 Good morning everyone. This is Robert Hernandez and he's going to give a presentation on Git commit. Robert has been working in computing engineering industry for over 10 years. He has used Git extensively on a daily basis and works for Nibbless Works. So, with that, give a big applause to Robert. Thank you. So, yeah, I'm a senior engineer over at Nibbless Works, a DevOps consulting company. So, my day to day I work with Git. We use it for pretty much everything. And it's a super important part of our software development practices and kind of how we get things done. So, this logo, this is the logo for Git. To me it kind of looks like a traffic sign. I don't know for like a cul-de-sac or something. But really what it's showing you is that Git likes to do branching. So, branches in Git are cheap and it's a very important process and workflow in Git. So, what is Git? Well, if we go and look at the man page for Git, it gives us that... Unfortunately, it's cut off. Well, anyway. Basically, Git is a scalable distributed revision control system that has an unusual rich command. It can command syntax and provides full access to its internals. Which I guess sounds good. I don't really know what I would do with that. But really what you need to know is that Git is a way of managing a set of files over a period of time. So, that's kind of it in a nutshell. So, what can Git manage? So, we're here at scale. We see a lot of people putting up links to GitHub, source control and things like that. But Git is really, it's just a way of managing files. They can be text files. It's really good at that. Showing differences between text files as they change. But it can also do binary files. Now, it's not as good with binary files and there's some kind of work around. We won't get into it in this talk. But just know that Git, it can manage really any kind of file. So, some better than others. So, Git in the open source world, it's pretty much used all over now. It's super popular. And by show of hands, how many people have used Git repository? Like, cloned one down. Yeah. Okay. I figure a lot of people. It's really easy. So, other people that have cloned one down, how many people have actually modified the repo after you cloned it? You know, I have. Okay. So, a lot less. And of those of you who modified the repository, how many of you had taken those changes and contributed them back to the project? Alright. Maybe a few less. Right on. So, there's a bunch of reasons why you might not be able to contribute those changes back. You know, if you work for a company and maybe there's intellectual property in those changes and you guys, you know, you can't contribute them back. Maybe it's something that's not really in the spirit of the project. Maybe you've hacked it up so much that you're like, oh, this is kind of embarrassing and I don't want to contribute this back. Or maybe you just don't know how. Right? So, maybe there's, it's just that that's kind of a jump. It's a leap too far for you to go ahead and contribute it back. Which is, I think, really unfortunate. Because Git shouldn't be this crazy thing that's super hard to understand. It really should just be, you know, a tool that you can use to track your changes easily, consistently and be very confident in the tool and not find it scary. So, if we, you know, if we wanted to see what Git was all about we can go ahead and look over at the man page and just see, okay, if we wanted to get started with this, what would it actually take? So, let's see, I've got, I'm just going to run this easier. So, we go ahead and start looking through the man page. It's great, okay. I'm looking for some commands to pass it. But we'll quickly realize, like, there's a lot here. You know, if you look at my pager, I'm only on about 15%. And now I just got to some of the subcommands. But I just keep going and going and going. Which now I'm really thinking, I don't want to do this anymore. You know, at this point I'm thinking, maybe I'll just keep that change to myself. Which isn't what we want to, you know, isn't good for the rest of the community for you to kind of hoard a change just because the man page scared you. So, what I wanted to first talk about before we started to get into some kind of get workflow and a way to contribute those changes that you would have back to the project or any project, I wanted to talk about my first PR and what kind of, what I went, the road I went down trying to get a small change into a project and not knowing what all entailed as I went through getting that submitted. So, this was back in 2012. I was working down in San Diego for a development shop. We were developing apps in Java. It really wasn't that interesting, honestly. And I started to think, okay, do I want to spend the rest of my life developing proprietary apps that really only a handful of people are going to see? It didn't really sound like a life goal of mine to do that forever. So, I started to look around in open source and I had done some things with Linux before this Java development job. And as I was looking around, I found a project called Salt Deck and I heard about it on a podcast called Floss Weekly hosted by Randall Schwartz. For those of you that don't know, it's the free Libra open source software podcast. They have a lot of interesting projects that are kind of just talking about what they're working on and he interviewed them and one of them happened to be Salt. And so, I heard about this project and I was thinking, okay, this is something I can actually use because I do a lot of work at my house with some servers and things like that. I'll give this a spin and it's a new project so maybe this can be kind of the first thing I start to play around with in the open source community. And so, I went ahead and I had a server I was going to set up. It was a free BSD server and for those of you who don't know me, that's kind of my operating system of choice. Unfortunately, I can't present it on this right now because the resolution was a little bit wonky and we had to get going. But anyway, that aside, I was setting up my free BSD server and I went ahead and installed Alt, got it working in a really small way, just like can run commands on it and stuff. And then I started to actually provision what's called Salt State. And for those of you who don't know, Salt is a configuration management tool. And back in 2012 it was like really, really new. And so I started to configure these states to install packages on the server and set up some files. And when I went to go apply them, all of a sudden nothing worked. And I wasn't really sure why. So I went ahead and started a troubleshoot and sure enough I found out that, well okay, it's not actually a problem with my particular state that I was defining, it was actually a problem with Salt itself. It looked like there was very many people running Salt on free BSD and so I hit a bug that wasn't affecting really anybody else in the Linux world. So I went ahead and dove into it and it ended up being this one line change. There was an additional argument that was needed for the particular function that was related to the free BSD package module in Salt. And so I was kind of shocked because I was thinking I couldn't really fix this. There's no way. I don't really know anything about the project. And I kind of hacked on Python when I was in college. So I kind of triple checked what I had done and well, this seemed to be the only thing that fixed my problem. And so I was like okay, I've got this fix for Salt, but I don't really know what else, like what's it going to take for me to get this into GitHub so they can see it. Because what I was modifying at the time was the local package that I had installed Salt with. So I'm modifying that, not actually a Git repository or something like that. So I went down the road of figuring out all the things that were required. And along the way, it was kind of just, I don't know if anyone's seen Kill Bill, but just hitting a wall. You know, like I'm not going anywhere. I keep blowing away the repository. I keep screwing it up. I probably spent like two days trying to figure this thing out. And I'm like okay, I'm just going to do this systematically, figure out my steps, figure out what works. But I also was really worried that I didn't want to look like an idiot on GitHub by pushing something up that didn't look like I knew what I was doing. So I really kind of tried to fine tune that whole process of getting my pull requests up there. And so what I found out was that this is what was required to get my code merged. So I forked the Git repo. So I forked Salt, right? So that basically creates a copy of that project in my own account. And then I went ahead and cloned down my fork locally. So I had a local copy of Salt now that came from my fork. And then I had to create a branch for my changes, which was just this one line change. But still needed to create a branch. So I did that. And then I committed my changes to that branch. So now what I could do was push up my branch back to my fork, right? And then I put in a pull request. So, you know, what is that? Like, that's six things I had to do. And I'm sure there's kind of some other little things here and there. But these were the big milestones for what I had to do. And it just took so long to figure out. So a lot for just a little one line change, right? It's like, okay, now I understand why some people just say, forget it. I'm not going to go down this road. I've burned a weekend figuring this out. So where would someone start if they didn't have to waste an entire weekend like I did, looking into how do I configure Git? How do I do all of this stuff to be able to now put my pull request up to an open source project? So first thing to consider is who's making the changes, right? So we have to be sure that we're the ones who are, well, there's a user or an email address and a username associated with who's making these changes. And this is cut off again. And sorry about that. The good news is the majority of this is actually still relevant. So we want to go ahead and set the author to myself. And then I went ahead and set my email address as well. So this is what a typical commit message would look like, minus it kind of not being cut off. But essentially, we just need to identify who's making these changes. So we do that by configuring our .git config. So this is a file that's going to live in your home directory that is going to have all the things that are particular about who's making the commit. You can set things like what's your editor, what your pager is. You can do a lot of things in this particular file. But for now, all we really care about is setting, this is really annoying, anyway, the whole thing is just we have to set the .git config global and then the user.name and then I'm setting it to my name, Robert Hernandez. And then I have the next line is .git config global user.email and I set it to my email. That's essentially all that's really needed. And then by doing that, you've created your .git config and then you have a .git config file with the user section and the user name and email address in there. So that's all that did. But that is global for your user. So without a .git config, we're unable to identify the author, at least not easily. Sometimes some people's user names make no sense for what their actual name is in real life. In this case, this person's name on their system is Minion and their system name is what .git uses as their email address. So yeah, .git uses info about the current user and the system. It's essentially, it tries to do some smart things but really it doesn't really help you as far as being able to know who's doing these changes. So the other thing that's really cool is once you actually configure this and you have this, when you go to a remote like GitHub or GitLab, I think BitBucket does it as well, is they create this kind of like heat map that has all of your particular commits throughout the year for across all projects on that particular platform, which is really nice. It's just kind of one of those things. Some people try to do like a committed day. I don't know, it's kind of to have fun with it. Some people try to spell stuff based on the days. There's like a project to do that. So I saw that the other day. I was really laughing pretty hard. So what's the workflow? And really at this point, we have our Git config and everything and that's good. But what's now the workflow? And I briefly touched on it back when I was putting in my first PR, that story. But essentially, let's see if this is going to work. Okay, cool. So essentially, we have right here the original project, right? I went ahead and created a fork. And then we have my fork. This is in my GitHub account for my user. And then I clone it down right here. And then I have my local copy. This could be my laptop or desktop or whatever. And that's essentially the three places where the code is going to be. And then I'm basically going to be syncing any changes from the upstream. Look at what we're calling upstream. To my local machine. And then if I have to make any pull request, I push them over here because I have access to make changes to my fork. I don't have access to contribute back to the open source upstream project because they don't really know me. They can't give everyone access. So knowing that, let's dive into it a little bit with Git and then figure out what commands actually make us be able to do something like this. So one thing I'm going to try real quick. I can fix this resolution problem. Sorry about this. I just want to think it's going to be helpful, more helpful to make this change. They are, yeah, they're up on my GitHub. I have a link to my GitHub at the end. So we'll, yeah, they are up there. Oh, and they'll also be on the outside. Thank you. Refresh it. It should be better. Okay. I think we'll be better now. I went ahead and set the width a little bit longer. So sorry about that. So we have our workflow, right, that we're going to get into. The first thing we're going to look at is step one of forking. So what does that look like, right? And what we're going to be using is a repo that's the first contributor's repo on GitHub. It's a cool project and its whole purpose is giving you your first contribution in open source. And it's really nice. It even has like a whole like step-by-step process, but we're just going to kind of elaborate on that here. But, you know, with my goal is that after you, everyone sees this, you can go ahead and go to that repo and do exactly what we did here and go ahead and make that first contribution. So now that I have my project, the first contributor's project, I'm going to go ahead and fork it. So I hit the little fork button. This is in case of GitHub, but a lot of the other, you know, websites have a similar thing like GitLab has the same concept and sort of Bitbucket. But we're going to go ahead and fork. We fork the project, gives you this nice little animated Git thing as it's forking. It should only take a few seconds, but kind of fun. And then once we have the fork, right now, that repo is associated with my user account. So I have my own copy of the entire repository. And next, what we need to do is go ahead and grab either the URL or the SSH string of the fork. I prefer the SSH string because I think it's a lot easier to just put my public key on GitHub and be done with it, but you can use either. But just know you have, there are two. I think some people only realize that they only think there's a URL and don't notice the SSH option like right here. You use SSH, you can click that and it'll give you the SSH string instead. So now we did the fork, right, so we did step one. And then now we're going to go ahead and clone down for two. We're going to clone down to your local laptop. So let's clone down the repo. And that seemed to not help, so that's unfortunate. So look a little bit longer, do you guys, or no? A little bit longer? Okay. So I guess I'm getting closer to this. Let's just go to the 400. You get for developing these on high resolution. Okay, well, I could scale it down. All this keep going. Essentially we're going to clone down the repo. So we're going to take that URL string that we found here and we're going to go ahead and put it in. And basically what it is is just it's a pointer to my repository for my username and the name of the repo. So in this case it would be my username on GitHub is sarcastic admin and it's going to then be first contributions repo. So I go ahead and then I can navigate to that repo once I've cloned it down. So if I go ahead and I CD to the first contributions directory well now I'm in the repo. So what happened when I cloned it down, right? So what are all the things that I got with this clone? By running that command. So I've achieved the source code. So the source code is going to look like it did in GitHub but now on my local machine. And then additionally with the clone I also get some things set up that Git kind of does for free. So it sets up a remote basically saying like there's another copy of this repo not local to my laptop but somewhere else. And it sets up this remote and it happens to call it by default a name of origin but it can really be anything. That's just the default that Git sets. But it points back to my user's repo on GitHub. And then it also sets a default branch. So it basically is saying okay this is the branch from where you're going to make your changes from and then they're going to be contributed back to this branch once you put them in for a pull request. And that's pretty typical for most projects. Some projects do it a different way. You might have to vary slightly but by and large the majority of projects that I've seen and contributed to on GitHub they use a default branch. You branch from that to make your changes and you contribute it back on that same branch. So default branch master like where does this name come from? What's the deal? I mean really it could be anything but a lot of people use master. Well it comes from a term and it's a lot older than just Git itself but it's really a term from trunk-based development where all of the code is contributing like I was saying back to that master branch. Everything is coming back to master. And the way to think about it is that trunk-based development is like a tree. So the tree is the fattest part it's not like the branches are fatter than the trunk. The trunk is the fattest part of the tree and all of our code is coming back into that trunk to make it ever larger. So if I'm going to contribute to this repository the first contributors or first contributions repository what do I do? Well I need to create a branch so I'm creating a branch off master and I happen to call this myChangesOne can be anything you want just not master. Obviously that already exists. So I create myChangesOne and then I go ahead and do a Git status. And this is like kind of a shorthand here so I'm saying Git checkout and the dash B actually creates the branch and then puts me in it. So I'm currently on myChanges-one and I'm ready to go ahead and now modify the code and get it ready for a pull request. So now we have our new branch. Let's go ahead and modify the branch. So here I'm just appending my name and my GitHub URL. So I'm putting that to the contributions and the file, the markdown file and just saying okay this is a really easy modification this is actually what everybody does in that repo to put in a pull request. And if I go ahead and do a Git status I went ahead and rather than just putting in the kind of the blocks here I went for the color with the screenshot because now you can see that okay Git instead of just showing nothing it's actually showing that I do have a change. So Git is seeing this change which is cool. So now Git's aware hey you've modified something and I'm just letting you know. So that's the file I'm tracking and it's been modified and that's what we intended. So now what we need to do is say okay for this particular change we want to get it ready to commit. So I'm going to go ahead and I'm going to add the contributions md to stage it on this branch and now it went green. So Git's saying okay this markdown file it's been staged. So the next time you run a Git commit I'm going to commit all the things that are in green. Git stages it. Great. So then the last bit of this is committing this stage change to the branch. So now we just have to go ahead and run git commit and you can run it with a dash m and just give it a message and just say what is happening here on this commit. And once we do that we'll notice that there's no longer changes or no longer like any red or green changes or modifications on our branch which is cool but what happened to the thing I was changing? Like where did it go? Did it just vanish? Well Git has this concept of a log so all of your commits are logged and the log is specific to each branch. Now we can go ahead and look at that now and see that there is my commit that I had just made. So it didn't just vanish it actually got committed to the branch like I intended but Git says there's just no more working changes in my repository they've all been committed which is great, that's what we wanted. So now if we go back to the workflow so I've committed everything on my local machine and now what we're going to get into is pushing that up. So we're going to go this particular branch to my fourth. So if we push the commit and really if we push the branch up we can just go ahead and do a Git push and we're specifying the remote which is my fork that happened to be just called origin by default and then I'm also specifying the branch that I'm pushing up. So once we do that now I'm pretty much ready for a pull request. Now that branch is up there it's up on GitHub now no longer on my local machine or it is on my local machine now as well but it's also up in my fourth. So what are the things that I need to consider before I do a pull request? And this is just some typical things that I do when I'm looking to do a pull request as I don't want to have them I want to make it easy for the people that are maintaining that project to just look at my code not anything else. Sometimes they have certain requirements around like documentation or testing or things like that that you just want to be aware of. So check the documentation pre-pull request to just make sure that you're kind of doing all the things that they ask because I think for them it kind of gets annoying when people don't read it and they just blindly put in pull request and they're like well did you read this document this is why we put it here and they're typically like a contributors.md in the repo or something like just the root read me those are two places I would definitely check. And then I would also check any existing issues so if you have what you're fixing is a problem maybe someone's also already reported it or is there something that's also already in pull request and this is maybe the pull request bit maybe something you could have done earlier to check if someone had already fixed it but it's just worth taking a look at both of these areas to see if there's anything related to your particular change that you might be able to link to give better context about the changes that you're making what's happening in that change so then you submit the pull request right and you basically it gives you kind of an option here usually default to the default to the default master branch and then it has your branch and github is pretty cool about detecting like when the most recent push was to your fork so it typically will populate this with the most recent push to a branch so you don't have to kind of scroll through and figure out which branches you're going to be committing but you want to just the main thing is just to make sure that you're committing your change against the upstream master right so the project that you forked their particular section of pull request that you want to be submitting your pull request to and then include detail description like don't just say I changed stuff because then no one's going to really like that go ahead and elaborate I think more information is better in this case so just go ahead and do that that you know put your maybe if you're working on some code maybe what system you're on what versions that you're running of software anything related is always appreciated and then you've got you put your pull request and in this period of America's Got Talent it has Simon Cowell so so now you had your pull request in there right and now it's like okay when are they going to start when are they going to merge this thing like I'm excited you know what are they going to say about this so you know hopefully everything's good and they just go ahead and say thanks a lot and they merge it to the upstream and then you feel really good that now you've actually become an open source contributor to an open source project now that's the best case scenario and you know typically if everything you kind of take this workflow and do that typically you know I think you have a good chance of just getting it merged but what are some other considerations you need to make about things that can happen after you might have pushed up your code and put a pull request in and I'm going to cover two that really I think I come up they happen to me most often the first one being something that you kind of have to do after this merge happens right because you went ahead and merged they merged your code but how do you make those changes how do you reflect those changes back on your local system I think that was one for me once I got that PR merge and assault stack I go cool but I still have just the branch locally I don't know what to do with this so let's talk about the first one pull the updates from the upstream repo so as the repo is growing and things like that you want to go ahead and make sure you have those same changes locally and that if you go back to our workflow diagram is right here is this link right here so we're going to be pulling down things actually from upstream we're just reading from it and then sinking them back down to our machine we're not going to go through our fork because really our fork doesn't have the ability to do any of this kind of thinking so we want to configure the upstream repo for our local machine and again you cut off but the basics are that we're going to go ahead and point this we're going to say get add remote and we're going to call this one upstream but again you can call it whatever you would like and we're going to actually if we could see more of this it would point to that upstream repository which I believe is github.com slash first contributors then first contributions and then if we do that get remote again we're going to see that we have now origin and upstream and those are set so now we can actually we can sink from both of those but remember we can only write to one so the only right to origin because that's our fork and that's the only thing we have permission to write to so if we update the master upstream how do we do that so we go ahead and we check back out our master branch and let's assume that that portal request you put in was merged we go ahead and we pull down when we're giving dash dash rebase we're going to pull down from upstream their master this is actually going to replay all of the commits in the same order on our local master so now if our change is just merged we should be able to see that reflected in our local master and just as a a kind of a tip that I would recommend don't make any commit to master keep that branch identical to upstream it'll save you a lot of time just make sure that is mirrored with upstream and everything else is a lot easier just don't commit to master so we knocked that out right next one that can happen and this can actually happen while your portal request is in review is if somebody goes ahead and makes a similar modification around the same lines and Git can't really figure out a good way to deal with those changes and merge them together it creates a merge conflict now this would be something that would show up in your portal class it would say merge conflict and really it's not as scary as it sounds I think also Git is very verbose about when these things happen so I wanted to go ahead and just briefly touch on this because I think it happens more often than not and people try to deal with it in a lot of different ways some better than others so I was going to go ahead and walk through a merge conflict that I created purposely in that first contributor's repo so let me go ahead and bring that up so I'm going to go ahead and play this and this is basically a mirror of my terminal as I walk through a particular a merge conflict that I had and I'm going to mix it with rebase so I'm currently if I take our example I'm currently in the my changes one branch and I've went ahead and modified it for to add my name append my name to the bottom of that markdown file so but I got a merge conflict so I'm going to go ahead and check back out to master so I'm going to go back to my master branch and then I'm going to go ahead and from master I'm going to go ahead and pull the latest changes like we just talked about from the upstream master so I go ahead and grab those latest changes I can be sure that my master branch is at the date and then I'm going to go ahead and go back to my changes one branch and then deal with the conflicts from there I go ahead and update I have some additional commits that I was missing and then I'll check back out to my changes one and then I will rebase master on top of my branch and this is the desired behavior but it's freaking scary right if you don't know really what to expect it really gets you saying I don't know how to resolve this you asked me to rebase master on this branch but this contributor's MD doesn't really I don't know what to do with that it's a manual help so can you please take a look at this so if we go ahead and do that and take a look we'll see that yes it says both for modified so okay I have to go look at this file and figure out what can I make what can I do to make get happy with my change so if I go look at the bottom of the file and I can see that it's showing up sorry I already did it just crash oh no okay so I went ahead kind of a little bit really there and it's not showing my change that was unfortunate the idea is that you should be able to resolve the conflict yourself I was trying to do that here I guess something's happening with the screen again my apologies for that but the idea is that you go ahead and resolve the conflict yourself and make it look the way you would like and then we can go ahead and just go ahead and tell get to rebase continue I think I mistyped this should have just done it live instead of trying to record it so let's see what happened here oh I didn't add it so basically when you make the differences the changes here you have to add them to get to continue its rebase and then everything will be good but sorry that didn't that didn't pan out at least for the demo so anyway wrapping up here so now what so we went through the merge conflict I can talk about it more I don't mind bringing it up locally on my machine I can recreate it but the idea is that at least maybe by now it's getting better and you've got a little bit of get through and you can at least make your first contribution and a good place to start is this example repo of first contributors repo on github so thank you my name is Robert Hernandez sarcastic admin github my github account sarcastic admin so yeah if anyone has any questions thank you if you have any questions or comments please raise your hand and I can come by and give you the mic what is GI up GI up oh yeah I don't know it's new thanks Ellen I have a question do you normally create an issue first because what happens is you make the change and really the maintainer of the project it's hard for them to be like hey this doesn't really fit with the project maybe it's not something that or me it's not an issue I think it depends it depends on I'll just repeat the question or pick up because of that okay cool yeah so like maybe it's not something that or maybe it's not an issue maybe that's by design or something like that so it doesn't hurt to really actually make the issue first that's a very good question for a bug if it's actually something that's like actually making the program crash I typically won't because I think that one's maybe usually it's a little bit more pressing for me because I've got to get that thing up and running but I'll sort it out there and see what the team of the project says great presentation thank you I'll try to pull down your slide and my phone kind of straight forward looks like it's in Markdown how do I get into a PDF oh yeah yeah so you can go ahead and print it I was going to push up the slides as they look like this after we were done here you can also just point if you like quote on the repo down you can point the URL at that index file and it'll all just come up as it does here if only a small change is necessary would it be better to just to just suggest the change in the issue like give like a detail of what they should do instead of actually instead of going through the entire process of yeah so that's true if you just say like hey I think it's this line yeah that's certainly a lot easier I just wanted to I wanted to credit because I was spending my weekend so I really wanted my message my commit in the in the tree but you're right that is a lot easier at times anyone else actually I think my question was very similar to that one I think GitHub added a feature you can yeah so I was actually playing around with that we have a person in our content marketing department at work and we're putting all of our content marketing into Git and she's not the best with command line she doesn't have to be for her job you know it's understandable and we were playing around with that editor and it was actually really nice Owen uses it quite a bit too right Owen quite a bit yeah yeah yeah we'll get into that anyone else oh yeah and again if anyone wants me to walk through that botch demo I feel I'll be happy to do it so you showed how to resolve the merge conflict yeah but that's just in your local repository but then yeah what do you do after that so it depends if you've already if you've already pushed it up and like you were saying it was in that bad state in the portal class thing conflict since we went ahead and rebased it and if we fixed everything and it looks good what we can do is just push that back up to our fork now when we push it we're going to want to do a force push a modified history and Git just doesn't know what to do with that so you want to tell it ok now it's cool don't be alarmed I know that I want to do this when you do that you force push over that branch the portal class is automatically updated because the portal class is associated with that particular branch you don't have to like close the portal class reopen you don't have to change it it will magically appear that helps a lot I wasn't sure about that part yeah sorry that's a very good question alright anyone else alright I'll be hanging out up here so thanks everybody I really appreciate it fantastic admin oh there you go keyhub one slide I didn't check which are by introducing the speaker this is Dave take a look sorry Dave is a Linux platform software engineer and indeed he works closely with the DevOps and SRA teams improved reliability scalability and performance across indies cloud he's a monthly core developer has been debugging Linux issues professionally in one way or the other for the last 15 years he is a Linux contributor thank you can we give a big hand his presentation is a little more interactive so feel free to ask questions, comment please wait for my mic thank you if you guys oh my gosh can we turn down a little bit volume wise if you guys have any questions during the course of this raise your hand and I'll get to a spot and I'll call on you I do love teaching but I love programming a bit more so hopefully this will be useful and again this is intro to currently debugging just make the crashing stop so if you're not here for that you need to leave right now again I'm Dave Chilick and I'm a Linux platform engineer for Indeed can everyone hear me still? still good all right great so what are we going to come here to do a bit of an overview about all the things you're going to want to think about when starting to look at kernel problems am I just projecting or is this actually I'm still good all right so we're going to talk about gathering debug information we're going to cover a bit about how the kernel development process works you know who's Linus who are all the rest of the maintainers how all that piece works because once you start interacting with the community that's really important to know in order to get what you're trying to accomplish done we'll spend a lot of time looking at oops's has anyone in the room hit an oops or know what an oops is can I see a show of hands all right fantastic can I see a show of hands of how many people would profess to say they have an understanding of C as a programming language fantastic that's the audience I'm looking for all right I'm not going to be able to teach you see here because that is an entire four year under college degree but you know I will do some some light work around that we'll cover a little bit about code inspection like the kind of things that you're going to run into when you're starting to look into the kernel doing code inspection I'll talk about some get tips and tricks that you're going to want to use when you actually are trying to find solutions to problems you're hitting on your production clouds and then we'll talk about engaging the kernel community and I'll give you some ideas on how to dive deeper and through all of this I'm going to walk through a real-life case study that we hit at Indeed on our production clusters that was actively taking down nodes so that brings me to who I work for I work for Indeed and we help people get jobs we have many millions of clicks every day and it is my job to make sure that those clicks continue happening or part of my job I am a Linux platform engineer and what that means is I fix issues in the open source code that Indeed uses frequently that's been mostly in HA proxy and the Linux kernel but that could really be anywhere else that we're hitting those problems prior to that I work for Canonical on the Ubuntu sustaining engineering team and there I worked closely with the Ubuntu kernel team to solve a number of kernel problems that the Ubuntu kernels were hitting so I've been doing this for a while and along those lines I eventually became an Ubuntu core dev which means I actually have archived permissions for Ubuntu alright so we're all here at scale some of us are probably on the DevOps teams a lot of us are wanting to become part of the DevOps teams and we've all been heard about this concept of pets versus cattle this was really originated by a guy named Bill Baker who's a distinguished engineer from Microsoft back in 2012 and he was widely attributed to him and the basic concept is you treat your servers like pets you name them and when they get sick you nurse them back to help that's the old way of doing things now we're doing this scale out model which is you treat your servers like cattle and you number them and when they get sick you shoot them his words not mine I propose that that's a little simplistic view because you've got these things called wolves and that's what we're going to fix here if wolves start eating your cattle too many cattle you're going to need to deal with them too but if they're just taking one or two it's not worth it it's not worth the hassle so what is a wolf let's cover what a wolf is this is a problem that we hit it indeed this is the original description that we got from my ops team probably a month after it it happened the description reads we need to re-kick this server due to file system corruption currently this host is in downtime and removed from the mesos cluster okay who cares we're dev ops we can redeploy a machine in about two hours because it's all scriptified and wonderful right well then they got another one okay and another one and then a fourth one this one you start to worry var is corrupt for the end time you're starting to hear the frustration come out of this ops guys you know descriptions right and then we got a few more does anyone notice why this problem was going to be really difficult for us to solve thank you what I think you said it the logs all of the logs live in var yeah is that what you're going to say back there I wasn't going to say that but man you hit the nail on the head and we're going to cover a little bit about that so and we've got another comment in the back there yeah so keep in mind these are just the top line descriptions this is including none of the text of the bug but yes I've got no information on what's triggering this I have no information on what the output is I have no logs because that's where all the logs live as they live in var so we're going to cover this problem in the scope of how to fix kernel problems alright so step number one as many of you observed it's sitting in var and I need to gather information well simple things you need to grab what is the exact kernel version that's hitting this because if we know this is a kernel problem we're going to need to know the exact kernel version that this is hitting otherwise how do you know what sources you're looking at you're going to need to grab the logs well unfortunately it's in var log best thing you can do is to recover the machine and then save off any logs that may have gotten saved unfortunately for us the logs that were pertinent to the problem didn't actually get written to the logs which brings me to the next little trick using console output typically all oupses are going to be written directly to your console which means if you log your serial consoles you can actually grab those oupses from there another option is to get rid of var log altogether and just move to our syslog and log them to machines elsewhere this would have been another strategy we would have gone with but we actually found logs in a console output so other things you're going to want to consider as that is separate from the problem we're going to discuss is grabbing crash dumps so enabling crash dumps there's a number of ways you can do that and that's going to be very distro specific you can grab source reports SOS report on Ubuntu it's a fantastic tool that's going to grab a lot of information out of proc, sys, and your logs and etsy for that matter it'll sanitize some of it and it will package it up into a basically a two gig tar ball that is amazing thanks Mortems when I worked for Canonical that's what we would ask customers to send us in order to do advance debugging work without having to go back and forth like, hey what was this file you look in that file and it's pointing to another file oh can you give me this file so it saves a lot of turnaround time sir well if you're going to throw it off to someone outside of your organization having it sanitized is not but it's also good DevOps practice to sanitize in case you're going to end up throwing out SC password. If you have a hash SC password, if that gets compromised or thrown somewhere, you could basically have just compromised your entire cloud. It's never a good idea to have unsanitized logs unless if you can avoid it. Okay, another tool that you might want to look into is SAR. I don't use it all that often, but a lot of my DevOps teams also, the DevOps teams that work with me, tend to use it to great effect. All right, so what end up happening for us is we had this console output and this 1024x768 screen is going to make this a major eye chart for you guys, but I got this oops output. For whoever's seen an oops, this is going to look very familiar to you, but this is basically XFS telling you where all of the problems happened and I'm going to explain how to read one of these things. That is what I consider basics for starting to do kernel debugging. All right, so what other things might you, before we go into that though, let's talk about other information sources you might want to gather. If you're looking at a device driver error, you saw these oopses, you're going to want to grab your exact firmware levels that you're at. You're going to want to grab your module arguments, maybe some mod info output from when it's functioning. If you're hitting network errors, any kind of network file system or unsecured network communication, you know, something that's not SSL encrypted, you might want to grab a TCP dump or a wire shark. If you're smart, you could probably still grab the SSL encrypted traffic and then grab the correct keys to decrypt that, but I've never done that and that sounds painful. If you're hitting file system errors, something you're going to want to do is grab a dump of the file system, just simply DDing that file system is useful. With XFS, we actually grabbed an XFS meta dump which allows us to grab all the metadata without any of the actual data. So again, talking about keeping things sanitized by doing an XFS meta dump, no PII that was on our production cluster was being handed to me by getting a meta dump, which is a great way to say, hey, look, security team, I don't have any information because it was all sanitized, which is good. Another thing you can do is once you have that meta dump, you can do an XFS restore on your local machine and actually have file system as it was when it crashed. All right, so now we've gathered our information. The next step is really to grab the kernel sources. Now, what you're going to want to do is you're going to want to get the exact sources used to build that kernel because the kernel is so active, you may find something that's too minor revision after what you had crashed and it's going to have a few thousand fixes in those few minor revisions. It's imperative that you grab the correct kernel version. The easiest way to do this is to grab the sources for the RPM that was used to build your kernel, but I like to do it a little bit more difficult and we're going to explain how the kernel development process is so that you can understand where your kernel is coming from. All right, so kernel development. Everyone knows the Linux mainline tree, right? Who owns the Linux mainline tree? Linus, yes. Linus owns the mainline tree. He's going to be where it's living is, as you can see here, it's living on git.kernel.org under Torvald directory. It's maintained by Linus and this is where all the actual, where all of the active kernel development happens. So when people are working on features, they're working on bug fixes that aren't necessarily stable or are a little bit more dangerous, they're all going to Linus' tree. And to give you an idea of how many fixes go in between 4.17 and 4.18, there are 14,432 patches in just 10 weeks of development. That is how active the kernel is. Again, just reiterating why it's so important to get the exact sources that are used. All right, so get around all of this churn. What the Linux kernel community has is we have the concept of a Linux stable tree. And now this is owned by Greg Crowe Hartman. Oh, I am on, I am missed. All right, all right. Take a step back. Ignore what I just said three seconds ago. I got ahead of my slides. So part of the development for Linus' tree is that there's a lot of subsystems that also have their own trees. So for example, XFS has XFS Linux tree, which is separate from Linus' tree, but then gets pulled in by Linus as he's doing his development. So basically the XFS maintainers will vet all of the patches, put them onto their development tree, and then say, hey Linus, here's a pull request, please pull the changes from our tree back into the mainline kernel tree. And he will do so. And you'll actually see merge commits from Linus' thing pulled from XFS Linux, this number of patches. This is great because it allows Linus to have lieutenants to vet and authorize fixes. And I've got a question in the back right there. It is not a get, so the, so the Linux kernel was created way before GitHub, and actually way before Git, because Linus developed it in like 2007. So no, it is not a GitHub pull request. It's more of a LKML message from the person, the maintainer of that subsystem development tree to Linus via LKML, as far as I understand. Yeah, I can do that. So the question just was, is it a GitHub pull request and the answer is no. All right, so does everyone get the ideas to how subsystems get developed and then pulled back into Linus' tree? We got some nods, good enough. All right, so once Linus does his series of RCs from these features, he will eventually release, he will do a release and bless it, and that is when Greg Crowe Hartman will actually fork from there and create a stable branch in the Linux stable Git repository. Now, this is very important to us because this is where most of the distro kernels come from. So for example, when 4.2 came out, he created a short term stable branch in the Linux stable tree and then bug fixes as actual bug fixes that are deemed stable worthy get added to Linus' tree, they will be also additionally submitted back into the stable tree. So they follow a process called the Linux stable process, which vets fixes to only be bug fixes that are not going to possibly kill you because you grabbed a feature you didn't need. Again, when 4.3 released, same thing happened, he created a short term support release and stopped the 4.2 development on the Linux stable tree, but when 4.4 came out, he decided this is going to be a long term support tree on the Linux stable branch, a long term support branch on the Linux stable tree, and that will be supported up through February of 2022 where they're going to pull fixes back from Linus' tree up through the end of February of 2022. And then when 4.4.5 was released, that was again another short term support kernel release and so on and so forth. All right. So if you want these kernels, you're going to go and grab them from Linus, this is an example of Linus' tree, you grab it off of git.kernel.org, you do a make old config, that's important because what that's, there's supposed to be an underscore there or not a space, I think my technical writer added a space but is what it is. So space to be make old config without a space, what that'll do is that will take your configuration file for the kernel that you are currently running, it'll add it to the kernel tree and then it'll add all of the options for anything new that's in the new kernel sources that you added. So I'll ask you, do you want this kind of module? Do you want that support? Do you want to deprecate this value? It'll ask you a whole bunch of questions and that is the best way to create a config for the kernel, for a new kernel you've just downloaded from kernel.org. Then after that you're going to need to make it, make minus j, we'll give you by the number of CPUs, you'll do a make modules install and then a make install. I know this is a little fast but these slides are all available on the internet so hopefully you don't have to be typing frantically these commands. After you've downloaded, after you've installed that kernel, you're going to want to update your init ramfs, which will generate an init ramfs for your distro and you're going to need to update grub so that grub knows about it. Alright, that sounds all well and good but there's actually much easier way and that is to use the mainline builds available for your distribution. Two major distributions that I know of that do this are Boontube because I'm a court of but they have a mainline builds repository that will allow you to test mainline builds without having to build them yourself. It's a great way to test if this has been resolved by an upstream kernel. EL repo provides the same for sent off. I would suggest not running either of these if you can avoid them. The mainline builds for Boontube specifically because Boontube has a lot of soft patches that are possibly not included in these archives. They're basically just used for debugging to test if the fix exists upstream so that then that fix can be identified and brought back into the actual distro kernel because that is the responsible thing to do because we want to maintain stability of our production clusters. Alright, so indeed actually uses, now that I've said not to do it, indeed actually uses these Linux stable, the Linux stable trees out of EL repo and I do my best to maintain any problems, fix any problems that we find there and push them back into the EL repo kernels or upstream wherever is most appropriate. Okay, so now that we know about Linux stable this is what this is a little bit more about what the distro map looks like. You can see down at the bottom here we've got sent off that forked off of the 310 Linux stable tree and has been maintaining it ad infinitum. The short term support Ubuntu and Fedora releases only have nine months worth of development on them and then they're dropped but it gives you an idea as to where they fork from they're forking from the Linux stable trees and then most of the distros are going to try to follow Linux stable practices and bring back stable fixes for the lifetime that they're supporting that kernel. Alright, so we've gotten our sources, we found our correct kernel, we're going to clone it, we're going to check it out. If you're running sent off, you're going to use these commands. Notice that sent off is actually going to grab the tar balls from Red Hat source repositories so you have to do a little bit weirdness here. It doesn't actually give you a git tree. It provides you an RPM spec file plus a tar ball and this is really not a git repository and I would love for rel to actually give us a git repository because their kernels might become more useful as someone who's trying to support kernels. Just getting a tar ball gives me next to no information in terms of how many millions of patches they've added on top of their kernel. So there's a Red Hat people in here, please bring that back to your management. If you're an Ubuntu user, most people would say, hey let's just get app get source on the Linux image and that will provide you something very similar to what you get in sent off land which is basically an unpackaged tar ball of the sources as they were used to build the kernel. What I would do instead though is grab them from kernel.ubuntu.org, sorry .com and that will actually provide you a git tree that is searchable and much more usable in order to find fixes and provide fixes back. So that will give you the entire git history from the beginning of time from where that kernel was forked from. In order to rebuild the Ubuntu kernels you're going to want to do a fake group Debian rules, binary, prepare generic, binary generic. So I wanted to fix that but I didn't get a chance. All right. Another thing, so going back to our problems now, you've got your sources, another thing you're going to want to have is you're going to want to have the debug info. When kernels are provided to users, we always strip the source code and symbol information as much as possible before we put it on the machine. Otherwise, an unstripped kernel is going to be like 700 to 2 gigs depending on how, 700 megs to 2 gigs depending on how much, how many modules you've compiled in there, how many other things you've compiled into your kernel. So what we provide instead is we strip it out and then we provide debug info or debug symbol packages which will allow you to have those unstripped binaries for use with GDB or crash and we'll cover that in a second. These two commands down here are how you would install that. I don't need to read that to you because you can all look at that later. All right. So we've now all gotten our kernel sources. We've gone through a whole bunch of steps just to get these sources but now you kind of have a better idea as to where to find the sources that you need. The kernel structure is actually really great. It's pretty simple. The documentation is where you're going to want to start if you're first looking at debugging kernels. There's a documentation process that's going to describe how to interact with the community and there's also a great admin guide which is going to describe a number of kernel command line arguments that you may be able to tune and there's also a bug hunting description which will describe a bit about what I'm talking about here today with like oops analysis and things and then everything else is very well it's very descriptively named so mm stands for memory management or yeah memory management the net subsystem is going to be in the network directory the net directory the file systems are all going to be in the fs directory arch it's going to be very arch specific if you're looking at that you're smarter than me um if you're looking at driver issues they're all going to be in drivers pretty pretty well in obvious naming so don't be intimidated start with the documentation directory and then dig into these these individual directories and start just looking at them you're going to recognize a lot of names that you've seen before like x4 xfs things these things are all going to pop out to you and you're like oh that's pretty easy I can start reading that um two things I would like to highlight though in the kernel trees that you're going to want to look at is you're going to want to look at scripts um directory which is going to have two scripts in there called get maintainer and check patch uh get maintainer will tell you who is actually maintaining the source files you're looking at and check patch will take a look at whatever patch you want to submit back to lkml and tell you all of the things someone is going to flame you about so you don't get flamed on lkml great it's basically a way to sanitize it's basically uh the the kernel maintainers have sanitized or scriptified their kernel style and that's basically what that's going to be doing for you all right so let's get back to our oops that I grabbed earlier right um actually let's not do that I actually found a much better oops in the git logs and um we're going to cover a bit about all the pieces of this I'm not going to go into the depth of this but I'm just going to cover what all the pieces are so when you get one of these you can actually know where to look and see what you want to see where to go from there all right so starting from the top I'm going to so you see how this wall of text split into this is just the first chunk of that wall of text I'm going to do that on each of these progressive slides um but looking at the top we see no one can read that because of the resolution uh it says bug unable to handle null kernel null pointer dereference this is where uh when the kernel hits a bug on it's going to figure out why it hit a bug and it'll give you a message about that and that's where this is coming from and that's going to actually initiate the entire oops the next uh the next line is the ip which stands for instruction pointer that is the exact line where we hit that null pointer dereference um you can see that it says it's a musetext lock slow pass plus 90 plus 0x hex 98 that is the exact line in the object file that we hit that you hit this null pointer dereference all right we skipped a few lines because those were arch specific um the next thing you're going to want to look at is you're going to want to look at the oops line which is going to have a number one in it if you see number three or four or anything higher than one you're looking after the first oops which means if you are hitting a memory corruption bug and you're looking at the third oops that first memory that first oops might actually be the source of your third oops so there's no point in looking at the third oops if you can't look at the first one right so always make sure that says number one there uh well almost always but you get the idea you're going to see a number of all of your modules that are linked in which is a great way to determine what kind of devices were available what what was on the machine you'll see what application was actually being run so the comm says stands for mod we were running mod probe when this this oops here was hit granted this is not my xfs bug we're it's just a much better send my xfs oops um it says it's not tainted that means there's no non-gpl code um and it gives you the exact kernel version so if you have an oops but don't know which kernel version it is uh you actually really do know what kernel version it was um and then again it's got the instruction pointer at the end in case it scrolled off the strain because you were using console output to to record this so at the very least you should at least have that uh line number that you hit this problem at all right scrolling down we're going to see our register instruction pointer in order to take this register this instruction pointer and map it to code um one of my colleagues actually mentioned that you can do this very easily with gdb um i've been using a much more complicated issue but we'll cover that in a second um and what you do is you run gdb against your debug info kernel okay remember that unstripped kernel that we were talking about and why why it's so important this is why it's so important it's because it is allow allows you to decode that instruction pointer to the actual line of see where it was hit um if anyone's uh an eagle eye and can actually read this you'll notice that this is not makes absolutely no sense in terms of the oops we were looking at and that's because i'm not using the right kernel version so at some point in time between four nine and four fifteen they rewrote enough of this that it actually makes no sense in terms of the four fifteen kernel just trying to reiterate again why why it's so important to get the right kernel version all right another thing you can do the next thing in the in the oops is you're going to have a bunch of registers available now translating those registers into something useful can be done one of two ways you can either do disassembly within gdb or i've actually been using obj dumps which will uh i will obj dumps and push all of the dumps symbols uh disassembled see into a file and then i'll use that what i'll do is i'll end up searching for say mutex lock slow pass plus the offset um and then i'll have register values so you'll see here like rsp rbp these are not text rsp or ibt these are actually mapping to the rsp and rbp registers up here but because i have this now i can actually use my register values and go back and i can figure out oh it's a null pointer g reference oh well i can look at the registers as they're being modified in my assembly to figure out what might have actually where where that that null pointer actually came from all right so that's how you take your that's where those registers become useful and if you have to go across function calls be aware that there's this calling this in the arch directory there's this calling dot h which actually describes where registers are saved on the stack and how they are popped and saved and who's responsible for saving them i'm not going to go any more further into that uh the stack is this is the beginning of the stack here um so talking about calling dot h with how things are called um this is the beginning of the stack you're gonna have to do some really pedantic use of you know identifying where things got saved and when um but that is super advanced and probably not worthwhile here but i wanted to mention it this is the bread and butter of your your oops if you don't listen to anything else know this um this is where this is the function call stack that was used to generate your problem this is what was called this is the order of which all the functions were called that ended up with that null pointer um the oldest is at the bottom and the top is the most recent you'll notice that there sometimes are question marks these question marks are usually due to um GCC optimizations where it's not it's not uh where it would uh inline functions that might not otherwise have been inlined so it's not obvious to the to the uh it's not obvious in the object files where you exactly are but the the reference numbers will still be correct all right um do you notice down at the bottom here that we actually see uh oh there's supposed to be another thing um you see delete module on the third line from the bottom since we knew we were running mod probe from that earlier oops we were clearly removing that module from the operating system when we hit when they hit this oops back there all right there's one last one last line in the oops that you need to know about and that's the code line that is the x86 bytecode that can then be that can then be reassembled into more readable assembly by using the scripts decode code script um that's all I'll say about that but now you know every line of your oops and you can actually if you're really desperate can actually do some work there all right so let's go back to our xfs problem this is our xfs bug um I hope some of you downloaded the slides um but uh let's see here down at the bottom we see uh an extra message coming out of xfs that is not specific to us to an individual to every oops purely coming out of xfs and it says corruption of in-memory data detected shutting down the file system please unmount the file system and rectify the problem thanks xfs yeah all right so let's dissect this oops a little bit um it was great in that it actually told me the exact line that it hit this issue in it's like it says internal error xfs want corrupted go to at line 3505 of xfsbtree.c um the caller is xfs free ag extent um and it is not tainted okay so provides the exact kernel file and nut line number and we have the exact call trace in here so we can kind of do some tracing and figure out what is happening all right we also look down at the bottom we'll see that what we were doing when we hit this error is we were removing the directory how in the world did removing a directory cause file system corruption I thought that would just be like removing it and unlinking it right okay that was kind of my first thoughts as well but then again I didn't know xfs perfectly when I was when I started this and I I don't think anyone really does but I don't think anyone knows everything about the kernel either and I think Linus would agree with me so um you notice down up at the top more closely we see xfsbtree insert uh sorry and then that was called by xfs free ag extent um two useful things I'm going to talk about in a second and then again we have our question mark of we're not sure where this came from but it looks like it's this at this address all right this is where your c chops really start to get challenged so I'm going to start with uh xfs xfsbtree insert because you see it at the it's the top of that blue that uh of my uh orange box there xfsbtree insert and I'm going to look I'm going to start reading code this is where you can't get around this in order to understand do kernel debugging you're going to have to look at c code and love it um because it's kind of like a detective you're you're kind of going to be the detective kind of challenged by what you the information you have from that oops and the code that you know it was running at the at that time um we'll see that we got our xfs want corrupted go to um it's actually asking to make sure that I actually equals one here and then it'll go to error well that's no help we thought we knew the exact file and line number that this hit no it's not it's actually a this is actually a macro that's just checking to make sure that I is still one at the end of this function well where where does I get set well I get passed by reference to xfsbtree intrac okay so let's look at xfsbtree intrac xfsbtree intrac intrac takes that and now calls it set and what does it do with it oh it it passes it to xfsbtree new route okay great all right so then you call go to into xfsbtree new route and then you go into alex block and then you go into alex block alex uh xfs alex bt underscore alex block and then you go into xfs alex gets free list and finally you find where stat is being set it is being set because the block number equals null ag block number okay well man that's not doesn't seem like a bug it seems like I read that block number off a disk and it equals null ag block number so this seems like a valid corruption but it seems like valid corruption on disk what in the world could be happening um uh yeah stat equals zero and that's where our that's where our i was set to zero where we got our original bug from so step number one when you're hitting kernel problems we do not live in a vacuum I now understand what was happening I was hitting my my block number was set to null ag block number um and that was being set happened when I was doing a remove directory and that was happening when I was trying to insert that new free block into a btree all right so giving you probably way more information than you want but that's fine um so the next thing you do is now that I understand that is I'm going to check for fixes upstream because we do not live in a vacuum there are tons of hundreds of hundreds of developers all working on the Linux kernel the likelihood of you hitting a bug that someone else someone else hasn't already fixed let alone also hit is highly unlikely so the first thing to do is check for fixes um you're going to use google our favorite friend google I mean look for that stack trace that that was sitting in my oops I'm going to check mailing list archives specifically for lkml and then for any subsystem specific list so for example XFS has its own has its own list I'd be looking there as well um and then I'd look for any related terms you know like um agfl agfl corruption memory corruption maybe that excess excess want corrupted go to line I'd search for all of these things trying to find the fix another thing to do is actually use git now because we're running an awesome distro or we're running the stable linux tree we have a full git repository and this is why I would why I hate running sent off kernels is because they don't give this to me what I can do now is I can look at whatever version I forked from so in this case we were running a 410 kernel at this time and I can do a git log from every show me all of the changes from 410 onwards with the word crash in it in fs in fsxfs right so I know it's an fsxfs problem and I want to see all the changes between my kernel version and and the and and mainline I'm actually looking at linux's tree here right the reason I'm looking at linux's tree is because he's the one that's going to have all of the fixes right and I want to see just what's affecting XFS and I'm going to search for agfl I'm going to search for btree I'm going to search for interact I'm going to read all those changelogs there are probably a few hundred changelogs and it took me a few hours but the majority of the time what's going to happen is you're going to find someone's already fixed your problem and it will save you days if not weeks of work plus when you submit your fix back to the mailing list they're not going to laugh at you for sending them a fix they already have right not that they're going to laugh at you they'll probably just recognize that you're you're new and they'll try to gently encourage you to check to get logged next time absolutely absolutely sometimes when you put sometimes when I push when I submit fixes to the kernel I'll actually start with I'm not sure if this is the best way to do this but this is one way to solve the problem right and that way it leaves them leaves them room to know that I'm genuinely trying to figure out a better way I just don't have as much experience as them but another way to another thing you're going to want to do is if you can actually reproduce your problem in my case I was hitting we were hitting in this on one to two nodes a week across our 2000 nodes worth of cluster so there's no way I was going to be able to do a bisection on this but if you can actually reproduce this problem checking your mainline kernel do you remember those mainline bills I said told you not to run run those mainline kernels and then try to see if you can reproduce the problem if you can't reproduce the problem that problem is now bisectable and what you can use is you can use a tool you can use git which is an amazing tool to actually bisect down to the commit that fixed your problem without having to read all those logs and basically you'll go through and tell git hey this is my commit where I'm starting this is the one this is the mainline bill that I tried that actually definitely fixes the problem and then you'll go through a number of iterations of rebuilding the kernel and actually arrive at the commit that resolves your problem that will solve 95 percent of the bugs you are going to hit with distribution kernels cool prior to git 2.7 you used to have to use git bisect bad and git bisect good which is really not greatly descriptive in terms of how to if you're trying to find a fix because bisection was usually used to find a regression so it's actually the reverse but with git 2.7 and newer they provide new terms called term new and term old which allow you to define hey this is the old version this is the new version I like all right so now I've got all my information I've got a lot of information I've got a better way to look and now I'm going to do some I'm going to try to gather some more information my problem I couldn't really gather too much more information but if you're hitting a crash if you're hitting crashes you're going to by all means enable a crash dump because then you can do post-mortem analysis on that memory dump you might want to instrument your kernel now that you've read through all that c code and finally understand what is going on in there trying to figure out what thing was being set that's causing that null pointer dereference and you might use system tab kprod or ftrace jprod is now deprecated you might want to use ebps I'm going to plug brendan gregs talk because he's coming on in the next session and that will be awesome and I'm looking forward to it myself if you're hitting a performance problem maybe you might you might want to run perf against it that should identify what the hot pads are and then you're going to want to analyze your file system metadata offline which is what I ended up having to do with xfs was actually I did an xfs metadata during xfs restore and then I actually verified that that null ag block that that's um c key it's actually like afaa fa f a f f or something like that it's kind of like dead beef is what usually you see in Kokat Sea programs I verified in the metadata at the correct offset that that null ag lock was actually in on the file system and it was So the next thing you do when you're stuck, I haven't, I couldn't figure out anything more to do, is to really engage the community. And yes, Batman is part of the Colonel team, and this is actually a picture from the 2016 Linux Plumbers Conference in Santa Fe, and it was like around Halloween time, so that's why Batman's on the Colonel team. I actually love it. So how do you engage the community? You've done as much homework as you've done, as you can, in order to cohesively describe your problem. You're going to want to look for, are you going to, man, all of these are not working. I had a bunch of transitions that aren't existing anymore on my slides, I don't know what's going on. So LKML is the list that most people tend to think about when they're thinking about submitting fixes or talking to the Linux Colonel community. Problem with LKML is it's like two to three thousand messages per day. Very few people are actually monitoring that, and the people that are monitoring that are probably not the people that are going to be most able to help you. Instead, I would greatly recommend you to try to find a subsystem specific list. In my case, there was the XFS, the XFSDVEL lists exist, and the way I interacted with them was I did that through patchwork.colonel.org, which is a great interface into the mailing list without actually having to subscribe yourself and read all of the emails there. So check out the patchwork system. There's actually also lore.colonel.org slash patchwork or something, and that one's actually for LKML for subsystems that don't have specific subsystem lists. Right now I'm working on a scheduler problem, and there is no subsystem list for the scheduler, so I'm going to actually, I'm going to probably cry a few days while I have to be subscribed to LKML. Other things you're going to want to consider is look at that subsystem. Yeah, there's a C group list for the scheduler. I'm going to talk to you afterwards. All right. I learn something new every day, right? So check the subsystems are also going to have their own, some of them are going to have their own web pages. The XFS projects, for example, had its own web page and actually had a number of other methods for talking to them. For example, they were sitting on FreeNode in XFS. They were sitting on FreeNode in pound XFS. You might also want to check pound kernel or pound Ubuntu kernel if you're hitting the Ubuntu kernel problem. There are lots of smart people in there trying to help you. The other thing is kernel newbies is a great place to start. If you have a patch that you want someone to look at before you actually submit it to LKML, with all of these things though, be patient. Like this is kind of a fire and forget and come back maybe an hour or two later and hopefully someone's responded and if not maybe the next day kind of thing because you're talking about people who are global, right? So people are all over the world. They may be sleeping when you posted that message. They may wake up and check their IRC logs and see, oh, I got pinged or this is a very interesting patch. Let me go look at it and they'll respond to you then. So be patient and come back and look. All right. So what I ended up doing is I ended up sending an email to Linux, the Linux XFS mailing list. Being a responsible developer, I included the full oops output so that they could make their own determination on the problem that I was hitting. I included a bunch of analysis that I had. Basically I said my best guess giving a code analysis is that we are unable to allocate a new node in the allocation group free list feature. And I got a response which was without XFS repair output. We have no idea whether this is caused by corruption or some other problem. If I had a dollar for every time I've seen this sort of error report, I'd have retired years ago from Dave who's actually one of the main XFS maintainers. He responded to me. He told me exactly what to do and you know what my job now is to do? I went and did it. I went and grabbed that XFS repair output and then I responded to him on IRC via pound XFS which is actually a great way to communicate with him. He was really communicative. As soon as I sent him the XFS repair output, he's like this is a problem issue. It's commit blah and live XFS pack the AGFL header structure so that the AGFL size is correct. I need to resurrect some old patches I had that automatically detected this condition and fixed it. So he had a pretty good idea as to what was going on. And looking at it, for all of the pain I've been going through, this is the commit that was actually pushed into the four or five kernel that we were hitting. It's 20 characters and caused me two months of work, two to three months of work. Problem with this commit is it's actually completely correct. This commit actually fixed a bug in the pre four or five kernels where there wasn't a compatibility between 32-bit and 64-bit. The problem is when you went from pre four five to post four five that had this commit, that free list went from 127 elements to 128. And now that 128th element had that null AG block number. It had that dead B fit it. So this is a perfectly valid commit. So what do we do? So I went back to EL repo. I went to the temporary fix. I went back to EL repo and I created two bugs for them. And I said hey look, you have people upgrading from the three time kernel to the EL repo kernel which is 410. That is crossing this boundary that is that four or five. We need to remove this fix. And so I got them to remove that fix for the EL repo kernels. And then I considered rerunning XFS repair on every volume in our entire cluster. If anyone's ever tried to do that, you'll know how impossible that is because in order to do that you need to boot into single user mode. It's incredibly painful and it's much easier to just redeploy the entire world. My ops team was not cool with that. Especially when I asked them and they said hey that's not okay. So what I ended up doing is I started creating a fix up patch set. That thing that Dave said he was going to resurrect his patches. He was going to create a fix and he needed to resurrect those patches. I started working on that myself. And then this happened. Yes, life happens even to kernel developers. I had my beautiful daughter and fortunately though that even though I was out of commission for quite a few weeks dealing with a new baby, the real fix came along. I had been doing things properly, engaging the community, gave them all the information and they identified the real fix. There were actually four separate patch rewrites on the Linux XFS mailing list. In three months later they delivered this shaw into the XFS development tree. Now remember from our earlier discussion about how the kernel is developed. This is going into the XFS development tree. It's not going into Linux's tree. So at that point in time, Linux was at the end of his merge at the end of his RCs for I think it was like 417 or 416. So he wasn't accepting any new fixes especially something as large as this ended up being because it was basically when you mounted the file system it actually validated a whole bunch of the metadata to make sure that nothing was broken. Which could in and itself lead to breaking things. So it was not acceptable for Linux at that point in time. So we actually had to wait for another month for Linux's merge window to open for the 417 RC1 and that's when this fix actually got pushed into 417. And then now that it's on the mainline tree now we know that it's stable enough. People have vetted it. Now we can submit it back into the Linux stable tree and I actually did the Linux stable submission and backport for that. And Greg Crow Hartman accepted the patches and then he'll add them to the stable tree. The stable queue which is another branch within his another tree within his git directory on kernel.org and that'll eventually get merged into all of the stable branches for all of the operating systems that are following stable maintenance practices. Remember how I mentioned that your distros are going to try to follow stable maintenance practices. This fix actually ends up getting pushed because it's coming in through the stable maintenance. The Linux stable process. Wow. So we've covered a lot. That's what I was talking about. The last thing to do is don't forget to upgrade your cluster. We were still hitting this problem probably three or four weeks after the commit was pushed into the EL repo repository with the fix. And the reason is that we kind of have a rolling update kernel process because we can't take our entire cluster down. We only have so much spare capacity. So we were still hitting this for a few more weeks after it was actually pushed upstream. All right. I don't see too many people sleeping. So any questions? Thoughts? Yes ma'am. It's like so embedded for particular devices. They always have the like not be accepted by the mainstream. And they are bad in the night. If you find a bar, it's very difficult to do the patch that way. Do you have any thoughts? Yes. I have lots of thoughts about that. It's very so. Yes. There are lots of problems with that. Basically the hardware developers end up writing drivers that the Linux kernel community doesn't deem acceptable. And the hardware maintainers think feel it's easier just to fork the kernel and provide their own kernel with their patches on top of that. I think that's fine in terms of that's what open source is meant to be able to do. I think it's unfortunate in that once they drop support or selling of those devices that they end up also dropping support for their source trees and they basically actively artificially ruin your device because you can't get updates anymore. As a person that has a wallet, I do my best to only buy devices that push their drivers into the upstream kernel. And every time I go and buy a Wi-Fi device, I verify that the PCI IDs for that Wi-Fi device exist in the upstream kernel before I end up buying it. I would probably do the same for embedded devices if it's possible. I know a lot of the graphics drivers in those embedded devices are really hard. It's not an easy proposition and it frustrates me as well. Anyone else? Yeah, up front here. Yeah, because I had 50 minutes. So the question is, I didn't talk about crash or GDB too much. I mentioned it. So how much do you want me to say it? How much do people want me to say about crash? Yeah, yeah, I probably, I mean, like those tools in them themselves could probably be another hour's talk. Since I didn't have a great example for them, I didn't really include them. I did mention that you can run GDB against your debug infos. Another tool that you might want to use is crash against your debug infos plus the crash dump that you might take using from your crash dump. There's lots of help for, there's lots of great online help in recent versions of crash. So if you go into crash with your VM Linux, your debug info, and a core dump, you can do help. You run help in there and you can run help against every one of those commands and they'll tell you exactly what they do. It's actually really great. I just go and explore as really is what the best thing would be. It's very hard to kind of say like, here's a great crashy tutorial without going to specifics that don't really apply to 90% of problems. So that's kind of how I feel about that. Sorry, it sounds like that's what you were hoping for. Yeah, I know. It was an extension that I was hoping to add, but I was like, timing just doesn't work out. Anyone else? Yeah. So is there any sort of walkthrough or something that you're familiar with? How if you wanted to crash your own kernel to try to go through this process yourself? Without destroying everything? Yeah. Okay, so I can answer that actually pretty easily. I did this last week just to investigate my scheduler problems. What you end up doing, so depending if you're hitting a crash, it's really easy. What you do is you enable your crash dump for your distro. That's going to be very distro specific. So what I would end up doing is I'd Google Ubuntu, enable crash dump, and there's actually a number of great wiki pages which explain how to do it. I think it's actually just app get install crash dump if I'm not mistaken. What that's going to do is that's going to enable kernel command line option that essentially is adding an extra kernel. Wow, I'm going way too deep too quickly. So you've enabled your crash dump, you're going to have to reboot. What you then can do in order to get a crash dump on a running machine is you can actually use the sysrq keys which you're going to have to enable. So sysrq, which you're going to look for, or you can go into proc, oops, or proc, yeah prox, slash prox sysrq, and you can actually echo, I think it's echo C, or echo C into slash prox sysrq, and that'll actually crash your kernel. Okay, so have nothing open, but that'll crash your kernel. And when that crash happens, because you've enabled crash dump, it'll actually save your memory into a memory file that you can then reboot and use in crash to do some playing around. Did that answer the question? Enough? No, no, absolutely not. You don't have to do any kind of bad C code because what you can do is once all you, by echoing C into proc sysrq, that is just going to force the kernel to crash for you, right? Right, right, yes. There's so many places you could do that. That's just like infinite. It's like there's millions of lines in the kernel, any one of them could probably call this, right? Yeah, like insert a zero into some pointer and then run that, you know, do it in XFS while you have an X4 volume and then create an XFS volume that doesn't, you know, no pointer dereference, right? Like there's lots of creative ways you could do that. Yeah. Yeah, sir. Was he Batman? Okay, well, we've identified Batman. I succeeded as a presenter today. Anyone else? We've got the back here. Yeah, I'm just wondering how, you know, just from a life cycle perspective how you encountered this issue in your production environment at first place. I mean, obviously it's a lower current event, but like what are the best practices as far as selecting kernels to run in production? Yeah, so my piece of advice is to run the distro kernel until you cannot run the distro kernel anymore because the distro kernels are going to be the most widely used and most widely tested as a result, right? So is this pretty good example? Was it in LPS already or? No, it was, I was, we were not running a distro kernel when we hit this problem. So you had special requirements? Yeah, we, basically, we were running, we were running the Fentos kernel, which is off of 310 and we were hitting a number of problems with overlay FS that rel had not yet resolved in order, so in order to run Docker images in a performance way, we ended up upgrading to the EL repo kernels in order to solve that problem. So one problem resulted in another problem. The fix for one problem resulted in another problem. I'm actually working on getting us off the EL repo kernels, but it hasn't yet happened, but it's kind of nice that we are on them. But that said, you could have also run into it going from one LPS to another LPS or something? Yes, absolutely. So you just are unlucky because somebody else ran into it first? Yes, absolutely. I think actually there's probably plenty of people that probably hit this and didn't know what to do with it and they probably just reformatted their machine. The problem is what I guess, what happened is just that I was hitting this on a node of like, you know, I was hitting this twice a week on, you know, 4,000, 2,000, 4,000 nodes, right, across the globe. So it wasn't something I was going to be reformatting to fix, right? Plus, the way we were deploying these nodes, reformatting it wouldn't have helped, which was also interesting. So yeah, in the back. One second. You mentioned you recommended turning on crash stumps of course. Mm-hmm. Are there any drawbacks to it that would reduce performance and have any other impacts? So the way crash stumps work is they create, they create a separate kernel in RAM that's ready to take over when your original kernel crashes. And what that does, so the drawbacks are then that you need to reserve RAM for that extra kernel. And that amount of RAM that you're going to need is going to be dependent on how much RAM is on your machine, okay? So you may run into issues by having to reserve multiple gigs of RAM for this crash stump kernel. So that's the major drawback in terms of performance that's almost negligible. I'm not familiar with net console. The question was, what do I think of net console as an alternative? Are you talking for console logging? I'm not familiar with that. Oh yeah, but that's not, he's asking about crash stumps, like getting the actual memory dump. Does net console handle, yeah? Right, but oops, it's not a crash stump. Yeah, okay. Yeah, so that was something we could have done instead of the serial debugger, but we would not, that wouldn't have solved a point if you needed to get a natural crash stump. So that would have been another option as opposed to our syslog. Are you like immediately following me? No. Okay. I'm ready though. Oh, you're on point. Anyone else? Anyone, any other questions? No. And I'm working on that one right now. And I'm really excited because I think I finally root-caused it. And if anyone is running Kubernetes or Mesos in the cloud and using hard limits and has performance problems, you would want to talk to me afterwards. I can explain some awesome behaviors about C groups and CFS, the CFS quota hard limits that we've now identified. You're saying, hey, it's going to be six months after the mainline release? Yeah, that's, I have a, so there are a number of things to work around with that. So I've only been working for Indeed for a few years. And there's a number of infrastructure things that I have, infrastructure projects that when I get time, I want to implement. One of those being allowing Indeed to run Indeed-built kernels. And that is my plan is to be able to run Indeed-built kernels in those times when I've identified fixes and I want to have a fix now, but I'm having to wait for the upstream kernel community. So I, that is one of, that is what I plan on doing. And I plan, I'm only running those until those fixes are available upstream. That's my goal. Because I don't like monkey patching and I don't like running source code that I haven't, that has not followed the process, right? I like to give back to the community. Anyone else? Yeah, but it's hard because if you don't have infrastructure, you have to create the infrastructure. And our build system is all built on Jenkins, which is made for Java and not made for building RPMs. So I have a whole other pain that's all about trying to build RPMs in Jenkins, which I don't know, maybe I'll open source someday. Yeah, my friend again, my sysrec friend. Yeah, I was just thinking about like, when you submit a fix for a bug, is there any like, regression tests that are created for those bugs? Yeah. That helps you to also submit. So testing in the kernel community is done by the kernel community. It's also done, so for example, XFS has its own set of regression tests. I don't know how many regression tests were actually added for this specific bug, but sometimes those tests are subsystem specific. Sometimes they're not. Again, it's one of those things if you have a way of writing a test for this, by all means write one. But there's no, I know of no like kernel holistic regression mechanism that is available. I know for example, the Ubuntu kernel team has their own, has a, is running auto package tests, plus a number of other testing frameworks to verify that they haven't caused any regressions. And I assume there's probably the same for Red Hat and any other major distro. Yeah, that I could theoretically add to those or work on those open source projects, but I have you get to have the timer, the desire, I guess. I have the desire, I just don't have the time is really what comes down to. Yeah, so apparently there is a unit testing framework in the kernel. Yeah, I think I remember reading something about that on the LWN. You go ahead and. To repeat what I said for the video, as of I believe 4.20 there is actually a unit testing framework in the kernel. So, 5.0? No, no, no. So, that was one release off. But yes, there is a unit testing framework in the upstream kernel. You just have to be writing non LPS kernels that came out this weekend. Yeah, yeah, okay. So that's why I don't know about it. I've been distracted by other problems. All right, anyone else? We're good. All right, I'm going to call it. Thanks guys. And the EBPF and kernel debugging. Oh, yeah, okay, cool. Good afternoon. I have the pleasure of introducing Allison's check-in. She's going to be doing a presentation on virtual file system, why we need them and how they work out a little bit about Allison. She's been an automotive systems programmer and kernel engineer and has worked at Nokia, Mentor Graphics and Pelton Technology. In 2014, 2015, she collaborated on-site with a customer in Germany. Allison has spoken at events including ELC and ELCE, Usenet, Liberal Planet, Linux Calm Off, Automotive Linux Summit, Scale, Make and Maker Fair. She has organized the monthly meeting of 2,800 members, 70-year-old Silicon Valley automotive open source. With that, please give her one welcome. Thank you. Okay. If you came here under the impression that this was a part of America's Got Talent, you'll quickly realize your mistake. I'm not going to sing or dance. I'm going to tell you about virtual file systems, how they are one of the core technologies of Linux. We heard something about file systems and virtual file systems in the previous talk. I'm going to keep on that theme. This work has also been described in a blog posting at opensource.com, Red Hat's website that covers a lot of the same topics. It's very similar to this talk except it has different mistakes in it. I work putting Debian into trucks. This is the product I work on. These actual truck drivers are some of my co-workers. The team that I'm on is hiring systems programmers and kernel engineers. We are hiring, so hit me up if you are interested in doing the kind of work I'm going to talk about next. Specific technical topics I'm going to cover are how everything is a file, works inside Linux, part of the POSIX UNIX heritage that Linux has. I'm going to describe what virtual file systems are and what they have to do with the file systems that we are all familiar with like Fuse or ButterFS or EXT4. I'll focus in then on ProcFS and SysFS because those are two of the most important file systems in the kernel. If you understand them because they are a little bit simpler, you'll understand a lot about the internals of other file systems. I'll show you how to observe and monitor file systems with EBPF and the BCC tools that are part of EBPF. It sounds very forbidding and opaque when I give you those acronyms, but actually BCC is the easiest to use Linux kernel debugging and tracing tool ever. If you stick around, I certainly can convince you that if you use BCC and EBPF, the hard part is figuring out what the data means, not how to get the data as was discussed in the previous talk. Then I'll turn a little bit to bind mounts and overlay file system. There are two ways of faking out file system permissions and file system paths that are part of the virtual file system layer itself. Those facilities, bind mounts, overlay, and mount namespaces, make possible containers, read-only root file systems that are critical to embedded IoT devices and live media boots. I'll talk to the extent there's time about some examples involving system D and the Kali Linux live pen test distribution. To begin with, we should talk about what is a file system. To start consideration of this topic, I turn to the well-known book, Linux Kernel Development by Robert Love, one of the early principal contributors to the kernel. Robert Love's book says a file system is a hierarchical storage of data adhering to a specific structure, which is a little bit of a vague definition. I would say that this phylogenetic family tree of single-celled organisms in the diagram there, by this definition, is a file system. But that's not what the Linux kernel means by file system. It means something very specific. In particular, the Linux kernel takes as a file system any software object that has a struct file operation that implements these three methods, read, write, and open. If you have used Linux much, if you have looked at the Linux man pages, you realize that read, write, and open are part of man two, the second part of the man pages. That's because they're system calls. So these kernel functions are the kernel side of the system calls that are in the second part of the manual. And it's hardly surprising that to have something be a file, you need to be able to read, write, and open it. So that is the kernel's definition of what a file is. Let's turn a little bit to the idea of virtual file systems and understand the role they play. So I'm going to start to talk about what virtual file systems are by describing how they're used. This diagram, which I'm afraid is a little bit complicated, is really the heart of the talk. If we're out in user space and we type copy or echo or cat, then user space hits this system call layer. So it usually starts by calling open. Open then goes into the virtual file system layer. The virtual file system, the main function that it provides is it resolves paths and it checks file permissions. This is important because the way that for containers and read-only-read file systems and live media that the virtual file system can fake out the permissions and the paths is because it does that for all file system implementations. I'm going to show examples of this in just a minute. But when the virtual file system layer gets the open request, it then looks at the file type after it resolves the path and from the file type, it looks up the open method or the read method or the rate method that pertains to that file system. And then the file system functions themselves will go and talk to the device drivers or they will perform a access to memory in the event that the file system is memory backed. These are called pseudo file systems and Linux. So a very common thing that you might do is you could copy a file from an NFS mounted drive to your disk drive. When you're doing that, you're actually making a request to the network using the methods from NFS. The data is getting copied up into the virtual file system layer in its data structures and then we're using the rate method of EXD4 perhaps to rate out. So we're actually translating from the format. If you heard the previous talk, XFS is using B trees. We're actually translating from the format of one file system to the other via this layer that's in the middle. So that's all a little bit abstract. Just before I move on to some more examples and explanations, I just wanted to show you this first paper ever about virtual file systems from 1986 about BSD. One of my co-workers shared with me, this made me laugh because this figure, which is 33 years old, is virtually the same as the figure that I made for this talk because if you get the answer right, you have to have diagrams that are almost the same. At least the graphics have gotten a little bit better. So if you are thinking about splitting the file system layer in Linux into virtual file systems and file system implementations, if you're a software developer, you might immediately think about object-oriented programming and realize that the virtual file system layer is in fact an abstract interface from which the specific file system implementations inherit. So just to remind people who haven't thought about object-oriented programming in a little while of what inheritance is, this is the figure from the Inheritance article in Wikipedia in which we see that animal is a parent class and dog is a specific implementation of animal that dog may either inherit or override the move method from animal and that dog may extend animal by adding the bark method. So by the way, if anybody is completely puzzled by what I'm talking about here, if I'm using acronyms you don't know or if my font is too small, please do speak up and I will try to do better. Making the analogy between animal and dog and virtual file systems and things like Fuse, we can see that the virtual file system layer does not define open read and write, they are virtual void if you are an object-oriented programmer. Yes, sure, is that better? It's going to move again probably, but we'll hope. Okay, just shout out if it's okay, perfect. So the open read and write methods are virtual void, they are actually null function pointers in C and so file system implementations must define them, but then there are stubs in the virtual file system layer that implementations can inherit or implementations can redefine functionality that's in the parent class. And so just to show you what an actual file operations function table looks like, this is the one for EXT4, we can see here read, write, and open, I actually lied a little bit because you can actually define a reader write iterator instead of functions called exactly read and write, here's open, and then in this function table a lot of the function names say EXT4 and then they are the specific implementations for this type of file system and then there are other functions that are say generic file splice for example, this is inherited from the virtual file system layer. So before I get into some details and some demos, the previous speaker mentioned the layout of the kernel source, there's a FS top level directory, in the FS directory are C files, those C files in the top level file system directory are the virtual file system code, the subdirectories have names like butterfs and ubfs and riserfs, those are the specific implementations. The important functionality that the virtual file systems provide is that they resolve paths and permissions and that's how they make possible faking them out which is what I'm going to actually spend most of the time talking about. And then of course inheritance from a parent class means that we have a lot of code reuse and that prevents code duplication and that leads to better code quality unless of course the parent class has terrible root penetrating bugs and then all the implementations inherit those bugs and this is not just theory, this is something of a troll written by Jonathan Corbett on LWN last December where actually the XFS developer, since XFS was discussed in the previous talk and bugs in XFS found that there were these bad security problems in the virtual file system layer itself and fundamentally all Linux file systems therefore had this problem. So code reuse is great except when it's really terrible but and I should say all this magenta text is hyper-length so if you download the PDF of this slide you can follow up and have a look at all these articles. So having talked a lot in general about virtual file systems and file system implementations I now want to show you some little demos about them these are the sorts of things you could actually depending on what code you have installing your system you could actually type along and and follow this. I'm going to focus in on PROCFS and CISFS because those are two super important file systems in Linux kernel. PROCFS is a file system where the kernel publishes statistics about itself, publishes tables of data. CISFS is the file system that's used by Udev to see when devices appear devices come and go it's critical to hot plugging is what I'm saying file system mounting and unmounting things like that and you might say well why do we have PROCFS and CISFS why two flavors and so that's an interesting question that I hope to answer right now by showing you a little bit about them. So hopefully this is large enough to see is this font good for everybody? Okay so let us begin then. So let's start by using the stat command to say stat CIS power state. So CIS power state is a file on CISFS. It is about one page of memory and it is a regular file. So that seems hardly surprising. Now let's suppose we say stat proc mem info. So proc mem info is the file that the free command uses to find out how much free memory there is, how many shmem buffers are in usage, how full the dentary cache is, all that kind of thing, how many shared pages. Well if I say stat proc mem info we see that proc mem info is of size zero and the type of it is a regular empty file. So that seems kind of surprising. How would free work if proc mem info were empty? So let's try WC minus L proc mem info and it says that proc mem info has 48 lines. Well that is really odd. Suppose we say head proc mem info. It looks like proc mem info is kind of a table with information in it. Well what about file proc mem info? Empty. That seems really strange. So how about if we say cat sys power state? Well that has three strings in it instead of a table. So in fact if you echo freeze or mem or disk into sys power state then you suspend or hibernate your system immediately which is a really another fun cool demo but not the subject of this talk. Okay so we see that sysfs tends to actually be files of one page of memory regular files and it has a few strings or a few numbers in it where proc fs apparently is empty but apparently is full of tables of statistics about the kernel. So what is up with that? Well let's go back to the slides here. I've just shown you all that. In fact proc fs contains per process stats. Most of what is in proc fs see if I can make the audience dizzy by switching screens here. Most of what's proc fs are these numerically named directories. Each of those directories corresponds to one process ID of a running process. In each of those directories are files with statistics about those processes and then these top level files are our statistics for the kernel as a whole for example the free memory. So there seems to be kind of a duality of proc fs. It's sort of full and sort of empty since my training is in physics that remind me of a famous paper called is the moon there when nobody looks reality in the quantum theory and the money quote from that paper is this one. It is a fundamental quantum doctrine that a measurement does not reveal a pre-existing value of the measured property and what I mean by quoting that is not to suggest that proc fs is actually quantum mechanical but in fact there's this sort of Schrodinger's cat situation where the files in proc in fact do not have anything in them. What they actually are is they are hooks that trigger callbacks in the kernel that actually generate the information that is apparently in the file when you request it. So they are callbacks those files in proc fs and so there's a file api because everything is a file that allows you to request statistics from the running kernel dynamically so that is what proc fs is that is not what sys fs is. Sys fs is a interface that the kernel manages for publishing reactions to events that happen very often hardware events. Sys fs is a mechanism for reference counting objects so when objects are created in the kernel they appear in sys fs the kernel is publishing to user space the fact that a hardware event has occurred Sys fs allows the objects that are these reference counted objects to be configured in many ways there's a lot of tunables. Sys fs notably is the implementation of the embodiment of the kernel's quote unquote stable api applications binary interface to user space. You may have heard how Torvald periodically lets loose with an epic rant about how one must not break user space and in fact if you follow this hyperlink if you dare there is one such infamous rant which would certainly be inappropriate to reject a family-friendly conference like this and what this all is about if people were really nearly allowed to change sys fs between kernel versions then the type of ability that the previous speaker talked about of of blatantly changing kernel versions without reinstalling the entire distribution would not be possible. The stable sys fs is why we can change the kernel version and not change libc change user space and it's also why we can change user space and and keep the 310 kernel if that's what we want to do they're independently developed projects. So I'm going to talk now a little bit how you can observe what sys fs does using ebpf so if you've come to this talk because you you hate dcc and ebpf which is what Brendan Gregg is talking about in the other room you are going to be disappointed because I also am going to talk about it it seems that you cannot get away from it today so how many people have heard of ebpf and bcc and know what it is okay a couple people I think I think ebpf and dcc are the new hotness so the good news is you can learn about them right now I'm in in this console on the left the bcc tools repository that you can get clone the link is in the slides if we look at what's in this repository there's a bunch bunch of python functions and text files the python functions have suggestive names for example tcp top uh llc stat xfs slower um what else um kill uh off cpu time and and so forth uh and so on so you can more or less guess what these functions do without even looking at them with each python script is a text file that is documentation um these tools are super super easy to use and I'm going to show you a couple examples uh the first one I'm going to show you is trace dot pi so I'm going to show you now that we can use trace dot pi to watch sysifes and action so that we can see that it is in fact a kernel uh system that reacts to system events so I'm going to say sudo dot slash trace dot pi and then I'm going to trace the probe point sysifes create files and that is all I have to do except type sudo and so now what has happened the python files that are part of these bcc tools actually have snippets of c source code embedded in them when I run the python script what happens is the python passes the c snippet which is really about one line of c to the clang compiler the clang compiler compiles the c and it uses a system call to insert the bytecode that it generates by compiling the c into a virtual machine in the linux kernel this sounds like science fiction but really there is a virtual machine inside this linux kernel that interprets bytecode and you can load this bytecode into the kernel by calling these python scripts that get clang to compile the c and call the system call and look how easy this was I didn't write this trace dot pi is just part of these tools okay so now trace dot pi has told the kernel print uh you know send a message whenever this probe sysifes create files which you can guess what it does it runs whenever there's a file created in sysifes and it's going to return the process id the thread id the command that calls sysifes create files and the invocation so now here's my amazing demonstration I am going to insert a usb stick and the virtual machine is watching and there we go we see that process id 2186 included a k worker thread a k worker is just a kernel thread that services work cues um and it has called sysifes create files uh so this this is a kind of a live demonstration of the event reaction that is implemented via sysifes and in fact the result if you look at the d message the same time is that device sdb1 appears in slash dev so this really shows you how that works except it's a little bit unsatisfying because we didn't learn what file got created and you know it was sort of a pretty toy pretty toy thing so let's look at a more sophisticated invocation of trace dot pi that gives a little bit more information this would be a little bit too much typing for me so I cheated and put it in the shell script um basically now we're still calling trace dot pi we're calling it with a minus k flag so that we'll get a kernel back trace now we have the full function signature here from the c code from the source file of sysifes create file with its arguments at the end of this line this is I find this so entertaining you can see what is clearly a format string that you would send to printf and so this this line here it really is c source code um if you can program either c or python you will love bcc tools because there are 80 examples that come from the bcc tools repository and you can easily extend them and you hardly need to be a tenious c programmer to create this I think we can agree uh so we are going to print out the created file name using this invocation and because we've included the function signature now we have to give the path to the header file that defines sysifes create files this is just the same thing you put in a make file this is kind of like a on all in one line a little make file and a little c snippet but this is the this is the most sophisticated thing you need to do to use bcc so um we are going to now run this script oops with sudo so once again okay so that is actually the just in time compilation has just occurred and the bytecode is running now in the kernel and we see the same line as before and now I'll do the amazing demonstration again of plugging in the usb stick and now we get instead of just the name of the k-worker and the fact that sysifes create files ran now we learned that the name of the created file is in fact events and we see the kernel backtrace here you can see that the k-worker thread called something called sd probe for I'm sure say the device probe asynchronous that called device add disk which is which is hardly amazing and device add disk created a file in sysifes called events so if you if you watch the previous talk about analyzing oopses um look at how much easier this is I mean the other thing is if you listen to talks about the kernel you get the impression you have to spend all this time reading source code and um you know trying desperately to get these these uh oops outputs and everything and a lot of times you you do have to do that but if you just want to get some information about how things work I would argue that this is absolutely the the simplest possible way now you can go and read the um the code that is in the functions in this backtrace if you really want to know how this facility works but you can find exactly the source files that you need to you need to look at by doing something like this so to get started thinking about a problem in the kernel this is totally the easiest way and as I said there there are 80 of these python tools so um the likelihood that there's one that's applicable to your performance your intrusion detection or or your bug uh is is really highly likely uh and if you say well well I don't know what trace point to list there's even this tool tplist.py that prints out trace points for you and uh you also can trace user space so this these tools are remarkably flexible and hopefully I've convinced you it actually is pretty easy to use so with that enough about virtual file systems and how they work let's talk more about why we care about them so I've alluded a couple times to the fact that the virtual file system layer uh resolves paths and permissions and therefore can fake them out that means that we can have bind mounts mount namespaces and overlay file system and that may sound vastly dull but in fact the reason that we have read only root file systems that are used in embedded iot devices the reason that we have containers and the reason we have live media is because we have these facilities that really are fundamental to the way that those applications work so to begin with I just want to make contact between bind mounts overlays and things that probably almost everybody who uses linux knows about namely simlinks and charoots so simlinks are a mechanism for allowing a file or directory that's at one path to appear at another path and and so are bind and overlay but bind and overlay are much more a featureful configurable ways of performing that for years in linux we had charoot which allows you to change the point at which your top level file system begins in some sense it does what it says it changes where root is defined this was useful for build systems but it had no security if you wanted to have devices for example in charoot you had to bind all of flash dev in so it was it was useful for for some things but it really wasn't secure for containers or other such applications bind mounting allows you to provide a file or directory at another path but in a secure very granular way dynamically and overlay is an even simpler method that's used as the previous speaker was was referring by container systems and also by live media so let's let's talk a little bit about how they work this diagram is just a rough one to illustrate what happens with a bind mount typically you have a directory which can be an in-memory directory temp of us which can be bound mount onto a path that is actually on storage media and then as a result you see the files that that are in actually the temp of us at the new path which is corresponds to the storage media so basically the bind mount makes the files that are in memory look like they're on the storage media now why would you want this well one reason you might want this is because you have applications that want to rate to those paths but your storage media is unratable another reason is that you have software that's versioned and the software that is looking at that path has that path compiled in perhaps you don't you don't have a source code to the application and you can't change the path that it that it's reading but you can actually change what is it that path without rewriting the file so that's sort of the fundamental idea of bind mounts there are a lot of flavors of them there there's you can really get into the weeds with the different flags that you can set with them bind mounts can either be mirrors they can be one way propagating or they can be private this privacy refers to mount events and file system visibility it's not the privacy of files all this sounds very confusing and abstract I think so once again I want to I want to show you rather than tell you so to do that I'm going to whoops I'm going to actually fire up a container and show you how it uses bind mounts so to do that I'm going to use the bcc tool mount snoop and to use mount snoop I'm not even going to pass any arguments I don't even know if it takes any arguments but this sudo mount snoop dot pi so mount snoop once again is going to print out the command the process ID the mounts name space of the file of the mounted file system and the indication of the mount plan heavily said what mount namespaces are but they're kind of a view of the file system each process has its own set of mount namespaces that are in the files in flash proc some PID mount info and so mount snoop is the bytecode is loaded into the kernel and now if I start a container so I'm going to start a container with system dn spawn just because it's dead simple and easy to do it's so system dn spawn is a container runner like one c or docker very similar sort of thing and so if you did this same exercise with one of those containers you'd you'd see something very similar to what we're about to see so mount snoops is listing I start the container and you can see that there are a huge number of mount events now rather than trying to look at all that mess let's just return here to the slides um and so here is uh some of the highlights of the output of uh what I just showed you I captured with mount snoop uh when the container starts up if you look at the indication of the mount command you can see that for example uh mount is mounting the proc file system from the host into the container so that the container can use it is in fact a bind mount that I was just talking about so the flash proc and the host is being made available inside the container slash proc that's sort of a boring example um so the container top level directory which is called SRV n spawn is actually appearing at as flash inside the container and none of this is very surprising what gets more interesting is if you if you look through that this long long list of outputs of bind mounts that occur when a container starts uh see for example here there's an intentional obfuscation by default in the container of some of the files that are in the host so here instead of the file proc k all sims which is the exported kernel symbols that describe the functions that the kernel supports um appearing that instead of that file appearing inside the container system d n spawn container runner by default is mounting this essentially garbage empty file over proc k all sims so if we return to these windows here if we look at proc k all sims on the host here's all the kernel signatures so here's the address of kernel functions and the names of the kernel functions here inside the container you can see the prompt says root at n spawn if we say cat proc k all sims it's empty and that's because junk has been automatically mounted by the container runner over proc k all sims this is just a quick little illustration of how bind mounts are uh not headline news i think for anybody they're not exactly a sexy feature of linux but they're one of the facilities internal to the kernel that makes a lot of the stuff that we take for granted actually work um talk a little bit more about read-only-read file systems this is another topic whose uh importance and uh interest is not immediately apparent but just to point out if you have a server you cleanly unbutton boot the server so that you don't have to run xfs repair or fsck at the next boot if you yank the power out of devices then the journal that is a transaction log for the file system and the data on the storage device can get out of sync and that makes life very sad but if you think about it say a car running linux the type of system that i've worked on are our truck drivers as wonderful as they are do not you know type shut down when they key the truck off and people feel free to jump start vehicles for example where they come for the linux comes partly up it goes down it comes partly up it goes down that's how i think about jump starting cars given my life experience um but the reason that that's okay the reason that you can just take the battery out of your thermostat and you don't have to fsck when you put the battery back in is because the root file system is um is unrightable the fact that it's unrightable uh has a lot of benefits it means it never gets full it means that malware can't modify important things on the file system it means if you are someone working on field support that the device that someone is complaining about should be at least not only the same as the device that you have as far as what files are on it and actually at application design time the developers have to really uh separate data and configuration from binary programs because by the design of the of the file systems we are not letting them rate uh you know files out where the binaries are stored um having said that having read only root file systems can be a real headache uh linux systems will not boot unless bar is readable uh there are a bunch of upstream programs that expect to be able to rate in user's home directory there's various ways to uh finesse this um but uh a really great way suggested by what i was talking about a few minutes ago is to bind or overlay mount tempfs in memory file systems which we are free to scroll in over the path where applications are are looking for the files so um i talked a few moments ago about bind mounts where essentially you replace the files on the storage media from the point of view of applications with files that are actually in tempfs in memory file systems overlay fs is somewhat easier to use but less configurable method that is used uh also by containers or by uh live media if you've ever wondered how how can you run from a live cd i mean obviously you're not rating slash bar to a cd but the kernel has to write a var w temp for example that has the information that who or w reads and so overlay fs is different from bind fs and that it's supposed to as the name suggests be an overlay so kind of like a transparent overlay where you can not replace files that are already there unless you have a path name context but you kind of compose two file systems together so we can make the files that are actually in memory that are modifiable uh look like they're in the same directories as files that are actually on the unrightable storage and so to illustrate that i've got an one last quick demo here um let's make an upper directory that we are going to overlay on top temp upper dar now let's make temp work dar which is a scratch directory that overlay fs needs and let's quit this container and let's copy the file password that i just happened to have sitting around here to upper dar and then let's mount uh a file system of type overlay name overlay upper dir is equal to temp upper dir work dir is equal to temp work dir hard to type at this angle and then lower dir is equal to slash etsy on slash etsy what could possibly go wrong so we know we've got our overlay there and if i see who am i dmr that seems odd what's in et cetera password just root and denis richey what's in uh temp upper dir well it is that same file so it is now appearing at the path where my previous uh password file appeared so now i can unmount the overlay oh but i can't because denis richey is great at god as he was doesn't have sudo on this system happily i've already logged in his root on the console like did this demo before i reinstalled my system a lot it's a lot of fun anyhow um so now everything is back to normal um point out when the advantage of doing this overlay is if i bound mount that temp upper dir directory onto slash etsy the only file ls slash etsy would be password everything else would be gone i did that in earlier version of this system and of course i had to reboot if i overlay mount it then all the files that are in password except for this one that has the name conflict appear unchanged this example is silly and stupid because it's a demo but you can see real life use cases of something like this uh i mean most obviously if you want in your container that the uh password file just has root and container user in it this is a mechanism for doing that and it's it's so if you if you have features like that you're used to for live media you can change your preferences for fonts or something when you boot from an install cd this is actually how you're doing it very very often with overlays so as i said uh bind mounts are more powerful but overlays are actually a little bit simpler so um coming into my uh conclusions here um virtual file systems go all the way back to 4.2 bsd they are part of the posits heritage of linux virtual file systems being the shim layer between system calls and specific file system implementations are what makes everything is a file possible in linux uh they are really where the rubber meets the road on virtual file systems um well the where the rubber meets the road for everything is a file um procfs and sysfests that i used as examples are two of the most important file systems of linux and two of the simplest so they're kind of a good way to understand these concepts um procfs is a place where the kernel publishes statistics that are actually generated when you ask for them sysfests is an event management interface it's a place where reference counts to objects that have a lifetime are maintained by the kernel bind it mounts and overlay fs are part of the virtual file system layer um and they make a lot of the magic um of um virtualization and containers and iot devices would read only read file systems work and so they're uh not very uh they're not very sexy but they're very important and then finally the bcc tools and ebpf are much much easier to use than ftrace unlike ftrace they don't produce megabytes of output that you have to painfully grep through they typically produce one or two lines um they have wonderful documentation and i urge you to give them a try if you have not already done so and with that i'd like to thank uh some of my colleagues who made helpful comments on this talk advertised that my colleague sarah newman is giving talk about uh blind users in linux at 6 p.m in ballroom h and here are some references where you can uh learn more about this topic thank you uh but you mean the pdf file of this talk i was just asking where the pdf files are well this talk uh is posted um the link is in the announcement for the scale program if you look at the scale program schedule and you find this talk there's a link to these slides or if you go um if you go look at the title slide here i actually have the uh here's the url well no i didn't i didn't put the url in here anyhow you also can find it open first yeah great thank you yeah yes i didn't really get to that but um i did uh yeah so there's a question about collie linux i actually have that in the supplementary slides here um so collie linux is dependent on uh overlay fs i just went and thought it'd be interesting to take actual live media that a lot of people use and and look inside it and see how um the bind and overlay are used so collie linux where people don't know is a live distro that you can use for pen testing it's a kind of a generic linux but inside uh built in are a bunch of uh fuzzing pen testing tools but what i looked at in this particular instance was not the tools that are in it but how this live distro works so if you look up um if you look in the directory okay so inside inside collie linux when you insert it there's some stuff that starts at running and then it mounts a squash fs which is yet another file system implementation i don't know if this i don't know if this text is large enough to read let me zoom in let me zoom in here oh wait is that better i thought maybe it would let me zoom but i guess not if you so if i i loop mounted which isn't yet another kind of mount the iso for collie and uh if you look inside the loop mount there's a squash fs which is just a compressed file system that collie linux uses for its uh root fs that it uncompresses and runs from memory when it starts and then if you look inside the squash fs there's the suggestively named shell script called 9 9 9 0 overlay dot f h and so and then if you look inside that shell script it it makes an overlay fs like the one i just showed you and so then now you have your root fs which is actually running in memory and you while you it is a feature i mean a positive feature of a live media distro that you don't modify it right because you don't want to go modifying the pen tools when you're actually trying to attack or probe or investigate somebody else's system what you want is just to have a copy of these things that you can use you don't want to trash the files in this distro so that you have to reburn the cd or the usb drive before use it again you'd actually want them to always be the same so this is a mechanism that allows you just as i was illustrating to um temporarily changed your heart's content the things in this in memory file system and then when when you reboot all those changes go away so um yeah this this is uh really handy so if if you want to save out files uh that you're generating um when you're running something like this you have to have another storage device to copy them to um so i don't remember if i had i guess that i guess that was it but i have a you know you can get the copy of this iso um from the coli linux website coli linux is tpl an open source and you can just do the same loop mount and see so i tried to work this up as a demo but i couldn't for i you know just booting a live cd in a inside cumul wasn't very exciting to watch so i kind of gave up on that but yes your example which um which file controls the permissions of the file that the is actually seen in the file system the one that's been overlaid or the one that's been yeah good question it's actually the file that's on top um except for the fact that well let's go back to that sign make it easier to understand so it's just as a follow up while you're thinking about it also the same question applies to the directory itself you know the directory permissions of the directory right yeah so in this case if you look that we have a file in the lower file system that appears unmodified it it will um is that true yeah because that file is really on the storage that's a good question we could actually try it and see um certainly these files that are actually in memory that are in the upper directory you can you can do whatever you want to them and of course that's part of the point of us so the if these files were owned by root or you know unratable they would still be in the after the bind mount i'm pretty sure that if the file is not overlaid it retains its permissions because there's no copy of it in the temp of us we'll get actually nice thing about all this stuff is easy to it's easy to try it so i i like very much this investigation method and then afterwards if you want to change something or you're still puzzled how it works then you go and read the source code any further questions well thanks everybody uh for your attention can you guys turn me okay in the back there okay good five more minutes just uh making sure it works good afternoon i have the pleasure of introducing mr davin edwards he's going to talk about sys emin to his sre creating capacities to make tomorrow better than today i'm a little about mr edwards he is a co-founder of run dak ink the maker of run dak the popular open source self service operation platform davin has spent over 20 years working both the technology and business and of it operations and is noted for being a leader in the porting cutting edge devos techniques to large scale enterprise organizations davin is a frequent conference speaker and a variety of focus in devos sre operations human topic and is also actively in international thank you thank you with that please give him a warm welcome oh thank you thank you good start with a nice applause it's only downhill from here right so uh how many of you are systems administrators by by nature by trade everybody good anybody uh just got the sre title yet do you folks nice all right good uh good stuff so my um uh my talk is about the long suffering work of operations often thankless uh work of of operations and how you know we're always looking for better ways of working and uh sre seems to kind of hold that out as a as a as a guiding light sort of like the devops movement did uh for uh for the rest of the organization and this talk is just talks really about um kind of the what i've noticed from companies that are moving towards the sre style of working uh what's getting in their way and those that are being successful what are they focusing on so when talk is less about the folks that um have grown up their company or or you know have kind of drawn the plane board so to speak in that sre or what is now called the sre model more about companies that have what i'd call you know legacy or the history of success and have had to pull that into the uh the next generation so a little bit about myself this is really kind of matters why i'm um i'm up here so my career i've had this interesting opportunity to see inside all kinds of companies high flyers low flyers uh startups big big household names and um you know that happened through uh my previous life uh the managing partner of dto solutions we got really known for doing kind of devops and operations uh transformations and more so the ops towards dev uh than the dev towards ops that i think a lot of people really see in the uh the devops conversations um i do a lot of work in the uh community um i'm on the the content committee for gene kim's devops enterprise summit if you work in an enterprise it's a great conference uh jason cox down here raising on jason another another helpful organizer it's a great uh it's a great event uh also wrote a book in the uh the new um oraili sre book seeking sre it's a very interesting book it's a collection of essays it's the first kind of non-google sre book from um from uh from oraili so i can see inside all these companies really kind of helps me inform why i'm up here today um and so you know a little story right you know not i'm not that far away maybe in a company just like yours let's think about life and operations like what what's what happens right i mean first of all we all know this right it's the overloaded constant firefighting right it's just whether it's tickets the alarm's going off the phone's going off there's you know the project due yesterday the project due tomorrow um you know fire is going off everywhere right you know the life and operations is kind of marked by this idea that we're overloaded we're running at 110 percent right and then you know we're always waiting ticket queues for everything right so whether it's us waiting for somebody to do something for us or you know uh the other position which is uh you know a bunch of people waiting for us to do something for them right but it's a constant waiting game that's that's going on and then things keep breaking they break again they break again right so how many times you're responding to the same problems the same known incident over and over and over again and really two things are happening here one is the someone needs something fixed and they're waiting on it and then also the constant interruptions that are being pushed onto people hey can help me with this can help me with that right and this kind of sinking feeling that everybody's busy but nothing's really getting better right that um you know it's the constant firefight it's the constant overloaded uh work we're constantly being told there's a new improvement project or a new kind of initiative we have to go but the entire organization is running on on uh on overload and then to make matters worse right uh you know here comes the business execs right and they're just grumbling to themselves like those operations folks right everything's taken too long it costs too much you know it breaks too often right that's just not fair right so all of us is just a big grumpy mess right and I swear this is the only cat picture in my uh I figured it would go over at scale right so in my enterprise crowd I talked to I don't I don't put the cat in there so now I got this sre thing oh this is interesting right you know there's like a little little whispers here and there's little books you can read that sound like fantasy lands right uh you know and then you see but but now you've got like the walt Disney company right you got standard charter bank out of Singapore right you know even ibm's talking about a ticket master right you know talk about big big legacy uh you know people are talking about this sre thing and how it's working for for them right so it generally goes like this the same executives that were griping about those operations folks right they're like hey here's this sre thing it's great google does it right so here's what we're gonna do right we've got sysadmins now so we're gonna make them sre's just like google's got and uh you know it's it's gonna be great right nothing nothing nothing's gonna happen right so they got the plan right they put in the okr it's gonna be amazing then the boss comes down is like oops oh I missed it sorry so it says hey look your new title is sre please write code and be better at operations right and you're sitting here going all right I got my you know I've got my cab meeting calendar I've got as long you know this this 13 week provisioning process and it's uh you know these exertions at quality's job one right if we just would concentrate harder we'd stop making problems right um so you know what happens is basically we have the old system in world right we're you know over uh overloaded constant firefighting waiting in queues things are breaking breaking again everyone's busy bosses are griping that things are taking too long costing too much right so now we've got this new sre world everyone's in everyone's in sre oh it's the same problems right we're in the exact same place around so and the reality is is that you know changing job titles or adding individual skills doesn't make systems administrators sre's right we got our you know our our friend here it's great we can dump all these you know new skills and say hey observability traceability these new programming skills you know think about things in code put everything in an sdlc uh let's learn about distributed architectures let's get Mike Nygaard's release it book and learn all the anti patterns let's get up on our instant responses and our blameless postmortems but the reality is if we're still wrapped in that same organizational trauma so to speak nothing's really getting better right and we're not these are not sre's they're just we're just long suffering systems administrators with new more marketable skills and they'll probably put that on linkedin and they'll be gone and you know six months for for uh richer pastures and so you know the reality is if you kind of dig into what's going on and the organizations are being successful sre's are rethinking of how operations work work gets done it's not about the tools of the technology it's about how do we approach what's the mindset and what's our organizational philosophy around this you know operations uh operations work so if you think about what are these principles what sets sre apart um the google folks were helpful and kind of define these three core principles uh what it means to them i think they're pretty much spot on so the first one is sre needs service level objectives with consequences right with consequences being the key part so let's let's talk about that right so in the old-fashioned world uh we have this notion of a sla which was something we agreed to operations agreed to and if enough operations blew it the penalty would be on on them right but in this slo world um it's just one letter different but it's got some some key uh key differences uh we start by saying art there's some service level indicator and that's got to be a business measurable metric right could be response time up time you know uh transaction time saying a lot of time things over and over again but there's other other things as well and we're going to say this service level indicator uh it gets together with the business and say what are we actually what's acceptable right um f uh you know if partners only submit their their requests for something in we know once a day and we expect uh you know a the next day turnaround we don't need five nines on that on that service right what is the acceptable level of error or accepted acceptable level of of slowness or whatever it might be that will that will that will take and what's interesting about that is instead of kind of the traditional sla world which is ops must be perfect right uh if there's errors at all you know what's wrong with you you know your sla says you're gonna be penalized if you fall below this level but if you're not hitting a hundred percent or something must be something something wrong in this world it's saying well first what's kind of odd is saying well this level between the service level objective and perfection that's a budget right that should be used to say you know we can go faster we can run things cheaper uh you know we can do whatever we want as long as we're within that that budget and we haven't we haven't we haven't blown it um but reality is you know this is about shared responsibility and unlike the sla world which is oh operations you're signing up for this sla generally under duress and you know political pressure um and you know thou shalt keep this thing at this standard or else the consequences are on you in this slo world um it's pretty dramatically different everybody development operations people will build the people who run and the people who pay the bills are all agreeing to that the most important thing in the business is the running service and meeting that slo so something goes wrong it's no longer operations responsibility it's everyone's responsibility meaning that that development has to swarm to help figure out what's wrong with that problem and that means that new business features are not getting done and that means the business has to be cool with that and uh and okay with that and there's no you know oh we've got to get on to the next set of customer features while the site is on is on uh it's on fire so it's radically different departure from kind of the mindset of traditional operations and slas it's really all about the shared responsibilities right so that's number one number two uh of the three is sre's have to have time uh to make tomorrow better than today you'll probably notice that from my my title this is a biggie to me and uh this idea of toil right uh you know people starting to hear toil kind of he's throwing around and people heard that term places yeah so uh the vex row uh google wrote the chapter on toil in the google book and i should say about the google books uh you know people kind of get a little mixed feelings about them because they're very googly right but the reality is you'll see there's a lot of trends kind of in our industry that's been going on that are all sort of filtering sort that same idea so much like dev ops suddenly sprang up when patrick debaugh had the first dev ops days you can look to all kinds of people in the industry who are arriving at the same principle i think if you look at what's going on the google books you'll see that there's a lot of things going on at different companies what they're arriving at the same idea just happens to be the google flavor but they did a very good job of defining some of these key ideas and terms so um it's a toil is the kind of work tied to running a production service that tends to be manual right stuff you do by hand repetitive doing it over and over again automatable you know like you you could have automated it uh it's tactical and devoid of enduring value that devoid of enduring value i think is the key the key message there that's not adding to the business right we're not making tomorrow better than it is today we're just you know fighting a fire or doing whatever is necessary to uh to uh keep the lights on and last one is a very interesting litmus test it scales linearly as the service grows right if you have five people taking care of a service right and say you have you know a thousand users right suddenly something happens and you have a hundred thousand users can you run it with the same five people or do you need do you need 20 30 40 you know people running that service that's a good sign of how much toil is in the system how uh how much that uh the care and feeding of it grows with the scaling of the service ideally uh the care and feeding is the same whether you're you know zero or a billion right so you know what if you want to kind of break down well what's the opposite of toil right the opposite of toil what we want to have is engineering work right and engineering work is is the stuff that takes the human creativity right so if you want to kind of put it side by side toil lacks enduring value right it's stuff that's ephemeral we have to get it done maybe it's necessary right um but it's not uh it's not going to move the business forward right so repetitive cleanup work or responding to an incident right it's necessary to keep the business going but it's not adding enduring value versus engineering work we're building enduring value we're doing something that's going to make tomorrow better than today for ourselves and for the business right um toil tends to be wrote in repetitive where sometimes it's nice it's nice to kind of zone out and do the same things over and over again but we're not really adding value because what we should be doing is things are creative and iterative we're the most expensive assets that our companies have we should be doing things that move the ball move the ball forward and quite require human ingenuity uh you know toil tends to be tactical engineering work tends to be strategic uh that scale factor i was talking about toil increases with scale um engineering work enables scaling and then toil can be automated probably should have been automated whereas engineering work by definition generally can't be automated so stuff that requires our brain to you know to to do something we want to be doing that right it's better for the business it's better for us it's better for humanity right so what's what's the what's the deal here right the deal here is we've got to be careful about the balance of the toil versus the engineering work right because what does engineer what can you do with engineering work one is what we talked about which is improving the business but the other is um uh we use the engineering work to reduce the toil to fight back against that that toil because otherwise what happens is um what I call engineering bankruptcy is when the toil is at this high level or crowds out our engineering work we not only aren't improving the business but we have no capacity to reduce that toil right so at that point we're in a death spiral right we can't help ourselves um we're just going to be shoveling as fast as we can uh for as long as we can until we fall over because you don't have the engineering capacity to fix whatever it is that's causing that that toil and toil importantly note it's a naturally occurring force right um it's kind of very interesting that you know google and they're borg and you know sort of they are the poster children for the uh the new cloud native world have an entire book about operations and a whole section is about about these ideas of toil because you know you never have unlimited budget and unlimited time so you always sort of start with this kind of um Niall Murphy who's now at runs sre for uh azure talks about this says that you know because of budget constraints and just no one has perfect foresight you always start with a whole set of things in the beginning of the new service you don't have automation for you've got to log in and hack away at something or redeploy something uh kind of you know level two is uh i've got my script that i hacked together right that's in my directory uh that's my externally maintained system specific specific system specific automation then you know there's okay well let's let's share that that that script right or maybe start to put it into uh like a chef or a puppeter or an ansible or something like that that's like my externally maintained um generic automation then i'm starting to ship that stuff with my releases right so it kind of comes like the helper scripts that come in that helper script directory that i can get at with my with my service and then ultimately where i get to i've built it in the system you know i've taught i've written the application to maintain its own database right or i've created an admin panel in the application where people can add their own you know users or partner interfaces or whatever it might be so you're going to go through this cycle you know over and over again uh for each each application each time you're doing this the toil is going to get less and less and less but you're going to be launching new things and it's really this kind of naturally occurring occurring uh cycle that kind of goes over and over and over again so toil is not some thing that we're going to solve once and be uh and be gone with so let's get to the third principle then right so the third principle of uh of sre is sre's have the ability to regulate their workload right this is where it gets it gets crazy for people right so it kind of goes the idea that what if handing off responsibility to sre or ops or whatever you want to call it wasn't a right right that it wasn't guaranteed that you couldn't just say hey i'm done marketing's ready to go uh here you go ops you know it's got to go to production you you uh you take over it and so really it's this idea of that you know we're going to separate the running in production from it being run or owned by sre or operations right and that's usually the point that these people come in or they're like you're nuts right what are you talking about right it's that that can't possibly happen anything that goes into an into production must be owned by by you know our production operation operations team and it's really a pretty dramatic idea but if you look at it you have the right guard rails in place and you think about your organization as dividing those two things up um you'll see more and more organizations saying hey development you run your stuff in production you take care of it right and if you look at whether it's the google model they actually have a safety valve at some point to say hey look you know once it hits this certain threshold of stability and has this certain criteria is met you can hand this off to this other team to this sre team and they will help run it and maintain it for you but before that point uh you know it's actually the development team's responsibility to to care and feed for that service in the production environment now it takes a lot of you know tooling and sort of an enablement and kind of change in mindset but it is a it is a pretty dramatic uh departure forward and one of those core principles of you know kind of what's uh sets an sre uh way of working apart from an operations way of working and there's a great talk steven thorn did at the devops enterprise summit um there's a link there help my slides are already on my my twitter that he goes through kind of these principles in depth um but the question always comes up okay well you know where should i start right what's the practical way to get started with this stuff now this is where i kind of depart from the sort of canonical um sre uh lecture when people say well you know the place to start is number one right that if you have service level objectives you have those consequences you have that shared responsibility between dev and ops and and the business then you have an sre organization right well the problem with that is that the company wide culture changed right and that's really hard takes a long time maybe just come in talking about that they might think you're a little bit you're a little bit crazy um so i'm not discounting that that's where you eventually have to get to but probably not the best place to start likewise with this idea that hey we should separate the going to production from you know turning over the care and feeding our application to to uh to to production to the job operations that's another company wide culture change that's real painful um you know those conversations need need to happen uh but i find time and time again the companies that get started and make progress that actually can be under their own control it's really focusing on this middle one here right reducing toil so one of the things where um you know everybody wins um you know when you when you do it you know who's really against the idea of oh we're going to be more effective and we're going to get more done and we're going to save more time and we're going to feel less out of control right no one really says says no to that so i you know strongly recommend that that is where you want to want to focus is reducing that toil so you have time to make tomorrow better than today so the rest of my talk is going to be about about about just that so why focus on reducing toil number one uh you know it's got lots of value independent of sre you can never say the word sre and people you still you sell the idea of toil and reducing toil and what you're going to get out of it as an organization and what individually individuals are going to get out of it everybody you know there's not many you're not going to find many takers on the other side of that uh of that argument right it's a pretty uh it's a pretty mom and apple pie kind of thing and it's obvious right your people are most expensive assets right so you know stay out of their way right so whether we're in toil kind of comes in two areas right one of it is in you know delivering plan work right so do we have a lot of ticket systems we have to bounce through do we have a lot of repetitive things we have to do over and over again you have to constantly talk the same firewall team to get the next thing you know changed or whatever might happen or from the firewall team is are people day in day out asking me for the next sequence in in ports to sign up another partner or something like that uh so the the plan works side a lot of toil comes into and also in responding to incidents right are we giving people you know the right instrument the instrumentation the right uh you know collaboration or checklist investigatory tools are we empowering them to make decisions pushing control closer to the to the edge and are we empowering them to take uh to take to take action you know in both plan works and respond incidents is all kinds of places we can we can remove toil from the uh from the process but there's a lot that gets in the way um I did this talk at the uh a conference last year the links in the slides there kind of spent like half hour going through all the things that get in the way of traditional operations organizations even when the dev side of the house is you know a cloud native and dev ops friendly you can watch that I think it's a fun talk but the the long and the short of it is silos and queues are the major causes of these dysfunction just dysfunction right so what am I talking about so silos um so people think about silos as oh it's just a team right it's a team comes from the do with teams and it generally doesn't have a lot to do with the organizational structure it's generally more about how a team is is working right so we think about it if everyone's working in kind of one room you know I think we're like a five person startup or whatever it might be uh you know we got the same uh the same backlog the same information or context we're working from the same tools we're using using them the same way which is important uh it's kind of same set of priorities probably the same boss or the same the same backlog but you know that's great and we can work fast we can work furious we can you know get things done very little friction between us but the problem is you know nothing ever lives in isolation in the enterprise you're always going to need something from somebody else right and this is where these silo effects start to take place because you know I'm focused on my work I'm in silo a someone in silo b is focused on their work and I'm pushing work on them right or requests on them or waiting for something from them and all that does is kind of disrupt my flow of work and so I'm going to turn more and more inward I'm going to focus on optimizing what I do I'm the best firewall rule changer west of the Mississippi that is all I'm going to worry about right and if someone comes and tells me that my process slows them down well you know forget them I'm looking at my card it says firewall engineer right I'm I'm focused on on that so you know these silo effects really start to cause these these these disconnects right these in these in these mix mix matches between you know the context people are working in the processes the tooling I mean I'm trying to see people that they're both using a common tool but they're using totally different different ways and a mismatch of capacity you know that my best firewall rule changer in the missus west of Mississippi there's only three of them and there's you know hundred operations people right you know there's a mismatch and maybe you know 500 developers so you know to watch off that so this is a big source of of toil these disease mismatches causes these disconnects and these and these problems and so how we how we cope with this right is we got this this beautiful thing called the ticket queue right or the ticket system and you know the team is we're just going to put this work in there and that's going to batch it up and it's going to send it on to the next person but the reality is we see that you know we know what happens right you you put something in there it sits and it's sit you got to go get a project manager to get somebody to poke that person to tell them to do something then they do it they look at it they don't understand what you're talking about because you're not the firewall rule expert they are right so they send it back to you and say this is no good and you send it back to them they tell you got to wait till next Tuesday right and they do it and they didn't realize it when they did it that they disconnected something else that you also needed and that caused another you know rounding around and around right and sometimes you just say forget it I'll fix it myself on my end other times you got to scrap it and throw it and throw it back but you know these queues are really working through these queues cause all these interruptions and waiting and toil right this is a major source of this toil all that extra work that we have to do is getting in the is getting in the way and you know these ticket queues are also expert we call a snowflake makers right so you know the term snowflake it's yeah it means it's great because I didn't invent it up to someone else did but this idea that you know it's technically acceptable right but it's it's this brittle unreproducible thing right and manually done things a lot of one off when you have these ticket queues style of working the queues don't really learn it's just basically become a bunch of functional silos of experts and you know they're probably good at what they do they get a ticket they parachute in they do something they get the next ticket parachute in do the next thing and you know whatever they're leaving behind it might be technically perfect but it's just a little bit off or a little bit different or the next person's not sure what it is and not not the mixed metaphors here but all those little snowflakes become time bombs that you know that can either interfere with somebody doing something manually making another snowflake or trying to do automation right because only worse than than than automation that doesn't work is automation it's just a little bit a little bit wrong right so you know these these ticket queues become these snowflake blowers right that you know is that's another another major source of of toil so you know how do we how do we get rid of this stuff right you know a few things right so number one we have to actually track the toil levels for each team and I don't mean like pure time tracking type stuff usually seem folks do it they'll use kind of their internal sort of scrum master project management functionality just to interview people or get a sense on a given a week or you know days or whatever might be you know how much time are you spending with this toil and how much time is being spent on engineering work and importantly setting these toil limits for for each team right so being specific about it and saying look you know sort of the canonical out there is to say hey it's 50 percent right that we don't want any of our operations teams spending more than 50 percent of their time on toil this interrupt driven work we want to spend more than 50 percent of their time working on engineering work and by setting those limits and be hard about them we can actually start to fund those efforts where the you know those efforts to reduce the toil right so it seems very you know obvious but surprised how many few organizations are trying to go down this path without actually starting with let's actually track the toil level set those toil limits and then fund these efforts now you know the first obvious one is well let's just get rid of it right so let's get rid of the excessive dependence on the ticket queues let's get to solve these problems in our applications let's just you know stop bouncing things around you know it comes through our processes a refactoring our apps our tools our processes that's a that's a it's one way to go it's a good way to go and you can kind of kind of catch it as you as you as you can but often you're depending on other people in that scenario it's a little hard to get started there so it's good to focus on that but the one I want to talk more about is when kind of near and dear to my heart is the idea of applying a self-service design pattern right so get out of this idea that you're waiting for somebody to do something or someone's waiting for you to do something or you're constantly being interrupted by those things when people are waiting for you and let's get into you know pushing self-service right so and you know people often approach the self-service question and say well where do I start right and the most powerful way I've seen and guys Scott Peru at a company called CSGI it's one of the biggest outsourced building companies in the world I should have a link to him was talking about this recently that they kind of made a 20% time project they just said let's empower the teams to find and fix all these little things that they can and turn on self-service for other people in the organization so just look at it look at your ticket queues that you're constantly interacting with and find out what are all the little things that we can start to start to chip chip away at you do that at scale you know they've kind of kind of massive efficiency improvements so you know some examples like you know they do this do it again and then do it again right those the obvious ones the repetitive requests you're incurring waiting on the requesting side you're pushing interruptions on to people on the other side the whole time they're they're you know kind of filling those repetitive requests they're ignoring the other work they should be doing a big source of toil in most organizations and often it's a desk by a thousand cuts it's never like one particular thing it's just all these little repetitive things right so you know looking to put self-service in place you have the experts set up those those procedures you know and then what's effectively happening is is those people are happy you know kind of the Tom Sawyer idea they're happy painting your fence for you because they get that fast feedback right and they they get to do what they need to be done and you can focus on the things that are moving the business business forward like getting rid of a toil you know the classic one is always will access control the big problem I could do that but I can't get to it that's got customer data there so I can't I can't restart that service even though I know it craps out all the time right so I now have to push work on to this other you know part of the organization to do it for me more toil one of the kind of cool things I've seen in big companies with this self-service idea is that they can actually hand out access to privileged environments to people traditionally didn't have access to those privileged environments right letting developers to restart in production letting a low test team you know refresh and reconfigure environments that have sensitive customer data into in it right letting people do database maintenance or you know schema updates that traditionally weren't allowed to allowed to do that you know making firewall rule changes for repetitive things the like partner onboarding or spinning up new environments not to set up new new firewall routes you know instead of having say I have one trusted team that could do that I can now I can now hand that off to other folks and distribute control and that actually passes muster with the audit crowd they actually prefer it because they say oh this people have limited access it's not like the traditional way of saying well you know here's a shell script and your SSH key and your pseudo privileges and you know good luck right it's more like saying here we've set up this self-service route for you you can these guardrails are in place for you great way to cut out that toil a fun one we've seen this time and time again see I'm an expert I don't read that wiki right so you know the service has changed right because I mentioned people are changing things right and they put in the dock they wrote in the wiki right they said you know use this flag or bad things will happen right or hey you know the knocks like can you pause that monitoring because every time you do something like it wakes all this up and it takes us 20 minutes to figure out that it's just you you know doing a restart right because you know what always happens is then that person rolls out of bed three in the morning I've been here for seven years I know what I'm doing I'm not going to read a wiki to do like a restart on a service and you know boom the whole thing uh the whole thing blows up right and then I think about if looking at from the self-service perspective is that you have the team collaborating on those jobs right so it's basically like the the the documentation is the code so so to speak right and so when these inevitable changes come up these little things that need to happen um by putting them into that self-service job when that person rolls out of bed at three in the morning where you roll at bed at three in the morning to say you know what was that firewall rule team told me last Thursday right you don't have to remember that it's in the self-service job we know what buttons that we know what buttons to hit and or what command line to hit and you know we're off and off and running and kind of last one here which is another kind of anti-pattern um but that's pretty prevalent which is this idea that you know dev work is more expensive than ops work right ah developers they're hot shots right ops folks uh you know they make a lot of noise they'll be fine right so um you know oh I didn't change the slide that's why so you know you see happens is this kind of time and time again we get these problems someone goes ah service problem oh I know this one you fix it you move on right sure enough you know week later month later day later whatever it is same problem happens you go you fix it you move on and eventually the operation side goes well hey can we fix this come on you know and I've seen this more times than not sometimes it's just passive aggressive dev side just closes this closes the bug closed closed sometimes they'll give an answer and say oh it's not in the budget right and besides you have a workaround so don't come to us with your problem you can you can fix it uh we actually had a um customer bars was funny their first run deck job they ever put together was because they had a it was insurance like a mid-level insurance company and they had they did this specialty underwriting where to underwrite a policy like five or six people had to log into a web app and they had to do their underwriting their underwriting stuff and there was some bug in the code that fd a person got a cup of coffee in their web session timed out or they closed their browser and they didn't hit that release policy button the policy was locked at the database record level and so in general when they found this out was when some high value you know mr big shot owns 19 buildings that's coming into the insurance broker's office to talk about their policies and the broker can't get into the policy because the thing's locked and everybody freaks out the entire customer service center is getting is getting yelled at and they were literally they had to had to raise a p1 bug against the dba team uh to actually go and fix that this has happened 30 to 40 times a month right this this would happen because there's hundreds of people involved underwriting process and they literally kept telling the operations team you know sorry there's no budget to fix that we're going to be replacing this system you know often that happens right within the next 18 months or two years or whatever and you know of course it never happened and they're like this is this is insane and it kind of went up to fight up to the up to the cio and the cio actually um you know uh sided with the development headed development at the same time there was a meeting the very next day where he was yelling out the operations team for not being able to hit all of their all of their uh all their milestones to improve the service and this is a great example of sort of you know how things ran in the organization so just by creating a self-service job to let those customer service agents unlock the record themselves a little power shell script that they that they ran to uh to do it uh you know they basically were saving 30 to 40 uh p1 dba bugs and the whole fire drill uh per month right so you know well yes it's the best way to do it is to you know to be able to to say hey let's fix these problems i think that is again like kind of like those um other longer conversations like i you know i mentioned earlier this idea that hey you know this operations work matters this is our business this is what we do and you know springing services or keep us in business and we have to prioritize that and and you know looking at the true cost of this stuff um in the meantime you can throw some self-service in there and you can kind of get rid of a lot of these repetitive uh repetitive uh problems so you know this pattern i'm talking about i mean this is something that we've seen uh from different folks in the uh in the industry i know that the uh not not any one company you know the disney folks this morning we're talking a lot about empowering self-service i mentioned scott prue um some other folks i'm gonna talk about later that that have done this sort of thing but and we're trying to capture this idea but what's kind of shaping up to be is that we have these uh consumer somebody needs an operations capability right so they have to be on the dev you know outside of operations could be within operations and we have this expertise um in the house and the idea is find a way to let the experts create the standard operating procedures that we can then let the other folks uh you know need as as needed and the key kind of ideas see key patterns to see over and over again number one uh that's pull based right that it's not um you know i'm asking somebody to inform me it's i can use it as i need on demand uh number two we see this idea that you know people try to solve this problem by saying oh we're gonna pick the next great tool right like last year well this year it's kubernetes right this brings change everything last year is ansible before that it was chef before that was puppet before that it was blade logic before that i was cf engine and now when you go into the company is they have everything right so the idea of one tool to rule them all just never really never shapes up right and frankly it's kind of an anti lean or a sort of an anti pattern to um to try to push tools on other people in other groups it's better to let them as much as you can pick the things they want to work with if that's the power shell team that's awesome if they're an ansible team you know or maybe a pure python that's great right if those people love shell scripts over there that's all they know that's fantastic too let them use the tools that that that work for them and focus on encapsulating them and creating this this uh this cell service and in doing that you know it's really because they're not focusing on sort of forcing everybody into one um language it lets you kind of define sort of a meta layer around it and we see people are really focusing on setting up these you know idea of guardrails right how do we make that these jobs safer and safer to run because that's going to the more the safer I can make this job to run whether it's making it you know item potent meaning that you know you can't start you know 35 press the button 35 times you only get one service running not you know 35 right um or uh you know allowing sort of allowed values right that other experts in a particular area they might know the right incantations and the right flags that flag but other folks don't so guide them in the way of of doing that uh another key uh principally see time and time again is that self-service isn't just the ability to push a button that's kind of like a rigid self-service but true self-service is more like the aws style of self-service right so think about like savvice or these old sort of horizon business hosting before they had a self-service portal right but you hit that button and it kind of fired off a static process to go tell people you know do something from this this catalog versus look at the aws model right which is like hey we're going to give you these guardrails but you can define your own button you define your own images you define your own you know configurations that you want to do and we're going to kind of put you in these guardrails and put you in this kind of you know framework so it's standardized but you can do what you need to do inside of that and we see the people are really successful with self-service are the ones that are really driving that idea of letting people define the procedures that they need to go and run so it's not just a one way a one way street and really critical is to focus on building security and compliance in right so access control is a must but also you know you see folks teaming up with the compliance folks to say hey this is actually a better system of record right because our old way of kind of you know our ticket-driven the theater of compliance was we had all this rowdy and all these approvals and all these you know kind of itals and templates and then at the end it came down to somebody who's going to do work and it's bob again and bob just says done right like well what happened like do we know what happened was that really bob that did it like what did bob do did bob even do anything you know did bob run bob script version one or bob scripts version seven you know maybe bob ran jane script right nobody nobody knows so the idea of being able to actually record things through this kind of this this kind of self-service capability actually is a great way to get funding and it gets compliance on to your side and we'll ask a couple things here you know it's interesting you see people who have taken this self-service idea on their own and they've kind of used as a foundation for more strategic initiative so hometown heroes here ticket master jody mulkey was their former cto now he's over at aspiration the big unicorn bank but they had this problem where a very strategic which was the average mptr for this is back in the 2014 time frame the average mptr for web-facing outages like yankees can't print playoff tickets or you can't buy your adult tickets was 47 minutes right and for a company that was already getting kind of beat up by the by the world that's rough right because that's not tech crunch news that's like you know new york times news when when something goes something goes wrong so the idea they had is well hey look you know instead of do this stuff and throw it over the wall and then they threw it over the wall they had this long escalation chain before it came back to the people who knew anything about it right so they had this knock and the knock was like a bunch of escalate they're the escalators right they looked at the screen they you know they would call people and escalation chain and go up a lot of time was being was being spent on that and then they had all the context as they're seeing things happen yet the people they're waking up and calling had none of the context and now they got to get involved is a big problem so what they did is they started working on the idea of saying well let's hand off these procedures to the knock and turn them back into operators right in fact let's test those operational procedures all throughout our process so finally when it went to when it went kind of into operations we can empower this operations team to actually take charge or we're pushing control to them they can solve these problems up front eventually I think they've got to the point where they've even gotten rid of the knock all together they're kind of pushing that control all the way back to those development teams but in this context in about 18 months or so they went from the average MCTR being 47 minutes down to 3.8 minutes right it's like in 90 something percent reduction or something like that I can't really do the math but you know dramatic all from just saying hey let's empower the people closest to the problem to solve the problem and those people that have to get involved that we have to escalate to let's get them quick self-service access to production environments as well so taking the idea of self-service and turning into a really strategic initiative you know MCTR reduced by 92 percent reduced escalations by 50 percent that made everybody super happy right they didn't have to be woken up you know half they're woken up half as many times right and the overall support cost dropped by about half as well so huge a huge win I said I think they've gone you know a lot farther since then this kind of foundational thinking has also really made their Kubernetes introduction the last few years I think go go quite quite well it's a cool story one more story that's Sean Norris a standard charter bank they're a huge bank that no one's ever heard of they're based in Singapore and London and they have branches pretty much everywhere except for the US and and the UK so all over Africa all over Asia a lot of emerging markets so they have you know 86,000 employees in 60 plus countries so that means there's 60 different regulatory regimes they have to they have to adhere to so you know a a systems administrator in Malaysia can't access someone's data in Singapore I mean it's that kind of crazy crazy stuff right and so because of this and I think like Queen Victoria signed their banking charter like 300 years ago it's nuts and so everything their bank was built for compliance right so they had this problem which is nothing got done because it's a bunch of tickets a bunch of manual work a bunch of one-offs because of all those different regulatory regimes and all those points that couldn't that couldn't connect so their whole point of rolling out the self-service was just just strategically number one to have a common way of of doing things right so just they can stop all this or the manual toil I think they had like five or six thousand systems administrators kind of doing things doing things manually but more importantly it was to really be able to distribute that workload across their organization and allow people to solve problems for themselves without having to escalate put into the ticket black hole gets routed you know all over different countries different teams before they can get somebody involved it's a really compelling story as talk it's a it's a lot more than just this but it talks about in just a year's worth of time they save 28 person years of time just with trying to roll this roll this out 13,000 operations tasks and privileged environments that didn't require review oh this is this is super fun so they had this deal because of the banking regulation that they had to prove that people who did anything in a production environment that they did what they said they did right so they couldn't figure out how to do this so what so what the regular regulators made them do is the lowest common denominator across all these different countries is if you had to do something in production and you're already in background checked and everything you had to go to this special ticket system you had to request a production access token you got that token then you had to log in to this kind of special shell that they have put your token that token in there that'll turn on the screen recording the keystroke logging screen recording software that will actually record everything you do and at the end of your session it will then email a notification to your direct boss the video is ready for review and they have to watch the entire video and certify under power of like you know their job that you didn't do anything that was untoward right so that happened 13 they avoided 13,000 video watching sessions in one year just from going down this service path yeah it's pretty miserable but that's life right and you know they had 200 less customer impacting outages right so their biggest problem is not you know random things breaking because it never does is all self-inflicted right so they were to cut down on 200 those you know 200 less gunshots to the foot uh during uh during that during that year so again little kernel of self-service that started just like the ticketmaster when i just mentioned was a guy named mark mon that actually started kind of as a semi-skunkworks project to build self-service capabilities got leveraged by the cto into great things same thing at uh shaw Norris the svp big big you know above all the uh the organization saw some a few people that kind of just dissimilar wasn't really a skunkwork so it was you know they got approval for it but went through the pain themselves to set up self-service for their own teams to do these things got picked up and spread throughout the organization and you know now they're all they're all heroes so um where could you guys help uh run that we're trying to document this pattern we're trying to talk to different people if you your company if you're um you know moving to more an sre driven self-service way of working and i'd love to talk to you about it um we're trying to kind of kind of catalog these ideas turn them into sort of resources for uh for the organization because i feel like there's a gap in the world there's a gap between the people that the organizations that grew up with an sre cloud native point of view and those that have the history of being around for you know 10 20 years and kind of came from a different dna how they're moving towards that so if you want to talk about it i'd love to talk about with you guys um rundoc.com slash self-service is where you can see this it's an open book project um but love uh feedback on it and love to talk to you about it so recap um sre is more than just a title right gotta focus on those those principles focus on you know changing how operations work it's better for us it's better for everybody around us it's better for our companies um you know things like air budgets and toil limits are really kind of what a different way of thinking sets us apart um i'd say be practical you know start focusing on the toil first it's the it it helps people's lives individually and it kind of shows the you know the the business that we're working on something new that actually has delivers immediate value find a fix those anti patterns it's the best way to do it empower people to get going and solve the little things that get in their way look in their ticket queues figure out which are their petitive things they can start to uh to tackle and then apply that uh self-service pattern wherever you can't um refactor things and you know get rid of those handoffs all together so that's my uh talk i'm damon edwards on twitter the slides are already up there uh email anytime and um i'll take any questions or comments maybe i'll take it tickets and uh you know the comments that maybe uh there are a lot of overheads that people have worked work with because uh maybe people don't there's there's no um recognition of what went underneath to actually solve the problems we're talking about yeah firewall um what we did at my organization is we implemented a system uh first of all it involved uh smart commits where we required uh to have uh ticket numbers and every kind of commit to a code version of this and we used this bucket server uh we used rundep to automate some tasks and through a combination of python shell rundep and smart ticketing uh we were able to provide automation so we actually required people to put a ticket number into a system before something like a tagging operation took place and so it became self-documenting right right we would publish the results of the operation and people could see what signs of repo got changed or altered yeah during the tagging process sure so that i i wanted to mention that my first experience is i think it's a key to effectively leveraging a ticketing system um it's actually required people to document things and it's so you can actually use the ticketing system properly right and that's probably a policy yeah and and my my beef with tickets aren't tickets my beef is with the ticket queue right it's people that that force their work through it through it through a queue you know they break up the work into little functional pieces and it sits in people's queues it's more the queue is the problem the ticket as a record i think is actually a great thing and so we see people doing that stuff all the time which is i'm still using the ticket as a system of record but i'm not actually put sitting in a queue anymore i'm empowering people with self-service and i'm just using the ticket as a way of documenting it now we might argue that well that's kind of an odd thing if you had to start from scratch why don't you use a database like why are we you know using a ticket system but if that's already a fact of life in an organization we see that as being a positive thing so ticket as a record is a good idea having records the ticket queue to drive people's work is where things start to go to go wrong right so self-service is that key thing of it's not it's getting rid of pushing work onto other people it's getting rid of of the i'm injecting waiting on my end and interruptions on their end um if i'm going to do it myself and there's a ticket involved me doing it myself then that's great if i'm going to open up a ticket to make someone else as a work order to make someone else do work for me that's when the organizational toxicity starts starts to pile up we see that a lot we see folks automating in self-service creation of tickets like oh i just know because of regulation i've got to have these 30 tickets to do it stand up a new environment so i'm just going to automate those 30 tickets right yeah so i don't want to monopolize the time just one one quick comment sure when you talk about the fr e uh goals objective i think you had something like time to improve in the middle um control of your workload yeah also i think that what developers are sending over controlling that they're actually sending over the right amount of work they can i feel they're calling that service only contract and service level agreement and in my experience is when you start talking about sometimes talking about that sometimes you'll get up for management to agree to actually go with the the middle piece which is give people time to improve the system yeah well one comment i'll say about the itel thing that's very interesting and if you look at devops plus sre it actually forms a self-regulating system their feedback mechanism to make a self-regulating system versus the itel model which is it's kind of a top-down hierarchical control there's just some body above me that's going to improve my work and it's going to be kind of driven through these different formalized processes versus you see the high flying uh or the faster moving sre plus devops organizations they're horizontally self-regulating like they put these shared responsibilities together to where they control the system themselves because of that shared responsibility they've created between development business and operations they don't need all that overhead they don't need all of that the cab to tell them what or what not to do because they've built the feedback mechanisms in the system themselves so i'm not getting kind of off in the weeds here but it's a very interesting idea that i think that way of working with fundamentally breaks in the itel world is the hierarchical command and control and and we're going to catch inspection from some external party like the cab i just don't see that that hasn't worked yet and you see the sre plus devops way they build the self-regulating system in to the into how people work it's a much more effective way to go about things that's a whole different that's a whole different talk thanks anybody uh so if your if your team has no read-up on tools basically yeah so like basically everyone can use whatever language they want whatever you know ansible puppet shop whatever they want as long as they get the job done yeah um how do you how do you encounter that you know since like um i'm i'm i'm pulling into that it's basically like how do you go from the situation you have all these different tools everybody's been using their own thing the side of the sort of natural reform yeah how do you combat that without you know taking everybody off by saying you have to use this tool sure i think you focus on the integration points and you focus on having the standard toolchain architecture so we see it so the easy example would be a deployment pipeline right you know one team can use bamboo another team can use Jenkins but if you guys get together as an organization and say this is what this is this is what a ci server is this is what a a source repository is this is how we're going to divide up our source our source code this is how we're going to um you know are we going to deploy off of trunk are we going to have long live branches right these these kind of meta decisions that once you make them as an organization you can use artifactory or you could use nexus or you can use github and you can use someone can be using you know uh spn right it's not it's going to matter a lot less if you focus on the toolchain architecture so we have the same picture and then focus on the integration points right so if i have to actually expect you to to work on my puppet manifest vice versa then yeah that's a problem we have to work we have to work that out and come to some standardization of skills but if we're able to encapsulate things in a toolchain capacity and say hey i've got this higher level self-service and it doesn't matter whether i'm calling a shell script or i'm calling some other you know homegrown you know full-fledged you know java-based uh uh management tool it doesn't matter because i'm thinking about it from a that's my modular automation and this is my standard tool for for calling that right so it's more of a loosely coupled toolchain idea that i see working working well it's not just saying anybody can do whatever they want right there still has to kind of be that that cohesive pattern i hope that answered the question sort of we talk about that afterwards they're like i can give you a rendic example but i don't want to shell sell the rendic thing so i'll tell you i'll talk about it the idea of having a framework that you can encapsulate those things right so that's that's the idea and then i'm interfacing with that framework i'm not interfacing with your python scripts right um that's kind of the idea but if we have to share that fight we have to share work on that script then yes then we need to figure out yeah they might figure that out yeah you can't just say everybody can do everything that they that they want comes more attractive the process and then you can use that as a carrot and incentivize people like you can keep managing that crappy little bash tool over there or you could just take five minutes and adopt the better tools and ultimately there's a lot of organizational change that just comes with culture change and that little bit of influence about influencing culture by just you know make your tool if not be better just look better it will get more people starting to use it and then that will help drive conversion at least that's in my experience so i hope the microphone got that no worries so uh hopefully a quick question sure no i think s re would be like a subset of the broader devops conversation i mean devops is like really an umbrella right it's a whole bunch of kind of it's a bunch of ideas aligned toward sort of a set of problem statements it's not really like a specific methodology but i think if you take traditional operations work and you apply devops thinking to it that's when you get s re which is which is a great which is a specific thing right it's a team it's a way it's a you know it's a set of principles it's a set of kind of you know skills um so i think you know they're both lean you know like lean manufacturing that sense they're both lean to start and then devops i think is a more overarching idea and i would look at s re as a instantiation of devops specific how do we do operations work um because that's not going away right operations work not going away and i think some people call it service reliability engineering which is probably a better name than site reliability engineering but i think we're kind of stuck with uh with that okay laid on me no um so i wonder if you were mentioning about integration points between the devops team the s re team would you say that that a possible good integration point would be the artifact repository meaning devops job is kind of gets your artifact to the repository and then s re's job is taking to make sure it's reliable in production i mean i think there's different organizational models so i know there's like the the google model where there is a separate development team that has s re's embedded in it the full functional unit that builds and really deploys the begin application right full stack right into their you know board system but there's a divide between ops and dev right different teams and they kind of they do work differently um then you look like the netflix model right their s re's are really more like incident commanders because there is no centralized operations operations is inherent to these cross functional teams everybody is their own kind of autonomous team and you build it you run it you own it right there is no orphans in that in that world both achieving high velocity results and the same shared responsibility model one does it by putting everybody in the same team the other does it by building things like slo's and and and air budgets and toil limits to make itself regulating so they can build that shared responsibility so i think the org model matters a lot less than the way of working and the only complaint i would have with what you just said is that um the artifact repository a complaint a yes and here make sure the artifact repository is the full stack right that it's what containers are so great right that's oh it's not just here's my war file it's here's here's the container everything's everything's running in here's the data i mean it's like it's really almost like the environment i want to hand off an environment to that team rather than i'm going to hand off artifacts that they're going to have to then kind of reassemble on the other side now maybe the tools automatically we'll build it for them and then that's fine too right give you artifacts plus some kind of build manifest that's going to go into something and produce that environment for me but the full stack should be built and run by one side and then built and then run by the other the other side now whether that side is inside your team or that side is some other organization is totally up to kind of i think the dna of how your organization works because i look at the netflix model and the google model is perfect examples of they both seem to do a pretty darn good job at least themselves and um totally different organization structure anybody else think we're out of time i'll be around if you want to talk thanks so much appreciate it it's fun to be back at scale yeah thanks for the introduction so yeah my name is mike ausendorf from barrios barrios stands stands for backup archiving recovery open source so some people who call it bros but that's not the way it's pronounced it's it's barrios and stands for backup archiving recovery and open source once again so today we want to talk about resilience and disaster recovery in times of ransomware and other threats we will elaborate why it's especially crucial to have open source software when making backups if you think about data sovereignty that's the crucial point and then we will i will give you an overview about barrios what it does and how it works so i hope nobody of you have seen a screen like this on your computer you can kick calm down it's just a screenshot so it's not actually happening here if this happens to you you have hopefully two options one option is to pay the fee to release your data and the other option is you have a backup and can just restore your data and you don't need to pay the fee so and that is what we are talking about we're not talking about that so ransomware is just one possible threat for your data of course other things might may happen like physical damage software failure user failure inclusion all kind of things and of course there's always the unknown the the unexpected that may happen and in all those cases the only thing that can solve you is to have a backup and let you restore your data um so the title says last line of defense that means um there are things our things are very important to avoid any kind of such situations so you probably have few DMs that your firewalls your virus can ask you have intrusion detection systems hopefully training to prevent from social engineering that's all important but that's scope for other talks our subject here is resilience and disaster recovery and um so I like uh Star Trek and here's one quote from Commander William T. Ryker our daily routine is the unexpected and that's what backup is all about some unexpected anyone knows what this picture is about it has to do with a backup but in another context so um that's the small world global seed world I find that in in Norway and it was designed to hold a lot of seed types in order to ensure nutrition after a big catastrophe so um this world holds more than 800,000 different seed types it was funded by the Norwegian government and it was used much sooner than the people who constructed it expected and 2015 a research center in Syria was um near the civil war region and they decided to close that um research center they brought their stuff over to Svalbard and then later on opened another location and retrieved the the lentil seeds they were working with so that means you don't have to make backup of your data but also about other assets and important things and the lesson learned is sometimes it happens much sooner than you expect that you need your backup so back to our subject um some general guidelines if you talk about backup so if you make a copy of your data to a disk somewhere on your network then it's wise to have it separated from the rest of your network so means have a dedicated backup server and make sure that nobody can access the data otherwise if you have gotten untuned they might want to try to destroy your backup as well and the second point is um make your application backup your backup that means it's always good to have more than one backup so what most people do is they start with a backup to disk and then make copy of those backup data to cloud storage or to tape media so that you have two different kind of technologies where you store your backups if you backup to cloud of course you should consider um encryption or in other words encryption is mandatory if you hand over your your data especially your backup data to someone else because cloud cloud is just a synonym for someone else's computer so make sure that it's um encrypted and then the last point is um some people start to make backups and after a while they find out oh my i run out of this space because they haven't calculated how much space they need if they start backup so if you do regular full backups then um calculate what space you need otherwise you get into problems soon and um more people than you expect get this kind of problem so that's um from the from the field these uh this guideline so if talking about your end somewhere some other things needs to be considered um i said before if you backup to disk make sure that it's uh separated from the rest of your network only accessible by the special backup protocol and even that might not be enough if an intruder can imagine can um can act in a way that he can control your backup software he might also get into a position to destroy your backup data even it's only accessible by the specified protocol so one really good advice is to have read only medium like warm tapes write once reach many tapes so that is really on a physical way protected against unwanted overwriting of your backup data and then again if you're using backup data encryption you usually need the key to restore the data again so make an extra copy of that key because if you lose that key or if someone manages to encrypt your decryption key you will be in the situation that you have backup data but you can't use it because you can't um decrypt it um some things about long-term availability if you think about having backup as a kind of archive a lot of companies need to store or be able to recover data for a period of 10 years or even more then you have to think this time span into the future and the means of technology 10 years is quite a long time so will your backup software be available and will it run on future hardware you have to think about it how you can ensure that will hardware available in the future be able to read your media so imagine you have a new pc in a couple of years and you have a tape drive and then you have to make sure that the hardware you have in the future will work with the medium you now have so think about that and then if you think about the software you're using avoid vendor lock-in and there are separate or different kinds of vendor lock-in one is for example a paper used when you need to restore so there's a popular cloud service where you can quite cheap store your data and then you might get a surprise when you need it back because retrieving the data is what costs and of course the vendor knows what he's doing because if you need your data back you usually have no choice but to pay I think it's a business model that's okay you can do that but if you use it you have to know what you have to pay when you get your data back so should be somewhere in your in your budget and then if you use proprietary software or software that has restriction usage restrictions and these are two examples from the field it's a backup software that needs a license key to work and if the license key expires the vendor will tell you you can use your software in the future until the next reboot because after the next reboot or after you restart the backup software it will try to validate the license key and if you don't pay the subscription fee then you won't be able to use it and that's of course not what you want because then you don't have the sovereignty over your data so that's bad another example from the field is after subscription ends you are obligated to delete the software the backup software and of course without backup software you can't restore so that's not not good to have and then think about what happens if the vendor of a backup software goes out of market so over 10 years that can always happen and so the answer to all these questions is open source software is crucial for backups so one thing is you really should distinguish between real open source and open core open core is in the end not different than proprietary software just it may have some parts of it being open source but to be fully operated or fully operable for you to restore your data you probably will need parts of the proprietary software so that's nothing else than proprietary and when it comes to the to the effect so with real open source you have no render lock-in at all even if companies making a project disappear the code is still available and can be adapted the worst case is that you have to pay developer to do the work for you but you are in the position to theoretically can do that without having open source software it's just not possible um future proof and adaptable to future hardware can only be done with open source so if you're running a backup software fully open source on linux as complete open source operating system you have the full stack as open source and so you will always be in the position to adapt the source to the needs you have so and especially if you use cloud services for backup it's important that the service provider uses open source because otherwise you won't be able to retrieve your data when you decide to move away from that cloud service so always have a look that it's um that the software that a cloud provider uses is compatible with an open source solution so requirement summary backup software only future proof is 100% open source if you want to be prepared against ransomware and other unexpected things keep extra copies of your encryption keys separate your backup data use backup replication backup your backup if possible use warm media and a last point it's always good if your backup data is easily accessible in case of a complete crash even of your backup server um it's it's good to start restoring your backup server but that might take a long time until you get everything running up and working and it's good that you if you can directly access the backup data to retrieve the stuff that's in there without having first to install a backup server again so any questions until until this part yeah okay sorry uh yes the question was if we're getting into a cloud we will see that later what kind of searches we we're going to support yeah okay so this part was more the general part and I would now switch to the uh barrios part so I said before stands for backup archive recovery open sourced we started as a fork of the betular dot org project one of our co-founders is microphone veering he used to work in the betular project as a voluntary developer for a long time and he started back in 2010 to collect patches that didn't make it into betular mainstream and around 2012 he decided that we would make an own fork based on that work already done um why did we decide that we wanted to implement our ideas we wanted to be a little bit faster than the the kid our project was at that time and we wanted to make sure that it stays open source so we didn't want to make things like open core or so we really wanted to be open source so and then we released the first version in 2013 and since then it's uh yeah yearly a new major version the current release is 18.2 we will see a little bit more about that on the following slides so it's all about data sovereignty no vendor lock-in barrios is open source licenses a fair new public a fair new public license agpl we did a lot of code cleanup and factoring um developed a lot of new features over time one thing or one of the first things was a python plugin interface which made it a lot of lot easier to implement plugins for example to backup databases and stuff like that we see a growing open source community around barrios um we have an open storage format so it's derived from the back killer format and it's completely documented and um what we also have is we have command line tools which makes it really easy to extract data from tapes or other backup media without having to set up a complete barrios server before so in case of a disaster recovery this helps you to quickly get up again in an operating mode so here's a little overview on the left column you'll see the client side that is from what we can take data to backup so we have a windows client a linux unix clients um max client a bsd is not on the slide but also bsd client we have a plug-in that makes the mvf snapshots and incrementally backup those snapshots using cbt change block tracking and we have plugins that can um backup gluster and cest cloud storage is on a on the block level now we have plugins to do incremental backup of microsoft sql server and do point-in-time recovery we have a similar plugin for mysql other databases like postgres we backup by making dumps so there's no specific plugin but for those two we have point-in-time recovery and ndmp is a network data management protocol used by vendors like netapp or emc and we support this protocol to directly fetch data from storages which is done pretty fast so then in the in the central column we have the barrios server that is where we control everything we have schedules that say when do we run backup jobs like for example a full backup every month and an incremental every day and there you configure what to backup from which client when and to what destination and that's the right column that is where we're going to write the backup data so the easiest way is this storage you just have enough this space on your barrios server you can directly write the backup data there that is what most people do in the first place and then of course we support tapes and tapes libraries and different kind of cloud storages so glaster and sef is named here meanwhile we also accept we also support s3 interfaces so we can write to amazon if you want or in this case sef for example has sometimes s3 interfaces so we can use it as well okay question yes we have we am there we have a dedicated plug-in hyper we not yet yeah we support s3 interface from from amazon yeah that's basically what we support those two and s3 is not on that picture yet okay are you using a barrios with s3 or i was google cloud okay all right would be interesting how you do this okay very good okay yeah and then we also see in the middle column the central column we have a web interface with a file browser so there you can select the client and then see what's in the backup mark directories or single files and then do the restore and of course we also have a text console a command line interface to do that to select files for restore and we have a scripting interface so you can automate it or include it in other frameworks you might have like a web front and so so there are some providers internet providers or better say managed service providers who use these kind of interfaces we have may have integrated solutions and run backup service platforms as managed services based on barrios so that's that's also possible as an overview we don't have enough time to to talk about all features that we support but all common sense features that you usually expect from a network backup system like multi-platform we have binaries for linux unix windows macOS scheduler supporting full differential incremental backups virtual full means that if you have really large data sets and you don't want to make a full backup every month you can do what we call full or virtual full backups and that means we look at all the data we have in our backup system and rebuild a new full backup out of the existing maybe two months or three months old full backup and all the incrementals we've made in between and we use this information to rebuild a new full backup so why is that interesting um if you're running full backup and then incrementals for a long time and then suddenly you have to do a restore that will be very long to run because then at that point the backup software would have to reconstruct the the file set of the originating full backup and taking in consideration all the incremental sets you've done over time and that could be 100 or so and that would be very slow so it's wise that at a certain point you take the existing backups and reconstruct a new virtual full which makes it much faster if it comes to a restore situation then accurate means that you keep track of deleted files so sometimes if you do a restore you don't want old files for example spooling files or log files or whatever so if it's important for you that you keep track of that you we have seen the so-called accurate feature then what we call a catalog is a database holding all the meta information like what file with which checksum which attributes from which computer from which date is on which backup medium so that is the the main information if you want to restore a particular set of data and that is usually usually my my sequel or postgres and those catalogs can get very big so if you have like a billion files in your backup then the catalog can grow up to a terabyte or so and that's the reason why we recommend postgres here because we realize that it's more performance and better tunable for for large environments yeah and then of course we support encryption in several ways we can encrypt the backup data directly on the client with individual keys per client we can use global clients global keys and we can encrypt on tape drives that support hardware encryption that is of course the fastest way and then baryos will take care of the keys you have will store the keys in the database so that when it comes to restore it knows what tape needs what which key and will automatically handle all the key stuff um yeah i said before as as medium backup media we support this tape library and cloud storages and what's particular of interest we have many scripting interfaces the python plugin interface is one an easier one is the pre and post jobs that you can easily configure with your jobs that certain scripts run on the central server of the barrier server or alternatively on the client where you do the backup so that is really flexible and i guess most of the task can be done with with those kind of scripts so some new features that we invented over time or that we developed over time um one i've already named is the hardware encryption support on lto drives um another is client quota support so that you can limit the amount of data a certain client can send into your backup and uh ventric limitation that is particularly of interest if you have a network uh kind of limitations and then you might want um your backup not to eat up all the uh ventric that you have available um yeah indian p is interesting if you have large storage error networks using indian p protocol like net app or emc um then we implemented replication to other backup sites so that you can have your backup in one data center and then make a migration or copy job to another data center and the barrier server will keep track of where you have your data available yeah the cloud server just already mentioned the python plug-in interface and we have a multilingual and multi-tenant web ui um supporting acl access control lists um that means you can design it in a way that you have a self-service portal for your users so example given if you have a dedicated computer for one person then you can grant access to backup data for this particular computer to one particular person or if you are a larger organization and have like it departments then you can grant access to a group of administrators for all computers belonging to that department for example so that's working pretty well and i think we have now like 10 languages um in the web ui most of it coming from the community so like chinese i don't speak chinese nor do my colleagues so someone committed it and that's that's growing the base so a few words about the current release um what we've done there and that was a lot more work than we expected um we implemented a way that transport encryption is preconfigured and working out of the box by default so before the 18.2 release if you wanted to have transport encryption you had to create individual certificates for each client you had to configure everything you had to roll out and you had to make sure that it really was encrypted because there was no log message or so indicating that encryption worked or not um and um if you think back to the one of the first slides with the architecture diagram uh you saw the connections from the client to the director and we do this by authentication by password by shared secrets um all over the time and at a certain point we thought well we already have this um shared secret passwords why not use it um to enable tls transcription by default and so that's one of the major new things in the 18.2 release um and it took a lot of time to make sure that it works with all the clients not supporting this kind of encryption um that's the reason why the release came a little bit later than uh than announced yeah but now this works pretty good um we use kind of header inspection that were kind of pre-header inspection that means if a connection is established we see how it's um it's encrypted or not um and then we started if it's an older client without encryption or with a newer than automatic encryption and of course the the old way with individual tls certificates is still supported uh because it's um a little bit or much more secure um to use that way uh but i personally do not know anybody who has really configured that way so uh but of course we didn't want to shut down the this way which is still a little bit more or more secure and then we also introduced uh PEM authentication so and that's in conjunction with the um tls by default is a good way um to use um existing user authentication methods like PEM unix or you can use your unix password or you can connect it to um an adapter directory for authentication which is also um interesting way for for large organizations and then we did some internal work which makes life for us a lot easier so uh who of you knows what autocons is so automate and and yeah okay some some knows that's really a nightmare if you have to make changes to it so and the the old autocons system which was only used to prepare the source code to be compiled uh alone contains 76k lines just autoconscript and then we replace this by the more modern cmake and reduce it to just 5k lines which is already a lot but uh or still a lot but um better than 76k lines and then um we have a lot of really old c code legacy c code um where the the older developers used to make uh string operations with ment copy um which of course is uh really a nightmare and we started to modernize all those old constructs by uh c++ language features like strings but so this is something that the user usually don't see but which takes a lot of time in development but yeah we think it's it's worth doing so because uh we realize that if you want to make smaller changes that a certain piece of code uh then it has impact on a lot of other places and so we started to refactoring and making the code more modern uh which makes it easier to implement new features in the future yeah that is what we're actually working on for the next release is to continue the modernization and we are doing a lot of work on the storage demon that is the part being responsible to write data to disk or tape or cloud or so and uh one thing is that we started to implement uh scasi drive reservation that's especially interesting if you have a lot of drives and maybe um separate barrio servers who want to access the drives at the same time so they can make use of scasi reservation logic and another thing is um if you have already worked with barrios you might know um if you want to backup to disk you have to configure uh what we call virtual drives so because when we write to disk we more or less simulate the tape drive there so we have the concept of um virtual drives and uh if you want to have more parallel jobs writing to your disk space you have to configure more virtual drives which of course is kind of stupid uh because on disk you don't have this limitation so we decided to make life easier for administrators uh in a way that you can now configure um a way that barrios will handle those multiple virtual drives by itself so you don't have to take care about how many virtual drives you you need and to configure this will be completely automatically done in the in the next release and then uh with the web ui we decided to switch to a new framework called view or view j s um that has a big advantage it's uh it allows persistent connections between the server and the browser um so for example you can get push notifications if a backup job has finished then you get uh something flashing here job has successfully finished or so and also has the more modern design and will hopefully look better and one thing is uh what we will also add um the possibility to make configuration changes to barrios using the web ui which is until now not possible until now the web ui is good for restoring and for monitoring so to see which jobs have finished and um to browse the log files or so yeah and then we started to um modernize the documentation the old documentation was written based on latech which is a great tool if you write mathematical papers like i did during university times but it's not really suitable if you write technical documentations especially not if they get really long so um we decided to modernize that and we are now using rst files and things and that is the same framework that for example python uses for their internal documentation so um let's work in progress but it's almost finished and you can already see the results on docsbarrios.org another thing we are working on is uh to build more unit tests um we were using regression tests all over the time like more than 100 regression tests we had we were we are also using Jenkins to do continuous integration tests that means if we build new packages for let's say a debian distribution then we will start a virtual machine running debian and we will install the packages run backup and restore jobs and verify that everything works and we do that for every distribution that we that we support and now we started to have more unit tests so i told you before that we are going to modernize the code make it more modular and this brings us into the situation that we can do more finer granularity tests using c tests so that's also a lot of work and um that will improve the quality of the code even even more okay so one word about installation packages we are using the open build server coming originally from from susan that means from just one source on github we built the packages for all the supported linux distributions and we even built the windows installer packages um with cross compile make files using the the open build server so additionally um we built on on request so here's a picture of the web UI browser so on the on the left you can select the client that you are interested in and then you can drill down and select single files or sub director trees or whatever and restore it to the originating client client or any other client that is reachable by the server by the barrio server and you can even choose if you want to restore to the original location of the file system or you can add a prefix like temp resource and then you can um compare on system what files you want to have back yeah okay so then some things from the marketing or publicity uh way people see us so i don't know if you know black duck black duck and open hub they analyze open source systems and they found us to be about about the two two percent most active teams um on on on github and one thing that is particularly nice the french government has an inter ministerial working group that analyzes open source software and gives recommendations for the public administration which open source software they could use and they uh recommend barrios and that's the only backup software on that list so it's um quite nice for us um this is a chart showing downloads weekly unique with us on download barrios.org it's over time since we started doubled almost every nine months um since 2006 we use mirrors so there's no viable numbers since then um yeah but we still see growing demand some of our customers so we have several max blank institutes we have the the variant state archives as case study um we have um companies and organizations from all sizes so you can use barrios at your home office and you can use it and and big university environments a big company environment so even at um national archives so we have one national archives in the european country with two petabyte of data and there we backup from a cluster storage and write it to a set storage because they wanted to have different technology involved in their original archives and in their backup archives now we have some finance institutes that are under observancy of the german bundesbank the federal bank so they have to follow all the usual bank regulations yeah and we are working together with partners so as said before the software is completely open source and we provide installation packages and repositories every year when we publish a new major version uh in the time between we also do bug fixes and maintenance releases on the source code of those bug fix releases is always available on github and people can compile it by themselves if they want or they can use our commercial subscription service which is similar to reddit enterprise you know so uh customers pay them to get verified binaries um and we we deliver support concising and training services together with partners um worldwide but if you'll see on the map especially in the united states on the western part uh there are no partners yet um that is um to be honest one of the reasons why we are here so we are looking forward to meet uh company service providers that are interesting and in implementing our open source solution at their client side so if you're interested just talk to me here are some some links where you can find more information if you're looking for case studies for more technical papers the last link is of particular importance it's linked to the open source backup conference um and there's a full archive of all the 10 years history of that conference with a lot of case studies videos um pts slides and so lots about a lot of topics around open source backup then here we have some picture credits uh just for the sake of completeness and of course we are here at scale booth number 315 and um well we know some time for questions but if the time should not be enough or if you want to talk more with us we are here tomorrow at that the booth so any questions yeah ah good question the question was if the schedule always takes place on the server or if you can initiate something on the client um the answer is by default the schedule takes place on the server and the server will initiate the the backup job um but as we have an an api and a scripting interface it's quite easy to initiate jobs from the client you just have to pipe some commands into the command line interface and then you can can start a backup job from the client and a question about brick-level email backup um particular microsoft exchange um we support uh vss snapshots so the windows client uh does that so you get a consistent state of fire um but the rest is up to you so if it really works with uh with single mails i am not an exchange expert but i guess you have to oh i know that you have to restore on a file level so um if i don't know how our exchange works if they have more than one mail and in a big file this psd file uh then the smallest item you can restore is a file so i hope it answers the question yeah so if you have other email servers like your syrus imap or duskat they usually have one file per mail uh there you can restore a single single mails yeah okay next question here ah the question is uh using it at home was a practical was just one machine uh the answer is with one machine it's not practical so but uh i i thought more about um geek like home users who have a little network so for example me i have my server running around the clock then uh i have my my wife's laptop my own laptop my uh children's minecraft server um and so um there i have around my own barrier server but for one server it's not now it's not practical yeah okay next question over there okay good good question that's it's about roaming devices like laptops who are not constantly connected uh the answer is um yes that's one of the new features we've implemented a few years ago it was funded by a german isp called global ways and uh it's okay i have to dig into a little bit more into the architecture so before we had this feature the way the connection was established from the barrier server to the client was always in a way that the server initiated the connection and talked to the client and said hey client you have to make a backup and that of course only works if the client is available at that time and especially with roaming users like laptops you are often not in that situation so they wanted to have the possibility to have the um connection being established by the client well that's that's what we call a client initiated connection um and the barrier server will then uh when it gets the connection send the information hey uh some pending jobs for you so yeah so the answer is it works this way yes i think here was another question yeah okay question was uh for the home user scenario what how to backup the barrier server yeah um yeah i must admit i i'm not doing that and that is sometimes of course a problem um well what i do is i have the backup files on an external uh disk so um that's better than have it inside but of course not perfect so it would be good uh or in a perfect world you could have or you would have your own tape drive and then make backups to tape and bring the tape to your neighbor or friend or so another option is to uh make replicas of your backup to some cloud storage that you trust or trust should not be enough you should encrypt the backup and then you could make copies of it to to a cloud provider yeah okay next question okay question is what happens if the full server crashes um called bare metal restore uh yes we uh have a solution for linux we work together with another open source project that's called rear relax and recover it's a belgium project and what they're doing is pretty interesting it's a collection of scripts and um you have to a rear mix emergency boot image so you run a script on your linux server and the script collects all basic information what kernel version file system layout logical volumes drbd what what other kernel modules are used and then it creates a boot an iso image containing all this information and if you boot from this iso image it will automatically recreate the basic system layout including uh file system logical volumes kernel and modules and so on and it also offers the possibility to plug in with other backup solutions so uh you're specifying a configuration file that you want to have your barrios client on that machine on that emergency rescue system including the configuration to connect to the barrier server and if it then boots up and recreate the system layout it will finally trigger a backup job restoring the data so um that is how it works with with barriers and rear and rear also works together with a lot of other backup solutions so we are just one that plugs in there but for windows we don't have a ready solution right now we are working on it but with a low priority at the moment yeah okay next question ah the question was about rear where we store the iso uh you have separate possibilities you can just include it in your backup so that is a way that most of the customers do um and then if it comes to restore your first restore the iso image to a separate server burn it onto a cd or make it available via network boot to the to the client you want to restore but of course you have all the options you can burn it directly on a cd or put it to someone else's places yeah okay yeah another question yeah sure the question was although remark was um i'm just repeating for the video um that you some people say you don't have the backup until you verify that the backup works so yeah that's true basically um we have what we call uh verify jobs so you can uh um that's kind of internal thing that's we'll we'll check that if the information that we have in our catalog in our database is consistent with the data we have on tape or on backup medium um but of course it's better to do real-time restores and and then really check what you have um some customers do it some not um of course yeah we recommend to do it from time to time um i mentioned the the osbcon before the open source backup conference and there was i think two years ago a very interesting talk about verifying database backups and what you can practically do with it so the approach was um you do a restore to verify that you have your data and then you have a working copy of your existing data and you can use it for like analysis purpose for read only purpose in some instance so um it's just a collection of ideas what you could practically do with these kind of test resource okay other questions yeah okay then that okay how the vmware snapshot backups work um yeah it's a python plugin that makes use of the vcenter api so that's that also implies that you have to have a license from the vmware site that enables you to use these api and these api offers you the possibility to trigger a snapshot that's the first thing and the second thing is um it supports technology called cbt change block tracking so um we start with a with a full backup of a snapshot and then if you come back next day to make an incremental backup we pass the information of the timestamp we got or to be more precise uh you get a unique identifier for the snapshot you recreated the day before and we pass this identifier to the vcenter api and we require the vcenter api to give us only the change blocks since that timestamp or since that id and then we only write the the change blocks into the backup okay think one yes yeah yes yes um in fact if you do a default installation of a barrio server it will directly install the client on the server and there are some jobs preconfigured one is just a kind of self-test it will be one directory so that it proves that it works and the other thing it will make a daily backup of your catalog database so yes that that that works so you can have a single server and client environment so if it makes sense don't know but you could do yeah yeah it's doable and there's one scenario where it really makes sense um if you backup to let's say an s3 or on self storage um that can handle multiple collections at the same time so it's one of the big advantages of a cluster storage like cef that it can handle like 20 or 100 parallel connections uh but if you do it uh and the legacy barrios way you would have one uh what we call storage demon that's a server part talking to the backup storage and in that scenario you would canalize all the parallel jobs through this one server and that would be of course a bottleneck and one workaround for that is that you install a storage demon on each of your clients so that you can write in parallel to the to the cloud storage it's of course not an optimal way but that's one of the scenarios that could make sense to have a server on each client yeah but the short answer is doable okay good ah yeah okay uh yes you don't even need another barrios server we have some command line tools um one is called be extract and that will just extract everything it finds on the backup medium whether it's a tape or or disk image and then you get everything that's inside there um and another useful tool is called bscan that will reconstruct the catalog so in the really worst case that you have lost your barrios server and your database backup but your your you have your backup medium left um then you can just uh piece by piece have it scanned by the bscan utility and that will fill up your catalog again with all the data it founds on the backup medium okay other questions yeah then i think yeah we are like nine minutes ahead of time that's i think a good idea okay one one question uh name of the uh bare metal restore solution yeah rea relax and recover r e r a r relax and recover yeah okay another question yeah then um i hope you have some plans for tonight i don't know if they still have beer on the other part um but i'm pretty sure there are things to do here at pesadena i'm looking forward to go with uh some friends to a brewery and have a nice nice evening so i wish you're the same and uh yeah would like to see you again and thanks for listening bye