 Check one two three check check one two hello Check one two ABCD EFG H I J K L M N O P Q R S T U V W X Y Z Hello check one two three four five six seven eight 910 1112 What you know, I can't drink like caffeinated beverages after what happened yesterday Yeah, yes, hello Okay I feel like I'm hitting it. I feel like I move and it it bumps into my face. That makes a lot of noise I mean All right, I'll try not to mess up this arrangement got to be the hair You know, I'm okay using a handheld as well, I think this will work What no, I just don't want the the mic to make noise when I like move around and it bumps into my face I'll just not not move around too much No Me on the podium it says not just to move around too much anyways, so okay Are you gonna astroturf this presentation with fake questions I I don't know how to juggle these people have nothing to worry about Oh She asked that at your own talk out row I Okay, this is about the time so we should get started Welcome to another great day at scale Hope you guys are enjoying yourselves so far. Well, this is the first first talk of the first Saturday, but before we get started got to say a little word from our sponsor who's sponsoring this room We want to say thank you to our sponsor build kite for supporting the systems and infrastructure track Build kite is a CI CD platform that enables teams to build and deliver software at scale Not this scale, but And so I like to introduce our speaker Anita Jean and she's about to kill it We are happy for it give her a big hand Baker feel welcome Hello, everyone. My name is Anita John. I'm a software engineering manager at Metta I support the Linux user space team which covers a system D and basically anything in the Linux ecosystem at this point Yeah, I know dangerous And today I'm going to be talking about how I debugged a memory regression in system D. Journal D This is something we found when upgrading our fleet from 247 to 248 I am also an active committed to system D. So if you'd like to ask questions about system D after This talk Because this talk is not about system D. That's the one at six o'clock. Yeah, I'm feel free to do so Okay, here's a rough agenda of what I'm going to be talking about I'm going to set the stage talk about how we discovered the issue Then I'm going to go into how I investigated and tried to track down the root cause Because the root cause was pretty unintuitive. I'm going to spend some time explaining in detail What happened and then close it off with some hindsight thoughts and takeaways So setting the stage This is what our production fleet looked like at the time when we ran into the issue So our entire production fleet runs sent to us stream 8 I was also able to reproduce this issue on Fedora 34 So this occurred while upgrading system D from 247 to 248 So this was quite a few versions ago because the latest version is now 251 upstream our System D build comes from the CentOS hyper scale SIG my teammate David will be talking about hyper scale SIG tomorrow So you can go Listen to that if you want to hear more about how we build packages for hyperscalers But the idea behind this build is that we base it off the Fedora raw hide spec in this case those forces in the 248.2 I also mentioned here that we use kernel 5.6 at the time This is only relevant if you want to run some of the BPF tools that I'm going to talk about in this talk So here's what we found Um, I got a notice one day from the fleet efficiency team that system D journal D's average memory usage Had grown since May 24th from 17 megabytes to over 50 megabytes on average This is pretty non-trivial for Our production fleet because we're running millions of servers and we want to kind of eat out all of the memory utilization that we can And so they wanted to figure out what was going on At the time I was rolling out system D 248 I was pretty close to the end of the rollout at this point So about 90% of the fleet and I thought it was pretty strange that I didn't see this issue earlier And that this didn't actually line up with the rollout that I was doing The way this rollout worked is that I actually had to roll back a few times So I would have expected to see like up and down spikes in the memory usage if this was correlated with the rollout But in this case, I didn't actually see that So what was going on? Trust me verify the graphs don't lie probably so I hopped on one of the host I Hopped on one of the hosts that was exhibiting the issue and I ran p-map to Look at the memory mapping for system D journal D And what I noticed that is the anonymous memory was over 300 megabytes for system D journal D Which is pretty crazy because normally it's like 20 megabytes Anonymous memory is important here because this is memory. That's not backed by a file system This is memory. That's actually being used in the stack and heaps. So it's actually in the RAM So I saw so I was like, well, was it actually the rollout in this graph here? It shows some of the Anonymous memory growth in system D journal D The axes are pretty small, but it was a very slow growth about a hundred kilobytes per hour So we were like Well, if this memory growth is not showing up in 247 then it must be something in 248 But why didn't we see this as part of the rollout because normally when we upgrade system D It's a live upgrade and the daemons will restart into the new daemons as part of the upgrade But we noticed that for system D journal D This wasn't actually happening and so you know since the upgrade from 248 and up This is actually fixed and you can actually reboot into system D journal D Which should be running the correct daemon. So for 247 we just kind of manually Upgraded the system D journal D daemons Restarted them into the correct version and that's when we were able to confirm that indeed It was the rollout and there was something in 248 that was happening Okay, now that we know that the issue was the rollout we have to figure out what changed The first thing I do because I'm actually very in tune with what's going on in system D So I just go straight to the git logs and look at what's changed in the journal So there weren't actually that many changes in the journal code itself between 247 and 248 So I had about five commits to look through Looking at the code for each one. I didn't actually see anything in particular that struck out Maybe some of the a map cache stuff Was using more memory, but nothing like obvious at this point And so I was thinking, you know, maybe it's a memory link If you have like all this growth in your application, it's usually some kind of memory leak, right? But so I ran VAL grant as people do against journal D and you notice that pretty much everything's freed There's like these 8,000 bytes that are still reachable but that's actually unrelated and is kind of Used in every version for the mem pool cache, but for the most part nothing, you know Everything's freed everything's the allocated as expected. So what's going on? I'm pretty much a generalist in this space So I had no idea what to do when we're Debugging a memory regression in something like system D journal D I work in other parts of system D, but journal was kind of a mystery At this point. So I decided to just take those few Journal commits that we had from the previous two slides and do a new system D build Based on each of those commits and deploy them on hosts and monitor for the memory regression So theoretically if we wait a few days Eventually it will see on our charts that the memory growth should be growing on You know the bad commit, right? While that was going on I decided to try and figure out You know if can I figure this out faster? Can I use some tools to debug this issue? My favorite tool of choice is usually just to jump straight to S trace Unfortunately S trace is not very good for debugging a memory regression because a lot of the memory is allocated early on by lip see when the application starts and so the kernel you will see like S break pretty early on in the application But the memory isn't actually Allocated and mapped to physical memory until it's faulted. So that would not show up in S trace You know memory allocations in the kernel if you were trying to use that What I decided to use instead was ebp f ebp f allows you to extend the kernel capabilities without changing kernel source or loading modules and it's become a great feature for tracing profile and monitoring and so in this case I would be able to kind of live Trace what's going on in journal hopefully get some stack traces and figure out what's going on there And of course meta is a founding member of the ebp f foundation So we have a lot of maintainers and others who could help if I had issues Like I mentioned before I'm a generalist So I didn't really know how to debug the where I should actually go and like what bpf features I should be using So I you know I did some googling has people do and I found Brendan Greg's website for Debugging memory leaks and memory growth And that's what led me to try out the bcc tool suite which has a lot of pre-built Applications for tracing and debugging memory leaks and stuff So the first tool I tried was step count The idea here is that if I could count the calls to Malik Maybe I could see that journal D was doing more calls to Malik than it was before and that could explain Why we had this memory growth and so on the right here. You can see I call stack count with dash U for user stack and Malik to count the calls to Malik I give it the PID to system D journal D And what I got here was one of the counts of the stack traces which lead to Malik calls Unfortunately when I did this between 247 and 248 the number of allocations are pretty much the same. So That was a theory that didn't work out. I Was still pretty hung up about the memory leak thing even though Val Grand was like, you know, there's no memory leaks So bcc has a tool called Memleak that you can run and what it does is it tracks The allocations and D allocations that are happening in your program So for example, if you have something that's been Malik for a really long time and it's not been freed during the lifetime of the application That kind of thing would show up in the stack traces that Memleak returns and So on the right here, I have the top 10 allocations happening in system D journal D I look for things that were not being freed within 10 minutes because the way Our configuration was set up. We should be rotating out the journal files quite frequently Unfortunately, this was also a dead end because I noticed that the allocations were pretty much similar between 247 and 248 All right, so let's take a step back and look at what we know so far You know a lot of stuff happens in debugging you have a lot of theories some of it doesn't work out But what we know so far is that There were pretty much no leaks. I've used like multiple tools at this point. There are no extra leaks no extra allocations And so You know, where should I go from here, right one thing I did learn from the Using those BCC tools is that the Allocation path so like from getting to the journal stack where we enter the code It starts from a client context function and then it ends up in an allocation function That's actually in system D's core libraries. So for example, I listed Readful virtual file, which is one of the key like allocation functions that journal D is Ends up in when it actually calls Malik and realloc and so using this information I decided to change my bisect strategy and start looking instead at the allocation function So we go to git log once again. I look this time in the file IO functions Because this is where the read full virtual file function lives And so there are actually quite a few more I omitted a lot of them But this is just an example. I did the same thing. I did system D builds on each of those commits deploy them to machines In our fleet and then just watched for the memory regression You know and at this point bisect prevails and it takes us to the root cause commit The funny or like painful part of this commit The funnier like painful part of this commit is that it was actually the last commit right before 248 was tagged. So If I had done like a whole bisect across all the commits and start from the center it would have taken quite a while But I'll explain this more in detail about what the issue was but the summary of what happened was Normally if we were reading something from proc we'd start with 4,000 bytes and then increase the buffer if we need more of it, but the the function at change to start from a four megabyte buffer instead Reallocate down to a smaller buffer as we read from proc and we didn't need like the larger buffer anymore So, you know, if you just think about intuitively, it's like, I mean that that's fine Right, like you take a larger buffer you bring it down You return that memory back to libc. So what's the issue here, right? All right, so let's get into the details of what happens Let's pretend we have this heap here this rectangle And I'm going to talk about how some of the Allocation patterns look like in system ejournalty 247 So in 247, we have a bunch of like malloc calls Journal reads a bunch of data from procfs. So things like the length of the command line and stuff Those get allocated in like, you know the heap just to store the command line and stuff If for some reason it's larger than the initial size of the buffer we set up we'll call realloc and Double the size of the buffer if the buffer doesn't fit then libc will just find a new space in the heap and we You know, alec new space the old one gets free If for some reason that We need to allocate even more space because the data we're reading exceeded that buffer. We double it once again You know, same thing keep calling realloc and we actually double this up to three times most Files will not most reads will not take up like all 32,000 bytes but This is something that could happen in the code and so the initial commit was trying to fix an issue where The maximum we actually want to allocate was not 32,000 We want to actually cap out at four megabytes and so the commit was trying to fix this issue and also Optimize it so that we wouldn't have to reallocate up three times. So we wouldn't have to retry three times They thought we could just Allocate everything at once and then just deallocate it down So logically, it's like, you know, it makes sense. We could save a few calls to realloc We get the correct maximum that we're aiming for and Everything's good, right? So here's what happens in system knee journal d248 This is with the blimp commit. We have the large four megabyte buffer Most of the time we're reading stuff from pocket fest. It's actually quite small. So a for 4,000 page aligned realloc now and We have a small buffer again, so that looks good, right? Let's pretend we have something like this on the heap, right? We have lots of little buffers scattered out throughout the heap because as the application is running We allocate and we free memory and we create all these free memory chunks in the heap unfortunately when you're looking for a You know an area in the heap to store this huge buffer You know it couldn't fit between these gaps anymore. So instead we have to go towards the end of the heap and Alec Malik this like big buffer there And so now we have this large space being taken up at the end of the heap and Then you know normally like I said pocket vest we read data. We don't need the buffer that big so it gets realloc down The problem here is that if we keep doing this multiple times The way realloc works is that it tries not to copy and move the buffer to a more optimal location in the heap Instead it just keeps bringing it down from where it's already allocated So if we do this then it starts to create a lot of Fragmentation in the heap. So if we were to free Like one of the chunks in front of it. We start to create even more gaps in the heap and they won't be able to store like the large buffer anymore and we won't be able to optimally use the memory I think another point I should bring up here is that You know, even though we have all these free areas in the heap To the to the kernel like in the application that memory is still mapped to a physical mapping And that's why it says that the applications using all this memory that memory That's like in between those allocated chunks is still being taken up by libc in your application They're saving it for later uses. They're not they're actually mapped And so the kernels like your applications using 300 megabytes. You could realistically just be using 20 and so You know the fix I ended up merging was to partially revert Some of the initial blame commit We go back to retrying three times to increase the size of the buffer But I fixed the but I do also address the initial problem Which was they wanted to cap out at four megabytes instead of thirty two thousand. So that's also been integrated in the fix And as far as I know meadow was kind of the first to detect and fix this issue Kind of a benefit to the community that we run system D upstream or like system D latest tag across all of our machines, so Things that only show up at scale. We're able to Fix pretty quickly Okay, closing it off with things that I learned in hindsight So I want to know how I could have detected this issue faster, right? If I see this issue again, it should not take like Weeks or like days to buy a second issue and track it down Things that I should have noticed is that if I instead looked at the rate of change of an honest memory between System D journal D 247 and 248. Sorry. I mislabeled that it should say journal D I would see that in 247. There was basically No rate of change in the anonymous memory, but in 248 I'd be able to see these spikes of like four megabyte allocations So if I instead looked at this graph initially and like thought about it more clearly I probably would have gone in the right direction sooner Another thing I want to bring up because it's kind of cool is mtrace So this is part of glipsy. It's a function and a command line tool You can insert it in the in like the start of your program and it will track like allocations and deallocations In my case, I think this was pretty similar to using BCC's memleak. So You know if you have the choice, you don't want to modify code I would probably use memleak, but I thought it was just cool to call out that there's also a tool in Lipsy that you can Do this programmatically in your code Okay, another one that I thought was really cool that would have been super helpful for this investigation is massif It comes as part of valgrant and tracks kind of like the peak allocations happening in the heap for your application so On the left you see that the peak memory allocation for systemd journal d 247 was around 100 kilobytes But on the right for systemd journal d 248 you see that the peak is four megabytes I've omitted here stack traces So massif also shows you stack traces and that would have been super helpful for tracking down Where these allocations were happening and why this peak was happening And I think if I knew that I could have nailed down the the blame commit a lot sooner Okay You know because I am a software engineer by trade I wanted to know was there like another way I could have done this was there another fix that would have been equally You know as optimal for the heap So one idea was to instead of using realloc which tries to keep the start of the buffer where it is What if we just free and call malloc again just copy the data to another part of the heap And so if you look at the graph on the right I actually show three different builds of systemd journal d So the first one that's like going all the way up. That's the systemd journal d 248 with the memory regression And then in the middle closer to the bottom is the fix that I actually upstreamed This brings us back to baseline systemd journal d 247 levels and then at the bottom is the Hypothetical fix that I'm listening here on the slide. Like what if we instead just malloc mem copy free the buffer It is slightly more optimal I Didn't end up going with this because it would have been pretty inefficient if we have to keep calling mem copy to copy data across the heap You know, you have to balance like memory optimization also performance And so I think it was just fine the the version that we ended up upstreaming Okay, takeaways Definitely invest in observability monitoring tracing all that good stuff. I'm really fortunate that meta has a lot of teams That makes this super easy. I could just throw these builds on the host and I would automatically get these graphs I didn't have to set up anything. Oh Useful stack traces. Oh, yeah, one thing I didn't mention is that a lot of these BPF tools they require Frame pointers so most builds that come straight from your distribution don't Include packages with frame pointers So if you want to get like the really nice stack traces that I had in the BCC tools I run You should make sure your application has frame pointers but otherwise It's really hard to debug anything without stack traces Okay, um, and yeah Always be learning. I love debugging stuff because it allows me to pick up new skills learn about new tools There are all these things in Valgrind beyond like the memory leak checkers that I didn't know about That I now do and I can keep that in mind for if I have another issue like this in the future And of course, I want to keep getting better using EPPF. There's lots of workshops here Tracing is just kind of super valuable nowadays for figuring out how to debug and trace down issues All right, that's all for my talk. Thank you everyone for coming and now take questions Yes, sir So this gentleman mentioned that for embedded systems, we really want to decrease kind of the memory fragmentation and gaps And is there anything in Linux that does this honestly? I'm not quite sure when I was doing this investigation I learned that there were actually lots of knobs that you can toggle in like g-lip see and Malik to change How the allocations work like whether you want to Cash more free buffers and things like that. I'm not sure if there's like a tool Specifically that could analyze gaps for you Anywhere there's be something like that Yeah, Neil's right You mentioned that there's probably something in Valgrind that would do this and especially the KDE folks have been looking into like memory efficiency and things Like that. So Trying and see if I ground have something Yes over there Yeah, this gentleman brings up a good point We could also use in the future amendment and advice calls to see what memory allocations are happening Any other questions or comments? Oh for sure So we so I didn't go into like the I'm sorry the this gentleman mentioned like We rolled out 248 to the in fleet entire fleet. Maybe we should have started an RC tier I didn't go into the details of how the rollout phases work, but we always start in the RC tier But kind of the one of the downsides of the RC tier is that we don't run like all of our production workloads in it And so we don't get all of the signal we need Sometimes it could be rolling out to 50% and that's when we start to see a lot of the signal in these things Yes, oh RC stands for release candidate in our case. We have Just a bunch of machines that we run our initial binaries such as the kernel System D a lot of the widely distributed binaries who run those first in our You can think of it kind of a test here, but some of them do run production workloads And that's where we try and see if anything blows up immediately Yes I think the cons team doesn't want me to say anything about that And also I'm pretty far moved from like the dollar aspect of the machines. I Think legally we're allowed to say millions of machines But yeah, we were it's just this is like the entire fleet like we roll out the latest system D to like all millions of them And it runs our production bare metal fleet Another audience member here just mentioned that we can use alternative memory allocators like Jay Malik Especially an embedded system that helps with memory fragmentation and stuff Yeah, actually when I started this investigation, I was like No way this can't be system D Because upstream we run multiple CIs for like a sand and just if we saw a memory leak, we should be able to detect it immediately I think for us the Linux user space the kernel teams the container teams We're pretty well in tune. We communicate pretty frequently amongst ourselves And so if we notice an issue in one system where we kind of work together Communicate and figure out is it this rollout? Is this this other rollout? And so that kind of helps narrow down the scope a lot being able to communicate effectively between teams and I think The being able to have a very like Generalist skill set like the one I did in this case. I was able to pinpoint Which system had the issue? I think being able to work across kind of a breath of system sometimes just to narrow down like It's not my applications your application and having some of the data to back that up that really helps That's a good point. I didn't address the issue where you mentioned We could be upgrading like tons of packages at once like when we jump between Distribution versions and we have a lot of package upgrades happening Especially you know for system D was a lot easier because we do try to stick very closely with upstream and the fewer Diversions you have between versions You know the quicker you're able to detect the issue which is why I like running the latest system D. It's such a big Space that it covers. I want to be able to detect issues faster Like I wouldn't want to be running 239 and then trying to debug an issue that we now see when upgrading to 251 But for like the big distribution upgrades for sure we do those a lot slower We kind of detect issues as we roll out a lot more slowly across the fleet But it's kind of the same thing. We just keep communicating between teams We do a lot of communications with upstream to resolve issues as well and things like that I mean, I'm happy to do it as long as the Linux user space team is still here You generate a video and if you're running and hold their stable stable version It might be more stable in some end but you do some issues You can't really get those upstream easily. So you have to end up getting a lot more At my workplace, we have personal experience dealing with those where we run very popular distribution everywhere and We've gotten to a point where we maintain somewhere between 8 to 12 percent back forth overrides because of the main package set because The vendor itself can't do anything to fix it. There's not a whole lot that can that they're willing to commit to and Nobody else wants to really work on it because it's so far you know it's Long-term Anyone else have questions about This talk or system D in general. Okay. Well, thank you everyone for coming Hello, that's a very loud for a very small room so Try not to be too loud. It's very hard for me. I mostly want to be very loud So hi, I'm Michelle my pronouns are she her and I work in Netflix. I'm a senior software engineer See this excellent So today, I'm really excited to talk to you about Netflix workstations Throughout this presentation. You'll see some before and after shots of work That was actually done on the workstations. You'll see some close-up beauty shots some De-aging for some of our older celebrities and some large environmental destruction So things like explosions and dust some really cool vfx work. They got them I'll be happy to answer any questions at the end as time permits Or if you find me, you know throughout the day in the hallway, feel free to stop me and if nothing else you can find me on LinkedIn, but Definitely put something in the message. Otherwise, I'm gonna think you're a robot and not respond. I'm like, yeah, you're tricking me So one question I will answer right now is that I have no power to renew or green light your favorite show if I did There would be a show called cake with Michelle my number one favorite show where I just eat cake and tell you How great it is be such a good show. It would be really boring for everyone else, but it'd be really great for me So but what I can Do is help artists and creators make the Netflix originals. We all know and love my team production infrastructure engineering or Pi is responsible for helping to make sure they can work every day all day and not worry about the infrastructure powering their workstations So here's our agenda for today. We're gonna learn about what technology we used What was the problem that we solved? Who are the users? How did we actually make Netflix workstations? What did we learn and what comes next in the Netflix workstations 2.0? So before I talk about the problem and why we did this I wanted to give a quick teaser trailer of the technology that we used if you are unfamiliar with any of them Don't worry I'll be kind of defining all these services as I go and if you're like a super expert on any of these Please stick around and answer my questions. It'd be really helpful for my next sprint So at the top is go having a cup of Java salt for configuration and spinnaker for our fleet management Images are made with packer and we use a lot of AWS services But I wanted to highlight EC2 as the actual instances DynamoDB for the data and SQS to support our event driven microservices architecture Those EC2 instances come with nice DCV Which is one of the ways we actually remote in and display it to the user and the other one being Teradigi and Since we're all about open source here I wanted to point out some of the open source software here, which is packer and Salt and spinnaker which are all really great And if you have any questions about them specifically that you don't learn today Happy to answer how to get started with them So what was the problem? Historically artists had machines built for them at their desk and only had access to the data and applications when they're in the office This system allowed for fully on-prem solutions that stopped at the door There are a few reasons why that's not feasible. Some of them you may be well aware of Is the vast shift to work from home in 2020? We actually already started the project in late 2019 as if we were psychics But it really accelerated our timeline to get people working as fast as possible While people are beginning to come back to the office in some regions the work landscape has irrevocably changed obviously So the second reason and our initial project mandate Why we started in 2019 is that Netflix wants to make content at an unprecedented rate I've worked in entertainment almost my entire career, which is about 12 years now And I've never worked on this many Entertainment projects at the same time. So we need that technology to work at this scale and be able to do it more efficiently There are simply not enough artists in a single geographic location to support the Netflix scale of content So productions need to meet artists where they are instead of trying to bring everyone to one giant imaginary office and Finally by taking down traditional geographic barriers to entry instead of making everyone move to LA as wonderful And much as we love it here the industry and Netflix becomes more accessible to all types of people Losing access to great talent because I live a hundred miles away from the nearest office or might have mobility Restrictions or other things that they want to work from home. It's just not something we want to do So why not ship a computer is the question. I'm always asked the first question The North Star for Netflix workstations was to provide the infrastructure for artists to get a one-click experience You go from sitting down to working on a shot But that can also be accomplished again by building a computer and literally just sending it to them So when I do that instead There are two answers their flexibility and security among many others so Production needs change fast and furious new cameras and new software make it an ever-evolving space By creating cloud-based workstations adding anything from new features to security patches can be done quickly a User can switch from one operating system to another by opening a new window instead of a new computer Instead of waiting for a package in the mail. They can get a more powerful GPU or larger storage with configuration change We've had it across like this all the time like oh, I've run out of storage. Give me some more Like okay, I've attached some more to your new workstation go to town instead of waiting for something to come in the mail Workstations like many cloud-based applications are not designed to be permanent Instead artists are renting a workstation instead of buying it The downside to flexibility of course is accounting for other features that come with a single desktop Things like persistent user settings when working on a desktop User and software settings are just sitting on the hard drive Whatever the software puts them with cloud-based solutions We had to be more methodical and save crucial settings to persistent storage associate that with a user and Attach that storage to all future workstations so you have to be a little more think through what are the downsides when you do something That's more flexible For security asset security is critical inside studios Files have traditionally been passed around many different ways from FTP to email to literally someone getting on a plane with a hard drive and flying across the ocean So every time a file gets moved it creates an access point for interception By having all the assets on cloud storage along with a remote workstation Files do not have to pass through any intermediate service. No one is downloading those files to their home computer It's all in the cloud It also allows for this granular control of a files lifecycle and access so you can grant it for the short time They're working on it and then take it back instead of being like please delete this from your computer pretty pretty pleased and They never have to leave the original storage So when designing Netflix workstations we did balance the needs of three significant personas artists technical directors and pipeline engineers and you might not have heard those titles before So I'm gonna kind of go into detail. They are the very specific animation VFX job titles So and then there's also the many supporting teams such as storage and networking and security all the people that make these go So let's look at dig a little deeper into these personas Artists want to focus on their art and not on technology They are the main end users of a workstation They want to get to work immediately and not for some wait for something to boot up or to figure out what files They need to download or anything. They just want to do their job The artist workflows can vary wildly depending on the step of the production process They're on but their basic needs are the same you want to think of you know building a movie is like moving a Factory one item from one team to another So it's moving the shots of the film from one team to another and they're all Adding things to it to get to that final image that we all enjoy So they have source files that include anything from say like original film plates that they just recorded You know like a celebrity jumping away from an imaginary explosion To reference images to something much more tactical like a lidar scan to get a good idea of the 3d environment that was filmed in the artists then use digital content creation or DCC apps to create new images that could be final drawings or the blueprint for a render These new files then have to be shared for review So all these steps would ideally be completed Directly on the Netflix workstation which includes other tools and plugins to manage that whole pipeline from every team to the next team such as task management and getting reviews and things getting approved and final delivery and all the many many steps in the Movie making factory So technical directors are both Artistic and technical expertise Instead of working directly on shots they work daily to improve the artist experience They want to get the artists working faster and easier and not worrying about these things So they need to be able to create scripts and plugins for the artists and also debug their problems So they need to do Half an artist job and half a developer's job and troubleshoot any issues So they need a lot of observability tools a lot of access things like that They decide on the software and ensure the artist on that correct configuration for their task Things like oh, this is the version of this software We're using for this project because this plug-in works here and this will be really great for this You know that sort of thing really in the details for everything So pipeline engineers focus exclusively on creating pipelines or workflows so they handle that whole flow Their portfolio can include things like file orchestration or license management or anything else needed for the artist to complete their work They glue the pieces of the pipeline together with Netflix workstations kind of in the middle They make sure the artist can work as efficiently and smoothly as possible and also the technical directors have everything They need to actually develop on workstations as well not just Make the cool images So my team started with a strong emphasis on building a complete White-Gov product for the artist to use we thought we'd build one thing and everyone would just use it and it'd be great That's never true. I don't know why we thought that Plan was always to move towards a more self-service model Which is good because we got things in their hands We got to see how they actually worked and get that iteration going I'll talk a lot more about iteration as we go on So the first step was adding features To allow the technical directors to start to customize the experience and the next step is Moving towards a platform so the pipeline engineers control and manage their own fleets, right? They're managing everything about it and we just make the tools available to them So they can provide a very curated artist experience for all the teams and really customize what they see So how do we make them now we're gonna get start to get into technology Starting a brand-new project can involve making hundreds of decisions both big and small It can be very overwhelming wondering of a choice made now will ruin an engineer's day a year from now Well, I'm here to tell you not to worry too much about that because it definitely will like don't worry about it It's a hundred percent gonna happen So have you ever worked on a legacy project and thought wow the engineer that wrote this was a genius and Everything makes sense and it's totally perfect and there's no weird problems with this. That's like every project, right? So of course not and the best part about this project is I got to be the person you're mad at because I got to start with An MP repo and build something and now everyone gets to be mad at me two years later. It's fantastic So as the engineer that bought that built in the you know in that greenfield place I'm the worst but it's because I worked within the constraints of what I knew then Not what we know now on how people would actually use what we built so The our techniques we did use to reduce the bad choices that future engineers had to live with For all of my technical talks actually I like to include this underlying theme of do less but accomplish more that is Focus on solving the problems and only build what you absolutely have to if there's a if there's a service or an Open-source tool or something's already built to solve part of your problem Just use it and you know you out. There's obviously things anyone to do to vet it But it's important that you focus on the business problems and not reinventing the wheel, right? So you can actually get get things done. So Netflix actually values a culture freedom responsibility. You'll see that in our culture deck and And yet a commonly used design philosophy that the engineers use is that of the paved path It's tools and practices that are widely adopted in the company and supported Engineers can choose any tool or build any service as part of that freedom But then you have that responsibility to support and maintain it and I mean much easier If there's already like 200 people already using it and I've already figured out all the problems and you can just lack them It's the best That kind of matched my philosophy as well and guided most of the team's choices Of course, sometimes you can't sometimes you're making new things Sometimes you're googling something and there's literally no answers anywhere because no one has tried to do this yet And you're like, how is this possible? I have to write a new thing the worst But you know, though, that's when I put a premium on What tools can I use to build this and if there's solid community adoption as well as if I build something? Can I use it in multiple problems or am I just solving one problem? And if I'm just solving one problem, what's the smallest amount of work I can do to solve this problem? And if it's multiple problems, what do I need to do to make this flexible? It's okay We're casual here No problem Okay Tell them not to call again. You're very busy It's very interesting Unless you want to like put them in the phone so they can listen because it's so good. I Mean I would recommend it you call all your friends and be like you have to hear this talk the slides There's not much on them. You won't miss it. It's just like cool pictures that I downloaded from Netflix originals Cuz I was like these are the TV shows. I like Okay, so this one actually has a lot of text on it Which makes it even funnier that I made those jokes that I'm like, oh, there's no Texas It's fine except this one has like ten lines So in order explain the technology behind this project. I kind of wanted to break it down into main components So the first component is fleet management Which is making sure the workstations are available and tracking their life cycle has so I'm gonna acquire this workstation Is this workstation still building has this workstation is it a zombie now? Which I think is the great name for Something that we've lost contact with it's obviously gone rogue and eat all the other workstations The second component is configuration management Which makes sure the workstations can be designed for any artist needs whether it's installing an application or configuring the registry Or an environment variable you can do pretty much anything to configure a computer You know and we managed to find all the edge cases to do that I've learned a lot more about PowerShell than I ever wanted to know The you know the following two package and image management are both a subset of configuration management But they're slightly different package management handles those individual package like a single software installation While image management handles collections of packages such as specifying, okay These ten packages are required for a compositing artist You would think installing a single piece of software is not complicated Well, it's surprisingly complicated a lot of things aren't as built for this type of Cloud-based environment as you think so you have to be You know very Deliberate and how you install things So it's kind of together both the package and the image they ensure configurations are immutable And you need a clear history for stability and observability like who installed this software What was installed exactly? What was the actual lines of code they use because that's the only way to dig into what why this didn't install correctly and why it's not starting up so The fifth component is user access or role-based access control and is how the right user gets the right workstation with the right files So we're not back into that world where people have access to the wrong images and the wrong movies and Then finally wrote remote display protocols is how the user connects to and actually interacts with the workstation remotely So let's start with fleet management So Fleet management handles the lifecycle of a given workstation as well as the pools of workstations with identical configurations Pools are kind of similar to an auto-scaling group in AWS, but the control plane controls the scaling So that we're not You know auto-scaling or removing things that people are currently middle of working on we need a lot more information instead of letting it automatically scale and We do that with the control plane and that's a collection of Java spring boot services Spring boot is great allows for the easier creation of job applications by abstracting away some of the common tasks Coming from Python to Java is really nice having spring because I didn't have to learn something like I just get it to work and Has a really strong paved path in Netflix and so many services use it There's many people I can ask questions to So that control plane tracks the fleet needs and alerts spinnaker to execute them Spinnaker describes itself as an open source Multi-cloud continuous delivery platform that helps you really software changes with high velocity So it's a common tool for releasing and maintaining services in Netflix and it's on the paved path So we actually reuse this to control the workstation fleet as well So most people are using it. Oh, I'm gonna deploy my service. I'm gonna make sure you know do this red-black deploy get things working We also said hey, we can use the same tool to deploy instances It had different configurations with these different images and control what's happening there So it's great that we can reuse this this tool Spinnaker has this concept called pipelines Which you've already heard me use pipelines in another way. So this will be fun. I'll probably use it five more ways Definitely nice overloaded word, but they're in this case They're instructions for creating the pools of workstations, which it calls clusters So pipelines can be accessed through an API and allows for variables and expressions So we set up a variable a variety of variables that can mix and match for each use case For example an artist can need a GPU when doing graphic intensive work or extra large storage to handle file management Some artists need one workstation to support the compositing surf software will others require a different one to do different software it To minimize latency and get the workstations as closely as possible We also support being able to select a growing list and regions and availability zones So we try to use the same pipeline to do everything But here's all the variables launch it here launch it there launch it with this that sort of thing And that control plane direct spinnaker pipeline around whenever it needs either, you know A whole pool of them I need a hundred more with a total no configuration Or I just need like five more workstations because I've a couple more artists coming and it connects to the API to Terminate once a user is done with it or remove whole pools if they're like we're done with this configuration We're gonna do a whole new one get rid of those and Then the workstation has a goal line agent Which provides the heartbeat and other information about a workstation back to control plane The agent let's the control plane know when the workstation still up and in correct state or if it needs attention We check we chose go lying here because it was easy to build For different operating systems, which we knew was an early requirement. We're gonna go all over the place Let's make sure that we can put our agent on every single type of workstation So configuration management starts at the control plane But the details of software installation and configuration is handled by assault We needed a system that can manage hundreds of thousands of workstations while being highly flexible and easy for anyone to jump in And create a new configuration So not just us as the engineers, but the technical directors and everyone we want them to think hey I have a new software package. How do I install it? What's an easy installation language and that's where we chose so because it's all YAML and ginger based So it's a little easier to read than writing straight code. It also uses Python which is a very common language for people in VFX and animations. They're also really familiar with that There's a bunch of built-in modules from installing a package to file management copy this file over here It also allows for logic statements to handle situations as Mount this storage only in this region or only run the script if this file doesn't exist Please don't reinstall this software five times even if this accidentally runs five times And it's a it can get complex but at the base It looks like this. We're just saying hey install this package and This is the equivalent Here of just saying like a YAM install Lighting 10.1 on a CentOS instance or app get install lighting 10.1 in Ubuntu The module is OS agnostic and should find the right installer per OS Today, there are hundreds of different packages written for the workstation from Installing the software editing the registry and any environment variables lots of different things It was selected over other configuration management tools such as Ansible or Terraform question I get often because it was used recently in our Netflix projects and got high marks as well as being Python based So this is an opportunity create bridges to the other engineers. We knew we'd be working on this because we knew they'd be familiar with by that So initially we sort formulas directly Salt formulas allowing for fast changes and quick deployment Everything was always the latest and no ability to lock to a specific version Which sounds really great, right? That seems like something that would not cause any problems that anyone could just push them code and Everyone get a different version depending on when they loaded the code. That seems right It was a Wild West, but it was part of being able to move quickly So to help manage that we created a package management system that kind of wraps around those all formulas to enforce some rules The first is a template for all the formulas to make it easier to avoid naming conflicts and Enforcing rules such as like making sure that user information is available before this runs or this formula is a PowerShell script So don't run it on Linux. I won't do anything And the second is that all the packages are immutable to use a package on a workstation It needs to be published to a version if any changes are made it needs a new version This makes it very clear what was run. It also allows for checks at the publishing stage Syntax and testing this just sound very familiar. We just basically just did get you know Like we wanted to make sure there was some version control in our on our packaging The structure made it easier for more of the technical directors and other engineers to contribute to library of formulas I know exactly what code was run on the artist workstation while not impeding their progress They don't have to come talk to me every time they want to publish something They just have to publish it. So there's a version number. We can track down what they did So let's talk more about image management The image management process has gone through multiple iterations as we learn more about the users and observed how they actually use the project and This section that kind of uses the case study of like how you iterate through a project and make it better and better as You learn more about how people use things and how when the project is better So a workstation goes through three Configuration stages. We have the bake the run and the acquire The different iterations will show how we move the logic around to better serve the user needs once we knew how they wanted to use it So I added the actual logic of each phase to the chart and we'll be moving through the iterations all of the deployed workstation start within a MI and AMI is an image used to launch a workstation AWS The bake stage is the process of turning a series of instructions to an immutable image if you're familiar with Docker It's just like that. You just here's the script turn it into an image. So every time I launch this instance. It's the exact same instance So then can be used to come up very quickly when it takes a minute or two For for an EC2 instance to come up once you have that AMI However, we found the process of baking can take a long time We wanted to ensure that when we're early stages, we could quickly release new features We have to wait 90 minutes every time we try to iterate through a bug We don't want to be held up by those long waits while it rebate So we only included minimal logic in the base stage that includes just the operating system as well as some You know critical tools such as that go agent like hey, this is all we need to get this workstation up Spinnaker then creates the workstation during the run stage Once the workstation up that agent starts running as well as the control plan. No, it's available. There's a new workstation During this first generation. There was no other configuration happening at this stage The acquire stage is when a user asks for a workstation like I want to get to work now most of the software installation including those DCC apps and all the user configuration happened during the acquire stage it was kind of a just-in-time approach like It's funny. I just went and got a sandwich of jersey mikes and now I get to use a sandwich metaphor Which is I refuse to go if we weren't going to order in advance Which is crazy because I'd let these people have their workstations and they had to wait 30 minutes There was no ordering in advance for them. So I wouldn't accept it for sandwiches They shouldn't accept it for the workstations. So we iterate in The long wait times led to us moving the software configuration in the installation to the run stage So this created a pool of workstations that had specific configurations and just the basic OS However, there were two risks to this method But one great thing which is that the workstation wasn't available to acquire until everything was already installed So when they went to acquire it be fast Because they didn't know it was even being made The first risk was making up many many pools with slight variations That turned out not to be an issue because in practice the usage patterns of the artists were not as diverse as initially thought The same team tended to use the same configuration for extended periods That meant then there were many pools the number did not grow to be unmanageable and The teams tended to actually just switch instead of using this a month multiple configurations So they'd be like oh, we're all using this configuration next week. We're using this configuration. Let's get rid of the original one the Second risk was inefficiency there were always idle workstations ready to go for artists which meant they were sitting around just like lonely and unused and wasting compute and Racking up that AWS build which we all love so much That was an intentional design design decision in favor of usability over efficiency Of course, that means it's time for a third iteration The third and current but probably not the last iteration is pushing the software installation to the bake stage This is like having a library of blueprints instead of a bunch of empty houses Real estate is not wasted because unlike houses where stations can be built from baked images very quickly as they are needed But don't forget. This is what we tried initially and because of long bake times. We didn't like it But the process has now come full circle instead of abandoning baking We improve the process develop better tooling that allows for concurrent and easier bakes and a layering system So you're not always Doing a 90-minute bake. You're just adding on one piece of new software, right? And you only do a full rebake when there's something that you really need to clean up or you need a new operating system Or something that's critical So we also kept the other option as a backup because one of the many Lessons from bringing up a new project is kind of hedge your bets You know, let the old one slowly go away while you choose the new features and see how it goes And maybe you need to go back the old features if you've haven't done the new features correctly real slow roll the technical directors who handle the configuration choices are encouraged to give some time between creating a new one and Letting an artist get a workstation So that's also an improvement in like the education process like you shouldn't expect things right away Like it's always gonna take some time to install so once you make your choices let things happen and and Then you'll get your work stations as Kind of with all projects. There are no perfect choices just a bunch of trade-offs to maximize the user experience Cost and technical efficiency So we try to do here and we'll see if maybe the next time I give this talk the new version there'll be iteration four five and six So a crucial part of any remote workstation experience is connecting to it We currently use two different remote display protocols nice DCV and terror DG When deciding on which remote display protocol to use User must know how they want to connect This is not decided by the artist, but rather by which workflow they're using So generally the pipeline engineers who get the artist team up and running kind of decide on the workflow based on what they need Terror DG is a commercial solution consisting of an agent software hardware client and gateway components Nice DCV is an AWS service provided by any EC2 instance and allows for a browser-based connection and application streaming Application streaming allows for a very streamlined experience. How many times can I say streaming and but also work in Netflix? Explained so artists can jump right into their tasks without Launching an entire desktop. There's like it just be one piece of software So imagine you just just wanted to go into Slack that browser were just up slack in it You'd get another workstation for another application and just be that while with terror DG It's like here's a window. It has an entire desktop on it and your computer doesn't really have to do anything So for both of these the latency and peripheral connectivity were vital features For example of an artist can't use their wake-up tablet or their two giant monitors They won't be able to do their work as well in a remote workstation VFX and animation is very Graphic and detail oriented work. You can't have any latency. You can't have color disparities. You need to be able to see things accurately Or otherwise we'll get like a blue hole or something, but like only in one shot So initially all the users acquired an access network workstations through a UI Key components needed to allow for improved self-service are in the UI This includes giving the TDs the ability to create configuration and control access to them with that role-based access control or our back the UI is also for operators to visually check the health and usage of workstations such as checking logs when a support request comes in or Making sure an artist is on the correct configuration when they say hey, this isn't working correctly There are also like the bigger dashboards to check on your fleets So there's common questions like are we out of licenses that we need to buy more license for the software? Or did the auto scaler go haywire and we actually requested a thousand workstations with only steam installed for some reasons that have any Artist software, you know that may have happened that you know here they're there But you know we got rid of them before anyone noticed Over the past year we realized that stakeholders were less interested in an opinion in a product and more interested in a robust and customizable platform We're continuing our add abilities to control and customize the artist experience while improving the platform such as Reliability observability flexibility all those big production level things now that kind of out of that alpha stage but focusing on these and our strong partnerships and then Netflix workstations can be leveraged for an even greater impact So it's kind of like getting more open-source contributors that I have any to do it all yourself We just say here's the platform build what you'd like on top of it So I'm just going to put it all together The workstation starts with baking the AMI. This is where the majority of the configuration happens Then Spinnaker handles the fleet management by turning the AMI into a workstation during the run stage After that the agent starts up and makes the workstation available for acquisition Finally once a user requests the workstation they connect to it via the remote display protocol When they're all done with it the workstation gets terminated and the cycle starts again This system is designed for artists and everyone who support them Artists can be on board in a matter of their geographic location to accommodate any dynamic production schedules They can get any workstations They need with a configuration change instead of having to wait for a new one to arrive in the mail and Textual directors can quickly modify control configurations to maintain stability while pushing out new tools Yeah, I just had recent things. Just go in. Oh This file is missing. This is wrong. Let's let me remote in real quick and get that for you And now you're ready to go and you have to worry about me being sitting next to you and Those pipeline engineers can build on top of the platform to customize the experience to a specific artist workflow and have Whatever you're doing whether it's one frame a thousand frames this software that software You can have the system ready for you So here's what it looks like today for a user to request the workstation It goes by like really fast But it repeats so you can see it happening over and over again They select the workstation and the configuration they need they wait a very reasonable amount of time For the workstation to prep then they select the connection in this case. It's nice ECV So they'll be opening it in the browser and then there it is a desktop with a few tools already open automatically So that's another thing you can figure like hey when this loads up Automatically open these tools because these are the ones they're going to use and if you can catch it You can see deadline another AWS tool that we use for rendering right at the bottom of the task bar See very reasonable amount of time I didn't edit that into the gif that it went so fast definitely So what do we learn here? You always want to debrief your projects to make them better and that's what we did So everything was new when the project started these were new roles new team and you know an empty repository We made the best choices with the information we had then we pivoted when real-world users showed us a different way Here's some of what we learned about deciding artist driven technology So nothing can be taken from granted when making a significant leap in technology at your company The new technology of cloud-based workstations must at the very least match the old experience of dedicated on-prem workstations Or the user won't use it the having the exact same Experience is the table stakes the things are used to have to be there at least For example, they rely on our software settings to reign a place from one workday to the next But if it's not that's not a given on cloud-based workstations to be more methodical about it Therefore, it's like super essential to learn everything that people are doing So nothing is missed the user's workflow and then improving it because why can't it's in the user new thing if it's not better than their old thing? We focus first on technical flexibility instead of the artist experience But artists need custom workflows not standard solutions by initially building the entire product We kind of painted ourselves in the corner We were on the hook for making everything but unable to deliver that needed customizations because we had limited engineers and Said we're pivoting to a platform to build a strong core and allow for a partner customization Empowering others to develop with you gives your teens more leverage than they can accomplish as individuals The more the system makes it easier for others to Contribute the better the adoption and the iteration will be people do things that you never expected them to do Which is great because I need all to do it and they're doing it The engineers need to help each other It's Critical to empathize with your users when an artist calls me. I know it is a worse part of their day That's what I've known the entire time I've worked in entertainment They want to focus on their art and not on whatever it is you broke today because you push something Observability and testing can seem like things that can be cut when timelines can press say you thought you had a year to build Remote workstations and then you've only worked there two months and everyone is home now But they quickly come back and haunt you Especially when you're on call So we committed to improving in those areas and adding more robust automatic testing clearer failure mode reporting All the like good stuff you want in production that you were like, oh, we don't need this right now Everyone doesn't need we just need to get this to work But you need to get it to work And finally addressing one use case at a time from the micro to the macro level Addressing one operating system making sure that works or addressing one You know user persona first and then moving on to the next team and really hardening as you go From the small tickets and the small pull requests and small releases It you know things we try to abstract too early and be like, oh this will work for all use cases never like got it Right and we end up having to rewrite it anyway So it's better to just for us to do one thing at a time and then you see oh this is how All these five teams are using the same way and now we can I'm sure I things out And know you're on the right track So what's next like the Christmas Prince universe? Crossovers and spin-offs make things more fun. I hope you all enjoy the many Christmas Prince movies the Netflix provides Netflix workstations are kind of one part of a technology ecosystem designed to help kind of creators Make feature films and television at an unprecedented rate So you have new things every week to watch and you always stay a Netflix subscriber It consists of many other teams that help our infrastructure in new and exciting ways We'll be collaborating to plug workstations into more and more workflows So more things are automated more things are done quicker and easier and we always know what's going on So our talent worldwide can be part of making diverse content So thank you for listening It's been fun and I want to thank my many sitting colleagues without whom there'd be no Netflix workstations and I think we have time for questions. I think we have about 10 minutes. So if there are any questions, let me know So he asked how do you get the same frame per second in Netflix workstations? So one of the things we do is just being local regions like making sure that there's a workstation as close to the artist as possible We know someone's gonna be working in this country or the city Let's get let's get in this availability zone The second is we're really relying on our remote display providers to get that work and get that latency down So, you know, you see both those things like making sure you don't feel the latency when you're close enough Because like I have to test workstations obviously that are all over the world And I will notice it. I'll notice a latency if I'm working on a workstation Across the world, but the artist won't because it's right next to them And then relying on the actual graphics and getting the pixels there as fast as possible That's our our great partners at Teradici and ICCV Luckily, I don't have to do everything. Go ahead Yeah, so the RDPs those two Teradici and ICCV they're not proprietary in Netflix Anytime we could Buy or borrow, you know, I really like that because I don't have to build my own thing So Teradici is commercial software and then nice DCV is on your EC2 instance So when people are like, hey, I want to try out, you know doing a remote workstation I always tell people just this is what I think you should try out first This is the first workflow you go build an EC2 instance and then install nice ECV and Go see how it works for you because it's already it's already there in AWS for you And if you decide oh, maybe I want the Teradici experience, then you'll at least know that it already works I think there's one over there. Yeah Yeah, so it there's there's a lot Where so he talked to us about color management obviously colors very important to getting your film all correctly So some of that is the back-end technology that you use in film production, right? So there's this idea of color spaces and those are all math So that's happening as part of the process behind the scenes when you're generating the image You're deciding okay, this entire film is gonna be it's gonna use this math formula So that when they all come out at the end all the images are look Match together, so that's part of it and that's kind of like outside of Netflix workstations But it's just how movies work right the other part of it is relying on our Remote display protocol partners and then also on the monitors and the environment that the user is in So sometimes people ask me like hey can people work on their workstation outside, and I'm like no They're doing color-sensitive work. Of course. They're not working outside Like that's the difference between like making a remote workstation for a developer Of course you can work outside as long as you can see your IDE and have access to stack overflow You can work outside, but there's a lot of things that go into making the color correct and if our RDP protocols aren't right and things are off or if Their environment isn't correct or if the the color space or that math isn't correct That could all lead to different color options So there's a lot of people working hard making sure that is a hundred percent accurate and they need really nice monitors and a nice room when I Used to work inside a VFX facility. It was actually no lights allowed So like imagine like a normal office with like cubicles But like everyone's just like sitting in the dark like it's like severance So it's actually pretty funny But like they have to right because they can't have that glare on their monitor when making feature films Yeah, he asked what changed we would make if we had more variables and configuration that we needed to do I think we're already doing that now, which is Letting the users control the configurations and build their own stuff So when we first started they'd be like I want to work station with this on it And I'd literally like download the software and figure out how to install it and set up the formula and package it for them And then give it to them now. I'm like, okay. Here's our platform Here's how you here's where you put, you know any files you need on s3 Here's where you configure it in this repository. Here's how you build an image go to town test it try it out yourself so by kind of Outsourcing is that people want to use it. They can make as many configurations as they want and we're there to make the tools easier He asked if it also increases the support workload and it kind of depends Right like we're on the hook for the infrastructure They're making sure that works and the basics work and when they go to log in they don't get a connection error But if they've installed software and that software doesn't run correctly We just bring it back to them and say hey you run this I see this is your code if you if there's like confusion On how to use the tools that's part of our support burden. That's part of like oh I noticed you tried to do this and you got an error Did that error make sense? Do you know how to fix it? But most of the time? You know we try to have good messaging so people can see like oh I Try to install the software and it says file doesn't exist And it says where an S3 it look for it and the files in there Maybe I should upload the file to S3 where I said it should be you know things like that where it's like We're trying to be more of a platform and a tool system and not debug every piece of software because There's a lot of software that takes a lot of expertise to install correctly It's just some DC yet see apps that are just like you don't know how much time I spent in the first round Trying to get this thing to install and it's like wild to install correctly on the cloud versus a desktop So and into an image. It's very different installing on an image and it's installing on a machine. That's Always gonna be running and you just leave it so there's a lot of I was called image-making like the devil in the details So having more people involved in configuring their own configurations actually helps to disperse that Knowledge because they should be experts on the software. They want to use Go ahead back there Yeah, so there's Teradici and nice DCV. Those are the two you can also just like RDP into them if you're like a power user. I do I use that sometimes when I'm debugging I just use them the one I really like it's called like Royal TSV has a lot of options for promoting in But that's not like for like the artist experience. The artist experience We use the Teradici client or the nice DCV client because that's a that's a really nice streamlined experience where they we actually generate Short-term use codes for them Every time they want to log on so we're constantly checking their credentials because we you know security is so important in movies So it's like a different experience But like as a power user you can kind of get in any way you want you can use Microsoft remote desktop to go in I'm sorry. Can you say again? We haven't tried that. No Teradici was kind of one that's been used other Entertainment companies have used it well before and the you know So we wanted to use something that people were a little familiar with and it worked well for us And then the nice DCV we just liked for the application streaming and the fact that it like Comes free with an ac2 instance that we were already using so if we run into other use cases where those two don't work We might explore other other options, but that's part of the like You know, it's in my general philosophy not to build new features until someone actually needs it So he'll sometimes ask like oh You know, they'll get I'll get an error and I'll be like, oh, were you trying to do something? Do you need a new feature on the bike? No, I'm like great Do it do it the other way then do it the normal way so Yeah, so we focus primarily on studio But there's a lot of different teams at Netflix that are interested in this But it's interesting because studio is like a very specific Area where they need that color accuracy. They need that a lot of storage movies are very big You know, we need to not lose anything It's also very different than like if you think about like streaming like you're watching Netflix That's like 200 million users. So everything has to be very Resilient but like a single user is less important than the group But if you think about Netflix workstations It's an internal tool for artists and if one artist cannot work that is an emergency Like they get to call me at three in the morning if they can't do their work for that day because you can't like Let people lose a work day. That's awful. Like they're not getting paid. The work isn't getting done So it's a different. It's very different experience Yeah Yeah, we're definitely exploring it, but we really want to make sure that we don't stretch So we asked if like Netflix games was gonna use this and it's definitely something that is possible But we don't want to stretch too far until we're really sure so, you know As I kind of mentioned in the beginning we stretched really far in March April May 2020 And now we're like make sure everything works really well And then we're gonna say okay, how do we get new users like what other groups want to jump in to this and Without changing too much so something like games is actually very similar to VFX animation So it might fit better than say like for engineers who have a totally different workflow But engineers and TDs actually don't have that different workflow. So it's an interesting interesting What to focus on in any project user expansion versus Technology hardening and you always don't want to go too far in one direction where you get too many users and something Nothing works or you spend too much time on technology hardening and you've actually gone in the wrong direction Because your users if you'd expanded users you would have gotten more information So we're trying to balance it a little better now that we're not in a rush But definitely yeah, I want as many users as possible Go ahead Yes, he asked about zero clients and I said yes, and that's about all I know about zero clients So there hopefully there's no follow-up questions. I've heard the phrase zero clients and they've used them. I I Believe it's like when you have like a very thin client right where it's like so like I get into workstations on my MacBook But that's like super overkill because I already have a MacBook But if I just had like I imagine a computer just like a little tiny thing and instead of an entire computer that all it does Is log into remote workstations? That yeah, that's another use case that people actually often use because they had a regular computer Why would they need a remote workstation? Oh? Sorry Guy in the gray shirt and then a guy in it black shirt Yes But okay, if you think about it The computers that they were using other desks were also very expensive artists usually don't have like nice cheap computers You can't like give an artist a Chromebook. I don't know if anyone's ever is like seeing like a box desktop It's very common. It's like they have to be so substantial anyway that Using the cloud isn't making as big a difference. So if you're exchanging your Chromebook for an EC2 instance, you'll see a much bigger leap in costs Then you'll see if you're exchanging your powerhouse gaming rig For an EC2 instance you see what I mean like there was there's not traveling as much in cost But one of the initiatives like I talked about the efficiency of not having idle instances sitting around is To do things like that is to like lower our Unneeded costs. It's needed to have an EC2 instance that has a great graphics card and a lot of storage That's a needed cost having extra wide idle workstations that no one's using as an unneeded cost So that's what we focus on is it makes sure of any extra costs Are we don't have instead of the ones that are actually necessary? Okay Yeah, definitely. So there's a There's so many things like, you know, we use this UI and you saw it a little bit in the in their In the GIF but that's just like the artist view and like the admin view is very different So you can see things like open a workstation and see all the grass of things like CPU and memory and that sort of thing So like if you see something spiking and starting to hit the limit You'll get that alert before they do seeing the storage before it hits the limit So you get the alert before they do seeing, you know, hey, this person Logged in five times over 20 minutes. What's that about? Like seeing these patterns of usage is really important to spot like patterns of problem or You know even grander like hey 30 people tried to log in and 30 people had failures All right, something majors down then then seeing that so it's about both the macro and that micro level so we added a lot of Logging and graphs and things like that to the individual Workstation page in the UI where you can see all the details about it and make it a lot easier to debug and then we are also starting to learn those patterns by aggregating the data and Seeing over time how people are using them That answer your question Yes, we call that rendering it happens all the time So what an artist usually makes they don't actually usually make the final image They actually make a blueprint. So this is all it's like artistry and math at the same time So they say they get like um, they'll get like a film plate they'll get the original film and then they'll say like add this in here add this in here add this in here But they'll be working on a very low resolution version of it And then what rendering does is it takes that image and it takes it the selections They've made and does a bunch of math and creates that like final beautiful image. So that's like that's like the headless Fortune Yeah, you don't need like a whole workstation to do that It's actually a super common VFX way of doing things and if you look there's like a The very fast and the gif I featured deadline which is a monitoring tool that helps you monitor render So that when you're on a workstation and you say launch my render to somewhere else in the cloud That does the render you can see it monitored on your workstation So that is a it's extremely common process. That's how all movies get made, but it's a little different because it's really headless Okay, we allocate a pool so that they're always available He asked about we were allocating a pool for spot instances or reserved And it's more that it's about pricing and more about that an artist can come in and immediately have something. Oh We're at time someone's just during and I didn't know what he was talking about But it looks like we're at time So I'm gonna I'm gonna stop with the questions, but so great to have all these questions I love all the interest feel free to find me out like hanging out outside or again on LinkedIn Leave a message so I know you're not a robot Thank you Can you hear it now? Oh, there we go. Hey go. Thanks. I like the feedback. So if we can make this interactive, that'd be great So anyway, the point is that I am not going to apologize for anything other than the slides So Some people did help me get them at the last minute to have some nice Look and feel and all that kind of stuff, but the content is definitely put on there very rushed All right, so I don't want to apologize any further. Let's just just go ahead and get started Chaos is a feature not a bug. The point is that we don't want to Have it happen to us. We want to get out in front of it, right? So Some of the topics I'm going to cover here are a little bit of intro to chaos especially around like reliability and resilience You got to talk about that first chaos engineering, of course in fault injection I'm gonna talk a little bit about litmus chaos. It's the open-source project that my company harness acquired and the team that comes with it so we now have a really nice staff of chaos experts the company that we acquired was called chaos chaos native and litmus chaos is the open-source project that they submitted to the CNCF and it's now in incubating And we're talking a little bit about game days and or chaos days, you know game days can mean different things, right? So specifically a chaos tech game day So first, oh another thing. Let me do this Let me change this a little bit. I have some notes in here be helpful if I did this in Presenter view. I think you still should be able to see well now you can see what I see, right? How can I do this in a way that's a little more useful to you? Let's see here. Let's see. Hold on. Let me go back and put me in a different view Okay, I should probably get prepared for this before I do start my presentation. Hold on one second Let me just stop this for a second because I can't figure out how to do this Now let's see if I go into presenter view still doing that, huh? Oh, okay Thanks Let's go Let's stop mirroring. Is that what I do? Okay, so we're back to not stop mirroring. Okay. Now we go into Let's see. Do I move this over? Now I can go into See what I'm doing over there. Does that do it? All right. Now what happens if I go slide by slide? Oh All right, thanks for hanging with me on that Okay, so these are some of the subjects we're going to cover. I already cover that slide All right. So first before we start talking about chaos engineering. It's it's important to understand the core You know site reliability engineering Concepts that we're going to be applying this to So for example, and again sorry for the mix up on the slides here or the mess For example, it's never really been easier to create large-scale applications that are distributed We've got Kubernetes. We've got all these tools that help us do this and make bigger and grander and more scalable websites, right and So, you know by example of this we've now got our infrastructures in the cloud. We've got an amazing programming language support there are there's lots and lots of open source projects and components that we can use and We've got lots and lots of services now And so it is easier than ever to build an incredibly large and complicated website and so you know, I remember in the I've been in the space for a while or a long time. I've been in DevRel since early days of APIs and I remember in the early days you couldn't Get somebody to use your API because they would tell you why on earth what I want to add Your downtime to my downtime, right? That's that's how it was back then downtime was significant and You know these days We don't think that way as much you still do a little bit But you know, we don't really think too many too hard about using somebody else's service through an API Because their downtime is probably pretty reasonable, you know, if they're especially if they're of any size And so that's not that big of a deal. So we'll add that to our service. However, this creates a lot of opportunity for issues and so You know, nothing is 100% Reliable and so the larger your application the more of these components and these services that you add that creates more service area of more problems that could occur Even even like You know Amazon S3 or Google storage or Azure storage not those Have downtime to even though it seems like they don't right because night night how many nights will they have these days? But there are times when they've been head outages and so that's where resilience comes in, right? So There is no guarantee that these things won't go down and so you know These are the services that our systems are built upon and so any of this infrastructure can go offline at any point in time and You need a plan you need to plan that your applications and service will go down and so there will be disruptions and And to deal with this you think about your those components as being We said they have some kind of reliability right and then if your individual component goes down Then how do you deal with that is really the resilience? So how resilient is your application when some parts of it go down? So resiliency is the ability to withstand certain types of failure in your components or the pieces and yet remain functional to some level and So you can't be resilient if the components that make up your system Are not reliable or you know have some sort of known reliability And so this talk we're going to talk about how you can make your deployments more reliable through chaos engineering All right who am I my name is Dave Nielsen and I work as the director senior director of developer relations or community and advocacy at harness and Some of the things you should know just about harness is that while we started off as a purely CI or I'm sorry CD company It's just advancing okay Doing continuous delivery we then acquired an open-source project called drone and drone is Like the best kept secret, you know, it was created when Docker came out and the brad the creator of drone felt that Continuous integration should be done in containers Because all of those tests and all the different steps in the pipeline. You don't we don't want them to Impact each other especially if you if any of you have used like Jenkins, you know those different steps along the way Sometimes share co-basers or the or the platform the programming language And so upgrading any of your steps along the way could be pretty Difficult, but if everything's running in containers and you just want to update one of those steps You just update the container Could be a different version of the programming language. It can be a different programming language all together and so He built that that's what drone became and You choose your plug-ins is what we call those and you set it up with yaml and you're off and running It's it's very simple But anyway more or less a one-man show I mean he definitely had some others helping him and he had a great group of volunteers building all these plug-ins, but Two years or almost two years ago. We acquired drone and but it had already reached hundreds of millions of downloads with this One one-plus person show that had never raised any funding pretty amazing I used to work at Redis and There's a similar story there with Salvatore the creator of Redis and it's just really amazing when these things happen So I'm really excited to be Helping to promote drone Which is a different subject, but that is how we got our open source Community going or that's how that's most of our open source community and then we acquired litmus chaos Which has a similar story, but more recent and And so now we have two open source projects as a part of our community. Okay, so So failures are inherent to complex systems. And so what do we mean by complex? Basically, you know, you've got Even in a like a mono repo it can be very complex, but now that we've got all the different systems that are You know microservices different teams Becomes so much more complex. It's almost, you know exponentially more complex and so a system would be made up of multiple components each component might have its own reliability, but Really what we're talking about is, you know, how does the overall system behave? Sometimes there might be a rippling effect and you just don't know how The other components in your system might behave if one system goes down So that's a complex system where you just don't really know how all of them will behave if one thing changes And so There's lots of examples where there's been outages and where why have these outages occurred? Well, so for example salesforce.com, you know, they had and they're a big company. They know what they're doing and yet Boom they had a DNS outage and that's what does it go? I forget how the quote goes, but it's something like It's never DNS It was DNS. There you go. Thank you. That was it. Thank you That's the one I was looking for It's not DNS. Okay. It was DNS. All right So it seems doesn't DNS seem like one of those things would be ridiculously reliable, but in yet, you know, it's a complex system, too And that was when that one was took over a week to recover from so Clearly they had not done a chaos experiment on this problem I don't think we're taking a week to recover Cloudflare this one was interesting because this occurred when they ran a script bash script and but they Didn't use pipe fail. And so what happens when you don't use pipe fail if the one of the commands Has an error and you're expecting it to result in some data It's just gonna continue on and continue to do the Run the script but there's no data. So what happens is it goes and replaces data with no data and in this case That caused a big problem for them Because they basically passed in a zero row array into the next Step which then replaced all the data with nothing. So and that was a pretty big outage Then Atlassian had one. This is another example of where This is an example of where There was miscommunication. They were supposed to delete some accounts They passed in the wrong data like the one team gave the wrong data to the other team And then on top of that they had a script that said delete these accounts but Or mark them for deletion but in the same script was no delete them permanently and so they ran the wrong one and So not only did they pass in the wrong account information. They then deleted them permanently. So That was a difficult one for them to recover from So there's lots of different ones. This is one I personally experienced. I was at PayPal many years ago and We were Some of you may have had this experience So I remember sitting in the in the room when the consulting company that we had hired to tell us how to make a big change in our application basically PayPal was able to get to 50 million users on One database Yeah, so we are running on an oracle database on the biggest baddest you know Sun Solaris box that they could possibly have and The consulting company came and it said, you know In about six months There's nothing we can do to make this going faster and we're gonna run out of capacity And so the company literally stopped marketing everything because it'd be much worse for some financial type website to Not be able to respond to people and their money Then it would be to not have as many users and so we literally tried to put on the brakes But PayPal was so viral That even with zero marketing it just continued to grow and So we hit that sick month six months mark right about the time we were doing the update But it was a I mean that's a race condition. We were really worried about The website you know the database falling apart so anyway all the testing was probably not done very well and There were so many dependencies because we took what was a single database and we built APIs on top of it, but those APIs were really kind of All I can say is that we took processes and tried to split them out across multiple copies of the database and Once we put that into production. We didn't know what to expect and it turned out there was rolling blackouts essentially So about every couple days the website would just crash So I think More testing and perhaps chaos would have helped there as well. So there's a lot of examples and these examples are very expensive This is just a quick chart. I'm not gonna even tell you what the numbers are other big, you know, 30 million dollars per Outage that kind of thing. So the problem is is that there's It's so it's very hard to test all of the different combinations of things that need to test and so a Lot of times what happens is that we do the things that we know to test and what we think are important and at some point there's just not enough time and so We may not test things in fact I put this up here because Sometimes you just say I don't know what the problem is I don't know what the results would be if I don't do this I'm just not going to decide to do it now but the problem is you actually are making a choice you're making a choice not to do that and So You don't know what's going to happen But when some people say, you know, we're not running experience. I'm saying yes, you are you just don't know what they are And so there's a lot of different things that happen out there and I won't spend much time on the slide But yeah, there's there's just a ridiculous number of things that can happen. I remember You know the test that we used to do a long time ago just simply unplugging things, you know That was a pretty easy and cheap one to do Which kind of works but There's just so many things that can happen. You certainly can't just do them all yourself All right, so what are some real-world examples of things that might happen? Well, first of all, I was doing some research in the last week on all of this and one of the things I ran across is that Now this is over the past 25 years So it's definitely a lot of things have changed in that period of time But over the past 25 years according to the research I read a percent of all load losses Were in data centers were human cost All right, and I said this the very first person in our SRE team that I ran across I mentioned this to him He's like, oh, yeah Like what he's like? Oh, yeah, my first day of the job. I tripped over cord took out the entire data center I was like, oh my gosh really he's like, yeah, but if things, you know We've gotten better at this kind of stuff those cords aren't laying around as much but anyway There's other issues that that are real-world examples like you know resource exhaustion and you know I was at Redis and one of the things that we would exhaust is memory. Okay, and that would cause a problem, right? So That would be one type or or just your database fills up the hard drive, right? Or your log is fill up the hard drive. I mean, there's all these things hardware failure hardware failures, of course network latency Sudden increase or decrease in usage that could cause problems. I mean increase you can understand but even sometimes going down can cause problems Downstream dependencies, of course functional bugs. Those are some of the ones I mentioned possibly race conditions state transmission errors between senders Unusual combinations of interservice communications and I named this one This is not this is just me making things up here But the Dunning-Kruger problem when one node thinks it has all the correct information But it doesn't So it thinks it's doing all the right things, but it's not Okay, sir, some of these examples. How many of you have experienced possibly one of these examples Okay. Yeah, any example of the you see up you don't see up here Yeah, what's that air quotations? Oh Air conditioning, okay? Like heat. Yeah. Yeah Yes Excellent Yeah, air conditioning actually my first ever job was when I was in high school I went into a mortgage business and my job is to do the backup of the tapes You know, they're this cold room with all the tapes in it and I'd go on there and whatever So when I graduated I found somebody to replace me and at one point I Remember I didn't actually know that person that well, but I knew the people at the mortgage company And so at one point I'm like, hey, how'd that go with that guy? And he's like, yeah, not so good Like what happened? He's like, well, he got cold one day. So we turned off the air conditioning And they didn't figure that out for almost two weeks so they basically ended up with roughly about a week's worth of no backup So they were lucky that nothing terrible happened. Yeah, so he didn't last long Yeah, exactly. So all right now that we've talked about all that What is chaos engineering? All right, so It's the practice of subjecting application. I'm gonna read to you here It's the practice of subjecting applications and services to real world stresses and failures in order to build and validate resilience to unreliable conditions and missing dependencies, okay, so That's what it is. This is this is from a Microsoft website. I think later on I talk about some Azure stuff I'm trying to error some AWO stuff trying to you know be fair to folks And and the next one is fault injection. All right, so what is fault injection? So it's the act of introducing an error to a system different different faults such as network latency or loss or of access to storage can be used to forget and sorry target system components causing scenarios that An application or service must be able to handle or recover from so chaos engineering is more like the overall practice And then as a part of that you're going to run these individual experiments. Those are the fault injections Okay, so now what about what about chaos engineering? Where does it come from? Was anybody around when Adrian Cockroff was giving those early talks? anybody Yeah, I happened to have been there I Work in Silicon Valley, and I run Silicon Valley DevOps And I also ran this event called cloud camp and between those two events. I gave lots and lots of people opportunity to talk and Adrian submitted a talk to cloud camp. I think it was cloud camp and I'd already known him from talking about other things Of course we accepted his talk and he gave I believe one of the very first talks about chaos monkey at cloud camp and Over the years, I've had him speak on different topics But I remember I quoted him on a couple things on Twitter and I think I got hundreds of likes like it was Definitely like a moment. I'm like, oh So that's kind of cool, but it goes even further back to maybe Jesse Robbins He was one of the ones who was actually doing some of the things like you know, just physically unplugging hardware, you know to create Tests, you know at Amazon and then over the time The Netflix ended up not only having the chaos monkey, which basically did shut down instances That's basically what I was doing, but then they would do other things, right? They would create other monkeys that had different skills like and they had the one that would shut down the whole Region, you know, that was the chaos gorilla or they had ones that would they had ones that would you know Kind of make it look like the server was slowly responding so over time right so it slowed down its responsiveness and so they had all these different monkeys and And they ended up calling that the the Simeon army and this was a big deal in the time It was very very innovative and a lot of us were like wow I hope I can someday, you know have a website that would deserve this kind of attention But you know this concept definitely grew and grew and grew and then around 2014 they coined the term chaos engineer and then 2020 Amazon added their Reliability which is including chaos engineering to the AWS well architected framework Lemus chaos was donated at CNCF in 2020 as well and then and then now it's at the incubation level all right, so This is very short and very biased history, but This is my history with with chaos And so folks are not definitely not just talking about anymore and they're not they're not it's no longer just chapters There's like multiple books about this has become a big deal and by the way Yeah, there's some really good books in here So and if you haven't already checked this out I think the AWS well architect framework is a good one to go look at specifically there is the Reliability pillar and there's a lot on this subject there and some on chaos engineering. Yeah, I'll make them available. Yeah, you bet How am I gonna make them available? At the end I have my contact information so you can email me. That's probably the easiest way Okay, so why are we seeing this now? All right. Well So more and more and more I think the way I think of these things is that it just takes time for some of these higher level more long-tailed concepts to come to attention But as we solve more of the immediate problems We are able to solve the story turning our attention to the ones that may not happen today, but may happen tomorrow And so also I think during COVID I think there were a whole bunch of new websites that sped up really quick and Had a lot of problems and there were a lot of people using them So I think maybe all of a sudden we had a new appreciation for You know reliability and then finally I think just as Kubernetes makes it possible to spin up more Container images and your application might grow and have more surface area because it's easier It's also easier to spin up an entire copy of your application and run chaos on that Which was not something that was really possible before the cloud or Extremely expensive because you basically to own a copy of all of the hardware, right? By the way, this is also something we discovered as a part of Continuous delivery is which again what our company does because when we run you can run canary deployments You can do rollouts, but if you want to do a blue-green you're doing an entire copy of your infrastructure And so that's something that is relatively recent So it's kind of a similar thing and by the way if you roll out an entire copy to do a blue-green deployment Isn't that a great time to run a chaos experiment? All right, you got the whole thing right there, and you haven't switched it yet to beat production Perfect time Okay, so also. I think what's happening is that As it becomes more important, you know, there's there's People who've been doing this for a long time and they're like we now all of a sudden have our moment in the sun and in the light And you know we want a name for what we're doing, and so I'm I'm making this up today I'm now introducing to you dev reliability ops Okay, I'm joking, but No You know just like dev sec ops there are some people in the dev ops space would be saying Isn't dev sec ops just dev ops? Right shouldn't we be doing security as a part of dev ops like does there have to be a whole separate thing? And I think it's the same thing with chaos is I think it's there shouldn't be dev reliability ops It just should be part of dev ops, you know, maybe we'll spend more time on it than we used to but That's the way I look at it. Also. I think it's important now for I think from what I've seen and read is that Chaos is actually a very good team-building exercise to run those kind of experiments Because they do require Working together and understanding each other Creating compassion for like some of the people who have to deal with some really big issues whereas others may not so I Think it's great for internal Teamwork not just for customers All right, so here I just want to mention. Yeah, let this chaos is also we're starting to see more open-source projects one of them is litmus chaos and It's a great tool for running your experiments, so I'll talk a little bit more about it, but this project started actually from another The team that create created chaos native actually where it was the same team or mostly the same team between From open EBS and they had done some experiments to kind of Do chaos per it basically built one or two experiments on their own for their own product to test it and that eventually became What is chaos native and so that's where it came from And also one thing that's interesting is like though they can't one of things that's interesting about KF chaos native is that or Letmus chaos is that it's very declarative in a way that you describe what you want to occur and That's become popular. It's becoming more and more popular. So I think folks kind of gravitated folks who were interested in more declarative type of Pipeline creation were attracted to this Okay, and you can learn more about letmus chaos. It's it is a CNCF in community project There's lots of users and it's growing on so on and so forth The project yeah project's growing a couple things. I'll just point out which I can't really see here, but I'll try to get close so Well, there you go So scheduling all of these experiments is is actually really difficult Especially if you're trying to decide in which order your experiments go and all that kind of stuff So that's one of the main things that you'll get from an open-source project like a limits chaos Is you can actually schedule how in what order you want the experiments to occur? All right, so let's get more into some of the key resilience practices. This is where we're gonna get more into chaos and game days, so Yeah, let me back up here for a second so the way I look at the the field of chaos is Being a part of the resilience practice. Okay, so it's more than just testing your individual components and seeing how things respond It's also about just finding out is your entire team able to respond to these issues Because it's not just about Seeing something broke figuring out how you respond to that one thing with a technical fix it also could be Let's say it happens in the middle of the night and the person who gets the information Do they have what they need to be able to respond? Are they trained to be able to do it that they what they need to do it's a lot more than just follow these steps to fix the problem and especially if there's some sort of cascading effect that affects more teams and so That's why I put here like what are the steps to The key practice for resilience rather than just chaos because it's more than just the chaos It could be for example Game days which I'll talk about a little bit more in a minute and If you're doing a game day or even if you know for just real life you should have some sort of disaster recovery plan, right and So that becomes more about it on the on the technical side and then on the people side There's more about the team culture building So first let's talk about chaos engineering and game days all right, so Some some things you probably already know is that if if you're confident that your processes work and that if you're To deploy your application more often you're gonna do it more often, right? If you're not worried that you're not prepared For a failure for example, okay, you'll deploy more often the other thing that happens is if you deploy more often Then you also know that there are there's more Perceived risk All right, even though I think the evidence proves that when you deploy more often. There's actually lower risk There's still that fear of pushing the button and So if you know that you're prepared for some disaster and You know that you have a plan for recovering from it then you're gonna feel more confident and you're gonna go faster Okay, so I Call it swing in for the fences All right, if you feel good if you're a baseball player and you're confident in your ability to hit the ball You're gonna swing harder Because if you're not so confident you're gonna try to swing slower so that you Can kind of watch the ball coming in and get your bat on the ball and you're gonna go for a single or a But if you're confident in your swing You're gonna swing harder and you're gonna go for the fence and I think that that's the same that analogy I think works for me on deployment So you're confident in your ability to deploy you're gonna deploy more and more often And you're gonna get more done Your company's gonna move faster But you can't do that if you're worried about the failure if the failure isn't something that in your head you have confidence in recovering You might start swinging slower Which again, I think actually evidence proves actually creates more problems Okay And by the way just as a backup here for those of you who are a little bit newer to the subject Let me ask the audience here. So why would deploying? less often be likely to create more problems more changes but Go and those are both the right answers, but why most more change more changes what? per release Right, so more changes per release and we make more changes per release. There's more dependencies and From my own personal experience. I remember having a situation where somebody came to me and said hey We ran into a bug in something you wrote four months ago And so not only is it that you get more bugs or more errors But then the developers have to go back and remember like what was I doing then like I don't even have that environment anymore I've got to go figure out how to set that up So there's just so many other issues that pile on and cascade to waiting to deploy So I just want to make sure those of you who are maybe newer Know that all right, what did I leave off here? This is about where my slides getting a little bit wonkier All right, so let's see okay, so there's a book that's out called accelerate and basically one of the quotes that I like was that build systems build systems that are designed to be deployed easily can Detect and tolerate failures and can have various components of the system update it independently I think this just goes without saying but One of the nice things about having these Pieces work more independently as you have less of a effect upon each other Immediate effect, but you still even though if you do this as a best practice You still need to test the entire system because there are times when perhaps for example a System goes down and you don't notice the immediate effect because there's a cue somewhere All right, but now that cue is starting to back up And at some point that cue might get to a point where it fails Or it doesn't necessarily fail, but the backup is so long that there's never going to be a recovery all right, so Even though those systems might be completely independent That failure is still going to occur They're not completely independent Also, this is one of my another quote. I like which is People keep talking about platform engineering taking over devops All right, this is kind of like one of these things where it's like the pendulum keeps swaying one way or the other And so pro tip every time someone explains you that devops is just CI CD basically just tools All right, one way you can explain them really quickly that it's not just about tools ask them how game days fit in All right, because a game day is your experience. You're responding to a failure and even CI CD the best CI CD platform On the planet still gonna have failures and you have to have people understand what the hell is going on so they can recover that from that so yeah, if you're not dealing with That kind of part of devops you have partial devops. There's another thing we're coining today partial devops I'm just kidding. All right, so what is the best way to understand your system? All right science science is the best way to understand your system and chaos engineering is essentially science All right, so You're you're basically gonna purposely run an experiment and see how it recover how it turns out So let's like out the basics here. All right, so first of all As I said in the title cast engineering is a feature not a bug In other words, you're planning this you're not waiting to find out. Okay, so Again kind of like a little bit of a summary here, but I wanted to reword this because this was not my original material but So but basically these are the four steps. Okay, so you're gonna carefully apply These failures with explicit hypothesis, this is the science part of it, right? So you come up with a hypothesis and you can test your hypothesis All right, so then You also want to Start off with a small blast radius right now They'll tell you that when you're doing when you're about ready to do chaos engineering Some people will say well, I'm not ready to do chaos engineering yet The answer is the question should be well, why not they're like why I need to prepare for that But that doesn't make sense because the chaos engineering is what helps you identify what you need to prepare Right or what you need to fix however You probably shouldn't just run experiments without much thought because the blast radius I mean you could take down Way more than you had intended Okay, so you have to start I would recommend we were all recommends starting small and then You got a communicative course to who is going to be impacted by that Right. You don't want to run an experiment on something in production That's gonna impact folks who aren't ready for that and They need to have a well-defined process and that goes about saying so When this occurs who's ready for the problems some problems are not going to be Anticipated we need to be ready for an unanticipated issues Okay, so in a nutshell Can't really read this here on my screen. I'm just gonna walk over here and read it off of here All right, so in a nutshell you're gonna select the experiments that you want to run You know you might specify a particular region goes down or something Then you're gonna run those experiments on a targeted system, which could be production. It could be your By the way, is anybody know in a blue-green deployment, which one is first? Is it For sure. Okay Because I think when they first came up with it I was reading when they first came up with blue green. I don't think they had chosen one in the beginning. They were That's why it's blue Okay So I was reading that about how they came up with it and they're like we just wanted two different colors But I do like that idea of having blue be first Because green feels like the newer one, right? Yeah What's that? It is but I'm telling you if you go back and read about it They didn't choose green because it was green, but anyway, it doesn't matter. I go with I'm going with what you said I think that's right. All right, so you could choose your targeted system and again a blue-green environment would be great One to do You can observe the results In that targeted system you want to learn from them Okay, and then you want to select the systems to well, I'm sorry you select the system to test first And then you run your experiments. Okay, that's a little bit out of whack All right, so So you know apply the scientific method, but how all right? So have you guys ever tried to teach what the scientific method is to other people? I did it with my 11 year old it was it was it was kind of interesting To teach her she was learning it in school, but you know when you're learning at school. I don't think they have a real Real world issue to deal with and so I was just trying to give her an experience, you know, let's come up with a hypothesis, right and It's kind of personal so I won't get into it But it was highly entertaining because when kids they come up with like the craziest results Right, they they don't have any real-world experience. So they come up with with results that are Just completely unbelievable And it's fun to test because obviously those things don't happen My daughter's a little bit on the fearful side. So we applied it to something that she was you know dealing with in sports and We were able to prove that it's not as scary as she thought Not the world didn't end when she tried it so But in in your your real-world experiments with with your team, you're gonna come up with a hypothesis and So first you want to be able to measure How the system changes during the experiment and it's important to understand How they behave now, so that's the first thing you need to do is document like what is happening right now So I'm comfortable understanding what I should be expecting And so you want to come up some baseline metrics, you know, it could be just something like response time or whatever, but some baseline baseline Then you want to think about okay Does this actually work the way I think it's gonna work like I have some assumptions about how this should work But let me actually be thoughtful about this and understand how this system actually works and And then you need to actually go and validate that Okay, because you're guessing most likely unless you're the one to build the system So you go and validate is this actually how it works and then you can start thinking about what kind of failures could I inject? to help me prove That it actually does indeed behave that way So one of the things you could do is simply run experiments to prove that the system behaves the way you think it does That's an excellent place to start Okay, because if you don't truly understand how the system behaves when you run your experiments and you get the results back It is really hard to understand what the impact was. What was the meaning? What are you now going to go and fix and change so you really do have to understand how it began out behave Or how it actually works by the way, I was just in a I was just in one of these Sort of like think tank type sessions you know I'll explain it just a group of people who are paid to sit in a room and answer questions about things My specialty is how Different Platforms have evolved over the years and one of the topics they brought up is how do you? How do you innovate? Okay, especially in an established industry and the one the one solution that was proposed that I liked was You get your product Which could be your application and you get the people who know the most about it in a room and then you document every single step of the process every single one and It's exhausting you could take a week All right, you document every single one and for every single one you asked do we need that? Is that necessary and You're ruthless about it And in this process You'll probably eliminate some stuff and you'll probably come up with some ideas about innovation But you could just do this with your own engineering team and your own application to understand what's happening So you can then design Experiments, and I was thinking about this in the case in the context of this talk and Probably we almost out of time No way, okay Geez, okay Sorry, I've got a lot more slides, but okay, so with the context of this talk I was thinking about how you go about designing a game day and why you're gonna get people together to do this and Simply understanding how your system works is is a great benefit You're gonna come up with parts of your process. You can eliminate or improve or Innovate all right. We do CICD. You know how many steps are involved in CICD? a lot But if you think about how you can eliminate them one of my projects and But as soon as I change it to how about the ten click employment deployment project people? Oh, we can do that Right, so it makes you think differently Okay, so anyway, let me let me move along here. So You got to understand your blast radius of the experiments that you've chosen and Make sure you set up abort conditions All right, because things could totally get out of control and you better have a plan for that and then Analyzing the results and then sharing the results. Okay, if you're involved in dev ops, you should know sharing is very important So that's basically the steps. All right, so again, there's a cycle There's also a great Netflix ebook out there if you email me I'll send you a link to it which goes through a lot of this in a lot more detail and one of the things they say that it was different than List that I listed which is what we came up with was The start small but then quickly increase the scope because small may not be useful Big is useful, but you have to be careful about how big you go But if you quickly increase the scope, you should be able to get there pretty quick And so of course automation is very important You want to be able to automate this some people actually have this a part of their CICD pipeline So it's actually something that they're doing every time they're running these experiments every time because if you're big enough Your application is big enough more teams. You don't know what's actually changed. And so you can't actually think Holistically, oh that changed. I better change now. I need to run that experiment You might just need to run them all the time. It's kind of like a build process with your continuous integration All right, so I'm just gonna be a quick screenshot here of the different types of experiments You might do if you're using chaos for example, so we call them experiments And these experiments are now becoming broadly adopted so beyond just chaos between the beyond litmus chaos for example, AWS has now started to use litmus chaos experiments as a part of their FIS service Okay, so these kind of experiments are not just useful if you're using Let us chaos, but they're also useful if you're going to use AWS fault injection service All right, I know I'm out of time here, but Okay, so if you want to contribute, please contribute go check out our experiments And if you want to contribute to one of them, you might find it somebody's running it in AWS for example, right for a quick demo So there's a very if you how many of you seen the Bank of Anthos? Sample app anybody okay. Well, it's a really good one. It's created by Google. There's lots of reasons why it's good I'm gonna skip that right now But basically what happens is This is a very simple example where you have an application and here's all the dependencies very very simple example Banks banks have outages to now they usually have them all separated into individual components one of the services might simply be a Balanced service Okay, because you're using that balance all over the place right what if the balance service goes down? Okay, pretty simple example Okay, so how does this actually work? I Don't know how the how other projects work, but I know how let miss chaos works What they basically are doing is they're putting a proxy and that proxy is is In between every communication between systems or between components So everything is going through the proxy and so if you're making a call to the resilience. I'm sorry to the balanced service You can set up an experiment that intercepts that request and responds as if it would if it had failed Okay, and so that's essentially what's happening in these chaos experiments They're intercepting the request and they're responding as if it had failed and this is even for databases databases They're starting to get slow it'll actually intercept the request and have a timer and Then respond and they'll set it up So it's slowing down over time and so you can actually see What happens if your database is slowly getting slower? Okay In this particular case, it's just a failure of service and so you want to be able to respond You don't want your whole application to go down. You probably don't want to put like three dashes or whatever that is, right? So you're gonna set it all up So that now I know if that's down. I don't want my page to break. I don't want the dashes You're probably gonna want to put some better information there And for that fact you don't want them to be able to stop you don't want them to have to stop doing other work And I happen to know that banks cash this piece of information because when I was at Redis one of the things that was Interesting is a lot of banks would they didn't want to replace Oracle But that morning when everyone checks your phone to see your balance They have this huge spike like around 9 a.m. And they just wanted to cash all that So they just take all that balance what your current balance is and they put it in memory and So that would be a good use case I should share this with my Redis friends and say look you should run a chaos experiment And then they should use Redis to cash that balance so that even if it goes down They're still gonna be able to see like That amount and what you would you do you would say here's the balance You don't just leave it at that because that could be very out of date So you would say here's the balance as of this time Okay, so therefore you're protecting yourself But that's the kind of thought you can go through is if you understand how the systems are working And you understand what's happening behind the scenes now you can come up with a hypothesis You can test it can a cash actually solve this problem it can but wait a minute what other issues are we dealing with? What about latency? This is no longer the accurate information. It's put up the right That additional you know caveat. This is as of this time and now maybe that works Okay, so there's obviously more experiment experiments you can do All right I'm over time, but let me just take one one minute here. I was gonna talk more about game days here so game days Game day you're basically doing this in a protected environment you can do with production as well But usually you're gonna set up a protected environment and run through This whole process with your team, but I wanted to share with you just real quick and if you want the notes to this This is from our Karthik is the co-creator of litmus gas And they can't really really read this real quick, but I just want to share with you these thoughts Okay, because these are like from actually running game days. All right, so obviously define your radius Your blast radius to find the failure. So there's minimal damage So you you kind of can guess what kind of damage there will be so start off with minimal damage Try to do one variable at a time because then when you get the results You know that the result was because of that one variable that makes it much much easier to then go and correct Use a scientific method. All right Simulate as many real-world problems as possible. Here's some examples Resources this is these are like the popular ones that he knows it came to his mind Resources becoming slow noisy neighbor, you're on a server Maybe you have multiple of your own services running on the same physical machine. All right noisy neighbor can eat up resources Availability zones may go down instances may become unresponsive of shut sudden shutdowns This is one that I think everyone should be testing for black swan events. Oops misspelled that These are hard to come up with But it's a great exercise to think through because sometimes you'll identify some what you thought were black swan events that are not Black swan events. All right, so it's great to think about those Buildings going down regions going down AWS will tell you which like where their Regions are Google does not give you that much information. So if you're on Google, you may need to do a different type of test If you can test in production because that's where the most unknowns are right, that's where That's organic right people traffic all those things you don't have in a test environment Try to repeat them continuously. You don't know what's changed automate tests so that you can okay This is I think probably one of the most important things to do automate the tests Okay, it's it's super hard if you can't reproduce what you did so you got to automate the tests and Just keep in mind like I want to end on this thought which is that human beings are most likely the cause Okay, that's just the way it is We're out there Mix them up with these systems and we cause a lot of problems But this is something that Julie said so I want to share this with you from her You know, we have compassion When a system goes down like all the system failed. Okay, right, but when a person makes a mistake We don't necessarily have that same compassion Right, and I think we should I think Julie says it. She said she thinks that we should I agree with her So I'm not necessarily saying how you deal with that But you know blameless culture perhaps think of that way You know if it's blameless people are gonna be more likely to come forward and share what would cause the problem as well So I think I would leave you with that is that when you do these experiments. I think it's a great team-building exercise And you'll find out that oh wow that person's really dealing with some complex stuff Maybe you can have a little more compassion for them when it goes down Or even something simple because sometimes the simple things are things you overlook All right, that can happen Just have more compassion for the people you're working with. I thank you for staying overtime. I appreciate it And if you want I have my slides. Hold on a second. Yeah, here's my contact information Feel free to email me or get me on Twitter and I will make sure you get the slides Okay, let's get started. My name is Simon Almir. I am a production engineer at Metta I work with the artificial intelligence group. I'm gonna be talking a bit about CH route and PROOT So I'm a production engineer that means I work on the reliability Scalability and efficiency of the systems at the company in particular. I work on artificial intelligence systems Both like training new machine learning models and deploying machine learning models into production I don't work on the containers team who would be the people who have expertise on like CH route and the topics that We'll be covering today. I just think they're neat Production engineers at the company tend to have a very wide breadth of experience And if it comes in handy that you happen to know a little bit about containers, then It can be really useful on the job. For example Working in AI. I had a use case where I needed to isolate certain GPU devices From inside a container. So one container would have access to some GPUs a different container would have access to a different set of GPUs and This was just something that I learned externally on my own and it came in handy in my job So we're gonna talk a bit about CH route what it is what it does We're going to have a lot of we're gonna have a lot of demos and examples in the talk So hopefully the demos go well We're going to quote-unquote accidentally forget the password on the Forget the password on a Linux machine and then go and rescue that install changing the password for it I'm going to pull out an Android phone and run some standard Linux desktop Style applications on it. So we're gonna play around a little bit with Android. We're going to use P route Which is a slightly more flexible version and like user space version of these utilities to fake Some stuff in the OS Including pretending that we have a different CPU than what we actually have so I have on a USB drive a file system from a Raspberry Pi and we'll be running some programs even though the Raspberry Pi has a completely different CPU architecture Then what's in my laptop and then at the end we will cover a little bit about containers and what they do So containers are really great for isolating applications you might think about a virtual machine as a More general version of a container what a virtual machine differs from a Linux container the kind of isolation that we're going to be talking about is that Virtual machine has to load an entire operating system if we're running and Linux application then we Can just tell Linux that we want to run a Linux application without having to load another copy of Linux in a virtual machine That's a container containers are also really useful for deploying software So you can just ship an entire OS inside a container, but let's just dive right in so chroot chroot or or chroot I've heard both pronunciations. I have no idea which one is correct This isolates a program to a particular directory. It changes the root directory chroot, and we're just gonna dive right into examples so What I'm going to do is I'm going to create a couple directories I'm going to create a temporary directory and a bin sub directory in there and I have an application here called busybox busybox happens to Be a self-contained version of a lot of the standard utilities you might find on a system I'm going to copy that into my directory here, and I'm going to chroot I have to be root in order to use the chroot utility. I'm going to change the root directory to be temp And then I'm going to run busybox inside of that temporary directory It's now going to be bin slash busybox And I'm going to tell busybox that I want to run a shell so I am now in a chroot environment the Only things that I can see in this environment are what's inside that directory if for example you were running a file server you would Probably not want that file server application to be able to access all of your system You only want it to be able to access maybe a user's home directory or a web directory chroot is a nice way to Restrict that file service ability to access the rest of the system So what can we actually do with a utility like this? Well, I built a small Probably the minimal environment for running chroot which just contained a single application We could see a true into an existing linux file system. So what does that look like? I'm going to start up a virtual machine This is just for the purposes of the demo you could be doing this on physical hardware And please feel free to interrupt if for example something I'm doing on the screen isn't particularly visible So I have a demo user account on this machine password is just demo I'm going to log in and change the password for this user I'm just going to do that with the console with the command line and let's Go ahead and create something. That's going to be a very secure password like blah, blah, blah, blah, blah, blah, blah, blah, blah, blah I'm going to just paste that the current password is demo clear that out And paste in the new password and log out, and now if I were to try to log into this account again, I am not able to. If this is the only account on this system, I would be pretty screwed. I would have to go and maybe reinstall the operating system, like how would I access my files, let's reboot into an install environment. So this happens to be the Kubuntu installer on a USB drive. I'm just booting into that USB drive and testing out the environment, so I get a regular desktop. So there are a couple things I have to do in order to gain access to the actual system on this machine. So I have a CD-ROM drive, that's the live CD that I have plugged into this virtual machine, and then I have the disk that's built into the system. So I need to mount some of this. I happen to know, create a directory to mount this to, and I happen to know that the file system, the root part of the file system is dev-sda2. So I'm going to pseudo-mount that to dev-sda2 to mount targets. And now if I list out mount target, we have all the files from the system, and if I look at it in the file explorer, this is the basic file system on this machine. So I could run applications if I chroot into this directory, so pseudo chroot into mount target. And now I am on the file system of this physical OS. If I look at slash home, I can see that there are the two users that existed on the system already, and I could run a simple program like password demo, and type in a new password. I'm just going to make it demo again, and now the password has been updated successfully. If I reboot the system and log in as demo, we should be able to log in using just the password demo. If I wanted to do more complex things, I could mount some supplementary file systems. For example, if I wanted to install some applications or fix a grub configuration, or maybe I made an edit to FSTab that I shouldn't have, let's add a little bit more into this environment. So there are some support file systems that will be important if you're running an entire OS inside of a chroot. I'm going to mount bind. This will make a part of the file system available in two different places simultaneously. So the special file system dev contains all of the, for example, disks that are attached to a system, as well as a whole bunch of other stuff. So I'm going to make that available as mount target dev. In addition to that, I'm going to do just some other ones, proc, which contains information about processes that are running in the system, and sys, which has some configuration options and information about other devices in the machine. So now this is actually a fairly complete system. I can see a chroot into that system. I could do updates grub, and that will reinstall or reconfigure grub if I needed to make changes there. If I wanted to install some packages, I have to do a couple additional things. So one of the relevant configs for accessing websites on the internet is DNS. That happens to be configured in EDC, resolve.conf. At the moment, resolve.conf happens to be empty. I'm just going to make a backup of that. Resolve.conf to resolve.conf.back, and I'm going to write just a name server. This is a DNS machine, 8.8.8.8 happens to be Google's public DNS. I'm going to put that into resolve.conf, and now if the Wi-Fi is being cooperative, so I can apt update, I can apt install new applications in this environment. So apt install, let's do C matrix, because that will be a fun little demo, and that's installed on the actual physical machine, it's not installed in the live CD environment. So I can exit, like C matrix is not installed on the live CD, it's installed on the physical machine, and let's just reboot. So we're going to see two things when we boot into the original OS. We're going to see that the password for the demo user is returned to normal, and we're going to see hopefully that C matrix is installed and accessible. It's taking a little while to restart. For the sake of the demo, I'm just going to hit the reset button. Don't do that on a real machine. So now we have the demo user, DEMO is the password that worked, and the program that I installed, C matrix is now available inside this environment. One last thing to do since I renamed resolve.conf would be to rename this file back to what it originally was, like remove the one that I customized and put the original back. So that's just one of the really cool uses of CH3. Another really cool use is on an Android device. So let me plug in my phone. So what I'm going to do now is I'm going to grab Ubuntu base, connect to my phone so you can see what's going on. So we need, if we want to run a desktop Linux OS on a device, we're going to need to download a version of it. The installer is going to be basically no installer. We're going to install it manually ourselves. But there's a few really simple options for doing that. Ubuntu happens to offer Ubuntu base, which is just the file system in a tar in compressed archive that we can just directly download and use on the phone itself. I have pre-downloaded a relevant image. This ARM64 is the CPU in this phone. So we're going to open up a terminal. So I have to be root to do this because CHroot needs a privileged user in order to run. So we're going to media. So I have the commands on here. It's worth noting that these slides are uploaded to the scale site. So you will be able to access this after the convention, and I'll have links to relevant documentation at the end of the presentation. So don't worry about copying too much of this here. So we're going to just make a new directory for our Linux install. This is some, CHAdder is something I had to do on my personal phone that I have as a root of device. Some operating systems care about whether capitalization matters. Some operating systems don't. Most desktop Linux environments care about the capitalization of files on the file system. Some file systems give you the option to toggle that ability on or off at will. So what I'm going to do is just make sure that case sensitivity is enabled on this directory with CHAdder-capitalF. And now I'm going to extract the Ubuntu file system that I downloaded. So that's just tar with all of its complicated options. The SD card happens to be mounted on this device in data media zero. So download Ubuntu base and I'm going to dash capital C, extract it to a directory, the Linux directory. So I am now quote unquote installing Linux on this device just by extracting the file system. And again, similar to what we had in the live CD environment, I need to add supplemental file systems. So mount bind slash dev into Linux dev and similarly for proc and sys. And now if I chroot, I'm going to be running, rather than just chroot without any program specified, which would give you a basic shell, I'm going to be running login-froot, which will set up more of a standard login environment. So chroot tries to preserve some of the stuff from your original environment. So chroot Linux and I'm going to run bin login and the user is root. I am now running in an Ubuntu devian derivative, Ubuntu environment on the phone. I can run standard phone applications, standard Ubuntu applications like apt, let's apt update. This is going to fail with a lot of errors, but apt is running. So let's actually fix those errors. So again, we have to set up DNS. So let's echo the 8.8.8.8 name server into resolve.conf. That's still not going to give us full network access. Any application on Android that wants access to the internet needs to be added to the Inet group. This being a standard Ubuntu install, we don't have the Android groups or users in the environment. So I just get to add that. So group add, number is three, 2003, Inet. There's also an Inet raw, which we don't actually need right now, but if you're setting up other users, you might need it. I'm just going to apt, download applications or packages as an unprivileged user as a security measure. So we need to add the apt user, we need to change this primary group to Android Inet. And now we are successfully downloading the packages list for Ubuntu. The underscore is part of the user name for this user. Underscore in some programming languages implies that it's a private or inaccessible variable. This is analogous to that for the user's list. They just prefixed it with an underscore for the apt user. So that was CHroot. It's a very powerful utility. Let's talk about Proot, which is basically a debugger. Proot uses ptrace under the hood, I'm just going to move that out of the way. Proot uses ptrace under the hood. It's a utility for debugging applications that doesn't require elevated privileges to run. Proot gives you the ability to do things that are very similar to CHroot, mount bind. We've used both of those in the previous examples. And they can also do a functionality called bin format, which we'll get to in a couple minutes. So one of the really cool things with Proot is that we have the ability to bind files or directories without being a privileged user. So LSB release is a command which tells us information about the operating system that is currently installed. How it actually works is it reads a file in ETC. In the case of this machine, its red hat release contains the name of the OS and its version. Say for example I had a program that checked what version of the operating system it's running on. And you want to run this on like a newer operating system. Maybe you don't have access to the source code of that program. Let's give it a slightly different view of the file system. So if I copy this red hat release file into my current directory and edit the file, let's change it to, I don't know, not fedora, release 99, there we go. And now we're going to Proot dash b is analogous to mount bind. And we can do this with files as well as directories. So we're going to take the red hat release file and we're going to make it available as etsy red hats release. And we're going to run LSB release dash a. And now as far as LSB release is concerned, we are running not fedora release number 99. Proot also has the ability to do a CHroot like thing in an environment. So I happen to have another version of Ubuntu base here, which is basically the AMD64 version which is compatible with this laptop. I can Proot dash capital R. This is a shortcut which adds all of those like dev, proc and sys to the file system for us. So we don't have to worry about that. I'm going to do that into the Ubuntu base directory. And now I am in like as though I had CHrooted into this environment, I am in the Ubuntu base install. But I am not a root user. I'm just an ordinary user on the system. So let's do that with a slightly more exotic machine. I'm going to plug in the disk from a Raspberry Pi. I'm just going to mount it. And this happened to mount where? Let's go here. This happened to mount in run media SAE. So let's go there. Media SAE and rootFS happens to be the Raspberry Pi environment. Now if I were to just try to CHroot into this, that would normally give an error except that I have a few additional packages on the machine. Let's do this as, so let's first establish that I am running on an x86-64 machine. Now I'm going to sudo, let's just do the regular Proot again as adding all of the additional file systems that we want to do that. And I'm going to add this utility user bin QEMU ARCH64 because this is a 64 bit image static. So QEMU is an emulator. It lets us run like a virtual machine of a different CPU. So we can emulate an ARM CPU on an x86 machine. We could emulate an x86 machine on an ARM CPU. This particular utility is QEMU user. It will translate individual programs. So we can Proot into this environment and now if I uname-a, as far as the applications in this environment are concerned, I am now running an ARCH64 CPU. And I have utilities that are available on a Raspberry Pi like Raspberry Pi Info. So I am now in a Raspberry Pi environment even though this is a completely foreign set of binaries to this machine. If I wanted to enter this environment as a privileged user, a slightly different set of options is helpful. So dash capital S sets up a shorter list of convenience mounts for us. And I can do the same login dash f root. Oh, I need to be a privileged user. And I happen to have Proot in that location. So I am now a privileged user on the Raspberry Pi file system. I can apt install, let's do C matrix again. And let's do x term because that's always a fun demo. So that's going to take about a minute. Do we have any questions in the audience at this point? Yes. So Proot is attaching a debugger to the application that we're running. In this case I'm running the login program. Proot is attaching a debugger and then it's monitoring all the system calls that the application is running. And it can modify the output of those system calls. So when the program needs to execute a library, when it needs to list a directory, we can modify the results of that list with the debugger. That's how we, for example, change the Red Hat release file. Anytime the LSB release application listed out that file or tried to read from that file, Proot intercepted that call and changed what the application saw. So yeah, dash Q. So this is how you specify the emulator that you would like to run. So dash Q with the QMU AR64 static, that will run the binaries inside of an emulator. And dash Q specifies which emulator you would like to use. Yeah, so the man page for Proot has the complete list, but dash capital R adds quite a lot of additional convenience mounts for you. This includes Resolve Conf, so we don't have to do that when we use dash capital R. It also mounts your home directory inside the environment. So all of the files that you have, your configurations in your home directory are also available in the isolated environment when you use dash capital R. And when you use dash capital S, you do get a much shorter list. So the dash capital S option is a little bit safer for package managers. So you don't, for example, change the time zone on your physical machine when you're changing configurations in the environment. So the paths with both of these options are theoretically read-write, subject to the regular UNIX permissions. So an unprivileged user is not going to be able to create new files in slash dev. Yes, so bind mounts are making this directory available in two places at once. So any changes that you make to one are changed in the other, which is why it's important to not add too much stuff from your base environment into this isolated one. Cool, so that finished. Let me go back to logging in as the unprivileged user. I can run Xterm. Now Xterm is running. The ARCH64 version of Xterm on the Raspberry Pi file system is running on this machine, and I can run CMatrix inside that environment. So people like to say that everything on a UNIX system is a file, like everything is a file, and to a large degree that's true, but there are also a large number of shared resources on a system that aren't isolated when you use a tool like CH3. We're talking about things like how much CPU is a program allowed to consume. When a program accesses the network, that's also a shared resource, like every program has a similar view of what the network devices look like. So what do we do if we want to isolate some of these shared resources? We have two major tools for doing this. One is namespaces, which I'll cover now, and the other is cgroups, which I'll cover in a couple of minutes. So namespaces control what things are visible, and cgroups control how much of the resources we're allowed to consume. So we mostly looked at what mounts are available when we talked about CHroot. We can create a namespace for these mount points, so that, for example, and I will actually just dive right in, we can provide an isolated idea of what things are even mounted right now in a file, in a CHroot-like environment. So if I bind mount something in an environment, that is visible to both the external environment, the base that's on the system, and it's visible inside the CHroot. But if we're in a namespace, then the things that happen inside the namespace stay inside the namespace. So I've got this pre-typed. I'm going to run on share with also the... Oh, yeah, I'm going to do the PID namespace. So another shared resource is what applications are actually running. So if I run, for example, PS with a whole bunch of options, we see there are lots of programs running on this machine. So if I were to unshare is a command line utility which lets you interact with namespaces. I'm going to unshare, and I'm going to play with the PID namespace, PID, process IDs. Fork lets us run an application, and I'm going to use the same, I'm going to base as before, except I need to be in my directory there. Okay, so this did a couple things. If I PS again, now there are only two things happening inside this environment. The two applications that happened to be running were bash and the PS command that I ran generated this list in the first place. And I forget, did I do this? Well, let's try this. Let's mount a new dev in this environment. Let's look at the... Okay, so we have two things going on right now. I just mounted inside of this unshare environment, I mounted dev. So this is a fresh new dev mount point. If I list slash dev, we have a whole bunch of interesting stuff happening inside of that dev directory. But if I look at the equivalent path outside of that isolated environment, we have basically nothing happening inside that directory. So the things that are mounted inside of the environment are different than what's visible on the system outside. That's kind of what you get when you use namespaces. And there are lots of different kinds of namespaces. I talked about mount, I talked about PID. You can also have an isolated view of the user IDs on a system. You can have an isolated network environment so that the network configuration inside the now container can be different from the network configuration on the base host. That's a basic introduction to namespaces. Now we'll also talk about C-groups. So C-groups are a little bit complex to play with. There are two versions of C-groups that were implemented in Linux. There's C-group 1 and C-group 2. C-group V2 is the default now on most desktop Linux distributions, including Fedora, which I happen to be running on this machine. And C-group 2 makes it easier to manage multiple resources at once. So all of the example here is going to apply to C-group 2. So C-groups let us control how much of a shared resource we're allowed to consume. This includes CPU, like what percentage of CPU utilization we're allowed to run inside of a C-group environment, how much memory those applications are allowed to use, even more exotic things like how many processes this application is allowed to spawn. But let's play with it then. So if you don't already have C-groups set up, you can just mount it. It's a C-group file system. But this happens to have it already configured out of the box. And C-group 2 happens to be mounted on CIS-FS C-group. So we're just going to hop into that directory. There's a whole bunch of stuff that we can configure here, including there's a lot of different C-groups pre-configured on this machine. But we're going to do our own from scratch. And how this works is we just create directories. We write short contents into files. It works in a very simple way if you're used to editing things on the command line. We can also change the permissions on the directories and files in C-group if we wanted to give an unprivileged user the ability to change the limits within this environment. You can create sub-groups inside of this as normal directories. So you can delegate some amount of resources to a user and then they can sub-delegate a subset of those resources to additional applications that they want to run or constrain. So let's actually do a demo. This will be fun. I'm going to load up top so we can see the CPU utilization on this machine. I'm just going to move that down to the bottom. Okay, now let's just run a program like stressNG-C0. I need to be in my home directory to run that. So stressNG is a benchmarking program or rather a stress testing program. It's going to load up all of my cores. We can see that basically all of the cores available on this machine are effectively fully utilized. Let's try and constrain the resources stress is allowed to use. So let's CD back into sysfscgroup and let's make a new cgroup. I'm just going to call it demo. That is now just a normal directory in here. We have a demo directory and when we CD into that directory we're going to see that magic happened. There's now a long list of files that were automatically created for us in this directory that let us configure this cgroup. So there are a couple of things that I need to do in order to get an effect from cgroup. One is I need to configure a limit and the other is I need to define what stuff should be restricted. So let's handle getting these stuff that should be restricted. So what I'm doing here is I'm running pgrap stress. This will print out the list of all of the process IDs for the stress program that's currently running. I'm just going to write all of those into a file cgroup.prox. So what this does is it adds all of the processes that stress launched. It adds those to this demo cgroup. I'm also going to take the process ID of this shell that I'm currently running and put that into the cgroup as well. So now we have in this cgroup.prox file a list of the process IDs that are inside of this cgroup. So there is also resources that we can constrain. CPU.max is one that's particularly fun considering I'm burning the CPUs in my machine right now. Let's see what that looks like. It's just a file and it's telling us and this is in the documentation for cgroups that it's allowed to use the maximum available CPUs for every 100,000 time slices. We can edit this. Let me just write... Let's make that 100,000 out of 100,000. I'm just going to write that into the file. What we should see happen is that the CPU utilization now drops down by a lot. So 100,000 divided by 100,000 is one. And so the total amount of CPU that all of these processes as a group are allowed to consume is one CPU. If I wanted to allow it to run maybe two CPU cores worth of compute, I can just change that to a two and now it's allowed to consume two. There are 16 cores in this machine so this should be roughly equivalent to max. We can see that reflected on the top graphs below. We can also greatly restrict the amount of CPU that it's allowed to consume like that. That's 10,000 out of 100,000. So now everything including this shell only gets a very small amount of CPU granted to it. That's why pressing enter was very slow because it's only allowed to use a very small fraction of a core. And if I kill the CPU stress, it's a little bit more responsive because it was sharing the CPU budget with stressNG. And any application that I run in this shell is similarly constrained. So if I were to run a new copy of stress, I don't have to do that from home directory. This new copy of stress is also constrained in the C group. So you can't just get out of your resource limits by launching a new program. Those are all the demos that I had. I will take some questions. So if you have any questions, feel free to ping me. I'm going to leave a slide of additional resources up. So the slide deck is available on the scale site. And it includes links to the documentation for CH root, P root, namespaces, and C groups, and all of that stuff. So these are the basics of a Linux container environment. So if you're running something that looks like Docker or LXC, it's going to be using a lot of these kinds of tools to do that. So questions. Oh, that's because I ran it through sudo. So sudo escalates the privileges there. These utilities, you can do, like you can change the user inside of the environment as well. So I could like su to a different user when I'm in that environment as well. Not as far as I know, you need certain capabilities in order to create a new namespace. Those capabilities are available to the root user. They're not normally available to unprivileged users. The Linux capabilities framework can give a normal user the capability to configure a namespace. Or you could have the root user create a namespace, go inside of that namespace, and then lower its own privileges to run as a normal user. So there is a way to create a namespace without then diving into that environment. So if you check out the documentation for Unshare, you can get a lot more information about creating a namespace without the dash dash fork option. The fork option is what did that for me. Yes, so I've been using the chroot command line utility. The version that's in new core utils has a few safety protections available to it. So as far as escaping chroots, chroot is also available as a system call. So any application, like any C application, I think Perl also supports it, you can run chroot and then change your root directory. What happens then if you delete that directory? Like if you chroot into a directory and then delete that directory? Most Unix systems I think will actually let you do that. And so if you are executing inside of a deleted directory, what happens then when you cd dot dot? That can actually break out of a chroot environment. So chroot isn't perfect as a security mechanism, and some of the modern OSs do provide guard rails in a utility like the core utils chroot. But that's where mount namespaces come in and are a little bit better, because inside of a mount namespace you can completely change the visibility of the file systems that are mounted, and you can unmount things that the namespace should not have access to. So the question was about, in my example I entered into an environment that was not encrypted. What happens if you have an encrypted disk? The short answer is you will need to decrypt it somehow. On a Linux live CD, like the Kabuntu installer or Kabuntu installer, if I were to just open up the file browser, like with Dolphin, I could click on a file system that's encrypted with, for example, Lux, and it will open up a prompt asking me to type in the password for that file system. If I know the password and I type it correctly, then it will give you a new device in slash dev, and it will mount it that will give you the unencrypted view of that file system, and you could chroot into that. So the question is about why does the chroot command line utility require root permissions? That's because changing what is visible to a program gives you some additional control over the program, so you don't want anybody to just be able to do that. It comes from, I think, older Unix systems where the privileges were reserved for the root user. When things like proot exist today, it's perhaps not as useful a restriction as it used to be, but yeah, I don't know why it still has the same restrictions, most likely for backwards compatibility so that we don't change like APIs or add functionality to things that should not have that functionality. The question is, do ACLs come through in a chroot environment? Yes, they do. So if you have a file ACL attached to a particular file in a chroot environment, that would still be the same file ACL. And finally, I will leave you with one additional thing, a question that didn't occur to anybody to ask, but if you have to be root to run chroot, and I did my example on the Android device with a rooted Android device, what if you don't want to root your Android device? Could you use proot on that Android device? The answer is yes. So there are additional links there if you want to run a Linux environment on an ordinary stock Android machine. Thanks. Hello. Is it me you're looking for? I can see. Yeah, there's an art computer one. Hey, look, there's a metaphor, sir. I love reading, sir. We've heard it's worse. Yeah, we have. Why are we setting an ACL? Why? And I really look for anything. Can I get your back stand? Oh, come on. That's fine. Yeah, I didn't bring it. Okay. There you go. Sorry, no. The thing where you find your way to the machine. It's the thing came off. I kept no idea how to solve any of the things. I don't think that's what I'm saying. No, they just... I'm surprised. I know, I know. And I will do the few of them. That's the art. I mean... Yes, the demo. I know. You can do it yourself also. I was going to say, like, we should have fun. We should have installities in the laptop also. Yeah. Well, I mean, I had prepared it. Give the talk, I have. It's better this way. Six minutes? No. Everybody loves system D, right? Yes. I had the shirt. You have one? Oh, yeah, you have one. We just put it right there. Displayed it. Two minutes, four minutes. Okay. Hello. Hi, okay, so we're going to get started. So everybody sit down. I'm really surprised of having this room so full. Either people really like system D or really hate system D. And I can't really... No, no, no, no. No one till the end of the talk. Okay, cool. I'm Alvaro, she's Anita. We are here to talk to you about journey to the heart of system D, right? And that's kind of like the sales pitchy or the clickbait title. We're just going to do like an introduction of a little bit of what is system D and what are the less known features of system D, right? A small disclosure here is that both Anita and me, we both work at meta. And even though in meta we do use and abuse a lot of system D, this has nothing to do with how we use system D. This is plain vanilla system D and this is how you in your own life can use system D. And by the way, if somebody is counting the times that we say system D, at the end we're appreciated. So that's the meta logo in case you haven't seen it. And let's move on. A little bit of agenda, right? Since system D is mostly a service manager, let's talk a little bit about what actually is a service manager. And then let's talk how system D fits into this space. We're going to do examples, which is basically going to be code feature how you used to do things before system D, how you do it now with system D. And then we're going to do a live demo, which is always fun because it's the thrill of doing it live. Cool. So with that said, let's just get started. So what is a service manager? Nothing fancy here. A service manager is something that manages your service, right? So in general, like you have Postgres, a service manager is the thing that makes sure that Postgres is alive or Postgres is down or that Postgres gets started before you start your web server. It's kind of managed like the operations of your service. By operations we mean actually verbs, right? It also makes sure that your service starts and stops whenever it can. That's what we call lifecycle. So if there's people that is all here, you remember that in the good old days, did I strike a nerve? Well, I was there here. I was there for the final part. Okay, so in the good old days, you used to put like Vashstrip into itc and itd. And nobody really knew why we did this. It just was convenient and then we put it there. And that was our service manager, right? So you were your own service manager at that point. So you write stuff like this thing over here, which is, yes, and thank you very much because this is the init script for Cassandra. It's taken from for Damian, right? So here is one of the things, right? So in this script you would define verbs. And this is just a shell script. So your verb is basically how you decided to parse the command line argument that you passed to the script. So you have start, stop, restart and force reload. And here's the first thing that is wrong with this approach or like we did not like that much. It was that these verbs are not transferable from one service to the other. So you can rename something start and you can say something go and that is not transferable from one service to the other. Also here restart and force reload are the same thing, but I'm pretty sure that in a web server you don't want that. Also you don't want that in Cassandra, but that's neither here or there. So even if we now zoom in into what start is, right? So there's two problems here. One, there is a lot of implicit knowledge. You see there that there is an environmental issue that you kind of have to know that it exists and then you have to set it up. And this of course is not true for all the services. So you kind of have to have a lot of implicit knowledge of the thing that you're doing. And the other thing is that there's a lot of boilerplate code because the only thing that start Cassandra is line 110. Everything else is either dealing with command line parsing or deciding what to log to the STD out and basically just doing a boilerplate. If we even zoom to do start, like this is not standard at all. So it's really hard to answer question like what's the user that's running Cassandra? In which director is running? Let me take your attention to the last two lines. There's two star stop line. Why? Right? And there's a reason, of course, but because somebody put them. And one returns one and the other one returns two if it's failed, so it's weird, right? So yeah. And of course if you actually want to answer a simple question is my service running there's no way that we can do it, right? I ask again, did I strike a nerve? Okay, cool. So yeah. So this is it. Now this also has a lot of issues because these are shell script. There's no dependency in them. So when you start a machine or when you stop a machine these things has to run in sequence, right? Because there's no dependency between them. And you can put some sort of ordering with numbers, but in general that doesn't make any sense. So start and shutdown was low. Well, the back in shell scripts is hard, right? Because after a hundred lines it's really hard to know what is actually doing something and what is actually just doing control flow. So in Unix, in general, or Nix, we kind of figured out to take things into the 21st century this was not going to be good. So this is a few new Unix system up here. Upstart and system D were kind of like the biggest one. They went to emerging technology, they went into a feud with each other, a healthy competition and eventually system D came on top, right? So then system D won. And this is kind of like where we start with this talk and it's the fact that either you love or hate system D it's kind of like to the side, system D is there and it is on your side. So I think it's pretty safe to say at this point that system D is here to stay. It's in all of our major distributions and the idea behind system D is that you just tell it what commands you want to do and you don't have to worry about a lot of the boiler plate that you had to deal with when you were writing in Nix scripts and system D will attempt to choose the best way to run your service the way you tell it to. So here's an example of a system D service unit. If you wanted to run the service as a certain user, you just have to set the user property to the desired user. If you wanted to change the working directory of your service, you just have to use the working directory property. And at this point this is one of the few areas that look bash-ish. This is the actual command for your service. You just have to fill out the exact command. Notice how there's pretty much no bash here at this point. Another cool feature about system D is the ordering. This is something you couldn't really do with Nix scripts. People faked it by adding numbers in front of their scripts and things like that. With system D you can tell it to run your service before or after other services depending on how your dependencies look. And then you can start your service. In this case we're using multi-user target. That will start this service, my app as part of all of the user space services. Here's an example of how you could start my app.service on the command line. You use system cuddle start, my app.service, and system D will do all of the setup required to set up your service and then fork and exec it. This will stop my app.service, system D will send commands to shut down your service, usually sigterm and then sigkill, and then clean up the state from your service. For consistency we're going to run start again, and then if you were to run system CTL status, you get a pretty nice overview of what's happening to your service and whether it started and what not. In this case we see that the service was running. It was running three years ago. You see that it was running at the point when we took the snapshot, it had started running seven minutes ago, so system D keeps track of all of this state for you. You can also see below a brief overview of the process tree for your service, so these are all of the processes that have been started as part of my app.service. A lot of this is made easy due to C groups of Linux kernel feature called C groups which system D makes pretty heavy use of. This would not really be a system D talk if we didn't talk about two things. C groups and socket activation. Let's get C group out of the way early in the talk. A C group is basically a control group which is a way that the Linux kernel has to take processes and apply restrictions into it. How much CPU, how much memory, how much network can it work? Well, you get what I mean. So it's how the kernel does it. And system D definitely uses that part of the C group, but it also does a very clever thing with it. For every service that you start, like my app.service, it creates a special C group for that service. So in this case, it will create system.slides slash my app.service. And then every service that runs or every process that's running in that C group gets added to the C group. So you have main process. And then if main process decided to have children, they're also part of the process tree, but they're also part of the C group. And if one of those children revealed and decided to double fork itself and escape the parent, well, escape the parent, great. But it's also still part of the C group. So this is a nice way that system D has to keep track of all the processes. Now I don't know if you remember, like in the old times, we used to have an Apache server, and that thing used to have CGIs. And CGIs were basically like JavaScript that get executed, right? And still JavaScript, they could do whatever they want. So a lot of the times, they just double fork itself. And then somebody says, like, restart Apache. You restart Apache or you stop Apache, but all this side CGI process is still that hanging out, right? Now, with system D you have the main process. You still have the child. You still have that process root that is easy to track. But system D can also keep track of all the processes. So when you say system D, can you stop my service? System D can actually count down every single process that you have and then just kill the C group. And this is kind of crazy because if you kind of think about this before system D, we have no reliable way of stopping a service. So that's something that at least in my mind. All right. So now I'm going to talk about dependencies in system D because one of the goals of system D was kind of to make your boot more performant by parallelizing a lot of the start of services. And so the way it does that is it provides properties to allow you to configure both the ordering of your services against other services and whether you require or depend on another service in order to start. And so by doing that system D kind of takes all of that information, creates kind of a tree internally and uses that to parallelize a lot of these start and stops requiring a unit. This makes the boot and shutdown process a bit more performant than, you know, having to synchronously run everything. So here's an example of ordering using the after and before properties. So on top we have app.service and it wants to run after dv.service. So you can specify that in two different ways. You can use after dv.service or in dv.service you can put before app.service. And now when you run, if you try to start both of these services at the same time they will be ordered in the way that you specify here where app.service will come after dv.service. If instead you want app.service to depend on app.service so you can use the wants property. What this does is if you try and start app.service it will also start dv.service and also respect the ordering as well. However, wants is a weak dependency. So if for some reason dv.service fails to start app.service will just keep going. If you want something stricter you can use the requires property here. That way if you start app.service and dv.service fails then app.service will also fail to start. And of course, as you will eventually learn with system D there are multiple properties for specifying all the different granularities you want with ordering and dependency you can read more about that in the system D unit man page. Alright, so now I'm going to talk about journal D. So system D journal D is a pretty central part of system D it's the central logging system and it aimed to solve a lot of the issues we had with syslog and traditional unix logging systems. So syslog was relatively insecure. You could tell it you're whatever pit you want it to be and syslog will believe it and log it to disk there was no formatting with the data that is logged basically everything was just a text file and it was really easy for humans to read but it was pretty inefficient to programmatically process because nothing was indexable. And with syslog you need extra service to manage whether the file system gets full you need something to manage that. It doesn't really have a rotation built into it and rate limiting features were well limited. System D journal D kind of is one daemon that attempts to fix a lot of these problems. Here's an example of a journal D config so you could still do things like tell it to forward the journal logs that are coming from your services to syslog you can change the runtime max usage where you want your logs stored and of course with like I said there's a whole multitude of properties that are not listed here. You can visit the man page to figure out how to configure the journal. So how do you get your logs into journal? Well anything that's logged to standard out and standard error in your service that ends up in the journal. So for example if we have an exec start that echoes hello world if we run journal cuddle you'll see the hello world printed in the journal. You can also programmatically write to the journal. So system D comes with a utility called system D cat and if you echo things to pipe things to system E cat that will get logged into the journal at the log priority that you specify. If you use something like Pistom D which is a Python library for system D you can modify the different fields and metadata in your log message before you log to the journal. And if you use a C library system D ships with SD journal API and things like that you can also modify the metadata that gets sent with your log. On the right you can see a lot of the metadata that comes for free with each of your system D units. So the ones that are underscored system D will just log those for every unit without you having to do anything. So in addition to your log line you also get things like the CID, the host name and then all kinds of things that are listed there. Now I'm going to show you some examples of how we query the journal. Like I said well I don't know if I mentioned this but every field that comes with the journal log is indexable. So that includes things like the timestamp. So if you want it to filter by logs since July 12th until July 13th you can do that with journal cuddle dash dash since and dash dash until. If you wanted to change the output so that you see more than just your log line you can use dash dash output. In this case I'm using verbose here. What verbose does is it gives you a listing of every field that comes along with your log line. If you want to filter by a certain process ID you can do journal cuddle in this case I'm indexing by the field underscore and that will show me all of the logs from system D. And of course you can combine all of these fields together. So in this case I'm looking up the logs from system D log in D that we're running as PID 1632 since July 12th. And that will show you that. The other cool thing about the journal is that it actually collects logs from your entire system not just from system D services. So if you want to look at the kernel logs you can use journal cuddle dash dash D message. Dash B is since boot. And that will tell you all of the kernel logs since boot. Okay, cool. So now we are starting digging deep inside system D. So let's talk about a few features that system D has. One of the most overlooking features is template and gets really overlooking because it's very simple right? So people tend to just ignore it. So as we saw, if you want to create a service unit you put a name dot service. Now if you put an ad symbol at the end of the name and then you put these little placeholders percent at twice, what is going to happen is that when you call your app after the ad you can put a word. And that word is anything. You define it however you want. It could be a configuration file, it could be a path, it could be a user, any word. But whatever you put there it's going to replace the placeholder. So if I put conf one there then you'll see that wherever I put the percent at twice appears there. So that means that now with a very simple unit you can specify different web server, right? So this is a UISC server running conf one. But if I do conf two it will run whatever definition of conf two is. Another thing that is very common is that you put the ad, the thing next to the ad is the user, right? So if you have a process like Tmux or anything and you wanted to make sure that this process run as a certain user you put my app ad or my app ad and then it will run as a different process but with just the same definition. So pretty powerful for what it is. And the promise of activation. So this feature to be fair is not a system D feature but as many things system D abuse this a lot. So let's see a little bit how a normal application would work. So you have your app. Your app is a web server and it eventually is going to decide to bind to a port. So it has to open the port, it has to bind sorry, it has to open the port, it has to bind and then it has to listen, right? That's normal. How works? And then a request comes and then you can serve it. So you see like there's a workflow. Your app, the port, the request. Now what happens is that if your app for some reason decided to die, there's nothing listening to the port so when there is a request there is sadness. So now, so activation. So activation kind of changes the paradigms a little bit and it basically makes that instead of in your app the one who starts listening to the port, the port starts existing and system D will listen to this port. So you tell system D listen to port 80 and when that request comes, the first request, system D will start the service and instead of letting the service open the port, it will just handle already open port descriptor. So now you give the response and that's great. So the first advantage of this is that now the thing that opened the port and the thing that is using the port doesn't have to be the same service. So it can both run as different user, right? So system D can run as root and as you know you can run as a single user. So you can run as a single user. So that means that your app has to be privileged at the moment that is flying to the port. With this method you will not need that. So your app can run as a very single user and then the root user will start port. Now if you take this and you mix it with the template that we just see, we can make some pretty cool blue green or green blue deployments. Basically you can run as a single user. So this is the application of your application. That version could be a file on this where you have a different binary or can be a commit in GitHub. But whatever definition you have you can have system D opening the port and forwarding traffic. When you want to upgrade your application you say just start my application with a different parameter. So that will load the application. And then when you decide that version B2 is healthy enough to be the one that in production you say bye version one and then that's become the official one. And now you have a very cool simple deployment process with very little configuration except like this management script that you have to do the rollout. So how do you actually build these things? So if you want to start a service you do a dot service file. If you want to do a dot service file. And there's many type of service file or unit file system D. This is called socket activation. There's one that I recommend reading that we're not going to touch here. That is called path activation which that's exactly what you think it does when somebody access a path some service gets started and it's pretty cool. But yeah. So this is my socket. And as you can see you have the same section that the other has a unit which port and which address is going to bind to. And then you see that it also has this reuse port which is the exact feature that allows you for two applications to start listening into the same port. So this is the one that the kernel is going to use to run Robin. So how do you do this in code? Because we still have to write the application to be able to understand this. Traditionally when you start an application that is a web server you will open a file and that's that which statement there. Then you bind to an address on a port and then you start listening. And then you set your request and then you start doing the process. This is your application. But this is the one in white is the logic for opening a port. Now since you don't have to open the port and systemD will already give you a socket. You just say instead of opening you say socket.fromfd and then systemD comes with a nice little variable that allows you where is that socket. And then you specify the type of socket and then your application just move on as it normally should. So programmatic access. Now systemD has a nice cool feature that it has a divas interface which is an IPC process. So anything can talk to systemD. So if you want to get information out of systemD you can do system cattle on batch. But also you can talk to systemD. So we create a binding for this, right? So we have a Python API to systemD which basically what it does is it exposes all the goodies of systemD into Python as like Python APIs, right? And that's it. There's no mic behind it. So you will install it the same way that you will install any Python package. We've installed systemD. I've been told that this thing is also in Debian by extension and it works like this, right? So you open your service and then you can do start and you can do stop as you would normally do. Now the same thing that from status you can get it, right? So you can say you need to service the main pin. Now, and I think here is that you get an integer back. So if you were doing like processing your shell script this would be a text and then you have to know the type you have to parse it and do all of magic. Since this is a real IPC to you it's actually typed. So since it's typed you can actually get more complex stuff like an array. So if you want to know what process are in my service you can actually inspect this at runtime. Then you get an array of doubles. What a beautiful thing. So basically you get all the things. Anya? Cool. So now I'm going to talk about a few more applications of systemD by talking about watch dogs and notifications. So if you set watch dog seconds in your service unit to a certain number of seconds what it does is it tells systemD to expect pings from your service in some certain number of time. If it ever exceeds the watch dog seconds so let's say the service goes unhealthy and no longer is able to ping systemD back. We time out then in this case systemD by default will just restart your service and hopefully it gets healthy again and it can start pinging and of course this behavior is configurable. So here's an example of what this looks like from both the program side and also the systemD unit side. So on the left we have watch dog second set to three so systemD will expect pings within three seconds from your process and on the right we enable the watch dog with systemD using systemD daemon watch dog enable and you use systemD daemon notify to send those watch dog pings and so on the bottom you see an example terminal of what that looks like where the watch dogs enabled and it starts sending watch dog pings if we decide to stop sending those pings you see a notice from systemD that we time out from the watch dog and it will abort and kill the service and beyond watch dogs you can do this with notifications in general so by default systemD will send notifications when your service is activating when it's reloading whether it's started but you can override these from within your service within your program by sending different notification signals and so to do that in your unit you should set type equals notify that tells systemD to expect your service to set up the statuses and using systemD daemon notify you can set the status to loading and then below in system cuddle status you see that the service now says it's loading and if you want to say that the app is now ready you can do the same thing send the ready signal set the ready signal to 1 and update the status to app ready and that will get reflected whenever someone runs system cuddle status and of course with systemD there's like a whole host of different notifications you can send a few of them alright and now for the fun part we're going to run some live demos using systemD and systemD where everything can go wrong right and I should have started this before but I think I can do that is that a little bit better very cool awesome so as a fun thing I named this demo systemD demo thank you there you go so I need to set the stage so I can explain how the demos will work so on your left you have a python shell it's just overloaded but it's just a python regular thingy and on the right you're going to see some information that changed a little bit with every demo in particular I'm showing that this thing is running this line over here it is the demo so in python you start a supracess you use the nice supracess.run right so you do supracess.run meanSleep and then I'm going to sleep for 300 seconds and you can see that in the right a new process has started and it's a child of the process and it's on the same tty ergo is on the same cgroup ergo is on the same with the same user not ergo but it's also like a nice feature of this so yeah now with pistonD we have used a nice feature systemD has a command line tool which is called pistonD run so since supracess.run pistonD run I mix them together so now you have pistonD run and what this will do it will start as you can see bindosleep but it start definitely as a different unit so this is a good way for you to tell like systemD to start right now this is a proper unit that it has a status and you can see like it is running it has a command and it's actually so cool that systemD will actually create a service unit for this file we're going to charge right so so if I were to cut that file you can see that it actually has like it's a proper service unit with a lot of information there right ergo so now let's take this a step further and this is actually cool so what I'm doing here I'm starting a service unit so as you can see it starts somewhere else but I'm binding the STD in of my Python rebel and the STD out to this STD in and STD out of the unit of course inverse and then I'm just binding those things so even though I am in a Python rebel I actually can interact as if I were that if I were that unit right the other thing that is cool is that since I named this my service all of these things my service you can see that I can actually see the status of this and I can see the and I can see the unit that gets formed so this is going to kind of like help us inform and with this I hope everything was clear if it wasn't you're going to catch up don't worry alright so let's start with the first example alright so in the first demo we're demonstrating two sandboxing features that systemD provides protect home and protect system so setting protect home to true will make it so that within the view of your systemD unit the slash home directory will be completely empty and inaccessible this is compared to on the right where outside of the systemD unit on the host you actually have home directories and with protect system strict it makes the entire root file system view from within your systemD unit read only and if you try to touch something within the systemD unit you see that we get an error about read only file system whereas on the right outside of the systemD unit on the host we're able to touch things okay cool so the next example is a little bit more granular right so that was like a bluff thing and the logic of that is that if you start your web server you don't want your web server to have access to the AWS key of every user of your machine so you must better amount it like granularity here you can set up like read only directories and then this directory in particular now it's mounted but it is read only right I'm not gonna try everything you have to believe that the thing that is named as read only is actually read only now inaccessible directory shows the other thing that you saw which was that the home even though it exists in the real system for this service it doesn't exist being path and being read only path a social very interesting feature because now we can take this directory over here and what we're saying is that attach whatever this thing was as another path so inside the service you can actually like toy around with the actual file system and then you can see that this is mounted as something else right and this is true not only for that but for binaries right so in this case Python or files not binaries and a river files so Python exists there as a directory that was just mounted as a different thing right so this is interesting and cool by the way Simon did a talk about chroot and pirut about this that covers similar things but he does it like from scratch without the help system so once the things are online I really recommend seeing that okay so more sandboxing features with private temp you can create like a temp of s that is unique to within your system the unit this is completely different from temp outside of the unit you can also create a temporary file system with the temporary file system property and in this case we have one under varkash my s you can also create a private network inside of your system the unit so this will create a new network namespace so that you can access the network outside of the loopback interface and you can also use private devices true which means that slash dev will be a completely new mount different from what's on the host that you see on the right which is outside of the system the unit so this feature is also kind of cool but it takes a little bit of a turn right so these come from the fact or the general knowledge that you should start every service with a different user so if you have a web server like an engine X you start as a dot dot dot user if you start a Postgres server you start like a PES goals user but are you really going to do that with every single service like this really crazy script that you have that just deletes down partially downloaded files well the answer is yes and to do that you do it with dynamic user so dynamic user will when your unit start will create a user on the fly will add into the file system and then this is the user with that it has that you see the number 6485 and if I do who am I you see that actually the the file doesn't exist because this is not updated in the ETC password this is updated on an NSS switch so yeah now if you remember that PYD it's a 6485 now that's PYD for the same service or name service can get reused on but this is true only for the same boot right so this is a good way kind of to preserve the logic of of maintaining between one and the other okay so if you wanted to do things like filter IP addresses of course system D has something for that too you can use IP address deny and if you pass it all zeros that will tell it to deny all IP addresses so when Alvaro tries to ping one of Google's DNS servers from within the service unit he is unable to however we did allow 8.8.8.8 using IP address and you are able to successfully ping it from within the system D unit okay cool and now we are seeing seagrups in facture and though if you notice but this patch actually took a little bit longer to start this is because we are actually using we are actually setting up the maximum CPU that this service can use right so here we say that we can use only 20% of the CPU so if I remember how to use this is that how we use it sorry yeah so I can use stress and G and the CPU oh wait why are you filing to me now stress CPU it's zero okay no it's one okay nevermind we are going to use 3CPUs live demos we are going to fix this don't worry yes we are here we're back okay go so it was stress and G CPU one thank you Simon okay so what you can see like we have we start stress and G and we this process can only use 20% of the CPU and that's the thing that we're actually trying to aim for between these two they cannot actually use more than 20% right so they kind of like are doing the same thing now taskmax is actually like pretty cool because this actually controls the number of time that you can fork exec into the unit right so as you can see I'm starting batch so the task is one as you can see it right there my mouse does yes it's it works as a pointer okay so if I start SH start and now that number goes to 2 if I do SH goes to 3 3 plus 1 equals 4 you see where this is going and now when I do one more it basically stops the fork right so if you have a service and you know that it cannot really like avoid fork or stuff like that this is a good feature keep in mind that this is a feature of the kernel the system is exposing to you not a feature of system D and finally well I need to exit actually all the things that I have okay perfect now I can do python python use a little bit of memory that's one of the complaints that people have but yeah so if I do X equals and I decide to create like a long string like hello yeah hello and then do a very long number that I've been told this is long enough yeah you're going to see that system D is going to kill it's going to kill the person now you see them still on batch right so what system D what actually what the kernel is doing is the right thing it is like there was a process it when the whole the whole unit went over the memory the um killer start and then explain the minimum amount of process that would free the most amount of memory and of course if you want to know what's um D check out um system D um D okay now since we fail we actually need to do a couple of things like go all right so we're going to close off with the last demo uh which really demonstrates how system D units can toe the line between like containerization in this case we have a devian uh root file system if we set that to root directory we will have our system D unit um appear as though it is now running devian when actually on the host outside of the unit we're running fedora so inside the unit you can do things like APT install SL and you can actually run it um as usual as you know any regular devian application and you cannot exit that because that's the reason why the application exists so we exit and then we are so questions go ahead sir no so yeah so so there's features for time out and stuff like that that's also particular in the in the in the same unit you can do like time out when the unit has to start but you're right you will pay the price or the person who or the request that will pay the price for starting up the service is the first one that is definitely true what happened um but yeah so but there's features that you can do for time out so you can say like um like like system D will just hold to the connection and just put it into the listening queue for new connections that also happens right and then when the unit starting you can say like time out on on the unit and then if the unit time out to start it can restart again or can just die and then the connection get close so that's not by default but there is an option there's an option called restart and then you can say on failure or on everything or on nothing so the good news is that um a lot of random application out there does work with this right a good example is ng-next because the way that this works is that um system D will set I'm not sure if it sets an environmental variable or there is a I think it sets an environmental variable running and another application just check if the other environmental variable is there and then you can set in your config actually like started from system D. I'm pretty sure this is worth I think both were also worth my skill probably Cassandra doesn't but that's Cassandra's fault uh yeah I can talk about few of the daemons that come ship with system D um so for login we have system D login D um of course we already talked about system D journal D um more recently we upstreamed and uh pushed out system D um D which is a user space um killer um um D I did talk about um D uh there's system D and spawn if you want like uh to like you know use containers and stuff um system D machine D can manage things like your images virtual machines um man I mean those are like the ones I use most frequently oh yeah system D network D if you don't want to use network manager you can use system D network D to manage your network interfaces I think next I think next release is gonna come with an office suite so you have system D excel we've joked about system D text edit but I don't think that will become realized well you have to you handle system CDL cat and that kind of displays I think there's a system D cat um system D analyze to help verify and analyze system D units um system D boot if you want to replace your boot loader with something um as part of the system D project am I missing any I think oh yeah system D time D um but it does all it does is like like manage a file if I'm not mistaken oh really alright cool I omitted system D resolve D since you brought it up and it's not everyone's favorite so what the sir in the back system D timers system D timers this is very good we didn't touch it we should have touched it um let's give our answer just so no at the count of three if we like it or not one two three yes yeah in general I like I like like timers there more I would say like like the barrier of entry to learn how how they work with cron like against cron it's it's higher because cron is something that we have seen since we were babies right but um but it's it's way more flexible right um so the timeout for every service unit is configurable by default it should be 90 seconds um but at some point your your system have to shut down the service will time out it does its best to gracefully shut things down with sick term but exceeding the timeout it'll just go ahead and sick kill everything that's something that we didn't talk but as you can see on the on the image script for Cassandra somebody define how to stop Cassandra well we have never defined a stop and you can do that but the point is that we kind of know how to stop a service so let's just do the default yeah yeah well well yes you send you decide what signal to send you decide what to wait and what to do once the signal it's not it's it's not happening right so you say sick term for instance when it finished you can actually send another sick term and then eventually system D will say will take the responsibility of your hand and say it's more important to shut down this machine than to preserve your cat pictures there is visualized right yes there are bootchappers but but um yeah yeah and then there is I think there's visualizer but I definitely don't know yeah yeah and the SPT analyze you can actually plot the boot process it's not supported yet for shut down and I don't think it does it for like arbitrary start of units either but if you want to plot like your boot figure out which services we're taking up the most time things like that that is in the system D analyze so it it only does the last boot it doesn't um it doesn't like a simulate the next but system D has a system D dependencies and all you have to do is put it into the first target a system D start which I can remember the name but you will probably do that and then it will show you what is the order or what it's going to do but you can do this for like anything right like literally any service for every unit you can I can't remember but I think there's dependency a thing and it show you like the dependencies of the thing that you're going to start and if you do it with a target you see what's going to get started with the target Splank yeah we're writing it in Rust I hear yes somewhere you jest but we are considering writing system D daemons in Rust in the future something that you don't know and it is a contributor of system D okay seems to be good thank you very much people for coming I really appreciate it we did good time