 Is this thing on? Excellent, okay. Test check. Good morning. Four years ago, I was working at a tech startup in San Francisco, and I was releasing a new piece of code to production. It was a large Hadoop cluster, and we had these task workers on the cluster. And I put a little code in my RPM package to make sure that all of the old workers were killed when the new RPM was installed. It was just like a little kill-all-dash-9, everything owned by the current user. Neglecting to think that, hey, when I do this install, that's gonna kill everything owned by root because that's running the install script. So this took out SSHD, syslogd, crondy, and one fourth daemon running on the system, but it left the servers up. Didn't kill in it. So the users of the cluster were still getting work done, and we didn't wanna just reboot all the machines because they were all running HDFS and those nightmarish. So we spent the greater part of a day, eight of us logging in over lights out console one at a time into every server and restarting the four daemons with our diceware password that took like a minute to type. And eventually we got it all done, but it was a massive pain in the butt. So the first day on the job, my first job out of college, I am sitting in the knock, done all the paperwork and gone in and they sat me down in front of a console and the senior assistant man walks through and says, yeah, just go log into this machine and go kill the foobar process and it's under a daemon tools kind of thing and it'll restart itself. So I log in, not really paying attention and you get the log in banner, it says Solaris 2.5 and I go kill all foobar, you know what happened. So streaming over my console is all of these shutdown messages and stuff and then the phone rings. I pick the phone up and it's a trader and you know it's a trader because the first thing you hear is what the fuck is going on down there? Because I just killed trading for the whole company. So that was the first day on my job and we tell you these stories because what's really important about when assistant man grows into being an SRE is the experiences that we all have in making mistakes and doing things in production that cause outages and seeing things like single points of failure and the pain that they cause the business and they cause ourselves. So as we look at the evolution of CIS admin to SRE over the past 30 years, once upon a time we all managed individual servers and we installed them possibly using a CD one at a time through a data center and we would know the name of every server and we would know, hey, server Zavix is not working well today because I screwed up one of the libraries on it last night when I was doing some maintenance. Over time that we didn't scale and we had to learn how to manage our servers using code. And I think the real difference between the old world of CIS administration and the new world of reliability engineering is that it's no longer about managing individual servers, it's now managing fleets of servers and it's also worrying about reliability engineering. It's not, I want to keep one server up. I need to keep the service up. I need to keep the whole fleet. I need to keep the application running. So my name is Jonah Horowitz. I work with Netflix. I'm on the SRE team. I got here through a journey that I think a lot of us had. I started playing with computers when I was in high school. I had a little BBS that my friends would log into and play trade wars with me. I worked on a help desk and I worked at walmart.com. I helped launch walmart.com in 2000. I meandered through a bunch of other small startups around the Bay Area and landed at Netflix about 10 months ago. My name is Al Tobi. I started a little later with computers and probably a lot of people. I was a music major through college and dropped out and got that first job. Worked at Limelight Networks, Sonya Online Entertainment, Cisco, a place called Uyala and now I'm at Netflix as an SRE. So the SRE team at Netflix is called Core and it's kind of a background. I'm not gonna read it to you. But our primary job at Netflix is to be the last line of defense. So most of the engineering teams at Netflix deploy and manage their own services. So we're not doing the configuration for them. We're not spinning up their servers. We're not gatekeepers. The one thing that the whole business really relies on our team for is when something falls through the cracks or some major failure occurs, we're the ones that respond and get online immediately and make sure that the right things are happening to bring the service back online as soon as possible. We also do some other things in terms of filling in the cracks. So we do a little bit of research and development kind of things, mostly in terms of ops. So for example, we might discover that while we have this really awesome alerting and monitoring system, it's doing a lot of really neat things, but our alert volume might be a little higher than we like it. And there are certain alerts that fire, say, when we fail across regions, even if it's intentional. So we wanna do things like piercing correlations to say, well, the traffic is missing in this region, but it's showing up over here and we can automatically squelch those alerts using just a little bit of Python code. And so that's the kind of thing our team would do and then show that to the other teams and hopefully they'll absorb that into their products. It's important to note that one of the really key aspects to how Netflix operates as a service, every engineering team is responsible for both writing their code, deploying their code and running their code in production. So we don't manage any of the services in production. But when you take a lot of engineers that come from other companies and maybe their experience level with running code in production isn't very high, we have to consult with those teams to give them the operational skills that they need so that they can be successful running their code in production. So we consult with all the engineering teams, we talk to them about making sure that they have good dashboards, making sure that they have good alerting systems, making sure that their Java garbage collection is tuned properly. These are all things where we take the skill set that we have and we make other teams at Netflix services more reliable. We have a couple of other smaller tasks we do. One of them is where a liaison to Amazon. So obviously we run everything in AWS and knowing what's going on at AWS is extremely important. So we have a lot of communication with them. And finally, the last thing we do is incident response. So during an incident, during a large scale incident, we're responsible for making sure that the communication keeps happening. So we'll engage the engineering teams that are responsible for the services that are impacted and then we'll bring other engineering teams in to the call if they're needed and we will take current status and communicate that outward to any team that's gonna be impacted like even public relations or just upper management. And then after an incident, we document what happened. We work with the teams to do an incident retrospective and we create action items for follow up and track them to make sure that they get completed. So part of that, the way that Netflix operates is this idea of context and not control. So we never tell a team exactly how they should implement something. What we do is we inform them about maybe why or what some of the other teams do and we show them that by implementing some of the best practices, their lives are going to be easier. So instead of saying, hey, you should have an alert on your CPU if it goes above 80%, we show them that when their CPU goes over 80%, their service starts degrading and let them decide that they want an alert at that level. The other thing that we do is across the entire organization and for those of you who are fans of DevOps, one of the things that Netflix does, I think it's fairly unique is we apply this, hire smart people or hire the best person for the position across the whole company. And that's a big part of why we can do things the way that we do is because we hire like a professional sports team or as I prefer a professional orchestra. We hire the very best people so that when I'm playing a part, I know that the person next to me is also very good and I don't have to talk to them about tuning their instrument or making sure that they do the right bowings. They know their job and they can do it and it's not my business to tell them how to do it and I don't have to worry about that. So that's a, and the other part is, if you talk to a lot, especially Barry as startups, they give you that line about, well, we're kind of like a big family. And the thing is Uncle Bob never does the dishes after Thanksgiving, never once. And you don't want Uncle Bob on your team because he's not gonna help out. And so with that kind of family mindset, you get that kind of environment where you have people that you have to work around, right? And we've all run into that in our careers and maybe have to deal with that on a daily basis. But it's part of the way that we hire and staff at Netflix, it's not a concern that we have. It's not a family, it's a professional sports team. People come and go and that's just part of life. And it's part of life even without that kind of policy. It's just unwritten. So another aspect of Netflix is freedom of responsibility. So a few years ago, or by the year and a half ago, one of the front end engineering teams said, we would like to switch from running Java as our production stack to using Node. And they decided on that because they write JavaScript for the front end clients and they wanted to run JavaScript on their server stack too. Now, we didn't know anything about JavaScript running it in production, but we trusted that they were smart people and they could figure out how to do it and we could support them and help them be reliable as a service. Now, it didn't go perfectly smoothly. There were definitely some outages and some problems caused by that transition. But now that team is running JavaScript in production, they're kicking butt, they've got their operations under control and we're helping them fine tune some of the smaller parts of their instances, not daily managing, we're not doing their code pushes for them. They figured that out on their own. So we gave them the freedom to decide that they wanted to make that switch and they also took responsibility for the outcome of that. How many of you have heard the line when you're interviewing somewhere? They say, well, this team is kind of like a little startup inside of a big, huge company. Has anybody heard that one before? Anybody think it's nonsense? So when the companies attempt this, what they're trying to do is they're trying to get that kind of that pace of innovation that comes from a startup and not having bureaucracy and not having kind of all the controls in place that slow things down. And when you break that down and get to the bottom of it, what you really want is a learning organization and you wanna be able to move at high velocity and that's what we do at Netflix is we make mistakes. We make them in production very often but we've optimized for MTTR so that we can, sorry, mean time to recovery and we try to get things fixed as quickly as possible so that we can allow our engineers to make mistakes and we accept that that happens on a regular basis. So I mean, one of the examples where Netflix very publicly made a mistake and corrected that mistake is if anybody remember Quickster? And so if you look still on our website, there's an apology from our CEO and that's just a great example that I loved when I came to Netflix, when I saw that, I was like, wow, that is really different from what I've seen at other organizations where they would just either try to bury that in the history or pretend it never happened. So that's a really important part and so that part of our instant review and the blameless post-mortem kind of thing is our focus isn't to say, hey, Bob screwed up and deployed to production and we should fire Bob. It's what was missing in the automation or in the interface to deploying code or deploying settings to production that made it possible for him to make the mistake and let's see if we can remediate that. So I talked about this a little bit earlier, but this comes back to we give teams the tools they need so that they can run things successfully in production and drive reliability without taking responsibility for it ourselves and running everything for them. So every team in Netflix needs to be able to have tooling and best practices for driving reliability and we don't have like a central team that's doing pushes or orchestrating any of that. So now we're kind of getting more into, we've told you a little bit about how Netflix does things. How many of you are assistants right now to have the systems administration in your title? And how many are SRE production engineer in that kind of title? Okay, good, a good mix. Excellent. How about DevOps? No booing. That's cool. I actually like that as a title. I know it's unpopular in some circles, but because of the way that DevOps leads us forward into reaching out into the organization, I think that's a really cool way that the titles have been evolving. One of the things about, all of these things have in common is that like a systems administrator, even when you're a junior admin, you're a generalist, right? You do a million different things and you're maybe not the best at all of them, right? I'm not the best programmer in the world, but I do a lot of programming for automation and building tools to help with my job. But I also have to do systems administration kind of things like dealing with Linux distributions and packaging and all of those things. And I've dealt with mail servers and a bajillion different applications. So we're generalists and that kind of carries through, I think, the entirety of our career tracks. The other thing that we do, especially when the Netflix SRE, one of the big differences is we kind of lead behind a lot of the fiddly bits, right? I'm not managing LDAP servers. I'm not deploying Sendmail. I'm not messing with Bind. The kind of the things that are constants in our careers as sysadmins. What our goal is is to take that knowledge and experience we've gained over the years of deploying all these applications and dealing with all the bull crap that they bring into our lives and look from a much higher level view. So Google Earth and Zoom out, right? And take a look at the Netflix stack as a whole and understand how things fit together and how failures might propagate through that system and be able to communicate that to engineers inside the organization. But Netflix is so large that we can't actually know the whole system. This is, we have something like 3000 microservices interacting with each other just so that we can stream movies. There is no sysadmin. There is no engineer at Netflix that actually knows how the whole stack works. We have a lot of tools that we use to visualize that and when we're having a problem with a node in that graph, we can see how that node is connected to other nodes. But as an SRE, you can't know everything about the system anymore. It's too big, it's too complex. It's more about allowing every team that's deploying their code in production to have the tools they need to drive reliability. And as SREs at Netflix, we're helping those teams drive their reliability, helping those teams have the tools they need. And when we come across a best practice that one team is using, making sure that's socialized throughout the whole organization. So that everyone is more reliable. So how many people are using Puppet, Chef, CF Engine, something like that? So back in the day, we all installed things by hand maybe net booted our servers and then we started using these great configuration management tools. But they have a lot of challenges and one of them is that if every development team is pushing its own code to production, you end up with these choke points around who isn't responsible for managing the Puppet config. You don't wanna give every developer commit access to the Puppet repo because then they can change the Etsy password file and take out logins for the entire production operation. It's very hard to kind of compartmentalize some of these configuration management tools. And it's very hard to train all the engineers on how to use all of them. So Netflix has come up with a different way of doing that. We produce baked AMIs. So instead of pushing configuration management, instead of changing those configurations in production after they've been deployed, we deploy entirely new images. Every time we push to production, we push as a complete virtual machine image. So when a new version of a piece of software gets released, it goes through our Jenkins pipeline and gets built into this virtual machine uploaded to AWS and launches a whole new cluster. And we just fail from one cluster to the other. And then if there's a problem, we can quickly revert to the old cluster, running the old software stack. And we never have to log into machines after they've been launched. This provides a lot of advantages. So one of the things we're working on at Netflix is how we're gonna use containers across our environment. And one of the things that comes up regularly in those discussions is what's the point? Because we already have the system built in place that brings kind of what I think is the number one value of containers. And I definitely love using containers in different contexts. But that's the ability to package up the software in such a way that I can test it in one environment and move it to a different environment and maybe canary it. And then when it goes to production and starts receiving production traffic, I know bit for bit that each of those steps that the software is identical all the way down to the kernel interface. And that takes, that gives a lot more confidence that you don't have things like libxml 2.0.1 to 2.0.2, it changes some ABI symbol, right? And then the whole system comes crashing down. And I've had that happen before, you know, just an app get update destroys the system. Sorry, what's the question please? So we store a state in a separate set of clusters. So we have a persistence layer that is responsible for storing state. So when we're talking about rolling, what we call a red black push, we're not failing that stateful layer. We're just switching the higher upstream services. The deployment process for new stateful services is a little bit different. It's more like a rolling release where we'll bring up one node with the new setup on it and then destroy one of the old nodes and just do a sort of cascading release across the entire persistence tier. So that's managed slightly differently, but we still release those persistence tiers as a fully baked machine image with a different set of software on it. It's just set up so that it can do the rolling releases. And that's a critical component of microservices that isn't subject to this talk, but is one to take home with you is, a microservice, people talk about stateless services. Stateless services don't exist. What you wanna do is hold onto the state for short of a time as possible, which reduces the probability of losing that state during a failover or a complete machine loss. And then we bleed traffic, as Jonah said, from one version to the next. Where was I? So yeah, we get this bit for bit deployment or identical deployment into the different environments. And that's all managed by this tool, Spinnaker, which is open source. Maybe you've heard of Asgard. This is the new version. We'll talk about it more as we go. And then the other thing that comes along with this is we've got built into these tools automatic canarying so that we can do things like push this, the tool automatically deploy a canary with some percentage of real world traffic so that we can test it in the real world. And as we were just talking about, chaos is something that just exists, right? We can't prevent it. And most of us that have been doing this for a while, you've done all the exercises to try to prevent every single thing that can fail, right? You've got dual network paths. You've got sands with dual paths. You've got RAID, dual power supplies. Sometimes you've got these funky machines with mirrored memory and things like that, HP nonstop. The thing that all these things have in common is they cost an enormous amount of money. And it's not like you go from RAID being two times expensive because you also need a RAID card, which means it's more than two times expensive. So the cost of safety goes up quite quickly in comparison to just building it into the system instead. So Netflix switched to this idea of deliberate chaos. We, one of our more famous tools is a tool called Chaos Monkey. It's a, it's not actually very much Python code. It runs around in our production infrastructure every day and it just kills some of our instances. This forces our developers to code in a way that they can't have special instances. There's no magic instances, there's no golden instances because any instance in a production cluster can get killed off at any time by Chaos Monkey. When this was first rolled out, it was a little terrifying and we let teams opt into it when they thought they were ready to take that sort of risk with their systems. But more recently, we've actually made it opt out. So only teams that can justify not using Chaos Monkey for some specific reason are allowed to opt out of running Chaos Monkey. And we have almost complete coverage on our production systems when it comes to chaos going along and killing systems. We even have chaos running on our persistent data system. So we run a very large Cassandra implementation and even the Cassandra nodes get killed off by Chaos Monkey on a daily basis. I mean, almost every cluster loses a machine every day to Chaos Monkey. And this is great because actually if you run an Amazon and you have any scale there, you will get a, I get every morning in 25 or 30 emails saying that X, Y, Z node is going to have maintenance and it's going to be turned off tomorrow or next week or something. When you're running with Chaos is you're like, ah, so Amazon's going to kill that node. It might already be sick. Doesn't matter. It'll be fine. And yeah, there's just no way around having this sort of system running if you're at scale in a cloud environment. It's getting more important over time because when Amazon was fairly new, this happened all the time, right? It was constant chaos when you deployed in Amazon but they've been getting a little bit better at it and their servers are getting a little more reliable. So we keep that kind of to keep a little bit of keep people on their toes, make sure that that's always running. So I talked about innovating for, optimizing for innovation over safety and one of the things that happens a lot in larger organizations, especially those that maybe grew large very quickly is things go wrong and then the first thing a lot of people say is, aha, I should put in a process to make sure that this never ever happens again, right? And the more never ever it gets, the heavier the process gets and the slower the innovation is, right? If you've been at a small company and long enough for it to grow into a large company, you've been through that pain where you used to be able to just fix stuff, right? You know, I used to be able to, when I started at this healthcare company I worked at for a long time, it could just log into the server as root SSH root out and just go and fix things when they broke. And then over time it became, well, you can't log into the server and fix things, you need to file a change request first. So I go and file a change request and then I would go and fix it. And then it was like, well, no, you can't just file a change request and go fix it. You have to file a change request and then the change review board has to look at it and the change review board usually made up of people who have no idea what I just said in the change request and then they have to approve the request and then I can go and fix it. And then it just keeps getting worse and worse over time, right? Then the security policies come in and then different, it just keeps kind of crofting up over time and it seems to be one of those things that kind of just keeps building up like barnacles and never really gets, never really gets smaller or refactored down. So we looked at that and there, turns out there is a different way to do it. Right, so instead of having to approve every individual change before it goes out to production, we, and this is a reoccurring theme, let our engineers take responsibility for releasing their code to production, making the changes they need to make in production and then we log those changes in a system we called Chronos and at any given time, there is a literal stream of events going through Chronos, any sort of automated changes and any intentional changes that are getting pushed to production are logged in the system. So when we see an impact, we can look at the start time of the impact and quickly go into Chronos, narrow down through the potential things that could have caused this impact and find the team that's responsible and work with them to mitigate or resolve the problem. So we let them roll forward, we let them roll back, it's really up to them, but the ability to quickly find out what changed, how it changed and when it changed allows us to get rid of that huge overhead of a large change control process. Another thing we do a little differently is how we handle all of our tooling for systems administration, as I mentioned earlier. How many of you are using Nagios today in production? That's pretty good, yeah. How many are you using one of these other tools that's listed up here? Cool. So we don't really use any of these because one of the things that, as Netflix grew, these tools kind of didn't really follow with the scale. A lot of them have grown up and improved for high-scale usage, but they didn't at the time work, so we built a different tool. So the thing about these tools is, there were things that we also do, right? Like I also had a Nagios installation and I would update the config every couple of days and check it in to get and had a cron job to pull it down and that was fine. But the installation and the machine would start to cruft up with different things like all the dependencies for the plug-ins and it would be sometimes two, three years out of date because nobody wanted to take on that job of refreshing the Nagios box because of all of the different little things that we're gonna bust and ruin your evening. So this is different, again, from a traditional large organization or even a small startup. We don't take responsibility for owning these things. Netflix has a team called Insight Engineering and they're great and they run the production instances of our monitoring systems and so we don't have an SRE ops team that's responsible for writing code, deploying code, managing stuff in production. Also, oh yeah, you have to be responsible for every last piece of monitoring that's running in production and whatever other tasks. We have pushed the authority and the ability to run those tools to the individual teams and we all rely on this team, Insight Engineering, which wrote an open source tool called Atlas. It's on GitHub and it allows us to collect metrics at the scale that we collect at Netflix. So we're doing billions of metrics a minute with multi-dimensionally so we can group them and look at them in all sorts of different and unique and interesting ways and that's, oh yeah, and we have some screenshots of what that looks like in a minute. Yeah, so to repeat the question, it was what would you say you do here? And that's a good question. So like I said, it's kind of a higher level of view so it's, you know, we're not the maintenance men working on the plane, we're more like the pilots or really kind of ground control, right? In terms of, we see all of the other ground control is actually better, right? So it is kind of like knock but we don't actually have a knock. Like we're not watching dashboards all day. In fact, we don't have any dashboards in our seating area. No information radiators. We ban them when we move to our new space. Instead, what we do is we pay attention to alerts so we have that on-call component that is very familiar to most of you. We engage with the engineering teams, which is, so there's a big consultative role that's where we spend a lot of our time. We comb over various things so there will be minor alerts that roll through. We comb over those and look for things that were maybe instability in the system that's starting to jitter and hopefully catch things before it bubbles up. Mostly though when we find things like that it means that we need to go engage with an engineering team and make sure that their learning is improved. So that really is the majority of the time and it's, then there's kind of that other component which is because we have all this experience with all these other monitoring and time series systems we do have kind of a small role that's very unofficial around being kind of a product manager for those tools, so we work closely with those teams to say, hey, wouldn't it be nice if we had this? Hey, can you build this kind of feature? And they have a lot of other demands from the engineering teams but we're one of the primary drivers. We do both. So yeah, I forgot, incident management. So that is the other component that we service that we provide to the whole company so that when there are incidents we get involved during the incident. We track all of the what happened. Today it's in JIRA, hopefully we'll have better tools soon. Make sure that we log, this happened at this time, this happened at this time so that when we do the incident review we can go through the series of events and figure out what happened and then work with the engineering team to remediate. So talking about Atlas, this is an Atlas dashboard that we use as our team that we build that shows us what's going on across all of the regions and how many errors and how many streams per second are getting started. So the top graphs are kind of how many people in each of the three Amazon regions that we're running in have hit play successfully recently? And on the bottom it shows the number of client errors reported. You can see they're kind of correlated but we took all the numbers off the side. So you can't see that those client errors, like that's actually very few. It's not like the client errors match almost one for one with the starts. But we use all these great tools that are written by the Atlas team so that we can provide insight and troubleshooting. This system allows us to quickly isolate like are Android users having a problem? Are iOS users having a problem? Are people on Apple TVs having a problem? And we're able to select which group of people we want to look at and quickly deep dive into a problem. So this is a tool called SELP. As Jonah mentioned earlier, nobody at Netflix really knows what the entire environment looks, understands the whole environment all at once. So we need tools to assist us with that. So this is a tool called SELP that actually uses data from various systems inside of Netflix to draw the graph of microservices so that we can see how they're all connected and interacting because as engineers can push new services and very often we don't even know what's going on in terms of new microservices being launched, replaced, deprecated, they come and go. We don't even really pay attention most of the time until there's a problem. And then when we see that we can come here as one of the places we can look to say what is the chain of dependencies from a user hitting start in the Netflix player all the way down to a Cassandra database or a EBCash cluster. This is a Spinnaker, we talked about it briefly earlier, but this is a tool that we use to release code to production. So each one of those green boxes represents an instance that's running in AWS as part of this API proxy cluster. And Spinnaker manages the code from the, so we use Jenkins for automated build pipeline. And then Spinnaker takes that build to AMI and manages the launch of those servers in production and you can define pipelines. This is all open source, we released it a few months ago. You can be like, for instance, I wanna launch a cluster, a Canary cluster with five instances and once those five instances are up and running and taking traffic I want to compare the following list of rules metrics between the currently running Canary instances and the current production cluster. And I wanna make sure that I didn't see a increase in errors, a decrease in throughput and you score that and then once that scoring is complete it automatically or it can send you an email and say, hey, we've run the Canary and it looks good. And you can say, yes, I wanna go to the next step. So you can make that manual, you can make it automated. But we have teams that release to production every single day on an automated schedule. Like it's seven AM, they push to production and it goes through an automated pipeline where it runs the Canaries and if the Canaries pass it just pushes that new code to production. We can, it also supports things like staggered releases. So we always like to release in one of our, one region first and then the next region and then the next region. So we don't see a giant A, B, a giant red, black push across our entire infrastructure take out everything all at once, we rather take out like one third of our production at once. So Spinnaker supports staggering the pipelines across different regions. And we're working on a new feature that's gonna allow us to do automated squeeze testing. So that is when a new instance launches and it goes into production we can actually calculate what the maximum throughput of the new release of the software is and then adjust our scaling rules for that cluster to support the new rules. The other thing you can see on here is the kind of the gray part here. And what that is is when it deploys a new ASG very often leave the old ASG running so that we have a really fast path to be able to roll back when if something goes really sideways. And go ahead. Now you can, this is a view of the pipeline so you can see some pipelines that were successful. This is like a log so it's just showing each pipeline that's run recently. You can see the last three pipelines were completely successful. The third one down had six stages or more than the other ones had four. You can see below that some pipelines that had some failed stages so they crashed out before they finished. And the developers get a bunch of feedback as to why that happened and they can adjust their code or fix their things before the problematic software gets into production. And kind of the last screenshot, well second to last, is this is a tool called Vector that is also open source. It's built by our performance engineering team. And what this does, this provides the micro view. So Atlas has the macro view in terms of it has one minute granularity for stats and you can see them in graphs but the problem with rollups is that you lose resolution. You can't see things in a one minute aggregate. You can't see if you have say a 500 millisecond GC pause. That'll just get absorbed by the averages because averages are awful. So what we have is this tool which we can turn on on any instance in our fleet. And developers can go straight to this web interface and see the low or high resolution stats on the system graphed here. And it's not on here but down at the bottom there's a little link where they can actually now click and it'll go off and go and build them a flame graph of say their Java applications full stack of call traces so that they can see that performance information and it's all automated and self-service. So with that. So that's it. I'm John Horowitz, like I said and this is Al Tobi. All of our public code is available on netflix.github.io so if you wanna take a look at Vector, Atlas, Spinnaker and all of the other Netflix open source projects they're there. Coming up after lunch, Brendan is giving a talk on broken Linux performance tools so strongly recommend that talk, it should be great. So we hope what you got out of this is kind of a view of how Netflix does things. We hope it's interesting to everybody and what I really hope is that as we talk about these ideas and you take them back to your companies you can start to push on the boundaries outside of your bubble and maybe we can see some of these practices become more common outside of Netflix because it's a lot less unpleasant in the worst case than some of the worst environments and in a lot of cases it's really awesome to work in this environment. Thank you. We have some time for questions if you want. So the question was what are some other besides kind of the quickster mistake what were some other production issues that occurred that maybe we learned had big learnings from? Maybe one of them is DNS ruins everything. So we're using a DNS provider that had some issues and went offline and they were offline for a few hours and basically all we could do is sit on our hands while we're watching Twitter feeds and if you've had outages with very public services like Netflix or Facebook or something users get really upset. And so we're sitting on our hands and we're working with this vendor to get it fixed but in the meantime we're also looking at other plans already before the issue is resolved and starting to say what can we do to give ourselves more control in this situation if it happens again so that we can move traffic or make sure that our customers stay online. So that was one kind of thing where we learned a lot from that one and made some changes that were invisible to our users but if it happens again it will remain invisible to our users. Let's go here. So the question is how did the move to the countries that we launched a few months a few weeks ago impact our day to day life? So we've been working on that turns out for a while. But one of the things that we did to help with that launch was we have some automated tools that generate alerts and they look for deviance in trend line. So when we launched in all the other countries it was kind of hard to know if that country's traffic was being impacted by something because we had no historical data on how much streaming activity to expect from that country. So we expanded our automated tool to generate new anomaly detection across all of those all the countries. So now instead of just having anomaly detection on the countries that we were in the 60 countries that we were in we now have anomaly detection across all 190 countries. And so we can see if traffic is getting impacted. There's also some stuff about in the 60 countries where we were in it was really easy to work with or maybe not easy but we had developed the skill set to work with all of the ISPs in all of those countries but now we have to work with ISPs in 190 countries. So we're still learning some of the lessons on that and that will continue to go forward I'm sure. And that was a big part of how we did the launch as we said. There's no learning like jumping in the deep end, right? So into the deep end we went. I think back here in the green. So do we have plans to use Lambda and API gateway? Yes, but we're still waiting on some features to be added to that so that we can integrate it with our VPC. Yeah, so that's actually what I do. And we have a project called production ready that is really about these are the things you need to know or these are the things you need to do to be successful in production. And that has a list of it's actually not a very long list. It has about 15 things on it that we go through with each team. And how we engage with the different teams is different depending on how those teams operate as an organization. So teams that are used to operating as using Trello as their tracking system, we'll create Trello cards for them. If they're used to using JIRA, we'll create JIRA tickets for them. If they just wanna track it on a spreadsheet, we'll track it on a spreadsheet. So we work with them and sometimes that's just like submitting a bug fix to their service because we can just check out their code and see if they just changed this thing, it would be done and so we can fix it for them or we'll clear hurdles, provide knowledge transfer they need to check off all those boxes. That's our boss. So to repeat that, Cobra and our boss. It's about 150 people outside of product teams that work on just internal tooling, internal services, including some of the stuff that like internal service discovery, all that's, or most of that's actually open source that you can see all that on GitHub, Cassandra. It's about 10, 10 ish. Any other questions? Oh, okay, I think we have time, yep. So that's a service called Eureka that is also open source. So we have our own discovery system and Eureka has in the back end, I think some Redis and some Cassandra that it uses for persistence. And so every time a service comes up, it has a client library that it uses automatically contacts the discovery service and then figures out where all of its dependencies are much like you would do with an SED or a zookeeper or something like that. Correct, so the question was, do we ever update AMI's once they've been baked? And the answer is no, we do not. Once an AMI is baked, it is only ever destroyed and replaced with a new one. It's as close to a medieval infrastructure as you can get, yeah, I mean, the immutability moves down into the Amazon layer, right? So the question was, how quickly can we react to a challenge in production and get an AMI completely baked and out into production? And I don't know that we wanna give you exact numbers on that, but. All right, so Mike's been working on this. The experience with configuration management systems is that it often takes about an hour to get consistency across all of your configuration management without running some sort of force update across the cluster and I will say that we are at least as fast as that. So that is something that we're working on. So, I mean, there's a certain amount of physics in the way of making that faster, right? In terms of the Amazon process for updating an AMI is not the fastest as it could be. So that's part of where the container work is being done in terms of kind of trying to bring that latency down in terms of being able to deploy code, but it's about as fast as it can be given kind of if you do the math on how fast you can take in a bunch of LTS image and then roll in a JVM and code and all of that stuff and then get it deployed out into the different regions. And any more? We'll call it done. One more and then we'll call it done. I'd say that's probably, so the question was, do we have any kind of part of our infrastructure that's really solid and boring and critical to our infrastructure and that, yes, there is, Kafka is part of that. So our data pipeline in terms of all of our streaming logs and things is now running on top of Kafka and Kafka is one of those tools where once you get it deployed and kind of working, you can forget about it, right? And I love Kafka for that reason. Most of our fleet relies very heavily on Cassandra and specifically because Cassandra is very good at multi-region replication and that's what gives us kind of that power like this video I was showing, I'll show it again. To be able to move across regions at will. So we can at any point in time say, you know what, things feel a little bit, in US East One, we're out and we just do it. And part of the reason we can do that is because we have that multi-region replication and the data's just there and we don't have to think about, oh well, now we have to fail over the MySQL master to across the Europe and have all of the replication events and it's a painful process whereas Cassandra, it's already done. Maybe there's Eureka in there. I mean, obviously Atlas is a cornerstone of everything that happens at Netflix in terms of even things like looking at, I think our content teams even use Atlas sometimes to look at what's going on with a particular movie that's on Netflix and then things like we've been able to look at with the big blizzard on the East Coast. We had some graphs going around internally showing the Netflix viewership climbing up on the East Coast because people are staying inside. So that really is kind of a big pillar in our infrastructure. All right. Well with that, thanks a lot guys. Yeah, thanks. Do you already have IPv6? What is IPv6? It is a new protocol for the internet. The current one called IPv4 soon will not have enough addresses for everyone. IPv6 will help solve this problem. This is not a problem at all because we have NATs. They are great. This may work for you because you are using only small number of posts at the same time and control all of them. But if you have a lot of posts you do not control, you would not do NAT. I have a perfectly running network and NATs are good. They provide security. My inside network is not seen from the outside. This is because the NATs are stateful. It is statefulness that provides the security, not the NATs. You can have stateful firewall and IPv6. But NATs are good. I get security for free. You pay for it in hidden costs that are required to support the NATs. A lot of people spend a lot of time working to make the applications work with NATs. It is good for the economy. I give them work. So go and throw some junk on the street. Someone will have to be paid to clean it and that will be good for the economy too. NATs are good. I use them all the time. Why do you need them? Because they are good. Would you not rather have a way to cleanly express your... Do you already have IPv6? What is IPv6? It is a new protocol for the Internet. The current one, called IPv4, soon will not have enough addresses for everyone. IPv6 will help solve this problem. This is not a problem at all. Do you already have IPv6? What is IPv6? It is a new protocol for the Internet. The current one, called IPv4, soon will not have enough addresses for everyone. IPv6 will help solve this problem. This is not a problem at all because we have NATs. They are great. This may work for you because you are using only small number of hosts at the same time and control all of them. But if you have a lot of hosts you do not control, you would not do NAT. I have a perfectly running network. And NATs are good. They provide security. My inside network is not seen from the outside. This is because the NATs are stateful. It is statefulness that provides the security, not the NATs. You can have stateful firewall and IPv6. But NATs are good. I get security for free. You pay for it. In hidden costs that are required to support the NATs. A lot of people spend a lot of time working to make the applications work with NATs. It is good for the economy. I give them work. So, go and throw some junk on the street. Someone will have to be paid to clean it and that will be good for the economy too. NATs are good. I use them all the time. Why do you need them? Because they are good. Would you not rather have a way to cleanly express your submitting and your security policies? The much bigger address space allows that. No, NATs are good. I will take your IPv6 if you give me NATs. They make me feel comfortable. Go buy some weed and smoke it. It will make you feel much more comfortable than NATs. Weed is prohibited by law. NATs are not prohibited by law. They are great. Stupidity is not prohibited by law either. Do not insult me. My applications do not work with IPv6. This is because you think you will be able to use the IPv4 forever and instead of testing your applications and paying the developers to fix the applications so that they support IPv6, you are buying more NATs. Yes, NATs are great. I want to buy one for IPv6. But you said you do not need IPv6. Yes, I do not need IPv6. But I need NAT for IPv6. Are you nuts? NATs are great. I do not need IPv6. And you do not need all the customers with the mobile smartphones which will run IPv6 only? My customers are not asking about IPv6. That is because they are asking from your competitors who they know of IPv6 and with that they can offer innovative applications. I do not need new applications. The old ones are working well with NATs. NATs are good. Not so good of a conversation this is when you keep repeating yourself. Yes, NATs are good. Do you know that in Netherlands and France and other countries the people are getting native IPv6 already? I live in other lands. So I do not care. I do not care about their IPv6. Everyone has to use NATs. NATs are good. Do you have an iPhone? Yes I do. It works with NATs. And do you know that is already has IPv6 and you cannot use it? Oh. You mean my iPhone has a feature I cannot use? Yes. Your neighbor already uses IPv6 on the iPhone. I want to use IPv6 on my iPhone. Too late. You have been talking about the NATs too much so you cannot have IPv6. I want IPv6. Sorry, I cannot help you. You need to go and ask your provider. What if they do not have IPv6? I need IPv6 for my iPhone. You can get IPv6 from a tunnel broker. What is a tunnel broker? Do they do NATs? NATs are good. No, they provide the IPv6 connectivity using your IPv4 internet as a wiring. If you terminate the tunnel closing up to you, it is almost as good as native IPv6. What are they? Go and ask your ISP. They need to know that you want IPv6. I will not tell you. Then I will ask them about NATs too. Yes, ask them what is better for them, NATs or IPv6. And I'm going to the farm in the meantime to drink some milk from the mad cows. This conversation has exhausted me. Ask them if they have IPv6 for my iPhone. I will. Goodbye. Do you already have IPv6? OK, let's see if I'm on the speaker. I'm on the speaker. It didn't start playing, did it? Do you already have IPv6? What is IPv6? It is a new protocol for the internet. The current one, called IPv4, soon will not have enough addresses for everyone. IPv6 will help solve this problem. This is not a problem at all, because we have NATs. They are great. This may work for you, because you are using only small number of hosts at the same time, and control all of them. But if you have a lot of hosts you do not control, you would not do NAT. I have a perfectly running network. And NATs are good. They provide security. My inside network is not seen from the outside. This is because the NATs are stateful. It is statefulness that provides the security, not the NATs. You can have stateful firewall and IPv6. But NATs are good. I get security for free. You pay for it. In hidden costs that are required to support the NATs. A lot of people spend a lot of time working to make the applications work with NATs. It is good for the economy. I give them work. So, go and throw some junk on the street. Someone will have to be paid to clean it, and that will be good for the economy too. NATs are good. I use them all the time. Why do you need them? Because they are good. Would you not rather have a way to cleanly express your subnetting and your security policies? The much bigger address space allows that. No, NATs are good. I will take your IPv6 if you give me NATs. They make me feel comfortable. Go buy some weed and smoke it. It will make you feel much more comfortable than NATs. Weed is prohibited by law. NATs are not prohibited by law. They are great. Stupidity is not prohibited by law either. Do not insult me. My applications do not work with IPv6. This is because you think you will be able to use the IPv4 forever, and instead of testing your applications, and paying the developers to fix the applications, so that they support IPv6, you are buying more NATs. Yes, NATs are great. I want to buy one for IPv6. But you said you do not need IPv6. Yes, I do not need IPv6. But I need NAT. For IPv6. Are you nuts? NATs are great. I do not need IPv6. And you do not need all the customers with the mobile smartphones, which will run IPv6 only? My customers are not asking about IPv6. That is because they are asking from your competitors who they know of IPv6, and with that they can offer innovative applications. I do not need new applications. The old ones are working well with NATs. NATs are good. Not so good of a conversation this is, when you keep repeating yourself. Yes, NATs are good. Do you know that in Netherlands and France and other countries the people are getting native IPv6 already? I live in other lands. So I do not care. I do not care about their IPv6. Everyone has to use NATs. NATs are good. Do you have an iPhone? Yes, I do. It works with NATs. And do you know that it already has IPv6, and you cannot use it? Oh. You mean my iPhone has a feature I cannot use? Yes. Your neighbor already uses IPv6 on the iPhone. I want to use IPv6 on my iPhone. Too late. You have been talking about the NATs too much, so you cannot have IPv6. I want IPv6. Sorry, I cannot help you. You need to go and ask your provider. What if they do not have IPv6? I need IPv6 for my iPhone. You can get IPv6 from a tunnel broker. What is a tunnel broker? Do they do NATs? NATs are good. No, they provide the IPv6 connectivity using your IPv4 Internet as a wiring. If you terminate the tunnel closing up to you, it is almost as good as native IPv6. What are they? Go and ask your ISP. They need to know that you want IPv6. I will not tell you. Then I will ask them about NATs too. Yes, ask them what is better for them, NATs or IPv6. And I'm going to the farm in the meantime to drink some milk from the mad cows. This conversation has exhausted me. I will ask them if they have IPv6 for my iPhone. I will. Goodbye. Okay, everybody's here for Fibersplacing 101, right? Absolutely. Okay, we're out of IPv4 addresses. Is there anybody for whom this is news in this room? Yeah, that's what I thought. So what we're going to talk about today is where we've been, where we are, a little bit about how we got there. We'll have some fun. I'm Jason Professor at SUNY S. Wego in our computer science department. I've been there just about two years, excuse me, finishing up my fourth semester. So I like it a lot. The winders haven't been too terribly scary for me up there. We had a lot more snow. Problem solved. Sorry about that. We'll have some fun, and we'll talk about how we can motivate content providers. And then we'll get into questions and answers. You can ask me questions about anything V6 related, not just the stuff in the talk, so feel free. And if you've got a question on something during the talk, raise your hand, yell something. I'll throw a microphone at you, and you can ask your question on the mic. OK, so a brief somewhat accurate history of the internet. In 1967, believe it or not, Larry Roberts published the plan for ARPANET, which is the original genesis of the internet. The first version of the internet did not run on TCPIP. How many people know what the protocol was for the first version of the internet? NCP, very good. You don't win anything, but right answer. Network control protocol. How many bits in an NCP address? Bueller? Eight. Eight bits. We had a worldwide market for an internet of 256 hosts, people. Imagine that. The first host on the network was actually 1969, although I wonder if you can call it a network with only one host on it. But it literally was a single host internet in 1969. By 1971, that was up to 15 hosts. Can you imagine an internet with just 15 hosts worldwide? But it was at one point. In 1971, I was five years old at the time. In 1972, we got email, the first email on ARPANET, 1972. 1977, we actually saw our first multi-network demonstration where ARPANET actually communicated with another non-NCP network. And where multiple independent topologies started talking to each other. In 1980, we were up to 213 hosts on the network. Those 8-bit numbers are kind of getting cramped. Maybe we need to do something about that. 1981, we saw the establishment of CSnet and BitNet. And in 82, TCPIP and the exterior gateway protocol were established. So this was literally when we did the development of TCPIP as a new protocol for the internet. We went from 8-bits to 32-bits, which at the time, 16.7 million seemed like a really large number. So we went to 256 times that and said, 3.2 billion hosts worldwide for this network of research institutions and US military specifically is probably pretty generous. And from the available perspective at the time when computers were still pretty expensive and a given university might have a couple of 100 hosts and it made sense. It actually seemed like going from 256 to 3.2 billion would allow for an awful lot of growth. And it did. And in fact, it wasn't a problem as long as that's who was getting on the network. In 1983, so in one year, we went from TCPIP is a protocol that we now know how to write to, we're going to turn off NCP. 1983, January 1st, we turned off NCP forever. It was still running on some hosts in some isolated places, but it was no longer the lingua franca of the internet or at the time ARPANET as of January 1st, 1983. Everybody had to speak TCPIP or they couldn't talk to anybody else really. In 1984, we introduced DNS. Believe it or not, does anybody know what the TFTP host file or the FTP host file for IEN116 was? Prior to RFCs, we had IENs, Internet Engineering Notes. And IEN116 described a file that was maintained by the central registry that had a list of addresses and their host names, line by line. You may remember most of your systems even today have an Etsy hosts file or equivalent. That is an IEN116 host file. Now it just tends to contain your local information, but back in the day, there was a central host file and host names had to be unique across the entire internet and you submitted your IP address and the host name you wanted to use to the central registry and they put it in the host file and everybody FTP'd that file once a day or once a week or whatever suited them onto all of their systems from that central registry. It was called host.text on the central registry. So we decided that wasn't scaling and in 1984 we implemented DNS. By 1987, the internet was up to 10,000 hosts. Yeah, we need more than eight bits for sure, but 32 still ought to be enough, right? I mean, 10,000 hosts, that's a lot of headroom to 3.2 billion. By 1989, that 10,000 was up to 100,000. In 1988, we upgraded the internet from 56 kilobit links to T1s. So we got 24 times the bandwidth. That was nice. In 1991, we went from 1.5 megabit backbone links to 45 megabit backbone links. That's right, 30 times growth in just two years on the bandwidth. Those were exciting times. By 1994, we were up to four million hosts on the internet and we realized that 3.2 billion might not last as long as we thought. So we chartered the IPNG working group in the IETF and they started working on a way around this 32-bit problem. We're bad about scaling. Here are some famous scaling quotes that you might recognize. Thomas Watson once said, in 1943, I think there's a worldwide market for perhaps five computers. Think we're a little bit past that at this point. In 1989, Bill Gates said, I have to say that in 1981, making those decisions, a move from 64K to 640K felt like something would last a great deal of time. Yeah, about six years, Bill. In 1995, Bob Metcalf famously said, I predict the internet will soon go spectacularly supernova. Well, that's true. And in 1996, Capstrophically Collapse. Not so much, Bob. You get to eat your column. And he literally pureed his newspaper column and ate it from a bowl at the World Wide Web Conference in Santa Clara in 1997. So you can review the slide deck to get the details out of the article. In 1997, a certain exodus facilities planner that I was working with told me that 400 watts per square foot of power should be enough for any data center ever. I told him that he should put down the crack pipe. In 2007, Steve Ballmer, another wonderful Microsoft quote, there is no chance that the iPhone is going to get any significant market share ever. All of these have one thing in common. They were completely and totally 100% wrong. We're bad at predicting how things will scale. Very, very bad. The same with IP. Interestingly with IP, each time we run out, we seem to square the square and hope that's enough. We started with eight bits. Then we went to square that and squared that again to 32 bits. And now we've squared it twice again to get to 128 bits. But at 128 bits, we're talking about 3.4 times 10 to the 38th addresses. So it's literally going to be very difficult for us to produce enough hosts to overcome that because you're talking about more addresses than there are molecules in the known universe. And it's very hard to build a single molecule host. You have a question, Joe? Use the mic and then you get to pass it to whoever has the next question. I said we're polluting the routing table anyway and we already have networks with 12 bits of network addressing anyway. So whether we run out of number of bits for host addressing, we will eventually run out of number of bits for network addressing. Yeah, we're going to have to address that, but that's not going to be an address scaling problem. It's going to be a routing scaling problem. It's unfortunate we didn't address that with V6. Just hang on to it until the next person raises their hand and pass it to them. But you know, he's right. He's absolutely right. We've still got a route scaling problem. The way we currently route packets based on destination address is utterly stupid and non-scalable. And we're going to have to eventually solve that problem. So that takes care of where we've been. Where are we now? We are out of IPv4 addresses, mostly. Four out of the five RIRs are no longer making conventional assignments. One RIR has some addresses left, but if you're not operating in Africa, that doesn't really matter because they're the AFRNIC RIR. IPv6 now accounts for just over 10% of the internet traffic globally according to Google, speaking of Mike Joseph there. It now accounts for more than 37% of US mobile traffic and it is roughly 50% of the traffic at scale based on the statistics I just saw in the knock a few minutes ago. Apple will stop accepting IPv4 applications in the App Store this year. They've already made that announcement. The App Store is going V6 only and your apps all have to support V6 or you're gonna be out. So now that we're out of V4, we have a few options. You guys saw the more Nat video earlier if you were here at the beginning. We could stop all internet growth and live with the internet we have today forever, but I don't think that's gonna be a very popular decision. We could start taking people off the internet in favor of higher priority uses. Let's make the internet entirely servers and people just have to not talk to the servers. I'm not sure what good servers are without people to talk to them, but that's a thought. Spammers don't work too well without clients to read the spam actually. Microphone, get the microphone from Mike. Mike, oh, Mike doesn't still have the microphone. If we got rid of the spammers that would free up a whole lot of AppNIC IP blocks that were assigned to China. Well, except that they were assigned to CNNIC first and CNNIC doesn't turn loose of anything, but that's a different issue. The reality is the spammers are already on V6. Trust me, I run a V6 mail server, the spammers are already there. I think Michael vouch for that too. His company runs some V6 mail servers. He does a lot more V6 email than I do, or his company does. So really V6 is the only solution that's gonna actually allow us to continue to grow the internet in a meaningful way. More Nat really, really fails spectacularly in a number of ways, and it just isn't going to continue to scale for more than maybe another two or three years, and then it's just gonna become really horrific to try and keep using it. So with that, let's talk about asking your IPv6, asking your ISP for IPv6, because it's fun. In 2008, I called a particularly popular large cableco. I won't name Comcast's name, but I was talking to them about my residential service, and the general response I got from the first person on the phone was IPvWood. So they escalated it to tier two, and their response was IPvWood. So they went to tier three where it was, yeah, we're not gonna do that. So I kind of escalated it through the sales department and said, you know, I'm gonna eventually need to be able to communicate with the entire internet, and the extent to which the internet can grow on V4 is limited, so some of that growth is gonna end up on V6 eventually. So we're not gonna do this as not a valid answer if you wanna be an ISP in a few years. And they basically said, yeah, we don't care. We're not gonna do it. So that didn't work out so well. In 2013, by then, I was actually able to talk to Comcast and they went, yeah, we don't have that in your neighborhood yet, but it's coming. And then it was, yeah, we don't have that for business class yet, but it's coming. And to their credit, today, Comcast pretty much has IPv6 available for every customer that wants it. All you have to do is turn it on. Question? Get the mic. Get the mic. No, this is being recorded, so if you wanna ask a question, you have to do it on the mic. No, no, exactly, I do not have a question. That's a work for small internet's data provider. So I see that customers don't want IPv6. So how to run it? Business without IPv4, version four. Well, so your customers are gonna eventually want v6 and you can't obviously force v6 down their throats. I don't know how you keep providing v4 to your customers when you can't get it anymore. So, yeah, we're gonna have a similar problem in probably 50 or 60 years where people are gonna continue wanting to run their cars on gasoline, but we don't have any more gasoline. Once there's no more, I don't know how you keep selling it. I don't have an answer to that question. At least in the case of internet protocol, we actually have v6 to offer them as an alternative and it works. I don't know what the alternative to fossil fuels is going to be when we run out of fossil fuels, but when you run out of a limited commodity, you run out, you can't keep selling it. That is the nature of a limited commodity and IPv4 addresses are a limited commodity. So, I don't have an answer to that problem other than to convince your users that they need to consider v6. Yeah, yep. We have a lot of IPv6, but mostly we sell IPv4. Let's say there was a worldwide shortage on oats and there were people that only liked oatmeal and didn't want to eat any other food. Once they get hungry enough, they're going to eat something else. Trust me, you've got to pretty much treat it like that for IPv4 at this point. We're out, I don't know what else to tell you. I'm going to move on. So in 2013, I was looking at my iPad and as long as I was on a Wi-Fi that had v6, I could get to v6 websites no problem, but when I got on my AT&T cellular connection over LTE, it wouldn't do anything. It wouldn't get to the v6 only sites that I wanted to go to. So I called up AT&T. What do you think the first person I talked to said? IPv1. IPv1, exactly. The next person at tier two was actually more interesting. They started out with IPv1, but when I explained what IPv4 was and what IPv6 was, they were like, oh yeah, we don't support getting to websites over internet protocol. This was tier two support, mind you. Not the idiot that just answers the phone, the escalation engineer, to which I responded, what do you use to reach websites then? They didn't have a good answer. I managed to escalate it to tier three who told me, well, the problem's actually with the web server you're trying to reach. We can't get to it either. And I said, yes, that's because your network is what's broken. And so they didn't want to believe that. So they called Apple and they got Apple on the phone with me and Apple said, yeah, we can't get to it either. So it's a problem with the website. And I said, no, your network also doesn't support V6. So you're having the same problem because your network is broken in the same way as AT&T's. Needless to say, this argument went on for a while and then I finally got tired of playing the game and walked away without it getting solved. AT&T to this day to the best of my knowledge still doesn't know how to spell IPv6 on their cellular network. Yes, on Uverse they actually have some support for IPv6 if you know exactly who to talk to and the right secret IPv6 handshake to get 6RD which isn't even really IPv6, but it pretends rather well. So finally I'm gonna talk about a trade show that I was at in 2012 where I was in the Aaron booth and we were stationed next to a booth where they were pushing a monitoring software. I don't even remember which one it was. It wasn't Xenos though they would have had the same answer at the time though they might have been more honest about it. And literally people would come to the Aaron booth and we'd talk to them about V6 and get them all hyped up and they'd walk next door to this guy's booth and say, does your product support V6? To which he'd say, well you know you're actually the first customer to ask us about that and so why don't you tell us what your V6 requirements are and we'll get back to you about it because we don't have any plans yet. So at the end of the day I walked over and said, so does your product support IPv6? And he said, you know you're the first customer to ask about that. And I said, you know that's really interesting because we've been talking to people next door, I've been in the booth next door all day and we've been telling them about IPv6 and I've heard them walking over to your booth and I've heard you tell more than a hundred people that they're the first customer to ask about it. And he says, oh you heard that, huh? I'm like, yeah. And he says, well actually we're working on V6 but we don't want our competitors to know it yet. We wanna get out in front of them. So we're telling people that they're the first person to ask until we actually have a story to tell. I'm like, well that's great for your marketing strategy but have you thought about the consequences that's having on people's thoughts about their ability to move forward with V6 on their networks? Oh that might be kind of bad, huh? So yeah, fun, fun, fun times. So how do we go about encouraging content providers? Well I've tried a few different tactics over the years. One of my favorites was the IPv6 buddy which is a little USB keyboard that has, it's a numeric keyboard mostly but it's zero through nine, A through F, comma, period, colon and double colon and slash. So you can actually type addresses in really fast, even V6 addresses and they call it the IPv6 buddy. And when I saw the announcement of the IPv6 buddy I thought this is really awesome. So I go to my shell window and I type host www.ipv6buddy.com or whatever their website was and all I get back is an A record. And I was so disappointed, so, so disappointed. So I opened my email, you know, MUA and I wrote an email to the guy's support address and I said, you know, I'd like to buy your product but it's very hard for me to take a product like IPv6 buddy seriously when I can't get to the website on IPv6. By the way, I work for Hurricane Electric and I'd be happy to help you get your website on IPv6 through a variety of possible measures. Please contact me if you need help. A day later, the guy had a tunnel up and working through our tunnel broker and had put his website up on V6 and had it fully functional on V6 and he wrote me back and said, thanks for pointing that out. You're right, it was a really stupid oversight. Send me your postal address and I'll send you a bunch of free IPv6 buddies. So that was pretty cool. Yeah, it was nice to get the free keyboards but even better, I got the website on V6 overnight and I didn't even have to really lift a finger other than writing an email. So I call that a success story. On the other hand, there's a company that's been coming to scale every year and I talked to them at scale in several other conferences every year. And in about 2009, I talked to them and I said, you know, you guys gotta come up with a way to support V6 on your server instances or it's gonna be a real problem for your clients and they said, yeah, we looked at that and it cost us $100,000 and we don't have the budget. So in 2010, I talked to them again and they said, yeah, we looked at that and it cost a quarter of a million dollars and we don't have the budget. So I talked to them again in 2011 and they said, yeah, we looked at that and it'd be half a million dollars and we don't have the budget. And so I talked to them in 2012 and it was a million dollars and I said, you guys realize that four years ago I talked to you, it was $100,000. Three years, it was a quarter of a million. Last year, it was half a million. And now, it's a million. This is not getting any better while you wait. We know, but we don't have the budget. So Amazon is a fail. I recommend moving your servers off of Amazon if you care about being on the rest of the internet. Unfortunately, the Google Cloud still doesn't support V6 either, and neither does Microsoft Azure. So you need to look at alternatives like Linode. Linode fully supports V6. SoftLayer, which is now part of IBM, supports V6, though the older IBM data centers still don't. Or you can look at Host Virtual. All of those three providers do virtual hosting that does support V6. Sorry, Mike. You've got to get it on V6, or I'm going to keep bashing you. And you've got to admit that's fair. So if you care, move your stuff to one of those providers that actually has V6. And by the way, if you care about helping the internet in general, let them know why you're leaving. That's the important part. Some success. Talked to Blizzard Entertainment at Game Developers conference in Moscone Center once. And at the time, their reaction was kind of like, yeah, we don't care so much. Nobody's doing games on V6, and we don't really care. But in fairness, like a year later, World of Warcraft was starting to do trials with V6. And today, I'm happy to say that universally across the board, if you've got V6 capability on your network, you can turn on V6 in your World of Warcraft client, and it just works. Is that why World of Warcraft usage has dropped by 90%? No. Are you sure? Yes. People have mostly just gotten bored with it. That's why it's dropping on you. So if you are a content provider and prefer not to get featured or bashed in one of my talks, there are two requirements. One, implement V6. Two, send me an email saying, please don't talk about us. I will not honor your request to not talk about you if you don't meet requirement one. So be forewarned about that. But I will happily honor your request if you meet requirement one and send me the request. With that, we're on to questions and answers. I'm from Akamai, and they're paying for me to be here partially, so sponsor, slide, whatever. Questions? Comments, concerns? Andrew, you're drafted. You can run the mic around. Yeah, that's what I've been doing. Hi. Do you think it's likely a situation that some part of the market, like some country has IPv4 only mostly, and the rest of the world gets IPv6? Like, where in our two countries don't adopt the IPv6? Well, I don't think that that's necessarily unlikely, but I think that once it gets to that point, those countries are gonna try and find a way to get on V6 pretty quick. Because once most of the world has adopted V6, they're not gonna keep supporting V4 just to reach those two countries, right? I think in three to four years, you're gonna see the larger cable codes and cell codes starting to at least charge extra if you wanna reach V4 content. Because by then, we're going to be at a point where they can get away with that, and where it's actually costing them so much extra to maintain V4 connectivity for customers that care that they don't really have a choice in the matter anyway. So Akamai is a big content provider, and you guys ship multiple terabits per second, all around the world. So my question is more around your global load balancing of this, moving from V4 to V6. As you learn these routes and you collect the data to decide, I think you can do a lot of DNS load balancing, you decide where to serve traffic from. How have you been able to work with a couple thousand IP addresses in a V4 slash whatever, and then you have these big V6 subnets that you have to decide where to load balance things to? Well, without getting into too much detail, we don't get down to the individual client address on the load balancing in either case. About as granular as we get in V4 is the 24, and about as granular as we get in V6 is the 64, and so we just deal with them that way. And for the load balancing, we usually, if able, we'll return both an A and a quad A if the client is the customer producing the content is actually able to work with us to supply the content in V6. We actually support a lot of customers that have their backend server, the origin server, only on V4, and we will deliver their content both V4 and V6. So if you have V4 content and you want to be available over V6, actually reaching out to Akamai is one way you can dual stack your entire web content without having to do any work on your side to get V6 enabled. Having said that, I wanna do clarify one thing that you got a little bit wrong. We are not a content provider in the sense that we don't produce content other than the Akamai website and so the internet weather report and some things like that. We are a content deliverer and a content accelerator and also we provide DDoS mitigation services. So I actually did have a question. We discussed Amazon's inability to provide IPv6, and I think we're supposed to be glaring at someone over there, but I'm not exactly sure who. No, I wasn't glaring at anyone. I don't know if there's anyone from Amazon in the room. I was glaring at him for Google. Okay, okay, sorry about that. So is it? I like to pick on Mike because he picks on me at the air and events. Is it possible to run a V6 tunnel into AWS? So can I just go to HE? You might or might not be able to do that. I don't know. I haven't experimented with Amazon's ability to cope with tunnels. To the best of my knowledge, everything at Amazon is rather strangely knatted. It is. And therefore I don't think the tunnel would survive the nap process because the tunnel is a protocol 41 tunnel. It's not a TCP port. So we're gonna move it the same way. So a stateful nat is gonna kind of blow chunks on that. Mike, and then the guy in the white shirt. I know that guy. I wouldn't answer his question. So I can't speak to GCE either, but if you can embed your tunnel in UDP, you can probably get it to each individual host on any cloud, but then you're gonna have to run a per server tunnel because even if you get it into the cloud, most of the clouds are not providing a true layer to emulation these days. So you're still gonna have to worry about host to host traffic. To answer the gentleman's earlier question too, I don't know what Akamai is doing, but a lot of networks are trying, having trouble actually trying to approximate the aggregation layer for V6. V4, as Owen said, 24 is pretty safe and it's relatively easy to collect the entire list of slash 24 is in a 16 million row table, but it's much harder in V6. So you can't assume 64. You actually have to approximate where the subnet mask lies for a particular user range and there's a bunch of people out there doing that, but it's challenging. I did have a question for Owen. Akamai used to charge an IPv6 premium. Do they still? No. When did that drop? I have no idea when it dropped. It was before I started working there. All right, good to hear. Yeah, if it hadn't been before I started working there, I'd have found the appropriate person and beat them repeatedly about the head and shoulders until it got fixed. Yes. So at home I have what for all intents and purposes a really great ISP with one obvious exception. Every time I bring up, when are you going to support native IPv6? Their response is, but we have 6RD and I have this wonderful conversation with them about the difference between tunnel and native. Never get a good answer out of them as to why 6RD or me setting up my own tunnel is their only option. Any tips for how to find the right counter problem? Sure, ask them about jumbo frames. Okay. Because the main problem that you actually have with 6RD is the inability to send a 1500 octet packet. So if they'll support jumbo frames for your V4 at 4096, for example, then you can get your 1500 octet packets through 6RD and you don't care. Yeah, it's still issues with firewall failover and 6RD is not fun. Yeah, okay, that's fair. Because they actually can't do it native at the particular modem they provide. So I have to do it myself and then it gets ugly. Yeah. I believe they blame AT&T because they're leasing equipment from AT&T and the CEOs. Well, yeah, and AT&T doesn't know how to do V6 except 6RD. So it may be that they don't have the ability to provide you a better answer, in which case, again, I returned to the, asked them about jumbo frames and that'll at least solve the MTU problem if they can figure out how to talk to AT&T about jumbo frames. I'm not sure AT&T can spell jumbo frames. The biggest problem with AT&T is that it's a very large organization full of very specialized individuals, such that if you need a line of code typed into an AT&T router, it may take 38 people to do it. One to type the A's, one to type the B's, one to, yeah, any other questions? Down here in the front. I'm gonna make you walk as far back and forth as I can, Andrew. So, as Akamai has rolled out V6, V6 addresses four times the length of a V4. Do you guys have any performance numbers? Were you able to, did you have to send more packets because you lose? Actually, by and large, it's the same number of packets because the header, believe it or not, is only twice the size of the V4 header, even though the address length is 4x because we simplified the header, a great deal in V6. But no, we generally don't see much of an increase in packet count. And generally speaking, V6 actually performs better because of the simplified header. It's faster for the silicon to parse the header. The addresses are moved closer to the beginning of the header, so the silicon is able to do more advanced lookup while the packet is still arriving and other tricks like that. The fibs are actually simplified quite a bit on V6 versus V4. The firewall rules tend to be simpler because you're not having to deal with, well, what if it's not and what if it's not and all these other weird complexities that come with V4 and address shortages. And so, as a general rule, all of the V6 stuff is just so much easier, so much cleaner that it turns out to also be slightly faster but not noticeably. TE is much, much simpler in V6. Hold on. There's been a bunch of studies at NANOG about this and because V6 traffic tends to be on tunnels that are independent of V4 and those tunnels are smaller, they tend to be able to TE on more direct paths than V4 tends to. So, when you're talking about- Yeah, but hopefully the tunnels are gonna go away. Hopefully that's a temporary- Well, no, but I mean like, yes, although, I mean, most of the carriers are still running in PLS, so they're still sticking it in some kind of PLS style. And a lot of the ones that carry V6 traffic tend to be smaller and therefore easier to merge onto more direct paths than before these days. But as V6 adoption increases, that benefit will go away. Yeah, but yeah, we try to avoid doing traffic engineering on our quote unquote network, but that's partially because we don't actually operate in that work. So, a governance question, you pointed out RFC 33, which had a drop dead date for eight-bit IP addresses. Yep. And the question is, is there any discussion within RN or within the wider community of regional internet registrars to have a drop dead for IPv4, or do we just not have enough IPv6 adoption at this point? Well, so there's two problems with your question. The first one is that that wouldn't be an RIR thing because the RIRs don't deal with RFCs. And this would be more in long lines of an RFC that would have to come out of the sunset for a working group in the IETF. Now, as to whether that's happening in the IETF, it's funny you should bring that up. Jacques Latour from Canada and I are actually at the moment working on a draft that we're going to submit, an ID to actually propose just such a date when we move IPv4 to deprecated and then a later date when we move it to historical status. And we're still kicking around exact dates, but the current thinking that we're probably going to congeal around is proposing that it be deprecated around April 4th of 2020 and that it be moved to historical 4-4 2024. And are those dates going to slip as much as ADSB? Probably. But, you know, who knows? Any other questions, comments, rotten fruit, tomatoes? Yes, over here. Hopefully he's not going for rotten fruit or tomatoes. Do you see IPv4 ever going away for local networks or something? That depends. Would you say Novel has gone away for local networks today or not? Novel, IPX, so you would say yes, but I'm willing to bet there are some people here that are saying, well, wait, I'm still running Novel. Okay, so the reality is the answer to that question depends a great deal on your definition of going away. I think that it will probably not be in my lifetime that the last host running IPv4 is turned off or stops running IPv4. But I do think that IPv4 as the lingua franca of the internet is not going to be much more than another five years, if that long. I could be wrong about that. We may somehow limp it along much longer than that and really just keep hitting ourselves in the head with a hammer as long as we can, but eventually we're gonna lose consciousness and stop doing that. So because the guy hitting himself in the head with a hammer, once he loses consciousness, stops doing it. And that's what it's gonna take for IPv4 to stop on the global internet. Right now we're hitting ourselves in the head with the Nat hammer over and over again and some people seem to enjoy that. I myself am not such a masochist, so I don't run Nat at home. I've got enough v4 addresses that I don't have to. But other people are doing different things and have to support more growth than I do. So I think in terms of the global internet, four or five years of v4 is gonna be achievable but painful, going beyond that is gonna be much more painful and the cost just keeps escalating. There's already a rather large social network that is actually trying really, really hard to stop running v4 at all. They want the CDN that they're working with to accept all of their origin traffic over v6 and then front them for v4 and v6 so that they can turn off all of their v4 internally and run just v6. And they wanted us to do that last year. So you're gonna see more and more of that, I suspect. More and more people just not wanting to face the continuing and escalating expense of supporting v4. And that's what's gonna kill it eventually. But I think we're probably at least three or four years away from that. And sadly, maybe more. Because as the other gentleman pointed out earlier, our customers want v4, they don't understand. Well, yeah, people want stuff they can't have all the time and it is what it is. Anything else? Great, go for it, oh, one more. We still have 13 minutes left in the session so use them wisely. So when we get enough folks migrating over to IPv6, like are there any other concerns that you see on the horizon or is it all, routing scalability like Mike mentioned is gonna be the next great thing we have to solve because currently maintaining a global table of everything somebody considers to be a uniquely routed prefix including the people that think that if you get a slash 16, you need to turn it into 256 slash 24 routes all with the same next hop because that avoids DDoS somehow that it doesn't. Kid you not, look at the routing table. Almost every provider in Asia has developed that religion. They disaggregate everything into slash 24s. It's horrible. That's why we have a 700 and 7,000 routing table today in v4. It's a little better in v6. They mostly are not disaggregating every 48, fortunately. Don't say that, just don't even think it. Yeah. So that's gonna be the next thing we have to solve is we have to find a better way than global prefix based routing to route this. I personally think that if we started routing based on the closest transit AS, it would be much better because then the transit AS can do conventional prefix routing. But in order to do that, we're gonna have to either add an extension header or change the format of the v6 packet. I've been looking at that for a number of years and I haven't found a way to do it that's clean enough that I can propose it to the IETF. I'm happy to talk about it with anybody that wants to help do better protocol engineering because I am not a protocol engineer and I don't wanna play one on TV. But that's the state of things and that's what I think is gonna be the next big problem. Mike, do you have any other ideas on the next big problem? Say again? No, I mean, I think the point I was trying to allude to earlier was not just table growth, but we are and you and I both supported a draft and Aaron to actually implement slash 12s. And the problem is that we are giving away large swaths of the v6 address space to the largest operators of which we are the largest operators and of which we are the ones who approved giving that away. So, but right now that hasn't bit us. I do worry that what happens 20 years, 30 years from now when v6 would otherwise be fine, but for the fact that we've capped sort of the number of large operators at some fairly small value and we find ourselves running out of network bits. I mean, because v6 has brought back classful addressing. Let's face it, the slash 64 is a pretty hard classful boundary and we moved away from that before. On the slash 48 is a fairly firm but not as hard classful boundary. And then so we've narrowed, we've whittled down those 128 bits to as much smaller range with which we can play. And then on the IR and numbers space side, we've been giving out chunks of those pretty liberally, partly to support v6 adoption, which I think is right, but that pastel did the same thing and Vince made similar comments about the early days of v4 allocation. So I worry a little bit about what happens in 20 or 30 years. I worry a little bit about it, but at the same time, I actually think we're probably okay because if you look at it, there are maybe 40 or 50 organizations that could qualify for a slash 12 today under Aaron policy worldwide. And there are 4096, well, less than that because we got to subtract a few for silly things like ULA. ULA is silly. I'm sorry, it's just silly. It has no purpose in life. It makes no sense whatsoever to me. Unique local addresses. It's basically the IPv6 equivalent of RFC 1918. Yes, it's a much bigger RFC 1918 space within which you may not collide and you have a much lower chance of collision, but it's still RFC 1918 effectively. But the reality is it's still probably more than 3000 slash 12 is available. And even at that, what I have repeatedly said is that if we manage to exhaust the current slash three in less than 50 years, and I'm still around, I'm happy to work on more stringent policy for the next one eighth of the address space. So we actually have a safety valve there. If we burn through the first one eighth of the address space faster than we anticipate in the next 30 to 40 to 50 years, we can apply the brakes on the next one eighth or even the one eighth after that if we have to keep issuing while we develop new policy. Arlena wants the mic. So I think it's gonna be okay. You know, I know a few organizations that I haven't worked for that could probably get slash 12s. You work for one of them. The others I can think of are AT&T, Comcast and perhaps Verizon or Time Warner. But that's about it in terms of providers I can think of that could qualify. It's relatively easy and I've actually so far qualified for three slash 24s, but they were for tiny little organizations like Akamai, the US Department of Agriculture and Hurricane Electric. So, you know, we're not talking about particularly small organizations that are qualifying for slash 24s and in that first slash three, we've got two million of those available. So I think we're gonna be okay for quite awhile with the current policy, which is not by the way universal across all five RIRs. Erin is probably the most liberal of the five RIRs. Strangely enough, the RIPE is the most liberal IPv4 policy. They have the strictest IPv6 policies on the planet. So go figure. Arlena. Just out of curiosity, any imminent network security concerns concerns in the future of IPv6? Yes. IPv6 is very, very insecure. It has all the same security threats as IPv4. Are you aware of any class of enterprise network device like a firewall load balancer proxies that just that class of device when the current market is actually gonna be hindering IPv6 adoption? I'm sure there are several that don't support V6 and are therefore hindering V6 adoption among their customers, but there are enough out there that now support V6 that my recommendation is vote with your dollars. Don't patronize those vendors and they won't hinder you. Juniper supports V6. I believe Checkpoint now supports V6. Who's the really tiny, goofy one? Not Sonic Wall, but... Zizel? No, not Zizel. No, not Palo Alto. It's Fortinet. Even they support V6 now. So, yeah. You know, and even some of the home stuff like from D-Link is supporting V6 now. You know, my HP printer supports V6 though not very well. Even the IPv6 firewall in my HP printer supports V6 except you can't enter new addresses for rule sets. So, I mean... It has everything it needs to support V6 in the firewall built into the printer except the user interface for entering an address to put into a firewall filter only accepts V4 addresses. Thank you, HP. So, one of the challenges actually is in home networking. You know, just recently I think we were both privy to a Comcast issue with prefix delegation. And I think one of the things I know for, I mean, I remember sitting down after V6 launch day with some of our colleagues from Apple talking about, you know, I know Tony was working on issues in, well, Tony at Cisco rather, but also our colleagues at Apple both were working on issues around prefix delegation in the home because one of the problems is V6 doesn't support NAT. So the class of devices that won't work in V6, all NAT devices won't work in V6 because V6 has a bullish NAT until somebody bastardizes it and starts doing that which we know people are doing. You can already do NAT with IP tables and V6. You can and that's unfortunate, but we know people are doing it and you know, when V6 was designed, it was designed with the premise that yeah. Actually amusingly that wasn't by design, it was by accident. They generically added V6 capabilities to the IP tables code and it turned out that when they added it, all of the same things you could do with V4 became things you could do with V6. So, you know, the problem is in the home, there's remarkable amount of double or triple NAT right now because when home users wanna expand their home network, they stick another NAT device in and they don't think much about it, they just stick it behind their existing router and it will happily double or triple NAT their traffic which isn't great for performance and makes NAT punch through like really difficult. So, you know, gamers figure this out quick, but you know, average home users browsing don't. And so one of the challenges with V6 is when you have, how do you detect, for example devices that are behind of the devices, how do you deal with prefix delegation within the home, the idea that, you know, right now most homes are getting slash 64s but really they should be getting 56s but even when they get 56s. No, really they should be getting 48s. There's still some debate about that. I'm strongly on the 48 camp. I remember. And I'm gonna stay there and I've got the microphone up front. That's true. Well, whatever size prefix it is, whether it's 48 or 56, there's not a great way to distribute that within home, right, most home users aren't gonna go in and construct a routing table for their multi-tiered topology and so there's still a lot of work being done on the CPE end. Actually, once the CPE supports PD, generically stacking CPEs that support PD should be relatively simple. It should be and that's what you do get into the 56 problem, right? Cause like one of the things that Cisco has pointed out is if you're device, if you're doing 56s, how many, what size prefix do you allocate to an individual port on each device and how do you figure that out? Yeah, at that point you're limited to some version of four by two. Yeah, so that's where you start to get into those problems. Right, which is why 48 is better because now you've got all kinds of possibilities, two by eight, four by four, you know, lots of different things you can stack together. It is true but then it further pushes down the earlier aforementioned prefix size for networks. It doesn't do that much damage. The reality is that if you're assigning 48s to your, and here's the interesting artifact of error and policy that you can get stuck in. If you assign 56s to your residential customers and 48s to your business customers, which I realize Comcast is not doing, to their credit, screwing their business customers just as much as they're screwing their residential customers, no matter who you are, you get a 56. It turns out that if you're issuing 56s and 48s, all of your error and measurements for whether you qualify for additional address space are based on the 56 number. So everybody you gave a 48, you better be able to justify 256 slash 56s. Yeah, it's true. I did write that section of policy but the community supported it. So we got one minute left. So one more question. All right, thank you. Oh, there we go. Last question. You've talked a lot about various ISPs in telcos. Any comments on Verizon Fios? Verizon Fios, anybody here who has Verizon Fios, please call them and ask for V6 once a month or so until they finally implement it because they have been completely stubborn and intransigent about it. And they still don't know how to spell IPv6 and they're almost as far behind as Sprint these days. Believe it or not, in this regard, AT&T is slightly better than Verizon. AT&T Uverse, where you can get it does V6 more than Verizon Fios. And in fact, Verizon Fios is the first one to put CGNAT on to their customers but at least they give you the option of opting out. So you may want to check whether you have a real public address or an address in the 100.64 range on the outside of your gateway if you're on Verizon Fios. And if you have a 100.64, I strongly encourage you to opt out of the carrier-grade NAT and tell them that you don't like it. I'll get to you offline but I got to clear the podium for whoever's next in the room. Who does support V6? Comcast over their cable network, Time Warner over their cable network, AT&T Uverse, T-Mobile and Verizon over their mobile networks, Google Fiber, as long as you don't need a Google Cloud instance. Google Mail supports V6, YouTube, all of that stuff. So anyway, thank you all very much. Have a wonderful day at scale. I had lunch before I, hang on a second. Hi. Testing, testing? Yep. You know, ZFS. I really wanted to go to the ZFS talk. You'll see what's going on, bro. Yeah, so that's why I was like, well, okay, it seems to be a big deal. Yes. It's a good call and a half, right? Yeah, so I did the, we're glad you're here. I was curious about the European pressure before yours, then I missed yours. And then I went to the lessons learned the hard way. So AJ went to the other one, right? He said he would do it, honestly. The server hardening, what was like that? Yeah. Simple server hardening or something like that, yeah. Right, well, I looked at the Netflix slides last night. They were up last night, and I'm like, they have a whole bunch of pictures here, but like, don't tell you anything at all. It's like all pictures are different, right? Like, well, okay. So we're looking at your... Yeah. I guess I want to hear what Zachary and me said. We're supposed to put our slides. One of the guys yesterday did his slides as a docker container, and so he was like, if you want to make everybody upset, you know, wearing the yellow shirt, just go ahead and pull the docker container in. And it's like a 110 bag. It's a Linux Xcode, man. It's a Linux Xcode. I know that we've got a bunch of Postgres users that several of you are Postgres users. Anybody here not a Postgres user? Oh, there we go. Kinda, okay. How many people here use docker and other container tech? Couple, okay. The, well, one way or the other. Go ahead and get started. So welcome. Late afternoon, Sunday sessions was a little lightly attended, particularly when you're up against some of the people I'm up against, but I think you'll find this worth your time. I'm gonna be talking about doing high availability stuff with new tools for Postgres QL. For one non-Postgres person, this could be adapted to other systems and probably will be, but right now it works for Postgres QL. So rather than just starting into a long description of the architecture, I thought it would be more fun to just show it to you at work to start out with because there's been a lot of Postgres HA stuff in the past that has involved a lot of hand-wavy description of how things work and instead I think I wanna show you an actual working system here. Now mind you, it's not a real production system because I'm running all of the containers on this laptop, which is not what you would do in production. Obviously you'd have them all running on separate machines, but it does make for a good demo. So first thing I'm going to do here is I'm going to use Docker compose to bring up my containers. And let's tail the Docker compose logs. You can actually see what's going on here. Ooh, there we go. Okay, we're getting a lot of stuff. And so we're getting reports from two of the nodes there as you can see in the bottom. Nodes three and nodes one that they bootstrapped from the master. We're getting feedback that those two are secondaries. So let's see, do we have a replication cluster here? Let's see. So we've got, here are our containers running right here. We've got three database nodes, one ETCD node, I'll explain what that's doing further in the presentation. The looks like node number two is our master, so let me find out its address. I'm going to log into that. And it looks like it has two replicas. So we've got our cluster with one master, two replicas. Right here. And we are up and running. But of course I'm here to demonstrate failover, right? And high availability. So let us kill off the master. Okay, so again, two dot one is node two is our current master. So we're going to stop it and you see a whole bunch of activity there. This is all log output from the Petroni system. And then we get a whole bunch of stuff here, wait. And we can see the other two nodes restarting. So we try to do this again. Oh, we don't have a connection anymore. Hold on, it's got a hold of the cursor and doesn't want to let go of it. Just a moment. Well, it TCP times out. Okay, so let us connect to think this is going to be number. Oh, nope, wrong one. Who's the master here? I'm the leader with lock three. So there we are. And you can see that now node one is streaming from node three, which is the new master. So this is the essence of our system. There's a lot of code that went into making that happen. And so now we're going to talk about that. In the meantime, we'll leave our two node cluster running. Since we do have a small group, feel free to interrupt with questions. Although at this point, I haven't even started to describe the architecture. Yeah? All right, yeah, actually, no, no, no. Hold on, I meant to do that as the demo. Let's go ahead and bring that node back up, hey? It would become a slave and let's do that. Thank you for asking. Okay, so I just restarted DB node two. And you see we've got a bunch of stuff going on there. It's looking for replication slot. They can't find it. And so it creates a replication slot. Replication slots are a feature of Postgres 9.4. They're really useful, which is why we create them by default. And now, if we look over at the master, which is still DB node three, it now has two replicas. Because DB node two has restarted and rejoined, the cluster is a replica. And that would work whether we were instantiating a new DB node two. If you're restarting the old DB node two, that would only work if there wasn't any trailing replication information. Unless you're using the pdrewind plugin, which I'm not going to cover in this particular session because it's new. So that's operation in a nutshell. So now let's actually describe how that worked and why it's needed. So we're gonna first start out. Postgres' built-in replication is really cool. I'm really happy with it. It took a number of years to hammer out. But at this point, it's easy to set up. It makes all kind of guarantees about replicating your data. It prevents data corruption. It prevents a lot of sort of common foot guns with replication. That is, you have to really try hard to fuck up your systems. The worst that'll happen is you'll break replication. Replication will stop working, right? And you can combine it with disaster recovery fairly easily. Have a replication and disaster recovery in one mechanism system. You think, hey, all this is great, but we're kind of missing something. Which is, why is there no built-in failover? I mean, my SQL has built-in failover, right? Well, sort of. And this actually has led to a number of things, including I actually heard this at a recent conference. I forget which conference it was at in the fall. Somebody was actually talking to a Postgres person in their Postgres booth. And they literally said to them, automated failover is too complicated. You don't want it. Well, no, that's not good enough. A lot of us have SLAs to meet. We have always on applications. And it's not impossible. Automated failover is doable, particularly if you restrict the problem. Now, part of the problem that we run into in the Postgres world is we try to solve everything for everyone. And coming up with a failover system that will work for absolutely everyone, no matter how they're using their database or what they're using it for, or what hardware or environment they're running in is, in fact, pretty close to impossible. However, coming up with a failover system for how a lot of people use their Postgres databases, these days anyway, which is a bunch of old DPR web databases running in some kind of a cloud or container environment, where we can have a pool of asynchronous replicas and automatically promote one when the master goes down. And we have the ability to have some kind of a watchdog note. This is sort of our prerequisites here, right? Is that this is going to be sort of our system. It turns out that this particular set of requirements meets the needs of a lot of people. Not everybody. There are people who need synchronous stuff and guarantees against data loss. There are people who I can't have to run on large hardware where the point of spinning up a new note is prohibitively expensive. But an awful lot of applications here fit this spec. So now, I wrote out that sort of spec a few years ago and then worked with a team of my former coworkers at Postgres Experts to build a system called Handy Rep based on the set of requirements. So Handy Rep is a master controller architecture. It's built with Python fabric and SSH. It's in production in at least one place that I know of. It's been forked a couple of times so I don't know if it's in production elsewhere. And the idea was because we were a consulting shop to build something that would work with any of our clients Postgres configurations in place as they work, which often involved dealing with a lot of really screwy stuff in terms of LDAP authentication and special subdomains and all kinds of other things. It was also designed to be pluggable in order to support all of these infrastructure changes. So we worked on that for about a year and a half, pretty much completed the initial spec for that, but there were some problems with it as a general solution. You pretty much had to have us install it for you. It was difficult to install, it was really difficult to debug in terms of when we would lose a replica out of the system, it was really hard to figure out why. In the end configuration, it had over 100 different configuration options, possibly more depending on plugins. It scaled kind of poorly, like it was great for like two to four node, maybe six node clusters, but beyond that. And the handy rep server itself as a master controller was kind of a single point of failure. Like we could have a secondary handy rep server, but it was manual failover to that. And that wasn't helpful, but we couldn't have two because then we had potential problems. So it was like, okay, part of the problem with the handy rep design was that we were trying to be too general. We were trying to tell our clients, you don't have to change anything about how you're doing Postgres. We'll retrofit a failover system on top of that. And that is what some people need, but it's not a general downloadable solution. I really wanted a general downloadable solution. Well, in the meantime, there's this company called Zolando. And Zolando is like the number one European fashion portal. Hold on, I'm trying to figure out a way I can put down my coffee without it falling over. Okay. Zolando is the number one European fashion portal. They've got about 15 million customers ship some ridiculous amount of merchandise per week. They have 150 Postgres database nodes in their environment and have to be 24, seven, 365 because they control their own shipping, not just online sales. So allowable down times are tiny. And for that reason, they needed automated decentralized high availability. I did look at handy rep, but they felt it didn't fit their needs because among things, they needed to support much larger clusters and they needed to be a lot more autonomous and not dependent on the DBAs to configure them. So, well, Zolando was looking at this and while I was getting dissatisfied with not being able to make handy rep portable enough, some stuff happened. Now, Zolando ran into a lot of the common problems that you have. They had an initial, they tried to do it on their own and they ran into a lot of the common problems that you have with automated failover and high availability, right? False failover, where you failover where you didn't need to, which is always problematic because you have to break all the app, a bunch of application connections, have to reconnect and if it's happening all the time, that becomes a bad experience for the user. Miss fires where you try to failover and you can't complete it. Race conditions where you can't figure out who the new master is supposed to be and you either end up with two or none, those sorts of things. And then they ended up with, and they encountered the worst problem with automated failover. Does anybody know what the worst problem with automated failover is? Yeah, exactly. Split brain. So OpenX kindly supplied me with this little brain. So we've got split brain here. Here, have half a brain. Oh, and here we go, have half a brain. Ta-da. So split brain. Yup, split brain is our big problem with that and that from the perspective of people who are approaching this from a transactional database where consistency is considered important, split brain is kind of the worst place you can end up, right? It's actually usually better for the system to be down than to have a substantial risk of split brain because there is no automated recovery from split brain. And under some catastrophic circumstances, no recovery at all. So, what we really needed was we needed a service that could come in and bless all of our little cloud post-crisis. Those are our little flying elephants, if you can't tell. Bless all our little cloud post-crisis. Bless one of them is the master and if that one goes away, bless another one is the master and be consistent and immutable and independent about it. So what we were thinking about all of this, a company called compose.io which was an online hosted cloud web application as a service project. They're about to be acquired by IBM and before the acquisition, they open sourced a bunch of their stuff including the system that they used to provide high availability for post-crisis. Not the complete system, but the initial proof of concept that they did that became a system when it got integrated with their architecture, et cetera, right? And they open sourced that and it was called high availability for post-crisis batteries, not included, it was called compose governor. Now, this was just a proof of concept, but it was a really, the ideas behind it were really good. So, and part of it was that they used a lot of at that point, relatively, this is about a year ago, they used a lot of, actually it's not even a year ago. God, six months, seven months ago? Seems like a little much longer time. But they used a lot of technology that was just emerging, right? So Linux containers, ETCD for consensus. And a simple post-crisis controller that lived on each node. And so, we forked it. Zolando forked it, I started contributing to it, into a new project to actually make it production worthy. So, now, that's our background, let me explain how it all goes together. So you actually understand the new system. So, the first thing to actually understand about this is that there's actually three parts to database failover. Part number one is detecting when you need to failover. Part number two is actually failing over the database, and the third part is failing over the application from one database node to another. Now, in the current Petroni system, part number one, detecting failover, is handled by timestamps within ETCD. Again, I'll explain ETCD in a minute. And API checks on the individual nodes. The clocksters have to be consistent in the ETCD cluster. And actually, honestly, if your timestamps are wildly out of whack in your database cluster, you're going to have other problems as well. That won't go undetected for long. The second part is what's called leader election within ETCD in order to decide who to failover to, and then, of course, post-crisis replication failover. The third part failing over the application is not yet handled in Petroni, although I'll talk about how that's handled by external systems later on. So, here's how it works. So we have our little post-cris node running in a Docker container. So here's our elephant in the Docker container, right? Now, that's not just running post-cris in the Docker container. In fact, it has this little Petroni demon, which is a pilot, and the Petroni demon actually controls if post-cris starts and stops and controls its configuration. The idea being that if Petroni isn't running on that node, neither is post-cris, period. This was one of the big problems I had with handy-wrap is that I ended up with all this complicated logic of, okay, how do I detect if post-cris is actually down or if it's just handy-wrap that's down? Well, the answer is we set it up so that it is not possible for post-cris to be running if the Petroni node is down. No, no, you can actually set this up on VMs and you can even set it up on real hardware. Yeah, it's a lot, it's, you know, containers actually encapsulate this a lot better, but no, I mean, I actually believe Zelendo's doing this on VMs. So, and so then you've got one of those and you've got a whole group of these, right? So a whole group of our little post-cris containers piloted by Petroni in each case. And so this is just a little Python file running as, in this case, in the container case, this is running as the application of the container. The nice thing about the container approach is you don't have to take extra mechanisms to make sure that Petroni shuts down, post-cris shuts down. Because in a container setup, if the, my application for the container is Petroni, if that application stops, the container stops. And that's controlled by the container infrastructure. If you're running this in VMs or real machines, then you would actually need to do some extra stuff let's say system D to make sure that if Petroni exits that the post-cris postmaster exits as well. So, and that's the advantage that running this in containers buys you even if you're gonna have one container per machine. So then what happens is we say, okay, we've got three co-equals, we've just started up three co-equal post-crisis. So how does this become a replication cluster? Well, so what happens is these three co-equal post-crisis, we need one other element here, which is an ETCD cluster. ETCD is a distributed, consistent key value, it's actually more of an HTTP information store. I'll talk a little bit more about it later, let's just understand how it functions right now. So ETCD, we've got an ETCD cluster here, which functions as a single consistent service. And so what happens is when we start up our nodes, they all send messages to ETCD requesting to be the master. And ETCD holds what's called a leader election. And it decides one of those wins the master election. And it sends messages back to them and says, okay, node two, you're the master, nodes one and three, you are replicas and your master is node two. And so then we do an automated base backup from node two, in order to make node one and three replicas. Yeah? Yes, yes it does. ETCD does in fact store its data and it writes it's synchronously to desk. So this does recover from being down. Although you might see a whole bunch of failovers if the system doesn't come back up all at once. Which is often the case, yeah? It could just be two, it could just be two, it could be five, could be 15, whatever you want. Yes, no, there is no particular one. It's just that three is good for a demo, because it's enough that I can actually demonstrate failing over without enough that I use all the RAM on my system. So anyway, so that's establishing the initial cluster replication for a brand new cluster. Now if you were retrofitting this onto existing Postgres servers, you would have to actually take some extra steps. But if we're doing it as a blank canvas and we're gonna load everything via PG restore, then you just do this, right? So then the question is what happens when we lose the master? You know, as I just demoed, right? We lose the current master. Well, what happens is all of our nodes are sending back, are sending messages to the ETC server every 10 seconds in the default configuration. You can configure that interval. They're sending every 10 seconds. Those notices have a time to live of 30 seconds. So after 30 seconds, they're going to check ETCD and discover that there's no longer a master. Hey, no master exists anymore. And then what they're going to do is that both of the one of them, whichever one happens to do it first, is going to try to grab the master key. So now, during a failover, this happens in two stages. An initial deployment, whoever grabs the master key first, gets it. In a failover, we want to be more discriminating. So what happens is, whichever one does it first tries to grab the master key. It gets a temporary lock on the master key and then it checks the replay point of all of the other potential failover nodes using the Petroni API. Because each one of these Petroni demons, I forgot to mention, each one of these Petroni demons not only controls Postgres, but it also has a RESTful API that is used for some of the operation of Petroni. And so over the RESTful API, we can query what the replay point is on each one of the different nodes. So if, for example, one tried to grab the master key and then it checked the replay point on node number three and it discovered it was behind, it would give up the master key, at which point number three would grab it and promote itself to master. At that point, so that's the election process. Two-stage election process for failover. At that point, we send back a message. You know, ETCD confirms you have the master key, you don't have the master key. And then the node that doesn't have the master key changes its primary connection info, it changes its replication source and starts replicating from the other server. And we failed over. Any questions about that so far? That's a very good point. So, let me actually talk about this a little bit more when I talk about split brain, because that's exactly what this goes into. So here, wait, hold on, we've got another brain to split. Who wants half a brain? Half a brain. Anybody else want half a brain? There we go. Yep. So, because I haven't actually talked about how do we prevent split brain. Well, in this case, we're actually relying onto a lot of ground work done in ETCD. For those of you who are not familiar, because most of you did not raise your hands earlier at being anybody, some people work with ETCD and similar services, console, zookeeper. Yeah. So ETCD is a distributed consensus HTTP data store. Stores all of its data as HTTP paths. So it's kind of a document store. It uses the raft algorithm, which is one algorithm for what's known as distributed consensus. The raft algorithm is designed to implement in the CAP to implement CA, right? As in, it's consistent, it's available, it is not partition tolerant, as in, and I'll show you that in a minute. ETCD is great for configuring information and metadata, because one of the people says, hey, if ETCD is able to maintain consistency across the cluster, why not just put our database data in there? Well, here's one of the problems. It's really frigging slow, compared to like a transactional database. Like, you know, the number of writes per second you need in ETCD is measured in like the 10s, because it's doing this whole consensus thing on the backend. So you don't want to store real data in there. And as a matter of fact, we go to some trouble within Petroni to only write things to ETCD that need to be there, versus things that we can pull the API of the individual nodes to get. Because initially I had this design where I was constantly updating the replay point in ETCD, turned out not such a good idea. When you're running ETCD in a really lightweight container where there's a lot of other stuff, it can be a problem. So, now there's some alternatives to this. People familiar with ZooKeeper, which tends to be a little bit larger scale. You know, it's the big Java-based thing. There is support for this in Petroni. You can run Petroni using ZooKeeper. Console by Hashicorp is the other one. And that actually has this nice thing of integrating discovery services as well. There is not currently support for that in Petroni. We don't have the module for that, yeah? No, no, that goes in the Petroni configuration, which I'll show you in a minute. So, initially I actually had a mission to support console. I was writing a console support module, and then that use case went away because that particular user switched to ZooKeeper. So, if someone wants console support, they're gonna have to write the module. Shouldn't be too hard, we've already got examples of how you do both ZooKeeper and ETCD. Anyway, so the idea of ETCD with distributed consensus thing is if we actually have a network partition, then the ETCD cluster knows if it can't access, if it can't establish communication among a majority of the nodes that were originally in the cluster, then it responds with failure messages to information requests, rather than providing the information. And that's deliberate. And that means that we prevent split brain due to a net split because any database nodes that can only connect with the stub of the ETCD cluster will get back failure messages. Now what happens with Petroni when the failure messages is that database node will reboot in read-only mode if it was read-write. If it was already a replica, it will just keep going. Because the problem is, if that was our original master, but it's now in an isolated network segment, we do not want it to be continuing to accept writes. But it's okay for it to be up to continue to accept reads. I mean, we'll get stale reads, but presumably in a net split situation, somebody is getting pager alerts. And presumably we're going to actually straighten out the application connections at some stage. And so that's basically how it sets up. For ETCD, this means that your cluster is statically sized. Changing the size of the ETCD cluster requires a restart of the cluster, I believe. Which is a little bit annoying because obviously during the restart of the cluster, you're going to have a flip over and postgres in a master election. So give it some thought. I mean, the useful sizes for an ETCD cluster is usually about five nodes. If you're really trying to guarantee against failure situations. So there's not a strong reason to make it larger than that. Zookeeper, I believe, can be actually dynamically resized. It uses a different algorithm for consensus, but it does perform a lot of the same functions. Yeah? Yeah, well, if we're doing asynchronous replication, yes. And under any failover circumstance with asynchronous replication, you will have lost some data. Um, yeah. The, if you can't afford any data lost and you need to set up synchronous replication, we have within Petroni, we have some configuration options for synchronous replication. I'll warn you that I don't think anyone is using them. So test the hell out of that if you're going to actually go that way right now. In general, for a lot of the stuff we're talking about Webby stuff, we're willing to accept the loss of a couple seconds of data versus being down for an hour while we wait for a human being to check it out. That is, however, another reason to not actually restart existing nodes automatically. There's several reasons. First of all, if the node failed in the first place, you don't really want to restart it automatically because you don't know why it failed in the first place until a human looks at it, right? Second thing is, if the node failed in the first place, it has untransmitted data. If you restart it and force it to rejoin the cluster, you're going to wipe out that untransmitted data. Whereas if a human being restarts it, isolates it from the cluster, they could potentially recover some lost transactions. Yeah. So let's look again at the setup in detail now that you actually know what it's supposed to be doing. So first of all, I mentioned that each one of those nodes is running a Petroni daemon. So this is actually the configuration. So you configure Petroni on each individual node and you pass it along configuration. This is the configuration that's getting loaded through Docker Compose into each node that placeholders are for environment variables that are being loaded through Docker Compose. So we actually want to take a look at that. There's a few of these things. Scope is somewhat cryptally named. It's the name of the cluster that you're running it in. The idea is that you may have multiple database clusters in a single network that you're running Petroni on. They may even share ETCD or ZooKeeper servers, in which case you actually need to have namespaces for each of them, and that's supported. Default time to live, default polling interval right here. And then we've got some other configurations. Like I said, there's a REST API running in each node, and so this is where you tell it what IP and port it's going to be listening on. Within that, where what advertised, what the advertised connection address is going to be. The reason why these are two different lines is because if you're doing some sort of network redirection or network masking, particularly for say service discovery, your advertised address from outside the container or the VM might actually be different than how that address is seen internally, and we need to support that. Most of the time these are going to be the same. So currently the API supports SSL and simple, a statically set user password authentication. We haven't had a strong push for supporting something like LDAP or whatever for the API. So again, if that's something that you need, fork the project, that's what it's there for. And then you actually need to configure your distributed configuration information service. Now the configuration for ETCD is really simple. We've got a scope, we've got a time to live, we've got the host. If you actually have authentication set up in ETCD of some kind, then that information will be there too. Zookeeper configurations tend to be a little bit more complicated. There's an example of that in the docs of the different elements that you need for a zookeeper configuration. So that's where you configure what ETCD cluster it's connecting to, as it needs to be configured in each individual node. Then you actually need to configure a bunch of things about PostgresQL. Now one of the reasons that you need to do this is if Petroni is initializing your Postgres cluster for you, then the configuration you put here is the only configuration there is for PostgresQL. Because it needs to be able to rewrite the configuration in order to restart. If you're retrofitting Petroni onto an existing database cluster, then it might be a little bit more complicated and you might be able to ignore some of this stuff. But if you're doing it the way that I just did it, let's spin up brand new containers, et cetera, then all of your Postgres configuration comes from here. And this is all of your typical things for PostgresConfiguration, including listen address, again the AdvertiseConnect address, which might be different, the Data Directory, maximum lag on failover. This is, we do a check again through the API of how far behind are you? And so in addition to trying to choose the furthest ahead one, you can also set a threshold, and by default you do saying, hey, if the replica is this far behind, don't fail over anyway. Probably in production you'd wanna set that more to something like a gigabyte, or more depending on what your traffic is. And then the rest of this is actually, oh, create replica methods. And I'll mention this again later on, by default we use pgbase backup because the simplest way of spinning up new nodes, we do actually support other methods, Postgres point time recovery, Wally recovery in order to deploy new nodes. Say if you might have a larger database for whom doing base backup is prohibitively slow. So one of the things that I'm actually working on is a modification to allow you to take the base backup from one of the other replicas, which is not currently supported, but it will be supported in the future. The other thing you actually have to set up is your host-based access, your Postgres access control file, again, this needs to be written by Petroni if Petroni's initializing the database cluster. We set a whole bunch of passwords because, again, we're initializing the cluster and those passwords need to be set. And then if you need to pass any parameters to Postgres, like for example, if you're doing archiving, parameters for replication, et cetera, those are going to need to be passed via Petroni because Petroni is writing the, it's actually not writing the Postgres goal.conf, it passes these things on the command line versus via Postgres command line options. Yeah? I'm trying to remember where we did that. That was a bug I filed early on. When we initially initialized the cluster, we have to launch it in trust mode so that we can create the passwords. And it wasn't getting relaunched and so as a result, we were locking ourselves out. I think now you don't actually have to do that. In this case, because it's a container with no SSH, it's perfectly fine to say local all trust because no one can get into it, right? Unless they can hack the container and if they've root access on the machine, they can get into it anyway, right? The, so in a different circumstance, you might not want to do that. So it would be worth testing and if not add your thing under this bug and say, hey, it's still not fixed. So like if you're doing that on VMs, you're gonna have this exactly. Yeah, then, and part of it depends also as are you initializing this via Petroni or not? If you're not initializing it via Petroni, then it's not important to have local trust access because you will have set these passwords. Do you follow me? Rather than Petroni setting them for you. Sure. The, so, okay. Yeah, so I actually was working on that and I discovered a problem with Postgres which is we can't pass include directory and include file via the command line in Postgres. It has to be in a Postgres.conf physical file which is kind of a pisser because it's not in the default file. See my long arguments and hackers about why Conf.d should be default behavior in Postgres and which I lost, but the, so anyway, the, so yeah. And so that was the idea is that like you would actually set it up, I was looking at a modification for having a Conf.d file. I might put that in by default in the future with Petroni would be a Conf.d file. If you're doing containers, you would mount that as a volume. If you're doing it in VM, you'd do it wherever you wanted to and then you can actually have another place to drop in configuration options rather than passing them in the Petroni command line. The advantage of passing everything through Petroni config file is that you basically have one master config file to rule them all that can get checked in under whatever configuration management is and then you don't have to worry about having several separate configuration files. Might be an advantage, might be disadvantage-dependent. So, let's go ahead and do that again. Okay, so let's shut this down. Okay, and I actually need to tear down the containers. The reason I need to tear down the containers is because if I don't tear down the ETCD container, it'll just restart it and the ETCD writes stuff to disk so it already has a master marked as initialized. Actually, no, if I did this, so there are circumstances where Petroni can be unable to restart without human intervention, which is if ETCD is down and then all the database nodes shut down and then they come back up and there's no master and you don't have a, well, except you don't have valid ETCD cluster and not up there anyway. Anyway, restarting at multiple times, here I wanna actually restart it from scratch. So, we're gonna restart it from scratch rather than your failover circumstance. So again, we get lots of output. You'll care, you can see all this stuff. The failed to acquire initialized lock is the, so there's two kinds of master locks, right? There's the I am the current master lock and there's the initialized lock. The initialized lock is held for the entire cluster and it's only set once. Once it's set, it lives under that cluster namespace in ETCD until it gets deleted. And the reason why is you don't wanna accidentally re-initialize a database cluster that has data in it. You can actually get, I'm trying to remember the specific sequence of events because I've done it. You can't actually get wedged under circumstances where the Petroni cluster will refuse to start in read-write mode because it can't find an initialized master but the initialized lock is still set. We decided that was better than having it wipe out all your data automatically. But that would be under a circumstance where everything went down and then sort of came back up again unevenly. And those are hard to protect against 100%. So we see this going on. I don't understand why when I start this particular demo two always wins. I don't know why. Because it's actually the second one being started but for some reason the timing of it two always wins. So, oh, wait a minute. Two always wins but doesn't always have the same IP address. Oh, it's three this time. So again here we see node two has two replicas, nodes three and one. So, you know, so that's our initialization setup. And so then we wanna go ahead and kill two, right? And you see all of this traffic of stuff not being able to connect it and that sort of thing. Actually that time it filled over really fast. Part of it depends on the timing and you've got a 10 second polling interval, right? And it depends on whether you hit that interval immediately or later on. So, and then we get this. Now there's no contesting who's furthest ahead because I'm not running any traffic on this cluster so they're all co-equal points. So we go failover and now we have two nodes. So who's the leader this time? Three, three is the leader. Well, it's either gonna be four or five so let's find out what it is. Am I the leader? No, there we go. So one is replicating from three right there. And so now we actually want to bring a node two back up or bring a new node two up. Actually, hold on. Let's actually wipe out node two, huh? Because under a circumstance we had a real failure you wouldn't be bringing back up the original node, right? We want a new node two. So now we're going to go ahead and bring node two back up. Hold on. There we go. And node two has connected. And we now have two replicas this time, two and one are replicating from three. Anything else you wanna do in this cluster? It's all temporary containers so if we screw it up I'll just rebuild it. That is a very good point. Yeah, let's do that. Yes, you see all of these sort of reconnections that sort of thing. See demoted self because DCS is not available. That's the message you're getting. The original master is restarting as a read-only node. Because the thing is postgres you can start any node as a replica of a non-existent master at which point it becomes a read-only node. And because we support cascading replication it doesn't require breaking replication to the other two nodes. So we're gonna keep getting this message of course because it's still trying to connect to ETCD. Now if we bring back up ETCD. Okay, let's see who's the master now. No, two actually promoted itself, that's interesting. Oh right, because the TTL had expired. Yeah, so that'll happen if you actually, if the ETCD cluster goes down, when you bring it back up you may get a failover even though it's not strictly necessary. And that again is a timing issue, right? Because the ETCD cluster has been down for longer than the master locked time to live. Then when it comes back up the individual node doesn't know the difference between the master being completely down and the information service having been down. And so it treats it as a master election circumstance. Now in this case because all of our nodes were read-only we lose no data because the nodes will have been at the same replay point at that point anyway. So that's why it was not regarded as an issue for us to fix because it's not really a problem. Now if I brought ETCD down and then I deleted it we would actually never come out of read-only mode because it would keep pulling for the list of servers and it wouldn't find them. So we're getting towards the end of our time here and I've already taken a lot of questions, it's not. So let's actually finish talking about the other stuff here. So, so what's included currently in Petroni, three things, the Petroni agent, which again runs on each server, which has a RESTful API. There is also a command rudimentary command line tool that's under heavily development called Petroni CLI that is steadily improving to have its full sort of speck of features but isn't complete yet. Petroni CLI is just basically again a Python command line tool that allows you to do things like interrogate the API, the individual nodes to ask them for stuff and importantly do things that wouldn't happen automatically like say manual failover. Like say you want to manually fail stuff over because you're going to apply kernel updates to individual nodes, right? So you want to do manual failover or if you want to do manual failover because you're put node on a VM with more memory or whatever. So also to stop individual nodes, et cetera, you can do that via Petroni CLI. Stuff that's not included. Now I mentioned that the third part of failover is to failover the applications. That is not included in the core Petroni project and the answer and the reason for that is that it's going to be provided by other things. There's no GGUI interface for Petroni and I don't think anybody has any plans to build one. Deployment of containers, et cetera, VMs or whatever, that's your own thing to do or do it via downstream projects and there's no built in monitoring except that the APIs do provide a lot of information. So a monitoring system can interrogate the APIs and get a lot of information for monitoring. We just don't have any templates set up for that. So given that Petroni doesn't cover everything, we have a couple of downstream projects. The one that is sort of in production right now is a project called Spilo from Zalando. So Spilo is Petroni plus a whole bunch of Amazon tool orchestration. They're a very AM AWS integrated company and so they use all the AWS tools, the Amazon virtual IPs and the load balancers and that sort of thing on top of Petroni to provide a complete system. And that's available from their stuff with documentation. If you're actually going to look at implementing Spilo though, be prepared to devote a significant amount of work time to it. It's a complicated system with a lot of parts and it's only ever been deployed at one company. I mean, they have done a really good job of trying to document it but it is very complicated. The one that I'm working on, but it is not available yet, but check back in a month or so. I'm nicknaming AtomicDB. I'm going to be using the Atomic slash Kubernetes stack to supply more of a complete system again on top of Petroni with Kubernetes and service discovery in order to provide the routing and failover portion of the whole thing and also maybe provide some OpenShift containers that are Petroni based for anybody who uses OpenShift. So check back in a month or two or just follow my blog and you will see that as it develops. More features, there is PG rewind support. So if you're okay with wiping out data, you can enable PG rewind, which means that down nodes will be guaranteed to be able to rejoin the cluster, but that may mean that you're wiping out data they contain that's not on any other node because of failover. Again, I said we do have configurable node imaging via wall e and point time recovery. There is synchronous, there is instructions on how you would enable synchronous replication support. Like I said, I don't think that's really tested so you might want to do some testing. And we also now have ways to flag specific replicas as non-failover replicas, which is something you'd want to do if say you have 20 replicas for load balancing, you generally want to have like three or four designated failover replicas that don't take load and the rest of the replicas are going to be load balancing replicas. Other things in development, cascading replication support, obviously for geographically distributed stuff, you'd want cascading replication. In integrated proxy, somebody actually wrote something in Go for another project that I was looking at potentially integrating because it's a nice little proxy that will query ETCD or Zookeeper to find out who the current master is and reroute stuff. And then when bidirectional replication actually becomes a thing that mere mortals can install, we'll want to look at supporting that. But in the meantime, if you see features that we don't have and that you want, it's on GitHub. It's mostly Python, you can fork it. So here's all list of resources, as you can see for this. Either take a picture of this slide or I will be pushing this to my GitHub page, jberkus.github.io after this presentation and then you will have all of those links. Yeah. This isn't pushed yet, but it'll be pushed within the hour. So any final questions before we, oh, we actually got five minutes for questions. So go ahead, we can even like clobber the cluster if you wanted to, yeah? So the demo is separate. So the thing is that the test suite related to the code is just is unit testing, which is great to have, but it doesn't test some things that we want to test like this failover work. And that was the reason I created the Petroni Compose project, which is what I'm using here. Oh, I actually don't have a link to that. I'll link that, I will definitely link that off of my webpage. The Petroni Compose project is a Docker compose file that is designed to set up a cluster so that you can actually run automated tests of things unlike, does it failover work? Can I add a node, et cetera? That needs to be built out into a full test suite of failover behavior, which doesn't exist yet. So that's one of the things that we actually kind of need. The, I'll probably build that out also via Kubernetes is that once I actually get a Potomac DB, we'll have Kubernetes and that will have its own test suite built into it because it'll be a little bit easier to automate than what I've got with Docker Compose, which is a little rudimentary. So yeah, you had a question? Yeah, yeah, yeah, yeah. Okay, well, I'm gonna say about PGPool2 is, I don't talk about PGPool2 because I don't wanna trash other people's code in the community. I was just saying, I would personally not use PGPool2 for automatic failover. Other questions? Okay, well, thank you very much. Yeah? Oh, you know what? And, ooh, there we go, okay. Sorry about that though, all this graphical stuff is confusing me. So, first off, thanks to everybody for coming to scale and sticking around to the very end. That's awesome. I certainly hope everybody's enjoyed Pasadena. I've definitely enjoyed the new location and gotten lots of good feedback about it. So my presentation is Anatomy of Command Line. So we're gonna talk about what order things take place in Bash and Shell so that we can avoid certain surprises because things happen in a different order than they would look like they would take place. We're at scale 14x and all that. All right. So I've got a couple of caveats. I'm skipping some of the pedantic details for the sake of time. So if you think of things and you say, hey, there's these 14 edge cases, hey, go read the man page, all right. I'm presuming you're already familiar with Shell features such as pipes, redirection, variable expansion. This isn't a full thing on Shell. If you want that, actually go look at my scale presentation from like eight years ago. Or, read the man page. And then also, I'm going for clear examples rather than better code. When we teach programming or teach many concepts, you have to go for the thing that's gonna point out what it is you're trying to illustrate what you're trying to point out. And that is oftentimes not the way you would do it in reality, the whole physics thing where go find the round cow, right. So if you can get a round non-variable density cow, that would be, yes, much better for production, but it doesn't really work so well in the real world. Yet, I'm certain somebody's working on it. All right, so let's dive right in and talk about ordering. So the Shell, when you type up a command line is going to go through and parse that and do certain things in certain order and certain types of actions happen first. And some of them happen last. So let's talk. The very first thing to happen is redirection. That is the very first thing that takes place. Then you have pre-command variable assignment. We'll get to what that is, cause it's kind of funky name. And then we have expansion. The things that we often think about is happening at the Shell level. And pipes. And then at some point you get to the commands. So you can see there's a lot of things that actually happen before you even start the command up. So let's talk about redirection first. So this is ripped from last week's headlines. Some of you might recognize these examples. So echo. Oops, getting a line ahead of me. Don't look if you saw that, forget it for 60 seconds. All right. So we had SSH vulnerability that, why that was in there. Anyway, still a fan of SSH, but you know that one, they always an apology. All right. So there was a configuration change you could make while we were waiting for package updates to come through. And you could disable the undocumented configuration option, brilliant, in your SSH config. Now you can change that in your personal config and that will change it for your account. But if you're wanting to change it for the system, you need to go through and do that and that's the SSH under bar config. So on this particular command, can anybody tell us what will happen when you run this command? Yes sir. No. So first of all, who's running the command? I'm not giving you a shell prompt so I am kind of cheating. But if I'm running this as me, what happens? You get an error because you have permission tonight. So now we all know pseudo, right? So I can pseudo, hey, this is a live presentation. I can change things, right? So I can hit pseudo and echo that out and that will give me root permissions when I run the echo. Does that solve the problem for me? No, the redirection happens before the pseudo is even looked at. Redirection happens first. So the redirection is still taking place as me, not as root, even if I'm using pseudo. So let's go ahead and oops, I need to be able to see what I'm doing. All right. Whoa, that was not what I want. So kind of protecting, oh, oops, sorry. Forgot one part of the setup. Some of a couple of people in the room understand that and this is important for later present, peace, there we go. So permission tonight because as me, I do not have permission to change that. And if I pseudo, it didn't even ask me whether or not for my password, right? I don't allow myself to just randomly run things as pseudo. Didn't even get to the point of asking me a question because it didn't have access. All right, so we can instead use T. So what does T do? T takes output and splits it to standard out, so back to your terminal and also to a file when you give it a file name. So if I run T under pseudo as root, now T as an application has root permission in order to open up the file. If you are doing this, remember the dash A in this particular case, so that you append the new configuration option instead of wiping out your previous configuration, which is probably not good in the case of SSH, and you can then have access to it. So this is a good illustration of how redirection happens first. As I say, you don't even, the redirection happens before whatever command you want to run even gets looked at. This is from the talk description. So if I echo Anka into the file, I'm truncating, single grader then, so I've created a new file or have emptied a file, and oops, I tried not to leave stuff all over my home directory. So I go through and echo that in, and I'm creating new file in this case, it was a brand new directory, and I am then checking to see whether or not the content made it into the file. And then in this one, I'm gonna again create the, truncate the file, add the content to the file, so we know it's in there. I'm gonna grep for it, and then add the part that didn't get copied. Again, grep for the content. So I know that that content is in there, and then I'm gonna take the output from that command and put it back into the original file. I get an error, that's what the frowny says, tells me it didn't work right, and my file has zero bytes, because what happened in this command was that the grep Anka redirect into the file, the redirect truncated the file before the original grep happened. So the first grep for grep Anka file.txt was actually reading an empty file, because the truncate happens first, then we go find out what commands are gonna run, and they start doing things. So by the time the grep is instantiated, the file has already been empty. And then, since this is getting too long. So this one, I am going through, and again creating the file. This time I'm grepping, grepping, and then using t to open the file. So the first thing that happens isn't the truncation of the file, t goes through, and I'm not using the pen, so t will truncate the file, and then add the new content. Now I'm cheating a little bit, and we'll see that later on, but for right now, leave the guy behind the mask, or whatever that is, okay? All right, and make that a little bit cleaner. All right, now in this case, I'm using a sub-shell, we'll get to sub-shells later on, in order to do something. So as a result, we have the echo foo is going to append into the file, but we still end up with an empty file. So the, let's take some of that. Well, I don't need to do it. So the echo foo goes into file.txt, but that doesn't relieve any output that the outer echo is using. So we truncate the file, even though we've added content in there. And here we get the contents of the LS coming into the file, because what happened was the echo foo is into the file, it produces no output, and then the LS-L output is echoed by the outer echo into the file that's been truncated. Now pre-command variable assignments, we actually just, I use that when I set the lang, we'll give an example for that. So lang equals C, you saw me do that. And what that is, the shell looks at variable assignments. Oh, yes, sir. It has the results of the LS command. Yeah, so, okay. So when pre-pending a variable, you can see from the next example that I can pre-pend and a variable and set up a variable for a particular command, and we'll get an example for that. But the shell looks for variable assignments before it goes through and evaluates the command. Again, that was the second item and several other things before we actually get to the command, looks for those variable assignments. If you have just a variable assignment, such as lang equals, is assigned C, then it stops and it doesn't issue a command because there's no command to run. There's a couple, for the pandantics, go read the man pages, a couple of extra pieces there. So in the next example, though, I am using that assignment in two different places so you can see that the variable assignment is taking place. And a key for this is that, this example actually doesn't show it as well as I want. But a key for this is that that variable assignment is only for the particular command. So if you do the variable assignment on the same command line and then have the command afterwards, that variable assignment does not actually impact the shell. It is just for the command that is being run after that. The whole thing. And yes, I intentionally had errors so we could see that the results from both the commands. And also notice that I'm not getting the outputs in the order that they appear in the command line. So we got the output from behind the pipe before we got the output from before the pipe. So the find is looking for a directory that doesn't exist. I had preset the language environment to be German. We get the error output in German for the said, which is also an error because it's an illegal operation. But I'd preset it to English and we get the result, the error output in English. And then I wanted to show again that the redirection happens first. So here I'm assigning, using this variable preassignment to assign a value to a variable. And then I'm using that variable as if the value of that variable is a file name. But because the redirect happens before the variable assignments, I'm trying to redirect to null. And you can redirect to dev null but you can't just redirect to null. All right. Next is expansion. I call this a seven layer of burrito of the command line. You have a whole bunch of different things that all happen at the same time and they kind of get intermixed. You know, like the pieces of it, you know, of a, when you're doing on them anyway. So the seven forms of expansion, I'll give this in a little bit better form in a second. Our brace expansion, tilde expansion, perimeter and value variable expansion, perimeter, parameter, command substitution, arithmetic expansion, word splitting and path name expansion. We'll go over what each of those are. The order is brace expansion, then tilde expansion, parameter and variable expansion, arithmetic expansion, command substitution, which is done on the left, right, all happen at the same time. So whatever order you find those in, you do them at the same time. So it's kind of like addition to subtraction. They have the same, they happen at the same time, is which one do you run into first? But you have multiplication division to take a higher priority. So these all take the same priority and happen in the order encountered, all right? Then we have word splitting and path name expansion. All right, so, and then we have a bonus layer called process substitution, which takes place at the same time as all the other things take place. And process substitution, we'll get to that. All right, so tokenizing. This is from the bash man page. Only brace expansion, word splitting and path name expansion can change the number of words of the expansion. Other expansions expand to a single word. So if I, any of the other expansions brace expansion everything, all right, excuse me, get back. So brace expansion, word splitting and path name expansion can change the number of items on the command line. Other types of expansion don't, they might have spaces in them, but it's still a single item. And then there's a later expansion that goes through and tokenizes based on whatever your IFS is. And we'll cover IFS, don't change it, but we'll cover it, all right. And that is to say, that's directly from the bash man page. Brace expansion is a bashism that is not in the traditional born shell, but we use it a lot. Left to right order is preserved. And I'm gonna give you some examples. So in this case, I am taking nothing and slash user and dollar home, the value of dollar home. And local or slash, value of dollar home slash local and user local and splitting those out into different pieces of expansion and combining them with nothing at S and then with bin in general. So that short line gives me all these different directories, basic word. You might guess where I use this in my profile, but I can then shorten what it is I need to keep in there and make it easier to me to organize. So let's take a smaller example of that that's a little bit easier to parse, human parse. So again, in this case, I'm taking nothing and slash user followed by a slash, followed by nothing and S and bin. So nothing and slash and nothing and bin gives me slash bin and nothing and slash and S and bin gives me slash S bin and then user and slash and nothing and bin gives me user bin and user and slash and S and bin gives me user S bin. We can also confuse ourselves in the middle presentation, sorry. We can also do sequence expansions. So this is saying 1.10 says give me 1, 2, 3, 4, break up 1 to 10 and give me each of those as an item. And we have a couple more pieces with that. This is an example of, so first of all, anybody wanna tell me what we're gonna get as a result of this command. We're not gonna get the same thing as before. So what happens first on that echo command? So it's going to try to do this brace expansion first before it does variable expansion. So we get 1.dot and a variable because the brace expansion says I don't know how to expand that, so it leaves it in place and then later on you get the variable expansion and so $10 becomes one zero and then, oh yeah, there's a command you wanted to run, echo. Let's spit that out to the terminal. And somewhere, some shell, I thought they had it in bash for a little while. There's a way to actually get the variable to happen first and then it went away and it got confusing. But generally, brace expansion first and it makes it kind of annoying so you do have to switch to actually using a loop or EXPR or something like that if you wanna build a sequence. Now we can nest them. I'm just gonna go through a couple of these somewhat quickly. Oops, I'll go through them quickly if I actually copy them. We can use alphabet and we can skip. So this one saying ..2 is saying give me every other. You can do ..3 and so forth. So till the expansion parameter variable expansion, arithmetic expansion, command substitution and process substitution all take place at the same time in the order encountered. So for my seven layer burrito thing, I count these as the squishy things. So these are the things where they all get mixed up and as you bite into it, you never know what you're gonna get. All right. And of course, salsa and sriracha are important parts of a burrito. All right. Okay. And as I say, they might be intermixed as you're going through. So till the expansion, I'm just gonna cover these briefly. So till the expansion says, look for a user with this name. If you don't give a user name, then it looks for your home directory. And actually this example I did wanna do, but without leading dot. Oh, that's why I took it out. It was too much, too much output, sorry. So there we go. So it goes through and expands each of those. So we get the brace expansion on var log and cache, or lib log and cache, excuse me. And then we get the till the expansion on each of those directories. And then till the plus and till the minus, this is actually new to me. When I was getting ready for the presentation, I was like, oh really? So if those of you know that dollar deer is what directory or dollar PWD is what directory you're in and dollar old PWD is the former directory, you've last come out of. So till the minus is a way of doing dollar old PWD and the dollar plus is a way of going to the next directory, which if you haven't been there yet and you don't have a time machine. But it's, you know. So if you use, there's a way of doing a directory stack. I don't ever use it, but there's a shell variant way of doing a directory stack. So you can go up and down that array, array using plus and minus. And then the deer is the directory stack. I say beware the use in scripts because there are places where till the just does not expand or doesn't do the way you want it. So generally use dollar home is my recommendation. Yes, it's more typing and so are they're annoying. We've got this perfectly cool tilled up there. We don't ever get to use it, except for when we accidentally hit it, when we were trying to hit escape because we use VI and, you know. So, but beware use of it in scripts. I recommend using dollar home or, you know, pull the, if you're trying to get into somebody else's home directory, you know, use other tools to go through and expand to that. All right. Parameter expansion. Variables. You know, good engineering documentation is like legal documentation. They take the native spoken language you're using and use it in a different way to make it confusing. All right. So parameters are variables. There's actually more that you can do with parameter. There is more than just variables, but for the most part, that's how we think of it. So one of the things is there's lots of fun string manipulation that can be done during parameter expansion. And before I get to that, I taught at a community college for seven years and two things. If you ever teach, don't teach globs one night and regular expressions the next night. Your students will be in the hospital. It's not good, right? There are two different languages using a similar character set, but not the same doing different things. And it's rather confusing. The other one is the string manipulation. If you teach them a bunch of these all at the same time, they get intermixed. And so I find if you're not familiar with the string manipulation, go look at one, play with it for a little while, get used to it for a couple of days and then go play with another one. I really recommend doing that so it makes it easier for you to understand them. That's how I learned them inadvertently, but I found that it's useful to do that. The problem with finding them is that the Bashman page is kind of like the length of a worn piece with an encyclopedia added to it. So it takes a while to get there. It is actually, there's a very nice shortcut that works directly for these. If you search for colon minus in the Bashman page, that will take you down to the string manipulation portion of the man page. Excuse me while I go rubio on you. And then the other piece is that this, as it turns out for this presentation, most of it is from like 400 lines of the Bashman page in the end because it describes it all more tersely than I am. So if you end up with the colon minus, you're kind of in the middle of that. So that puts you right into the main portion of what I'm covering in the presentation. All right. So command substitution and sub-shell. These happen at the same time. They're kind of the same thing, but not quite. And we'll cover that. So they are a copy of the current environment. They are not a brand new shell. I call them the turduck and burritos. You just keep shoving more things in there, right? And that's a reason why we don't use an old mechanism. We'll get to that in a second. So first is command substitution. It takes, it replaces the command with results of the command. So if I use command substitution to say echo Fred, what ends up on the outer command is Fred, whatever. So if I do an LS, the results of the LS end up there. If I do a find, if I do, you know, go grab some random Wikipedia article via curl and put that to standard output. That all ends up in place in the command line. The mechanism for doing that is explicitly and only this, and I'll cover why it's not something else in a second. So dollar, open, print, whatever command you're gonna run, and close print. You can have extra spaces in there because we're not Python, we don't care. But it is, use this, and the reason for it is by having a left and right that is not the same, we get nesting. So I can do command substitution inside a command substitution, inside a command substitution, and I don't have to do any funky things in order to make that happen. I can just use it. And most of the time you don't, because you're doing something kind of complex at that point, but once in a while it's rather advantageous to have that. There used to be, and officially it still works, but I tell my students it doesn't, there used to be a mechanism for command substitution using back ticks. The problem with back ticks is they look like front ticks. In some fonts they're just vertical lines. I'm like, is that a quote or is that a back tick? I don't know. And for those of us that are needing to get bigger and bigger things so we can see stuff, even if they're different characters they're looking more and more like the same blurry fuzzy thing right there. So they're bad because they suck for readability. But the other is, you have your left back tick and your right back tick is the next back tick. So if you want to nest, you now have to start quoting. But when you quote, you have to quote your quote to in order to quote your quote to get to your back tick. And yeah, at the time it takes you to figure out how to do back ticks, you could have created a new programming language that doesn't have the problem and written in that programming language. So avoid back ticks because they're unreadable, they're a pain, and they don't work very well for nesting and it becomes uncertain what it is you're trying to do. If you've got scripts that have back ticks in them, please replace them, they're pretty easy. You take one back tick, you take the other back tick, you put in three other characters and they're gone. It's beauty. All right, back ticks aren't readable, nesting doesn't quote a nightmare. All right, sub shell. So sub shell is like command substitution, but we took the dollar sign away. So there we go, sub cell. All right, so we create a new environment that doesn't change the parent environment just like the command substitution. The difference is that we are no longer taking the output and putting it in place, we're just kind of running a sub shell that's doing stuff and the output doesn't come back into the current command line. And in both cases, you can run multiple commands within that setup, I think I've already used it and we will use it later on. Note from, I should figure out where I found this. I think this was in the advanced batch scripting guide, but there was a point that inputting, using redirection from a file is faster than catting the file because you don't have to start up another process. So I will say though, and I used to teach a shell scripting class and shell scripting is important for sysamans, we should all know how to do it, we should be able to do it really well, but at the point that you're caring about efficiency, you should probably be looking at different language. Shell is not the most efficient thing out there. We can do a good job with it and there are sometimes when we need to use shell and we need to be somewhat efficient and as much as I love it, I will agree that if you really, really care about efficiency, let's consider something else. All right, now if the substitution appears within double quotes, word splitting, path name expansion and path name expansion are not performed on the results. So even though it has spaces or whatever your IFS is in those results, it is quoted and becomes a single string, right? So if you had put echo, quote, hey, space there, quote, that hey there with the space in between is a single string, same type of thing. All right, arithmetic expansion, we can do math. All right, and but we cannot paste. Boom, all that to figure out four, sorry. But an important part is that it is integer only. This one I'll just cheat, syntax error. So shell doesn't get floats, doesn't get decimals, and it can need to fun. So what is, oops, wrong thing. What is the result of this particular math problem? One, so you end up with one and a remainder. So I can also do that and see the remainder, but I don't get both at the same time. There are other tools that will allow us to do that type of math. You need to go to those if you need to. An important point of this is that the expression is double quoted. So when you're doing your math arithmetic expansion, we don't have any quotes in here, but it is quoted, so I don't have to escape my star when I wanna do multiplication. All right, there used to also be a square brackets syntax that is deprecated and will be going away any time now for the last 15 years. So who knows if and when it'll ever go away because there are shell scripts that were written in some time in the late 80s and nobody's gonna change them for some reason. So don't use it though. You might see it, and that's why I put it in here just in case you look at it and go, what the heck? Hans didn't mention it. Well, I did, but don't use it. All right, don't use it for anything new. You can do nesting, so we can do, so we get three plus three is six and a lot. A double quote inside of expression remains a double quote because you're quoted, so it doesn't get changed. Parameter variable expansion command substitution and quote removal happen within the expansion, within the arithmetic expansion, so. So I've assigned a variable and then inside the arithmetic expansion, I'm using the variable that gets expanded to the value of the variable and it is used. Still has to be a valid integer or a valid number for it to work, but it works just fine. All right, and just so we had a couple of security talks over the weekend, so I thought we'd put a nice example for that in place. There we might go, oh yeah. You still have to have valid syntax for some reason. Okay, all right, so we can see we've got that in there. I've got variable expansion take place, couple different things. Again, in here we're highlighting priorities and the multiplication happened before the addition did. Standard arithmetic as far as that goes. Yeah. Oh yes, no, no, I did this because 1948 was when he wrote the book and so. That was the excuse I gave for my typo and it still works. All right. So, and it is, sure, he came up with 1984 because it was the opposite of the published data or whatever, something like that. So process substitution, this is the fun thing. We get, other operating systems do not get it. We being, I'm presuming we all using Linux, FreeBSD, et cetera, so we get it. Other lesser systems do not have this option. And I never really use it. There are a couple of edge cases though where it is quite handy. I usually just cheat and use a file and that's the way I'm gonna use for my example because again it makes it easier. But there are times when it is nice to be able to look at two different things without having to reach to the file system per se. So, this first one. Actually, this way. So, first off, what happens after the hash? It's a comment. So it's a comment in the shell script. It's a comment on the command line. In fact, if you are working on a command that's like rather long and convoluted and you're like, oh, I gotta stop and do something else? Insert a hash in front of it. Now you have a comment in your command history and you come back to it. Or you're doing, you've gotten a late night call, you're rude and you're doing stuff and you're like, oh, I was drinking a lot. Maybe I'll just copy this for somebody else or get to it in the morning when I'm sober again, right? Okay. So, hopefully you're sober in the morning. All right, so this is showing you the name pipe that's being created for the command substitution. And then the next one is not gonna work unless I actually copy it. There we go. So I'm grepping for sources.list from the output for the find command. So you say to yourself, well, you could just do find and then pipe to grep. But then that wouldn't illustrate what I'm trying to show you. But yes, normally that's how I would do it. And as I say, this is useful for a place where you need a file but have a standard out instead. I really should capitalize standard out because just that's normal, yeah. All right, so I will now use an example where it actually makes sense. I'm creating a file and diff doesn't like files or doesn't like non files. Diff wants two files and whatever you do, it says, no, give me two files. I want two files, except only two files. So what I'm gonna do is give it two files. It's just that the second file isn't really a file and but diff doesn't know that. And so this is where we're using command substitution and we get that the only difference is the file I created when I was reading the file system because it's looking at the same thing. But there are lots of times when you have a more complex comparison and especially you might have two different file systems that you're comparing or something like that and it is handy to be able to feed diff output from two different commands and get it to do the diff on the fly without having to wait for a file or without having to throw them into files first. All right, and this is another place where it's useful. So let me get the syntax right, correct first. So the issue here is while is awesome but it is a sub process. So whatever happens within the while stays in Vegas and you don't get to find out about it. So normally if I go through and assign, so in this case I'm assigning 1984 to I and then I'm saying, I really screwed that up. Okay, so I'm taking 1984, I assign it to a variable, use it within the while because it exists before the while and inside the while I do things to it. After I get out of the while, it returns from the sub process and I come back and scope to the original scope and whatever happened in the while is gone. So for instance if I go through and grab the file sizes for all the files in a directory and use while to iterate through them and add them up and I get a total and then after the while I say echo total, I get nothing. What I usually do is as I'm going through the while I echo out the lines every time and then after the while where I've got standard out I pipe detailed it minus one and get just the last entry but now I've echoed out a whole bunch of things and stuff and it's rather annoying. So but by doing process substitution or the redirection I can now actually get my content after the while because it's this, it changes where that sub process is. So in this particular case I instantiated J inside the while and all I did was add to the I and then I went through and got the result at the end. Now there's a couple things that I didn't add to the presentation that occurs to me I should. First of all I did plus plus I instead of I plus plus. Why did I do that? Yeah, so we got here. So I'm adding to the number and then assigning to J, the other way around I'm assigning first and then doing the addition. So if I do I plus plus I assign 2112 and then increment I and throw it away. So I don't get the increment. The other part is notice I left the dollar sign off inside the arithmetic expansion. In the previous examples I could have actually done that as well the variables will be expanded even without the dollar sign. So that really should have done it the right way up there but they both work, yes. Okay, so it depends on which editing mode you're using. I use VI so I start off back at the beginning right there to begin with. If you're using the default syntax which is Emacs style I think you end up there but I don't know because I don't use Emacs. Control A, control A, there we go. And that's another reason I don't use Emacs. Control A does things in screen. It would mess up my day. And I use screen a lot as you can see. So, but anyway, yeah, so learn command line editing. Look at lib read line is where all the documentation is inside the main page. If you look for, if you search for read line, the word read and the word line concatenated together inside the bash main page, you can find out more about editing. And of course there are many, many guides in how to's online. All right. Word splitting, so we've split on text that is not in double quotes. Again, if we double quote, we protect that. It uses the inner file separator and by default that a space tab and something else, I think new line, don't change it. You can, it's a variable. You can assign to it, you can assign Fred to it. Don't, please. Because when you do that you start getting other behavior effects. That affects everything. And those of you that know what you're doing, yes, you can do it safely but please don't. There are other ways of handling it in most cases. If you do play with it, make sure you keep an original copy of it so you can set it back and ideally do it in a sub-shell. So it goes away and other things because forgetting about it you think, oh, I'm done and the script ends and you just, and you don't reset it. And then later on you go add things to the script and if you're doing that as root and you wipe out your file system, you'll have removed your evidence that you did it with the IFS and never figure out what you have, what did it, what happened, so. Don't mess with it. And then the example I give here. So what happens with that is that the echo sees the string of file, space, file but it doesn't see, they're just letters and a space and letters. Echoes those out and then the word splitting takes place and LS is given two arguments. LS is given Etsy issue and Etsy Resolve Conf has two different arguments but it's taking place at the, as I say that, else, or in the shell. All right. Path name expansion, globbing, right? So I mentioned earlier globbing, regular expressions, you should know what globbing is. That takes place at this level and all the different things you can do for globbing. We'll leave examples out. Then you get to quote removal. So the shell just says, oh, there's quotes, they're gone. Okay, they just go away at that point. So again, you can see by knowing when quote removal happens what pieces are impacted by, are protected by quotes and what are not. We finally get to pipes, right? So we did redirects and I kind of teach that they're like pipes but they're not really like pipes. They are different places for the shell. So we get to pipes. What haven't we gotten to yet? We haven't even got the commands. All this stuff, we haven't ever found out what it is we're gonna run. We haven't found out what we're gonna run and we got the pipes. What do pipes do? Allow us to run multiple commands, right? Take the output from one command and make it the input for the next command. Not necessarily in that order. Pipes, all the commands in the pipeline fire up at the same time. If I find and pipe to grep, find and grep start at the same time. Okay, yes, for those of you that do kernel stuff and everything, something fires up first. However, from our perspective as a system and writing a shell script, it's indeterminate which one comes up first. And I'm not certain that there's actually a determinate way to figure out which one comes up first. And that's why my example early on with T instead of the redirect, I was cheating because potentially the T happens first and I end up with an empty file in that particular example. So, which caused the first time I gave this presentation to not go very well because I was like, what the heck? But it's a good example that I need for that point in the presentation but don't count on it because again, that T could actually be opening, truncating that file prior to the grep taking place. All right, so this error that I created before gives us a good example of what happens. So, let's go. If you're, okay, first of all, if you remember before, we got the error from the said first and then we got the error from the find. This time we got the find first and then the said. So, most likely could just be the time it took for the errors to come out and so forth but there's most likely they actually fired in the opposite order as they did before. So, if you notice with that, the sleep is the first command but we got errors from the other two commands before the sleep completed because they all fire at the same time. The other two commands weren't actually dependent on the output from the sleep so they were able to do what they do. So, when you use a pipe, they're all actually happening at the same time and all the stuff that we've covered before happens in there as we're going. Just to show that standard out also takes place. That's why I did the LSE at CSU. We got standarded out before the sleep completed. So, again, it wasn't just that error came up quickly. All of them fired at the same time. We get to do something. So, the command or equivalent, again for those of you that are pedantic, go have fun with the man page. We're just gonna do this stuff. So, functions and aliases, they're commands. They aren't. So, if you're using functions and aliases, you need to know what those are and care about them but from where we are right now, they get evaluated at the same time, they're the same thing and I'm not gonna cover what functions and aliases are. So, regular expressions. What shell construct uses regular expressions? Grab. It's not a shell construct. Still, that's a separate command. It's not a shell construct. There are a couple. So, the string operators that I mentioned earlier for string manipulation, there's a couple of those where you can use regular expressions. But as far as the command line goes, shell doesn't look at regular expression. Those get passed on to the commands and then the commands use regular expressions. So, there are a couple places where you can use them like I say for string manipulation, a couple of other isolated things. But if you just have bare regular expressions on the command line, what are they? They're globs. If you just have a bare regular expression sitting there on the command line, the shell is gonna look at that and go, hey, I know what these characters are. A star looks like a glob to me and it's gonna try to use that as a glob. So, if you're gonna use regular expressions, you need to quote them. Even if the particular one you're doing right now is gonna be just fine, you need to go through and quote those so that you can make sure they are protected for that command and they are not being turned into globs. Let me go, sorry. All right, thought my phone was doing things and I don't wanna pocket dial like Secret Service or something. I was not at the secreted talk. I have no idea what you're talking about. Oops. And if you paste into the wrong screen session, it doesn't do what you wanted to do. So, I'll just go back to the other way. So, I'm creating a file. So, a file. Was that correct? No. I'm creating three files. F.1.txt, F.2.txt, F.3.txt. So, I'm using basic expansion to do that. I am then going to hopefully copy the correct line. Truncate file.txt again. I really don't like that file apparently because I keep beating on it. But I'm throwing FFF in there. And then next, I have created my convoluted environment and I will take advantage of that. So, what's the result of this command? Nothing. We know we have an F in that file. Grep is looking for an F in that file. But it's not. We get nothing. So, how do we tell why we got nothing? We use a print statement. So, in the case of echo, echo is just spinning out whatever the shell told it happens after the echo, right? So, what we end up with is echo gets the plain text that the shell interpreted and handed off to the echo. And the shell saw the F.star as a glob and said, hey, do I have any files that start with an F? Have a dot and then something else? Or nothing else, as the case could be. And it turns out it does. So, what happened with that grep was that grep got grep f.1.txt space f.2.txt f.3.txt did not see a regular expression. It saw the results of a glob. So, if I want to use the regular expression, I need to protect it with quotes. All right? And we have three types of quotes. It's not part of my presentation, but hey, we're here. What are the three types of quotes? Single, double, and character is the way I list it. So... So, backslash. If you're using backticks, you need your backticks to quotes, but you need your backslash in order to get a backslash. All right, so please don't use backticks. All right. Single, double, and character. So, the backslash quotes the very next character. So, let's go back down here. And if I wanted to have two stars here to break my regular expression, I can get them one at a time. All right? So, there's a single character quote. If I want to have a backslash, I need to have two backslashes, because the first backslash, say, the next backslash is actually a backslash. And when you do backticks, you need a backslash or backslash to get your backslash. You end up with three of them per level. Don't do that. All right. And I think I did these examples, grab stuff. Okay, further examples. Okay, there's a grammar issue right here, because I have further example, because I took most of the examples and put them back up into the presentation. But I have one nice example. That is longer than the screen. Oh, that's gonna be fun. All right. So, we're gonna have to make this slightly less readable in order to have room for it. There we go. There we are. Okay. So, first of all, what does this do? Something I needed to do the other day. Actually, somebody in the room might actually kind of recognize what it was, I think. Oh no, good. I obviously, I removed enough details. You might not recognize it, good. All right. So, I had a text file with some content in it that was somewhat standardized, but there were other pieces of content in the file. So, I needed to find the lines that had content that matched a particular regular expression. I then needed to iterate over those lines and I needed to iterate over those lines and again for them in the file and do a count of how many of them were in the file and then I needed to sort them in a particular order in order to get them. And while there were probably easier ways to do it, I built this up like we often do as a system in. I say, okay, what's the first part? I go tackle that first part and then now how do I do the next part? We have pipelines. We can do these things as pieces and glue them, you know, put them together like Legos except for when you're doing Legos, you're like, oh, where's that piece I needed? Oh, they don't actually make it. I can make my own pieces awesome. All right, but it gets kind of, you know, hectic. Anyway, so I needed to do all these things and go through and do that and so this command is a little convoluted, to say the least. So let's talk about what happens in what order. So first of all, what's the first thing that happens? The redirection, you would think by watching my presentation. But actually what happens is the sub-shell. So that first parentheses because the redirection takes place at a different command line inside the sub-shell. So the first thing we get is the sub-shell. We're gonna do a whole bunch of things and eventually we'll get to the pipe. Don't forget the pipe. It's a song about a pipe. Oh, that's a different thing, sorry. So we'll get the sub-shell first and then within that sub-shell we have things happening. So we can go through and look at it and the next piece is, should be the redirection. I hope so, because that's what I'm gonna say. All right, the next piece is the redirection and the process substitution. Now notice that we're actually getting both. I didn't cover that very well earlier. We have a less than, a space, and another less than. The first less than is the redirection. The second one is the process substitution. So we're taking process substitution and we're using that to redirect standard input for the while loop, which totally makes sense. Yeah, but it is how it works. And I don't know who figured it out and how long they'd gone without sleep before they made sense. So I have that, and that will get in to the while. Inside the while we're going to grep for the, so the process redirection is gonna give us the grep that is looking for the a-z followed by another a-z with a space in between. Then from file we're sorting those to unique dash u, so it gives me unique options of those if I have 50 copies of the line in the file, I don't want to search for it 50 times. I just want to search for it once. So I'll take those unique things and where are my, so that is, okay, whoops, oh, all right. So that is where the first parentheses in sub shell ends. So I was reading it wrong earlier on. And then we take that, throw it into the while. We read each line into a variable called name. We do the sub shell, take the values of the results of the, are not sub shell, the process substitution, assign that to count so that we can get that. The problem with doing it with count and then echoing out, I just get a number, but I don't know which thing had the number. So I also need to go through and put that in there. Why the hell do I have a lesson here? All right, and get that. Then, why do I have a lesson here? Count or something, maybe. All right, we have the other, okay, I know why. Cause at some point I middle clicked in the middle of my file and totally messed up my example. So sorry about that. It still gives me the pieces I was trying to point out. So we get the process substitution, the redirection. We get into the while and get our unique counts. Then I go through and echo the name with the count so that I get both of those together. And then I can go through and finally sort them numerically in reverse and then get the extra less. So yes, at some point, sorry, I pasted itself inside of itself and made a turducken mess of my example. So my apologies, I will fix that and get them to upload the actual example so that we can have that online. All right, yes sir. That was just a file and that was pasted in the wrong place apparently. So yeah, this is totally not what I've, yeah. The redirect with the grab and everything, those were the pieces I had, as I say, I've obviously pasted it inside of itself and made a lip pretty badly. Resources, bash man page, I mentioned it several times. It's a great resource. It's long, don't try to read it in all one sitting. Break it out over different periods like years. No, but it's a great resource, it has a lot of information. The thing that it doesn't have too much of our examples. So for that reason, I like to look at point people that are other resources. A very good resource is the advanced bash scripting guide has a lot of examples. It's somewhat disorganized in many ways, but it is still a great place to go look for stuff. And of course, once you have keywords that you're looking for, you can just search for them online and find lots and lots of other examples. All right, and this is, you know, since it's one slide, I just put the copyright at the very end. Any questions? No? Oh, yes. Mm-hmm. Let's go dump type to par, no, no, no, pipe to tar. It should work, right? It's been a while, because I just, yeah, I take it. I usually dump to, well, you don't need to dump it to tar. Okay, that's the thing. So if you're doing a MySQL dump, you need to pipe it to gzip to compress it. But it's already a single file, so you can just redirect that into a file. Yeah, so if you're, tar isn't doing, tar is a way of taking multiple files and putting them in a single container. For MySQL dump, you're creating a single file anyway in most instances. And so, tarring it up doesn't really do anything for you, other than that's what some people expect to find. So I would just dump it to, you know, mydump.sql and then gzip that file. And you can gzip it in line and put it all on the disk at one time. But generally, for MySQL dump, I throw it to disk first and then gzip it afterwards. Part of it is because MySQL dump locks up things in the database. So I want the database part to be done as fast as possible, even if it costs to be extra IO. For that reason also, go talk to Dave Stokes and the MySQL guys. There are better ways to do backups for MySQL, but that's outside the scope of this. All right, any other questions? Thank you very much for coming to my presentation. And thanks for coming to scale. I hope we will see you here next year. And you saw we had lots of good content. If you have something that you think is interesting, please submit a presentation. Because we are a great conference because we have lots of good presentations, lots of different people with different perspectives. And I think that's part of what makes scale awesome. Have a safe journey home, however long or short that might be.