 Thank you. It's great to be here. It's great to get to participate in a conference. I don't get to do that very often. I'm normally just rushing around. When i'm not speaking or at the booth or whatever, i'm on a call Or running out to see a customer or something like that. And for me, this is not just an opportunity to come and Speak but to reconnect with people. And you're really recharging my battery. So thank you. Thanks for letting me be here. A couple of logistics things. The first being the clicker. It might have just gone out. It buzzed. I think it means it connected again. So I'm glad that I'm at a DevOps conference. And I don't have to tell you what DevOps is. That's wonderful. Just common questions that I get. The slides will be posted. There's going to be a QR code at the end that you can scan. Also, the URL is already pasted in the Slack channel for this talk. So if you want to do that. And there are lots and lots of links and references in the Materials that I include at the end. And I love hearing from you. So reach out to me however you prefer. To begin with a story, thinking back to about 2010, I was a database consultant. I had gotten interested in databases early in my career. And then that led me to a job that was really rewarding. I got to work with people who were having troubles with Database Performance for a while. Almost five years. And about 2010, there was a company that was doing flash sales. So they would get inventory of popular clothing and accessories And things like that. And then they would make those available to everybody all at once. And thousands of people would rush onto the site and try and Get bags at discount or whatever. And they had a fairly typical architecture in those days. It was a typical sort of two master servers. One in active mode, one in passive mode in the core database. And the traffic was always going to one of those. And the other one was in idle mode. And it wasn't just for disaster recovery, but it was also used For a lot of operational types of things. So when they would do a schema change, they would stop replication. Replication went in a circle. So they would stop replication from the idle one to the active one. Do the schema change on the one that's effectively offline. And then restart replication and swap the roles of the servers. And this is a pretty common type of a thing to do with these sorts of databases back then. Probably a lot of people are still doing this today. And it can work pretty well. But replication is non-instantaneous. And so there are consistency problems. And there can be race conditions and all of these types of things. And replication, this was a set of MySQL servers. And that replication is also able to continue apparently working Even when the data is actually drifted on those two servers. So replication is still working. And you think that your servers have the same data, but they no longer do. And this scheme had been in use for many years. And slowly bit by bit through these changeovers, the data Had drifted quite far apart in small places because of those Timing synchronization problems between the switches. And what would happen was identifiers would get reused Because there were auto incrementing identifiers. And so for a very small minority of customers, they could Log in and see someone else's history or shopping cart or Saved items or whatever. Or lose their order history or whatever as the case may be. And I wrote a bunch of software during those days. Originally I was doing all of this stuff by hand. And eventually I figured out that it could be automated. And I wrote a bunch of software, which is now called Percona Toolkit in pretty wide use. But in those days I was kind of the main user. Because I was doing a lot of these sort of gnarly Explanation, like how do we get these things back together? And I automated it so that I could do it more effectively and With higher confidence. So one day I'm working on this site as a consultant. Remotely, I'm far, far away. The team is in New York. I'm in Virginia. And my job is to very carefully bring back into Synchronization another portion of the data. We were doing it a little nibble at a time. And I made a mistake. I didn't stop replication. And then I ran the command on the wrong server. And took everything down. And it was during a flash sale. Immediately realized what I had done. I actually had a very physical reaction to this. I felt myself starting to get tunnel vision. And the world seemed to get distant. I broke out in cold chills and recognized that I was About to pass out. So I slid off of my chair and laid down on the floor, Propped my feet up, breathed for a couple of seconds. And then I got back up, got on. In those days, we were still using Skype. Hey, Pete. And got on with the operations team that I was consulting with And said, you know, this is what just happened. They did a couple of quick checks and confirmed that this Was impacting the site and the current flash sale. So we were going to have to put up an under construction page, Essentially. And take the site offline. We started trying to triage. And they brought in some of the developers. Now, I had worked quite closely with the database team there For a while. And I'm still friends and still in touch with a bunch of them. But I hadn't worked very much with the development team. When they brought in some of the senior developers to help Figure out what was wrong, what was impacted, And what was a safe way to recover things, I was really impressed. I didn't... It didn't occur to me until then that a lot of the database folks Didn't really understand the database, what was in it, And what was safe to do and what was unsafe, As well as the development team did. And so the senior developer, you know, The one who knows where all the skeletons and all the closets are, Is coming on and consulting with us. In just a couple of moments, we were able to figure out, You know, how much was impacted and how we could recover. It didn't take that long. The rest of the morning, we came back online. After everything was done, I called up the VP of engineering And said, you know, of course you know that there's an outage. And he's like, yeah. And he did something that I had not expected. He said, I'm not here to criticize you. You know, on this call, I don't want to blame you. I don't think that'll be productive. What I want to know is, how can we learn from this? And how can we take those learnings and use them To prevent these types of things from happening Or to help them happen a little bit less Or to recover more quickly in the future. And that was really, that was not the way That I had been managed before. I had been blamed for a lot of things in my career. Even faulty equipment when I worked in a factory And something cut my finger really badly. After two days of investigation into that, I finally was escalated up to a senior manager Who told me that it was, quote, a dumb shit move. And if I did that again, I would be fired. And those devices were never fixed as a result of that. So this was kind of the first time That somebody had treated me with compassion And had taken a learning approach to it. So it was a day of a lot of learning. That was like early days of DevOps. I had probably heard of it. I don't know how much that team had heard of it. But looking back with the benefit of hindsight, I can see what I didn't see at the time. I could see that a lot of seeds were starting to sprout In our industry. And, you know, there were a lot of things That you could see the early beginnings of That are now, like, widely accepted. And it's been really amazing. It's been a really amazing experience To see these kinds of things turn into DevOps, agile, design thinking, continuous delivery, All of those sorts of things over the years. So now nobody argues that these are mainstream trends. Really mainstream. So there's a lot of stuff that DevOps Is either adjacent to or overlaps with Is driven by or drives. And so I've just, you know, written a list Of buzzwords on the screen, obviously. But the big general theme, the takeaway from this Is that DevOps is essentially a way to innovate In a way that's better for the business And more just for the humans involved. And I think that's really important. And I think it's also really notable that other ways Of innovating that have not been just for the humans Involved have not turned out to work that well. It's not accidental. If you want to innovate and do great work, People are doing that work. People and teams are doing that work. So the justice is an important part of it. Otherwise, go back to Pete Cheslock's talk. You're going to lose your best people And you're not going to innovate, Among many other things. Some of these buzzwords are a little bit newer Than others like containers and microservices. And I'll return to some of those trends. But ultimately all of these things Are driving a big transformation In how we build, operate our systems And our teams and ultimately how our companies run. For example, software is a service Largely driven by and co-occurring With the rise of cloud. I founded a software as a service company. And it's not just a different way of building a product, But it's a different way of building a company. The business metrics, the levers, the finances, All of those things operate completely differently Than companies did 20 years ago. So we're really in a new world now. But we're still doing a lot of things Very similarly in the database. Even if DevOps is penetrated everywhere in an organization And there's a lot of buy-in from top to bottom across There's often a last holdout or two And the database is often one of those places Where DevOps just doesn't reach as quickly. And I'll talk a little bit about why that is. But first, to note that there are kind of two ways That DevOps gets applied to the database Or really broadly and generally speaking. And here I'm really borrowing from Charity Majors Who uses the phrases, the first age of DevOps And the second age of DevOps. And the first age of DevOps, think classic Configuration management, Chef, Puppet. I was a system administrator, SSH in And start typing things on the command line. The first age of DevOps was Stop doing that and start writing software. So operations is accomplished through development And our work becomes software that Configures and manages our systems. Right, so that's kind of the first And logical place to start And it's true in the database as well. Database servers often got this stuff Applied to them a little bit later than others. You know, you would Chef everything And recycle all of these machines But you would never recycle the database server And bring it up under Chef Because it was important, right? So you could just, you know, completely Replace all of your web servers And they would come up under nice clean Configuration management control But the database often got left behind in that. And then the second age of DevOps Is when developers start doing operations And I'm really grateful to Holly Allen In her talk just before this About service ownership at Slack Because it is one of a set of stories That you can study and see a lot Of common themes arising in Everywhere from Netflix to Zalando You know, there are lots and lots Of stories of how this works And this service ownership of teams As opposed to the centralized operations team Centralized database administration team Is an important theme here I don't think that it's a coincidence That companies who work this way In my personal experience tend to do better At learning and moving quickly So in order for developers to own Database operations, own the production Behavior, performance, availability Quality of service of their databases You need a handful of things there as well Schema as code and automated migrations And there are certainly areas Where you can see this being done But then there are also, just as in Anywhere in DevOps, you can see That there are things that it's harder To do this with. So if you're doing A Ruby app on Heroku, it's a lot easier To do schema migrations than it is To retrofit that into a legacy monolith Or something like that. So that's kind of the broad brush Like what is DevOps for the database? Well, it's the same thing as DevOps There's developers doing operations And there's operations doing developers And maybe some are a little bit or not So much of either of those, but only In a database-specific context And that's partly because databases Are hard to DevOps, and here I use DevOps as a verb. Throw your tomatoes Now. It's okay. And databases are hard to DevOps Because they're different. They are different and special In a number of ways, and we've Intentionally made them that way In some cases, we would like to Do it differently, but it takes time For that to evolve and mature. First of all, we try and make everything In our apps stateless as much as possible So that, for example, if a web server Is on an EC2 instance that starts To go bad, we can just replace it. We didn't lose anything, because Everything that was of value that we Wanted to keep, you know, the Statefulness, that got delegated And pushed over to the database. It's stateless, and you can create And replace and autoscale and all Those kinds of things as much as you Want. So that makes the web apps Easier to operate and easier to Configuration manage and all those Kinds of things, but it makes the Databases do all of that hard work. The databases are also mostly Legacy technology. Even, like Postgres and MySQL and even Some newer databases than that Are still really legacy technology Because it takes a long time for To the point that you're willing to trust The crown jewels of your business to them. So, you know, a database takes at Least five years to production Mature. In today's world, Five years is already legacy technology. Well, you know, most of the databases That we're using today at scale were Designed well before configuration Management and so forth. And so they don't have the appropriate Control planes and lift and move Handles that a lot of other systems Do. And, you know, so Configuration management systems have A hard time with them. And then also Configuration management systems Weren't designed for databases. They weren't designed for complex Staple services that are distributed And have consistency Requirements from server to server You know, they're not Designed to coordinate across Servers, for example, across hosts. So, I mean, that's Not always true. This is not Categorically true. But generally, these You know, there's some truth in all Of these things. And then the last Thing, of course, is that the databases Are doing really hard work that's Really critical and has type performance Tolerances. And that's increasingly So as the world becomes connected And, you know, your application now Depends upon, you know, Twilio and that's Now actually linked into your Application and a part of it. So if Twilio is slow, then your app Has all of these things become dependencies Of each other. So Twilio's databases Have to run really, really fast. So folks often think Or, you know, regard Themselves as doing DevOps pretty well Even maybe in the database But I often see some signs that There is still room for improvement That there's still opportunity And this is not Intended to be any kind of a blame game Or anything, but if you can kind of score Things and maybe you do have some Opportunity for improvement and Rather than looking at the databases Themselves, look at the artifacts of The culture and the work processes And things like that to tell you whether You can do better or not. So my favorite thing is the Dysfunction that's caused when Responsibility and authority are Not co-located and tightly aligned When somebody is responsible for Something that they don't have control Over, it's a recipe for madness The administrator is responsible for Production query performance and The developer is the one who's actually Writing the query, you have a problem You have an automatic problem. And when developers are not responsible For that, they don't feel any pain And they don't get paged when a query That they wrote is slow or is taking The site down by causing a pile up Or something like that, you know That's a real issue and those Things point to deeper Underlying types of problems And if developers can't self-service If they can't Provision, change If they can't observe In production how the Services and the queries coming From those services are performing Then, you know, again, that's It's not only an inefficiency In a bottleneck, but a real problem For the org. So one of the really Clear ways to see this is if People are afraid of changing The database schema and Schema migrations exist even if You have a schema-less database. There is still schema. It's just a Question of whether it lives only In the application or in both the Application and the database. And if people are shipping A new version of the application That treats the data differently, That's schema. And, You know, if that's a scary Thing and if it's done less Frequently because it's scary, That's a form of dysfunction. To address these things, there's A handful of things that I think Of as being really important. One of those is tooling because Automation is good and important. You know, I took down The flash sale site with Automation, but also with Fat fingering and, you know, A mistake. But that automation Was making possible something That previously was not even Possible at all. It was also responsible for Not the automation that I wrote, but The automation that that company Wrote for doing their failovers, Was also responsible for this Data having drifted over time. Well, you know, the data drifted Over time. That's a bug. But the automation was there Fundamentally and that's what's Really important. So schema Change automation, deployment Automation, continuous delivery And monitoring tools. These are all Vendors, so I have a particular fondness For monitoring and observability. But that doesn't make it more Less important than these other things. So good tooling is really important. I'll talk a little bit later about bad tooling. Actually, I've already talked a bunch About bad tooling, haven't I? But good tooling is really important. Then another thing is that databases Are very complicated And they have murky inner workings That typically only a few people Really understand very well. There's also a lot to have a framework Of how to approach these. And the frameworks that work well For the databases, like defining What service, what quality of service Looks like for the database Is very similar to what it looks Like for any other service. I've introduced an acronym here. The KELT acronym, the use Acronym, those are defined in some Other talks that I've given That there will be links to at the end. But ultimately what this means Is that the database as a whole And its workload as a whole Is characterizable and you can Quantify and describe The quality of service both Globally and at the level Of individual queries. That's really important when you're Trying to figure out what and how Has changed in the databases to look At what useful work it's doing For its customers, that's the Applications that depend on it And where and how at a very Different level. And there's also, I think, a Monitoring philosophy that simplifies The big complicated problem of What should I monitor and how should I do it? There's like 350 status counters in MySQL and other databases are similar. Which of those matter? Which of them should I put on a chart? Which of them tell me that I Should increase the setting or decrease The setting? That's a really complicated problem But it's also in the slide show that I have linked later. So those are really important to Help people to gain a sense of Confidence that they can do the Database operations well. Another important thing for moving Forward is figuring out where you Stand now. And the classic way to do this was Capability maturity models which Typically went level one through level Five and you were trying to move From level three to level four or Level five to level five. And you could still use the Good parts and not get caught in The traps by the problems. The best way to learn about this is To read the book Accelerate. And to look at the annual Accelerate state of DevOps reports. And not just the latest years but I encourage you to go back a couple Of years and read all of them. And see what has changed from one Year to the next. And that's the best way to See what has changed from one year to the next. There was some interesting database Specific stuff last year. This year There wasn't as much of a focus on that. This year there are some new Capabilities mentioned. Accelerate will help you to understand How capability maturity models are broken And what's a better way to look at it. And the Accelerate State of DevOps report will Help you to understand what Capabilities in general are predictive Of or influence DevOps performance. There's also Dickerson's hierarchy of reliability Which you can find in the SRE book. And it's basically a Pyramid of needs. Monitoring is at the bottom for a really Important reason that you can't really Build on top of that unless you can see What things are doing in productive. And in a database Specific context I've outlined A bunch of these Database specific types of things In a book that I wrote and there's going To be on the screen later to grab that. A few of the Ways that things can go wrong as well. So first of all The tooling that I've been talking about A handful of times so far Can be Very very dangerous. Now you have tooling Written by vendors who think of Database administration tools as Something you put on your laptop And you double click to open Causes a ton Of problems and in fact a lot of my Performance consulting years You know during that period of my career A lot of times we solve performance Problems by turning off things like Badly design monitoring tools. So these things you know brittle Inadequate naive tools can really cause A lot of damage. As a vendor I've also been in the Position multiple times of people Saying things like you know I Really want to get vivid cortex In here and we really want to see That we're going to need a culture change And we think we're going to drive that Culture change by buying your product. I don't want that customer. You know that's going to be Sort of you know more bodies lying On the street at the end of the day. It's going to be careers broken It's going to be you know trust And faith broken. So you don't get cultural change From a vendor or from a tool or A product you get it by doing the Hard work from the inside. So more than likely you'll have Some amount of internal resistance And there are people who really Want to keep doing something that They see as career security. And sometimes that means that they resist Other people having some involvement Or even ownership of what they've Previously been involved in. So you know I'll name the elephant In the room some DBAs want to keep DBAing but they need to learn A new way to do that. And unrealistic plans. And so I think that's what I'm going to talk about. I'm going to talk about some of the Folks kind of without the Without getting all the best minds In the room and really getting buy In from everybody and crucially I think Holly Allen laid out the High trust environment and the Commitment to learning at the Executive level. If those things are not present Then not only do things like You know internal resistance and Committed with their shoulder to the wheel. So in a nutshell the way that I advise getting started is Really really small and carefully. And that doesn't always have to be the Case and Holly made an eloquent Argument for why sometimes you Just need to take the leap Plunge in but I do think That when you're dealing With people's careers and you're Really solving some political and cultural Problems it helps a lot To show success in a small area The Accelerate State of DevOps Report talks about different patterns Of DevOps adoption and which ones Work and which ones don't. So I think that's excellent reading as well. I like to encourage people to Start with something that's not going to end Their career. I punched a hole below the water line In the flash sale Website. I try not to do that Whenever I can. Chaos engineering is not about Actually crashing things. You know it's going to crash something Cause an outage. It's not chaos Engineering. Read the chaos engineering Book. And it also helps to start With a new thing that is not yet in production That is not business critical. The stakes are much smaller and you can Then have a pattern of What good looks like and then Try and replicate that. Culture is a funny thing. Culture is emergent. You can't Really operate directly on culture. Instead you have to kind of Come in the side door. You know you have to start changing Something that drives culture Like who's on call Or things like that. And culture changes Is an outcome of that. In the e-book that I wrote About DevOps for the database I referenced Ryn Daniels writing There's some really good writing In there about now Designable surfaces I think Is the phrase they use. The way to influence culture Through things that you can actually change That will then have consequences for the culture. I think it's also important To kind of break down the large amounts Of things that you could be focusing on And pick some stuff to get started with. So for me when you're thinking About the database the first thing I think about is can you measure it At a really fine grain. Can you quantify the performance Of an individual query Like add to shopping cart Or something like that. Can you track that. Do you know what changes Happened in your database workload In the last release. Those are really important questions To be able to answer. If there's an outage and you're wondering Whether to point the finger at the database Or not. There was no outage two hours ago. Can you look at the difference Between now and two hours ago And understand what's changed. Can you Ship a new version of your code And Have that code work against An old version of the database Schema. Or can you change The database schema without breaking The code without a redeploy Of the code. Can you Change these things without having To change them in lockstep. Because if you do have to tightly synchronize Those changes it will go wrong. And the more that those things are De-coupled, not only the more Control and flexibility you have In releasing and innovating But also that is inherently The ability to turn off things That don't go well in production. So that's a really quick way To solve a lot of outages. There's going to be outages that aren't Going to be solved by turning off a feature flag. But a lot of them Can be. And then I won't really talk about The service alignment, but for those Of you who might be watching this video later, Please look at a recording Of Holly Allen Slack And their journey to service ownership. I talked a little bit about internal Resistance and that that might Be coming from database administrators That see their roles changing In ways that they're uncomfortable with. Well, first of all, I think everybody Should be comfortable with their role Changing because it's actually better When you are not sort of the human Automation around the database Or when you don't have exclusive control And responsibility over something That you actually don't have control over Because guess what, those developers Can release things that influence the Database in ways that you can't control When you go into the new world Of database reliability engineering Life is better. You become a subject matter expert For something that a lot of people Really, really value Is how do databases work? How does indexing work? How do queries work? How do I actually become more Proficient in these things? As let's say an engineer who's a couple Of years into their career and doesn't Really understand a lot about databases And they see this as a way that they Can add value to themselves, make Their careers more valuable, make Their colleagues more valuable and have A greater impact on whatever they're Shipping to their end customers. You have innately and almost unconscious Expertise in these types of things And you can change your role from gatekeeping Around the database to actually Spreading this knowledge and mentoring Coaching a lot of people into Something that's going to be really Rewarding for them a little closer Here or if you can advance. Thank you. Can you go back one? So just a sample, these are all Actually hot linked from the slides And you can click through to any Of these and then this one Is a qr that you can scan right now If you want to download. This one goes to my company's Website and it's a form fill And then a pdf, it's like 65 pages Of everything, really. There are so many stories that I Was able to find. So most of this is not like my Writing, most of it is just compiling Other people's stories. For example, there are a lot of people Who have figured out how to ship This information and their schema In very decoupled ways. So there's probably like, I don't Know, 12 or 15 different Very in depth stories that I was Able to find in recorded conference Talks and slideshows and blog posts Even books, ACM q Articles, things like that. So that's all linked in there. So there's a huge richness that You can learn from, you don't Have to invent this yourself. And then the slides are in the Journal where this is the QR code For the slides themselves. And again, I'd love to hear from You and thank you very much for The privilege of talking with you.