 Hi folks, everybody can hear me? Okay, cool. So, DevOps days, Philly, right? So, I've lived in Philly for about 21 years now. This is, I consider this to be my hometown. And so, when DevOps days was coming here, I thought it was really important that I submit a really good talk, right? So, you know, I talk to people I know, you know, what should I submit? And a friend of mine said to me, you know, you should talk about culture, right? Rather, you know, because so much of our DevOps discussion is about tools, right? And tools are great, and I love talking about tools. I'll talk about tools all day. But culture's kind of hard, right? It's really easy to like annoy people and say the wrong things. As a technologist, it's very easy for us to say things that are completely tone deaf about culture, right? And we do it all the time. So, you know, there was this whole thing where, you know, is there anything that I can add to a discussion about culture, right? So, tomorrow we're gonna hear from Bridget Cromhow from Cloud Foundry. She's probably gonna tell us that containers won't fix our broken culture. She's completely right. I was at Velocity in New York a month ago, and Catherine Daniels, the author of Effective DevOps, which you should all read, it's an excellent book, told us that tools will not fix our broken culture. She's completely right. So, this is basically the problem we all face, right? We have this embarrassment of riches in tooling, in best practices, but none of it fixes the human problems that we have. In fact, it's not even very good at fixing the technical problems we have, right? So, how do we expect it to fix our cultural problems? And this is because of something called Conway's Law, right? So, this is a book from decades ago now, where, you know, an organization is gonna build systems that reflect that organization, right? So, and the systems don't just mean a technical system, they can also mean our cultural systems. This is also called Shipping the Org Chart. So, in fact, I was just reading this morning a really great, it was the original proposal for the World Wide Web from Tim Berners-Lee, and he's describing, well, you know, our organization here at CERN is like a web of different sub-organizations where things come and go. And he's saying, so we should build a technical solution to this that reflects our organization, right? This is something that it goes back a long time. So, what we're kind of saying here is that culture dictates our technology, right? So, our shared values, our shared beliefs, our shared ways of making decisions influence our technical choices. But culture is kind of a complex artifact, right? So, it's a chaotic system, it's full of feedback. So, in instinctively, we know that bad technology can influence our culture, bad technical choices can influence our culture in bad ways. We know that if we don't build our systems for reliability, that our ops team will hate our dev team, they will hate their guts. We know that if we don't build for self-serviceability that our dev team will in turn hate the ops team. We know that if we don't build for flexibility that the project management team will hate the crap out of the dev team and so on and so forth through this, you know, creative stuff, you know, like, everybody's seen this, it's really funny. I particularly like the system in column. But this is kind of the anti-pattern, right? We know that if we're cranky on a Monday morning and we're snipping at each other because we were up all weekend because PagerDuty was giving us the wha, wha, wha, and those of you who use PagerDuty know what I'm talking about. Because we didn't build a reliable software, well, technology's influencing our culture. If we just start ignoring email, right? Because we're sending our bug reports and interteam chatter and our code reviews and alerts and exception reports, which is actually something that I saw a company do once. True story. And they're all going through the same medium. So, you know, and we can't use it effectively, right? Technology's influencing our culture. If instead we're making decisions asynchronously and thoughtfully with a distributed team and we're enabled to have a distributed team that might be working across the globe, right? Technology is influencing our culture, right? So there's a feedback mechanism. There's this great quote from a discussion from an Etsy engineer on Reddit of all places, where we can see this kind of inherent back and forth between tools and culture, right? They both influence each other. Tools won't fix our broken culture, right? If you're totally broken, you gotta fix the culture first, right? But we can certainly have a positive feedback mechanism. So, I tried to come up with what I'm calling four principles of software-defined culture, right? So, what if we could make technology influence our culture in positive ways? Or at least not make our culture worse, right? Like that would be a good start. So now maybe principles is a little too grandiose for you, so call them paths or considerations for software-defined culture, whatever you want. So I'm gonna go through these. So the first one is building for reliability. So now, obviously, if your software is so unreliable that you lose all your users and you fail your mission as an organization and you go to business, you have no culture, right? But that's kind of thought-ological. That's the easy case. Failure to build in reliability is going to lead to firefighting. And that firefighting becomes a rut for organizations, right? You do kind of patch-fix work, the problems come back, they grow, they expand. And so everything becomes urgent, and that means you stop focusing on the thing, on new feature development and on whatever your organization's actual mission is. If you reward firefighting, you get a culture of arsonists, right? If we, without a culture of reliability, we've misaligned our incentives, right? And we're at risk of creating a culture where we just lurch aimlessly from crisis to crisis instead of solving problems. And this is gonna lead to burnout, right? So people don't like getting paged, right? They wanna work on whatever it is they came to you to work on. Like all those promises you gave them when you hired them, that's what they wanna work on. They also wanna sleep, that's good, too. And you're gonna lose people to this. And kind of the deep irony is you're gonna lose your best people to this. And you'll be surprised when it happens because they're just gonna one day say, you know what, I'm done. I'm gonna go raise goats in Idaho. And that'll be it for them. So we need to guide our technology choices this way. And one of the ways we can start doing that is by stopping the resume-driven development, right? So new technology isn't battle-tested. It is going to be broken all the time, right? And when you choose new technologies, you're always at risk for doing that. What's more, if you're doing that all the time, you're of normalized risky decision-making, right? So learn Elixir, AerospikeDB, or Vue.js, or whatever the new thing that's on Hacker News this week, like in your own time, that's an option, or in your development tools and in your internal tooling, rather than yoloing that stuff into production all the time. Thanks, folks, that's great. Okay, this has ripple effects in terms of attracting like-minded hires, too, right? People who always wanna work with only the latest and greatest are going to congregate at places where they can get away with that, right? And what's more, and I've seen this at a number of organizations, they're gonna tend to bring along people who like that stuff from their last organization and they'll follow each other around, like a plague of locusts, leaving you with unmaintainable software, right? This is gonna increase your hiring costs and it has no business value to you. Rewriting your front-end every six weeks because you've got the newest React thing or whatever it is, has very little value for your culture as an organization. This also is gonna bias your hiring against experienced engineers, right? And these are the very people who were trying, for many of us, are trying to hire, right? They've already been through a couple of burnout cycles, right? Like they do not wanna, you know, they see this coming a mile away when they look at your job listing and it's like, oh, we've got all these new technologies and they're like, no, no, no, no, no, I'm not gonna work there. I know what I'm in for here. It biases you as an organization against older staff as well. People with families, you know, when little Aiden needs to, you know, go to the band practice or whatever, that's not the time they wanna be doing, debugging your new stuff. It's also harder for kind of our inexperienced but second career types, right? We have a, you know, we've been telling people, learn the code, it's a great way to, you know, bring yourself into the middle class, but these are people who largely are not gonna be wanna be up all night because you decided to switch over to the MongoDB, the new MongoDB storage engine after only testing it for two days in workload, right? So one of the failure modes of trying to solve this is bimodal IT, right? So this is like, and this is like, this is the thing you see at enterprise organizations, right? We're gonna have one group of people who get to be the innovators, right? And we'll do all the new stuff there and we'll self-contain it so they can't screw up the rest of the business, right? And then we'll have the people who are the maintainers, right? And Bridget Cromhill calls this awesome mode versus sad mode, right? And it should be obvious what this does to the morale of your organization, right? Like, you people aren't innovators, good luck, you know, like, yeah. So we should have a bias towards, a default towards choosing boring technology, right? We wanna minimize the unknown unknowns so that we can focus our energy on what matters, right? The mission of our organization and the health and well-being of the people we've hired, right? Save the innovation for when it's gonna have the biggest impact, right? So our second principle is building for operability. And I'm really talking about orchestration here, things like monitoring and debugging. I'm gonna talk about a little bit later. So we're talking about having automated build and test. We're talking about having reliable deployment and easy scaling. Ideally in some kind of self-service way, right? When your development team can self-service and provision the resources that need to do their jobs without kind of like, well, let me do a chain, you know, let me fill out a form to do, you know, to order a VM, you know, and like, it's like, well, we did innovation. Now it's a form on a webpage, right? Like, that's not actually self-service. When we do that, we have a kind of a neat side effect, which is that now everyone gets test and stage environments whenever they need them. And that's going to empower our development teams to take ownership over not only the software, but the infrastructure on which it runs, right? So how this plays out varies. And one of the ways that we've, one of the kind of the notional frameworks around this kind of stuff right now is Giffy, right? Google infrastructure for everyone else, right? And there are a bunch of companies who are willing to sell you stuff that'll do this. And if you're building a distributed system of market services that only speak HTTP, anybody doing that? Only speak HTTP? Okay. Then maybe Giffy is the right answer for you, right? But like, you're taking on a lot of cultural baggage when you do that, right? You're saying, you know, oh, we're just like Google. Okay, so what we've done, so let's go back real quick, right? So we were talking about Giffy and about the notion of these kind of like platforms that do all the things for you, right? So what we've done in this case, we've split the application behavior between the application and the platform. And what really ends up happening in that case is between the application developer and the operator who operates that platform, right? And who that operator is is going to vary depending on the organization. Maybe you have an infrastructure team that runs the platform or a platform team or the anti-pattern of a DevOps team, don't do that. So the problem with this is that our application developer can't actually understand the total behavior of their application without understanding what's going on on the platform as well, right? Which isn't managed by them. Largely, these platforms are very heavyweight, which means that the developer is not gonna run the platform in the local development environment and now we have works on my machine, right? Hey, you know, I was working, this worked fine for me locally when I was doing my unit testing and I integration testing that was actually only testing a single service. And now that's going into production, not really tested. A lot of these things have integration code associated them, right? So that code that's written for one orchestration platform isn't gonna be portable to another one, which means that now our operations team, if they wanna make decisions around cost or reliability underlying platform, they can't because the application has been deeply tied to the underlying platform. If any of you are trying to migrate off of DynamoDB, because you decided to do that, like you know what I'm talking about here, okay? So this tight coupling of all the components, the application, the platform, and whatever integration code you have has become its own source of friction and failure, right? And this is really messed up because we've reintroduced all the problems that we were trying to solve by going to this kind of infrastructure in the first place, right? So like, and you know, GIFI is only one solution to this, but if you look at what goes on in larger enterprises, right? We've got, you know, an architect who's created the abstract Singleton proxy factory factory, and we've just made it out of process and we've rewritten it and go, right? Congratulations, folks. We've made, excuse me, congratulations, folks. We've made some real progress here, right? Okay, so instead we should consider an application design that allows for applications to be self-contained, right? So now, and self-operate with a minimal amount of global coordination, right? If we have this kind of platform independent separation of concerns where the application and the scheduler and the infrastructure are their own self-contained entities, then the people who are responsible for them can have ownership of them, right? So ironically to, and so there's this kind of ironic separation of dev, to end the separation of dev and ops, we actually have to separate the components that they're working on. Orchestration needs to be laptop-friendly, right? If you can't run your stack on your laptop, then you are going to have problems that only appear in production, right? We want to eliminate works on my machine to the degree that's possible. So Google infrastructure is probably awesome, but does it solve the problem we actually have? I'm trying to get jibba-jibba-apas-de-apwa to come out, but I think it's like an offensive Welsh word, so I might have to, yeah, okay. So the third principle is building for observability, right? And so this is understanding what happens during incidents, after incidents, and certainly before incidents, right? We know that we have to monitor all the things, like everybody knows the meme that's supposed to be here, like monitor all the things, right? We know that that's what, right? But with the rise of containers and microservice architectures, right, we've built systems that are, we're now all building distributed systems, right? And they are harder to observe and debug than the systems that we were building just 10 years ago when this quote was on, in ACM. And because these systems are distributed, many of the most difficult problems we're going to face are only going to ever happen in production, right? Because they're going to be like networking infrastructure, which means we need to be able to debug our production environments. And debugging in production means you have to have a way to observe them safely, right? So if your observability tooling alters the performance of what you're observing, right? Or worse, causes it to crash, well then you can't use it safely in production and you won't. It's very much time for us to move away from dashboards, right? So dashboards are great for telling you what you already know, like your known knowns or your known unknowns, right? But they're very bad at telling us the things that we haven't figured out yet, right? We should be selecting tools that allow us to explore and discover. And very importantly, I think for modern teams, is those tools should allow us to explore and discover the problems that we're having as teams, not as individuals, right? There's some great tooling that's coming out now, this is honeycomb.io, that where monitoring tooling selection can improve the communication and create durable and shareable processes for exploring and discovering problems that we've had in production, right? When I go through and do my queries here, I can build those up iteratively, I have a record of what I've done and I can share them with my team, which means that I as a senior engineer can actually go on vacation and I have new mentoring and training opportunities out of the debugging work that I've done, right? This also means we should be building for rainy days, right? Like anybody can build a system that works when everything works, right? Like that's kind of technological. If we only build for the everything works case, well then we have actually tied our hands when things go wrong, right? So let's tell a story. So we have a startup and I have skinny jeans on, so of course it's in Node.js and it has a programming bug, right? And so the right way to do this is we're going to, we're gonna abort on encoded exception when we run this and so we have a bug here and we get an error message, which isn't all that useful, it's just a type error and we get a cord up, right? And I can take that cord up and I can move it into my observability tooling, my debugger and I can get the stack that was associated with that, the stack frame and I can look at the functions that are associated with that stack frame and I can get the error and I now get the object that caused the error and I've gone from a single bad request to a solution to the problem just in one request, right? So there's no like, well, we'll monitor it for the next week and figure out what happened, right? This is the good way to do it, right? But Hacker News had this like a really great article on promises this week and so I'm gonna do, so we rewrote our whole stack over the weekend in promises and we have this problem, right? Which is like, there's no automatic abort here, right? So I can get an error message if I get that reason object, I can print out an error message, but there's no abort. So I have to remember to manually abort every time that I've created a programming error. Well, if I could remember to manually abort every time I've created a programming error, maybe I would not make a programming error in the first place, right? But let's assume that we knew how to do that, right? And we get our stack, but now we've mangled our stack and we don't really know what's in it. So we can get the error message out of that, which again, didn't really help us in the first place because it's like, well, it's a no pointer exception, but right, we don't know what. So we have to go and find all the functions that are associated with that file and then we have to go and find all the closures that were associated with the then part of the promise. And then we can go spelunking through the heap like this to find all of the possible objects that have been passed into that function. And by the way, if you're passing in around a bunch of anonymous functions, the answer to that is screw you, guess. But we have kind of an object here that we know, just because I'm cheating. So we don't have to guess. And so we can finally find that object and print it out and we say, ah, okay, now that object doesn't have the attributes that we're getting from it. Whoa, boy, that's an awful lot of work, isn't it? And so now you're telling me, hey, hey, hey, this is, like if I crash, that's fine, right? Like who cares? Because I'm just gonna restart the process again. I mean, when Basecamp came out with Rails, it was like that crashed every five minutes in their internal environment, right? And that was fine, you know? So it's okay if we're losing throughput to these kind of restarts because we're web scale, right? We're just gonna scale it up horizontally and that's fine, maybe, right? But now you're spending more money, right? So your burn rate goes up, you gotta go back to your VC's, Royer, and eventually all that VC money is gonna dry up because the institutional investors are gonna lose their guts for that. And then when the bubble burst because of credit default swaps, people are gonna be out on the street and a lot of that that is coming from foreign, is foreign held and that could even lead us to arm conflict and then it's, and then it's strapped on the power armor, boys and girls, because we've got radioactive zombies, right? Okay, so I'm not saying that promises are gonna lead directly to radioactive zombies, of course. Okay, but what we've done here is we've made a seemingly small technical decision what was really for only for ergonomics and we've ruined our ability to debug and understand what our software is doing in production, right? And if you don't have anybody on your team who can do this, you should probably learn how or hire somebody who does and if you're on a platform where you can't even get access to this stuff, right? You don't have a route, like you can't even show into the box. Well, I'm not saying never use a pass, but I am saying you gotta understand what you're giving up, right? You're definitely giving up something by doing that. So choosing a platform where your engineering team is inevitably going to run over a problem where the very, very smart people you hired, their only solution is, well, better call support may not be the culture that you wanna build, right? Especially given what we know about some of the support. Name names there. All right, so we should bias towards choosing platforms for observability, right? Bias towards throwing money at a problem rather than our brains works great if you have more money than brains. For those of us who don't, maybe not the right solution for you. And a platform isn't just where you're hosting, right? It's also the selection of your tools, right? I'm not saying you shouldn't pick a language that says debugging is hard, let's go shopping, but when you do so, you should recognize what trade-off you're making. We should be getting used to interacting with our observability tooling every day, right? We have a choice to understand what we've built, right? We can build a culture around engineering excellence or we can pretend that we don't need to. I know which culture I'm gonna bet on. Okay, so if you're annoyed at me before, get right, hold on to your butts, okay, so every tool comes with a community, right? And particularly in the open-source world, right? Which, I mean, this has been the great victory of open-source is that we have this like huge number of tools we can use now, right? And some of these communities make it a point to be open and inclusive. This is a quote from the Rust community code of conduct. If you get it, even if you don't care about Rust at all, which is a totally reasonable position to have, this code of conduct is amazing and they've made it a core of how they've built the community around their software from the very beginning, right? This isn't something that somebody imposed on them from the outside later on. Other communities have a well-deserved reputation for kind of some toxic behavior. I'm gonna try to avoid naming them or being combative, right? And so when we choose a tool, we have chosen its community, right? The people on your team are gonna interact with that community every day, right? There's some communities here. They're gonna submit bug reports, they're gonna submit patches. They're gonna read blogs that are about the tool, right? They're gonna attend conferences like this one. The tools you select will affect who you hire too, right? And if you are building a very innovative, bleeding-edge piece of technology where knowing that a monad is a monoid in the category of end of functors, I have no idea what that means, then maybe selecting tooling that is gonna attract people who understand that and who are part of that culture is the right move for you. But if you were building a web crud app, like someone like me would do, then those people are gonna be very unhappy if they come work for you and you will be unhappy with the things that they build, right? So we should make sure that we're picking tools that have the culture that we wanna be associated with. And this goes all the way down to the design of the tools we use, right? So if your programming language is intentionally designed with the idea that you're not smart enough to use a better one, is that the culture that you wanna be associated with? When a community not only tolerates exclusionary behavior, but actually inspires a user revolt that forks the runtime to protect that exclusionary behavior, is that actually the community that you wanna be associated with? I said I wasn't gonna call out communities, but I am. Okay. So at the beginning of the talk, I was talking about the embarrassment of riches we have in terms of tooling, right? And we jokingly talk about JavaScript fatigue, right? There's like, it's been two days since the last JavaScript framework. And that's a real thing. And it's actually great that that's the case, right? This is the victory of open source. But we really thought about why there are so many tools, right? Some of this is the resume driven development, right? Like if you're hiring people and you're like, well, where's your GitHub repo, right? Or where's your GitHub account and how many stars do you have on things? Well, then obviously you're incentivizing people to go after that kind of stuff, right? They're gonna build tools. Whether or not anybody actually uses them. But is the mission of our organization not enough for us as makers, as builders, as creative professionals? Are we not being fulfilled by that mission? So I ask a question and don't raise your hand because your boss might be sitting next to you. But how many of you are working for an organization or working on a product that actually makes the world worse? Just give me like stern eye contact, right? Okay. Now raise your hands for this one. How many of you are working for organizations that are hiring? Okay, like virtually everyone is up and I assume everybody else is looking for work. Okay, so now our profession doesn't have licensing, which is probably right because there's a big difference between the people who are working, real engineers who are working on operating systems versus like dorks like me who don't even have a college education or working on distributed web stuff. So that's probably right that we don't have licensing. But we can learn a lot from our brothers and sisters and other engineering professions, right? Imagine if we had a code of ethics like this. So I'll just read a little bit of it because I think it's really important, right? Engineers shall hold paramount the safety, health, and welfare of the public. Engineers shall act in a manner as to uphold and enhance the honor, integrity, and dignity of the engineering profession. Imagine if software engineers had a code of ethics like that. Imagine if operating engineers, the people in this room had a code of ethics like that. We can individually. So I want everybody from that first group to go find somebody from the second group while you're at this conference, right? And don't tell me like, oh, I've got student loans and mortgages and stuff like that, right? We all do, but right now this is a seller's market, right? They probably had to, the organizers probably had to keep recruiters out at the door, right? They're holding them off with pitchforks, right? Everybody wants to hire us. Our salaries have gone up dramatically over the last few years, right? This is the time that we can actually have control over the kinds of tools, the kinds of projects that we're working on, right? Your work creates the community, right? In a world where software is eating the world, the work you do creates the world around us, right? It reflects our values. It also reflects our blind spot, right? Our blind spot is how you have organizations like Facebook or Google who are demanding the real names of people, like, you know, which means that, you know, they're putting people who are trans or political activists at risk. It's how we end up with mobile sites that, you know, use up, blow away the data plan, that, you know, the cheap data plan of working class folks who just to serve them an ad for a product they can't even afford, right? Inclusivity, excuse me, inclusivity doesn't stop with your team, right? So many of the systems we build are ripe for abuse and we have misaligned our incentives for working on those things and we, and largely just don't care. I think we can do better. I think our software can improve our culture and I'd like to invite all of you to enjoy me in doing that. Thanks a lot, folks.