 All right, awesome. Thanks for joining us today. I'm Ryan. This is Cori and Buzz. We are, or I was, and you two are still on the PWS platform team, also known as CloudOps PWS. So we do a lot of the operations running the Pivotal Web Services public offering that Pivotal runs. Anything else? Yeah, so we wear a few hats. The operator hat, the platform engineer in terms of building tooling and techniques and supportive operations, and also site reliability and engineering practices. So our talk today is called Double, Double, Toil and Trouble. Will Shakespeare reference for any of the nerds? SRE transformation through automation and collaboration. So Act One, what is toil? This is a question we asked ourselves when we started to talk about this talk. And we kind of were thinking about what are the indications of toil? What's the definition of toil? What are we trying to classify as toil? And we kind of came up with a few different things. We said, well, there has to be a well-defined path to resolution. It has to be something that's easy to describe, easy to follow. There has to be a low barrier to decision-making. So it has to be something that any Joe Schmoak can answer the questions required to resolve it. Something that's painful to resolve. So something that is painful to resolve. Prone to operator error. This one is something that if it's prone to operator error, chances are that you probably want to be automating away that operator error. Something repetitive, self-explanatory. Impacts other team priorities or work. And it has a risk or severity attached to it. These are all kind of things that we classified as indications of toil. Now, some of you might be familiar with the site reliability engineering book that Google has released. They have their own definition of toil. And there's a little bit of overlap with what we kind of defined as toil and what they defined as toil. But they did it a lot more concisely and they'd probably put a lot more research into it. So we're going to go with their definition. The first thing, repetitive, that one's kind of self-explanatory. We kind of covered it as well. The second one, no enduring value. So what this one kind of means is the end state after you resolve the toil should be a better end state than beforehand. So after you've already resolved this toil, you should get value out of it. Or rather, after you resolve the issue, you should get value out of it. If you're ending up where you started, it's probably toil. It's manual. Again, this one is kind of something that you're doing as an operator when you really shouldn't have to be doing things because it's automatable. A machine can be doing this. Theoretically, it doesn't have to have a human doing it. Tactical. So what this one means is kind of you have to do it. It's reactive work. It's not really something that you can put on the backlog and priorities for later. It's stuff that comes up because there's a deadline or because there's some pressure to complete it. And finally, it's O-N, or I guess greater, with service growth. And what that means is it's scalable, right? As your service continues to grow, you are also going to experience more of this pain and you're going to experience more of this toil. So if we kind of compare notes between what we thought and what Google thought, we kind of got the automatable part. That's kind of what the two first parts are. It's a well-defined path to resolution and there's a low barrier to decision making, meaning it's kind of automatable. Now prone to operator error, there's a little bit of crossover there with manual and repetitive, we nailed that one. So what about the rest of these things? That it's painful to resolve... All right. That it's painful to resolve, it interrupts team priorities of work, and there's some risk or severity attached to it. Because these all do seem like really relevant things to toil, but they're not necessarily things that Google addresses as toil. So those are going to show up again. We're going to just get rid of them for now. But they will come back up. So I wanted to give an example of how you can look at this definition and how things can fit into it. So something that people might be familiar with here as operators is UAA user and client tickets. Or like when you receive an ask ticket or a service now ticket saying, hey, I need some client or service thing. It's repetitive because it's probably the same five UAC commands every single time. Maybe the scopes change or whatever, but it's probably about the same every time. There's no enduring value because really all you're doing is unblocking the team that's asking you to do something. From your team, from a selfish perspective, you have done nothing but stop your workflow, right? It's manual because it typically requires eyes from an admin to do something, to add them to UAA. It's automatable because really all of the variables, like the scopes required and the permissions and all that kind of stuff can be provided by the user. You shouldn't have to do any of that kind of stuff. It's tactical because you're reacting to a ticket that you received and it grows with your service because obviously as you start to have more people on the platform, you're going to have more tickets. In fact, that becomes extremely painful, right? Because those same five UAC commands start to change a lot and you start doing a lot more stuff when you have, like, your platform is growing and growing. So that one I think everyone can kind of maybe relate to. One of the ones that we kind of experienced on the PWS platform was kind of fun is cryptocurrency miners. We offer this public offering and there's a free trial and so obviously people are going to take advantage of it. It's free real estate, right? So they're repetitive because they show up every single day. It's really annoying. No enduring value because all we're doing really is getting our resources back that they're stealing. It's manual because we have to deactivate their accounts and remove the resources that have been allocated to them. It's automatable because, again, all we're doing is, you know, see our UAC, like, delete account or disable account and going into BAM and getting rid of their orgs. It's tactical because we have to take action on the minor schedules. They actually do develop schedules, probably because they're running a script or automatable something and maybe it's also when they show up to work in the morning, you know, they click a button and now they have more miners. And they do scale with the service. As our service grows and gets more attention, more people are going to say, hey, there's something free I can take advantage of. So we will come back to that as well. So I've been talking about what toil is, but why should I care about it? Well, to put it into a kind of consent sentence, you have no choice but to do this work, which requires action to be taken by a human when it shouldn't. It shows up frequently, it provides no improvements, and it will only get worse with time. So in other words, why should you care? Because you don't hate yourself. That sounds awful. And what's the cost of toil? Well, do, do, do. These things come back. It's painful to resolve sometimes. It interrupts team priorities and work, and there could be risk and severity attached to it. And from another perspective, one that operators might not really care about, could cost money, and it could cost time. As a company, these are definitely things that are a impact to you. So we're going to classify all of these things as the impact of toil for now on. So we've decided what toil is. Next, we're going to move on to act two. How do we identify and prioritize this toil? And Bozhar, I'm going to take it from here. Thank you, Ryan. So yes, act two. How do you identify and prioritize toil? Well, if we want to sum it up in one sentence, you can say you need to be cognizant of toil. Start collecting common concerns brought up by your team. Start collecting or writing down the manual work that you do, and recording how long it took. And as we mentioned, that's work that usually adds no lasting value. One footnote that we added, this toil might not always be technical. It could be emotional, and it could be perceived differently by different teams and even different members on your team. One person's perception of toil might not be the same as another's. You might be asking yourself, how can I be more cognizant? Well, as I mentioned, you can start recording, keep track of interrupts, pages, something that is toilsome. Collect feedback from current team members, from new team members especially, because they provide a view and experiences from other teams. They might have experienced more toil and might have potential solutions to the toil you're experiencing right now. Discuss within your team, within the broader organization what toil you're experiencing, and similarly, you might encounter solutions that other teams have for what you're experiencing. One useful tool to visualize the frequency of toil is a tool that our colleagues on the Europe side who manage the platform for Pivotal Tracker. They're using something that's called a toil snake. Basically, as you encounter toil, team members would put a sticky either on a physical whiteboard or on a real-time board, which we took a screenshot of. You might indicate who experienced the toil, when did it happen, and potentially how long did it take to resolve. One can simply take a look at this and see which toil some task has the most stickies, and if you wanted to sum up the durations, you can even see how long of a time impact this has. We come back to an image that many of you might have seen from XKCD, which is another useful tool that you might consider when you're filtering toil. When you're asking yourself, is this worth solving, is this worth automating? You need to consider how much time and doing the toil some task and how much time or money it will cost you to fix it, to automate it. Not all toil is created equal, so some will have more impact on your business, but they might also take more time to solve. Coming back to a slide from Act 1, this is what we defined as impact, but we also need to consider who the impact is affecting. How much, what's the adversity of the effect? How long has it been going on, and do you have support from the broader organization to address the toil? To visualize that these might be stakeholders affected by toil, you might start with your team in the middle, and that's where solving the toil might have the most impact, but potentially other teams in your organization, other units in the organization, marketing or business units might feel effects of this toil. You might also consider customers of your organization, the broader software community and even society as a whole. If we cross all of those together, these are all the combinations from what we've presented that you might consider in determining the severity of the toil and how important is it to your organization to solve it. A few examples of toil, one we mentioned was user and client creation, granting access to systems. You might be writing incident reports for your production platform or even forensics reports if there was no customer-facing incident. Toil around that might be the formatting of the report, collecting timelines, synchronizing with different parties, as well as maybe scheduling retros or other recurring team meetings, which is the last point. Maybe I'll skip over these examples, but these are some that we've experienced on our PWS platform team as toil-sum tasks. Another useful tool for putting all of this together and considering the frequency and the impact of the toil-sum task to determine what you should be focusing on first is a two-by-two, and we've included these dimensions, frequency and impact. For your team, for your organization, it might be different, but in most cases, it will be frequency and impact. You would start off by putting your toil-sum tasks, which you might have gotten from the toil-snake. And, you know, discussing with the team, discussing with other members of your organization, you might put them in different quadrants, and usually the most important toil to fix first in our example is in the top-right quadrant, the high-impact, high-frequency, which if you solve it, if you solve those problems, they give you the greatest reward. In our example, it might be cryptocurrency miners. And to look into that example, we certainly have a ways to go to automate solving this on our platform. And we did this over a period of time. As our first iteration of alleviating this toil, we automated the detection and notification. So team members get notified when a potential abusive application is detected on the platform. They can go in either next business day or if it's more severe right away and determine whether it is an abusive app and take action. Another awesome task that we identified from speaking to the team was story creation. We use Pivotal Tracker, but you might, you know, have to create, manually create stories, add tasks to be completed in your tracking software of choice. So we amended our automation scripts to create a story and add tasks so anybody can pick it up and, you know, perform this repeatable process. Iteration three might be automatic suspension of the organizations. You might still want to send a notification in case an organization and application was incorrectly suspended, but that, I think, is one of the later iterations of how we can automate this toil. So now that we've talked about how you might prioritize and filter toil, I'll pass it off to Corey to tell us what's next. That's fun. All right. So let's say we as a team started to hone our SRE skills along the lines of, let's say we've done things like adopt SLIs, representing our customers' experience and adopting SLOs, representing our customers' expectations. We've figured out, excuse me, how to minimize and improve handling of incidents. And indeed, we are getting a better handle on how to recognize and address toil. So what next? Well, so the thing to keep in mind is that this is all really about culture. That is, the real value in principle of, say, DevOps is about cultural improvement as illustrated by this great visualization courtesy of Daniel's story. So the thing to do next is to spread the good news to make this scale, if you like. Okay, fine, so how? How does this scale? Let's take a look at that visualization again with a few adjustments. As an aside, I want a big thanks to Daniel's story for granting me the permission to make these adaptations to his images. So expanding on the idea that DevOps is fundamentally about cultural improvement in principle, I'd say site reliability engineering is fundamentally about cultural improvement in practice. And I suggest that the team best positioned to be at the fore of adopting those SRE principles or practices is you, is the operations or platform engineering team. Let's come back to that point in a minute and dig into this question of how does this scale with a story. So here we are with the platform team. We've figured out some techniques, some practices to address reliability, incidence, and toil. Great, let's go tell our friends about that. Let's go talk about this with our sibling platform teams. Let's go talk about this with our services teams, with our application developers running their applications on the platform. After all, they too deal with reliability and with incidence and toil. And when we talk with those teams, we encourage them in turn to go do the same. We encourage them to go talk to other teams. And here we find that we can go viral if you like. That is this approach. Really like any practice is indeed how we scale. So why me? Okay, fine, returning to this question. Is it really me who should be driving this? Why? Let's go back to this last picture. As we've seen, we platform engineers or operators, if you like, are really at the center of it all. That is to say, if you're a team operating a cloud foundry or perhaps many cloud foundries, you are effectively a hub for collaboration. The services teams, the application developers, they're all running their services and products on the platform that you provide. So yeah, but maybe our organizational structure doesn't support this cross-cutting collaboration. So again, why me? So many organizations perhaps might have a command and control-like structure. And in this scenario, you might say, directives, if you like, are issued from upon high and passed down through some hierarchy. And I suggest, however, that there's a, excuse me, a fundamental flaw to this model when considering something like cultural change and the spread of practices, and that while all the parties in this hierarchy are well-intentioned, there's something lost in the delivery of the message. So you could think of this like the telephone game where for each hop, there's something lost or mixed up in the message and confused by the time you get to the people executing on stuff. So rather than this, what we seek is, or what we need, at least in terms for spreading practices and culture, is an organization like this where we have our senior leadership expressing vision and we have our middle leadership in turn translating that vision into strategy and it is the staff who best figure out how to execute on and achieve that vision. So once again, it is you who have figured out how to adopt reliability practices, how to address incidents and minimize incidents, and how to indeed burn down toil. And therefore, it is you who are best positioned to spread these practices, which is to say, cultural change comes bottom up. So again, that's why me. So to conclude our tale, let's return to the beginning. If DevOps is about principles of cultural change and SRE is about the practices that lead to cultural change, is this also very important? The answer is yes. With that, that's double, double toil and trouble. Here's some references, if you like, to the imagery that we've got in the slides. Let me open it up to comments and questions. Are there any questions or comments? Have we talked to Google SREs about our model and do we have any... Sorry, was that the second part? Did they comment? We talked some with Marie, who was a product manager for our team and is now at Google. Have chats now and then. Sort of free form sharing of notes and ideas. I think of this all as like, you know, Google's offered the community these great resources. It's not static. We, you know, folks in Google and within the industry, we're continuing to evolve this stuff. We're all on a journey, if you like. And for a given organization, there may be some things that make sense to adopt verbatim from Google, but that's not always the case. It depends a lot on your culture, a lot on who are your customers. So, you know, for Pivotal web services, for example, and for Pivotal, our customer is fundamentally different from Google, so we're adapting and adjusting. And I think it's a bidirectional sharing of information. Go ahead again. You had more, Tony? Great. So the suggestion there is, watch the 40 minutes of video produced. Lisvong Jones produced it? Okay. Okay. Yeah, great. Thank you. Other comments, questions, suggestions? Gerrima? Thank you, Gerrima. I think if there are... Oh, Marco. Yes. Previous slide? Previous. Yes. So the question is, do we have practices in which we're pushing back to the teams that might be creating toil for us? And how do we get buy-in? Well, taking the UAA user and client creation as an example, we are both feeling that pain, and of course we're blockers for those teams. So some of the work that we're starting to put into place around that is self-service related stuff. So building some tooling such that with minimal approval from some admin or our team, an R&D team can request the client, request the user that they need to get the credentials back. So we do a lot of tooling around sort of entirely internal toil and also tooling to help support our R&D org. And in terms of getting buy-in for that, I'd consider going back to that slide where I suggest that the culture comes bottom-up, you know, if we produce this stuff and everyone across the organization is finding that they're unblocked and there's less work for us, there's no convincing to do. Right, so the question was how do we address the user's toil or what kind of communication do we have, a collaboration do we have to help them address that toil as well? I'd say that what we experience as user toil is, you know, pages at two in the morning saying, hey, my app broke, what did you do kind of thing? Yeah, that seemed to resonate. And I would say that in general, everyone is going to have that same feeling of, oh, crap, something's not working, I need someone else to blame, right? Or I need someone else who has the better answer for me or whatever. And that doesn't really go away. So we can track our pages and we can figure out, like, oh, how do we optimize all of this so that we don't get woken up in the middle of the night? But I think going back to Corey's point in Act 3, you know, helping people recognize the problems that they're, the learnings that we've already addressed in terms of, like, addressing toil and figuring out how to address that toil is really the best way to help them solve those problems themselves. You know, if they have a recurring problem that they keep needing our help with, maybe that's something that they should have an SLI or an SLO tracking, and maybe that's something that they should learn how to automate themselves. And what better way to do that than to say, hey, this is how we did it on our end. Would you like to come work with us, pair with us, collaborate with us, and we can help figure out what communication tools are we using? Slack, mostly. A lot of the collaboration, though, happens in person, right? We try and pair with our other teams. We try and cross-team pair and collaborate with them, whether it's remotely through a Zoom call or whether we go and get together in a workshop or something like that. Yeah, so I think it's probably pretty common for platform teams to get all of these, like, ask tickets and service now tickets and everything. For the most part, we aren't servicing direct users. We don't necessarily have to go through an ask ticket system. A lot of the times, you know, they'll reach out to us and Slack and we'll say, oh, hey, yeah, we can do a client credential creation for you or whatever. You have some time this afternoon. We'll just pair together on it. And typically that goes over super well because they know exactly what scopes and everything they need and we never have to close out a ticker at anything. But there is definitely a fallback where, you know, if the team doesn't know how to reach out to the platform team, they'll submit an ask ticket and they'll fall all the way down to us. It just takes a bit longer. Yeah, right. So pushing toil back onto the other teams and how do we resolve that because it does kind of smell a little bit, right? I think in general, it's kind of helpful to decide what your boundary is. So if this kind of stuff is information or things that we shouldn't be the ones resolving, then, yeah, there's healthy pushback there in setting those boundaries. But I would encourage us to not approach it as just pushing back on someone. It's pushing back and helping them figure out how to solve it, right? That was how to solve it themselves. So often, I think, if someone has a silly question that is definitely answered in their FAQ or answered in their Wiki or something and there's no reason they should be asking us, then it doesn't really help to just say, go read our Wiki, right? Because obviously, everyone's like, all right, fine, I won't come to you for help again. So I think what we have tried to do in the past is kind of say, oh, hey, yeah, we've actually answered that question before. I don't know where it is, but maybe we can find it together. So here, why don't we pair for a bit? Oh, look, this is where it is. It's on the Wiki when you Google that question that you just asked. It shows up. And then hopefully, like, you build that collaborative relationship where you're not shutting them down and pushing back completely, but you're still helping them avoid that in the future where they have to come and rely on you. One more question? Yeah. How many people are on our team right now? We aim for eight engineers, and we have our product manager as well. At any given point in time, so we pair as folks might know about Pivotal, at any given point in time, there's one pair on the actual operations deployment efforts. The other pairs are working on the platform engineering and tooling and so forth and SRE things. The real reason why we think for, like, an operations team needing eight folks on the team is about alert fatigue. Just, like, you're going to burn out if you have any fewer than six at the fewest, but eight in particular. We need more people. Going back to some of the earlier questions about what I would see really as the collaboration opportunities, and again, our teams being at the center, the way solving the problem of, like, is it the platform failing or is it the application failing? The way we go about that is again, as our team, we're working on the platform, we want to establish how reliable the platform is. So we set up dashboards and alerting around, in our case, like CF push and application availability. So these, again, are things designed to represent end user experience and expectation. And then around that, we can have some conversation if we start to burn our budget, if you like, is, like, how do we change our behavior? What's our internal agreement around, like, you know, we've no error budget remaining, so let's focus on reliability stories. Well, we've got a lot of error budget remaining. Let's move fast and introduce risky or experimental things. And having developed those practices, we then in terms of, within Pivotal, can go to our R&D teams and help them establish those same things, which is helpful to your question. Really, we can then go out into the field, let's say, and work with our support folks who are, you know, helping customers run Cloud Foundries in the field and collaborate with them on running these same workshops that we've used to come up with these SLIs and SLOs and dashboards. And in turn, we approach that as, like, coaching them to become facilitators for those same workshops. And then they can go out to other platform teams and do the same. This gets back to that, how do you scale thing, right? And in turn, those platform engineers or operators can go to the application development teams and do the same thing, coach them on how to establish SLIs and SLOs around their applications. And now what we end up with is, like, a platform dashboard, and you can have your application developers be looking at those SLIs and SLOs. Don't want them to look at the service metrics because they're going to say, like, high CPU might not matter. You've got now this dashboard around which you can have a conversation, and they've got their own dashboard around their applications, and you've established these things by way of, like, in person if you like, collaboration, so you've also improved your relationship, right? And so now, you know, the application developer no longer thinks the operator is a jerk and vice versa. We at least trust each other's good intention, and we've got these tools to have the collaboration and communication around so we can say, okay, it looks like we're in the application right now. Let me help you go figure out, like, why you're leaking memory, or it looks like it's really in the platform. Let me go back to the people who write CF and say, improve this widget, or this component. Anything else? No? Okay, yeah. You want to talk about those? Sure. So our, sure. Thank you, Marco. Our cryptocurrency protection runs on a periodic schedule, so we use a concourse, concourse CI to do that, and it's pretty much a bash script. We utilize the Pivotal Tracker APIs to create those stories, add the task from a list that we've determined people can follow easily, so they don't miss an important step, and yeah, it just runs on a schedule, and if it detects an abusive app, it creates a story for it and alerts us of the story in Slack. I think we're about out of time, so thank you very much, and feel free to ask us any questions after.