 From around the globe, it's theCUBE with digital coverage of PagerDuty Summit 2020, brought to you by PagerDuty. Welcome to this CUBE conversation. I'm Lisa Martin, today talking with Tammy Bryant, who's a CUBE alumna, the principal site reliability engineer at Gremlin and the co-founder and CTO of the Girl Beak Academy. Tammy, it's great to have you on the program again. Hi, Lisa, thanks so much for having me again. It's great to be here. So one of the things I saw in your background, 10 plus years of technical expertise in SRE and chaos engineering. And I thought chaos engineering, I feel like I'm living in chaos right now. What is chaos engineering and why do you break things on purpose? Yep, so the idea of chaos engineering is that we're breaking systems, but in a thoughtful control way to identify weaknesses in systems. So that's really what it's all about. The idea there is, when you're doing really complicated work with technical systems, so like for example, distributed systems, and say for example, you're working at a bank, it's tough to be able to pinpoint the exact value mode that could cause a really large outage for your customers. And that's what chaos engineering is all about. You inject the failure proactively to identify the issues and then you fix them before they actually cause really big problems for customers. You do it during the middle of the day, when you're feeling great, instead of being paged in the middle of the night for an incident that's actually causing your customers pain and making you lose a lot of money. So that's what chaos engineering really is. Are you seeing in the last six months since the world is so different, are you seeing an increase in customers now with, for example, brick and mortar is shut down and everything having to convert to digital if it wasn't already. Is there an increase in demand for chaos engineering services? Definitely. So a lot of people are asking what is chaos engineering? How can I use it? Will it help me reduce my incidents? And definitely because there are a lot of new services that have been rolled out recently, say for example, curbside pickup. That's a whole new thing that had to be created really recently to be able to handle a large amount of load. And you know, people show up, they wanna get their product really fast because they wanna be able to just get back home quickly. And that's something that we've been working on with our customers is to make sure that curbside pickup experience is really great. The other interesting thing that we've been working on because of the pandemic is making sure that banks are really reliable and that customers are able to get access to their money when they need it and able to see that information too. And you can imagine that matters when you're in lockdown and you only can leave your house for maybe an hour a day, you need to be able to quickly get access to your money to buy food. And we've seen some big incidents recently where that hasn't been the case. Yeah. I can imagine, I mean, just thinking of what happened with everything six months ago and how people were, we had this demanding, consumers were demanding, we expect to get whatever we want, whether it's something I'm gonna buy on Amazon, something that we stream on Netflix or whatnot, we have this expectation that we can almost get it in real time. But there was a delay a few months ago and there still is to some degree, but companies like Amazon and Netflix, I can imagine really must have a big focus on chaos engineering to test these things regularly and now have proved, I would imagine to some degree that with chaos engineering that they have built, they're built to withstand that. Yes, exactly. So our founders at Gremlin came from Netflix and Amazon. Our CEO had worked at both where he'd done chaos engineering and that's actually why he decided to create Gremlin. It's the first company in the world to offer chaos engineering as a service. And obviously when you're working somewhere like Netflix, the whole products, you have to be able to get access to that movie, that TV show right in that moment. And also customers expect to be able to see that on, for example, their PlayStation in their living room and it should work. And they're paying for a subscription. So to be able to keep them on that subscription, you need to offer a great service. Same thing with Amazon, Amazon.com, they've done a lot of chaos engineering work over many years now to be able to make sure that everything is available. And it's not just that the entire Amazon.com is up and running. It's also for example that when you go and look at a page that the recommendation service works too and they're able to show you, hey, here are some other things that you might like to get to buy at this time. And I like to know as a consumer, I love that because it helps me save time and effort and even money as well because it's giving you some good advice. So that's the type of stuff that we do. Exactly. So when you're working with customers, I'd love to understand just a little bit from the conversational standpoint. Is this now, is chaos engineering now at kind of the C level or is it still sort of in within the engineering folks? Because looking at this as a make or break, knowing that for example, Netflix, there's Hulu, there's Disney Plus, there's Apple TV Plus, if we don't get something that we're looking for right away, there's Prime, we're going to go to another streaming service. So are you starting to see like an increase in demand from companies that know we have competition right behind us? We've got to be able to set up the infrastructure and ensure that it is reliable now more than ever. Yeah, exactly. That's really, really important. I'm seeing a lot of executives. I mean, I've seen that since the beginning really. Since I first started working at Gremlin, I would often be invited by executives to come and give talks actually within their company to help their teams learn about chaos engineering. And I love doing that. That's really great. So I'd be invited by C levels or VP's from different departments. And I often get people adding me on LinkedIn from all over the world who are in leadership roles because really like, you know, they're responsible for making sure that their companies can hit those critical metrics and make sure that they're able to achieve their really, you know, demanding business goals. And then they're trying to help their teams be able to achieve that too. So I'm actually been so pleased to see that as well. Like it is really cool to have an executive reach out and say, hey, I'm thinking of helping my team. I'd like to get them introduced to you. Can you come and just teach them about this topic? And I love being able to do that. It's really positive and it's a great way to improve. It is I think nowadays with reliability being more important than ever. You know, we talked to leaders from industry, from every industry. And there are certain things right now that are going to be shaping the winners and the losers of tomorrow. And it sounds to me like chaos engineering is one of those things that's going to be fundamental to any type of business, to not just survive these times, but to thrive going forward. Yes, I definitely think so. I mean, obviously people can easily just go to a different URL and try and use a different service. And, you know, we're seeing now failure across so many different industries. We didn't see that before, but for example, you know, I'm sure you've seen in the news I heard from friends and family about schools now being completely online and then kids can't actually access their calls, their resources, what they need to learn every day. So that really just shows you how much it's impacting us as a society. We really know that the internet is critical. It's amazing that we have the internet, like how, you know, lucky we are to have this, but it needs to work for us to actually be able to get value out of it. And that's what chaos engineering is all about. You know, we're able to make sure that everything is reliable, so it's up and running. And we do that by looking at things like redundancy. So we'll do failover work where we completely shut down an application or a service and make sure it gracefully fails over. We also do a lot of dependency failure work where you're actually looking to say, this is the critical path of this service. And a lot of people don't think about this, but the critical path really starts at sign-in. So you need to make sure that login and sign-in works really well. It's not just about the experience once you've signed in. That has to work well all the way through. So actually, if you have a good understanding of user experience, it helps you create a much better pathway and understand those critical pieces that the customer needs to be able to do to have a great experience. And I care a lot about that. Like, whenever I go and work somewhere, I always read customer tickets. I always try and understand what are the customer pain points. And I love listening to customers and then just solving their problems. The last thing I want them to do is be complaining or be really annoyed on Twitter because something just isn't working when they need it to be working. And it is really critical these days. The internet's a really serious part of our day-to-day life. Oh, it's a lifeline. I mean, some folks it's the only way that they're connecting with the outside world. Is through the internet. So when things aren't, I had a friend whose son's first day of college a couple of weeks ago, freshman year, first class, couldn't get into Zoom. And that's a stressful situation. But I imagine too, though, that, and I know you're going to be speaking at the PagerDuty Summit that more folks need to understand what this is. And I can tell that you have a real authentic passion for it. Talk to us though about what you're going to be talking about at the PagerDuty Summit. Sure thing. I'm excited to be speaking at PagerDuty Summit very soon. My talk is called Building and Scaling SRE Teams. So Site Reliability Engineering Teams. And this is something that I've done previously. I built out the SRE teams at Dropbox for both databases as well as storage. So block storage. And then I also led the code workflows team. And that's for over 500 million users, people accessing their critical data that they store in Dropbox all the time. You know, the way that folks use Dropbox is in so many different ways. Maybe it's like really famous musicians who are trying to create an amazing new album. That happens. Or maybe it's a lawyer preparing for a court case and they need to be able to access their documents. So those are a lot of customer stories that would come up over time. And prior to that, I worked at the National Australia Bank as well, leading teams too. And obviously like people care about their money if they can't access their money, if there are incorrect transactions, if there are missing transactions, you know, duplicate transactions, maybe people don't mind so much about it. You get like a double deposit, but it's still not good from the bank's perspective. So there's all types of different chaos that can happen. And I found it to be really interesting to be able to dive into that and make sure that you can make improvements. And I love that it makes customers happier. And also it helps you improve your company as a whole. So it's a really good thing to be able to do. And with my talk, I'm gonna talk to folks about, you know, not only why it's important to build out a reliability practice at your organization, you know, back in the day, people used to go, why would you need a security team? You know, why would we need that? Now, everybody has a security team. Everyone has a chief security officer as well. But why don't we focus on reliability? Like we know that we see incidents out in the news all the time. But for some reason we don't have the chief reliability officer. I think that's definitely going to be something that will appear in the future just like the chief security officer role appeared. But that's what I'm gonna talk about there, how you can find site reliability engineers. I'll share a few of my secrets. I won't give any spoilers out, but there's actually quite a few places that you can find amazing people. There's even a school that you can hire them from, which I've done in the past. And then I'll talk to you about how you can interview them to make sure that you get the best people on your team. There are a number of things that I think are very important to interview for. And then once you've got those folks on your team, I'll talk to you about how you can make sure that they're successful, how to set them up for success and make sure that they're aligned to not only your business goals, but also your core values as a company, which is really important too. Yeah. That's fantastic, it's very well-rounded. I'm curious, what are some of the characteristics that you think are really critical for someone to become a successful SRE? Yeah, so there's a few key things that I look for. One thing is that somebody who is really good at troubleshooting, so they need to be able to be comfortable with complexity, ambiguity and open-ended challenges and problems and also thrive in those types of environments, because often you're seeing something that you've never seen happen before and also you're working with really complicated systems. So you just need to be able to feel good in that moment and you can test for that during an interview question on troubleshooting and debugging. So that's something that I'll go into in more detail, but that's definitely the first characteristic. The other thing of course is you want to have someone who is good at being able to build solutions. So they can code, they understand automation, they can figure out how can I take this pain point, this problem and how can I automate it and then scale this out and make it available for everyone across my organization. So somebody who has that mindset of building tools for others and often they are internal tools because maybe you're building a tool that helps everybody know who is on call for every single critical service at the company and also non-critical service and they can identify that in a minute or less, like maybe even just in a few seconds and then they can quickly get that person involved if they need to escalate to them. For example, a tool like Pager Duty, that's really what you want. You want them to be able to think how can I just make this efficient? How can I make sure that we can get really great results? And yeah, I think they also just need to be really personable to and work well in a really complicated organizational structure because usually they have to work with the engineering team, the finance team to understand the revenue impact. They need to be able to work with the PR team and the social media team if there are incidents and then they need to provide information about when this incident is going to be resolved and how they can update VIP customers. They need to talk to the sales team because what happens if you're giving a demonstration and then somehow there's an issue, a failure that happens, an incident and then in the middle of your very important sales demo you're not able to actually deliver it. That can happen a lot too. So there are a lot of very important key skills. Sounds like it's a really cross-functional role, pivotal to an organization that needs to understand how these different functions not only operate but also operate together. Is that somebody that you think has certain types of previous work experience? Is this something that you talk to the Girl Geek Academy girls about how did they get into? I'm curious like what the career path is. Yeah, it's interesting. Like I find a lot of SREs often come from either a few different backgrounds. One is they came through the world of Linux and understanding systems and just being really interested in that, like deep diving into the kernel, understanding how to improve performance of systems. The other side is maybe they came from a coding background where they were actually building applications and features. I started off actually on that side but I also had a passion for Linux and then I sort of spread over into the other side and was able to learn both. And then often someone who's comfortable with being on call and handling incidents but it is a lot of skills. Like that's actually something that I often talk to folks about and they ask me, how can I become a great SRE? There's so many things I need to learn. And I just say, take it slow, try and gradually increase your number of skills. People often say that there's like this curve for SREs where you have the operation side on one side and then the coding side on the other and often like the best person sits right in the middle where they have both ops and engineering skills but it's really hard to find those people. It's okay if you have someone that's like really deep, has amazing knowledge of Linux and scaling systems and incident management and then you can pair them up with a really amazing programmer who's great at software engineering and software architecture. That's okay too. We've been hearing for a long time about the sort of negative unemployment with respect to cybersecurity professionals. Is that, are you guys falling into that same category as well with SRE or is it somehow different or you just know this is exactly what we're looking for. We want to go out there and even in the girl geek academy maybe help girls learn how to be able to find what I imagine are a lot of opportunities. Yeah, there are so many opportunities for this. So it's definitely an opportunity because what I see is there's not enough SREs. So tons of companies all over the world will actually ping me and say, hey Tammy, how do I hire SREs? That's why I decided to give this talk because I wanted to package that up and just share that information as to how you can do it. And also maybe you can't find the SREs because they don't exist but you can help retrain your team. So you can have an engineer learn the skills that are required to be an SRE. That's totally possible too. Maybe move them over to become an SRE. With Girl Geek Academy one of the things that I've done is run hackathons and workshops and just online training sessions to help girls learn these new skills. So that's exactly what our mission is is to teach one million girls technical skills by 2025. And I love to do mentoring at scale which is why it's been really cool to be able to do it online and through these workshops and remote hackathons. And I definitely love to do something where I'll say work with some of our customers actually and run an event. I did one a while back, it was really cool. We were able to have all of the girls come in and be at the customer's office and actually learn skills with the customer which was really fun and it helps them actually think, hey, I could work here one day. That would be really amazing. And I'm gonna do that again in November. And it's kind of fun too. We can do things like have like, dad and mom and then daughter day where you actually bring your daughter to work and help her learn technical skills. And it's really fun because they get to see what you do and they understand it more and see how cool chaos engineering really is. Then they think, oh wow, you're so awesome. It's great. I love it. That's fantastic. Well, it sounds like like I said before your passion for it is really there. And what I think is really interesting is how you're talking about chaos engineering and just the word in and of itself chaos but you painted in such a positive lights critical business critical, but also all the opportunities there that businesses have to learn and fine tune. So such an interesting conversation. Tammy, I'm sorry. Yeah, Tammy, we'll have to have you back on the program but I thank you so much for joining me today. And for those folks lucky enough that are attending the PagerDuty Summit they're gonna get to learn a lot from you. Thank you. Thanks so much for having me, Lisa. For Tammy Bryant, I'm Lisa Martin. You're watching this CUBE conversation.