 From the CUBE Studios in Palo Alto in Boston, connecting with thought leaders all around the world, this is a CUBE Conversation. Hello everyone, welcome to the CUBE Conversation here in Palo Alto at our studios of the CUBE. I'm John Furrier, your host. We're here during the crisis of COVID-19, doing remote interviews. I come into the studio, we've got a quarantine crew are here, getting the interviews, getting the stories out there. And of course, the story we continue to talk about is the impact of COVID-19 and how we're all getting back to work, either working at home or working remotely and virtually certainly. But as things start to change, we're going to start to see events, mostly digital events. And we're here to talk about an event that's coming up called the Failover Conference from Gremlin, which has now gone digital because it's April 21st. But I think what's important about this conversation that I want to get into is, not only talk about the event that's coming up, but talk about these scale problems that are being highlighted by this change in work environment, working at home. We've been talking about the at scale problems that we're seeing, whether it's a flood of surge of traffic and the chaos that's ensuing across the world with this pandemic. So I'm excited to have two great guests, Alberto Ferenado, Senior Vice President of Marketing at Gremlin and Tammy Buto, Principal Site Reliability Engineer for SRE. Guys, thanks for coming on, appreciate it. Thank you. Thanks for having us. Alberto, I want to get to you first. You know, we've known each other before you've been in this industry. We've all been talking about the cloud native, cloud scale for some time. It's kind of inside the ropes. It's inside baseball. Tammy, you're a site reliability engineer. Everyone knows Google and knows how the cloud works. This is large scale stuff. Now with the COVID-19, we're starting to see the average person, my brother, my sister, our family members and people around the world go, oh my God, this is really a high impact. This change of behavior, this surge of whether it's traffic on the internet or work at home tools that are inadequate, you start to see these statistical things that we're planned for not working well. And this actually maps to things that we've been talking about in our industry. Alberto, you've been on this. How are you guys doing? And what's your take on this situation we're in right now? Yeah, yeah, we're doing pretty well as a company. We were born as a distributed organization to begin with. So for us working in a distributed environment from all over the world is common practice day to day. Personally, I'm originally from Italy, my parents, my family is Milan and Bergamo out of all places. So I have to follow the news with extra care. And it becomes so much clear nowadays that technology is not just a powerful tool to enable our businesses, but it also is so critical for our day-to-day life. And thanks to video calls, I can easily talk to my family back there every day. So that's really important. So yes, we've been talking for a long time, as you mentioned about complex systems at scale and reliability, often in the context of mission-critical applications. But more and more of these systems need to be reliable also when it comes to back-office systems that enable people to continue to work on a daily basis. Yeah, our hearts go out to your family and your friends in Italy, and I hope everyone stays safe there. I know that was a tough situation and continues to be a challenge. Tammy, I want to get your thoughts. How's life going for you? You're a site-reliable engineer. What you deal with on the tech side is now happening in the real world. It's almost, it's mind-blowing. And to me that we're seeing these things happen, it's a paradigm that needs attention. And how do you look at it as a SRE dealing the most from a tech side now seeing it play out in real life? Yeah, it's been such an interesting situation, obviously really terrible for everybody to have to go through and deal with. So one of the things that I specialize in as a site-reliability engineer is incident management. And so for example, I previously worked at Dropbox where I was the incident manager on-call for 500 million customers. It's like 24-7 shifts. These large-scale incidents you really need to be able to act fast. There are two very important metrics that we track and care about as a site-reliability engineer. The first one is meantime to detection. How fast can you detect that something is happening? Obviously, if you detect an issue faster, then you've got a better chance of making the impact lower so you can contain the blast radius. So I like to explain it to people like, if you have a fire in your saucepan in your kitchen and you put it out, that's way better than waiting until your entire house is on fire. And the other metric is meantime to resolution. So how long does it take you to recover from the situation? So yeah, this is a large-scale global incident right now that we're in. Yeah, I know you guys do a lot. Talk about chaos, theory, and that applies. A lot of math involved. We all know that. But I think when you go look at the real world, this is now going to be table stakes. And there's now a line in the sand here, pre-pandemic, post-pandemic. And I think you guys have an interesting company, Gremlin, in the sense that this is a complex system. And if you think about the world we're going to be living in, whether it's digital events that you guys have one coming up or how the work at home or tools that humans are going to be using is going to be working with systems, right? So you have this new paradigm going to be upon us pretty quickly. And it's not just buying software mechanisms or software. It's a complex system. It's distributed computing. It's an operating system. I mean, this is kind of the world. Can you guys talk about the Gremlin situation of how you guys are attacking these new problems and these new opportunities that are emerging? Sure, I can talk about that. So yeah, one of the things that I've always specialized in over the last 10 years is chaos engineering. And so the idea of chaos engineering is that you're injecting failure on purpose to uncover weaknesses. So that's really important in distributed systems with distributed cloud computing, all these different services that you're kind of putting together. But the idea is if you can inject failure, you can actually figure out what happens when I inject that small failure and then you can actually go ahead and fix it. One of the things I like to say to people is, focus on what your top five critical systems are. Let's fix those first. Don't go for low-hanging fruit. Fix the biggest problems first. Get rid of the biggest amount of pain that you have as a company. And then you can go ahead and like, actually, if you think about Pareto principle, the 80-20 rule, if you fix 20% of your biggest problems, you'll actually solve 80% of your issues. That always works. Something that I've done while working at the National Australia Bank doing chaos engineering. Also at Gremlin, at Dropbox, and I help a lot of our customers do that too. Alberto, talk about the mindset involved. It's almost counterintuitive. Whoa, whoa, risk, the biggest systems. I don't want to touch those. They're working fine right now. And then these problems just gestate. They kind of hang around to the bin and the kitchen fire. That's okay, I don't want to touch it. The house is still working. So this is kind of a new mindset. Could you talk about, you put your takers on that. Is the industry there? It was kind of a corner case. You had Netflix. You had the chaos monkey those days. And then now it's a DevOps practice for a lot of folks. You guys are involved in that. What's the appetite and what's the progress of chaos engineering and mainstream? Yeah, it's interesting that you mentioned DevOps. And recently Gartner came up with a new revisited DevOps framework that has chaos engineering in the middle of the lifecycle management of your application. The reality is that systems have become so complex. You know, infrastructure has so many layers of abstractions. You have hundreds of services if you're doing microservices. But even if you're not doing microservices, you have so many applications connected to each other to build really complex workflows and automation flows. It's impossible for traditional QA to really understand where the vulnerability are in terms of resiliency, in terms of quality. Too often the production environment is also too different from the staging environment. And so you need a fundamentally different approach to go and find where your weaknesses are and find them before they happen, before you end up finding yourself in a situation like the one we're in today and you're not prepared. And so so much of what we talk about is giving a tool and the methodology for people to go and find these vulnerabilities. Not so much about creating chaos, but it's about managing chaos that is built into our current system and exposing those vulnerabilities before they create problems. And so that's a very scientific methodology and tooling that we bring to market and we help customers with. Tami, I want to get your thoughts on some of them. You know, we used to riff a lot to our 10th unit cube. We've had a lot of conversations, we've riffed over the years, but you know, when the surge of Amazon web services came out, it was pretty obvious, the cloud's amazing. And look at the startups that were born, you mentioned Dropbox, you worked there. These companies, all these born in the cloud, these hyperscale companies built from scratch, great way to scale up. And we used to joke about Google and people say, I would like a cloud like Google, but no one has Google's use cases. And Google really pioneered the SRE concept and you got to give them a lot of props for that. But now we're kind of getting to a world where it's becoming Google-like. There's more scale now than ever before. It's not a corner case, it's becoming more popular and more of a preferred architecture, this large scale. What's your assessment of the mainstream enterprises? How far are they in your mind or are they there with chaos? Are they closed, are they doing it? How does someone develop an SRE practice to get the Google-like scale? Because Google has an amazing network, they got large scale cloud, they have SREs, they've been doing it for years. How does a company that's transforming their IT have SREs? It's a great question. I get asked this a lot as well. One of our goals at Gremlin is to help make the internet more reliable for everybody, everyone using the internet, all of the engineers who are trying to build reliable services. And so I'm often asked by companies all over the world, how do we create an SRE practice and how do we practice chaos engineering? You can get started actually rolling out your SRE program based on my experiences that I've done it. So when I worked at Dropbox, I worked with a lot of people who had been at Google, they'd been at YouTube, they were there when SRE was rolled out across those companies and then they brought those learnings to Dropbox and I learned from them. But also the interesting thing is if you look at enterprise companies, so large banks, say for example, I worked at the National Australia Bank for six years, we actually did a lot of work that I would consider chaos engineering and SRE practices. So for example, we would do large scale disaster recovery. And that's where you fail over an entire data center to a secret data center in a non-known location. And the reason is because you're checking to make sure that everything operates okay, if there's a nuclear blast. That's actually what you have to do and you have to do that practice every quarter. So, but if you think about it, it's not very good to only do it once a quarter. You really wanna be practicing chaos engineering and injecting failure on purpose. I think actually, I prefer to do it three times a week. So I do it a lot. But I'm also someone who likes to work out a lot and be fit all the time. So I know that if you do something regularly, you get great results. So that's what I always tell everyone. Yeah, get the reps in, as we say, get stronger, get the muscle memory. Guys, talk about the event that's coming up. You got an event that was scheduled, physical event, and then you were right in the planning mode and then the crisis hits. You're going digital, going virtual, it's really digital, but it's digital. That's on the internet. So how are you guys thinking about this? I know it's out there, it's April 21st. Can you share some specifics around the event? Who should be attending and how do they get involved online? Yeah, the event really came about together about a month ago when we started to see all the cancellations happening across the industry because of COVID-19. And we are extremely engaged in the community and we have a lot of talks and we were seeing a lot of conferences just dropping and so speakers losing their opportunity to reshare their knowledge with respect to how you do reliability and topics that we focus on. And so we quickly, people that as a company and created a new online event to give everyone in the community the opportunity to just bail over to a new event as the conference name says, and have those speakers who have lost their speaking slots have a new opportunity to go share their knowledge. And so that came together really quickly. We shared the idea with a dozen of our partners and everyone liked it and all of a sudden this thing took off like crazy. And just a month we are approaching 4,000 registrations. We have over 30 partners signed up and supporting the initiative. A lot of press partners as well covering the event. So it was impressive to see the amount of interest that we were able to generate in such a short amount of time. And really this is a conference for anybody who is interested in resiliency. And if you want to know from the best on how to build business continuity of persistence, people and processes this is a great opportunity at no cost really. It's a free conference. And the target person and the audience you want to have attend is what SREs or folks doing architectural work what's the target person to attend? Architects, SREs, developers, business leaders who care about the quality and reliability of their applications who need to help create a framework and a mindset for their organization that speaks to what Tammy was saying a minute ago having that constant practice on a daily basis about who and finding how to improve things. You know, Tammy we've been doing going to physical events with theCUBE and extracting the signal from the noise and distributing it digitally for 10 years. And I got to ask you because now that those events have gone away. You talk about chaos and injecting failure. These doing these digital events is not as easy as just live streaming. It's hard to replicate the value of a physical event years of experience and standards, roles and responsibilities to digital different consumption environment to say synchronous. You're trying to create a synchronous environment. It's its own complex system. So I think a lot of people are experimenting and learning from these events because it's pretty chaotic. So I'd love to get your thoughts on how you look at these digital events as a chaos engineer. How should people be looking at these events? How are you guys looking at it? I mean, obviously you want to get the program going, get people out there, get the content. But to iterate on this, how do you view this? It is really different. So I actually like to compare it to fire drills in SRE. So often what you do there is you actually create a fake incident or a fake issue. So you're just saying, let's have a fire drill. Similar to when you're in a building and you have a fire drill that goes off and you have wardens and everything and you all have to go outside. So we can do that in this new world that we're all in all of a sudden. A lot of people have never run an online event and now all of a sudden they have to. So what I would say is do a fire drill. Run up a fake one before you do the actual one to make sure that everything does work okay. My other tip is make sure that you have backup plans, backup plans on backup plans on backup plans. Like as in SRE, I always have at least three to five backup plans. Like I'm not just saying plan A and plan B, but there's also a CD and E. And I think that's very important. And even when you're considering technology, one of the things we say with chaos engineering is, if you're using one service, inject failure and make sure that you can fail over to a different alternative service in case something goes wrong. Yeah, hence the failover conference, which is the name of the conference. Exactly. Well, we certainly are going to be sending a digital reporter there, virtually. If you need any backup plans, obviously we have the remote interviews here. If you need any help, let us know. Really appreciate it. I'll, great to see you guys. And thanks for sharing any final thoughts on the conference. How, what happens when we get through the other side of this? I'll give you guys a final word. We'll start with Alberto with you first. Yeah, I think when we are on the other side of this, we'll understand even more the importance of effective resilience, architecting and testing. I think as a provider of tools and methodologies for that, we think we'll be able to help customers do a significant leap forward on that side. And the conference is just super exciting. I think it's going to be a great event. I encourage everyone to participate. We have tremendous lineup of speakers that have incredible reputation in their field. So I'm really happy and excited about the work that the team has been able to do with our partners to put together this type of event. Okay, Tammy? Yeah, for me, I'm actually going to be doing the opening keynote for the conference. And the topic that I'm speaking about is that reliability matters more now than ever. And I'll be sharing some bizarre, weird incidents that I've worked on myself that I've experienced, really critical, strange issues that have come up. But yeah, I'm really looking forward to sharing that with everybody else. So please come along, it's free. You can join from your own home and we can all be there together to support each other. You got a great community support. I know there's a lot of partners, press media and ecosystem and customers. So congratulations, Gremlin having a conference on April 21st called the failover conference. The Cube at SiliconANGLE have a digital reporter there will be covering the news. Thanks for coming on and sharing and appreciate the time. I'm John Furrier in the Palo Alto students with remote interview with Gremlin around their failover conference, April 21st. It's really demonstrating, in my opinion, the at scale problems that we've been working on the industry. Now more applicable than ever before as we get post-pandemic with COVID-19. Thanks for watching. Be back.