 Thank you so much, everyone, for being here. I am Garima, and we'll be playing a real-life Dungeons & Dragon today. Fun fact, I've never played the original game, so good luck. Let's start with something about me. These are ways to reach out to me, because there was literally nothing else that I wanted to talk about me, except the fact that I work for this company called Espivital. It's great. I love working over there. I work in the R&D section, which builds the Cloud Foundry-ish things. Currently, I'm on a team which builds Bosch, Windows, stem cells, if you know about it. So as you all know, this is going to be a highly interactive session. I'm going to ask you a lot of questions, and in turn, I promise you will get a chance to ask me a lot of questions as well. So for now, let's start with how many of you go on call? Raise your hands. DevOps days, what should I expect, right? How many of you have enough team members? How many of you think that you have enough team members to go on call? Yeah, I see the number of hands are going down. How many of you think going on call is stressful? This talk is for you. On-boarding newbies is very stressful to go on call. How many of you feel that? Kind of. Awesome. And last question for this slide. How many of you know the original game? Yeah, everybody except me. OK, so I will be referencing a lot of things from this book called Site Reliability Engineering. It is also called SRE. You know it now, right? The SRE is now a position in companies. So we all know it. I'm going to call it SRE. OK, so let's talk about why and in what scenarios going on call can be stressful. So as per the SRE book, for one person, it is healthy to go on call once a week in a month. If you are going on call for more than that, if you notice the face is now sad, so if you go more than that, it is definitely going to be stressful. So let's do the maths. So one primary person to go on call, one secondary, followed by four weeks in a month. So all those who said that you have enough team members, do you have eight highly qualified, highly onboarded engineers to go on call? Not really. Awesome. OK, so you need eight people. If you count those numbers, it is eight. OK, next is incidents per day. So even if you have eight engineers and if you have 500 incidents happening per day, it's not a healthy team, and engineers will burn out, so the healthy count to go on call for maximum two incidents per day is good. Then how do we onboard people? We onboard people by drawing some boxes and lines about how the infrastructure is, which is great. And the person who is doing it, for that person, it can become really stressful. So if you did not notice, I'll do it again, the presenter in this case was smiling and now is not smiling because it can get overwhelming to do it every now and then. So what's the next way to onboard people? This is my favorite. I like fiddles. Fiddles are great. Once you're done with the fiddles, the happiness that comes that, I think something is so good, but it onboards only one person on the team who is doing it, not everybody is doing the fiddles. So that leaves us to not many ways to onboard new people on your project, on your product. That's why we have DND. So what is DND? You know the original one, but in this case, there is a dungeon master who is a person who was on call when some incident happened and the person went through all the steps to fix that incident. Once the incident is done, this person gets to play the role of dungeon master who would replay the whole incident in front of the rest of the team. And now instead of the person solving the problem, the rest of the team would solve the problem. We will see a demo of this, but before that, I'm just going to put a snippet to increase the number of slides in my presentation from a SRE book. So SRE book talks about it in these four or five lines where it says, blah, blah, blah, selection mechanism for picking a disaster, followed by a role playing in which one person plays a part of a dungeon master. In this case, the system. And the other person plays a part of the on-call engineer, what the on-call engineer said to do, and we compare this against what they actually should have done. These are the five lines in there. Good luck trying to understand it. But at Pivotal, we came up with a way to do this role playing exercise, calling it vengeance and dragons. So let's back this up by science. When you're in a stressful situation, there's this hormone called cortisol which gets released in your system. It is not a very great hormone. It just is useful to build your memory. And that's what we are going to do today. I'm going to put you all in a very stressful situation. Good luck. And this will build your memory. All right, what would you gain after this talk? You would gain that D&D is a non-destructive, fun, and easy game to play, and then stress creates memory and it builds your confidence and knowledge on your system. So how are we going to do this? We are going to do this on my infrastructure setup. How many of you know already about it? No one, like literally if you knew it, I would be scared. But so by the end of this talk, you would know more about my system. And in 45-minute talk, if you would know more about my infrastructure setup, imagine how much would you know about your infrastructure setup when you play this game in your team. So how to play? There's a dungeon master. Today, I am going to be proudly presenting as your dungeon master. And then there is the rest of the team. Let's do a quick demo. The dungeon master says, we have a problem. We have 114 incidents raised by our customers that they cannot access the first page of our website since today morning. Oh, I see. Did we deploy something new in the system? Yes, the dev team did a deployment overnight for their customer X feature. Are we receiving any logs for these requests? No, there are no logs in our load balancer. In fact, the number of requests are less than yesterday. Interesting. What is the URL that they are trying? It's dungeonsareal.com. What is the IP that we do when we do dig dungeonsareal.com? We get 10.2.33.24. This looks like an internal IP. What should be the IP of our load balancer? It should be 45.22.12.12.23. Let's change the domain setting to first point to this IP. Then we can investigate on what happened. So this is just a quick demo. I'm sure you could understand everything from my voice modulation. So we are gonna do similar stress situation and we are gonna solve something together, but there are some assumptions. This is not production environment for two reasons. I will not be presenting my production environment in front of such amazing DevOps engineers because it's risky. My company should not even allow it. And second, currently in my team, I don't have a production environment, so. So these are the assumptions. Assume that when you see one request in the setup, consider that as a 10,000 request because we want to consider it as a production environment. And it is a startup, kind of a setup. So everything is fragile, things are just built on top of everything. So yeah, that's all the assumptions for today. And I am done asking you questions and now it's your turn to start asking questions. Are we all ready? Okay. Welcome to World of Dragons. While, so we are a startup, while doing some digging somewhere in Earth, we can't tell you the actual location, we found a lot of dragons, like literally real dragons, not joking. And so now we're a startup who sells dragons. We have this amazing API for our vendors, like Amazon and other Best Buy kind of vendors who can just like pull our API to get dragons and we will ship them to dragons and everything was so good, our life was great, we were building a lot of money until we decided to do a field trip for our software engineers to the dragon place. It's not good, it's not good. I'll tell you because we are a startup, right? So we had one DevOps engineer, I should say one SRE that we had that got eaten up by a dragon. Life is not that good after that. So we have a lot of problems and we don't know solutions to that. So we hired all of you to help us fix these problems and let's start to work. Let's start with our first problem, which is mystery of certificates. We are getting a not secure connection on our website and all our clients who use our APIs are also complaining the same. How would you fix this issue? Let me show you the issue quickly. Yeah, so this is our great HTML CSS and everything included in this website, as you can see. And there is this, things look normal to me, there's just this red color over here, which I know what it is, but for the sake of the dungeon master, I don't know what it is. Can someone, or I'm sure many of you, help me fix this problem? Can you, yeah? If you explicitly type in HTTPS colon slash slash before the domain, does it work? That's a good question. Oh, it was. It looks like it was there already. Okay. That answered my first question. We have a question over here. If you view the certificate details, is it still valid or can you look at the validity date? Yeah, what detail, by the way? Can you explain what the certificate detail is? Hit F12. Hit F12, all right. You click view certificate. Sorry? Yeah, there it is, awesome. Please expand the details. So it says expired, that is interesting, but that still did not fix my issue. All right, we know the issue now, so that's great, that's great. Help us, like, I understand that this is a certificate, so I need to go and change certificates somewhere, but can someone help me how to go where and make these changes? Go to the machine where the web server is running? Good question, very good question. We are a startup. We have 5,000 machines. Which one do you want me to go in? Why do you have 5,000 machines? We are a startup. You take a name, we have everything. Big data, we are on Azure, we are on Kubernetes, everything, we got everything covered. We are hiring as well. Okay, are you in the cloud or on-prem? On both. We got everything, like, as I said, we are a startup. Okay, then if it's microservices, then go to the load balancer, machine where the load balancer is running. Yeah, which load balancer? Load balancer, the service. The service, yes. Yeah, so you would have multiple instances of this service running behind the load balancer, right? Help me get there, our DevOps engineer is dead. Help me get to that. I understand, so you go to the machine that's running the load balancer now. How do you get there? If you don't know your own infrastructure, that's a problem. Well, we hired you. Wow. All right, we had. It was under construction. We have a map, yeah. So the question was, do we have a map of infrastructure? It was under construction when our DevOps engineer was eaten. You had, use dig or NS lookup? Use dig or NS lookup, okay. Let's do that. Dig what? Dragon's world dot. Okay, so we get something here. Do you want me to remove these things? We got something here. 35, 241, 243. Awesome. What do we do with that? Console and find that IP address. This startup is not pivotal. So you want me to find this IP address? Like, what should I do with this? Is the machine Linux or Windows? Either you SSH or RDP into it. In front of this? Okay. Before I answer your question, someone else, I'm sorry, I don't know your name yet. Ask me to throw HTTPS in front of it. Okay, what should I do next? Hit advanced, okay. And I'm saying proceed. Okay, this is what we get. Okay, so it doesn't have web interface, but I'm not sure. Can you trace throughout the IP to figure out which of your providers the load balancers is hosted on? Awesome, that's a good question. Trace through. How do we read this? We know we're at UIC, so we're starting at one and going out to the internet. It ends at googleusercontent.com, so we assume it's on GCP or Google App Engine? Oh yes, that rings a lot of bells. We have something on Google Cloud. You name it, we have it. And this is how the platform looks like. What to do here? Come on, this is all your things. Tell me what to do next. We have AWS. We hired you. Okay, so we know that there is something hosted with some IP on GCP. What should we do next? GCP, so first figure out what service manages certificates for your infrastructure, for your VPC. Okay, so let's Google. What is that certificate management service called in GCP? Google Google Cloud. Yeah, I mean, again, I don't know. Maybe somebody here can tell better about GCP. Okay, so find. Where's Harvey? Harvey? Harvey? Okay. Yeah, so once you find that certificate managed, most clouds have a centralized certificate management entity for whatever you're deploying. So once we find what that service is, from that portal we can get to that and then there see what's going on. Said something. So, what he suggested is, go to the certificate provider interface on Google Cloud. We don't have that, but we have something that, some word that you used, so I'm gonna navigate over there. We do have a load balancer here. Yes, that's the certificate. And I look at it. It has expired. And thank you for getting us till here, because obviously I did not have any idea that we are hosting our services on Google Cloud using this load balancer, so this is great. Now I know where to go and change the certificate from. Thank you for solving that problem. But this is literally the beginning of the problem. Now let's get on to some real problems, right? This is just a test whether you all got DND or not. Now let's actually play DND. So, next problem. Request Sarin Dungeon. After a late night deployment of our new app by our dev team, because you know we are all agile, we don't deploy, the dev teams deploy. We have started getting errors where all our clients are randomly receiving HTTP code 429. This issue, as usual, did not happen in our staging environment. Help us fix this issue. Well, again, first question I would have for the dead guy that, is QA representative of production environment or not? You said, you tried this in staging and nothing went wrong. My question is, is staging representative of production? For the dead guy? Yes, it's all representative of production. We are all agile. Okay. Well, the dev of person is dead, right? So, then first we would, well, in this case I would go to the version control management system and see what that deployment, how much of a change was it? Is it a big impactful change? Is it a minute change? What is it, right? And at this point I would ask, there should be a rollback strategy or you spin up another staging environment or another prod environment that's representative and deploy the application there so you can test within it. But first you roll this back so you can get service up. What's the record for 29? Okay, Google this. Too many requests. Too many requests, okay. So there are too many requests. Maybe there's a new feature which is highly in demand. We had some hands over there. Yeah, Mac, get on it. How can we reproduce it? Is it random or is it some pattern we can figure out? Good question. Let's see it being produced. There it is. Some reasons. So we can reproduce it. Try scaling up. Literally the DevOps engineer is dead. If you're talking about scaling up. Walk us through it. I mean again, there's this concept in cloud called auto scaling group where it would scale automatically. You imagine that we must be using auto scaling group, right? You said you're all agile and that would imply good practices. Wait, whoa, whoa, whoa. Whoa. Whoa. Okay. Who else? Let's try this. Let's try some other folks, some other suggestions. Have you rolled back the change to see if this fixed the issue? Great question. Because that's the first thing they force anyone who does changes immediately if we get this. They force us to roll back and if it fixes it, it's our fault. Makes sense. Makes sense. We can totally do that. Our dev team is sleeping right now. So we will pass on to that for now, but we will totally do this. Was there an update to the web server which changed their max request count? Very interesting. How should we find that out? Well, see, there are VMs. There are ways to get to those VMs and find the IPs and SSH onto those VMs. You all need to just tell me what to do. To services and try to see if that helps. Great question. How do we get to those VMs for that? Do you have any ideas how to get there? It depends what you're hosted on and if you have those guys available. But like for example, if it's a web server, just restart HTTP or something, no? Yes. So we got to a load balancer in previous issue. Do you wanna add to that? So he said, do we need to figure out all the app servers which are behind the load balancer? Yes, yes. Amazing. Let's get there. Sorry? Good idea. Okay. So I see some instance groups word here. So I'm just gonna click on that. And these look like VMs to me because they have external IP, but happy to be challenged over there. Does this look good? Should I get on to one of these? I would have asked you how to SSH, but that is too basic. Okay, we are in here. You want me to do a top? Should I sort this by memory? Or load average or whatever? Okay, I'm doing it. What's your thought process? Do you wanna just like talk about it? He's looking for resource contention. Nice idea. We have our hand raising over there. Are the instances healthy in the load balancer? Let's check that. Since I use Google Cloud, this means three instances are healthy. So the instances are healthy. We have our hand over there. Can, are all the 429s happening on all three instances? And if it's isolated, can that be not destroyed but removed from the load balancer so you can interrogate it while it's still alive in situ and still maintain actual 200 connections? Amazing. How do we check that? How do we check that? How do we check that? Is it happening on all three of them? So earlier in the scenario, you mentioned that you were getting 429 errors. Yes. So was it through whatever, if it was through a ticket system or people were calling, perhaps if they had a direct IP, if not then we should have a log, do we have a logging system that through there, a log where we can check and try and isolate where, if it's one particular or if it's across the board? Nice, nice. Can we try to hit each of those IPs and that's behind the load balancer? Okay, does that answer the question that you have been asking? Right, how to get one of these? Okay, let's go back. I'm not sure if I had put it in a VPC or not, so it's behind, it's behind a firewall. See, it is not reachable. But why am I giving answers? So this is not reachable. What should we do next? Can you say, hit it locally from the box. Good, good. Okay, so we are already in the box. So I'll do something like local IP address slash, let's just do this, right? Are we getting any answer? Not really. Should I try the private IP of this? He said, what interface or IP is it listening on? You have to check the config and my question is, how do we check the config? Well, we also know. On the web server, you're running the browser, could tell you that now you can hide it. Yeah, okay. So yes, check the config. Probably HTTPD would be the, yeah Apache would be the most common. We also know we need to hit on HTTPS and it's insecure right now. You had something to add. Go back to the Google interface, I should tell you. Okay. No GCP at all. If this was AWS, I can tell you exactly how to do it. He's saying if this was AWS, he could tell us exactly what to do. And there's another hand over there. Do you wanna just like move in a little bit so that Matt doesn't have to walk too much, even if I like that? Can we try running that curl command on the box again? To hit HTTPS and to hit with the insecure flag? But not on the internal network, on the external network. Okay, on it. With dash, dash and secure. We have a handle. Oh yeah, go ahead. So I don't know Google Cloud, but there was a tab above that said monitoring. I wonder what happens. Hmm, interesting. This tab, did you mean? Okay, there it is. You had high expectations, didn't you? Why was there a spike early in the morning? Was there any activity? Because I was testing the presentation. Let's do it six hours. Maybe I can't test it as much. Are you able to ping the gateway or the other nodes behind the load balancer as well? You mean other instances in the load balancer from browser or from within? From within, if you're able to ping the other nodes and if you're able to connect to the load balancer from there. And within the load balancer itself, you might want to check the configs to see if those nodes are actually included in the config file on the load balancer itself because it might not be configured properly maybe. I missed it, can you say it again? Oh sure, the config file on the load balancer itself to see if it includes those three nodes. Okay, so you want to see the config of load balancer itself? Yes, in addition to the pinging of the gateway and the other nodes. Okay, so I did try a call here. It didn't work, do you want me to ping it? Sure, all right. Okay, it looks like it's timing out. Try the internal address is what I heard, so let's do that as well. Okay, so let's take a step back after this and let's see what are we trying to get to? Like, can somebody talk about the thought process? What are we trying to get here? So he said, we want to know what process is running on what port in the VM so that we would know what to do with it, something like that? Where to keep looking. Where to keep looking, all right, okay. So, and I understand that we are pinging different machines in the system, but on the machine itself, we haven't done much. We want to figure out if the problem is with the load balancer or with the apps running on the app servers. Awesome, how do we do that? Do you have application logs for the load balancer or do you know what the application logs are on the app servers? I don't have logs from load balancer, but I do have logs on the machine from the process itself. However, you will have to help us get to those logs. I don't know what your question is. I mean, this is about, this whole session is about, right? We are learning some new tools, new techniques from each other, how do they debug, and that's what this whole thing is about. So I tried it from two other devices and it comes through just fine. You got to keep trying, you would see it. All right, so now that randomly I started seeing the count work on your laptop too, tells me what Jeremy was saying there. Maybe it's an issue with one of the machines because the load balancer keeps sending us to the bad one first, then it probably did a round robin, had enough, and then sent us to the one that's working correctly. Fair enough. Okay, so now I would say you remove them from the group and then attach them to an external facing network to the machine and try each machine individually to see what the issue is there. All right, is there a better way of doing this because this will bring down the site, right? If I remove them all from load balancer. Well, can we look at the web server logs and see which one is throwing 429s? There was an idea, but it got cross-questioned with how to find the logs. You have something. Can you go to the details tab on the load balancer? Can we go to the details tab on the load balancer? What's your thought process? Okay, when you want to curl those, hit them on port 8080. Oh, how do you get that? Do you want to talk about that before? Why we saw that? Because it says incoming traffic is directed to port 8080. Because of this. We missed this bit. Okay, so let's do this again on port. Not ping, curl. Try the other two. That's good, all right. So what was breaking? Ha ha, you didn't fix it yet? Okay, so now we can hit it internally. Wait, is this the IP address of one of the nodes or is this the IP address of the load? It is one of the nodes. Okay, try the next node. Next node? Okay. Please. Only because you said that, I'll do it. Ha ha ha ha ha. Okay, let's do it for 197 this time. You got a space in there. Third one, ready for this one? That's that dash TULNP space pipe grep 8080. Is that what you want? Space pipe. Like your IP address or no? No, it's pipe. Okay. Grep 8080. Ah, before I hit enter, do you want to explain the room? What it is going to do? You're showing all listening ports and you're looking specifically for port 8080, which will tell us which web server you're running. Add a dash A. Dash. Oh, okay, pseudo. There's some. That IP address in the list is not in the batch of load balancers. Is that what you're saying? I thought it was, I see two different 35.add addresses. Third one was a 35, but that was the app server itself, not the load balancer. The load balancer IP as we saw was 35 this. I saw a hand somewhere, maybe not. Okay, so what do we get from here? Yes, this thing. Okay, before I hit enter, you want to explain people and this time I'm going to request Matt to give them my. Sure. So I just want to see what file handlers are open by that process, the PID 14704. That should at least tell us where the logs are going. Makes sense. Okay. I wasn't expecting that, honestly. Yeah. Any other way? I know. Okay, there's some. Can we get for dragon and varlog? Varlog, that for dragon? Yeah, can we act for dragon? Okay, that's because varlog is a place for all the logs in general and you have high expectation from this setup for standards and everything because I talked about agile. All right, all right. You hope and pray the defaults weren't changed. Oh, you want me to get there? Varlog, there's something like a dragon. Oh, you did have a high expectations, didn't you? You want me to tell those logs? Okay. Well, that's how production systems are. Not that straightforward, right? But it says health check is healthy. Can you say it again? Grab that file for 429. 429 and I heard we are still doing it on one node. Right, yes. So our working theory was one of the three nodes is bad but then we did a curl with slash dragon API with all the three IPs and all of those returned a yes. Can we run Bosch? So can you say it again? Once. We tried once and once it returned yes. Okay, what was the next thing? Can we try to run Bosch? Is Bosch on here? High expectations, don't you? Right, no. Don't have Bosch. Chef client or puppet, Ansible, anything? All right. I'll add it to our backlog and then we'll create stories for it. So do we know what image this VM was even created from? Like how do we know that this image was not updated along with the application? That might be causing the problem. Nothing happened to image because our DevOps engine is dead. Literally she did not make any automation around recreating the VMs yet. We are almost there. We're almost there. So we are on the VM. We know dragon API is something which is running with some process. We want to get the logs. We are almost there with the logs. You talked something about version control. Somebody else also talked about getting to version control. We just need some way to connect the dots. Anybody? Now, yes. Literally could not hear any of that. Matt, you got to pass the mic. Can we make a request while tailing the log to see if something pops up? Good question. Okay, so let me ask you such again. So we tail the logs. I should not tail the check logs, right? We are empty. And then call itself at port 8080 for getDragonCount. Sorry, one. That should have worked. Let's see address you're coming. IP right? Oh, I'm sorry. You have to do HTTPS? Let's get the right IP. Let's check the private IP. DragonCount request responded. Okay, so based on your hint, I'll go let's find where this application code is. Based on what? Based on your hint for the version control and something else. Somebody said, where's the source for this? Right, it's being served out of var, not W, what is that? Var, W, W, HTML. Let's try that directory. W, W, W. Don't have that. No, that directory doesn't exist? No. Okay, let's grep for the HTTPD process. Is this Debian or CentOS? So you want to get to the place where this code base is sitting. Anybody here can talk about how can we get it using slash proc? So we have a process ID. If you go to slash proc slash process ID, there are a bunch of files. I don't remember which part exactly, but if you ls dash l... I'm just trying to get the... Okay, so you want me to do ls... Oh, sorry, yeah, ls dash up and enter. And then that's fd, yes. Could you ls in the fd directory there? So the fd directory, file descriptor, right. And you need to sudo. So that's the files that the process is opening and sockets and sockets and if you see... But I guess the source code is closed. You're almost there. If I get here, and if I do ls, that's where the git repositories are sitting and the name of the repository is dragon API. Can you run ls dash al? Ls dash al. Okay, can you run git status? Where are you talking from? Why can't I see? Yes, okay. Wait one second. What do you want me to see? I'll do here, okay, git status. Okay, I don't see a dot git folder. Okay, in interest of time, we're almost there. The code base is actually here, not this one, DevOps stays here, there was a dragon API and there it was. And if we had looked for it over there, there was a throttling being added from developer themselves. And that's why it was working well in staging because not that many requests were coming up. So this case was when load balancers itself, IP was being rejected because load balancer has finite number of IPs. This is one of the problems that I faced when I was actually in one of the startups. I had few more cases of trying to solve here, but in interest of time, I'm gonna move on to recapping it. We use this technique for not just going on call and not just for incidents, now we use this techniques for any kind of debugging that we do in our teams like pipeline investigation, why something is not working and if one person or a payer solved the problem, they would present it to the rest of the team and that's how we increase knowledge and context on our team. That's all, thank you for having me here.