 Welcome to conference auditorium. My name is Nandan. I'm here to introduce our speakers for today, Mr. Chilur and Mr. Mulligan. They're going to be talking about learning from 500 hours of live debugging sessions. Please welcome them. Thank you, everyone. Good morning. So this presentation is based on what my teammate and I learned from having spent 500 hours interacting with customers as developers. There were also support engineers in the call. But this is from the perspective of developers who have written code, which is being used more and more in DevOps world. But we are ourselves not yet in the DevOps role. So how do you take the input from there? And what changes can you make so as to make your software better is the talk about it. A little bit of introduction. So I have seven years of experience working in storage systems and in distributed systems, started out as a developer for core cluster, which is a distributed storage system, did some integration of cluster with Samba and then Grell. For the past three or four years, I've been working on integrating cluster with Kubernetes and OpenShift. John, who is the teammate I spoke about, has been working on storage systems and system applications for over 10 years now. And this talk is based on our experience. And the software that I spoke about is the storage offering that we have integrated with Kubernetes and with OpenShift. Feel free to interrupt me during the talk. If you have any questions, we can also do Q&A at the end of the session. I think most of you would be familiar with this. As kids, when we discard programming for the first time, it felt like magic. What? You're telling me that I can dodge bullets? You're telling me I can control machines at my will? This seems amazing. I want to do this magic thing all my life. That's how we started, most of us. And it was fun. For a long time, it was fun. And then slowly, the magic starts meaning off. You feel that's not enough. I want to do more. And then you think, hey, what if I could write code which others could use? And slowly, that took me here. Started working for Red Hat, wrote code, which is shipped to millions of users. And honestly, I did not know what I was asking for. Like Peter Parker discovered, with great powers come great responsibility. And in the next few slides, you will see 10 or 12 things that we learned about the responsibility part after having got the power to do it. The format is something like this. I'll show you first what the customer told us and then what we learned from that. The first one is interesting, unexpected question that we got from the customer. You don't need to read everything which is there on the slide. But basically, the idea was this customer had four nodes. They were creating replica three volumes. They understood that when you create replica three volumes, at a given time, there will be only three nodes picked from this four and will be used to create the volumes. And they also understood that there will be some redundancy required, some extra storage space that needs to be available for the system to make the provisioning. But the specific question was, I don't want to have 25% redundancy. I understand that 0% is not going to work. Can you give me the exact number which will work for my use case? You can attribute it to me being a novice developer here, but I never thought such a question could come from the user. We had to write a script for two days to run enough number of tests to tell exactly, hey, with 17%, you are good to go. You will get a volume provision 99% of time. That's the unexpected question thing. The second one is the more interesting one. The customer came running to us. He was like, help me, help me. I have this OpenShift cluster deployed on a cruise ship. This is going to depart tomorrow, and I'm not able to create volumes. If you don't fix it today, my job is on the line. We are aware that there are data centers which do not have connectivity to internet, and we need to account for that. We build our tooling and build systems so that we can provide our RPMs and containers to such customers. But the agency of it, the agency and the access to internet not being available was something again, which I did not expect. And more than that, the emotions that this customer came with, they were really anxious. There's only 24 hours left. I need to do something. My job is on the line. Please help me. This is the first time that I understood that we need to empathize with the users. We are all taught that we should put the user hat on. We should think like an admin. As an admin, I might have a different storage back end than what others have, so there should be an option provided somewhere in my software to change the back ends. But to empathize with users, to empathize with the situation that they might be in when they come running to you, this is not taught to us that often, how often it should be. So that's about empathy part. This is the important slide. If you had to go back with just one slide in mind, I would say this is the one. This is not so commonly found, especially in the open source communities. And I would like to emphasize on this thing. Next one. This customer had three data centers. They had set it up all right, but two of the data centers went down at the same time because of a power issue. They understood that having just one of the data center up meant that the software could only perform reads and could have disabled the rights. But then they had these critical applications running, and they wanted some way out for them to let them run in read write mode. Now, the senior developer in the call calmly asked them, OK, if you tell me exactly how the access pattern of these critical applications are, I might be able to help you. A junior developer in this situation would have asked something else. They would have blamed the customer about, why did you lose power into a few data centers at the same time? Or they would have prioritized debugging and fixing the power issue before understanding the situation that the customer was in. The difference in these thinking patterns come in only after you experience these situations. This is not something which you can learn theoretically. And the only solution to this is to learn on the job. This is something which is done all the time in jobs where these situations are common. Most common being the firefighters. They have a long training period where they have to shadow experienced firefighters. They go along with them. They are put in the same situations as the experienced ones, but it's not their responsibility. This is what we suggest to juniors in our team now, that they just follow us along when we join a customer call. They don't need to do anything. But be there. See what the customer is going through. Understand what the thinking pattern should be when things start going wrong. And come up with solutions which are practical and safe for this customer, especially when you are working in a storage system. It's possible that one wrong action that you take here can delete the data that they have. This could be the only backup that they are working with. And it's important that you take practical steps, but at the same time, safe steps. So this one. The customer had upgraded to a newer version, and they found a small little bug, but they did not want to run the system with it. They decided it's not a big deal. Let me downgrade it. The older version was working fine. They did so, but neither the old version nor the new version would work anymore. Any guesses on what went wrong here? This is where we learned that if you persist data, you need to version it. We did a very small change in the format and of the database which was being used here. Once we upgraded it, it had upgraded to the newer version. And the downgrade would not work with the newer format. But the old version was not aware of the change, and it changed the DB again in a way that the new version could not read it. This is a very simple thing which I think all experienced developers know, but it's one of those things which gets neglected when you're developing a new project. This is not the first thing in your mind when you're working on a new shiny thing. You think that you'll come back and do this at a later date, but that does not happen most of the times until you hit something critical like this. Have a checklist. I would say that the frameworks that we work with should always emphasize on versioning the format that's getting stored. And you don't need to stop here. The very first thing which we do at this point is to put in a version number, which is an integer, which you keep bumping up as and when you change the format. But that does not scale, especially for software, which is long running and being maintained for a long time. For those of you who have worked with protocols, you know that there is a better solution. It's not an integer that you should have here. What you should work with is format, version, and capabilities thing. So the data that you store should come with a description of what formats or what different patterns it's using. And the software that is going to read the data should come with capabilities. OK, I understand format A, format B, format C, and format I. Anything in between, I do not understand. When the software picks this up, it reads the header which says the data here is present in C, D, and E formats, but I don't do D and E. I should quit now. Once you start building this, you'll realize it becomes so easy later on to add new features to deprecate old features because the software is always capable of determining what it should touch and what it should not. This is a corollary to that problem. In this case, what happened was the customer made a change on their infrasight. The FQDN of the system changed. And they wanted it to be updated in the database. The software that they had did not allow one to change the FQDN once it was entered into the system. We were willing to write an update for the software where we provided API for them to do the change. But they were not in a situation where they could bring in updates to the software. They wanted to do it with the data which is stored there, but with the same version which they were running. In these situations, you will figure out that there is a need that whenever you store data in a machine-readable format, you should have tooling around it. That's the easiest way for you to be able to make changes. Again, this is not anything new. If you have worked with protocols, we have frameworks. We have tools. We have applications which can read through the data which is on the network. This on it, show it to you in a human-readable format. This becomes crucial, especially when you're working with customer in a time-critical situation. To be able to look at data and change it as fast as you can will help you fix the customer problems much sooner than it's otherwise possible. I went there, so let's start here. This was one of the interesting problems we had. The software refused to start whenever there is inconsistency in the database. It requires manual intervention. This happened with our QE. This was not a release software, but there were a lot of customer bugs that we were getting at that time where the database would go in consistency because of some other issues that we had in the back end. And we decided that we will value consistency more than the user experience. We decided that if the software read the database and found anything to be wrong in that, it will refuse to start. We wanted to do that because we wanted to stop more corruptions happening in the database, but we realized that that's not what a user wants. The solution in these cases were found that it's ideal if you could make a note of what's wrong in the database, but at the same time, let the software run, let the user keep using it, abstract the problem from them, and at a later date, they can come back and fix the database, those parts which are corrupt. The part where I realized the true value of it is in the second case. This is where we had published documentation which said what is the limit for this particular software and under what circumstances they can go to a particular limit. This customer was a curious person. They wanted to try how far can I go without breaking the system. I'm pretty sure most of you have done some version of it on your laptops. You would have tried to open too many tabs in Chrome and then realize that nothing is responsive and the only way out for you then is to reboot your laptop. This is a valid fix when you are talking about non-production systems, but in production systems, that does not work. You need to recover the system with the parts which are running, and this is something which we have always studied in software engineering. This is called mean time to recovery. I knew the value of it, but it was only a mathematical value that I understood then. In this particular customer case, we took close to 10 hours to recover the system, and that's when I truly recognized the value of mean time to recovery. Always make sure that the limits are hard coded in your software such that they do not get to surpass that and then becomes difficult to recover the system. Yep, so I found it, this was a customer who was again one of the curious folks. They executed something and then they found that the whole system went down. In this case, it was a small CLI tool that we had embedded in code because we found that probably when a customer case comes to us, we will be able to execute this and then we will be able to fix the system sooner than in other times. We did not realize that it would be visible to the user and for a user, any new shiny thing that comes, they want to try it. There's also a law around it, we only realized it later. It's called attractive news and stocking. If kids walk over into a land which is owned by someone, they find something attractive there and that leads them to cause harm to themselves and then the land owner is held responsible for it. There is a law for it. This is similar to what we do in software. If you provide any new feature in software, please expect that the customers, the users are like kids who will go and touch and see, hey, what's this new shiny thing? Let me try this out. It happens all the time. You cannot stop them from doing it. The only way you can protect yourself and the system is to hide these things. Do not make it visible to the user till the point that you're confident that this is meant for an user. Your debugging tools should be hidden unless it's ready. So basically the learning from here is do not make it easy to break. Again, there is the other side to the solution. Once you start working on these tools, it's not meant to be kept within yourself forever. You have to write tools in such a way that you take it to maturity slowly. In our case, what we did was if I wrote a tool, I would use it for a few days. I'll make sure that it's working fine. I'll hand it over to my teammate, John. He would use it as a user without asking me how it's meant to be used, what are the corner cases. He will use it for a few days. Once he's confident about it, we would pass it over to the support team. The support team will start using it. Once they are confident about it, we start giving it to the users. And at the very end, when you know that this is working and there are no corner cases that can impact the system badly, what you should do is you should integrate it in the main software itself. If you can, you should also make it automatic. This is in particular for the case where we were fixing the database corruptions. We knew how the corruptions were coming in. We had automated the fix for it. And once we were confident about it, this software would boot up as soon as it did one check of the database, it would fix the inconsistencies and then go into the normal mode. This is what we have learned from years and years working on file systems. This is how the journaling data is used in file systems. If you remember the Windows 95 days, you had to stay up all night defragging your file systems and that was a manual task. The last time I think I did was 15 years ago. I have not ever done a defrag because all our file systems have enough journaling to fix it before the file system is even mounted for a user. The next case is in particular relevant to all of you who work on server-side applications and especially on those which have only CLI interface and no graphical interface. Here the customer was in a fix. They wanted to delete a volume, but they were not able to. In this particular setup, we do not allow users to delete a volume if their nodes are down. They had one node down. At the same time, they were not able to use the CLI to detach this node because it was hosting a brick for this volume. If you don't know what a brick here is, it's a component of the volume and it was for the same volume which the user was trying to delete. They came to us, hey, is this not a chicken and duck issue? I cannot remove the node, but I cannot remove the volume either. How do I proceed? The answer was very obvious to developers. They knew about a trick. All you had to do was use a brick-related command to remove the brick from the volume information. Once that's removed, now you can proceed and remove the node and once that is done, you can even delete the volume. But this is information which was only available to the developers. The fix for this was very simple. In the error message which comes for either of the first two operations, we told them if you are trying to remove a node and if you are sure that the node cannot be brought online, all you need to do is use the replaceable command, do the steps, and then you can proceed on to the next steps. This is something which is very often neglected in CLI programs. The discovery of next steps in a non-GUI program is very useful. This can be seen in the newer systems that we are building. We have seen this in OpenStack. We have also seen this in OpenShift. But the smaller tools, the smaller programs which are being built, they neglect this even now. It does not take too much of an effort to do this along with the failure of a CLI command. Yes, it's good and it's mandatory that you provide an exit status which the tools can take in and decide what to do next. But you also have the STD out available for you where you can give more information to the user, especially human users, and they will be able to make use of it. They will thank you for it, will you me here? The next case is about a user who came to us, said, hey, I upgraded the system using documentation. It looks like everyone has unrestricted access to data now. Is there anything wrong in your documentation? Is there anything wrong in the steps that I followed? Help me here because it's unauthenticated now. In many cases, what happens in this situation is that the developer throws this problem over to the documentation team and says you did something wrong. I don't care about it. My shiny code, my algorithm inside is working perfectly fine. Please fix this. Doing so, however, reflects a narrow view from many developers. When developers are made owners or maintainers of a project, it's not just the responsibility of the code that they take over at that point. They need to treat it as being custodians of the entire system. Instead of owners, I think we should all start calling the maintenance of project as custodians. This way, anything which breaks in the system should cause them grief and not just the code that they wrote. If you have a team which you're building new, this is probably one of the important things to keep note of as tech leads, as managers, that everyone in team is responsible for the whole project in total and not just for the code that they write. As you can see, sometimes custodians are the ones who make the whole project better. They care about every small nook and corner of the project which can be improved and fixed. This is one of the funny ones. I would let you people make a guess. What we had here was that some information had come over from the customer and John was trying to analyze it. He asked me how many volumes do you see in the information that has come? I see 80. I replied back saying that I see only 10. The twos that I'm using here are grep, fog, and word count. John said, I'm using my Python script. Can anyone guess what went wrong here? Line endings. The good old DOS versus Unix problem. Even in current times, you need to be careful that when you're communicating with someone, there are multiple possibilities of the communication being not effective. It can happen when you're talking to a person from a different geography and might not get the phrases that you're using. It might happen because you're using multiple channels of communication. And in this particular case, it so happened that the system was open-shift. They were on Linux. The file that we were analyzing, the analysis was being done on Linux machines. But in between, they had copied the file over to a Windows machine because that is how their FTP, SCP system worked. And when they did that copy, it had converted the file from a Unix format to a DOS format. There are many other such occurrences. In one of the cases, we had a customer who was so excited to execute whatever we gave them that they would not even wait for the second command to come in. Especially, it became difficult when we were communicating to them if else case because even before we got to the else case, they would execute the if case of it. The emphasis is mainly on figuring out how to effectively communicate. Again, this is one of the neglected skills for a developer. As a kid who discovered magic of programming, I never knew that you should know how to communicate so as to be a better developer. This is one of the non-technical parts of it. In interaction with customers, especially in these stressful situations, you'll come across situations where the customer might be angry at you. They might be angry at you as a developer. They might be angry at you as the company which is supporting their deployment and many such cases. It's very easy that you take this back home. It might cause problems to you personally or to people around you. What we found working in these situations is that there is a very easy fix for it. The fix is to have a support system. John and I finished these customer calls and went on a different channel of us which was internal team only and we spoke about what happened for five minutes. We did not have to analyze anything which happened. We just had to talk about it for five minutes and that brought closure on most of the occasions. We couldn't then move on and do our other activities for the day without being emotionally down because of something that went wrong. It's not only that your customer might be rude to you. Sometimes it so happens that you encounter a situation which is so new that even after having spent 10 to 15 hours on it, you have no breakthrough there. There is no way you are able to fix this. You realize that the customer is also in the same situation. They might be helpful, they might be understanding but at the same time it makes you sad that you are not able to fix a problem for the software that you have written and you have now put people in that situation. Again, in this case, the emphasis should be to make sure that you are doing your best. Talk to your teammates about it. Make sure that you have done the best you can and certainly if not in 10 hours, in 20 hours, you'll get a fix for it which you can then share with the customer and the happiness will be back but have a support system. This again is not a thing which is anything new. In our software systems, we have replica two, replica three deployments but when it comes to support, we send one engineer into the situation with no backup. This is something that we need to fix. Always have a backup, a support system in the call during and even after it. This will help your team be sane and do more productive work. Again, a corollary to the previous thing is in these calls that we did after talking to the customer, there was only one question that we asked. Did we do something new today? Most of the times the answer would be we did. We figured out a new way to pass the system. We figured out a new way to protect this database from being corrupt and all such learnings went back into the product. This is something, I don't like the word retrospective as much because many new tools and processes have taken over this world and it becomes boring and more processing but just ask this one question. Did we do anything new today and it will help you a lot. This is a summary slide. You can look at it when these slides get uploaded. These are the 12 topics that we discussed here. I have got a signal that it's end of my talk. I would like to sign off by saying code responsibly, just like you have code. You have to think as if you're driving on a road and you should code responsibly. Thank you. Any questions? If you have questions.