 Hi, everyone. It's an absolute pleasure joining you today to discuss this ever important topic. How can we tackle customer issues in cloud-native environments? Customer issues happen to all of us, and when it hits, it impacts an entire company. It's not just something that the R&D teams deal with. Because it's so critical for companies to get right, it's often one of the most important responsibilities of developer teams. And during that, we know how to tackle customer issues with that added complexity of running on cloud-native environments is super important for the success of any company, large or small. I hope you enjoy this presentation and find it useful, and I'd be super keen to continue this conversation with you all over Twitter. So let me start by introducing myself. My name is Eleanor, and I'm a solution engineer at RCCOUT. And as part of my role, I work side-aligned many different companies to leverage our live debugging capabilities. I grew up in New Zealand, and I currently live in Tel Aviv, and I've been fortunate to work with many different types of companies all over the world over the years. I have a PhD in engineering, and the thing that I enjoy the most is working in that sweet spot between people and technologies, seeing how we can make the most out of bleeding-edge technologies and innovations. And I mainly focus on how we can make sure that the technologies that I develop actually end up adding value. This takes me to this talk, tackling customer issues in cloud-native environments. This is a really important topic that we should all aim to be better at every day. So I want to set the scene for the talk. Imagine you're on holiday. You've waited a year and a half to go on this holiday. You've endured lockdowns, toilet paper shortages, managed to finally get a COVID vaccine, and even did all those PCR tests in order to board the plane. You've really earned this. Only thing is that your startup is small, and, well, there are certain parts of the code that only you can deal with. So before you managed to get the two weeks off, you had to convince your boss that everything was going to be all right. You did it by promising to her that you'll take your laptop and if something does go wrong, are there any fires that need to be put out, you'll be there to help. You probably know how the story is going to end. Day two of your holiday, and just as you sit down on a beach chair already covered in sunblock, you get a call from your manager. All hell's very close, and the team needs your help. Your platinum client, the one that the business team has been harping on and on about that they'll extend their contract soon, has just reached out to your CTO saying that they can't use your product. Their user management component is completely down. No one can access it, and it's preventing them from using the product properly. It's also impacting their users, which makes for an even more stressful situation. West yet, is that they're seeing that your platform, the one which they've had very good money for, is actually breaking some of their services. So from your client's point of view, your product is completely broken. They can't complete basic actions. Their users are locked out of their system. And something somewhere is using up a lot of resources, breaking their services. Lots of pressure from the client. But to add to that terrible situation, the sales guy has completely lost his mind. He was in the middle of expansion discussions with this client, and it was supposed to be a really big deal. One that would give him a massive boost with his targets. So here you are, the one that has to save everything and fix this issue from your holiday. When this happens to everyone, ask any developer anywhere, and they'll tell you that they spend a significant amount of time solving bugs. In fact, I actually came across a study not too long ago that found that 75% of developer time is spent on solving bugs. When we look at customer related issues, the stakes are even higher. And we need to make sure that we know how to tackle those. Now I know that as developers we do a lot, and more often than not, we sign up for the job to write new, shiny innovative code. But the reality is that we need to spend a significant portion of our time actually putting out fires. A lot rides on this form, a lot rides on this from a business point of view, especially for smaller companies. You know, if you look at a large company, say it's like Google or Facebook, they have major outages as well. And it makes it all over the news, all over forums, and it frustrates people. But more often than not, the next day, no one remembers, the companies move on. But if a small startup has a major outage, they could sell the end of it, especially when a key client is involved. So why is this even so important to fix? Why is it important for us to fix customer issues? Why can't developers just work in that desired box of developing new code that does cool things? Well, the success of the business relies on our customers having a great experience. After all, customers are the ones that dictate sales, which is what drives revenue. If we have happy customers, they'll renew their contracts, they'll expand the business with us, and we'll grow our revenue. If they like what we deliver and trust our teams and products, they'll tell us about what we do, and this will give us a strong reputation in the industry. They'll tell all their friends, and this will drive even more sales. Having a robust and trustworthy product will elevate the company's position in the market. And if everything works all the time, people will use a product endlessly. They'll form habits around your product and increase the rate of adoption. These four components rely on having the customers face a minimal amount of issues, and when they do, having these addressed quickly. So you ask yourself, how did we get to this point? We have a team of talented engineers. The company's been around for a number of years, and here I am, putting my sunbathing lotion on, and I can't continue going to the beach because I have to figure out a critical issue. This could literally make or break our company. So first of all, fear not. You're not unique to this position, but I do hope that you'll find more useful to hear is that there are several ways in which we can deal with this better, and this will help us tackle customer issues better in the future. But before we even delve into any of those details, let's really understand what has changed over the last few years. We are developing unique codes and incredible speeds. Our environments, technologies, stacks, whatnot, are continuously changing and evolving, and it's hard to maintain everything working orderly all the time. When something new is introduced, we need to learn it, implement it, and track it, and there are many things that change that are beyond our control. It's not like a plumber that knows exactly how to unplug the trains. Our worlds are ever-changing, and solving client issues is no small feat. What also contributes to the challenges is that the code is running on cloud. Now, don't get me wrong. The move to cloud is one of the greatest and most exciting shifts we've seen in our careers, but it makes for many, many challenges, and we all know and love the benefits that cloud can bring. It's flexible and it's scalable. We can support a high traffic of users or process an incredible amount of data, all without having to worry about physical resources. It's reliable. There are many redundancies and fail-safe mechanisms that cloud providers leverage to make sure that the services are up and running 99.9% of the time. Even though your cloud bill might look high, it's actually saving you a lot of money. The alternative of having to manage all the hardware is oftentimes a lot more expensive. And to that, the fact that you don't need to have a whole team of engineers to maintain it anymore. Finally, security-wise, like before, you can rely on your cloud provider to look after that through automated mandatory testing and data security monitoring. And with all that goodness, we're seeing that cloud-native-apps adoption is exploding. A recent IDC study indicated that by 2024, production grade cloud-native-apps will extend to 70% from 10%. And this is of all apps in 2020. So that's a massive increase. And this is the result of increased adoption, microservices containers. Dynamic orchestration and DevOps technologies. But with all new technologies, we're always seeing new challenges. And cloud is no different. Some of these challenges directly contribute to the complex customer issues that we are seeing. Take, for example, a more people-centered issue, lack of expertise. We oftentimes find ourselves working with engineers from the client side who have not managed to keep up with the latest and greatest. Organizations are indeed placing their workloads in the cloud. But they find it difficult to keep up with the tools leading to the need for expertise. Similar to that, we see that most companies use of cloud technologies is somewhat segmented. There is this ad hoc strategy around cloud adoption and migration. And this often leads to intermittent cloud migration. Some development teams may use public cloud to specific applications and particular projects, whereas others may continue to use their data-centered equipment. Having this inconsistent adoption of any technologies creates a mess and will inevitably result in issues down the line. There are also other challenges associated with cloud-native environments. And these are often centered around debugging. When you spot an issue, be it by yourself or your client telling you that there's something wrong, it's often difficult to understand what is truly happening there. You see, by moving your services to the cloud, you essentially gave up some of that control. There are multiple microservices communicating with each other, making it hard to get holistic view of your application. And recreating the situation locally is next to impossible. And to that, the fact that your environment is continuously changing, instances are deployed and torn down periodically, and getting that dynamic visibility and predicting what is going to happen is often not an option. And finally, the fact that different people in your company probably have different access rights and you probably won't be able to get through to the tool, to the root cause without someone else having to do a particular action, makes things even more difficult. Now, when you work alongside your customers, that complexity is multiplied even more. Now, if the tech challenges weren't enough, there's a list of other challenges that I'm sure you're all familiar with. First and foremost, when your customer raised the flag about the issue, they probably said one thing, you understood something else, and what was actually happening was different altogether. We often have problems when we don't have enough information on the issue. When we communicate over email or even slate, getting the relevant information we need often feels like slowly pulling off a plaster. It's painful. We work with different tools, use different terms and have different expectations. And to that, the fact that you're probably not working in the same space, let alone same country. If you add time zones into the mix, you can have a real hard time getting the complete picture of the situation quickly. Internally, there are other challenges that we need to deal with. With any critical customer issue, comes a lot of pressure from the business and sales teams. They want to deal to go through and that stress is oftentimes passed onto the developers. Then you have to take product into account. We've already set the quarters roadmap. The dev team is already running late with the release of a new feature which has already been plastered all over the latest PR campaigns. So to sum it up, a real mess. But don't despair. Everyone knows that you can't develop code without bugs. It's simply a part of life. The important thing is to learn how to deal with it by building processes that better enable you to handle these situations when they do arise. As engineering leaders, whether you're in a managing position, a change maker or someone who wants to ensure that you and your colleagues will be able to deal with bugs better next time they pop up. There are many things that we can start implementing. Here, I'm gonna split them into two sections. The first being processes and behavior improvements that we can make. And the second focusing on technical aspects. Both things are equally as important. After all, the better prepared we are in all aspects, the better we'll be able to deal with these stressful situations. So let's start looking at processes and behavior methods. Having these implemented will enable a company to deal with customer issues in a much more efficient manner. Now, these touch the entire company. You'll see when we cover them that in order to address customer issues, we need various teams to work seamlessly together. We need to establish strong working relationships between teams in order to deliver the fix effectively and efficiently. The topics that we'll cover today include prioritizing what we work on, how to manage client communications, building the right team mindset to deal with issues and averages and making sure that we continuously learn and improve through running retrospectives. Let's look at our first point. Prioritization. We all know that we can't work on everything at the same time. We know that successful teams are ones that prioritize more effectively. More often than not, though, it feels like when we are faced with an issue, we don't know how to prioritize it over other things that we're working on. I mean, what's more important? Releasing a feature that will help us get to a new client, fixing an old bug or even completing the code refactoring project that we've been working on for so long in order to make sure that further down the line will be easier to develop. A product manager is vital in these decisions as they often have the high-level overview of all the initiatives that are taking place. In saying that, it's still important for everyone else in the company to understand how and what to prioritize. To do this successfully, you first need to know and agree on what the high-level goals are. It could be around increasing user attention or growing to new users, or it could even be a brand decreasing your infrastructure costs. Whatever that high-level goal is, everyone needs to be in sync to it. Then, when you stumble across an issue, you can evaluate its priority and decide how much time you're going to put towards it. For example, if the company's current number one goal is to expand to new clients, then naturally, releasing a new feature which will get new clients is more important than fixing a bug for an existing one. But let's be honest, things are never black and white and there's not going to be a situation where you would want to leave an existing client with a critical issue and a non-working product. That's why it's important to balance things out. Should we allocate 50% of engineering times new features and 50% to bugs? Is one customer more critical than another? What is the effort associated with each piece of work versus the contribution it will make towards the company goal? There are many aspects that we need to take into account, but the one thing to remember is that it is okay to not do something. It's important to identify what's really critical and what isn't. Perhaps it will just be fine to tell your customer that they need to wait an extra two days before the team will look at the bug. Next, we have the ever important client communication management. Working closely with the client is paramount. They are going to be the ones providing you with additional information, which will help you debug the issue. And once you have the solution in place, you'll probably need their involvement to deploy it. More so, the way you communicate with the client will impact the way they feel about the situation and it will probably shape their relationship going forward. Now, once a client alerts you of an issue, you need to stay in close contact with them. First, acknowledge that you receive their concerns. From that point forward, it's important to get consensus on what to tell the customer and when, how transparent you wanna be, the frequency of updates, and what you share with them. This will change from client to client based on who they are and your existing relationships with them. It is important, though, to give the client the respect they deserve and be honest and professional when you give them updates. For example, it's far better to be honest with your client and tell them that it's a complicated issue that will take two to three days to fix instead of telling them, oh, just wait, we're making sure we're doing the best we can to get a fix out any moment. What helps a lot with ensuring that client communication is managed well is having a single point of contact for the client. The person who's worked most closely with the client should be the one leading conversations. You should always avoid having different people from your team reach out to their client whenever they have questions and updates. This dedicated person should be the one to filter messages, ask questions, and ensure that we don't spam the client too much. Next, we have a changing team mindset. And this is something that it's important to continuously build with your team, irrespective of whether you have an issue with a client or not. This is all about ensuring the team has the right mindset to work and approach situations when they do arise. This is all around giving the team a sense of ownership. Let's use an example from our daily lives. Say you had a baby and your baby cries in the middle of the night. I'd like to hope that you'll wake up to see what's wrong and deal with it. Now, if your baby is not a real baby, but rather a function, a feature, or an instance that you feel ownership of, you'll wake up in the middle of the night to resolve the issue because you understand the wider impact it has on your company. Naturally, you do anything and everything you can to make sure that any critical issues are resolved. Given teams a sense of ownership is crucial. There is no other developer or team that will jump in to rescue our deployment if something goes wrong. When our teams know and feel the responsibility that they have for the rest of the company, they will step up to the challenge and do what they need to do in order to overcome challenges. Ownership needs to be complimented by a culture of trust and empowerment. It is important to trust that your team members will do what they need to and have the right skill sets to execute on it successfully. Without trust, you'll see that people could take a step back and stop taking initiatives. Given folks the freedom to explore new ways of solving problems will increase their engagement and sense of ownership. Finally, a crucial habit that every team should get into is running retrospectives after dealing with an event or an incident. A retrospective brings the team together and gets them to think critically about what went wrong, how things were dealt with, and most importantly, discuss how it should have been dealt with. What can we have done better? This helps the team to create a prevention plan and continuously learn and improve. Carrying out retrospectives is the easiest way for teams to embrace failure and grow stronger. It lets you plan ahead for similar events before you have to enter into stressful situations again. It essentially means that you won't be beaten by the same snake twice. Within the post-mortem or retrospective process, your team will have a way to share their experiences and learnings with everyone else in the team. Ultimately, this builds stronger, more cohesive teams. Integrating retrospectives into your everyday processes ultimately builds these stronger teams and it brings a sense of ownership like we've talked about before and focusing on how we can improve and how we can learn from our mistakes so that we don't repeat them again. Now, in addition to the processes and behavioral methods that we've discussed up till now, there are a number of technical elements that we can leverage to make sure that the team is well-equipped to deal with customer issues in cloud-native environments. Tools can give you visibility into what's happening now, how your clients are using your products, the level of performance, resource use, and probably most importantly, visibility into what you wouldn't be able to see otherwise. This visibility is critical and crucial in order to sustain your ability to serve whoever you need to serve so that they have a seamless experience and don't perceive, don't react to any outages. Now, these are probably not new to you but it's important to look into them. After all, any improvements you make will have significant benefits down the line. So let's look into these. Now, one thing to note is that most of these tools and techniques help us identify issues and resolve them earlier in the development cycle. And it's really important because the earlier we'll resolve an issue, the cheaper it's going to be. We're going to have to spend less time and effort fixing it and we'll avoid more major issues later on. Actually, a study that reported that the cost to fix a bug that's found in production can get up to 100 times of the cost to fix something in the design phase. We need to do everything we can to avoid getting bugs out to our customer's environment. And when we do, we need to have the right tools set up to collect the data that will help us solve the issues much faster. All right, we'll kick things off with QA. Now, this one's a no-brainer but most of us have a lot of room for improvement. The focus here should be on identifying issues and errors as soon as we can during the deployment lifecycle. So this includes ensuring that you set up and follow a structured and rigorous code review process. It's always good to have an extra peer-revised review your code and have them check from a different point of view that nothing is missing or nothing could cause problems. Another aspect is conducting penetration tests to identify any issues and to ensure that bugs that you previously identified are resolved. And last but not least is set up automated tests and track and monitor the performance of your code to prevent the client from experiencing any issue. Next, we have logs. Now, if you have useful logs, it can be super helpful for developers, especially when they try to debug code that's not on their machine anymore. And when it comes to customer reviews on cloud environments, it's even more useful. With logs, you get a real insight into what your code is actually doing in its environment. So how do we make the most of logs? Well, first, make sure that you follow your company's best practices and structure your code in order for your logs to match the style. You want to make sure that the log messages are meaningful so that especially important if someone else needs to look at them, they'll be able to understand what exactly is going on. We all know that there's nothing worse than trying to understand a cryptic log message through AM when everything has gone wrong. Next, you have to focus on quality, not quantity. I know it's super easy to sprinkle the magic log dust all over your code, but having too many logs can actually make things more difficult. It's hard to manually browse through all that clutter to get any value out of it. And probably worse is the fact that if you have too many logs, you could actually impact the performance of your application. And we need to make sure that we use the right log level. When do we want to log messages? When do we want those log messages to be printed? At what situations? And this is a balancing act. And it's tricky knowing exactly where and it's something that comes with practice. But it will help you filter out logs, which may not be so relevant. For example, should you set it to a debug level? Will you log about anything that happens in the program? Or should you use an error, which will only log error conditions observed? Finally, it's important to remember that logs can be used for more than just troubleshooting. You can use them very efficiently for auditing, profiling, and gathering statistics, so don't mess out. Monitoring tools can give you a unique view of how the end users are experiencing your applications. They capture and display information on errors, crashes, and performance issues that users are encountering. It's also important to track specific events, especially if you have released a new feature and you want to understand what the uptake is like. The reason why monitoring proactively can be really useful when working with customers is because we can often pick up on an issue before the client notices it. Say there's an error somewhere with information isn't being sent over. If we can note it first through our monitoring setup, we'll be able to fix it quickly, and maybe the customer won't even notice it. On the other hand, if the customer has noted an issue, you'll still have a lot of information to work with which was collected straight from the customer's environment, and this will be super useful for debugging. Now, this last section focuses on leveraging tools that will help you when you do spot an issue in the client environment. Tools like live debuggers that enable you to capture debug data from your application that's running in production. You'll be able to place a break point and start collecting data without having to break your code. More so, the data that you're collecting is coming directly from the cloud environment itself. With any of these tools, it's important to do thorough research and understand which tools you can leverage for what. Do take the time to deploy them, learn how to use them properly, and make sure that the team is up to speed as well. Again, you don't want to get to a stage where you're on your holiday and your manager tells you to look into the issue, and then you start learning how to use a specific tool. So, we've covered a lot, and as I hope you appreciate now, there are many things that can go wrong when it comes to tackling customer issues in cloud-native environments. It's a significant and important part of our roles, and it's important to make sure to implement processes and use tools which will help you better manage these situations. After all, bugs are inevitable, but at least you'll be able to do all the groundwork early so that you can enjoy your next holiday more. It's been an absolute pleasure sharing with you my thoughts on this ever-important topic. If you have any questions or want to continue the conversations, please reach out.