 Hi everybody, my name is Radha Kumari and I'm a staff engineer on the demand engineering team at Slack. We look after inverse load balancing, TLS, SSL, DNS, CDN, service networking, among many other things. In 2019, our team decided to migrate all our inverse load balancing tier from a tier proxy to on-board, mainly due to some operation overheads with a tier proxy and some great features that on-board proxy offers. This talk won't go into more details about why we decided to migrate and how we did it. We wrote a detailed blog post on that. Please see the link below. Two years and 15 days later, this migration was officially marked as done. In next couple of minutes and so, we're going to talk about some of the oops moments during these migrations, steps we took to troubleshoot and mitigate, and a few key to the race towards the end. So, let's get started. During our web load balancing tier migration, we noticed an elevated 5XX rate from our web API endpoints. This was detected internally and we started digging. Each try was among the first things we looked at, and I was sure we had the trials configured on Connect Failure in our config. With that assurance, we moved on and tried a couple of things to get that 5XX error data. Unfortunately, none of those helped. One fine day, one of the team member noticed that there was a typo in our configuration. Instead of retrying on Connect hyphen failure, we were trying on Connect underscore failure. Now, the question is, how did that change past our testing or code review? In the version of on-board we were running back then, retry on config fail is just a free form text, which will ignore any unknown value like connect underscore failure. Every time we change anything in our config, we do a validation to ensure the new config is valid before we replace the old one, using on-board validate mode. Since on-board ignored the unknown value in the config, as the config was valid, this change missed our testing. To act to that, we had the same typo at other places and tests were based on that wrong string, which is another reason how our test failed to catch this. On-board wasn't doing any validation of the config value. This behavior was later fixed in on-board version 1110. So, we corrected the typo in our configuration to fix the regression. Yes, we also fixed our tests. To ensure this never happens again, we also fixed this in our library that generates on-board configuration, by explicitly specifying a list of all the strings that on-board supports and ensuring the library returns an error on any unknown value. Next up, this was during our internal web API tier migration. We had just finished migrating 25% of the traffic to on-board in AWS region, whereas the rest 25% was still going to Azure proxy in the same region. After 12 hours into that change, our customer experience team got a few tickets from one of our customers, that they are getting HTTP response code 404 for some API endpoints, but only for 25% of the request they were sending. This was soon escalated to us, and we could see the error match the timing of the rollouts we did. Upon taking, we found that the request was attending default SSL 4443 to the HTTP host header, which is uncommon, but not unsupported. We didn't account for this behavior during the migration process. Our edge on-bores were configured to match only on the host name and not on the appended port, resulting in 404, as on-board was unable to find a route match for this type of request. Now, how did this work in Azure proxy? Answer is simple. We didn't have any host or authority-based routing rules. They accepted everything. So, we drove back the migration to East customer pane, and later on made a change in our edge config to match on both host and host colon port. Next up, this was by far the most difficult issue we have had. Our desktop team noticed 12% extra latency for every API request across all Slack clients. This issue escalated to us, and we started investigating. We tried a couple of things in terms of timeouts and such, but nothing worked. So, we enrolled tracing at the on-board layer. This was the turning point as we traced, we discovered that extra time was taken before request processing starts, and TLS happens before that. With that information, we started looking at all things TLS in our configuration and noticed a bug in our TLS session ticket keys. For those who don't know what TLS session ticket key is, it's a mechanism to allow clients to reuse TLS sessions when they reconnect within a short period of time, preferably few hours. It is used to speed up TLS hand tricks, which improves end-user latency, especially for the customers connecting from Asia-Pacific region in case of Slack. I'm not going to talk about how to set up and how to use it in a secure way as it's out of context. So, going back to the issue I mentioned a minute ago, we noticed 12% extra latency for every API request, and we traced it down to a bug in session ticket keys configuration. The bug goes huzzly. Sometimes the keys that we were using to generate these session tickets were not synchronously rotated across all on-boards. This means sometimes the request would come with the key that on-board doesn't know about yet, and therefore required a full TLS handshake. Ensuring synchronous key rotation across all on-boys fix the latency issue. With that, let's look at some of the things we learned along the way. Don't underestimate the power of retries and timeouts. We have mistakenly missed these in every migration. First and foremost, aim for parity. Parity is more important than added feature. It reduces the number of variables your rollout depends on. Keep rollback or revert plan fast and simple. Fast rollbacks are more important than getting things like the first time. Preparing a rollback would request tested, reviewed, and approved can do wonders sometimes. Our edge API load balancing tier migration was reverted around eight times, and even though it was a manual process of merging the revert PR and running a Terraform pipeline, it only took a couple of minutes. Evaluate risks and manage expectations accordingly. In any big migration, there's always a risk of causing outages and incidents, and those are always disruptive, especially for customers. Know that you will break things and make sure this is communicated to the business during initial planning phases. Business or organization needs to be aware why this migration is happening and how does that benefit the organization? What is the revert plan and how long does the revert take? This makes rollouts stress-free because the last thing you want is having to justify the risks or the migration itself in the middle of an incident or similar. I love current. This is such a powerful tool. More than 95% of the configuration features were tested using current, ranging from routing rules to TLS and much more. But remember, it's not a one tool to test all the things. With that, thank you so much for listening. Hope you learned something from Octavius. And you're hiding. Have a nice day.