 Hi, and welcome to this presentation called Replatforming a $4 billion retailer onto Kubernetes and LinkedIn. My name is Terrik Lingeberg. I work as a developer in Norway in a small consultant company where I try to help modernize and move organizations and teams onto Kubernetes, for example. This presentation will be about the largest electronic retailer in the Nordics called Edship Nordic. I was hired at the time by the Enterprise Cloud Solution Architect, Henry Hagnos, and together we built and moved a significant part of Edship Nordic's applications over to Kubernetes. In this presentation I'm going to tell you how we as a small team built a new platform while rapidly onboarding ourselves and the developers, and also how using a service mesh was in retrospect a really good decision. I also have some lessons learned. You can find me at the usual places, like Slack, LinkedIn, Twitter, and I try to blog a little bit as well. So what are the goals for this presentation? What should you be left with? So by listening to our story, you will hopefully get some tools that you can use on your own journey when building a Kubernetes-based platform. You will get some ideas and arguments to why investing in service mesh could be a good idea and learn how to onboard an organization with respect to the developers. You will also give you some tips and tricks on how to handle developer managers who are more interested in keeping things as they are. So just to set the stage here, Edship is a large company in European sense. It consists of around 400 physical retail stores. It's a large online present and has over 25% market share. It sells electronic goods and can be compared to the best buy in the States. So before the Kubernetes transformation project, Edship had a microservice platform based on Azure App Services. It was stable. It was popular. When Edship started with a large modernization project, it was handling more load and also more services to run. Obviously that increased the cost and operation overhead of the platform. And having a microservice per Azure App Service, the management of those starting to get out of control. The Azure App Service hosting cost in 2020 was around 450,000 US dollars. So even though the Azure App Service platform was good, we wanted to get even better so that we someday maybe becomes the best. So not to give away the whole ending here, but after moving to the Kubernetes-based platform, we were able to get better scaling of the applications because we can bring up those applications faster and we can utilize technologies such as the Kubernetes event-driven auto-scaling project called KEDAW. We got better performance as we could, for example, deploy applications that communicate with each other closer to each other. It was a better developer experience where they now had full control over the dependencies, for example. The operations also got easier with Prometheus and other visualization tools. We also got more robust against disasters and had a better disaster recovery plan. Maybe the most impressive metric was the hosting cost. So we were actually able to cut 75% of the hosting cost. That is without taking into consideration the savings that we get due to developer productivity. So going back to the before the Kubernetes project, like Henry Huddles had the insights to see where this was heading and decided to get a Kubernetes project approved by the LCHP board. I was lucky enough to be hired as a consultant and together we started building the platform. First off, we thought that we could look at the CNCF homepage for any tips and was pleased to see a link to a landscape page. So we thought, great, this should be pretty easy then. And then we see this. So it's probably not that easy. So we quickly realized that we needed some guiding principles on what technology to choose. So this is not an exhausted list, but maybe the most important ones. So first off, we wanted to embrace the aspect-oriented programming model and move as much as possible of the cross-cutting concerns over to the platform. One such example could be mutual TLS. So instead of having checklists and QAs on the developers so that they would do certificate rotation and signing correctly, for example, we could just let them get it for free by deploying the applications onto the platform. We wanted to create a pit of success. And by that I mean it should be really easy to do things correctly but really difficult to work against that pit to do them wrong. So this also ties into the aspect-oriented programming model in that you should get as much as possible free. The biggest cost in software projects is often maintenance and not development. It is good that the technology has a low learning curve but if that means a horrible day-to-day operations, then it's really not worth it. So we're not interested in saving a week of development time in return for a year of extra maintenance time in terms of cost. We wanted to introduce technology to solve actual problem without introducing an even bigger problem. So for example, if getting mutual TLS means a really big maintenance job of a service mesh, then it's probably not worth doing it. We wanted to minimize the friction for initial adopters to be fun and easy to use our platform to be something that the developers wants to do instead of something that they have to do. And we also decided to try to use a very clear language in terms of all the... or with terms that all the teams can agree upon. So typical example that you use when onboarding teams are words like application, system, services. So we wanted to be very clear about the context and the meaning of those words in that context. We also wanted to have everything as code or as much as possible at least. So we had low testing as code. We had that system configuration as code. Alarms as code. Obviously, infrastructure as code. And of course, all of this code should be version control. We built up a SRE team from Linux admins along the way as we were building the platforms. They were handpicked and we gave them a lot of responsibility on the supervision in the beginning and they really shined. It was a good decision. So it's much easier to do maintenance and operations now as they have been a part of actually building the platform. But during the implementation of the platform with now an increased operation team in terms of size, we saw that we needed some better version control. And this is where we introduced Flux. So we used Flux to manage the entire platform. We did not get so far as to use it on the individual applications running on the platform. But that is hopefully something that Azure can do in the not so distant future. By moving the configuration into Git and have the configuration in CML files, the initial infrastructure as code became rather small. We would have a bash script to do the initial boot shopping of the clusters in our hosted platform. It will start Flux and then Flux will pull in all of those configuration and make Kubernetes move towards the end state. So this is not the exhaustive list either but these are some of the tools that we chose for our platform. But obviously the tools that you choose for your platform could be totally different. I basically just added them for reference for the ones who are curious. For the onboarding, we quickly found some champions amongst the developers. Developers that were keen on using Kubernetes and our platform and as a consequence had some high threshold for initial bugs and inconsistencies or missing documentation. Those champions would be a good proxy towards the other developers that were a bit more skeptical. They would give us valuable feedback on what was missing compared to how they were working today, for example. Also, we wanted to treat documentation as a first class system. And by that I mean that it should be of course on the version control. It should have QA on your documentation. It should be a collaborative product amongst the developers as well. So we would give a link to the documentation to the appointed champion, get feedback on what was missing or confusing and correct it together with the developer. Then the developer moves from being a student to a teacher helping another developer to do the same. And after a while of doing this the developers had a stake in the migration project and had written some documentation themselves. We also wanted to create as good templates and processes as possible. You only get one shot on the first impression so we wanted to make it count. Ultimately, the platform should be something that developers want to use instead of being something that they are being forced to use. So that's what we tried towards that goal. Something we often heard from the developer managers was why bother with this? This has not been a problem before. Well, even though that's true it's no guarantee that it will stay that way. So to tackle this we did extensive load testing to show that this actually could become a problem. And also we explained that we are in a mission here to make things better and that we're not interested in changing things just for the sake of it to try to keep them at ease. We also explained how we value their input and concerns and we tried to reflect those concerns and questions in the documentation and the presentations that we have had internally. We also took onboarding as a great opportunity to focus more on reliability and alarms for teams and their applications. We created a baseline of alarms as code with Prometheus alert manager and had it as part of the platform obviously. That way we would automatically have a set of alarms from the applications as they were being onboarded. It actually resulted us in finding a lot of bugs that previously was a bit hidden behind Azure App Services. We defined a clear escalation path, a policy where we used the native integration between Prometheus alert manager and Atlassian opportunity. We as a platform team would act as a gatekeeper and push teams to create alarms that were more domain specific. One of the critical design decisions that we need to get right in the beginning was whether or not to introduce a service mesh. We really need to bring in even more technology like isn't Kubernetes complicated enough? Henry gave me a week to experiment and see if a service mesh could solve two important problems. We wanted to have all the traffic encrypted with mutual TLS and we also wanted to get insights into the traffic and application in an aspect oriented way without having a lot of configuration. We looked at the usual suspects like Istio, console, mesh, LinkD. The one that was at the time most aligned with our principles was LinkD. It's focusing on exactly the two problems that we wanted to solve. It was backed by CNCF. It was also good that they embraced the service mesh interface specification. We quickly saw that they had really good documentation on day two operations and the community turned out to be second to none. It was really good and quick help whenever we needed it. During load testing of the sales portal that all sales clerks in all stores will be using, we had the rather unpleasant surprise. The turnout that the application wasn't performing at all running on our platform and that this application was an external one being developed that we were supposed to be running on our platform. We would not see the request coming from Kubernetes. Outside we would only see the request coming in and this was the first case where LinkD and its insights really saved us. By carefully looking at the metrics, we were able to identify a bug in that sales portal. It was basically not reusing sockets for outbound requests and we could see it in a number of TCP connections that were established. Where is Elshep Nordic today? They're basically writing into the sunset with zero bugs and record sales each and every day. Maybe not record sales every day, but the last and first for the platform, Black Friday, was a really critical test. We were really curious and pleased when we after Black Friday saw that it was performing with zero bugs and making actually a record sale for Elshep Nordic.