 This story takes place at a company called previously Anotel and now Cognic. We rebranded about them months ago, so if you heard about it, you probably heard about us on our previous stream. We are a SaaS company and we try to strive to make self-perception possible for autonomous vehicles. And we were funded back in 2018 and our headquarters are located in Gothenburg, Sweden. Right now, or like I say approximately, because honestly we grow still so it could have changed since last year's check, but we're around 90 employees out of which 45 people are working in engineering. Who am I? Why am I here? So I'm in the product area lead engineering enablement, what we call our platform functionality, working with developer experience. I'm also a CNCF ambassador and involved in the cloud native Nordics community, which is a group of meetup groups in the Nordics in Europe, and I'm located in Gothenburg. You can see the small arrow. I'm also a local meetup organizing in Gothenburg. I'm going to tell you the story of, since I joined in June 2020, how we started to working strategically in order to try to revolutionize developer experience in platform engineering. So I'm going to tell a story about how we used specific projects in order to help us with it. And I'm going to talk about externalness, search manager, Argo CD, and I'm going to bring about the story from starting there up until 2022. When I joined back, it was infrastructure and platforming was happening on a need-to-happen basis. A lot of our infrastructure was spun up from one of us co-founders and it was done to the point of, oh, it works, great, let's move on to more important things. So when I joined, there was no one really working with these things. There was this guy that used to fix things when it broke. You probably recognized the story. And the first thing I did was to try to upgrade Kubernetes, high security patches and these things. But most of all, I just tried to assess, look at what we had, look at what the needs were and try to figure out where to start. So developer experience, it's usually called DX as well. You probably heard a lot about it this week. I think it's really fun to hear all the conversations going around about this. And you might wonder, why do I care about developer experience? And I used to be a developer myself. I started out as a developer and then I moved through the stack. And as a developer, it's like great when the developer experience is great. But someone has already solved the problem for you and you can focus on the things that you care about. And we have these common needs as a developer. Regardless if you're building a Python application, a SQL application, whatever it might be, you might have things that you need to do, problems that are having common. So some of the things that our developers care about is like CI, CD, load balancing, run time, container run time, spoiler alert, Kubernetes, metrics logging, et cetera. But what the company cares about is the business logic. This is what differentiates each team. This is what each developer actually is trying to achieve. All the other things are not necessarily evil, but they are needed in order to achieve that. So the less time we can have not spend on these white boxes, the more time that we can spend on the purple box. And the more time they can actually spend towards making what makes our company stand out towards the competitive issues. How do we work with this? We work with this with the Golden Pass, Paid Road, Hands Up, who heard about the Golden Pass before, Paid Road, same, same, right? So there's actually a really good talk from last Kubernetes, sorry, not Kubernetes, it's CubeCon, Cloud Native, Cloud in Valencia, from Danny Variant regarding what was called, from Kubernetes to PaaS to what's next. Try that title. I recommend looking at that talk, there will be a link in the end, last slide. So what we are looking at is we're trying to build a platform. You saw at a keynote this morning, we tried to build a platform that's trying to solve all these common needs. Building on all the best practices for our developers so they have an easy way of actually achieving all the things they need without to have to understand all the underlying technologies because let's face it, there's a lot. So in the end, to iterate, why do we have developers? We have them because we want to create business value. And the more time they can spend on that, the biggest chance we have to actually succeed as a company. And that's what we care about. So I told you that I was going to tell how we were three and a half platform engineers and we revolutionized developer experience. How does being a small team impact this? Because paid road, you had other companies talk about it, you heard it and they see a lot about it in written in text and everything. But some of these companies, honestly, they are really big. And they have hundreds of teams working towards platform engineering. We are three and a half people. How does that differentiate? So some things, some uncomfortable insights that we have come to is that one, you can't do all the things. It's like you hear about all the cool things, you can't do everything. You have to be conscious of where you spend your time. That's why when we started out, we did not try to fix everything right away. We tried to figure out what is causing the most pain. Where can we make the biggest impact? And sometimes it's actually valuable to not do something. Because when you pick up something, you actually had to continue working with it. You have to continue making it available and you have to continue to maintain it all the time. Which means that if you pick up something that might not give such a high value to the developers, you still have to maintain it. And it's still taking time from what you could do other things. So sometimes it's valuable to say, this is not worth the investment. We should not do it. Upgrades take time and they are frequent. And then there's breaking changes and then you have to upgrade other tools and you have to change how you do other things and roll out new API versions, etc. It takes time. The more tools you take in, the more time you will spend maintaining those and upgrading them continuously. And you probably don't need a service mesh. Yoking aside, service meshes are great. But they might not be the solution to your problem. So just because someone else needs a service mesh, it doesn't have to mean that you need a service mesh. And service mesh is just a concrete example here. It could be anything. I don't need a service mesh. If you want to talk about not needing a service mesh, you can come talk to me afterwards. So what has our approach been? What did we do? Now it sounds like I'm going to give you most mind-blowing new things, but it's not that complex honestly. It's just remove time consuming tasks and bottlenecks. This literally, this sentence is what we lived for the last few years. So first we had focusing on blocking the developers. If a developer is blocked and need to wait for someone else in order to be able to do the things that they want to do, that's a bad thing for us. So our main focus is to unblock developers and make sure that they can always self-serving, reach their goals. Second one is removing time consuming tasks from the platform team. So the platform team as I mentioned, there's a lot of things, upgrades, time, et cetera, et cetera, et cetera. So make sure that the team doesn't spend time on things that they don't have to spend time on. And once you're done with that, you can move on to removing time consuming tasks from the developers. And the reason why the team is before the developers is because a busy team will not be able to remove time consuming tasks from the developers. So first you have to free up time for the team and then you can focus on the developers. I promise some concrete talk examples and I'm going to take you through external nest, search manager and Argo. And start out with external nest. Oops. This for us was something that we implemented and started to use. Due to unblocking our developers. So as a startup, probably for everyone, I mean really what you do is that you try to find out the next big thing, right? So you build something, you want to test it, and you want to evaluate it. And then you want to do that as often and as quickly as possible in order to see if you build the right thing. So for us, this often means pushing something to production. And pushing to production often meant that you wanted to expose that somehow, usually with some kind of DNS. So the process for the team, oops, that was a bit quick. Sorry. Was that they built the application and then they deployed the application and they had like a load balancer service type. So they got an IP for that. And then they went with that IP to one of the two people that actually configured the DNS record in Cloudflare and then they waited. Like they sent a Slack message, hey, can you help me set this up? I want this DNS and I want it to go to this IP and then they waited. And waiting is not good, right? That's a blocker. Then they got the DNS record, it was configured and it could go out that day and keep developing. We want to remove that blocker. So what we did was that we looked at external DNS. For us, it was extremely easy to install. Basically it was just getting a Cloudflare API token and add it to the configuration and get it up and running. The external DNS tracks ownership with TXT records. So this is an example of how we can look in Cloudflare. So we have the A type record for the application which is hello, Argo, we use something that I found running. And then it has the corresponding TXT record with some content that points out that external DNS is the owner. So why I'm showing this? Because when we migrated, we wanted to also start tracking all the existing services with external DNS so that when someone removed something, we also removed the corresponding DNS record. And we didn't have to care about manually configure DNS again. So what we did was that we added manually annotation to each service that told which DNS record you should have. And then we fake the TXT records in Cloudflare and pretended to be Argo adding them. And then, it's not sorry, Argo, external DNS. And then we deployed external DNS through cluster and honestly it worked. And if I'm not wrong, I think this took like us a week to do. And that included like figuring out what was running, what was configured, some custom shell scripts. No one been there before, right? And in the end, it was a great success. So the process in the end looked right. Instead of getting the IP, asking for DNS and have the blocker, you just built your app, you deployed the documentation and you got the DNS record and you kept developing. The return on if investment on this one was really good. So in the end, this turned into a total self-serve with very little education. The team members and developers could easily set up their own DNS record. It was super easy to use and they were no longer blocked. So if we look at these kind of matrix, you've probably seen it before, we have a effect that is the outcome, what was the result and the investment on the top. And if you have a low effect and a low investment then you have a medium return on investment kind of style. In this case, we had low investment as I said, about a week and actually it's been running with very little, what is it called, configuration over time with high appreciation for the developers. So the effect has been really high and return on this month therefore very high. So next up is search manager and this was one of those removing time consuming tasks from the team that we were looking at. So the process before was not great. It has some kind of situations that were very contextual based on where we were. So for some reason we decided that we were going to terminate the SSL records inside the container itself, meaning that certificates were loaded as a secret into the deployment via volume amount. So whenever we were renewing our certificates which happened every 90 days when we renewed them, which meant that basically we were doing it every 60 days because we didn't want to do it the last minute, we had to first use a custom shell script in order to create a new certificate like something medical worked on the terminal you get a new certificate, great. Then manually added the secrets in Kubernetes to contain the new certificate, all great, except the deployments don't pick up and you change secret unless they are redeployed. So then we also had to do a rolling restart of all the deployments. I mean, it's a pain. It was not great. It was not fun. And also it was like this, you had to have this calendar event every other month like remembering you the certificates and was like always up for discussion. It took a lot of cognitive load for the team. So we were looking at search manager. Once again, I would say it's like very easy to get up and start running. So once again, let's encrypt in Cloudflare. We had to get like connect those to search manager. And then we pulled the resources from the official release and you get a lot of cluster roles, et cetera and lots of custom resource definitions from that. And then we configure on cluster issue. A cluster issue is a configuration that assess like four more forward, like connect to let's encrypt and connect to Cloudflare and how do you want to configure your DNS zones? It also plays very nice with Ingresses. So sometime around here, we also started using Ingresses instead of load balancers in our services. That's another story. But it plays very nice because we no longer had to do the rolling restarts of all applications. So in the end, this is what the process looked like for certificate renewal. And this is all I have on this slide because we do nothing. It's great. Takes zero effort. So in the end, we removed the manual process and we have like less cognitive load and there's no rolling restart of all the things. Great, right? Where do I put this? So it was a fairly low investment. Honestly, it's not a huge effect. It was not something that took a lot of time from us but it was annoying and was really nice to get rid of. So I would say like a medium return on investment. Next up is remove time consuming tasks from the developers. And we looked at Argos CD for that and it's about deploying new, like deploying new versions of applications and also deploying new applications, right? So the deployment process before, this is once again where we started out. We had like the build pipeline, we're posting tags, image tags, tags that were built to Slack. And then the developer picked up the Slack, tagged from Slack and they went through the terminal and they run a magic script. I'm saying magic because it was magic to them. And that did do kubectl set image in the cluster, live setting the image which triggers a rolling restart. And then it ran kubectl rollout status to see what was going on. And then it posted to Slack that this deployment has now been updated with the new image tag. So this was good. It worked well for very long but when we started growing, introducing new developers, doing more, maybe not exactly similar looking deployments, et cetera, we started to get point where it wasn't really that great anymore. So one of the problems was that it was a black box for the developers. They didn't know what happened in the script. They got some output and sometimes it said something went wrong and they didn't know what. They went to like the live version of the application and could see that the change that had not been applied and they didn't know why. And they came to us and they asked like, what's going on? Something happened when I deployed. It's broken. And when that happened, it was like really hard for them to debug and they had also kind of limited knowledge about the underlying Kubernetes resources. They weren't really sure that they were actually live editing the image in Kubernetes. And this one was the one that kept me awake at night. We had no version controller was running in Kubernetes. If someone accidentally deleted a deployment, there was no good way of actually getting it back. And like it could also be scary for people to change. If you were like trying to change what the amount of resources application we're using, whatever, it was kind of hard to know because there were some YAML files in Git but they were not matching what was actually running in the cluster. So if you wanted to change that and then apply it and maybe overwrite something that was live edited, et cetera. So there was a lot of uncertainty in managing your Kubernetes resources. Right, I have these also. So also it was hard to know what was deployed and why. Because this Slack channel, it was, I mean, there were so many new messages every day. I don't know how anyone used that. So what do we need instead? We need a transparency. We needed to spend less time debugging failures and we needed to have the resilience and knowing that we could restore our systems in case something happened. So what do we look to? We looked at GitOps, it was obvious. And if you look at the definition of GitOps, there's actually a really good web page that I forgot to link up right now. I hope it's coming, otherwise it's in the last slide. GitOps should be declarative. You should have a system, a system management Git sub must have its desired status expressed declaratively. It should be versioned and immutable. It should be pulled automatically and it should be continuously reconciled. So, oh, that's the web page, great. Open GitOps.dev, check it out. It was, I think it was launched at last QubeCon. It was, it's great resource. So what did we do? We looked at Argo, of course. And so, I mean, everyone loves a good Jamify. At least they are declarative, right? You can, I mean, also the same goes for Customize and Helmshark, et cetera. It's declarative, you can use that. And then putting everything in GitHub allows us to have it versioned and immutable. And I mean, if you put a master branch protection, no one can overwrite it, et cetera. Argos, it pulls automatically from GitHub and then it continuously reconciled it. And there's a link to Argo project. So what we did was at first we were migrating because we had this, as I mentioned, like something was pushed to Git. It was not really what was running in cluster, what was going on, et cetera. What we did was we started out by trying to figure out what was actually running, like what was in the cluster. And then we started to just export that back to YAML files continuously and trying to figure out when do we get to a state that this is stable enough. So we added that to GitHub repo and then we started setting up Argos EDE with Sync Disabled. Argos EDE with Sync Disabled means that you will see what is mismatching but will not try to apply something to the cluster, which is great if you want to start out and not overwrite something. I have like double clicking, sorry. And then we announced an migration window and we could probably have done it live, but at the same time we thought this was going to be a one off. We are like, at this time, I guess 33 developers or something, they can live with that to our window where we say please don't deploy because it will mess things up. And then we turned it on and it was a success. Or I would get back to it. So what did the process look afterwards? So get the latest tag and push it to Git. And that's it. And then it worked. I mean, it just works, right? And in the end we got a very transparent process. It was much easier for developers to see what was going on, what happened and like what steps were involved. The negative part was that maybe they had to learn a few things that they didn't have to care about before. They had to know there's a deployment configuration. They had to know there's an image tag field and all those things. But in the end, exposing that to them actually empowered them to know what was going on and get to a state where it was easier for them to manage that. We had the resilience. We were able to restore in case something happened and dammit, it was easy to use, kinda. And I actually wanted to be a slide back. When we ruled this out, this was not a one week effort. I can tell you that. It was a little bit longer. It took a lot of time. We were trying a lot because we wanted to make sure that we were not accidentally deleting production as you want to when you do these type of changes. So it was kind of a high input, like investment. We spent a lot of effort trying to get it right. But in the end, it was also like a really high output which is the reason for this slide because I think at the time, it was medium return on investment. But with time it only grows because we don't spend that much time investing in maintaining and keeping it live but it still continues to live a really high value. So let's talk about the star and easy to use because while the technical solution was probably a really good one and I think it worked really good, we did not consider the cultural parts of it when rolling it out to some extent, our developers were not that happy when we said, stop using your magic script and start using this process instead. And you have to learn a lot of things by the way. They were not that happy. So with the rollout, we should have spent a lot more time and effort on creating buying with the developers in order to get them more aware of what they would gain and why we were doing these things. Because I think from their perspective it was like platform team says we should do this and now we have to learn a lot of things. While we should have told them like, you will get all these things and these are the reasons why we're looking into this. We think it will be valuable for you and we miss that. And so always when you do these type of changes it's important to consider the people that you affect with those changes. In the end, it went well. And this is a note from one of our developers and he said that, cutest the plotting and the work on making my life easier as a developer. Argo CD to get about automatic DNS and all that good stuff really helped me today. And this is what we work for. This is what we are striving for. This is what we want to achieve. So I think that in the end it's a success. The rollout could have been better planned. So these are the things to be more or less stopped thinking about. DNS records, certificates, and deployments. And then once again, the small star. What's up with the stars? So see rebranding. When you rebrand from one name to another and you have to update all your DNS records, all the look and feel of your application, all the things, suddenly you actually have to care about these things again. But using extended DNS, using search manager and using Argo CD made it a lot easier for us. It's a talking itself, and I'm actually thinking of writing that, but not today. So our strategy as a platform team has been to focus on unblocking developers, removing time consuming tasks from the team, and removing time consuming tasks from the developers. And that's all I have for today. Thank you. Please leave me feedback. Thank you.