 Welcome, everyone. Thank you for joining our session. This talk is about our approach to platform engineering for manufacturing at the LEGO Group, and we will just dive right into it. So let's talk about LEGO Breaks. First off, quick show of hands. How many have ever built a LEGO set before? Nice. Thanks. Yeah, so the LEGO Group is a toy manufacturing company, and we produce a wide range of different toys, like the ones you see here, and we produce these at our factories around the world. So what you see here is one part of the manufacturing. These are molding machines, so these are the actual machines that produce the LEGO Breaks and the LEGO elements, and when the fresh bricks come out, they end up in those baskets or the boxes that you see at the very end of the machines. If you look at this, maybe there's something that stands out to you. There's not a lot of people in these pictures, right? So automation is a big part of the manufacturing at the LEGO Group, and it's all IT dependent, so software supports the manufacturing, and this is a software that the digital teams at the LEGO Group are writing. If we look at this case, what happens next is at some point these boxes will fill up. They're not magical, and we have in the lower right corner, you can see a blue vehicle. That's an automatic guided vehicle, so this is a robot essentially that patrols the factory floor. When the box is full, it will come over, swap out the box with an empty one, and machine continues to produce, and the full box will go on to the next step in the process. So this is one example where automation comes in for our manufacturing, and this is the environment that we operate in. This is where we produce software and support our teams. If we zoom out just a little bit, we have these factories across the world, so we're in six locations right now. Seventh will open in the near future. Not all factories do all the same parts of the whole manufacturing process, but what is common is that we have data centers at these factories, and whenever we talk about Edge in this talk, these are the locations that we refer to. So essentially, this is what our talk will be about. How do we do platform engineering for a context like this? We'll be ready for it. Thank you. Okay, so now you've seen a little bit about our factories. Now I'd like to talk a little bit about how we organize the people in the group. So we are a bunch of digital builders, software engineers that are doing these integrations that Jepet just talked about. We are around 1200 software engineers in the LEGO group. We produce a bunch of code, and we have, and this is important for the rest of the talk, we have a bunch of digital products. We have around 250 individual digital products that support our various processes. So looking at this a little bit. So imagine the green figures you have in the top here. These are colleagues that are doing, well, you could call it real work. These are the colleagues that are actually managing the molding machines. These are the colleagues that are producing the bricks. Now imagine that you need some kind of digital support of these processes. It could be automation, it could be some applications they need, or it could just be a proper setup of software. We don't want these people who are in the direct value chain to be doing this work. So therefore we have this concept of digital enablement. So that's done by a bunch of teams, and they do this direct enablement. Now these teams might need things like databases, or they might need some staple things that are common across all of these products. Again, we don't want those colleagues to have to do that. So instead we have platform teams that enable them. So to be just a little bit more concrete, so you could imagine that you had a factory with molding machines, for instance, and you could imagine that you have people operating those. Maybe you need a digital display that shows something about how this process operates, or maybe you need something that can order a robot to come by. We don't want the operators to be implementing this code. So instead we have a molding technologies product, for instance, and they are doing this work. They might need, for instance, a message program. They might need a database. We don't want them to be doing that. So instead we build a team. Let's call them the edge platform. This team delivers staple services. We don't want that team to also be operating Kubernetes, because Kubernetes on-prem, that's a hard problem. So we could have another team, and so on and so forth. So it's platforms and tools all the way down until you get to, well, the actual physical factory. So we're in the edge platform. So this platform could be a product that offers up a message program and a database, and this platform supports the molding processes and the factory processes across the world, which brings us to us. My name is Mass. I am the lead engineer of the edge platform team. I do the roadmaps and the long-term planning and engagement with stakeholders. And with me I have Jever. Yeah. My name is Jever. I take the roadmaps and I try to convert them into something that can run. But I also orchestrate the workflows in our teams to make sure that we are working on the right things and delivering the value we want to do. And one point that I'll also come back to later is that we are a fairly young team. We've been, I've been at the layer group for close to one year. Yeah, but even less. So this talk will also be about how we actually handle this fact that we are somewhat new in what we do. So when we do these digital products, and specifically when we do the things in our area, we strive for providing cloud life experience. This can mean a lot of things. And this doesn't mean a lot of different things for different people. But for us, it's very important that this enables self-service. I want to make sure that when we have these various products and they, we give them a choice for what they want to use, that it's a real choice. So they need to be able to self-service and we translate this as we need to provide APIs to our things. Sometimes you need a nice graphical user interface for your things. So we also strive to provide our things and make them accessible via our internal development platform. When you then deliver these services, it's also very important that they are robust and they are secure and they support it. Because we want to lower that cognitive load of the various teams and we can only do that by ensuring that they can rely on the products that we are building. So these are some important attributes. Later in the talk, I'll talk a little bit about how we build in this into our product. But before we do that, it's important to talk about how we go about creating an architecture that can actually support these kinds of products. Yes, let's spend a bit of time looking at how a solution like this works or how we build a self-service platform in the context that I outlined earlier. Initially, when we started this out and we still continue to work based on these principles. So initially we set out that we wanted to use the available bricks that we have. So whatever tools and platforms or other product teams we're building, we wanted to use those as much as we could or cloud native tools that were available that would fit our need. And also specifically for services, we would rely on operators. So we would use the knowledge embedded in operators for each service. Of course, we need to know how a RabbitMQ cluster works in Kubernetes, but actually managing the cluster, we delegate that to operators. We also use GitOps wherever we can. We have different sites and we want to run the same infrastructure across all the sites and have a nice central way to manage that. So we use GitOps where we can. And it also ties well together with the next part, which is we have no direct access to these sites. So we don't have Kubernetes or QBAPI access from a central location with pipelines and so on. So we treat all the edge locations as remote islands, essentially. All right, let's take a look. So our goal is for our colleague, we want to enable them to get Redis or RabbitMQ at our factory sites. As I mentioned earlier, we have data centers available at each site. We have two, DZ1, DZ2. And fortunately, this is one case where we did not have to start from scratch. So we have a container platform available at each site. So for now, we will just abstract away from the fact that we have data centers here and just look at from a Kubernetes perspective. We know it's there. We know that there are two data centers in case we want to go into details there. But for now, we abstract away from it. So this could be one location. And we already have a couple of tools that we can make use of to solve our problem. So first of all, we have ARGO CD. We have external secrets. We use those as the primary mechanisms to roll out our infrastructure across our sites. We use GitOps to roll out our operators, the individual service instances. So in this case, it will be the RabbitMQ operator and RabbitMQ instances. So we have GitOps repositories that we do have access to. And then it gets pulled in with ARGO and from there installed into our locations. We also have a secret store. We have secrets that can be pulled in with external secrets in case we need any additional configurations or secrets. So this is all pull-based. We don't have that direct access to our clusters, but these tools allow us to get around that and provide us mechanisms for getting all our services out at the edge. If we look a little bit deeper, how we do it, so we create a manifest file. It's the one in the lower left corner. In this case, it could be for RabbitMQ cluster. We have a small set of values that are essentially input to a Helm chart that would install the RabbitMQ custom resource. We push it to a GitOps repo for a particular site. Then we let ARGO CD just pull it in, install it, and then the operator takes over and starts Booster having the RabbitMQ cluster. So a little bit quite easy to get here in some ways. We have this up and running. We are starting to see adoption. So we managed to get here without having to write a lot of code. We took the tools we had available and put it together. And then we got to this point where we now have the services running in our locations. And then from here, we can go in different directions. So one thing we could do is we could focus on extending our services, so add new services. We have Redis, we have Postgres, and so on that MESS was talking about. We want to do that, of course, but we also want to build out the interfaces towards our colleagues. So we also want to provide that self-service and cloud-like experience. So the first thing we do next is we build out an API. So we build the API. This API is going to manage our GitOps repository and secrets. And then we can expose that for our colleagues. At the later group, we are API first. That means that every integration we do is happening by APIs. We have an internal, you know, in our developer portal, we have access to all the APIs that different teams are producing. So we build out an API, we host it centrally, and then we can give that to our colleagues, and they're able to automatically provision a rabbit cluster, Redis database, and so on. From here, we also go one step further. So we have an internal developer portal based on backstage, which, as I mentioned before, has all the APIs. It also has information about the different product teams. In this case, we're building a plug-in here to actually provide that cloud-like experience. So you actually get a full plug-in in our developer portal. So we have team members working on this now with help from the team that is the product team behind our developer portal. So these are all the steps that we are going through as a team to build a platform for our manufacturing. And this gets us up to the point where our colleagues will be able to go in, request a service, get it created, maybe list what they have in their product team. But there's a couple of things that we are still missing. So there's a couple of questions we want answered. One is the actual state of things. So it's my RabbitMQ cluster running. And two, sometimes you just want to restart the thing. That's sometimes just needed. So we have a couple of more things we want to do. So this is good for creating the desired state at our edge locations and having the ARGO CD and so on, the operators trying to get it to an actual running state. But we also want that running state back and ultimately exposed in our developer portal. So what we also build out is we do a couple more things. We have our observability platform. Data gets pushed out from the sites. And we also build our own little agent that sits in our locations and will push the actual state back out and optionally pull in a command that needs to be executed against one of our clusters. So now we're there, you could say. So essentially we pull our desired state by GitOps out to our locations and then we push all the information out that we need and want to expose in our developer portal. And that kind of closes the whole loop of what we need to do. So it's a few different components that we need to work on to enable this cloud-like experience. Few learnings that we took away from it. Definitely our use of our operators has allowed us to focus on building that full experience. Also the fact that we go out of way to try and use what other product teams at the LEGO Group is building and use that for us. It has worked well. Of course we also needed some things changed in certain areas like in our container platform we needed permissions to install custom resources. But for the most part we go out of the way to try and fit into the other teams and their platforms as much as possible. And then a kind of realization that we definitely need a mix of disciplines to do something like this. I mean we have the operational side in Kubernetes. We have development of a back end. We have front end for our internal developer portal with React. So there's quite a mix of disciplines involved in building this out. Of course not everyone in the team are doing all the three things but we definitely need a mix of experience and skills in our teams. And with that that's how our current solution looks like and what we're also building out. And let's hear how we actually make sure that the thing works. Yep let's do that. So you remember just a while ago I hope you still remember that I mentioned this thing about our services being repost and supported. Again remember we were a somewhat young team and when I started the team I had to figure out some approach to ensuring that our things got to the level that we actually needed them to be. We are a product organization. Again digital products and digital products that are doing digital enablement of a bunch of colleagues. A big thing about being a true product organization is that it's very important for us that the individual teams that maintain the products and are trying to create the best possible solution for an end user that they actually have a choice. Want to make sure that if they feel that they need to pick a specific database in order to provide a certain level of service and that they are able to do it. We definitely have pave paths. We have things that we want people to use. We have directions we want people to move in. But it's very important for us that people have actual choice and they do. So when we build out this nice rabbit mq service it's fully up to the individual users that we have to choose whether or not they want to use our service. The alternative could be that they are running them on their own VMs and that might not be a good experience but if that's what they need to do then that's what they'll do. So in order for us to actually build these services we have to make sure that we get user adoption and it's not a given. We can't mandate people to use our services. So we had to come up with a way to make sure that we as a young team could build some pretty critical services for manufacturing that the teams in manufacturing actually wanted to use. We do some different things but one of the things that I'd like to talk about today that we use is that we do KS engineering. So some of you might know that KS engineering is not necessarily as chaotic as you might think. This is a pretty scientific approach. So in KS engineering you make sure to do your research, you make sure to form your hypothesis about what will happen and what you expect to happen. Then you have a controlled way of doing an experiment and then you go about doing your experiment, conclude and then you make your service more resilient if you actually need to. This is the process to simplify just a little bit. So what we have in the middle is the approach we have where we when we want to test a service we have some low generation, we have some injection of KS and then we have our learnings. Now surrounding that is the test plan. So we have a plan where we set down and documented what we expect to happen, the various phases that we are going to go through and what we want you to prove. Now the very important thing here and I don't know whether you remember the first diagram with the colored people but this thing signifies that we do our test plans with our end users. So we pick some of these colleagues in manufacturing who actually have to support these high impact critical processes and we have them help us understand how best to create these test plans. So if they are very worried about a specific part of how a service works like well they have an opportunity now to come with that input and that's why we're doing a case engineering. So I'll talk a bit about about what we expect to get out of case engineering and then how we actually did it and then what we got out. So Lego has this thing about that even the best is not good enough, roughly translated, the best egg for God in Danish. The idea here is that for the final product, for the bricks, for the sets that we create for our end users, for the kids out there, we should strive to make the very best product we can. There's like no limit to how good we can make it and we should do whatever we can to make it the very best product. Now as some of you might know it can be a dangerous thing to take that mindset verbatim into a world where you are running stateful services because if you take this verbatim you risk trying to do for instance a hundred percent uptime of your stateful services and the reality is that that's probably not what your end users necessarily want. Remember we're going for this cloud like experience and if you look at a hyperscaler, if you look at any of the big clouds and you look at the bunch of services they have and all the different features they have, there's just no way that a team like ours would be able to deliver that. So again not only can we not necessarily do a hundred percent uptime but it's probably also hard for us to create all these features. So by doing this case and by sitting down with our end users and talking about so how do we want to actually support this thing? We also provide a space where we can talk about what is good enough. So do you actually need a hundred percent uptime? Do you need a hundred percent uptime during a specific period of the year? We have a as you can imagine a focus on staying up throughout Black Friday for instance right? So there are maybe phases where you really need to be up but maybe there are other times of the year where you don't have to. This provides a safe space where we can talk about the actual robustness of our services and we normalize these discussions about what's good enough. This also provides us a way to learn from our end users. Another thing is that if we turn our focus to our teams instead. So again team we are now operating some pretty critical safety services at our sites. How do we make sure that our engineers are actually confident in operating these services? Well again by making sure when we run these experiments that we use the same tools as we do in production. We use the same dashboard. We use the same access to the clusters or lack of same. We make sure that our engineers actually have a feeling for how it is to operate these services. Not only operate the services when they work but also to operate the services when they are failing because they've seen what a crash looks like. They've seen how you restore a rapid cluster for instance. This also allows us to discover the various other products that are adjacent to us. So for instance if we need to do a stress test of the storage layer for instance we might need to engage with our assistant team that does our compute platform. Now we don't have the necessary access to go all the way down into the hypervisor and see how it operates but we by doing this case test and by making sure that in order to do the test we actually need the data from them we have an opportunity to engage with them. Bring them into the test and actually making them a part of the test. So should the day arise where we have an issue in production? Well now we know who to call. We know that that team exists. So all of this is normalizing and cultivating a culture around failings and learnings. So by doing this we normalize the fact that things will fail but we also drive home the focus on learning from these failures. One awesome thing is that now we have a setup where we can allow an existing application to connect to a test cluster in production-like environment and then we can simulate a crash. Well we like to do the crash and this allows our application teams to then figure out whether or not they are able to for instance reconnect when the service comes back up. And this is a huge part of treating our customers as colleagues because they are in a situation like ours where they also have questions that they need to support. Okay so that's what we wanted to get out of it. One strategy that we used when we then got started was to keep things as simple as possible. So we wanted to get these early learnings but we also want to like do the low hanging fruit first. So we have this AD20 rule where we save our ambitions for later. So you could say that in the end instead of doing a fancy load injection where we simulated the exact characteristics of a production load. I mean if you can do that with a bit of shell script we'll do that. It will work just fine. You can easily simulate a crash import by just deleting it. Okay so how do we actually go about doing this? And I would love now to do a live demo. I'm afraid I can't. We've run this in production environments which means that we actually run these things as our factories and I mean we want to keep those bricks flowing. So for now you'll have to make you with some screenshots. We need some load generation, some case injection, and some monitoring. And I'll walk you through this thing. So when we run the tests we are monitoring and this is actually from when we did a test of RabbitMQ. We are monitoring of course with metrics, logs and all that but we also just monitoring with the existing RabbitMQ console. So in this picture you can see that one of the notes are crashing. How do we then do the status load? Well as mentioned we could do something fancy but actually we're just scheduling a pod into the cluster that are running the stock RabbitMQ load generator. There's a performance test that comes with RabbitMQ. We started off with that. Later on if you want to get ambitious we can do something else. When it comes to actually injecting the chaos well we did a bit of well it's go code but it's roughly what you saw in the bit of bash before. These random parts we can reproduce this works just fine. And with this quite simple setup we already got some stuff out of it. So for instance we had a test where we wanted to see whether or not we could detect when a cluster lost a note and then when the note rejoined. So there was a hypothesis that at some point we should be able to in the locks see when this note rejoined. We ran the test and we couldn't because it turned out that out of the box we had misconfigured the locks and we could actually not see this thing rejoining. We added this thing in it was a quick fix and now we were able to see this. And this I mean we could have ended up going into production without this fix if we hadn't done this test. Also we are already getting a bunch of lovely feedback from our end users. Yaloslav is one of the engineers that has been running Rabbit for a long time in the group and he is very appreciative of the work we're doing and telling us again and again that he loves that we actually take so much care about doing this work and I believe that a lot of this comes from our willingness to engage again through these KS experiments where we're not afraid of demonstrating that things can crash. And then when it comes to more ambitions because of course we need to go deeper into this we need to look at other tools. We're actually having a colleague right now who's doing his master thesis on this and we're working with more cares engineering and we'll go deeper into this for sure. Cool. Then let's talk a little bit about our takeaways. Yes so we're coming a bit to an end a few final words we want to close this with. First of all having worked with this for since last summer as Mas mentioned for us having worked in this environment has proven to us that you know doing cloud native but in an on-prem environment has never been easier for us. We recognize that we have our container platform team we have our computer team we have our networking that did a lot of the hard work to set this up for us but at this point we can use a lot of the cloud native tools they solve our problems we saw how we use our to get around we don't have that access to our edge locations so this has been quite an eye opener for us. Of course we sit close with our parts of the container platform team so it also helps that we are close and we are physically together in the office but also the product focus that Mas mentioned has really shines through the fact that we are not forced to use what our teams are building we want to build for the experience for our colleagues so they want to use the services that we build and the platforms definitely shines through and it helps a lot in terms of making it accessible for each other and for us the the focus on integrating with apis have been essential for for making this possible also for building out the the self-service part we started we were building out the internal developer portal plugin but down the road may add all the interfaces in the end but as long as we have the api we can start adding depending on what our colleagues want also depending on the specific services so yeah and just to double down on this thing if you have a way to get some early value then go for it i mean if it requires a bit of bash just do that there are some awesome tools for case engineering we'll definitely look at them later but we just learned a lot but just doing what more or less equates to a bit of bash so remember to do that if you have the opportunity now if you took notice the the architecture that jebus stepped you through was quite different there was no bash in there there's a lot of yaml and we were just picking from the landscape so when you're doing foundational things we are now coming to a phase where we have some pretty robust projects out there make sure to use them build your layers carefully and make sure to integrate them and yes it does mean that you're not doing as much custom code as you might like to and in the end you will just be a yaml engineer but i will say that remember that this is saving you some time and i would suggest that you take that time invest that with your friends and family and then maybe go out find some Lego sets and do some awesome builds and with that i would like to i mean thank the community for creating this awesome landscape i would like to thank all of you being curious about what we are doing we will definitely keep returning and sharing our experiences and i'd like to thank all our awesome colleagues who have made this thing possible so thank you all