 Okay, hi everyone last talk we finally made it So only us and then we go out and drink beer and enjoy the rest of the Sun So welcome everyone nice that you're still here with us and we are now here to talk about Running Kubernetes in a manufacturer like what could possibly go wrong Yeah, exactly. So what is this about we luckily won a project to implement a Kubernetes platform at a chemical Producing company so that was the challenge itself because if you think about if you reduce chemical goods Something went wrong then potentially it is not so good for everyone So yeah, but before we go in detail a few things about us. So my name is Tobias. You can say Toby I'm working for a cuba medic since four years Now as a principal architect help to adapt it customer side Kubernetes in a scalable way and yeah, I'm Mario. I'm also working for cuba medic I'm a professional service a Kubernetes consultant. We basically help our customers to build up their cuba medic Solution inside of their data centers or in their infrastructure and as you can see we're both from a barrier That's why we were little hosin of course And now to go deeper into everything We need to go back in history. So Yeah, sorry, it's Friday and history lessons are always bad and no one likes history lessons, but When we go back into the end of the end of 18th century We had the first we were ease that started mass producing goods. This was called industry 1.0 now, I mean it's new but we go with it and Then we have a involvement in the industry line so we had mass production in this at the start of the 20th century and we had Computer automation in this 70s. So there were the first manufacturing lines that use robots and kind of out started automating things and now we reached the point where we call it industry 4.0 and we are now at the Point where we have interactive systems. We have microchips in all of the manufacturing lines dial down to the single device and everything in the Manufacturing plant is also virtualized. So when we look at industry 4.0, it's Like every manufacturing Plant that we have need to be self-sufficient but also interconnected to different other plans because we the whole production line maybe spends over a lot of factory plants and We are currently into the process that we have like data analytics in every single step and also security and observability that mice does our systems are still running and Also, we want to improve every single step of our manufacturing process So the problem here is we now need a lot of compute power and we also need a lot of a lot of servers But we cannot put them in the cloud because the amount of data that we would move to the cloud and get back from the cloud It's just too big. So everything is placed next to the manufacturing manufacturing manufacturing line So all of those Machines need to work, but there are some locations that are very small So maybe you need to put something into the cloud which brings us to the big topic it's all Everywhere and we need to combine everything and we need to be flexible and for this we thought Why not run the workload in the tool that we use all in our days and Kubernetes Yeah, and then we said, okay, Kubernetes cool Chemical manufacturing. Oh scary Okay, so let's get this started and we said, okay, come on. We are experiencing Kubernetes What should go wrong? And then we thought okay, what's mainly the targets for all this project? So we thought okay, let's start with some Workshops we discussed with and then we came down to a few key project targets so basically is One is hosting manufacturing software Why it's different? Manufacturing software is mostly not as like cloud native software are you are need to connect to your P systems like SSP You need to connect to machines. You need to connect to devices We have a requirement that some Windows tablets are needing some of our services and it's like printers Do they ever get asked in a project about printers with Kubernetes? We are and that's was the thing so and then we realized okay. That's also different kind of network We are used to like IT network, but that's a manufacturing network We have at least three firewall as far we know Across to reach this network So we have a firewall from our managed service to the customer from the customer to the cloud from the cloud to On-premise from on-premise to the user workload So you see there are a lot of components what are really like interconnected and we need to first understand What is the topic of manufacturing and that's What they want is a high stability in a high grade of formation Why if it's could be that a data center or our data center on the manufacturing get destroyed by an accident or something? And they need to be capable to review to this review to the reviewed reproduce this quickly and Yeah, then we need independence. So Our operation of Kubernetes Does not in should not affect any other microservice deployment or the Manufacturing software deployment. We need higher availability and that's across the distributed locations So this would be the location means we have currently running on sweet data centers But the plan is to run it on 30 data centers So and then Yeah, we need e2e operations. So we need a team who can operate across the globe the status centers So the next data centers us so it will be completely different as we currently are working in Europe Okay, then Independency what means in this in detail. So it means that we want to decouple things We were thinking in microservices, but it also means that an update of component a should not Block or roll out of component B and that's goes back to the infrastructure So if we blocking people to deploy their changes, we are like kind of the bottleneck and that is not possible in this case We wanted to be fast because fast means for them That's they want to migrate more or less everything in the end to the manufacturing line So I called it digital backbone for them is not Kubernetes for them is a digital backbone for their future. So they want to be fast And important thing my thing is abstract complexity Everything is really complex. I think Kubernetes now is a point where we say, okay This is really really complex. It's maybe too complex already and that's something what you need to keep in mind that you make things simple more consumable and More repressible that you say, okay, we still can handle it in a two-pizza team Okay, so High availability Everyone has a different Interpretation of it in our case it was combined H8 across multiple data centers. So not only in one data center it must be overall data centers and Multiple failure zone. So we have we sphere there. We have Escher. We have multiple really physical machines there where we need to cover and we have a target as a for 99.9 0.6, so this means monthly monthly we have sift 17 minutes downtime what Is not much when upgrade fails or something. So this is really a small number Where we can fail and that's something what do you think in a year? It's only three hours a year is really long So three hours if something face could be really a tough time for him to fix it but currently we are fine managed to do it and We now get in detail how we did it. Yes, so we have another Component that you usually not think of is We are in a corporate world a corporate world has a lot of departments and they don't necessarily want to speak it to each other and they are really really big and you need to communicate with everyone and for this we needed to Basically go to our to our customer and say we cannot do everything by ourselves. So and we need we need a counterpart So for this we have stakeholder team at our customer to with whom we talk with whom we we design the whole architecture and with whom we yeah talk on a daily base and Create the complete architecture But also we need to talk to the consumer departments because they are creating the requirements for what they want to run on the cluster What they need? Well, for example, we now need 200 terabyte of data storage somewhere to be accessible in every data center This is this is a new requirement. So you you're constantly talking to all of the teams and also we have Team that everyone likes and everyone loves the security Who basically later? Maybe they know a little bit, but the first thing is they ask Are you do you really want to do this how you want to do this? No, you can't do this and it's a constant Yeah communication with them to bring them up to speed and also to teach them so that you work together as a team So we came up with the idea Why not splitting the competences? Into Small departments so that everyone knows exactly this is my part this is where I can focus on and This is where where my core competencies are so we started With one team which is part of the car part of the customer, which is the core infrastructure team These teams are there for providing the basic level like I need my data center I need my network. I need my firewall appliances I need my Azure accounts and so with these teams we basically first designed the base layer of of everything and The next team is the application platform team So this is the team that with whom we work so you can you can say it It's basically a vendor inside of the corporate and they and their customers are the different departments and they ask you basically hey We need a Kafka, but we don't need a Kafka for us alone. Can we just jump on a Kafka? So this team is providing very Large services on the bay on a global level and you can basically just go in and buy stuff from them and The last team from the customer side is not one team. It's We don't know how many because we never talked to them, but they are currently always talking to to the to the department above them These are the teams that are actually running their workload on those Kubernetes clusters that we that we provide together with the with the customer team and They they are come all of the requests for we need this we need that we need we need all of we need more storage And we need to take all of this in account and this is our job We are basically the cloud native infrastructure team. So we provide large services in all of the data centers and We are doing the consumption reporting We're doing the metering and we are creating the the clusters and we are creating the all of the infrastructure And also we are taking care of the infrastructure. So we are the people where you get say hey everything is down do something now and Then we come to the point where we say Running Kubernetes in the manufacturing line. How do we actually do this? I mean, it's not the simplest way but We said tell them to accept it What can go wrong? Now we are more on a technical side. So we solve kind of the setup But then a core idea was okay to handle this We need to repeat to the function units and system does not mean that it's Technical a unit. It must be a component what we can use. This is not just Kubernetes This is infrastructure. We have lots of things to care about vSphere permission models and Everything should be managed by API and good ops So we said this is our approach To manage the scale because we already know that we need to manage 30 data centers across the globe globally 24-7 with a high SLA and for that the modernization was really the king, but How will our friend the great firewall work this because the firewall don't have a API we can implement against Yes, do you ever talk to? firewall teams in great Corporates It's really hard. It's hard and we called it Wonderwall because you never know if there's one firewall more and That's what we did then we built kind of object groups. We said, okay We structure rise it we structure rise the firewall rules in objects and said, okay Component a of data center need to talk with component B of data center but the data centers are a generic object very putting different data centers into it and Make it then with this who said scalable Maybe you think this may be a fancy program what I can download. No, it's was exo sheet Okay, what else? I think everybody heard it infrastructure is code is great and Automation automation is our key success factor and this starting with the setup of how our team can set up So we need to really clear How someone can bootstrap because this can be possible that we call that they get a call some of my colleagues Do you also sitting here? Hi? Get called in the night need to check out repository and somehow need to find something And that's mostly the time what you lose and you remember if everything is down We have 70 minutes in the months and that's why we said, okay Documentation must be also in the code repository that you have a single point to come into it And we need a center rise scaling architecture So we have one core concept or a few core concepts what we can then duplicate similar. We think data centers kind of our Pots and we say hey we want to scale up the data centers and also the users want to consume it like that so the difference between a Cluster provisioning between clouds are only a few a few parameters That brought us to the idea. Okay. We standardize the data center Standardized data center means for it. It must be independent and universal But we have standard interfaces how we connected how we manage services there and we don't use any cloud Specific us and that's we're basically say okay data center. We need to quickly put everything in Kubernetes That's our interface where it can implement it and make it for our and customers easy to consume and The setup is really important It is repeatable in short time because the disaster recovery case is one of the main factors of this company So the company said to us like if the production is down for half hour It's cost and already a million and that's can you think about if something like hardware can fail What this means so we need to be really good at this Yeah, so we started when we go back like back in the time we had like Hmm, we introduced containers containers are now like cattle So we don't care about them. We throw them away then we do the next step. We say hmm My my notes are I don't care about them. We can throw them away Now we go to the point where we say Kubernetes clusters. I don't care about Kubernetes clusters Also cattle I can throw them away and quickly create the menu and why not do the same approach with the data center So we created basically a template for a data center for each of the locations Where you run it to so we have one management layer on top of it, which is currently located in in Azure But this doesn't this doesn't matter and we are using our own software, which is an open source software called Cubimatic like our company so we basically say we run Kubernetes inside of Kubernetes So we have on our master we have on a seat so we have a seat in each of the locations and You run their your cluster with the control plane and inside of the Inside of the worker nodes of your seat or of your master you have the control planes of the user clusters Instead of containers So this makes the overhead really really small because h a user cluster only needs 0.3 CPUs to run in an HSA setup and you only the worker nodes are really then there for for your workload and You can quickly spin those clusters up in each of the environments that you potentially create because you don't want to run Only production you also want to have a dev and an integration and probably a sandbox where you want to test things But the problem is when you template a data center, you don't need just Kubernetes clusters There are a lot of more things that you need in every state single data center and this brought us to the idea Hey, why not create a service cluster? Right next to our seat that is in every single data center and with this service cluster We provide all of the basic services that you need to run the data center So we started and this is now bringing us to all of the CNCF world We started to create this the service clusters So in every of the service clusters we have at the accordion s for the NS service a registry We're currently using harbor a CI CD pipeline Agro CD and Now the funny thing Some you need IPs for your worker nodes and for your load balancers. So and for tablets and printers yes, and tablets and printers and you and There's nothing. I mean, there's something we figured out. Hmm. I see the HCP But I see the HCP does not run in the container. So we Created I see the HCP inside of a container and now can basically give IP addresses everywhere Outside of our Kubernetes cluster while running the DHCP server inside of a Kubernetes cluster and you need storage To store like pictures Every single manufacturing process is taking a photo so that you basically see Is everything all right can be improved on this? So there's a lot of pictures being taken and also you need a repository But wait a minute They are not all all they are not all arrows connected to the master service cluster. Yeah for the s3 we said Okay, we don't want we don't want to have the data replicated to the master because it's just too much data And it's not important the idea. There's only only important inside of the inside of the data center where it's really really needed and This brings us to the point that yeah, you can template it But you cannot do it for every service in the same way So you need to really figure out what is important for your use case and what is not important for your use case and As you can see We set up the idea that everything has it's a local address Which means that every single service in all of the data Data centers has always the same address. So when I'm in data when I'm in a manufacturing plant a And I call local dot DNS dot manufacturing. I get the nearest I get the nearest core DNS if I do the same thing in Data center B. It's the same address but it's also the local one and this brings us to our flexibility that we have When we hit the disaster case because as we said half an hour outage one million loss So basically the data center If if something happens with the internet connection, we still need to our plan to be operational. So we made our Fail over case the default So our fail over case is always this is how it's the the request is typically looks So we always call the the local service and we don't care if we have the master available Or we don't care if the other data centers are available. This also is a speed improvement But what if my local installation is failing? That's also easy we can just fail over to the master or fail over to the to the second data center location and How we managed to do this is basically everything is Tripled down from the master. So the changes are made and pushed to the master and it's always Put down to all of the these services below so that you basically can repeat That that all of the data is still in on the same on the same page And this makes everything really really easy in a disaster case and we can even say oh We have lost our master and we have lost our local data center So we just use another data center and this makes everything more and more reliant to any other topic So what this was the theory what we started with and what went wrong now Yeah, basically you can think that it's not everything work from scratch So and it was more like a trail map. So it's really something what you consider each project has his own Characteristics each data center has his own character characteristics and there are people who also are involving there So you cannot just say hey we do it that way and it's the only way we do because security says I don't care What do you want you need to? Discusses with us and we need to approve. So we are here the final one who approves it. So that's Yeah, and that's brought up to the idea So it's mostly like a trail map what you repeat and repeat and try to improve So our core success was we started small So we said okay We want one time the first end-to-end thing solved and then we improving by time and the first thing What do we take a look was the local infrastructure. So we needed to get into the data center setup We needed to get it explained it to us mean to understand what are the policy policies How we can connect to this Even like to have a jump server there was really a hard task because first you need to convince them. Okay, we have to save Implementation for it and we our we call it virtual operation center is really safe and it's Safe that we can access your critical production network also backup What means backup? I mean, yeah, we do automatic backup every 20 minutes from our clusters. No problem But it's not the end story because you need to think about where to store it in the disaster case because it doesn't Help you if you have it in your local data center and it's burned down. So there you need to talk with the customer Okay, what are your failover strategy where we can store it? Do we have a three service to store it and so on and Yeah, then core infrastructure basically we thought that's the easy part, but we figured out that yeah these fear Kind of is different as we expected it as we saw it at other customer Somehow we have now we send we had dedicated zones where we wasn't aware about and we need to figure out Okay, how we can use this to make it really liable and network I mean, yeah, you know when if you have a name for something like Wonderwall The network was really a challenge because sometimes we didn't even know there's a firewall in between Security firewall. I said it Segmentation and authentication authentication is important. We need to have every person identified then How you onboard a managed whole managed service team with customer accounts. This was really really organizational challenge Good. What else? So services we said everything we want to consume needs to be a service Mario told you there's three service for for example, and this must be local that the nest service be local caches It's was this implementation detail that you can reconcile was key So they are independent, but they can reconcile and that is the point here on the independence thing So we said central master. This is the holy grail if we change a configuration that's there and On the locals, we just have replicas similar as pots. They can go away They can spin up there can scale up but if something happens the core brain is the central master and Then we have the partial connectivity where we said, okay That must be our default and we need to reconcile and that's brought us to this architecture. I Think automation is for everyone clear But if you have things what is not automated and maybe you have new tools what you don't know Then you think about declarative Management to so you we created wrapper around things what maybe are not in our responsible But we needed to configure them and need we needed to Bring that to other departments like the firewall rules Cluster management. That's mostly our core topic where we came from but this was only one pass And that's what we find out. So one pass is there We have a global service per API to the provision across the clouds That means I can talk to the same endpoint to provision a cluster in Azure or on the local data center in Europe or in local data center in US and that's helped us that we had the ops tooling There's dashboards and also there we need to be aware. Okay How can we come? Operational when a data center is disconnected. So this was also in use case to think about okay We need potential more entry points to the custom environment because if we only have one jump host And this one is not reachable then we have a single point of failure Okay, we solve this one and then we came to like the applications Okay, we find that the application has some other dependencies as we saw it's a use in Kafka They had some really classical workload as well and they have super nice microservices and then we like To consume the lot of storage and we run everything fine We sphere and then we move to cloud and looked also fine. But after a time we getting Okay, we only have 10% utilization in the communities cluster. Why we had our cluster autoscaler as well as on premise We have cluster autoscaling, but don't premise VMs can mount a lot of more volumes as Machine time what we choose in Azure? So I don't know if everyone knows that but in Azure you have limitation of discs What you can mount to the machine and that brought us we spent a lot of money for nothing because the CPU memory was idle So we redesigned it learned it and said, okay We need different groups of node pools and then the outscaler was also like modified that it's can scale individually based on the volume workload and based on normal workload and Yeah, that's scaled and up and then we said, okay. Now we understand it. We said Okay, let's continuous watch and that's the key point that you start The whole thing again. You cannot implement this in one step. If you want to do it It's will be slow. You never have a result and the people will get asking Hey, is there something it's like a black box and for us what a key success factor? Hey, we have quickly something we can show and now we improve and we now improve the security But the application team could already start and the other teams can already work with it. Yeah, and We came about with three key points. So the platform that we have need to be adept by design We need some kind of standardization and we also need more the modularity because not everything as we learned not every data center is the same and We must be flexible, but we need to do it with predefined components that we can easily really reuse and Currently we have a managed service with Under and under the SLA we have 39 clusters 206 nodes 878 CPUs 3.2 terabyte of memory and 200 terabyte of storage and also you need a playground Which is not under SLA that you can just screw up like delete the master of and some things like this So yes, I did Cluster as we have 12 clusters 94 49 nodes 274 CPU and 1.1 Terabyte of memory and with this We have finished and you as well. So coupon is over Thank you everyone. Yeah questions if not, oh There's one can you can there's a mic next next to you? Oh So We set it up. We started implementing it in February Last year and we went production in April It's was true. Yeah, and we was pre-production. Yeah, April was pre and we basically we did it with two engineers from our side any other question if not Have a great evening. Thanks for coming. Thanks for staying and yeah, see you at next cube con. Hopefully