 So welcome to my talk. Can you hear me at the back? Yes, great So I'm talking today about our road to a community space close build environment. My name is SIGI I'm a cloud development architect at SAP customer experience, and I'm sitting Munich Now quickly the agenda I will talk about the technology itself I will also talk about soft stuff like the product aspect Workflows we introduced in our team while we are we are why we have built that and the learnings at the end So just to give you a little bit of context up front I'm a we are in a distributed team Munich and Montreal We are seven team members two of them are working students and we build on that system probably something like a year There are a few details. I will not go over because there are companies specific with networking and security stuff But in general, it's very generic and I think it's probably the same You could experience the same thing if you start you when you just doing doing you when you just tomorrow When I say close build environment We are building a software package which is going out to the customer. So it has to be a tempered proof It's not allowed to Be changed in why we are building it and we have to be it has also has to be audible So we have to know The package which the customer is using on his infrastructure on-premise systems how we build it So there are certain details and you will see this popping up later on We use a managed Q&A this which means for us is it's a small team for some it's probably a big team I have never had a team where I was Fortunate enough to be built on as the ICD system for miss seven people never so I was only the only guy doing that But still we would not be able to do this without a managed Q&A this cluster would be too much work Now to give you an idea of what we do Sandboard is an open source tool which we wrote and it's basically here to orchestrate Jenkins environment we build Jenkins environment with a shift cookbook and their seed shops So you spin up a Jenkins the seed shop is creating all the built Pipelines you want to have it configures security it configures plugins and all this stuff and We provide this to our teams They are roughly 20 to 30 teams and all of those teams have product aspects So they are part of our product which we are building and they get all the Jenkins instances Now someone 2018 there was a little problem Based on some management decisions our old data center was being terminated Basically, it's not my problem But now suddenly we have to move everything over and we did not want to just move stuff over we decided to Think what we actually want to do Which is a great thing to a situation to be in it's basically green field With the experience we had before we knew one thing if you have 20 or 30 teams They do not care too much about a Jenkins They do not care too much about the seed shops updating the Jenkins is and at the end of the day we have to do this So we start we said we want to provide a Jenkins But we also provide instead of a Jenkins we provide the pipeline as a service So my team is only here to provide the Jenkins instance and the second team Which I will not talk today about is actually building the pipeline as a service on our infrastructure So when I say customer, I mean our custom is the other team which is building the pipelines and they have the customers which are actually using the pipelines Our requirement is really simple. We have to support up to 120 build nodes So we can pick up to 480 virtual CPUs and 1.6 terabyte of RAM It will increase this year to something like double It has to be cost-effective big company. You don't have budget unlimited So out of scaling was for us must we have to have auto scaling We actually did not out of scale before if you own your own data center There are 100 machines just running. It was never a problem before It has to be secure Maintainable because we have to have to have the work right it has to be reliable and At the end it has to be have a customer facing your eyes the customer facing your eyes a real problem If you look at tecton and other modern stuff You have to build a lot of stuff around it so customers can actually use it and that's the reason why we still use Jenkins we Have Jenkins and we still use it because we looked around a lot of our current options we have and Jenkins provides with the job DSL and a lot of stuff around like unit tests graphs access management It provides so much stuff that we cannot trust my greater way from it It's planned in the future to have a look on what's going on how the Landscape is changing but it's changing quite a lot So if we need is to the rescue We had already experienced with me so as you need is 1.6 Which is not that old, but it's there a huge difference between 1.6 and 1.12 13 or 15 or 17 right now. We have experience with chef and Ansible It is does support auto scaling on our platform which we chose. It's Google you can probably get auto scaling on all the other cloud providers as well It handles certificates with this which is unfortunate our team also a little problem Sometimes they just expire and the certificates are far of our internal services. We do are not just use let's encrypt Cannot just gets used to let's encrypt For us, it's more secure than VMs You might know the one or the other snowflake VM sitting around somewhere not have been touched for a year or two You know the same problem And people need is you have a note. It's super lightweight. There's nothing on it. It's auto update You don't care about your underlying operating system. You just assume that it's there and it's there It's personal opinion the last one is probably the future so it is also a change in mindset We switch from imperative to declare truth thing instead of hacking and creating your virtual machine installing software configuring it Now we have to describe it before and then someone else takes care of creating that infrastructure for you This is huge because you cannot start in our environment with something you tell you start don't start with doing you start by writing So everything is basically already Infrastructure's code and when you ever have the problem that something is not running and you just delete a whole namespace with your complete Monitoring setup and you press some swing button and five minutes later the whole set is back there again again back You will love it. It also includes all the dashboards and everything we have it's based on code and You don't need to back up anything the only thing you actually back up The artifacts and you put them on some cloud storage and then magic happens and they take care of your backups Quick jump into communities. It's still quite young. So in July 2015. We had a version 1.1 First Cubecon was only no family 2015 Currently we are already at 1.17 and it has quarter releases What does mean what is what it means for you and if you work with it daily And this is one something we have not troubles with but it's something we have we are very aware of This ecosystem is really young Everything changes you have your open box because you just Run into them and you see that they are getting fixed and two months later new releases out and new features are popping up and Tools you're using now have suddenly business enterprise features because a lot of people started using them and they requested those features I was a Java developer for probably 10 years or something like this Java is great and really stable, but I've never had an ecosystem changing so quick You have to keep up you have to put the effort in don't run an old Q and E this cluster If you still do and you don't have time for it You are creating a zombie machine and it will be a horrible migration mine Nightmare later on I can't guarantee you that already now We are we saw the greenfields. We have Q and E this so instead of saying we are just a Info team and we are just throwing something out and Customers come to us and say hey, can you do this? Can you do this? We said no We created we we see is our product is a product and product means It's focus customer focus. We don't just throw our customers something on We tell we think on how they can log in can use our tools, which we are they can use The documentation is a must it has to be clear and understandable If it's not clear and understandable someone created a ticket and we have to have to do something and I personally don't like that very much So I'm trying to give everyone the chance to do it by themselves But in a way that they like it, right? It's really an effort that they like your infrastructure There was a team and a friend of mine told them told me that Work colleague came to him and said hey, I really love what you are doing. I want to have something very similar You also define our boundaries what is respond where the what has the customer to do and what is our responsibility? So if the customer comes to us and says We need this feature We will like look at our roadmap and check if it fits in our roadmap Of course the customer you're building it for our customers But nonetheless, we don't stop our work just to provide something if we have to do our stuff Maintainability quality and security we are enforcing security by default So we are responsible for security. It's not a discussion. I have met with a customer I will come to this later on and we also have a maintenance window defined already So our customers know if you're using that service, there will be a maintenance window It's not that we have to use it, but it's enforced. It's our responsibility That's actually the entry point for our customer. It's a wiki page It's very defined up there weekly maintenance window all the documentation and links over there It's pop internally public all available everything is a publicly internally available everyone is Happily to is able to browse it It's very transparent and we give our customer this link wiki page and they have to try to get To understand our product We are still trying to optimize it. It's a constant work in progress Now it's a really beginning of what we when we started developing It's a little bit awkward perhaps, but stay with me We one of the first things we build was a nameswis inspector This actually a tool which runs in our community this class and we delete everything Which is not white listed so if you create a namespace and it's called I want to test something One hour later. It's gone We do the same thing with in a default namespace if you try to do something in the default namespace It's gone if you want to test something you can use a dev or test dev is deleted after one week test is deleted after two weeks This creates a general idea that is you as a developer you don't start with playing around You start by coding something because it could happen that it's deleted in a week because it's it's a weekend One week is not that long and it's gone and I think that saved us a lot of headaches where stuff is coming from because they can't So we we can't we were actually for enforce it to us So the second thing what which was really important for us was the insight into our Cluster we started with that out probably the second day after the namespace inspector We decide on Prometheus operator. It's a Beta project and it basically provides you with the whole setup for Prometheus alert manager Kavana It has already dashboards in it. It has predefined alerts So they guy the guys actually know what they are doing I think there are a lot of cubanese operators which only operate cubanese clusters so it will pop up if You notice just getting full you have already the alert defined. You don't have to do anything. It just works Last year on the fourth stem I became aware of Grafana Loki We do not provide access to our cluster to our customers So they want to see the log files, right and Kofana Loki is basically this it's a lightweight log aggregation tool It's no longer an alpha. We started using it in alpha with the risk Or the awareness of it but having an elastic search cluster running in your your Kvinilis cluster It's huge lot of effort. I wanted to try to avoid this And it basically does exactly what we want. It provides check in slaves and master log files for our customers And we have log file dashboards for that as well Now that's a dashboard how it looks for the Prometheus operator in this case. It's just a node. It tells you IOPS network IO memory CPUs it We have our own dashboard and this quite probably the interesting thing here It's already you can see already the outer scaling mechanism. So we scale up a hundred nodes Two hours later. It's gone If you look at this You can see on the right left side. It's we have also introduced a cost model system So we actually know roughly how much money we burn It is really important that we have an eye on it because if you do not out of scale our bill would be roughly something like 30 or 50,000 euros a month and right now it's probably more like 5 to 10,000 euros There's There are a few bonuses if you're in a big company you can Leverage something more so it's getting cheaper, but nonetheless. It's a big problem, right? Everyone has a cost problem And you can of course not do this in your own data center if you scale up and down something you own already It does not make much sense so Now we need a tool for our customer to create the Jenkins So we are talking about the customer which are creating the pipeline. So they are very technical But they only want to have a Jenkins and it should work and should be safe and that's it So we are using QBAPs. It provides Ham applications, it's a simple interface. It only does exactly that it provides you a customer to the way of spinning up a machine or application and The values jammel file is basically the contract we have with our customers We don't tell our customers you have to know how QBAP works, but unfortunately for you, you have to know how jammel works So how does it look like? It's simple interface great project very interactive It creates new it gets new features, so it's actively developed on but it's really young So you you had a problem with the namespace on top. It showed all the namespaces. Sometimes it didn't show didn't show any namespaces So you are constantly have new features on it This is now a jammel file you actually have we are trying to have all the important things Custom I really want to change on top and we try to document it But nonetheless a lot of stuff is down there So basically what they can define or what they do define is a seed shop So they spin up their own Jenkins tell them roughly how much memory and CPUs that they want Which of course means if they create a big Jenkins lay Jenkins masters 24 gig of RAM If was easy pews, they will get it, but it will cost them a few hundred dollars per month versus small Jenkinses For testing this really great in a cubanitas It's really simple to just give your Jenkins to VC pews and four gigabyte of RAM if you are testing something Or you can give them a big one Now you have a lot of stuff configured already in your git repose and now you need to have your Jenkins and all the other stuff You all the monitoring all the cubanitas applications hand charts in your cubanitas cluster How do you do this? I personally would highly recommend our city. It's a great open source project It does nothing else than making sure that whatever you have defined us as implication is getting installed From the git repository into your cubanitas cluster. So right now you see an overview of all the applications We have you also configure our ingress controller to git. So the git ingress controller is in git Obviously to make sure that the ingress control how he's gone is configured in git is configured in communities So it looks if if you go into details, you can see the sync status. You see all the parts the service accounts Pots everything and the greatest thing actually on our city, which I really love is Our city provides your own custom research definition from our city and you basically define an our city application As we am it so you what you can do and what you should do and they actually build it in a way that you can Do it like this you configure our city through our city You only have to put our city once on your community this cluster then our city sees. Oh, there's an application called our city I should maintain it and check it out and suddenly our city manages itself, which is really great The names with inspector looks very similar Not a lot of stuff to do the target revision hedge where the radius come are coming from We inject in security secrets from vault Yeah Now how does a set up now look like it's not very big. It's not very complicated, but already the base layer is 10 virtual machines so learning like 10 40 or 50 vcp use it's what we already need We have permit there is Grafana for our customers and for our setup We have we provide shankin's is but we also need to have a Jenkins to build images and everything And we provide the dex is an alteration. I Open ID providers who we basically tell if you have UI like permit us to look creates an UI and it's unprotected That's not something you can do in our company So we put a dex up front or connect the dex with our adult server So you can use your company credentials to log into dex and then you have access from users Or alert manager what we also have is an artifact for caching reasons Now in our organization you saw the environment right it's not that much, but already we have over 40 git repository's It's huge. We call we have our product is one git organization And we have prefixes for all the repository so car QNEDIS means it's an application running in QNEDIS It's a hamchat or combination of hamchats chop is a Jenkins seed shops and the stuff we configure We also use our own dog food We have our own seed shop which creates all the Jenkins shops for building the images doing security scanning and all this stuff in Terraform Use Terraform to create the Jenkins cluster and then we never talk about Terraform again Everywhere repository has to have a weekly and For the hamchats we have to how to maintain I will come to this in a second and we have also ops manuals We have to have on call which basically means a service is not running you get a call You have to do to have to fix something so ops manual only tells you what you can do in case of monitoring There are a few ideas, but there's also written that You can delete this namespace and we think is for our city not a problem And I think right now we could delete the whole cluster and click on our seeding would just work So I think most of it. Yeah, sure We have our documented documentation in git. It's closer to the code It's test text-based formatting, which is much easier than in word Moving things around Works great, and you can use skitter pages if you want to which is really lovely Now we have a lot of ham charts. We have ham charts for our Jenkins. We use upstream ham charts What do we do? We have literally a documentation on where this upstream ham charts are coming from and you have to maintain it So you have to put it down and you have to change something We are not finished with the migration, but we have moved to customize so customize is Patching our ham charts we have from upstream And we need to do this for certain things we have priority classes on every service we use So we have to have Priority class definition now ham charts if it's not available We can ask nicely if they added to the ham chart But we nonetheless we have to edit and all the images Which are defined in the ham charts We are not pulling them from the internet if you do autoscaling you cannot do this because a node doesn't exist for Too long and you don't want to cash it and you have to cash you do wouldn't I like to cash it right so in Google It's gcr you want to have everything in your gcr registry to basically Yeah Customize it was unclear when we started what we want to use because there's a lot of options out there And it's still not very clear him three just released But customizes part of cubes detail since 1.14 so this is a no brainer anymore and our city trust supports it as well Keeping it stable now We have a lot of software running trust to provide a few changes what we introduced this as a e we name it as a e It's probably not that as a you what you know from Google talks and they are plenty Google talks about this topic It's a stable process. We know we have this system in our team So if I go on holiday and know I know there's certain things are happening It's a weekly rotation So everyone of them our team has to do it and everyone in our team has the time and effort to actually have a look at Every part of our infrastructure and because we have so many different components just from the Jenkins You actually need this time And it gives your team or the rest of the team the focus on working on on the stuff They are working on without getting interrupted all the time How does it look it's a wiki page with a timetable and then there's a checklist nothing magic works great though Now we have we are upgrading and maintaining a lot It's like you can assume one person per week something like 10 to 30 percent of the time what we actually also build was a Gavanna dashboard which shows us all the upstream chart versions and what we are running to see how we are progressing and sometimes Or not sometimes actually quite often there are major versions and we are upgrading again and changing again Which is really nice. It feels interesting if you read a new release notes from August you say, oh, this is a really cool feature I was waiting for this. It's super helpful. You love it But you cannot I would not build something and just keep it there. It will not it will right away The operations manuals we have operations manuals in the service page If you woke wake up in the three in the morning, you go to this wiki page Click through your ops manuals the graph on our dashboard and that's it Now we are really close to the findings already Well, now we have a little problem right our customers are running current shenkins is and now they are not updating them I'm not sure who knows that a feeling or not But we have this feeling quite often So we decided this time to do something else and I think it's a good practice and I we saw it in a few other projects How they do that? We build every night a Jenkins image So there's a Jenkins master and this image already contains all the plugins So we create we we pull down the image with the Jenkins run the Jenkins the Jenkins Jenkins install the plugins The image is done. We take it with latest and with the current daytime and You as a customer or and we deleted all the images in three months So basically there are only images available which are it maximum three months old in theory We could just do these certain versions as well if High vulnerability would be popping up, but it's not implemented yet Now we tell our customer you have two choices you choose latest Or you use a certain daytime version and we tell you the real limitation of that your image is only available for three months and Then customers go figure and figure out how they want to maintain this process To make sure that the Jenkins is not magically running for longer than three months. We just kill it after three months We also do have network policies Policy policies everything in place So if you just spin up a Jenkins and a customer would come here and say nothing to us They cannot connect to the internet and there are a lot of stuff just missing But for us it's more like the right list approach They have to come to us say we want to connect to this system and we tell them no Or if they really have a good reason why they want to connect to the system We will write listed for them and then it's our responsibility at this connection works We have to on big corporate system. We actually have to have a request firewall Rules and stuff like this and we have to put connection stuff in our security concept Our learnings Stateful service on Kubernetes is hard Kubernetes is not has not been built for stateful services if someone tells you this works really well. I don't think They explain they are lying. I think For Jenkins, there's no hey higher will be set up available. I think there's some cloud bees magic plug-in Which doesn't work or work and you just can't use it For us, it's a big problem way because our build this build runs for 6,000 hundred machines So if one Jenkins breaks or five Jenkins breaks, it has to be really reliable Otherwise, I think some one because something like 30 or 50 dollars So you don't want to skip it if you have much smaller builds if you build a little bit resilient It's it's not a problem at all Make sure that you have port priority So a Jenkins master should not got thrown out because a Jenkins slave is suddenly more important than a Jenkins master doesn't work out and System ports will and should always win and they will they will do this So if you think all your Jenkins everything is fine tune. It's now stable. So I'm your Jenkins master is going. Why is my Jenkins master gone? Yeah, there's a system service and it will kill your Jenkins. I will get get in each in detail about this in a second Don't think about this cluster is now set up Assume that you can't get rid of this cluster again So what we do is we move our Jenkins artifacts to blob storage blob storage is independent from our cluster Everything which is persistent needs to be in git or in some blob storage Otherwise, it's not persistent and we don't assume that it exists Sometimes you have our money trying for example the Prometheus only saves it for 24 hours But our long-term storage with Thanos at the back end put put pushes it to GCS. There's persistent discs Which we use but it's not our We don't expect that they are getting not they can be deleted and should be allowed to be deleted For the blob storage for example, we actually build a small browser Which is secured to add up and then you as a customer instead of just going to say you can use the Jenkins artifact browser But for any long-term thing there is this GCS tool and you go on this and you can just use it and it will always be there Although a big issue if you're building long runs, you have to be ready for maintenance maintenance is coming up In our case Google is forcing us to do this if there's a critical security issue What do you do with the critical security issue? You fix it, right? Sometimes most the time you fix it right in this case Google just enforced it for us Which is nice. I assume they also have the security by default We are custom they tell us what to do But they also enforce a maintenance window, which is great for me because I go to manager say hey She case enforcing maintenance window. We want to have a maintenance window. This is a maintenance window No discussions need it We tried at the beginning to make our services as resilient as possible tell them QB need is you're not allowed to kill it and That doesn't work in our specific case Google just deletes it in after an hour. It looks at it for an hour and say Okay, let's go instead of now and for trying to make it as hard for QB need is itself to manage itself We are trying it that different the way We are making sure that if it's getting killed and it's getting killed more often than you wish It's more often care getting killed than you wish for we make sure that is quickly back Now the auto scaling is a big problem if you do auto scaling You you run with 10 or 50 nodes. I think everything is fine Then you do a low test with 100 nodes and suddenly your monitoring is Not happy anymore because it's just 100 notes more than it was five minutes ago You have to do low tests look at it. It has to work And there's a big problem not a big problem, but you have to be aware of it Calico for example has a wonderful demon which actually analyzes how many nodes are running and I have F After every 10 nodes It reschedules it nodes and wants to have more resources And if you remember what I said with the system services There's a system service and it will kill your Jenkins and it instead of using 250 millisecond CPU It's only uses one millisecond CPU one CPU and Then suddenly you have three of them instead of two and if you try to keep your class Highly utilized because you think our cost cost cost saving Cost saving is important, but 100 bucks per month more just to be sure that your system service or not killing your Jenkins master is it's fine It's slow To autoscale something it's great if you have Jenkins jobs, which are running for an hour No one cares, but a few minutes for something which just starts is bad So what we do now while you're using auto scaling right for cost resources, but now we over provision We have a pot defined which is you is a sec is requesting the same amount of resources the Jenkins slave With basically minus one priority. So Q&A say oh, you want to have a Jenkins slave or there's a pot No priority. I just wrote out and then you have a smaller latency again Images are not cached Your nodes are not running for that long. They might run for five hours. They run for an hour. They might run For a day, but they're gone If you already autoscale to a maximum number of what you defined I mean, we have to have we have to be sure that we don't autoscale to 1,000 because that's cost a lot of money and Yeah, so we put it something like from zero to 10 for our default setup Which is we are not out can cancels are running and suddenly you're out. We're upgrading Don't know longer work because you already use ten nodes and If you need to upgrade and you are replacing one or by the other you have to have one additional node so you also don't Run your Kubernetes cluster Fully full you have always have to have a little bit of buffer and pot preemption starts and you have weird Behavior if you run full capacity and we are trying to get we're not run on this if you do this It's effort You thought we put everything on one cluster in one node pool Happily enough or good enough for us It's super easy to define multiple node pools. You have no affinity and you say yeah Jenkins master You get one node pool on your Jenkins slaves get another Keep them apart the Jenkins slaves are more secure because they are isolated But it breaks your thinking of you have one big cluster. I control it with priority classes and I run it to the Fullest to save costs and have a nice it doesn't work at least it doesn't work for us The project structure for Kubernetes cluster of one cluster and you have a tons of service accounts images everything Start to annotate them up front so you actually know why this image is in that one folder and who actually uses that Let's get really messy If you think you have one Jenkins slave and it's just a small Jenkins slave and you are getting Away with one virtual CPU and two gigabyte of RAM think again Kubernetes uses and needs certain amount of resources As the bigger nodes are getting bigger the set up itself gets more efficient We are running now between four and eight core machines We do not go over under four cores and you probably say more like on the eight or six cores per node To have a problem which of course if you do out of scaling your budget is a little bit smaller than ours It means the difference between 50 or 200 bucks per month just because you cannot be that fine granular I Expected to be very fine granular, but we throw this over very quickly So, thank you very much. I have a little bit of time for questions and answers if your feedback send me email If you want to talk to about me this infrastructure more detail that on Big company, it's really hard to outsource it. It's easy to inner source it. It's hard to