 Okay, everybody, let's move on. So let's welcome our next presenter, Elvir Kuric, who is a member of the performance engineering team at Red Hat and is going to tell us about building an open shift cloud on ref. Welcome, Elvir. Thank you. So thank you very much for joining us this morning. As Evgeny told, the presentation is going to be about how we, in the performance team, build an open shift cluster on the top of ref and which problems we face, how we solve them, and we are going to also show the ideas how you can build it yourself in your environment. So in this talk, I'll talk a bit about motivation. Then we'll say about hardware, which was used for this installation. Also, I'm going to say about how open shift was built on top of ref. I mean, the complete process of building ref virtual machines. Then feeding that information inside the Ansible Playbook and how open shift was installed. Then later on, for the test, how we created 20,000 ports on the top of this open shift installation. Then I'm going to follow, explaining the critical places which I faced during the installation, where I spent quite a lot of time fixing them, just to understand why they are happening and how to avoid them. Is the level of the sound I'm giving good enough? Okay. In the previous talk, it was mentioned that it is good when you have someone to call at 3 a.m. to give you support on a particular problem. This is, for example, one of the motivational reasons could be we can build open shift on top of ref and that is fully supported. If you are a redhead customer, that means a lot. That could be one reason for a motivation for this problem. Second case was build open container platform on top of ref, because you can find the documents explaining how to build it on the Google Cloud, on the Amazon, on the open stack. But on the top of ref, I did not find, maybe I didn't look at the right places. This could be particularly interesting for the people and the customers who wants to use open shift and all its features it offers, but at the same time, they don't have enough skills and knowledge to build it on the top of open stack. So a lot of people are skilled in the traditional visualization technologies like VMware or Rev. I say traditional. That means a bit older technologies than open stack. And learning about the open stack and building open shift on it could for them be a bit of a problem. It was also the project task between the performance team, where we wanted to build an environment where we can start as many as possible pods for another project. It is related to the open shift metrics and tomorrow 1 p.m. with my colleague Ricardo, we are going to talk a bit more about that. And once the cluster was built, the number of pods which we planned the schedule was around 20,000. And we wanted even more, but we were limited for the project with specific bugs and it doesn't make sense for us to go further. This is the hardware I use for this configuration. It was a pretty good hardware. I think except this red line, this storage backend, which obviously were not good enough to support the number of virtual machines and pods created. In total, I had 19 hypervisors as a part of Rev. Cluster, totaling about 2.5 terabytes of RAM and 688 cores. Every hypervisor had two network interfaces, one 10 gigabit and one 1 gigabit network interfaces. And 10 gigabit network interfaces was taken from every hypervisor and put into the logical network. So the whole open shift communication were happening over that 10 gigabit network interface. As a storage, I use the ice-cazi storage domain on Rev and all virtual machines were using it. Storage backend, I mean the hardware, was VSE 1,600 from EMC. Not the best solution, I think. Rev installation, I guess this is not something new. Kickstart rel and on top of rel build Rev. I follow the path where I use rel and then add the Rev packages on top of it. Other option, I think it would be possible also to use directly Rev ISOs and follow that path of Rev installation. Once the Rev is installed, storage domain configured and logical network configured and all in place, then it is necessary to find out some way to create virtual machines itself on that Rev environment. And without API, it would be really hard. So here is example how I did. I don't, I use the Python of it as the car. I'm pretty sure there's a million other ways to do it smarter, better. What I was doing is if you open that link on the GitHub, you'll find there is a script which enables me to create X number of virtual machines with X virtual CPUs and Y gigabytes of memory and at the same time attaching that machine to the specific Rev network. And I was creating one machine at a time starting it and then creating the next one. I could do trading and stuff and create everything in parallel, but I'll tell you reason why not do that a bit later. Okay. So when creating virtual machines, if you don't say it specifically, Rev is going to create virtual machines on the sparse disk. So that saves the storage space. That means if you create virtual machines of the size 30 gigabytes on the sparse disk, in reality, on the backend storage, that doesn't mean you occupied that size completely because it is occupied over the time. The second option is to use a preallocated storage for the virtual machines. In my test, I noticed that using the sparse storage type for virtual machines is quite problematic for masters, ETCDs and routers, especially for ETCDs because ETCDs are quite sensitive to the delays and at the most part creation, it led that I have problems with ETCDs. So I put my ETCD machines, masters on the preallocated storage disk. Open shift nodes were on the sparse storage type, so I didn't change that. When using API, it's necessary specifically to say this is a preallocated storage disk. The second option is to use the first option for the virtual machine, for example, if you want to create one virtual machine with a preallocated storage disk, that is going to last directly proportional to the size of the virtual machines because it needs to do the complete cloning in the background of storage. And at the size of 200 machines, it could cost quite a lot of time. Okay, this I mentioned already. So in this specific cluster, I had three masters, OpenShift masters, three TCD servers, one router or load balancer, seven infra nodes, which I was using for infra pods. That means router, registry, then logging, and metrics pods. And the rest of 210 nodes were reserved for usual application pods. In total, I had 224 machines in the cluster. In this specific configuration, TCD servers were not on the same boxes as OpenShift masters. And the reason for that, I think it is good to divide these machines. In total, there are different opinions and I'm discussing with different people and people think, okay, if you divide them, then there could be network latency problem between OpenShift masters and EDCDs. But at the same time, that gives you flexibility. That means you can, if you lost one machine, you are just losing, in case they are together, you are losing EDCD and the master. At the same time, if they are divided, then you have a bit more space to breathe. And there is another problem, there is a bugzilla, it was opened a couple of months back by me. It is a situation that on the OpenShift machines, with money pods, it happens that, it happens that IP tables and IP tables restore are causing high CPU usage. And that is a particular problem for the case when there are money pods. So that bugzilla is still present. So I noticed that problem on my EDCD machines and when they were part of, together with masters. So that is one of the reason I have got this load balancer was self-standing box. Also masters were not schedulable and on the infra nodes, that is a recommendation to have dedicated nodes for the infrastructure pods. So I created a special node just for running these pods. In three steps, it could be create virtual machines. Then there is addition to this also to collect IPs and VMs when machines are created. That means once you have machines, you need some way to collect IPs in order to feed them later on in the Ansible playbook. And last option, use Ansible, OpenShift Ansible to build it. Yeah. When I first tried to install this, because I tried multiple times, I naively created a golden image in the rev, but I made a mistake where I didn't have a package address. And when you just put the bare rel and create repositories properly, that means it is going to download all packages during the installation. That works fine for five, eventually 10 machines. But when you have 220 machines, then that doesn't work fine. And that first installation try lasted, I don't know, seven hours and I closed and went home because it was downloading everything. So that was my mistake. So next thing I did was I increased the Ansible forks in the Ansible config. Default is five, but I bump into 10 because I just wanted to get a bit more Ansible forks running in parallel. And then I went back to my rev template and based on my previous experiences with Ansible and OpenShift, I realized that it is going to do all stuff in parallel, a lot of machines downloading packages and so on and so on. And then I downloaded NFS, Gloucester, SEP, I think a bunch of other packages, for example PyYaml, then I figured out in the Ansible which Booleans need to be on or which Booleans will be on after the OpenShift installation. And I did that in advance on the golden image. And after I did that all, once I repeated the installation, so I reworked the rev image. The installation of 224 nodes lasted around two hours. So here is the main advice, put everything as much as possible data inside your golden image if you are installing big cluster. Are there some questions for this side? Okay, the question was what storage was for the docker inside the virtual machine, I guess. So I was using, there was no special additional disk like the SDB inside the virtual machines. I was using device mapper and it was built on the top of root, logical volume on the top of root partition. So it was inside, but there was no special the VDB, no, no, no, that was not the case. One more time please. Local mirror. Local mirror. I mean, I wouldn't do that. The question was from where I was downloading the packages during the installation for the case when it lasted eight hours. From the local mirror. I mean, it was a bit pride and usual and I would say stupid to do that guys at CNN, yeah, content network. So yeah, it was local inside maybe a couple of meters away. Okay. So then I, for another project I needed this part. This was a, I needed as many as possible pods. I mean, I just need a number of pods no matter what kind of pods are because I wanted to collect memory, CPU and network numbers and I decided to create 20,000 pods on the top of this cluster and in the performance team, we have a cluster loader tool. You have link here that could be possible to do separate presentation for the cluster loader because it is a big, big project and so in the nutshell, it does, you specify the number of projects you want and then you specify in the YAML file, you specify the number of pods you want per project and based on which image and run it by tone. And after some time, it's going to create you that. It was necessary to change, by default, you have a pods per core is 10 pods per core. So in this configuration, I had to change pods per core zero to allow more than 10 pods per core to the defaults. Yeah. So no special requirements on the pod type. I just wanted them as much as possible. Okay. So lessons learned. Try to get quite fast storage backend, that means hardware. This one I used, its specification says it's entry level storage backend, storage from EMC. I mean, it's probably fine, but for the number of virtual machines like, for example, here, it was causing me quite a lot of troubles. So then ETCD must, yeah, sorry, just I SCSI with jumbo frames, but I had the four pods. I mean, it was multi-parting over I SCSI. The question was how many pods we're using from the hypervisor to the storage backend. ETCD masters and different machines. I think they should be on the preallocated storage disk when running in rev. If you ask me now how this is for the OpenStack case, I cannot really answer that. And ETCD is quite sensitive on delays. So yeah, this one. If you find in the ETCD logs, there are much like this. It is a say, okay, it is the services like overloaded. And I have seen this before I moved to the preallocated storage disk. Then I first thought, okay, it is the network problem. But I have 10 gigabit network problem, and I never seen that nodes are going to the not ready state, so that was not excluded. Then I, in my disappointment, I said, okay, let's move our ETCD for the test just to the dev SMHE, S-HIM. So just in the memory, put in civility issue up and again, and it worked fine. I mean, I never seen this. I mean, it was just a test. I recreated everything on the preallocated storage disk, and it was better. On ETCD machines, I think it is not necessary to run, sorry, to run open atomic open shift master or node or docker services. So you can, even they are not schedulable, open shift masters are going to pull them for the status. So that is just an additional overload them. And it happens sometimes that the time on the virtualization is not playing the best. So ensure it is in sync in case it is not. It should be, but just ensure. Once you see this problem, if you see, you'll see realize, okay, this deadline exceeded for 1.7 milliseconds. And you'll go probably to the ATC, ATC, the ATC Conf, and try to play with these election timeout, election hair belt interval parameters, and get this appointment that is not going to help you. Because that's going to slow down the complete process of POTS creation. The original values are 5 milliseconds and 2,500. But if you change these, the complete process of what creation is going to be slower. And this server message I found that particularly could be reason in the storage or network or bot. So in this case, in my particular test case, it was storage. But it can be also the network. Check this documentation of the CoreOS, so that you have nice advices there. Okay, so I also noticed that there is a, when you work at the customer side, and install this everything, and then they say, okay, that's fine, that works all. But let's shut down everything and power it on. I mean, that happened to me a couple times when I was working for Telecom. So you need to pour down everything, and they're just thinking, we want to relocate it everything 200 kilometers. And once I did that, I shut down all my virtual machines over the API very quickly. And once they are all in the red state, I powered them again over the API very quickly. And that led that storage back end got hammered by hundreds, 220 machines trying to boot at the same time. And some of them did not boot it properly. So I think if I had better storage, that should not happen. But out there between the starting machines, so at least let's say 10, 15 seconds delay. So start one, wait, start another, and so on. And that's the reason I, in my script, I was using, and creating one machine at a time, and starting it immediately after creation. I could do that in parallel, and that would be probably nice improvement. But it is not implemented yet. Okay, and this is an additional motivation for this. Is that this is the example of completely red head technology is playing together to build the open shift. So we have rel plus rev plus open container platform plus ansible to build platform as a service. Another solution would be to use open stack instead of rev. This presentation will be on the GitHub in a couple seconds. There are questions, yes, please. That pre-allocated means just from the rev point of view. Yeah, yeah, yeah. Team provisioning was inside the virtual machines for the docker. I mean, I mean, when you use a device mapper, actually. Device mapper storage plugin for the docker, inside for the docker, then it uses thin LVM. And yeah, that is. Yeah, yeah, it was out. Yeah. Yeah. I mean, pre-allocated is the space inside. This could go through. This could go through. Yeah, yeah, yeah, yeah, yeah. Yeah, okay. Okay, you had a question. Yeah, okay, first of all, thanks for the presentation. I'll probably ask you a lot of questions after it because I have the same exact task. But yeah, what, for example, how did it behave if you, for example, there was 20,000 pods running and you shut down one master or one in TCD, did you test out these scenarios? Thank you. I didn't. It was the question. The question was, did I test the environment when? I used the microphone. Okay. No, I did not shut down master or ETCDs for that testing because that was not a task. I mean, it would be good to see how that behaves, but I mean, I hadn't need for that to be honest. But it should, I mean, there are three masters, three ETCDs, and if there is a problem with the, if we shut down one master, then that's problem, problem issue for Bugzilla. So, yeah, please. Just a second. What capabilities? What specifically are you mean? I think we cannot, these are different things to compare. I mean, open stack and installation, open stack, installation on rev. I mean, there will be always people who once specifically rev, or for example, some virtualization, all that type, then open stack. I didn't investigate these, let's say, the points where what is, if I understood correctly what is a better rev or open stack. So I don't think we can compare fully them, my opinion. That's one free, so you could probably done the whole thing using Ansible, and then it makes it easier to do the whole thing as you see it. You mean using Ansible to create virtual machine stuff? Yeah. I mean, I said, I mean, there is a problem, I would say, ten different ways to do this way of creating virtual machines which are better than this one. So I'm not keeping exclusive right that this is the only one. So yeah. That would be excellent. That would be excellent. You didn't even have a choice. Actually, like I have. That would be excellent. Please do. Please do. I have also, like, opinion from my side, what I've been testing also with Overt, and actually the storage is like a pain point major, and you'd actually need to centralize storage for running actually pods. They don't need live migration, they don't need like redundancy because you should use persistent volumes for that. So I was actually testing with local storage domains, which is actually disks of the hypervisors and then you don't even have a central storage because it's a bottleneck anyways. Like, as soon as you have a more powerful storage, you just double the number of pods and then it kills it again, like it's just like this is the bottleneck. So did you try this or not? I mean, it is not a problem with the pods storage, I mean, a persistent volume inside the pods. It's the problem, the complete storage for the virtual machines. I mean, where virtual machines are. Here's the problem with that high-scasy storage domain. That was the problem. It was not, I mean, yeah, I can completely ignore storage for the pods and use the persistent volume claims and hook them to some, let's say, another storage somewhere. That would work as well. But the complete problem was with the virtual machine and the storage which we use there. More questions? We have a lot of time. You have more questions? Yeah, okay then. Yeah, thanks. Thanks a lot, Oliver. We'll continue after this. Thanks for the attention. Just one more thing. Can I? It's just attention. No. Let's say, why come to tomorrow for the second? And FYI, there will be a second part of this story tomorrow presented by Oliver at 1 p.m. tomorrow.