 Welcome to our user report of Azure installation in production. We had the OpenTelecom Cloud run in Azure instance and would like to share with you the experience that we made with setting up, configuring and doing a day-to-day operations of this continuous integration platform. My name is Nils Wagnus, but I'm just kind of the host of the session since I have the pleasure of introducing Artem Goncharov to you, who is the main designer and chief architect of our Azure installation at the OpenTelecom Cloud. The stage is yours, Artem. Thank you, Nils. So our report, well, I'm expecting pretty much everyone who is here right now is pretty clearly understanding. CI and CD is a must in every kind of really normal, modern software development process. Might be some years before, five years before, there were not so many fitting installations or fitting really options for CI, CD. Nowadays, there are many more of those. You can count on Travis, GitLab, GitHub, CI and many more. But for our installation, we have actually weighted all the advantages and disadvantages of every possible solution. And since we are downstream OpenStack installation or downstream OpenStack provider and offering some additional services on top of OpenStack, literally it is clear that we are also running some OpenStack and we are having in our development process the same requirements as OpenStack has. So we have said, Zoo is the best fit, simply the best fit for all of our needs. Then we have evaluated basically that compared how Zoo is being used in OpenStack installation with all this huge amount of projects and tried to map it to our use case. And we said, Gerrit is cool, but it is way too complex for many of the developers. If we are expecting some of the developers or users of the OpenTelecom Cloud to contribute into development of our CLI extensions of the additional Ansible modules or whatsoever, Gerrit workflow is too complex for anybody who is not really familiar with OpenStack development style. And on the other hand, GitHub is already there and it is providing the required code review features that Gerrit is actually mostly for. So we have said, Gerrit we will not use and GitHub since we are already historically hosting all our source code on the GitHub is the way for us to go. So we came to our first POC, it was around two or three years ago and more or less the only working solution back those days or yeah, useful solution was to deploy Zoo on a bare VMs and we have decided to use WindMill for that. The project which is helping to configure components of the Zoo using the Ansible roles. So we deployed one scheduler, ZooKeeper cluster, one node sharing, node pool builder and launcher, one executor, one web instance, we are publishing our logs to Zoo and we were also able to push some changes to the ZooSource code to address some of our needs like support for security groups and something additionally. Well, after the installation of there it is time to really say whether this is top of loop whether this is good or not. Well, when you are starting with Zoo I'm pretty much expecting that everyone who is trying to do operating of it would enable debugging mode on back those days. And as of now, but especially back those days hackers were not sleeping and they have actually found sensitive data in our logs. So from the GitHub where we are storing our project config, ZooConfig, there are links to UI of our installation where you can find links to the logs files stored in Swift and some of the log files we are containing unfortunately credentials and pretty sensitive credentials so that literally the whole domain was compromised. What you can do in this situation simply tear down everything and try to reduce the damage. So the installation was gone after probably couple of months of running so it was running, it was helping us but yeah, not that long. After some time trying to decide what should we do next and due to some organizational issues this time was a bit long it took us around nine months before our second attempt was started we were waiting options for running an OpenShift cluster or Kubernetes, but due to needs for having additional efforts for really maintaining of the OpenShift or Kubernetes cluster itself which is also something yet additional knowledge you need to bring. We have still decided to go to continue with bare metal installation but this time a bit different. From the overall architecture you will not see anything new here. We have three executors compared to first installation. We have three pools where isolated pools where the test VMs are being running so that the danger of compromising is somehow reduced and some facts to that our second attempt. Our installation is still publicly visible publicly accessible we store our source code on GitHub and we expect that Zool is tightly integrated with it meaning that log files of the execution runs are still available to everyone. We are doing installation on the RV metals like I already mentioned and this time we have decided to give a try and use Fedora Core OS image a pretty nice and cool operating system which gives you a nice opportunity to scale up scale down very flexible everything in containers so modern, more modern how can it be but the issues that we have faced with it pretty much immediately after might be some days Fedora Core OS is liking random reboots not really random when the Fedora Core OS image Fedora Core OS operating system detects that there is update of the image it fetches update and reboots without asking anybody well, while being cool feature of running containerized loads this is not very cool and Zool doesn't like it then the next challenge everything is inside of containers you are sure but what about node pool builder this is not really working properly at least back those days when we were starting with the second try it was still not possible especially when we are talking about SE-enabled SE Linux-enabled images so node pool builder cannot be properly containerized might be today it is different but back those days it was not and then back to the issue with reboots Zool Keeper cluster pretty much dies after each reboot if you have a cluster of three instances and two of them are just rebooting to install updates your cluster is gone yeah, that's it so we were actually forced to move some of the loads and some of the components to the regular operating system operating systems like either CentOS or regular Fedora installation so basically we have moved Zool Keeper cluster node pool and Zool scheduler to those proper operating systems and everything else remains as of now on Fedora Core OS everything is running in containers well as I mentioned already node pool builder pretty much everything with some exception and containers we are using root less, root full depending on what exactly so we are using this complete wildness of what Podman under Fedora Core OS gives to us in second for the second installation our GitHub Zool application is directed we have packed it with new keys connected it with Zool in the meanwhile, GitHub itself improved a bit with regard to how it handles privileges and permissions so what was previously criticized by many is that being an admin in GitHub you can literally do everything by passing Zool completely and since then since our first installation at least some parts here were improved and in some areas you can really enforce that even as an admin you cannot overrule Zool and then very important thing to mention we are running lots of functional tests against our cloud we are developing lots of public oriented tools like Terraform driver which is packed with support for our additional services our own additional Ansible modules our extensions to the OpenStack SDK CLI and so on and so on so we would like to deliver good quality of the software therefore we are in need to run a real functional test in the real cloud and this still being publicly visible so security is still there and need to be treated carefully some facts about our workflow the workflow is nothing really very weird it is pretty much following the regular recommendations of the Zool installation documentation so we are enforcing that every project managed by Zool is having branch protection otherwise Zool simply ignores you check pipeline must pass otherwise the change the pull request will not be merged code review is also enforced include admins option is also on to at least help us exactly the thing that I mentioned to help us avoid that with admin privileges you can merge pull request really without test passing and Zool application itself is also allowed to write a bit annoying thing considering the fact that you grant already to the Zool application right permissions in the project or in the installation but still you need to explicitly allow it to write into the protected branch well we are using a pretty new feature of GitHub draft pull requests which are in our case some kind like helping us to identify work in progress pull requests and in some cases later on that in details covering partially workflow functionality of the Garrett check pipeline is pretty much as usual we are also having cross project dependencies and using depends on feature really very heavily and then we are having our post review pipeline check post we were forced to introduce this pipeline since yeah Zool is treating secrets in a bit different way compared to all the other CIs and it enforce you if you would like to use secrets in your job this job must be must be reviewed before it is executed by somebody. So since we would like really to run those functional tests for our projects before we try to merge them we were forced to introduce this pipeline and together with draft pull requests we are here relying on the pull request approval giving kind of meta approval and still having a draft pull request and this gives us possibility really to run those tests before before merging and the gating is implemented by taking pull requests with the merge label pretty much standard way how you expect it to be and we can actually look to the limitations and problems in detail that we have faced with this type of installation and those are exactly the things that we would like to discuss with the community maybe somebody is also experiencing the same problems. So again, we is gating and using label gate as a trigger for merging with dual there is no possibility to limit who can do really that in get it you have rich real possibilities to limit who can trigger workflow in GitHub. Unfortunately it is not possible and actually everyone who is having more than read permissions in the project is able to manage labels. This means it is pretty hard to isolate that. Admin maintainers, users with admin or maintainer privileges in the project they are still able to merge pull requests by passing Zool. So yeah, while we are enforcing that check pipeline is passing we are still able to bypass really gating here and this is a challenge what we are not really able to properly address as of now with the exception that we are trying to give only maximum right permissions to every users in the project and that we are using more or less isolated administrators or managing their privileges properly. Managing manually over 100 projects in GitHub. Well, we are currently having two organizations in GitHub and Zool is taking care of more than 50 projects as of now. This means that for all of those 50 plus projects you need to configure all the branch protections properly, you need to verify that the user accesses are done properly and manually doing that is what we are having as of now is becoming really a challenge and we are trying to come up with some automation but this automation is only possible if Zool whether this automation is triggered by Zool or not is then having real kind of like really super admin privileges in GitHub what is already a security relevant issue what might be not very nice. The issue what we were also having from time to time is that Zool doesn't respect or was not respecting back those days and as far as I know there is a work in progress towards fixing that Zool was not respecting code review requirements. So yeah, in GitHub we can configure that code review is required but Zool itself is not really looking what exactly is required whether there is more than one code review is expected or whether code owner review so basically from a special, from somebody special is required. So basically Zool is triggering is trying to gate trying to merge a change but it fails means Zool is consuming resources without being really able to do the merge. So from this security point of view everything is clear but it's really a resource waste. Single maintainer or contributor in the projects while in Garrett with Garrett it is possible in GitHub as an owner of the pull request you cannot review your own pull request. This means that if you are the only contributor the only maintainer of the project with this particular Zool workflow you would not be able to merge a single change into your project. This is not good but I don't see as of now any work around any proper work around some of that. Then the challenge we see there is no easy way to trigger or a trigger jobs from other pipelines like periodic pipelines or promote pipelines without involving admin. While using GitHub heavily we are addressing regular developers who are not necessarily having deep Zool understanding means involving them in Zool administration is not really proper way or triggering permanently Zool administrator to just to trigger a job is also not something good. There are definitely ways how the possibilities is right in your own admin UI for those purposes or really using JVT but this is still not a very nice concept. This is still requiring understanding of Zool internals or at least deeper understanding of the Zool that you cannot expect from a regular QA guy who is responsible for having periodically jobs running, test running. And the next challenge what we have is the secrets handling in Zool versus the one, the concept how you have it in Travis and GitHub in everywhere else in all the other CIs is pretty much different. And we are really having challenges explaining people why this is important and why we are requiring them to do some workarounds or to change the regular workflow of their development just to achieve this security. So while definitely something required for the security but it sometimes really makes life complicated. What we have next, separation of logs with sensitive data. Have I already mentioned we are running functional tests against real cloud and running functional tests meaning you are playing with real accounts. While using secrets is helping definitely, there are places where you cannot really do it in any other safe way and those credentials still might leak. So we need to work on the way how really to distinguish whether logs of the run should be treated in a separate way and should not become publicly available. We are having also issue with direct continuous delivery scope or actually with lack of those. While Zool is running Ansible and Ansible is able to do everything, it is still not very easy to use Zool for continuous deployment, continuous delivery with the help of Ansible. Inventory is there but you cannot really do that much with it. So continuous delivery is possible but not in the way how we would like it to be. Lack of Ansible collection support, I guess there is a work with regard to that but at least as of now there is no. We observe also from time to time periodic losing jobs and that Zool is really looping in retries of individual jobs or actually for all of the jobs and we are forced to reset completely to restart all the processes which is not good. Unfortunately, we had not enough time to really go deeper in details on that. And then as I said already Fedora Corviz is coming with Podman means we are using Podman for all of our containers and we are getting all the problems of Podman like periodic crashing, like ports being not bound to like problems with overlay FS and so on and so on. But yeah, we try to overcome that. Future plans of our installation at least migrate some parts when not everything onto the OpenShift cluster. We are not clear as of now with the way how to which project to use for that whether we would like to use some existing projects or whether we'd like to come up with our own solution. But this is a clear direction for us. At least also start running some jobs in OpenShift cluster not running, not spinning VMs for that and definitely do more of the continuous delivery with Zool. So as I said, we are doing already some parts with continuous delivery like publishing our documentation somewhere on some web servers or whatsoever but we would like definitely to do more. And yeah, this is all with regard to how we, what is our experience with Zool installation? Well, thank you a lot, Artem, for your brief overview of all the details. And well, we are open now for question and answers if time permits and please connect to the conference system here. I ask your question or get in touch with us directly on the last slide. We have our contact details as well. Last thing in this session I'd like to point out is our OpenStack Scavenger Hunt, easy treasure hunt. There are a few questions to be answered on the occasion of the 10th anniversary of OpenStack itself. We do a refer of a photo drone. So some nice nerd gadgets to win. And if you're interested, there is the URL for the link for that. For that set, thank you very much. And we're looking for feedback, comments, any questions, whatever. Yeah, thank you very much and bye-bye. Bye.