 So, hello everyone, thanks for joining our presentation. Failover from OpenStack to AWS based on SCI and TFT. My name is Miroslav Vatkerty, I'm the team leader of the testing farm team and also work in Linux QE. Today I'm joined here together with Miloš Prechlik, principal QE also in testing farm team in Linux QE and the main developer of the service that we'll be showing you today and another software QE engineer here in Evgeny Fedin also from our team. So what we have prepared for you today, we're going a little bit into the introduction and problem space, show you the use cases that we are trying to solve with this failover. Then we will briefly look at the architecture of the service that we have creating, some details of its programmable routing drivers, metrics, API and CLI, the main interfaces. We will show you how you can deploy the service yourself and what is our setup and also future plans. So, a testing farm team is a team which is developing, maintaining and running a testing system as a service which is now backing Fedora CI, packet and REL CI. So these services are using it too as a testing backend. So we are doing all the testing for them. Also, we run an internal Red Hat based CI system that we also maintain and run and the service that is basically shared between those two services. Important facts to mention from the start, so we are focused on testing the operating system. We focus on REL Fedora and CentOS. We run several thousands of tests per month in our services. We use multiple infrastructure providers where we provision mostly VMs or bare metal machines because when you are testing the operating system, usually container tests are not enough. For most cases and for the functional tests, we need VMs or bare metal machines. So how did our testing pipeline look before? So if you are testing the operating system, usually your pipeline ends up something like this. This is a very simplified view where on the input there is the artifact being that a package or module which you are trying to integrate to the base operating system. Then you have the testing process and somewhere in the testing process you are provisioning the VMs where you will be running your tests against. Previously, we had separate pipelines for all the infrastructures that we support, OpenStack, AWS and Beaker. OpenStack and AWS are for VMs, mostly used and Beaker is for bare metal machines and it has some drawbacks. This is how our pipeline looked before and what were the problems there? If there was an outage in one of the provisioning providers, it usually meant that the test pipeline would fail and that was a pain for our users. Also, it was not possible to fail over between the different provisioners even if they had the same capabilities. For example, OpenStack and AWS, in terms of the VMs, there are similar infrastructures, but we couldn't transparent it for the user to fail over between these infrastructures. Also, in case of usage spikes, some of the infrastructures are not, or none of them is actually infinite, right? So you can get into a state where you cannot provision enough machines because you don't have enough some of the resources. In this case, when there are usage spikes, you cannot do anything about it. You're still using one infrastructure and the testing gets delayed or in the worst case fails. Also, it's not possible then to cloudburst, right? You would like to cloudburst to a different compatible infrastructure in case of usage spikes. And also, what was the big downfall is that the pipeline needs to know details about how to provision the specific resource on that specific infrastructure and also the pipeline needs to know before it is running what infrastructure it will be used because we have three different pipelines, right? So to tackle all these problems, we have created a service which we call Artemis. It's a standalone service with well-defined API and for hardware specification, it has programmable routing. We will be deep diving a little bit into the details later on so you can program actually how it will choose the infrastructure, how it will do failover and cloud bursting, according to your needs. It also takes care of short-term outages. So in Red Hat, as we use OpenStack, sometimes it happened that OpenStack was not stable. It worked. In some cases, it returned 500 from the API, but in the next retries, it worked, right? So this retries also is something that Artemis can take care of you and transparently do this more retries. And just to mention that we are focused or Artemis as the service is focused on getting one machine. If you need multi-host scenarios, whatever, that should be part of the pipeline. So it really is good only to get you currently one machine from the pool. In this example here, I have examples how this could be set up. For example, on the right side, we use as a primary infrastructure OpenStack, and if the OpenStack is down or it has an outage or could quota is full, we can move the workload transparently to the user to AWS. In case of the left side, we can leverage AWS and its ARM infrastructure, and if we are over budget that we don't have enough money to use more AWS because that is a paid resource, then we can move to Beaker. And then for other architectures where we have no other option, we can just use, for example, Beaker. I led now the word to Miloš for the Artemis internals. So we basically split the provisioning process from the first moment to the actual getting the final machine into several distinct steps that are repeatable and that we can cancel in any time and try again. And the service we built to solve our troubles basically is powered by what we call routing, which is a script that takes care of picking the suitable provisioner, suitable actual cloud service or Beaker or whatever else. And it's based on many inputs and the thing is that we built on this routing that can happen anytime when we run into troubles with any of the pools, with any of the real services that we use for provisioning. We retry a lot. The tasks are designed to be retryable because our experience with these services is that are often fragile. So retrying again in a couple of minutes with some exponential backoff is very, very useful for us. And if we run into troubles, we can just try different kind of pool without bothering the user who's behind the API. He doesn't have to be aware of any troubles and doesn't really care about the actual infrastructure used for the provisioning. The internal, the service is built on top of Q and a broker with edit database for persistence. It provides metrics and alerts based on the standard Prometheus and Grafana setup and we throw in a sentry for monitoring because we don't like when errors somehow sweep our attention. So next slide please. The service works with what we call pools. They actually represent the actual provisioning services like OpenStack, AWS, Azure, Beaker, whatever else we would like to connect to it. And these pools, each pool provides a different or maybe the same, it doesn't matter. Each pool provides a sum virtual machines or Bermuda machines with particular configuration or configurations. It's usually possible to get more than one virtual machine kind from OpenStack or Amazon Cloud. So we organize these machines, these flavors into several pools and then we can basically switch transparent between those pools without involving the requester. Next slide please. Yeah. So for communicating with machine providers, we needed to create some unified interface. So we created the drivers. It's a layer between Artemis and the virtual machine providers. They took care of running command for the creating instances in the providers itself. Also, they translate generic environment specification to cloud specific configuration. So for example, OpenStack driver will translate compose to OpenStack image name and hardware specification to OpenStack flavor. Also, drivers provides resource usage for machine providers such as number of used CPU cores, use memory, number of instances, etc. As well as provides limits metrics to make sure we are not overusing our providers. For now, we have available drivers for phone machine providers such as OpenStack, AWS, Picker and HR. Next slide. Right. The heart of the service, the core that makes the decisions, we call it routing. And its only purpose is to pick the most suitable pool or pools for a given request. The request is usually specifies some software requirements like what kind of distribution or compose or whatever. And hardware requirements like how much RAM, for example, the user needs. And this script, it's a Python script so we have the whole power of the language is available to us with a couple of interfaces that Artemis provides to this script. So routing has access to the history of the provisioning to current state of all the pools involved so it can check their metrics, the rate of successful provisionings and usage of the resources. And then there's a couple of policies that basically filters the input set of pools to get finalist of the most suitable pools for the provisioning. The faulted pools or the pools that are exhausted are removed from the set and at the end you end up with 1, 2, 5 pools that are possible to be used for provisioning and that's it. If the provisioning runs into troubles for example the pool the downstream open stack suddenly has an outage or uses a connection to one of the pools the Artemis will take care of starting the provisioning from this point again and the routing because it has access to the history of the provisioning of the pool. And then there's the matrix, it can basically pick another pool or try again with the same one in a couple of minutes hoping things get better. All the decisions are stored it's all monitored all the errors are locked so in case of troubles humans are usually notified quite easily. Otherwise it's hard to tell what's going on so Artemis exports many matrix and it will export even more in the future so maintainers can get insight into what routing decisions were taken what's the usage of resources in each of the pools, what was the favorite, how many machines were provisioned from different pools what's the latency of API requests and so on. So next slide please. For example a couple of charts from our downstream monitoring panel which should land in upstream repository soon hopefully we can get easy access to how much we use each pools how many machines are we provisioning now how many failures we had in the last few hours how often was a particle broken so we were forced to use more expensive one in terms of not just money but for example if we try to provision something from beaker it takes more time which is delaying the testing so this can also be involved in the routing our downstream routing if there are 12 provisioners for quicker turnaround but if they are overused or have an outage we can fall back to beaker accept the penalty of more time needed to provision the machine next slide please. For an example this is how Artemis provides us with charts for usage how many instances we already grabbed from an open stack pool for example next slide please. So communicative with Artemis is done by using the EPI which is designed to be fully rest compatible it's based on polling mechanism so when you submit a guest request you need to periodically check at state until it will be ready to create a guest request you need to pass the environment specification which consists of from architecture operating system like Fedora 33 CentOS 8 etc optionally you can specify exact pool you want a guest from also optionally you can specify you want a guest with a snapshot support and we are working on an adding capability to specify hardware requirements of the guest the EPI exposes all functionality as well as some metrics and the EPI documentation is implemented using APRI blueprint and available in the APRI web interface for better user experience we have created CLI called Artemis CLI it is used as the rest API to communicate with Artemis and is built using clickbite package the output of the CLI is JSON formatted so you can use it to integrate Artemis in your environment next slide so let's look at an example this is how you send the guest request to Artemis it returns all available data at the moment most interested for us is an address it's not available yet guest name is an ID request and state states represent internal status of the guest request in Artemis so in this example pending means the guest request is in queue but is not evaluated yet so after sending the guest request you can inspect this request by using inspect command you need to pass a guest name as the option to the command and here we see the state is changed to the provisioning it means the Artemis send request to machine provider and wait for response next slide if we wait a couple of more minutes and send inspect command again we will see the guest request is in ready state and so that means the guest is available and we can see IP address and start to work with it right the idea behind Artemis is basically that each teams have different needs and different ideas how the resources should be used so we try to make the Artemis easy to deployable for team so teams can deploy an Artemis instance connect and use the resources the services they have available and implement just a couple of simple routing policies so they can make the best decisions most suitable for their workflows and that's it we don't aim to one solution to solve all problems of all teams it's definitely supposed to be favorable for each team's needs we are not there yet it's not that easy to deploy the Artemis but we are working hard on that to provide home charts so you can easily deploy it as an OpenShift application to combine with the best practices for these apps at this moment there's also deployment available for developers that is based on the mini-ship or there's a second one which is using docker composes there are some differences between these two some developers prefer the first one and the second one but it's easy to provide both awesome so a little bit just a few notes because we don't have too much time how we actually did that AWS OpenStack AWS in RHEL so RHEL is like those are tests that run internally so we had to have in AWS this is a virtual private connection connected to the internet that was set up by our IT department and once we had this we deployed Artemis, set it up to use the pools, of course we deployed it internally and set up it according to the needs that's how easy we implemented basically connecting to the to the AWS from an internal private infrastructure cloud if you would have to know anything about this just ping me or find me I will get more information if you are from Red Hat so our future plans we plan to write more documentation more testing, more metrics we try to focus on stability, resilience and reliability we already run Artemis in production so it's fairly stable, we run it I think for half year so we have quite a good experience about the deployments and everything but we are working on improving the installations as Miloš mentioned and the big feature that we want to have soon is being able to correctly specify the hardware requirements because currently it's very basic and we want to be able Artemis so it can choose the correct infrastructure according to more specific hardware requirements and CLI as you see it's very basic, just some JSON and later on it will definitely get improved if you want to find us we are on TFT channel in FreeNode or also on the internal Red Hat IRC if you are a Red Employee thanks very much we are ready for questions first question, could you compare and contrast with Terraform of course, Terraform is awesome if you are setting up your whole infrastructure we actually use it for some of our stuff so it's not a replacement for Terraform it's really focusing on usage for CI systems where you can pragmatically program these failovers and cloud bursting, so that is not something you can do in Terraform you describe basically the environment how it should set up then you run Terraform plan and the infrastructure is created for you so definitely Terraform is not focused for Artemis, it's different it's mostly for CI systems for provisioning VM resources currently next question from Nick Piper, are you likely to extend and being able to also provision virtual networks with several machines into that network currently no and we expect that you do this outside of Artemis if so so for Artemis outside of Artemis on the infrastructure you prepare the virtual networks as you need and then you use Artemis with this failover for example if you prepare a network in different clouds, so currently definitely not, we would not like very much to complicate the current way of things, but we are in conversations with some internal teams which have these special requirements and we are having talks with them so if you have some specific things around definitely contact us and we can talk about it if Artemis could be any usage to you Andre Buda is there a Jenkins integration for Artemis so that Jenkins can ask Artemis for instance and when Artemis is done so currently no, if you would like to write a Jenkins plugin it should be fairly easy because we have a REST API so the integration you can do that from a shell step currently so running using Artemis CLI for example to provision the stuff in shell step is possible and then to tear it down in some post step, but currently we don't have a direct Jenkins integration if somebody would like or has some groovy experience to write that we will be glad but currently it's not our focus we are not ready Jenkins paste