 Hey, everyone. So kick us off with our first track operating at scale. We have our first talk by Vadim Rikhovsky and Aditya Konardy about Cincinnati, a case study in SRE Bootstrapping. So I'll just share my video or share my screen for their talk. We'll wait for a few minutes to see if they'll join in to give a live discussion, but otherwise I'll just continue playing their talk. All right. So we have Aditya in the chat. So I'll just continue with the recording. Here we've all seen Red Hat's strategic imperative of delivering a hybrid cloud business model, including SaaS offerings across the portfolio. In this session Red Hat's service delivery and over the updates team will walk through a work example of how to bootstrap an application with a modern SRE-based approach. We will discuss Cincinnati, which is also known as the OpenShift Update Service. Along with that, we will take a look at how we can develop a cloud native service and how SRE's practices have bootstrap a service. Then we will look at how SRE and engineering collaborate to deliver the service. Finally, we'll also give you a sneak peek into our development workflows. Let's hand it over to Vadim to talk about Cincinnati. Cincinnati is a short for the name of the project and protocol called Cincinnati. In the next slide, I'll describe what it is, the what's that all about. OpenShift Update Service, also known as Project Cincinnati, is ensuring that every connected OpenShift work cluster is able to upgrade between different versions. It's based on Cincinnati protocol and which is a development from the CoreOS container Linux update protocol called Omaha. OpenShift Update Service, also known as OSS, is building upgrade graphs. These have been consumed by OpenShift clusters and OpenShift clusters are constantly sending requests to OSS to figure out a list of update versions available to them. This update service requires a higher availability since all of the collected clusters are requesting it. From the development side, it's also stateless and it can be effortlessly scaled. Next slide please. Here is the diagram of how all things work together in order to make OpenShift Update from one version to the other. Admin is able to list available options and available versions on the console. Cluster is constantly requesting OpenShift Update service and using the Cincinnati protocol it receives a list of versions available for upgrade. And once the upgrade begins, the cluster requests query all container registry and fetches the update image in the upgrade start. Next slide please. Here is the timeline of how we started developing the service and started collaborating with FSRE in order to achieve higher availability. It all began with a minimal viable product called Cincinnati and Developmental Cincinnati protocol. We have asked FSRE to start managing two instances, one of them would be production and the other one stage. We have collaborated to set up an output lawyer for the latest master commits stage and later we started working on application level metrics related to the level of requested updates, number of errors and other details. Later both instances had persistent logging set up and FSRE has developed a tool called SAS Herder in order to manage all the deployments to both stage and prod. Later Cincinnati application itself has added a possibility to use OpenTalentedry and tools like Jagger in order to inspect what's happening inside of it. Later the Cincinnati project has been renamed into OpenTalentedry service or also for short rather than not to be confused with protocol. Next slide please. Our future plans and things we're working on now is support for disconnected installations which don't require run-hats also since work related to operating of the OSS and helping customers with their experience of that using the Kubernetes operator to install OSS, support for open tracing on our production instances and help being our customers to take use of that and work on memory profiling and implementing it in our end-to-end tests. Next slide please. Here is how here is the work and how we collaborate with FSRE in order to achieve the stability and development of public service. Here are the tools and tricks that we use. Most of our day related to FSRE is happening in alert manager channel where all of the work related to stability for our instances is basically targeted around alerts. We treat every alert as an actionable item. We have to explain why this alert happened and how to prevent that from happening if it's a false alert or what needs to be fixed on either of the environments in case it's a valid issue. We use the following tools to help us with that. First of all would be Kibana and Elasticsearch to keep and visualize box of the parts which we're running. On development we use Yager to get insight on to how particular request from the customer has been treated by Cincinnati or OSS. And we use OSEAN Jack tool to add debugging tools into the live part without restarting it so that we could use S-Trace, debugger and other tools which are not present in the production containers. We have worked with FSRE to establish a small protocol how we should collaborate that resulted into using a special interrupt catcher engineer who would be able to guide us and help us with questions and issues which are unable to resolve in our differentiation. So that's been incredibly helpful. Next slide please. So during development we're also thinking about how they would affect stage and prod instances. First of all we have established end-to-end tests and we require every request for Cincinnati to pass them. These tests are focusing on the purifying functionality of the service but we also added a lot of tests which ensure that SLOs which we have agreed on supporting those could not be breached by the change. We're also working on improving and adding more end-to-end tests based on the experience experience we get from alerts. So we add tests from our own field of cases and we're ensuring that the new features are covered by those tests and have their SLO requirements. Next slide please. All of that work is based on feedback from both teams so we have several loops of feedback set up. For instance stage environment always has the latest coming deployed so we can check how the most recent code change has affected it and engineering can already give it a test and we can find out which issues could potential effect product we will throw out. Every other way we also gather on observability meetings where we discuss how to get more information about what's happening on our environment, how to improve observability and it helps us to get more feedback from different interested parties. Next slide please. The SIE team provides the developer team a framework and a setup tools to enable them to deliver the service faster. Let's take a look at some of these tools. The first thing to remember here is that SIE team is customer number one for OpenShift. We consume OpenShift dedicated and a runoff services on top of OpenShift dedicated. As a result we also have 24 7365 incident response and use GitOps tooling like App interface which is also an open source project created by the SIE team to deliver the services. We have other tooling like Manifest Bouncer which we will take a look at in the upcoming slides and we do static analysis we also provide observability tooling and also have resulted in developing our own SLI SLO framework a bunch of CI CD tooling. Along with that we have soft meetings like regular checkpoints with services to ensure that they are following the current best practices which are not enforced by our GitOps tooling and also that provides us feedback as to what improvements we need with our tooling. We also collaborate with the service development teams on feature requests for our own tooling and we also collaborate with upstream and downstream projects to be the feedback loop and contribute the features upstream where possible. Here's a list of some of the integrations or automation that we wrote as part of feature requests from the service teams or to fulfill our own requirements. All of these projects are available on GitHub in the link provided. One good example I would like to talk about is how Cincinnati drove the work around our SLOs. First of all Red Hat as a company or service delivery as an organization needed to decide our own approach and vernacular for SLOs, SLIs and SLAs. We then started implementing SLA and SLO tracking schema in app interface which is our centralized GitOps repository. The SLO Lips on it library started by Matias Label upstream helped us with the framework on how we could do a GitOps based generative approach to SLIs and SLOs using Prometheus and Prometheus operator. We took that and developed our own framework after which we mandated all of our components to register an SLO with the app interface repository. We got some early feedback and we had another quick revision and right now it's a revision too. At which point we felt like we wanted a service to go through this with us where Cincinnati kindly volunteered and we started working with the Cincinnati team on getting the performance parameters added to the app interface. This led to a ton of other work such as load testing and figuring out SLIs for Cincinnati along with improvements to the metric system. Finally we added the performance parameters for Cincinnati and AppSIE's own applications and the end result was that we have a prototype Cincinnati valid dashboard which we will take a look at during the demos. Nothing is possible without the collaboration between engineering and SLI. The whole idea of SLI is that engineers talk to SLIs and they work together and combine their skills to deliver a service. Let's take a look at some of these initiatives that we had. First of all like what they mentioned a few slides ago we have a managed services observability working group where interested services and tenants can come together with the monitoring engineering group and the SLI group and we all together discuss what improvements can be made to our observability tool. Observability is the base in the hierarchy of needs for SLI and continuously improving that is our goal. AppSIE team also collaborates and consults on SLI practices with the teams regularly and on a per knee basis. We also do capacity planning with the teams so that we always know what's coming and we are closely in the loop with the teams on upcoming product releases or upcoming traffic spikes that are expected. Along with that alerts are only good when they are tested so we also do fire drills and simulated outages that keeps us on our toes and we have enhancements to the SLI tooling and those benefit multiple teams because they are a shared framework. Let's look at what a perfect deployment looks like for us. A standard tooling saves developers from all this overhead and without the SLI team a development team would necessarily have to either crash their application and realize this afterwards or develop this tooling by themselves. So Manifest Bouncer which I previously mentioned is a best practice tooling for deployments and Kubernetes and with this Manifest Bouncer getting checked we ensure that all of our deployments that are getting through to production are of the highest quality possible. SLI also has some standard patterns like multi-AZ ensuring port disruption visits and anti-affinity for pods for HA and some Kubernetes specifics such as enforcing the update strategy readiness probe lameness probes limits and request and all of them being up to the mod. We also check for deprecated API endpoints deprecated objects and manage other CA and CD workflows to standardize across teams and across services and provide improvements across the ports. On this slide I will describe which work has been done to improve Cincinnati based on the experience gained from Missouri team. Initially we had an extensive list of alerts and situations which we would like to track but a lot of them have been triggered for no reason and that caused an alert fatigue. So the first thing we did was minimize the list of alerts and left only the ones which are showing that our services in catastrophic state. After that work has been done we have added upstream checks to ensure that SLOs are not broken when the change is proposed that help us to avoid a few performance regression. We also performed capacity planning Cincinnati and that resulted in several performance fixes. After that the load test documentation has been created so that other developers in Cincinnati and SRE would know about which parts of load tests are important and how it has been performed. We worked with SRE and extended the brand address definition file so that folks from other teams could have a look about which performance parameter is critical for Cincinnati. After that we worked on zero downtime upgrades for the Cincinnati service and have created a special dashboard for show the SLOs of the current state of Cincinnati using app interface provided by SRE. The result of that has been to ensure that no change can break our experience with SRE team. Next slide please. The first focus and the first space where we started working was ensuring that app specific metrics have been added to Cincinnati. Now we worked on integrating into the OpenShift Trouter and ensure that the metrics from that Trouter meaning that's exactly what the customer experience are being tracked and are part of the SLOs. Meanwhile we have been helping the SRE to work on SLO library and added load testing and capacity planning into the framework and have divided it as a requirement for Cincinnati deployments. Next slide please. After that we tried to visualize and report more on the details of Cincinnati so we have created a dashboard for our SLO which we will show during the demo. Next slide please. One of the biggest improvements during the capacity analysis has been a photo request which has increased performance between two different components of Cincinnati and that increase has been 50 times more thanks to the analysis held by SRE team. This issue has been identified during initial stages and later on we added special tests which ensures that this performance hasn't degraded since then. Next slide please. We worked a lot to avoid false positives and now whenever an alert happens we either fix the underlying issue or work with SRE teams to improve the alerting so that we would exclude the false positives. All of the alerts are defined in upstream Cincinnati repo so we work on them independently and every deployment automatically applies those upstream alerts. We're trying to do our best not to worry the SRE team without a reason and try to triage the upcoming alerts ourselves that also helps us to improve the knowledge of how the service would act on itself. We also work with SRE to make sure that they understand their alerts as well so we're trying to document them and involve SRE folks whenever we add a new alert. Next slide please. So now that we have seen how our deployment for Cincinnati looks let's take a look at that in production and let's take a look at some of the dashboards that we have. The first thing I would like to show you is our Cincinnati visual app interface representation. So as you can see the app interface repository being a get off repository needs to be visualized in a nicer way so we have this tooling called the visual app interface which graph to queries the app interface servers and displays a nicer visual representation of the YAML files that are an app interface. Here we can see the Cincinnati application file. It provides useful information for the SREs at a glance such as the description of the service, a Grafana dashboard, what the service as a low is, what the contact points are and what the dependencies and direct links to production are. I would leave out the links to production here but you can see that Cincinnati production and Cincinnati stage are working on two different clusters and we have direct links as the SRE team to these clusters and that helps SRE to quickly access any of the services when needed. The next tooling I would like to describe is the Cincinnati Valley dashboard. This valley dashboard is based on the metrics that are created from the SLI SLO framework and also based on SLO lips on it so they are following the SRE book best practices of multivendor multiarray urban rate alerts. Also we have the availability errors, volume of requests and latency at a glance so that SREs can quickly triage the health of a particular service. We follow the common pattern across all services so that it is standardized and we are all speaking the same language. We have some additional screenshots here around the Cincinnati Valley dashboard. We also have sometimes service specific dashboards that allow us to dive deeper into the service specifics and I think that's it. So please let us know in the chat or wherever if you have any questions. All right so there was Aditya and Valin for Cincinnati, the case study in SRE bootstrapping. So if you have any questions please feel free to enter in the chat and I think we have Aditya in the chat so he'll go through and answer all of your questions.