 Okay, so I'm going to be talking about how we manage services using GitOps and GraphQL This talk is a story about our team the Apsary team within Red Hat I'm one of the SREs in this team and what I'm going to talk about is our evolution our evolution since last year into this year and how we use the GitOps to Design a way to scale up So last year we were running just one service. We're running Basically just one and we decided that we basically were seeing that we weren't able to scale up So we try to implement a system based on GitOps to allow us to deliver more services. So now today we are managing 17 services and That's When we try to deploy the build the GitOps Solution that we implemented we identified a few challenges and we built Few tools around these challenges to alleviate them and I'm going to talk about those especially These tools at the core of the idea that we used we borrowed the Kubernetes programming model specifically the idea of the controllers. We tried to create controllers We call them integrations, but this is the same idea applies that essentially Will reconcile the state the current state to the side state with the current state just like operators do in Kubernetes I will talk about the integrations that we have and how we build them and how everything fits together and The last thing is, you know, what are the things that we can improve in in this platform and what is the world map ahead? So let's start by the team The team is called AppS3. It stands for application essay and our goal is to run internal services and Deploy them in OpenShift dedicated The key idea is that we are regular customers of OpenShift dedicated So we can see them in the exact same way that customers will consume OpenShift dedicated What is OpenShift dedicated? Well, it's just Fully managed Kubernetes offering that Red Hat has and you can it's a product You can buy it's a service you can buy a subscription and you are able to you know, deploy services Deploy your applications in these OpenShift clusters OpenShift dedicated is Fully managed by Red Hat and it has One thing that I think it's worth mentioning The you don't receive as a tenant as a customer of this product You don't have cluster admin because the cluster is being managed by other teams in Red Hat So what you get is Another thing that we call it dedicated admin and will essentially allow you to become admin in all the namespaces This is just a minor detail, but I think it's interesting because Red Hat builds OpenShift dedicated and even though we are inside Red Hat, our team is in Red Hat We consume it as customers So we have to abide by the by the requirements and by the by the by the way the service works So I said we run services and applications. I guess I have to explain what kind of applications we run typically the applications are web applications or Microservices that compose a bigger service than themselves. I'm gonna talk about two examples of things that we run We run These four projects telemeters and see not the OCM which starts stands for OpenShift cluster management and Hive and all these services in themselves they compose The OpenShift dedicated for provisioning Platform so when you go to cloud.redhat.com slash OpenShift, you will see a UI and if you This will allow you to deploy OSD clusters Provided that you have a subscription and Everything that builds the OpenShift clusters and replace them is those are services that we run and it's interesting I like this a lot because we are running an OpenShift. It's a chicken and egg situation. We're running the service on OpenShift that allows you to provision OpenShift. So I think it's interesting and I think it Relates very well Exposes very well the capabilities of OpenShift dedicated because Meaning this means that you are able to build things as complex as a platform You know that a fully managed platform just with OpenShift dedicated Another example is Eclipse.che Che dode OpenShift.io. It's an online ID. You can go there and can use it An online ID that has interesting features So When in the service, how do we run a service? Well, the theory is very simple, right? We just Generally have two environments staging and production we The developers they give us Kubernetes deployments and we just run them like OC apply or OC or keep that apply We have to provision other OpenShift resources like config maps, secrets, and if they ask for databases we provision non-RES, etc But in order to be In order for it to be possible for us to offer this to the customers We have to manage the when I say customers I mean internal developers that are the owners of this service. We have to manage them We have to give them access to the clusters. We have to manage the road bindings Ask them for for their user names Set up the pipelines. So They are able to do actually, you know promotion to staging promotion to prod. So this is generally The idea is very simple, but in the end it gets a bit convoluted convoluted because you have to manage a lot of things So let me introduce to you a few of the things that we had to deal with back in 2018 We are SREs right so we try to automate everything but Like every interaction that we had to be the developers meant like a different thing that we had to do and we kept building stuff for instance in this case is a New developer that comes and says hey, I want to deploy a service. So what do we do? We Create a new namespace and provision the resources they want. Of course, this is automated. We had a script that does this so This introduces two things manual processes which the manual process is running a script that provisions all the things that they require to be bootstrapped and Jenkins pipelines, so we are able you know to deliver the pipelines that they will need to actually Life cycle these jobs these services another example is when they came to us asking hey, we need that database and At first for a long time. We were just provisioning the databases through the AWS console and That was a lot of clicking so we ended up using terraform. So now we have another thing We have terraform and in order to say send them the Passwords for database we had to ask them for the GPG keys, right? So so we could send them send them the password Another example which cost a lot of this one was quite problematic is when I need them I need to remember it came we had to grant them access to all the namespaces. How do we know? What namespaces do they need access to in what clusters what namespaces we had to keep a log of what the teams were? Who were the members of the teams? What namespaces they owned and we had to you know We had a data repository in which we included all these data When they wanted to create a secret same story They had to give it to us encrypted So we had to exchange GPG keys and well We had to just send them the GPG key and we would also apply it and then we would save it locally in In an encrypted git that the developers didn't have access to so you know what I'm getting at This was a bit of a mess. We got a lot of team interrupts people say developers saying hey We want to do this and we got stuck at this stage Maniac reconciliation was a problem if someone left the team We had to Deprovision them from the clusters from GitHub organizations all these things and we had to keep track of those So we had another data repositories where we kept track of these things and separate scripts that dealt with this So the year summary was that it was that we had a lot of processes We had a bunch of scripts written by different accessories in our team some in gold some in Python some in bash We deployed a lot of services. They weren't interconnected each one. Well, of course, we automated stuff We were using Ansible. We were using Jenkins jobs to provision things. We were using a small github implementations So when you committed something to our EPO something something happened So in the end we were just running One service and we had a bit of a mess and we really needed to get out of this because the idea for the team was to Be able to run a lot more services. So fast forward one year The previous slide would finish about a year ago more or less 2018 around Yeah, after summer more or less September or so and nowadays we're running this we're running 17 services managing 250 developers lots of roles lots of permissions AWS accounts Quay orgs and we do this with the same seven members that we were but then the people have changed but the number has stayed the same and We are we think that we prepare to scale a lot more We are basically doing this using the solution that we built around github So let's go. Let's talk a bit about the solution itself This is the knife approach the initial approach. This is the design that we had in our minds Let's have a good repo with everything that we have We just one single repo and the whole goal is that if someone wants to do something They just send a PR and when we merge it something is going to happen that will you know configure this service the person One board someone off board someone Etc. So we thought that we wanted to do this using reconciliation loops in the Kubernetes style of things you have different scripts that this the this box here This box here ups Represents the several scripts and its script is like one controller one could read this controller that that's one thing Deploy open-shift scripts manage github organizations manage query registries things like this and The idea was very pretty but we identified several things that we didn't like and I will get that in a minute So let me talk a little bit more about reconciliation loop for loops for Kubernetes reconciliation loops there have been some talks today about these things the Controllers and I'm not going to do a better job, but I want to Explain exactly what we borrowed from from this idea The first thing for me personally was that we were changing for an imperative model to a declarative model Instead of saying hey, we want to do this. We are running this script The idea was to describe your desired state. It's declarative and then something will happen that will Evolve the current state to your desired state In Kubernetes Something happens you have your user provided you desired state which resides in at CD a simple infinite loop that repeats again and again and Controllers which in our github setup. We call them integrations They watch specific resource types like Replicas sets and they deal with the differences if the user wants a replica set with seven parts But there are only six parts. I have to deploy a new part. So it reconciles this state So this is what we wanted to do and this is what we wanted to translate to the github's world So we were in the on the on a good track singlet the report reconciliation loop But we found four big problems. The first thing is schema validation I will talk about this In the next slide and then we have three interconnected problems which are data redundancy Repeated logic and language independence. We found that if we did this the integrations repeated a lot of code where We had a lot. There was a lot of logic in the integrations That was simply, you know reading from the git repo loading the files doing stuff that is really that wasn't Didn't provide any value You could say well, you could write a library to alleviate that but if we do that then we're bound to one We are tied to one language And we wanted something that it would allow us to deploy, you know one controller one integration in Go another one write it in Python another one in bash that that was the whole idea so Then back to schema validation let's say you have a script that will read from the git repo and it expects someone to define a User this way a field called name a field called github username and a list of permissions where every permission is Has those three items to say something What happens if someone, you know, because the whole idea here is to be able to self-service this So Developers will be sending PRs. What if they make a mistake? What if instead of Typing github username they type github underscore a user and the script will fail because it's expecting github username What if they omit the org it's a required field and integration needs it so common schema validation problems than the most obvious ones are typos and Missing required fields, but there are a lot more like if there's a field that represents an email address We want to apply a regular expression and ensure that it's a meal email address or a URL things like this So this is something that we wanted to fix In order to be able to set this up The three other things Data and application logic redundancy and and limited dependence This one is a bit harder to explain anything But the idea is that let's say we have to to to them to developers Hymen Melis and Robert Johansson, and they are both in the same team chances are they will have the same collection of permissions Does this mean that we will need to write you know in the two different files the same The same list of permissions what happens if you want to add a permission we have to Modify all the members of the team. So it's obvious at this point that what we need to set database. We need something that will allow us to to Provide some normalization to this so This would this would be the ideal solution instead of saying a list of permissions I just want to reference a role. I don't want this role to be defined in another file, right? So whenever I modify this file it affects all the people that are using this role So we wanted to do get ups with the ability to do Relationships Relationships and foreign keys and things like this Also, we saw this problem some far somehow we could expose this We would be solving the language independence problem as well because we were concealing something that is already able to solve all these problems And it's exposing this information to us So Let me introduce you to the two components that we built the two main components The first one is a schema validation This is just a python in reality. It's just a python script that will check every file in the git repo against a schema And that's all it does and it's super simple and it's It was very easy to write the other part One that is a bit more complex is that in between the git repo and the integrations we deployed the contract server That's our name for a GraphQL server that essentially Knows how to read the data from the git repo and exposes it via GraphQL with this in place we We believed that we were solving all the things that we had identified as problems So the contract validator let me Explain you how it works. It's extremely easy as I said, that's the URL if you want to open it It's just a python script which uses a Python package called JSON schema and it uses JSON schema to validate the documents, so This is it's a shortened real example every Every document in the git repo has a reference to a schema that will validate it and The schema is just a schema like they have defined them in the spec in the in the JSON schema spec RFC So you can say things okay, so this has to be an object because it's it's it's a dictionary of fields We define only two fields name and GitHub username. We don't allow any additional properties and Both of them are required if you try to do something that doesn't match this I will fail and the PR will get rejected So this idea was super simple it really was was was that Contact server. This is GraphQL let me Explain briefly what GraphQL is is GraphQL is a query language. It's just like REST API When you implement a service and you want to expose an API instead of Implementing a REST API you can implement a GraphQL API and I will demo this in a bit So you so if you in case you have never played with GraphQL, you will see how how how the query language works It has GraphQL provides provides server runtime and tooling So it's very easy to build the schemas that you need for a GraphQL in order to Create the server it allows for queries and mutations. You can mutate data in our case We are not using mutations because we're mutating the data directly in the git repo Not via GraphQL because we want things to be PRs Because we are doing good ops An interesting thing is that you only get the fields that you request and The box there is an example of a GraphQL schema This defines a character a type character and it has two fields name Which is a string and it has an exclamation mark meaning that it's required and Appears in is an array because it's square brackets of episodes And there's an exclamation mark meaning that it cannot be null. It has to be there and also the array has to Be there in itself. It cannot be null. It has to be at least an empty array As you can see this work fits in very nicely with the JSON schema. We're defining required fields Whether or not the strings whether or not they are arrays So there was a very it was very clear to us that GraphQL was a good solution to mix with JSON schemas So our implementation the contract server it's we wrote it in in node.js with Apollo GraphQL We are not We are not a node. We're not just node.js developers, but we We believed at the moment that the ecosystem for GraphQL was miles ahead in the JavaScript world So we went with that and I think we made a good choice. We wrote it in TypeScript Quoting someone from the team. It's the only way to write a sign Same JavaScript, and I think it's it's awesome And essentially it's exposed to the good repo data and allows you to solve the references so you can Remove the duplication of code so Basically, it's a it's a relational relational database We have files that are users files that are roles files that are permissions and we can establish relationships with them So if you query a user you can say hey So what roles does the user have because it's in the spec in the JSON schema We have that we have we're saying that the roles that the user has a parameter which is roles that points to a role Yeah, so this also other problems that that that we that we had and There's another There's another cool thing that in contract server does and I think it's the coolest thing that it does It allows you to do back references The idea is that in the schema. We're not defining users users not there. We're just defining I mean, I mean the yellow box We're not defining the yellow box that contract server is smart And if you query roles and say hey, what are the users that are pointing to this role? They were it will show us the user so we're able to just by loading this data in the contract server in the GraphQL server We're able to navigate through the the foreign keys relationships, and that's what makes this powerful I think So We've made a bunch of assumptions about what's in the grid repo and and and the give people has to has to Satisfy the requirements that contract server imposes it has to have a collection of data files and its data file has to be Yamal or Jason It has to have Key, which is a schema which points to a file will validate which validate it And if you want to reference another file you simply define a key Which is dollar ref and the path to the other file the relative path to the other file, and that's it That's all you need to do We also need to provide the decision schema validation files and say user one has this required fields, etc and Unfortunately, we still have to provide the graphical schemas But this is something that will be going away soon because we want to be able to generate We can't generate the graphical schemas from the Jason schemas There are a couple of things we need to solve before we get there, but that's where we're going So only the two first things are there in the long run Okay, so I'm going to do a demo and in this demo I'm going to show you a few data files and in one of the in this repo I'm going to show you the how the validation works I'm going to show you how the contract server the graphical server also works and Yeah, have a quick overview of the graphical query language I'm going to try and do it here But I cannot mirror the laptop because I'm in Wayland. So I don't know Okay, so this is the repo. This is the app interface This is how I get repo. We call it app interface Inside the data folder. We have the data files that we are going to be exposing You'll see we only have four here in this example. I removed everything else. So in order to keep it simple That we have a Couple of users a role and a permission. So let's open the Let's open this user This user has a schema file which points to the schema file that we will we will take a look at Now and it has, you know, the few it's a regular YAML file It could also be Jason and it has all the all the things that we need in reality This is my user. It has like ten roles or something like that I just that we moved all of them You have the GPG key and all these things so everything that we know what the service is here The roles It's just a collection of permissions as you can see there's the dollar ref which points to another file and this will this Contract server understands understands this And now I'm going to show you the schema for a user This is a real schema that we're using for users. It has All the fields that we require all the fields that are possible. It's actually quite simple It's just a regular Jason schema Okay, so I'm sorry so Here it is. I can't see anything Okay, so let's start with users This is a regular graphical server so we have Actually, I can't do them like this. I'm gonna connect the other laptop Okay, much better. Okay, so we are going to query the users and We simply need to define the fields that we want and this returns Users from the contract server now the cool thing about GraphQL is that you define the fields that it the fields that he Want so you could do something like red hat username and this will give you the red hat username but If you remember in the user file, we had a rolled which Rolls array which pointed to other roles if we me pretty fight is a bit If we list the roles that we want There you go, you know you have the roles We just went down to the word returning the data from other files now or the description We can visualize the the path to Right, so I'm just really telling the data that I want it to return and Let me see if there's a note notepad here I don't know if we do copy If I click copy curl and I paste it somewhere I will see that it's just a regular curl that it's the payload is It's just that string it uses the string that I have on my left and it returns to Jason So this means I can do this in any language Also, another thing that it's super cool that GraphQL is that it has introspection Just the server itself allows you to return All the things that we keep inquiry so we can create users and the users have all these labels and We can go to roles and Inside roles we have users again, you know, this is the whole relationship of the Database that I was showing you and it's available here. This is the official GraphQL tooling. I'm not doing anything I'm just starting a Refql server providing it with a schema and this is automatic So this is this is one of the reasons we went with Node.js to deploy this because all of these Ecosystems was amazing to to use and we're just implementing the logic that resolves these references so that's basically all that we're doing and the back references is If we do roles v1 Now we can look we can query the The users and this is not defined in the data files. This is something that This is something that Contract server knows how to do just because it looks at the schema relationship in the schema and it builds all the Direct relationships and the back references Okay, so this was the demo for the GraphQL server. Does anyone have any question regarding the GraphQL query language? I find it super cool because just we turned up in the data that you want and that's it Okay Just a few more slides left So we've talked about the Apple part of the stack now I want to talk about the integrations How do we write them and what are the what are they so integrations are extremely simple to be written? They just need to Follow these small patterns They fetch the desired state by contacting the GraphQL server They fetch the current state by using the API of which server service we're managing github vault Jenkins Quay whatever You have to do it in such a way that it's idempotent. So if you run it again and again things Won't break if you try to create something that already exists You cannot raise an error. You have to say, okay, you silently fail you say again. This was already created We can run them in any language. We can write them in any language and in fact we do and A very important thing is that they all have to have a minus minus dry run flag So we're able to simulate what happens on every PR. So the PR comes in we will run all that all the List of integrations and we'll see, you know, what is this this PR is going to do? So let's look at one integration a simple one defining defining query registries the Logic is what I just said the sorry state from GraphQL current state from the query rest AP query rest API We iterate through the desired state if they are not in the current state we create them if they are different we modify them and then we iterate through the Current state and if they are not in the desired state we move them. It's just like a Kubernetes controller That's that's the exact thing exact thing you would do and we implement you implement dry run If you implement dry run you'll reprint what you would do instead of doing it. So it's pretty much trivial This is the query that you would have on the left is developer defining the registry so they want they point to a Query organization and the list of registries that they want for instance and in our integration We just request the data like this we Create this query in GraphQL We turn the data and do all the logic based on the results here So it fits in very nicely and the good thing is that because the data was already validated We know this is not going to We know this is not going to Give us false information as in if Name is supposed to be there it will be there and if the name is supposed to not be a null value It won't be a null value So nowadays this this is the list of integrations that we have We write them super quickly so we have integrations that deploy resources to the openshift clusters manage query post both configurations I don't know AWS resources We have a lot of things and we have everything We retrieve everything from the single good repo and we follow this pattern and we're and this is allowing us to scale a lot Another interesting thing is What happens when Developer sends us a PR. How do we how do we how do we decide whether or not to merge it? Well every time a PR is sent to this repo There's a All the integrations are run with the minus minus dry run and we get a report So if they if the developer has deleted all the users, we would be able to see it and we would refuse to merge that PR if there are any problems there didn't any schema validation problems Everything we will be able to to look at it just by looking at this report And we have only permission that we need to know whether or not if we should merge it Another tool that we built around this is visual contract visual contract is Essentially the good repo that we have has a lot of references going from one document to another and navigating this as a human Sometimes it's a bit confusing if you if you want to follow stuff you have to keep you know Jumping from one file to another so the visual contract is a react app that displays this information. So we have Services clusters namespaces users everything, you know, like the grafana Dashboards they are linked from the namespaces if you go to clusters You will see the namespaces and the namespaces you will see like everything is interconnected You can navigate from one to another This is just a screenshot of the repo that I currently have in which I only have two users Normally, I think we have 250 or so and if I click on one You will see something that it's quite interesting Which is the edit button and the edit button only sends you back to the to the to the page in the in the git repo It's a good lap repo. So it sends you back to get lap as in if you want to modify it simply send a PR Modify this file and that's it. So it's just a link, but at some point we will probably have a dynamic web form So based on the Jason schema you could Generate a dynamic web form to modify these fields and send an automatic PR, but we're not yet there yet Future work as I said dynamically generated the graphical schema that is one of my Things that I would like to to work on Soon because it's a bit of a pain to have to maintain the validation Maintain the schemas in two places in Jason and graphical format It's a bit tricky, but I think we we're in a good place to solve this problem Another weakness that a problem that we have is that The documents are validated in the context of themselves We are able to see their fields are missing etc. But we don't know if the validation will not look at other at other documents when it's validating this means that there are simple things like Uniqueness that we are not enforcing someone to define twice the same red have username and That is a problem and we think we have a good strategy to solve this Which is essentially running graph cool defining graphical queries that need to that need to pass In order for the PR to be merged and if we do if we set it up in such a way that they're easy to define Then we will be able to you know say okay this PR kind of go through because it's repeating a field that it's already being used somewhere else Productize One of the obvious questions is can this be used and it can it can be used But it's a pain to set up the The logic to run integrations We're currently doing in the gen in Jenkins using webhooks So when there's a peer when the PR is merged there's a webhook to that when the PR is submitted There's a method webhook that triggers the Jenkins job and runs the integrations and we really don't dislike this approach and we would like to do an operator Kubernetes operator, so it You define basically in the CRD the integrations I want to run and it listens for It listens for things like this once we have this I I think it would be easy to deploy this as of now The the biggest problem the biggest challenge that someone has if they try to deploy this is setting up the whole logic to run the integrations and The last thing is the automatic PR merges our work right now a lot of part a bigger big part of our work is Reviewing PRs and we want to automate that we want to be able to define some tests So for instance one thing that we want to enforce is if someone tries to modify a PR a service that they don't belong to Yeah, we can we should fail that PR and that now is a manual process We look at it and we say hey this guy is trying to modify the service, but he doesn't He's in part of this service, so I want to do something similar to github owners to the github owners file But using again graphical queries based on the output and things like this So the conclusion just to finish I think In this talk there are too many ideas. I want to I want to convey which are Certification loop declarative declarative approach is amazing. We found it to be useful in this use case I think I don't know. I think it changes the way It changes it improves a lot of things a lot of things and it can be used in many different scenarios, so perhaps trying to you could Maybe useful try to evaluate if you can borrow this declarative approach and applied elsewhere and The other thing a lesson learned in from 2018 for us was that automation is necessary and we have to do automation But you have to do it with a plan and design if you start automating in the end. It's In the end, it's it's it will end up being a confusing complex and you won't be able to scale up Sorry, and yeah, these are the links for the projects if you want to look at them and that's my email If anyone has any questions One of the reasons I could see is for the type script. Thank you mentioned for graphql But outside that why Python why not all go like so so for instance we have an integration We have an integration which is It's this one oops Vault configuration vault configuration the integration we wrote it in go because How she called world is has the main API is in go and we wrote it in go For instance the query repos make sense to write it in Python and not use not really use go So we really didn't want to be bound to a single to a single Language and that's why we we tried to follow this approach to have something that will expose the data In such a way that it can consume it from any language using just HTTP Which is essentially what the graphql client is is just an HTTP client It was like you read my mind because I did have a question, but I didn't raise my hand So my question is about get ops. So I am new to this concept of get ops. Although everything behind it is very familiar It seems that one of the features of it is the idea of a pull request That's just checking the difference between desired and current state and then that PR is really what makes this a get ops process Rather than you know, just a continuous delivery process, but you want to get rid of the PR part and automate that Does that does that mean there's some sort of flaw with get ops? Okay, so I Don't want to get rid of the of the PR check. I want to automate the creation of PRs So the edit button this thing I said about creating dynamic forms would essentially allow the developer to send a PR Not modify anything. So the PR will still be there We'll have traceability or the ability who sent this PR and we'll have a simulation of what would happen if we merge it So the PR would be there But you said another is another thing which is that the PRs are the essence of get ops and they are indeed a very important part But I think the most important part is just Having in get ops a state that you want to see deployed. So PRs are extremely useful because it's a benefit that you get from Get ops, but I think that the Then the main thing is just being able to apply that state But no question, please With respect to the world integration you have so in that case the secrets for example are still Not encrypted on at CD rather it's encrypted in your store Something pulls that secret out decrypts it and puts puts on that at CD, right? Exactly. I think it's exactly what you said We are the developer puts the secret in vault and in in our key tops. They say hey apply this vault Document this vault secret apply it in this namespace and the integration what it does it simply Obtains the data from vault and creates an actual secret in Kubernetes And that's it and it's version so they can control they can do rollbacks and do the kind of that kind of things so yeah, right so so Hashi corp has a newer thing where it has an edit container Which effectively? Ensures that you don't even have to write the secret Unencrypted onto at the city is that something you've considered and found not a good idea just okay So we saw this a while ago I think and the problem was that we weren't sure if we could run this in open ship dedicated Because in order to install that we need a required cluster admin So we didn't have this but the good thing is that if you have this Writing an integration that that is trivial, right? So this is just Okay, one last question, sorry I'm not too familiar with graphql is that So that's taking the data from that's defining it repo and is it storing somewhere. So it's just parsing that directly So graphql is not Doesn't have anything to do with give me good repos. What weird When you create a graphql service It's like when you create a rest API service you have to implement the logic, right? in our case the logic is go to give it get up the good repo and And return this data, but it's our logic the one that obtains the data from the from the good repo not graphql graphql is simply a way to to It's just an API an API specification Okay, thanks a lot for coming