 Hello, everyone. Thanks for coming. Do you hear me well? Okay so Yeah, this is this talk I will share our stack migration experience from both infrastructure to devs and As well as the an organization standpoint So I will share about the motivations behind our Kubernetes stack change the experience we gain into building our own bare metal cluster and How we made it so that it was adopted by by Every kind of text that we have at number Lee So I will be sharing some configuration examples. I will also Take this chance to showcase Our GraphQL choices that we use for building APIs. So it's not a talk on GraphQL itself But rather about the choices. I will share also examples and code examples to showcase the demo app and Then I will demonstrate how they fit together. So that will be about the developer workflow. So So just a quick key about myself you can find me about everywhere as a ultra bug I'm a gentle next developer. That's my open source life. I maintain quite quite a number of packages and the MongoDB one for instance and I'm a PSF contributing member because I I open source some some some code and work on Some Python code and I'm also a CTO at number Lee Before we begin I I wanted to share the story That when I submitted this talk a colleague of mine came to me and asked me this question He came here and say hey Alexi, couldn't you have more birds words in your talk title and So I felt obliged to answer these questions and since you may also have wondered it about it Before you decide to come here or not and the answer is no So let's begin the first thing I wanted to share is our previous workflow that has been Happened running for more than five years at number Lee. So this is how we still are working for some part of our projects, but This talk is about this transition and how it's being done So we have our friendly developers. We use a GitLab internally so that's where we basically have code repositories configuration repositories We keep them separate So there is no secrets in the code in the source code that we that we that we have That's where we run continuous integration code reviews, etc, etc So that's basically what the developers are interactive and the project manager also are interacting with every day Then to start the deployment and the orchestration of the deployment of the projects that we Built into GitLab. We we we base ourselves the developers just have to create YAML configuration file at the source at the root of their repository and it's basically an unseable task compatible YAML Then we created a web interface that we call deployed Ocus That is basically linked to GitLab. So you log in onto it and that's Proxying the GitLab SSO and then you can see the the list of the projects that you work on and Then you can start selecting it and then executing it Executing an unseable Playbook that will run in the background that will basically Connect to GitLab get the source code the repositories merge them together and So it's unseable based and then it will connect to all the bare metal servers that we have At least the ones that are targeted and configured in the YAML configuration file beforehand and it will start Create creating virtual environments. So virtual arms Python virtual arms and Take the code deploy it inside and then start configuring a use and deploying a you will give configuration file configuring engine next ingress configuration files and everything at once in multiple servers for clusters If the Project or service that we have deploying is a public accessible one we will need the help of some Network engineers to set up a F5 load balancers Which will also act as SSL offloading Proxies, let's say so all the SSL we happen in the F5 usually This is pretty cool. It's working very well And it's been working the world for a long time for us, but there are still some limitations in it Well, the first abuse ones relate to the deploy docus web interface and to the unseable playbook that is Running everything behind if GitLab is changing their API for some reason We have to fix it on on all those orchestration Environments in ansible and in deploy docus So it's a bit of work It's not happening that often to be honest, but it happens and when it happens you basically end up with a Large herd of angry developers that can't deploy their things anyway, so it's not that cool and on the server side The virtual arms that are able that we are able to create On the bare metal servers depend on the Python versions available on the servers so that means that we have some operation maintenance to keep up to update etc etc every Every bare metal server that are part of the different web clusters or application clusters that that we operate and Of course, we also depend and the developers also depend on the network engineering team to do these Mainly manual not fully but mainly manual SSL configuration when the the website has to be public or the API or whatever It is that will need to be accessible on the web through an HTTPS URL Also, you can see that It's based on a virtual environments Python ones mainly so what if the developer needed a different kind of stack That would mean that the ops or DevOps people would have to Modify as well or deploy it on make it available on the bare metal servers So and there is no an install feature as well. So this is something that we wanted to have for Some kind of corner case problems, but it's not that important, but still there is no hey just forget about it and There is no performance isolation as well most very strict one at least so you can have some problems sometimes when developer deploys a code that is Killing the ram of the node so we could have just kept this and Rewrite a bit to the orchestration you could ask yourself, but okay, but why don't you take all this engine next you with ski and virtually stack on the servers and just run them on on LXC or docker or rocket and You wouldn't you would be addressing basically what's on every node That's right, but with that means that would we would still have to To keep up with all the orchestration that makes this happen. So it's solving a part of the problem And when we started that when we felt that it was starting to be the right time to move on Actually, the Kubernetes Ecosystem was was already something that that was let's I won't say stable I would say Popular enough so that the community behind it and and some kind of documentation was was enough for us to go into it so We didn't want to have to maintain this orchestration container orchestration and things like this are by ourselves we just joined in the fun and decided that we would be Building with those bear metal approach our own Kubernetes cluster so that's what we did and I'm now gonna give you some some some Some overview of how we've done it The first thing was to actually build the bear cluster. So that's the methodology and then we had to decide on the tooling And when I said tooling, it's it's mainly how will the developers interact with the cluster, which is not a Simple question Actually, so you will have to take a stance on the level of abstraction that you want to give We wrote documentation because if it's not documented it doesn't exist And and then we worked hard into making sure that this new platform this new way was both adopted And supported so there are ways Organizational ways to do this and I will share a bit later how we did it and then we Distributed the expertise so that the expertise on the Kubernetes workflow and cluster is not the thing of only the people that build it in the first place So a lot of our production clusters At number Lee run on Gentoo Linux. This is part of our deep dive approach on everything we do So we decided to continue on this and and it's also a good chance for us to Get to know and to understand all the bricks and there are numerous in Kubernetes ecosystem how they fit together and so We built it on Gentoo. We leverage of on our infras code Technology no no technology, but infras code Way of approaching things and so we have already a lot of ansible Playbooks that operate all those machines that I was talking about earlier So we will average on it and just added the full of fully automation on deploying reconfiguring and provisioning machines on the communities cluster As our name says we are obsessed with metrics and numbers so we have we are extensive users of Grafana and whether it comes from graphite behind our Prometheus So we built dashboards to monitor and see how the cluster was going on in the early stages Then we decided to adopt a developer driven approach when designing our cluster because we wanted to remove friction So that was our main goal in the first place Of course, that means that it doesn't have to compromise security as well so we will see how we and the decision that we made to keep a right balance and but One thing we adopted quite early was we didn't want to have too many abstractions So actually we decided to allow developers to interact with the Kubernetes cluster directly so they have Cube CTL at their disposal so there's no helm or no Overlay beyond between the the developer and the Kubernetes cluster So that means that we also took some security measures to make sure that it didn't get out of end The first one is that At number Lee we are using the G's Google suit so that means that every employee has a Google account and Google this Google account offers an open ID authentication So the workflow to first authenticate on the Kubernetes cluster is to just go to a cube configure L And then login as usual you are using the Google suit account that they have We get the free MFA second factor Thanks to the Google account and every developer and employee at number Lee has a Yubiqui for this Then it provides them through the gangway Project They're a Qube config that you just have to download and they're up to start interacting directly with the Kubernetes cluster then We have to handle authorization and permissions and for this we already had a nice a nice workflow on github So we have everyone on github and groups on github and roles on github So we decided that it might be interesting to map all those permissions and groups to Kubernetes and there was no Project that was doing this so we decided to open source our own and it's called github to a RB a C So the principle is that a namespace in Kubernetes relates to a team And this team it relates to a group in github. So that's how it was already working and And this project just we just continuously map the github Namespace and groups and users and their permissions to to to Kubernetes So then we don't have to separate authorization and permission systems to operate We just do everything on github and it replicates to Kubernetes So to give you an overview of the cluster capability and choices that we make we made sorry github also offers an image registry that we of course leverage So that's where the image are done with it when they are deployed on Kubernetes We enforce some QA on on this security QA only white listed images can be deployed But we don't want any random image on the web running on the Kubernetes cluster and We enforce from the start the run under the non route That means that no container can run on Kubernetes if it's running as route and We have strict network policies. So that's a net poll policies. That's that regulates how pods can discuss between themselves and Developers to pods our internet to put basically we do the law almost everything and unless it's coming from the ingress Speaking of ingress we are using the Kubernetes ecosystem provided nginx ingress and we added the fully automated Let's encrypt Certificate lifecycle. So this also offers the developers to just with a two or three lines And I will see later of of configuration to have a free HTTPS endpoint the multi-tenant cluster it means that We decided for a start. Maybe it will change over time to Have all the environments Inside the same cluster. So we do we don't have a Kubernetes cluster for development the Kubernetes cluster for staging the Kubernetes cluster for Production it's a multi-tenant one for all environments. So that means that you can have a pod the production pod running next to a development one There's no real consensus in the Kubernetes ecosystem yet about this strategy Our approach was we are rolling out something. So we want to leverage on the resilience and Simplicity of the of the machines and and the workflow that we will provide To help people Get acquaintance with the cluster. We also created a special sandbox namespace That basically allows anyone that is authenticated to do anything and it's wiped every day You don't have to read this. I just put it in for reference on the slide so that you Can see how we wipe it every day So just to test We don't have a distributed persistent storage yet That doesn't mean that we don't provide persistent storage, but it's a simple one through NFS. So it's really Mainly for now about stateless machines state sets applications So we won't be hosting databases on Kubernetes yet. Maybe it will come. I don't know And then when you are a bit obsessed about security, there is a good Benchmarks provided by the CIA CIS and so we of course made sure that our cluster passed it Then you have to write a good documentation. So I'm providing here The topics that are that we felt and was working to get it covered so When you enter this space all your text may might not know what the Docker file is So you need to kickstart them in Docker. You need to kickstart them in Kubernetes You can need to kickstart them in the deployment as well, etc So we have leveled this a bit the idea behind this and the trap that I hope we didn't fell into is not to rewrite the document the docker documentation or the Kubernetes documentation Instead this documentation is a practical one making references if needed But it's a practical one so it's get your hands into it and let's go step by step and It's really helping and it has helped me been a very healthy helpful, sorry for for our developers to Get their hands very quickly in the Kubernetes cluster and have concrete results and here the sandbox name space helps a lot because they can try and learn in it and everything that we Ask the guys to test is based on the sandbox name space So this is why we are building we build this we put a lot of efforts in this actually Because this is where lies your adoption Speaking of adoption, we have to foster it and then you have to scale it at numberly We have multiple teams on people poles. They share the same the same core attributes Let's say backend developers, but you have multiple backend developers for instance. So in all those teams we wanted to make sure that that that there were there was someone that was identified and and valued as well as being able to help and give support so that Not only the people that build the cluster where the main reference and we're starting to get spammed So it was not it's not scaling that way. So we created our internal Kubernetes certification and And so that the people that take this Certification we can make sure that they have the basic but still strong knowledge enough at least to make sure that they support the people around them and this is also a nice way I think to value the expertise of members of the teams So a quick takeaway on the Kubernetes side So githlab we use githlab for airbag Images history and with Kubernetes. It's called the githlab to airbag You can check it online if it's if it's useful to you would be very happy. It's written in python We have always to balance security with versus freedom They are not opposed at all times, but but still that's something you have to take into account gift freedom but Not so much that it can put your company at risk That's why we have to enforce the security and QA rules from the start It's important when for us and I guess for anyone starting in this path For now we get reports on on not quite listed image running We have been to do to to make it enforceable from the start as well What I like very much and what we value very much in this approach is that now hops can concentrate on Adding features to the cluster that developers can leverage on the day-to-day work And I think that this is a nice. This is this is really nice So instead of being cluster by cluster now, you can see and we can see our Kubernetes cluster as a set of features that we can use And having practical and docs documentation helps a lot And to spread expertise Maybe a certification is a good is a good trick Maybe we will create more for certification levels later So how does it look now? It basically looks like this we remove the configuration repositories now It's moved to Kubernetes secrets and two volts depending on on on some projects We are not Finish stable entirely on this so that's something that's we're still working on Reuser roles they are maps to their map to Kubernetes RBC The groups to namespace and that's where the Docker image registry is and now instead of having the interface At the bottom, we just allow our developers to run kubectl Commands to interact with the cluster which will in turn Orchestrate the pods with an ingress engine X. We have a free let's encrypt endpoint for the The the projects that we need and we still need to to work on automating the F5 SSL offloading as well for the public domains We deploy a lot of projects every day So you might wonder why don't you just go to for let's encrypt and keep on this F5 thing, right? it's because we have to face some limitations from our clients and That that forces us to support Let's say not so up-to-date Browses so yeah, so we have we have to to to be able to to to be in between That's especially true when you work for banks anyway So now let's try to build the graph QLA application on the disk Kubernetes cluster and then we'll finish with the how it What's the workflow that that makes it happen? So the demo app that I'm taking and that the source code that will is provided as well I thought it would be a nice introduction or Idea to to demo how you can proxy. Let's say the Trello REST API Through a GraphQL endpoint, so you interact issuing GraphQL queries that will be turned into Trello API REST queries The first thing you ask in here is how do I do GraphQL in Python? Usually the The the answer is graphene, which is the most popular Library to do GraphQL in Python at the time that we were asking ourselves this question They were not supporting asyncio and we are very asyncio Lovers so it was kind of a problem. The other problem was the design approach of graphene where you did basically Explain your GraphQL schema as code or as classes, etc. So this is how you you express it But in in the in the GraphQL Ecosystem as we will see later There are other ways and most importantly language agnostic ways to do it So that's why we didn't go for for for graphene instead. We went for this So for the non-french guys around here, this is called a tartiflette It's a mountain dish, let's say it's basically potatoes cheese cream potatoes with cheese with cream with potatoes and cheese and a bit more cheese you have to finish the top must be cheese Okay, revolution. Anyway, and if you are very Hungry you can add the larder in it, but That's a plus So the project itself is called tartiflette. It's meant so basically you Understand that the core developers are French. They're the guys of the daily motion They're doing a great work and what I like Especially in a tartiflette is that it's modern Python. Let's say it's fully built on a sink. I oh and good way I think and it It has a schema first design and the schema definition language design What this means is that you will express your schema using the GraphQL SDL only So this is completely Agnostic to the language and then you will just point the Tartiflette engine to load this raw flat file and it will load the entire schema So you don't have to express it using code and classes or Python objects You just express it in a way that everyone in the GraphQL ecosystem can understand it and then you put it in the engine and you're good to go and we'll see how They offer an A O HTTP integration they embed a GraphQL development web interface to help you as well and And so it's pretty developer friendly Tastes very good. This is what the SDL looks like So you define a query and then you will basically define types So here I'm defining the type member which refers to the member type in the rest of Trello API That is either you or someone in Trello And then you have your properties and then scholar associated to it. Okay, so this is not by turn This is not any kind of language. This is the SDL the standard SDL that defines schemas in In GraphQL and that can be understood by any kind of language or library What's interesting as well is if you look at the Trello API documentation you see that the member object Will Have a property that is listing the idea of the boards That means that when you carry a member you will get the ideas of the boards and not the details of the boards themselves so when you operate the rest Trello API of You have to Get the member get the list of ideas of the boards and then for each board if you just wanted to display the name You would have to make a single query with the idea to the boards endpoint And then get the name out of it, right? This is how you would do it in rest GraphQL allows you to abstract this because this is This crafter is behind it So you will have only to add the board Boards edge that will be a list of Board objects. So this is how you will present it to your in your GraphQL endpoints and then the machinery or the magic that that that you have to do is to abstract this and make sure that your Your your GraphQL endpoint does all those rest calls for you, but For the front end or the initial query, you will have one query that will end up in being three queries to the rest Trello APIs So that's one of the key feature. Let's say of of GraphQL, but you can see you can see it in here So it's it's it's explained Show me some code now. How do you create this? You have the generic SDL So it's in the middle you create the engine and you pass it the path to the file to the SDL file row file that that is on on your on your on your project and that's all and Then it gets validated, etc And then your engine is ready to get queries basically So when you then we show a query to a type inside the your GraphQL schema It will need so to know some resolvers that are able to resolve to get the data that we are asking for so that was the import of resolver is about and To write resolvers in Tartiflet, it's just a simple decorator pointing to the node that we are talking about in the schema. So if we remember the query and we have the type member, you would just have to create your async def a simple async function and and Decorate it with the resolver and that's all You will return dict object that will represent the That will have properties and if in those properties they represent an edge Then the engine will for you look for a do I have a resolver for the board edge? because for now I just have the IDs and I need to Go and look for those ideas. I need the names, but they were not provided on the first call It orchestrates everything exactly so it will iterate like this through the graph Based on what got queried and then just call the resolvers functions like this simple super easy it will do it in Concurrently as well. So it's also quite fast and it's really easy to reason with so here You see that I get my my boss from the ID board that got Returned in the JSON from Trello, and then I just have to look for each ID board and get The the name and then I will just return the object that is that has been returned by Trello I didn't have to filter as well because the filtering is already done by the graphical engine as well So that's all good query is a resolver and then edge resolver Okay, now let's ship it So the first thing you have to do is a docker file This is a demonstration of a multi-stage build to get a Smaller image at runtime. So I find it very very helpful So I'm providing this for you to come back to it as you can see on like 25 you also have to enforce the nobody user has running your application and That's basically how it's built Then the workflow itself on the git on the git side We will have the build and the build will relate to the git the current git branch So I provide the simple script just to showcase how you can and you can do it in a in a hook How to build and deploy the image to your git lab registry based on the current branch you're working on So development branch will be development Instance or pod on on on Kubernetes staging branch will get you a staging Pod in Kubernetes for production it's on master plus git tag So it's a bit more complicated that just the bash here, but it's it's how we do it easily Now you have to deploy to Kubernetes for this you create a deployment YAML I trimmed it a bit because they are quite verbose you can see that We also enforce in the deployment the runners nobody and Then we get the secrets and we provide them From the Kubernetes secrets and we provide them to the code as environment variables So that's how it's done and on the developer side as well You can ask to get your let's say Mcrit SSL endpoint with the domain that you want and it will create it for you for free so I'm crazy enough to Have a quick demo Yeah With my hhkb keyboard, so you're ready. Okay, let's go. So basically you type three I can do it without my hands and you just build and upload this So this will build the thing and upload it to git lab and then you can see that for now It's not running so there's no deployment for our project. So then we will apply the development deployment So here you can see that it's being created on Kubernetes now It's there, but it's not ready yet zero on one. So let's see if there is a service. Yes We have an IP for our service easy is the pod created not yet. Now it's created It's running and now it's ready. That's all and I have my SSL as well Take away on GraphQL It removes friction it helps teams collaborate because this gives you a spec and so it normalize How that is addressed and and and communicated between teams having a SDL approach? I think less people concentrate on the data. It's really important and not the code 30 flecks is really modern and has this SDL approach and it's Very good. I think so give it a give it a try We have a workflow for environment deployment get on git branches Maybe we will challenge the multi-tenancy of the cluster later as I told you before and that will it would maybe have an impact on this and the secrets Shared to applications as environment variables and we still have to work on generalizing volts and Have giving power to the developers We we decided to give them the QPCTL as the as their main tool to interact with the cluster So maybe at some point it will when the adoption grows We will add some and allow some other abstractions to interact with it Helm but for now It's working on pretty well and That's it. You have all the source code in here and you can reach me out here and I Think we still have some time for questions. Thank you very much If anyone has a question for Alexis, please come to the microphones and the aisles Thank you for the presentation And I wanted to ask why you decided on using bare metal instead of using either cloud provider for the service and then Kubernetes on your own or using fully Kubernetes as a service because most of our applications interact with data and This data is on our own Infrastructure, so we have a hybrid approach with cloud not fully cloud based We come from the bare metal approach and this is something that first we value very much because We value the skills of the people that work with us and that's our own machines own skills. It's also Something that we have to to cope with because it requires some extra work, of course, but mostly because all the data that Lies behind it is also seen on our machines Okay, thank you. I really like your document doesn't page that was really great Do you use some some tools to deploy the the YAML files? Or are you just used to goop city like like you saw there keep sitting? But like if you have to guess you have the Staging and in product like tree environment, so you create three YAML file for exactly services exactly exactly Oh, yeah, I was wondering when you have the ingress injects in Gnex and And do you actually host in Gnex inside of each of the parts that runs Python as well? No, we have a separate namespace for all the ingress because we apply network policies between the new spaces as well I Don't think your mic is working. I'm afraid. I'm sorry. Can you I didn't Thank you hi since you're operating your own Kubernetes cluster. Have you considered using open shift instead? No, you haven't Evaluate our benchmark the one one so now we didn't evaluate it We we we haven't quite deep dive approach. So we we wanted to operate Kubernetes that's for sure and then we wanted to operate it with no other Things because I think it's easier to to install locally on your bare metal clusters Maybe but I'm not sure this is provided in gen 2 Linux No, there are YAML script to install. Yeah But it doesn't fit with our how we operate our own infrastructure today So it just for us. It's more natural to just go for the packages themselves and then Okay, go for every brick because we have all the Ansible tool set already at our disposal Okay, thank you for the answer Thank you so much Alexis. Let's have a hand for him