 Hi, everyone. Thank you very much for being here. I know there are many good talks to see, but it's nice that you are here. First of all, let me start by introducing myself. So I'm Alex Vistranum. I've been working as a full stack developer at Q that comes for almost two years now. I really enjoy researching and developing tools that improve the developer experience in our company. So that's why I ended up in the platform team. And you can find me almost everywhere at XView. So what's all this about? First of all, everything has a beginning. For companies, usually it's the startup world where you don't really care about how you deploy stuff. You just want to get stuff working. I'll just start by giving a brief introduction about where did we start, which is useful for understanding why we took some of those decisions and which weight did they carry over the time. Then about what made us change, because of course everything things change over time. And what didn't make a big impact in the early stages when you barely have traffic, they do affect a lot when you have to scale your application or when you have a much more complex system. And as the focus of this talk is about how we ended up unifying all the diverse technologies we were using and all the authentication methods and their single roof using cloud native technologies, I'll talk about which decisions did we take and how did we achieve at the end this unification. And of course, I also talk about how these changes impacted both our infrastructure and our developers and how the changes have been received by our company. So let's go for it. So for those of you who don't know, q.com defines itself as a virtual global supercarrier, which is a pretty cool name that actually hides a pretty simple explanation. So we aim to provide door-to-door transport everywhere in the world. And for that, we use something called return delining, which is about connecting different carriers or different transport methods for letting you go from door to door. So it sounds crazy and complex, but at the beginning, we really didn't need any big infrastructure. And actually, being totally serious, q.com, as many companies in the same time when we started, and probably even now, started by running everything inside the same server. I'm talking about application, in memory, cache, database, everything, single server. And that actually affected us much more than what it was expected, I guess. So I would like to give an explanation about why did we start doing all this. And if I have to sum it up in the single word, it's basically sanity. There is no way that you can keep track of all this when you are a pretty big company now if you don't automate and unify things. So after the startup stage finished and the system started being more complete and complex, we couldn't really handle all these loads. So that's why we started working and we focused by creating a team specialized on unifying all this stuff. So to discuss about a bit our baggage at the point when we started this unifying process, we had hundreds of containers as Python apps. A good thing that we already started using Docker by then, so that already removed a lot of the complexity that we would be needing for running software that it would be like in VMs or whatever in a unified way. So at least we had Docker containers everywhere. Historic reasons everywhere. Every time somebody tried to find out why something was done in that way, historic reasons. And that actually has a pretty big impact in the way that it difficult debugging. You don't know if something was done intentionally because back then there was some reason and just the guy didn't document it properly or it's a bug. Of course, there was no environment rules. Every team had basically full access to Amazon and they would just provide whatever they need and deploy the software in whatever way. So there was no way for us to know who had access where and to limit how access was managed. And of course, inconsistent or inexistent at all monitoring, which is at least for us, it's a critical part right now. So we cannot really go further without proper monitoring. Everything was done through DevOps. And basically every time somebody needed something, they would just ping us. So huge ops load for everything that wasn't just writing code. So where did we start? First of all, it seems pretty normal, but just saying. We started by identifying our pain points. Of course, the list is really long and I couldn't fit it in the talk. So I just chose the most important ones. And the first one is the lack of standardization within our infrastructure. I'm talking that there is no proper way or no recognized way how to get a bucket, get the database. It's just here you have full access, just get it yourself, which doesn't work at all. The second of all, it's the impossibility to keep our projects up to date. Q.com right now, it's about six years old. And some of our core services are six years old. That sounds good, but then that same service, first started running into a single server. Then it was moved from just running in the bare metal to a virtual machine. Then it was running through Docker. And now it's just a Kubernetes pod. All that stuff has had different stages and we don't have a proper way how to replicate that over all the projects on that scale. And finally, the credential management help. As I said, when we started this exercise, everyone had personal accounts to our cloud providers and they would just log in and do whatever they wanted there. There was no proper ruling or proper management or anything. So first of all, inconsistent resource management. So how did we provision resources? The most common way was just clicking the UI. Cloud providers have this awful feature that they offer you a cloud console that lets you do anything. So if you have access to do anything, just do it in the UI. Of course, we know that it's reproducible, so then you don't have a proper way how to know what was created and when by whom. After we changed that, the second most common way was to ask DevOps in person. Of course, that didn't scale, so they switched to asking DevOps via Slack. And when that didn't scale, we started having tickets for DevOps. And of course, when they didn't want to bother DevOps or do the process, they would just reuse or share infrastructure between different projects or sometimes even between teams. And in some cases, we actually also ended up with people getting third party providers for things that they didn't knew that Amazon could provide and that we would support. The most logical option for unifying all this it was to go to an infrastructure as code approach. And for that, the most recognized product right now, it's Terraform. So we have right now infrastructure repositories where anyone can make a merge request and they would specify which resource they want. For more complex stuff, we also built our own modules that bundle different smaller pieces together. So if anyone wants to just get an elastic search cluster, they would just get that instead of having to specify the instances, the storage, and so on. So this also allows us to update infrastructure parts at scale in the way that right now, if we figure out that there is something that we don't want to do in that way anymore, we can just either change the core Terraform module and then just rerun and apply everywhere. Or if it's more complex, we also did the right custom code that would go through all the Terraform repository's updating resources. At the end, you just need to parse the HCL file, make some changes, and commit back. It's not that hard. But unifying the infrastructure wasn't enough. We also needed a better way to put our code into this infrastructure. So the problem was that, of course, not only infrastructure was inconsistent, also the deployments. As I mentioned before, we went through multiple stages until we arrived to this point. So we went from bare metals to video machines to rancher and now to Kubernetes. None of the steps before had a proper deployment, so we didn't end up using almost everything. And just to name some tools that we used to use, Chef, Ansible, manually connecting to SSH and just committing code and reloading Nginx or whatever, Docker Compose, Rancher Compose, Puppet, Fabric. Basically, you name it, we probably used it at some point with more or less success. But at the end, this also doesn't work. At the point we are right now, we cannot really allow to not know what we are deploying and where. So we need also to unify all these ways to deploy software into a single way. And of course, the solution was quite simple for us. We just went for Kubernetes. It's probably not a surprise seeing that it's like industry standard and it has this huge community and support from everyone. Why would we choose something else? Being able to have just a single way to define our infrastructure meant that now we can do anything with it, not just use it for deployments. We can use it for searching issues in the deployments. We can scan all our repositories knowing that they use Kubernetes YAML files and try to find issues there. We, of course, can also be in the index of services deployed somewhere. We can do basically everything at scale. But of course, storing all this information our repositories brought another issue, which is once more inconsistency. This time in the manifest. Sounds good. Like, yeah, you will have everything, all the deployment information in your repository. Sure. But what happens if you have developers used to first just click in some UI and click plus button to get one more container that they have right now to manage manifests for ingresses, for polo autoscalers, for destruction budgets, for anything? Well, we had issues like copying and pasting manifests from different projects. Because, of course, a few hours of trial and error can save minutes of reading documentation. So why really spending some time checking docs, just copying and pasting from somewhere else? And let's see what happens. Non-started structures. This is not really bad in the way that anyone is kind of free of just set up your manifest structure as you want. But as I mentioned before, we don't really want the manifest only for deployments. We also run some automated tools over them, and having a common structure remove some of the complex logic for targeting the manifest specifically. So for us, this was a big plus. Misconfigured deployments in the way that people, as I said, it was a new environment. Developers weren't really used to it. And when they started, some mistakes were made. Like ingresses using node ports instead of caster IPs. And then you end up trying to deploy two services because you also copied and pasted the manifest from the other project. And you end up trying to deploy two services with the same node port in the same node, which, of course, doesn't work and doesn't make any sense. And discovering your agents. As I mentioned, right now, we rely heavily on data and metrics. So we really want to have at least a bare minimum set of metrics and data that we report from every single service. Of course, all these services needs their own agents. And the agents need some configuration because, of course, we cannot really ingest all the data in the world. So at the end, we also had the problem where projects were misconfiguring how they were contacted to the agent. And we ended up with missing metrics or overspend agents. So it will even slow down some other projects. And the missing required resources. Right now, we are using trying to use as many features from Kubernetes as possible. And we have our own operators. And we have secrets that authenticate the deployments to places. So that was also missing. And when people was trying to just deploy something to our Kubernetes cluster, the main issue they had was that they forgot to mount the Kubernetes secret with the GitLab authentication so they couldn't pull the image from the registry. So stuff like that was also quite common. So how did we end up unifying all this? We use Customize. As Nico explained before, it's really flexible. And it's kind of, in our opinion, the best way how to manage our own software in the way that hem charts are good if you want to share your deployment with someone else kind of outside. But in our case, we don't really need to share that at all. We just need to share a set of resources, like ingresses, operators, whatever, secrets. Don't care. So the thing is we really use the remote basis feature. And everyone that needs an ingress instead of defining themselves once more, they just include the remote base that specifies the ingress. And we can, first of all, update that from our end. So if we tomorrow want to change some annotations in all the projects, it's as easy for us as just updating the remote base and make everyone redeploy, which is pretty simple. And it also allows us to make sure that no one else is using something that shouldn't be done, I don't know, insecure stuff. And just to give an idea about how all this looks right now for us, this is how, for example, our developers right now are getting a Google Cloud platform project. Instead of having to log in, going to the console UI, click some stuff, getting it wrong, and then we don't have any idea about what's going on, we just use our own Terraform modules plus whatever it's provided by Terraform already. So they just specify stuff like the name, the GitLab project ID, where we actually automatically set up the integration with Kubernetes, and we inject the CHT variables necessary for making everything work, and some basic stuff like if they want the data.agent or not. And right now, setting up a data.agent is not much more complicated than just setting through in the Terraform file. Once this Terraform file is applied, they will get a Slack message saying that, hey, you create a new GCP project. Here's what it is. This is your ID. This is your URL. And they also get the second part, which is the most important part from this message, a Kugikatter template. For those of you who don't know, Kugikatter is a project that uses GINJA templates so you can basically feed it configuration, and it will generate whatever you want based on that. We use Kugikatter for most everything. And in this case, we give developers a command that will automatically pull our Kugikatter GCP cluster repo, and it will also pre-fill it with some data like the project IDE or the project domain, whatever. And once they run that Kugikatter, automatically it will generate all the files and push it to GitLab. And they will get something like a readme file with all the resources listed there, some explanations, some links, useful stuff that everyone should know. And it's in code. So it's reproducible. It's consistent. It allows us to audit everything and keep it up to date. So it's solving all the issues explained until now. Now, once we solve the infrastructure provisioning problem and the deployments, what's next? So codebase. Of course, as I said, consistency was lacking everywhere, not really just infrastructure or how we manage resources. So let's take a look at what I mean with this. So first of all, I'm talking about problems like starting with an outdated stack. As I said, we are six years old now. When we started, everything was monolithic. Everything was the same codebase. Of course, we started containers and stuff. So that meant splitting monoliths into smaller services. But how do you split the monolith into multiple services? The most simple way to do it is by just inheriting all the stack from the monolith, but just removing part of the codebase. Which, of course, that's not far from optimal as they ended up with too many dependencies that they didn't need or configurations that were wrong just because they were not like six years ago or simply just something that wasn't useful. Another issue is that some of the configurations were just copied and pasted. Same issue with manifest, same issue with everything. It's not really that, I don't know how to say it. You don't really want to spend time trying to research what config do you need every single time you make a new service. And when you are in the phase where you are splitting, I don't know, five monoliths into 300 microservices, it's pretty tedious to research it every time. But of course, this meant that all the projects were, again, inheriting everything from the big stack and they wouldn't start with a lean configuration with only the part they needed. And of course, multiple approaches for basic stuff. I'm talking like connections to database or in memory cache. Everyone, for some reason, re-impermented it every time and nobody was doing it the same way. And that meant that when we enforced SSL into databases, some services worked, some services didn't because some were mounting the credentials properly and passing them to the connectors they used and some weren't. And as I mentioned before, we rely on platforms like Datadog or Century for storing all the information about our services. But of course, this wasn't really there. Some people just forgot about them and some people didn't know that they weren't aware that they should add it. So it's also some pain point that we had there. And finally, aspects like logging or error management often was done also in multiple imaginative ways. So that's difficult to do our debugging and made it longer for no reason. So how did we end up unifying all this? By providing resources like code and plates that can help kickstart your project with all the bases configured. So with this, you don't really need to worry about how to configure Century, how to configure your Datadog agent, how to configure your database connector. Everything is kind of pre-done for you. We have cookie-cut and templates for all the main language we use, so we have one for Python. Well, actually, we have two for Python for sync and nasing apps, for JavaScript, for Golang. It's as simple as just editing it, passing some data, and you will get already the project structure with everything set up. But what about GitLab CI? You couldn't expect something different than having the same issue that we have in the code ways also on CI. Even if probably like 95% of our projects in GitLab require a step that is a Docker build, for some reason, people end up with totally different Docker build jobs in multiple places. And here I'm talking about mainly issues like Docker builds without cache that will just waste resources with no reason. And for example, they will pre-build in the same layers over and over. Passing secrets in security, using build arguments instead of another solution. So even if our self-hosted GitLab is private, our registry is private, everything is private, we still want to keep this security policy as if it was public. So we don't really want to bundle credentials inside of our Docker images. Incorrect approaches for some jobs, like using a job with a Docker-based image that will Docker run the image you build on the previous stage just to copy some files that didn't make any reason when you could just first of all expose them as artifacts or cache or whatever you needed in the state before. Or even if you want to go for the path of making a new job, you can still mount the same image and just expose directly. Things like that. And updated variables that, for example, when GitLab CI renamed a huge bunch of variables, even if they haven't removed them yet, we don't really want to wait until they do. And we break everything. And missing the latest features. GitLab CI, personally, it's the feature I like the most from GitLab. And I really enjoy using it. And they have been adding pretty nice features over the time, like dependency graph for different jobs, pretty nice advanced syntax for deciding when a job should run or not with only an accept. All this stuff wasn't used because teams usually don't have anyone focused on researching and improving their GitLab CI file. How did we improve this? Surprise, we provide templates that they can just use. We tried to use the GitLab auto-develops feature, but unfortunately it's not really working for us. It's working for a really specific use case, which is totally different from ours. But the approach is awesome. So we did the same with our own code. We have templates for our most common jobs. And right now, if you have a Python web service with a Docker file and just a couple of things that we need, you can run your whole CI pipeline by just importing other templates. You don't even need to write a single CI job. And how does it look now? This is an example of basically Python containerless app that you can just run without writing anything. In this case, you just have the stages at top, a couple of variables that we used to know inside the jobs what to target, and then you will just import everything. Of course, we didn't implement the wheels, so some of them are directly from GitLab. But yeah, all the rest are our own templates. And finally, about the production management hell. Honestly, I don't even know how to start here. I didn't join on the beginning, so I don't know all the problems that were back then. But I do know that when I joined, there was still a totally inconsistent way to provision access to places. We literally had an Excel spreadsheet with the name of whatever you wanted and who was responsible for that. And we had a please access channel in Slack, so you would just go to the channel being like, hey, yo, I need access to this. Of course, that's far from optimal. To solve this, we just decided it's time to unify all this. In this case, we needed two different solutions, one for how humans log into services, and then one for how applications log to either other services or provision secrets. For human access, we switched to SSO. We use Okta, but unfortunately, Okta is not really supported by every single tool. So we ended up using Okta to provision the roles, and then we used Terraform to write custom modules for whatever had an API. So we would pull information for who's this guy from Okta, and then we would use the APIs for the different things to create the users in the service itself at the end. And then for all the service stuff, we just used Vault. As I mentioned in the previous talk, we actually also use it for provisioning CI credentials on the fly. So for example, we have all the serverless deployments are done by getting credentials inside the CI job deploying to Amazon with the, or Amazon or GCP with the serverless framework. And at the end of the job, we just tiered them now. We have a single point of failure in the way that basically every job that runs inside the runners that are allowed to connect to Vault can't really have access to more stuff, but it's a really good layer of abstraction that we avoid more problems. And yeah, now that I covered all the pain points on how did we end up solving them, you probably might be asking, but all this sounds great, but how do you make sure that this is used instead of just having a bunch of stuff there that nobody uses? At the end, all this has a lot of potential, but if it's not used, it's totally wasted. So for that, let me introduce you the zoo, which kind of looks like something like this. It's our, yeah. So what does this thing do? Basically, it indexes all our services. At each point in time, we want to know basically a list of all our services. I know I want to know, give me all the production services that run in GCP. Technically, we could know, but we could know thanks to this. It also scans all our code base every hour. We have a framework for this for writing your own code that can scan your code. So we declare what issues we want to search everywhere, and we define some metadata like name and description and even patches. And then we have our own code for running this. Every hour, the scheduler picks up all the tasks, runs them against every service, and we create all those issues in our database. After that, we can either just have a button for opening GitLab in GitLab, which will just take the metadata and open an issue in GitLab for that same project, or we also have our framework supports even patching the code. So if you know how to patch the issue, you can write code that fixes your own code, and then it can commit it to GitLab. That's not really that easy, and we are not covering all our issues with patches, mainly because some things are pretty hard to patch. But so far, at least, it's definitely supported. And it's also serving as our analysis and insights platform in the way that, as I said, I'm working on the platform team, and that means developing internal tooling, researching best practices, whatever. All this stuff needs good information, and making informed decisions is hard if you don't even know how your code base looks. So we scan in the same step where we check out the repo and we run our checkers over it, and patches, and whatnot. We also try to gather as much insights from it as possible, and we store it to our database so then we can search or filter or whatever. It's open source. We have integrations like Sentry, and Datalog, and PagerDT. And funny enough, we actually had SAS before GitLab, and I recently saw that in 12.6, GitLab added support for open API definitions, which we also had for quite some time. And yeah, time to wrap this up. How does all this impact our infrastructure? So Git is our single source of truth right now. We don't have to guess or spend weeks finding stuff. We can just check the code. All the changes are kept in history, so we can understand why some things were done in some way, or at least pointing to the people who did it so we can ask. It's easy for us to deploy changes that impact a lot of stuff, mainly because we either use common resources shared across projects, or if that's not possible, at least we know that it's quite some common structure so we can write code that patches our own code. And we centralized both secret and network management, which is critical parts that we don't really want to let up to projects to configure. How does this impact all our developers? So they can have almost zero maintenance pipelines in the way that they can just import our templates, and most of the common stuff it's already solved for them. We can provide standards as libraries that they can just use and import, and everything is done, like user agent for free. It's easy to create new projects, and it's easy to provision infrastructure without big hassle. And what are we missing? Yeah, all these features I mentioned, they are pretty cool, but when you have to do local development, it's not as easy. We have identity network proxies, vault, whatever, all this stuff, it's not really comfortable to run locally, so we either have to have bypasses or some mechanism for allowing them to work. We still haven't fully leveraged totally advanced features, like, for example, custom resources in Kubernetes. We do have operators, but again, it's something that we can research more. Getting the people to accept your tools is a bit tricky. People, if it doesn't solve their issue perfectly, they are not really willing to use it. So at the end, we either end up with something that is either solving everything or it has to do what it does really well. And that's basically one part that is pretty tricky to get across the whole company. And finally, a more reliable project lifecycle management. Right now, we cover creation pretty well, but keeping a project relevant and basically adding Kubernetes deployments, for example, to an existing project is something that we have automated, but it's not everything. So I know you cannot just convert the project to async automatically. That would be pretty nice, but it's hard to do. And what can you do? Basically, what we did, define your infrastructure as code, even use templates. There are a bunch online. If none of them fits your specific use case, you can make your own and you can distribute it back afterwards and leverage industry standards as much as possible. Try to avoid custom stuff whenever possible because it ends up biting you. And that's it. I'm a bit late, sorry. Thanks. I don't have time for questions, but if someone needs something, just feel free to reach me.