 All right. On the dot, let me share, you have the slides in the invite. Can everybody hear me? Yes. Thanks. So functional updates from infrastructure, production particularly. So in last week, we built a streaming backup solution with Wally. This came after the adage. We basically decided to have a almost zero loss. I think it's every minute we push the wall segments into S3 and assure object storage, which means that we can actually restore the database to any point in time. It's been working for a month almost. It's working really well. It's great actually. We also moved all the front end and back end fleet to the ARM environment. ARM is a Azure resource management. We were using the classic environment, which was the old one. They have the new one. This is kind of the same thing that AWS did some time ago. And we used the opportunity to split API web, get and psychic load to different machines, to different set of machines. This enabled us to basically downsize a lot of the hosts. Instead of 20 hosts that are exactly the same. There are many more hosts, but they're much, much cheaper. And that actually reduced the bill. We also removed the legacy original NFS back and forth. It's gone. It was 48 devices and it was basically wasting money. We also build a continuous delivery system for development boxes using Terraform. It's basically... So if you want a development box, you just need to send a merge request. It gets merged into master and the box is gonna pop up. That's it. That gets us out of the business of actually delivering this provision in this kind of boxes. We also enabled Postgres load balancing in production last week. It's actually quite interesting. You can see how the load changes and it's making the database much more resilient. But we still have some challenges there, but it's actually working better. And we also moved all the Prometheus analog monitoring to the new ARM environment to the fleet-to-arm process that we started doing. We used this opportunity also for another thing, which challenges scale. We're basically in unexplored territories for what we were doing. We need to update upscale Prometheus from 7 gigs of RAM to 30 gigs of RAM because it was basically dropping metrics on the floor. It was just not capable to keep up. We're also hitting IO limits on the database. This means that every now and then the database is basically performing a checkpoint flash to disk. And it blocks because it just can't keep up. We're just getting more data and more and more. And mirrors are also being a problem right now. They are performing once per day to reduce the pressure. Else they were basically hovering the file system all the time. There is an ongoing effort on this from the development side because we are going to be changing the way it works. But that's going to happen in 9.1 or 2, if I remember correctly. Security, we are trying to get people, we don't want people to SSH into production servers. We raised this issue up and we got a lot of pushback from a lot of teams because basically they need to SSH into run Rails console to do their work. That's fine, but it is actually a problem of tooling. We don't have the right tooling to actually enable people to do their work. And it's showing. We also have challenges on resiliency. So the moment when we changed the fleet for ARM, we released a lot of power on the front end and the back end started struggling. That's when we hit the IO limit on the database. Yesterday, it seemed to be a good push that was taking us down. We still don't really know because there was a lack of observability. We're being impacted that way. So we have a couple, quite a set of challenges ahead. And in the next week, we expect to have the streaming backup for all the databases that we have right now. There are version GitLab.com, for example, and some other sites that we own that they're not actually being. They are being backed up with snapshots, but we don't have enough backup. We would also like to have a backup for files that is going to be a long one because we have basically a lot of data. It's going to take a while, but we would like to at least have that ongoing. We also want to remove the legacy file storage one, which is one of the first hosts we had for shared files, for example, logs or artifacts, etc. I think we have something like 20 terabytes of artifacts now, which is a lot. We want to drop the server to replace it with cheaper hosts and more of them. We want to move to Vault instead of Chef Vault, basically because Chef Vault does not scale with the amount of people we have around and the amount of hosts and secrets we have around. And it's getting really challenging to actually use it. We would like to start using service discovery and real-time configuration for the fleet. We're going to need this to start using things like review apps for staging. We will basically need to have a way of dynamically manage the fleet, not so rigid as it is with Chef right now. We would like to start building VMs and container images with Chef Solo, which means use our own Chef configuration to actually trigger these VM and container images built. We will do that for review apps and afterwards we will use that for production, separating what's pets from cattle. Pets being the machines that are around, the ones that we adopt, for example, NFS servers, and cattle, everything else that is basically ephemeral. For example, front-end machines like web workers or psychic, etc. We will like to move deployments out of the terminal into Marvin to start doing deployments with chat ops. We took the first steps that way by starting using Chef push jobs and it's working well so far. The next thing we want to do is basically get that out of the terminal so people don't need to SSH anymore into production to perform a deploy. And finally, we would like to have multiple replicas to do load balancing across all of them. Right now we have just one replica and one primary. We would like to have at least two replicas. That's ongoing. That should be soon enough. But I don't know if we will have more of them. We need to basically discuss that. And finally, any questions? I will stop sharing now and we'll check the chat. Yeah, so regarding ARM, that's Azure Resource Management. We have Azure. They had the classic environment where they have some shape of machines and they're basically pushing everyone. They basically have a new infrastructure. That's all the change. It's not the CPU architecture. It's just that they're pushing everyone to this new infrastructure out of the classic one. The problem is that we had half the fleet on one side on the classic and half the fleet on the new one on the ARM. And we had a peering network in between them and that basically costs money. So we started pushing everything to the new infrastructure instead of the classic one so we can stop paying this money that is basically waste. Containers in our infrastructure, not yet. We would like to. The thing is that we need to basically first, we need a way to build and configure these containers with our current chef because everything needs to play along. And that's what we're working on right now. We will move. We will start moving things to container. The things that I don't exactly know at what rate, but that's containers on its own is not a goal per se. It's basically a side effect of what is coming. I'm scheduled with a container scheduler. Yes. I mean, that's the beauty of it, right? Any more questions? Pablo, you said that one push can take us down, but you're talking about parallel get pushes, right? You're not talking about a single instant pushing. The problem is that we have an observability problem there. We don't really know exactly what we do know is that there was a huge spike in rights in the NFS server. And it basically clocked the NFS server and then we went down because across the fleet everything was waiting on IO. The problem is that we don't exactly know because we don't have good enough metrics there to understand where that came from. So we need to first understand exactly where is it coming from. There's a discussion with Giddily, with the Giddily team to have some form. So same thing that Bitbucket did, where they were greatly mining, so to speak, the Gidd access. We will probably need to do something similar to avoid some people to basically push a lot of data at once. The problem when you do a Gidd SSH push is that you're going to start as a process and you're going to shovel a lot of data at once. Depending on your connection speed, it may basically use all the available IOPS. That's how it looks like. The problem is that we don't know for sure that that's the case. We are using Prometheus to log that. The issue just with this is that Prometheus gives us the rate and the bandwidth used. We already have that, but we cannot identify which REPL with Prometheus, with the way it is. So it looks like we need something else there. I think I don't want to issue for that yesterday. We could basically go back to that one. Any more questions? Yes, Brian. Giddlapshell and Giddily. It's basically, I think that we're going to, in a short term, we're going to be replacing the kit part of Giddlapshell for a Giddily client. Maybe that's the place to do it. Okay. Have a great day. Oh, hold on, Jim. Yes, we're using Giddlapshell to manage Chef. We use the Dev environment for that. Just in case Giddlapshell comes down, we can still recover it with the Dev one. All kits basically. Okay. Have a great day, everyone. Bye.