 So everybody sees it good So hello, and welcome to the infrastructure functional group update for August 2017 So I'd like to start by welcoming our new hires and That's me. So I'm a new hire myself. My name is South Ishtar. I'm the new director of infrastructure We've also added John Jarvis as a senior production engineer and Gregory Stark as a database specialist So to rehash the goal of the infrastructure is to ensure that get lab calm is raised from mission critical tasks This includes the availability of get lab calm at 99% a 99% user request served under one second and complete the top ten risk assessment actions So accomplishments Production we are now indexing the site with elastic search and the cluster supporting it has worked very well We're able to create and destroy the cluster as required with a minimal amount of work. Thanks terraform a chef We've isolated the container registry. This has been a big step in breaking up the monolith and it's moving us to a more cloud native setup The sidekick fleet is also split up And we have now dedicated fleets for specific concerns and this allows scaling and tuning per fleet Developers can now control feature flags themselves. Yay feature flags They don't no longer have to wait for somebody from the production team to set it up for them. It's all through chat ups Takeoff deployment has been finished. It's been used by the release managers a lot of the issues around deployment I have been resolved, but it's getting better the laser refactoring is tested in staging today So they still come along get better and lastly VPN access has been set up For access to production fleet That's also gonna show us here database So databases the query timings on the project issues page has been reduced by 10 times With the total response timing of these pages being cut in half Loading time for the Explorer projects page is significantly reduced also and with these changes and querying the event feeds are a best case 66 times faster than they were before Sidekick in uniform and unicorn connections are now separate in PG bouncer This means that a spike in sidekick activity will no longer result in unicorn not having any connections available There's a new merge request template for database changes So make reviewing the database related changes much easier and faster in the same vein Making things easier the database team has Refactored their handbook and it's gotten much needed update Italy so All get clones and fetches are using Italy now and have been doing so for a week This is greatly reduced. I'm sorry greatly reduced NFS traffic The giddily Ruby sidecar is now running experimentally in 9-5 and with this we can skip porting Ruby rugged to go and this allows us to focus purely on the endpoints And we are well on track for meeting the q3 okr of 25 migrations Security package signing is now active in 9-5 and There's a remote access now requires a VPN for production. So that's a big step There's now a wider deployment of the intrusion detection system in the gilab environment And we've also you have a git lab static analysis gem is merged And this adds some rules to Ruby tough to look for the use of calls that have known security issues So it can flag them before we even get into production with them We engaged the external security audit from and they found a critical vulnerability and get I'm sure we all saw the announcements the alerts Everybody I know in the industry has that's great And we've also added protections against a denial service and see I first of the very old but very interesting Ruby regex So concerns where we need help. So the NFS fleet stability is still an issue We've run to a recent issue with giddily consuming all the system resources This luckily has a quick fix if just pretty giddily processed inside C group that can limit How much of the host resources can consume However NFS overall is still a single point of failure in the current setup The good news is there's been a lot of work in circuit breakers. Shout out to Bob and giddily to remove this the size Another issue is the size of storage we're encountering outside of the repositions the artifacts. It's very large. It's becoming unwieldy This needs to move to object storage to make it horizontal scalable And we also have a concern with iteration speed When something is deployed if there's issues or blocks that can take along and then anybody here wants to get it fixed and deployed So to alleviate this we're highly recommending feature flags. The giddily team has shown these to great effect. They're able to Turn on and off new features and when they ramp them up because as we know staging cannot give us the same level of traffic's production They're able to ramp up 10% 20% traffic until They never find bugs but in case they do they can just turn it off rather than have redeploy work on database concerns and While still currently being worked on we do not have a full failover system in place in production And there's also an issue with the primary disks that cause IO throttling on the high load However, both these are being addressed in linked issues and hopefully it'll be resolved soon PG bouncer is still running in our temporary setup instead of being based on omnibus. So that's an issue that needs to be resolved and Lastly influx database is incapable of handling 25 hours of data and handles 12 fine But it croaks on 25 around midnight. It just stops recording data for about 20 minutes. So We have to put it in 12-hour increments and recalculate every 12 hours So gilly resources the major block right now to delivering gilly faster It's just the current engineer resources the more engineering resources we can get the more endpoints. We can get through faster and file server scaling Gilly is handling more operations with every release, but until we're off of NFS. We're unable to scale horizontally This means that get operations are being concentrated on the total file servers and you know those are the elements we can use security We're seeing a definite increase in the CI abuse the level of CI abuse we've seen and The large security issues we feel are not getting the Priorities we need to get them moving forward. So our plans So first thing is you can't fix what you can't see to that end we're revamping the logging infrastructure Eventually we want all logging to use it ending the need for individuals to log into a machine to view any logs They should have one centralized logging place We're splitting up the Redis cluster and what this is initially just being split into a cache and other cluster But this lets us dump the cache without affecting the operations at other parts of the site And there's also work being done to make your lab the comp region and cloud redundant with geo This will allow us to fail over to other locations if the current ones experiencing issues such as the low light This past month was as a restorative going offline for over seven hours If we have something like geo in place, we'll be able to fail over to a different region and mitigate a lot of that count time So I just not here but Greg's Greg's with us now. So The plan is to bring him up to speed and that will reduce the load Immensely on York and improve the throughput of the whole team They're gonna and they'll database team will be working the production team to ensure that the primary disks Snuff and they're working on automated failure system for the database in production PG bouncer will also be changed to be set up using on the bus and the temporary solution removed And of course work is always continuing on improving important controllers to make them faster Italy migration migration migration, you know, that's their plan. I believe After this okay, others gonna be 75 more endpoints to migrate For security, we're gonna there's auditing improvements. We can continue audit Including investigations of instance such as account lockouts repository access group membership changes, etc Then we're gonna have big improvements in our vulnerability scanning infrastructure and an automated thorough dynamic web scanning for each release So and you know, we're always looking for more people. Please people, you know, we would fit these roles We're looking for director security security specialists security engineer and production engineers We're always looking out for good people. Please anybody know point this site and Now see if there's a questions in chat. I Stop sharing my screen just to see the chat Right they never find bugs never Said it's future flags are not recommended features might affect infrastructure not for 99% of our features, right? I mean, that's I would recommend future flags But just because we never know how they're gonna operate under load a good example is money the GPG signatures on those It seems like nice feature, but it start really spiking up as a feature flag because then you can just disable it rather than having the hot patch systems It's a good way also to to test the results Features that people like and people use them. I've seen future flags almost uses like it may be testing But yeah from from our point of view we Definitely anything that touches the infrastructure anything that could harm the infrastructure. We really want to see future flags used Yeah, I think it's a trade-off look if we Using a feature flag means you're releasing it and then you have to like clean up the code after everything gets to 100% so now it's only every feature takes two releases and takes too much request takes two reviews Might forget might have old code still there, etc, etc, etc So it is a trade-off so so use them when there can be an impact to production But for 99% or maybe 89% of our features if you just change something in the front end It's it's not very likely to have an impact and from time to time will might miss something, but it's it's a trade-off of velocity for resistibility and I think it's Something like DPD keys like yeah, someone might think about the impact it does or we might miss it and that happens, but it's It's not something it's not like any new feature now needs a feature flag That's not what what we but I think we should be saying to to all the back-end developers Okay, fair enough Infrastructure recommends them around stability issues, but yeah, it's a judgment call. Sorry. How's I'm here? Do you think we should expose future flags for end users like an admin? But that's That's a product decision. I'm just recommending for us to use internally to stabilize the infrastructure. Yes Can we on that topic set? I think we we sometimes have like for example a high load on the database and we don't have any way to say look When when Twitter had stability problems They had kind of a dark mode where like the basic things would work for the things that really tax their infrastructure Like complex searches or something else those those got disabled for a while I think it would be really worthwhile if we we have something like that in GitLab So not for new features, but for like these are the things we know to cause a lot of load And the production team is able to turn them off So if if we're if we're having a problem we can like turn off non-vital services so that So that at least like all the issues still look because people really need that Yeah, that sounds good. I can just making quick notes Yeah, exactly Brian like disabled mirror So instead of like max mirror capacity, which we'll never get exactly right but have a mode to say look We're under load. This is we should call a panic mode, but maybe maybe dark mode or functionality or something like that but Going to like a low load mode for for GitLab Yeah Yeah, Tony, it's to prevent Kale Wales not to present Is there any other questions? No, okay. Well, thank you everybody for your time And I'll see everybody on the team call in a couple minutes So I have a good weekend for if you're in Europe or have a good Friday if you're in the US or something else Take care everybody