 All right. Good morning, folks. Welcome to the first fireside chat put on by your GitLab technical account managers. My name is Andy Gunter. I am a technical account manager with GitLab. Hopefully everyone can hear me okay. Thank you all for joining. See the attendee list growing. We have today a special speaker and goes by the name of John Northrup. He's senior production engineer on GitLab.com. And so the topic today is based on questions fielded by the technical technical account managers from the likes of yourselves and other customers. Why do we need it? How do we do it? How does GitLab.com sustain the volume that it sustains and all of the good technical details. So let me quickly go around the small virtual room here. Just make quick intros of folks on behalf of them. We have Luca Williams who's a technical account manager covering East Coast and some maybe Midwest. Myself, Andy Gunter, technical account manager as I said, covering East Coast. We have John Northrup. Production engineer for GitLab.com. I think that covers it. I don't want to spend too much time talking myself. So I'm going to let John jump into it unless there are any other items to cover beforehand. Otherwise, I think you guys came to hear a little bit on how this works. So we're going to start that off. John. Hey, folks. Greetings from Sunny Nashville, Tennessee. As Andy said, my name is John Northrup. I'm a senior production engineer that takes care of running all of GitLab.com's infrastructure as well as our internal infrastructure that we used to build GitLab.com and the on-prem product that you run at your sites. So today's overview is just going to be a look at how we run GitLab.com, the architecture involved there, some of the things that we do to make that scale. And then I'd like to keep it on an informal basis so that you can ask questions as we go along through this. So in your session, you should have a Q&A function where you can actually ask questions and Andy will receive those and we'll kind of work through those. So this is a fireside chat like Andy said. So I don't feel the need to reserve all the questions till the end. So as we're going along, I'm going to hit on some high level things to kind of introduce you to our architecture, look at how we do things. If you have questions as we're going along, feel free to ask them. I have additional questions here so that at the end, if you don't have any questions, I'll start working through some of these questions that your TAMs have submitted. And so let's just dive right in. So today we're going to be covering a series of things starting with our current GitLab architecture and overview. We're going to look at the components of state. So what is stateful in our architecture and what isn't stateful? What can you reboot anytime you want to? And what do you need to have some gloves around? Our HA configuration. So how do we scale what we scale? The scaling of GitLab in terms of what components can scale horizontally and what kind of scales vertically? Front and load balancing or how we apply our secret sauce, keeping track of it all. So how do we manage all this? And then keeping it running. So how are we monitoring this and what does it look like? So we'll start with our architecture. So GitLab.com uses a series of front and load balancers. Those load balancers are our HA proxy load balancers and we split the traffic that comes into those. So we have a series of hosts that cover our API. We have our registry for our Docker images. We have our web for our UI front end and then we have our Git for the Git interactions, be they HTTP or SSH-based Git sessioning. So if you look at these, these are all the components that you would find inside your omnibus installation. We just kind of break them out. To look further at our database architecture because we're going to cover that next, this is where we tend to get a little bit more complex. So we run Postgres as the internal database architecture for GitLab.com and then we run it with some of the high availability features that we're shipping in the omnibus package. So we have Postgres with PG Bouncer in front of it for connection pooling and then we're also using Rep Manager for adding dynamic Postgres nodes so we can know when we have one node or when we have three nodes. What we do with this is when we have three nodes, the other two nodes become followers of the leader and then the internal GitLab application references things that are read queries. So updates on issues, comments and things like that, go to the primary, displays of those things though where you're not actually updating them but we're writing data, you're just fetching data, those go to this follower, our SQL node. So this is how we can distribute SQL load. This is all part of omnibus and we manage this through, like I said, PG Bouncer for connection pooling, Rep Manager for dynamically adding things and then we use console. So console is a service discovery tool and so that's what we use to dynamically advertise when a SQL database node comes online, when it comes out of line and that's how we keep fluidity of this. The service discovery piece is going to be something that you see grow within the GitLab product. The first place we started using this is in our database section. We hope to scale it out to things like Redis and web nodes and things like that but so this is the first place we've actually implemented it. So components of state, this can be fairly obvious depending upon how technically you are and not that obvious. Within GitLab there are components that need to be mindful because the data that you're pushing there or pulling from there is stateful and it needs to have actions around it that maintain that state so that you can have, you know, preserve the quality of your data and your application. There are other things in GitLab in the application that aren't stateful. So systems that need care. This would be our database storage and a system of record. Inside the omnibus package, inside GitLab, the Postgres database is a system of record for everything. Everything from what your user ID is to comments on an issue to, you know, what the latest, you know, hash ref is to be displayed of the Git repo that you're looking at. So this is all done in Postgres and we balance it out with PG balancer for connection pooling. Those two systems need to have care when maintaining them so any action you're going to do that takes an action on the PG balancer or the Postgres database needs to be mindful of that because that is stateful replication that happens between the two with wall transaction logs between the Postgres nodes as well as communication between the PG balancer to the Postgres database and then the applications talking to PG balancer about accessing the database. Our work queue. So this would be Redis. So in most instances of omnibus, you have a single Redis install. Redis is where all of the real-time transaction queue stuff comes from. So anytime you click an object to be modified anytime that you submit a new text for an issue or a note on something, that gets submitted into Redis, the sidekick queue which is Redis at the back end and that gets submitted and then pulled off for processing. This queue is a stateful queue in terms of you need to take care about the Redis process. Most of you have this writing to disk that's the base configuration for how this is shipped so that if you shut it down it will dump all the contents out to disk and then when you start it back up it'll pull them all from disk. We'll talk about some scaling later on on gitlab.com. We have split the Redis work within two different queues. One's a cache queue and one is our stateful queue and the cache queue we don't write to disk. It's if we lose it it's just cache data that can be cached again. Stateless systems this would be our UI display so that's the web nodes that are just running the web front end. Those really have no state to them and are processing so this would be the sidekick nodes. Some of you may or may not be running sidekick nodes independently. This is a feature that you can run independently but if you're running it on a single node server restarting the sidekick process is a non-destructive thing that you can do to stop and restart your queue processing. So we've got HA configurations and I realize that I'm kind of moving fast here we're gonna take time for questions if you have questions please ask them and we'll come back over with specific things. So John this is Andy we do have a question we're happy to convey that now or sure we until the end so and I think it's relevant in the section question is whether console is embedded or included in the omnibus package or separately installed. Yeah so for in the omnibus package today we do include console and it is it's installed along with the the pg database postgres database so if you're using the the I forgot our product names now but the so the high availability portion that we the ship with this we do include console with that we include console and console agent and when you set up multiple nodes they register as console agents of each other to do the console replication. This is of course another thing that you could split out independently we keep it on in the omnibus install it's on the postgres database nodes so they are running console and console agent to talk to each other for a raft-based algorithm for who's available using the gossip protocol so it's who's available who's not available doing health checks on postgres and if a postgres node becomes unhealthy the console agent pulls it out of console let it that is shipped with omnibus today any more questions nope that's it for now thank you okay there are stateful components we're gonna look at so ha is next um this will just be a real brief touch on ha and then we'll talk about this a little bit afterwards so one of the common questions the tams were seeing was priority levels for ha consideration right what is the most important thing if i'm going to be adding ha to my system that i should be looking at and so we really feel that database level high availability is something that you should consider first off if you're just starting to look at this and say well that's where i should focus my attention we've tried really hard to make the database high availability uh something that's attainable and easy through use of console and uh pg bouncer and ret manager so that really all you have to do is you spin up another node tell it in your github rb configuration that it's going to be running the database function and then you use a github ctl command to register them to each other to let them know that there's two databases and then what what happened from this is that the second database will automatically receive a configuration that puts it in a follower mode it'll have a replication slot added to the primary system it'll begin replicating data to the secondary system when this happens and you've got that set up then you can go into your github rb configuration on your web nodes and all your other nodes and tell them that you have a multi database setup and when you do that the traffic pattern that i talked about earlier where non writes so all your reads and queries will come off of a secondary node and your primary writes will happen on on the on a primary database node uh next is redis so because this is the the if postgres is the heart of our our github omnibus application redis is is the vein structure that delivers that blood to the entire system this is our messaging system for our q structures pub cell architecture this is what distributes the workload throughout the application and make sure that it gets processed and returned back in we ship today uh redis sentinel with redis which allows you to basically run the same type of github ctl commands to have multiple redis nodes that communicate with each other um there's the similar process inside the configuration to add another redis node with the github ctl commands um today that is a non auto discovery process right so you run omnibus on it you run the the commands the configuration options to tell that it's going to be a redis node um and then you put the kind of the redis sentinel configuration there which is in the documentation and then you start redis up and then the two just see each other communicate that they should be part of a redis pool structure together and then you update your client you'll have our big configurations to say i have two nodes in my redil sentence configuration or however many nodes you've added after that we move on to the web front end it's a debate between the web front end and then the last one is the psychic q processing right so um you want users to be able to reach your site to be able to do all the things that we've added in the ui to get functionality so uh checking where milestones are at issue status and things like that uh adding web front end availability is really easy um it's just literally installing another another another box or a node or configuration of the omnibus package and telling that where the database is and and the other options from the back end uh psychic is also an important structure in this in terms of you want availability for the job processing that you have this though leads into later questions about scaling because really adding a second psychic node is helpful if you just add it by default out of the box where it's going to give you is duplicate q processing for all the queues that you have an omnibus and that's very helpful that means you have two nodes processing all the q things that you need to do to take information in and out of the process for git lab what that gets you is that you can lose a psychic process or a psychic node and everything will continue on just fine um one of the things that we'll talk about in a little bit is if you're seeing slowdowns in processing so adding nodes um computing milestones or or burn down charts and things like that we can add psychic q processing to kind of speed that up and that's part of the scaling so in scaling git lab what we're going to look at are the parts to scale on git lab uh and sorry moving on here if there were questions from that last section you have questions uh please feel free to ask them we did have a question john it popped in and uh was on my mind as well so and i think maybe we're going to get a couple more the first question is in h.a. architecture the nfs part you used to get a shared storage and is often the main bottleneck in terms of s.l.a. and resilience from the perspective of the questioner so have scheduled any evolution or is there any scheduled evolution trying to sort of translate a little bit here but is there any evolution uh for more optionality regarding storage uh or persistence yes so or is there any other option for a separate kind of file system or server other than nfs yeah so um let's just stop the talk with you for a bit here so we've gone through a series of evolutions with storage on git lab and specifically git lab dot com so when git lab dot com started it was a single node architecture um in the history of it it went to dr db replicated disk uh that worked until it had to scale a whole lot larger it then went to nfs back in with a single nfs back in a multiple front end workers that you know as you've pointed out that works until it doesn't um because it is a bottleneck for scaling the next piece where we went from there is a question that somebody submitted to a tam about have you looked at things like sef and seffs or other large um large storage file systems that present themselves as a single posix file system to to the guest system and we have we actually spent about five to six months uh implementing and rolling out and and using seffs as a centralized storage back in what we found was that that added an inordinate amount of complexity to the configuration it was not impossible um if you already have people that are sef experts and seffs experts in your shop and you're running on bare metal or in environments that you control it might be an option for you um for us running sef with seffs on top of it in the public cloud we became the noisy neighbor and we have blog posts about this where the i o right low that we were uh subjecting our our cloud provider to was so high to the degree that they began introducing i o weight latencies to us because we were the noisy neighbor affecting other people so if you've got your own if you're running this on-frame and you and you've got it already expertise in this it's a decent way to go um one of the things that we're trying to do and we are doing through a product an open source product that we're writing called giddily is to get around this so today we i should say pause and roll back in the evolution when sef didn't work for us we implemented nfs sharding so nfs sharding is currently in the application today and so what that allows you to do is stand at multiple nfs servers and tell your uh get live installation that you have multiple nfs servers for sharding across and then as new new revos get created they'll be created in a random round robin access order across them uh and the database will keep track of where they are now this still leaves you with a single point of failure that you just now have an nfs server serving a repository um and that's one of the things that you know most on-prem sites uh use an nfs appliance so it'll be either be something like you know an emc uh you know an emc file system or uh i'm losing my mind on the the popular brick and stack competitor to that um but an nfs appliance so that it's gated behind like an ha uh nfs front end that's actually pushing it out to multiple storage in the back end where we're going from this on the next side of this is an open source offering called giddily and giddily is um a service-based architecture where we're removing nfs and get from the front end nodes so we're going to create giddily storage pools these storage pools serve much the same action as like the nfs servers but you don't have to mount them up with nfs there'll be a giddily process that communicates to the giddily server there's no nfs mounting and that's that's version one of that we're just about rounded out on that the next version of that will be to take care of replication and sideways replication inside of that to bring ha so that we can do um get distribution across the back end for high availability and and disaster avoidance without you having to mount up and access multiple stores so that's kind of the path to that yeah so john this uh let me add question two to this uh which is you know really the second part question two to be more specific does gitlamp schedule some evolution to support more or different shared file systems in fs i think you answered that like smb version 3o shared file system linux kernel 411 supporting smb version 3o yeah so to really answer that from the standpoint of gitlab um it it doesn't care so all you need to present to gitlab is a is a positive standard based file system so you can present that over nfs you can present it over smb however you want to do that it just has to be a positive space file system that you're presenting to gitlab so if you want to do smb smb v3 shares um to like the web front end nodes and and the get nodes and stuff like that that's perfectly doable you can do that if you want to it just did the the only requirement there is that you have to be presenting a positive space file system to gitlab perfect perfect and i think you're i think you also answered the one of evolution which is when you spoke to gitaly as a roadmap sort of i'm okay uh that's all that's in the queue for now all right so i'll go back and show my screen here and we'll pop back into this so when we talk about scaling on prem one of the things that we talk about is a good h a front hand so for us where we use h a proxy as the h a front end and for internal um load balancing scaling uh i can't stress enough what adding a good load balancer in the front of this will let you do in terms of scaling your web traffic being able to separate out api traffic being able to separate out get traffic uh and and and really let you break the components up and scale this for those of you that don't know uh gitlab.com right now is not anything different than what you could buy and scale on prem gitlab.com is the gitlab ee version we take it from the debian repo the same as anybody else could do and we use chef to just multiply that and and spread it around and then we apply our configurations to it so that only specific nodes are running specific jobs tasks and functions and they're being routed to through a front and load balancer the next on this is sidekick queue splitting so sidekick like we said is the worker process that takes things out of the queue processes and brings them back in so that's all of your git operations all of your notes uh merge requests all of that ci jobs those are all sidekick queue spun and run right the ability to take this sidekick queue and split it out is one of the things that people don't really realize they can do but it adds immense flexibility and scale for functionality this is really handy if you have so you've got even in a single load instance where you've got a single node but things are taking a long time to process because you've got a lot of people making merge requests and looking at diffs asking for diff requests creating issues might be pulling on things with with a bot or submitting and generating runs what splitting your sidekick queues out does is enables you to have more processing power to process those tasks that are getting put in queue and we'll touch on that in a little bit uh nfs sharding and an nfs appliance so i kind of talked about this currently today inside of git lab we nfs shard so we have currently 24 nfs shards right now we generate a new we generate four new shards at a time we generate a new shard anytime the file system is approaching a little over 60 full so when file systems get 60 full we generate new nfs nodes um and then limit new repo and project creation to the new nfs nodes and leave the old nodes just for growth and scale of the current structure now inside uh git lab today you know that's a single point of failure so what we do is we have internal scripts that go around and snapshot all the disc back ends of those things you keep snapshots of what the disk state was uh inside our cloud provider so that we can have rollback events to those uh and then nfs appliance so if you're an on-frame shop and you are doing nfs uh somebody mentioned smb v3 that protocols come a long way since people have reverse engineered that made it available and especially since it's been incorporated into the kernel um those are all great options and if you have an appliance to do that a lot of the i say appliance because a lot of times it's easier to get an appliance uh and have that be pre-tuned for high i ops and availability than it is to spend engineering time um to to do that for you and then multiple database nodes uh like we talked about in the front end when you have multiple database nodes we offload a lot of the database processing so a lot of the reads that can be intensive go to follower nodes and just the right functions or time sensitive functions go to the primary node so i mentioned sidekick queue splitting this is a function that we rolled out i want to say in the later version of git lab 8 and for sure in git lab 9 um so queue list can be found in sidekick underscore queues dot yaml if you don't want to go digging through that um there's the actual master config right there this is a list of all the sidekick queues so every time that we have a specific function that we're going to be executing inside a git lab it has a queue and a named queue we do that so that we can keep track of how we process this work and and roll through handling it through our systems what we internally do at git lab is we split all these out so we have to process the workload for git lab dot com today we have roughly 35 sidekick nodes doing various sidekick functions this is enabled through the sidekick cluster function in your deployment so this is an option inside your git lab rb that you can turn on and when you turn it on it takes queue groups and all a queue group is is a bracketed set of those queues that you want to run so if you're looking for specific discrete processing so you say man merge requests take a long time a lot of merge requests low people are doing a lot of merge requests i need to find a way to clear this out quicker you can spin up additional sidekick nodes so you start with an omnibus install you tell the omnibus install on the lrb that it doesn't need to run anything else except sidekick cluster so you say you know disable false to all the start for processes and you say true to sidekick cluster inside the sidekick sidekick cluster that's a mouthful queue groups you then specify the queues that pertain to the mr functionalities this lets you know it it will connect to the redis cluster and it'll begin polling just like anything else and then this lets you flush that queue out with greater reactiveness than just having one sidekick queue node that is running all the queue groups so within gitlab we have several different nodes that we specify so we have one called asap which that's one where we run sidekick queue queues on that are all the as like do this as soon as possible so those are things like when i add a note this needs to be processed immediately when i add an issue comment that needs to be processed immediately one of you some git functions those need to be processed immediately we have a best effort queue which that's like well you know when when this can happen make this happen and then we've split out because for us helping people manage their products and and how they've gotten to or replicate from gitlab.com is important to us so we have specific queue nodes that do nothing but process imports we have queue nodes that do nothing but process get replications so those are all things that you can branch and split out and so if you're looking at horizontal scaling this is a great way to get into that and and start adding some real gas to your gitlab instance. One of the things that the tams got asked was nfs requirements like how how do you know what do you look at when you look at nfs requirements and and how do you know what's speedy and what's not there are several key factors to just really tried and true methods that we've looked at a long time in the industry for various reasons so there are iops so input output operations per second and then the other one would be IO wait so how many operations can your base underlying file system do per second that's a high order and then are you experiencing IO wait when those operations happen so do you have to incur a right delay are you seeing right queue weights things like that and then for future state on our nfs nodes or or storage nodes we're moving to call them we want to look at cpu load this is for giddily because what's going to happen is when giddily comes into play which is already in use on gitlab.com and it's a beta feature you can choose to use it on your omnibus install today because we do ship it with gilab omnibus the actual git processes are ran and transacted on the storage nodes so previously where we just had storage nodes that were only had to have the requirement of of serving up nfs over the network now when we're moving to giddily they also have to do git functionality so git packs git repacks any of the git maintenance those are things that are processor intensive that you need to make sure that you're accounting for if you're building new architecture and you want to use the giddily functionality all right john if i can uh take you back yeah the question popped in i think it's related to the queue splitting and the question is is this part of the premium edition no the right gun yep so queue queue splitting um i would have to look i'm i'm positive that that functionality exists in the gitlab c all the way down in cd that is a functionality that we've it's a little wrapper that we wrote for psychic called psychic cluster um and that is something that i'm almost positive anybody can do in their gitlab rb configuration and there's there's not a license key required to do that that's just some base functionality excellent so just to add to that assuming um uh we've interpreted the question correctly uh is it as simple as adding the ammo file for sidekick queue or sidekick underscore qz ammo that well so yeah so let me go back and explain that that uh that yaml file is there by default we ship it as part of the instruction set in the in the basic lab configuration right i reference that to tell you that that if you were looking for how do i know what the q name is right what what are the q names that are in play those are all referenced in that yaml file um actually let me go back a great example of the file uh out there in your in your deck so if you look at this move this over uh this is a just our yaml file and this is going to enumerate all the different cues that we have you'll see in here that when we do we we consume this we attach weight to them so the application knows which cues are more important and a priority to attach to them but this is all the cues that the actual git lab application consumes so you know the expire for bills the authorization of projects everything from mail handling there's a queue for emails on push right so if you need to prioritize one of these it's easy to go in here look at what these q structures are and then add them to your git lab rb um gotcha excellent yeah so there's a follow-up question sure and that is how are dead jobs handled inside cake uh so dead jobs in terms of when a job isn't completed and it's just expired and that's all i've got to go on but i i think that's a fair interpretation yes so yeah so there's we have implemented a a layer on that so redis is just the back in for that and we use sidekick sidekick has job completion and and we track to see whether or not when so when a node pulled the job out that has a job id and that job id is held uh until the node comes back and says i finished that job there's a timeout period on on that sidekick process right so if a job id is pulled out the node's not returning it back um then we'll we'll run a timeout the time i think the timeout's high i think it's like 15 minutes uh and then we'll that job will be up for for grab for somebody else so there is locking on job id's right but there's not um is there a concept of uh an orphaned job that is persistent you know uh durable across reboots or is there anything that kind of gets orphaned uh in the overall stack that could be lingering and can't get rid of no there shouldn't be um and and that's why i said that the sidekick nodes were stateless in that sense right so there shouldn't be sometimes it i mean software software and sometimes it has the mind of its own um we have experienced cases where for whatever reason the job id gets stuck in a sidekick queue um those are cases where it is safe to just go ahead and flush the queue so you can just restart sidekick on that to flush out that hung job and it'll move right along beautiful all right carry on that's that's all we got to queue right now um so front and load balancing people ask us what do we use for load balancing lots and lots of h.a proxy so we use h.a proxy on the front end of gitlab.com we use it for high availability uh we use it for some rate limiting for our api and we use it for role separation if you're interested in looking at what we do uh i'll touch on this later on all of how we manage gitlab.com is done via chef and that's all public to use so if you go to gitlab.com slash gitlab dash cookbooks uh that's all the cookbooks that we use to manage all of our internal infrastructure and gitlab.com and how we roll things out this particular one the gitlab.h.a proxy is the cookbook that we use to manage the front end h.a proxy so if you were to look in that in the default templates you would see a lot of the magic that we use to split things out so one of the things that we do is we have some h.a proxy configuration that looks at the url that you've tried to access and then we split that based upon some base url pattern so if the url deals with a series of git transactions those are are handed off to a back end that is just git nodes that all they do is process git transactions whether they're ssh or whether they're coming in via the actsps front end but it's a git transaction on the front end so like this is how we split out web ui functionality for clicking around in the interface versus i'm making an actsps git transaction we split that so that the front end web ui nodes don't handle that that's all done it with h.a proxy we also split our api traffic off so if you're accessing url those api centric that goes to a different cluster of back ends that handle just api traffic and then on that we apply some rate limiting we we apply rate limiting that limits how many concurrent sessions you can have per second to our api this prevents us from being overrun and overloaded our api now let's just keep a pulse and handle on the abusers of that as well as how we need to scale that if we're getting hot and then like i said roll separation this is where we've talked i talked about we use that h.a proxy magic to separate out we have git nodes that we redirect the ssh traffic to and then the the htsps git workload we have our front end web nodes that we just handle all the the ui stuff on and ui only we have api nodes and then we're going to be spinning up this is on our my my pull sheet this week is api nodes solely for ci traffic so the what you can separate out there is is pretty flexible we love h.a proxy for that reason um so there's there's a lot of little magic happening in that cookbook if you're inclined to look at that uh so keeping track of all i wanted to touch on this just because i think it's important um we use prometheus monitoring we use prometheus monitoring and alert management to monitor all of our systems um we ship prometheus inside of git lab omnibus um the default configuration for that today is evolving so right now the internal prometheus lets you manage the lets you look at and monitor the internal git lab guts of the application right so um we're shipping more and more templates i think right now you should be able to out of the box with git lab monitoring be able to look at your psychic q structures um what their their q depth is on those uh and uh things around ac to p response times inside our applications so how long does it take the application to respond to a request and things like that those are all things we're shipping internally and we're continuing to push out now we at git lab dot com take that prometheus portion of it set it aside and then like scale it bigger so we have three nodes for monitoring all of our infrastructure we have two additional nodes for monitoring uh like application statistics because the we're beginning to instrument the applications with telemetry reporting so that they can actually give us data about what's happening inside the application that we're monitoring all this we feed into prometheus's alert manager and the alert manager helps us roll up you know what is the threshold for an alert can i aggregate alerts um if i've seen this alert too many times that's a problem if i haven't seen enough data on this alert it's a problem all that's flexible and configurable and we use the alert manager for that the alert manager takes two different prongs for us um we have a lot of slack integrations so and slack channels the alert manager pings somebody lets them know that uh you know there's a potential problem or that things uh need someone's attention and then when they reach a criticality state uh before they reach criticality we try to we pull out the pager duty so we page an engineer to take care of the problem um if you haven't looked at the prometheus that's that's included in monitoring for git lab i would advise you to do so the prometheus uh methodology for monitoring is really slick uh and there's a lot that you can do there if you have ever wandered over to monitor.gitlab.com which is a public uh griffon and prometheus instance that we have that we're working on scaling up more you can see the public spacing side which is just the same as our internals of all the monitoring that we do for the application so queue timings for gitlab.com um how we look at a lot of the dashboards that we've crafted to look at things those are all there as well and then um i forgot to include it on here but we also have a repo of all of the griffon of dashboards that we've built to look at gitlab.com so that those are public as well and then the second thing that i think is is highly important is uh managing configuration so we um we use configuration as code and by that what we mean is that every node that runs any part of our system it's not hand tool we don't manually go type it into the gitlab rb and put that out there we use chef unless it's the cookbooks that i was alluding to earlier so every portion of our infrastructure um how we care and feed for how we expand cctl options how we everything from lvm all of our lvm configuration dismounts is managed through chef and that's all in our gitlab cookbooks we we believe that that should be the sole source of truth for things and we use our own internal gitlab product and get revisioning for how we apply things to gitlab uh keeping it running so one of the questions the tams got asked was when do you patch systems so at gitlab every monday uh we do systems patching and then if the system patching calls for reboots those are scheduled so that they occur staggered and aren't uh customer facing so a lot of the ones that we do for like web front-end nodes we'll cycle through the ha proxy and pull a note out reboot it put it back in and just move through them in an automated fashion and then if it's other things so we'll pull database nodes out reboot them and put them back in and that's all made easier using the console with dynamic adding and and uh host discovery and then as you guys know uh we do release candidates when they're available to gitlab.com and then the final product release happens a 22nd day of every month all right questions yeah so we've got we've got a couple um let's start with uh the first one that came in kind of let this one queue up a little bit uh it's to do with logging so a lot of troubleshooting requires going through logs to find the cause of errors we only run a four-note ha cluster and are using syslog to centralize the logs do you do something different for gitlab.com um so that's been in uh iteration for us so yeah so that's been an iteration for us so up until fairly recently uh we've used uh syslog and we've streamed it all back to an alt cluster so everything goes back to an alt cluster so we're running the elastase search um you know log stash and kibana and that's what we do internally for that we recently modified that um to as part of moving to google which hopefully you're all aware of we modified that so in our current structure because we're leveraging um gcp and some of the things that are there we're pushing all that into google stack driver right now but prior to that um and it worked very well for us we used the elk stack so we would just use syslog stream them all to an alt cluster and then look at our logs that way excellent all right um next question here is have you tried to use azure load balancer instead of ha proxy and do you have any customer feedback on that so yeah so we're currently in azure um the azure load balancer is a very basic load balancer the only thing it can give you is is the tcp port available downstream and should you route traffic to it um if you look at what we're doing inside um the cookbooks that we have for ha proxy i mean we're we're doing regex based upon the url pass structure um we're doing rate limiting checks on how many times we've seen somebody from from a specific ip address accessing a url so those are things that we can't get out of the azure load balancer today um in addition to that we used to front in our database nodes with an azure load balancer and unfortunately we found when we removed the load balancer uh we we achieved a an entire factor of improvement on access times to the database nodes that were behind that so we haven't had good success with that um and when we moved to google we're going to change some of that there the load balancers there are better so some of the functionality we pushed out into um a gcp load balancer uh we'll move to google and but some of the other functionality in terms of advanced route splitting and stuff the regex matching inside the gcp load a layer seven load balancer just isn't there yet so we kind of to keep some of that in-house um but that's a great question all right that's all we've gotten the in the queue at the moment all right let me move back to this presentation I think I've just got one or two more screens here and then we'll go back into some uh other questions or did I close that not here we go all right and then oh one of the things that gets asked of us is how do we deploy git lab so if you if you hit a magical you know compendium of things and like work like us if you use chef and you want to do multi-node uh deployments uh or if you're just looking for pointers about how we automate this we have an open source tool that we wrote uh called git lab takeoff and uh it's you know free for you to use take modify you know do whatever you want with it but it's some basic things for us and we've just chosen to automate those inside of of a little wrapper that helps us do that some of the basic stuff that we do and take off is we warm up the deployment by pre downloading the package to all the nodes and then we go through this process where we stop the services if we need to so if we're doing a change that we need to stop sidekick processing before we you know execute the change or if we need to stop for example like mail rooms sending out mail notifications while we're processing change we can do that with flags that are in our takeoff thing that means just a series of automating these steps so that they happen but this is where we deploy the way we deploy code next we deploy to the web nodes and perform dv migration so in the deploy chain how we're deploying code is we hit the web nodes and we do the dv migrations when we do this we're using one of the features that we ship with omnibus which is skip post deployment migrations so this lets you just deploy the code just do the database migration that's needed um to to add new tables columns rows and modify the database for the codes can be running on the web nodes then we deploy to sidekick we deploy to the giddily nodes uh we restart services if they need to be started again and then after all that's done then we do the post migration process so we run the the deploy again and we don't use the skip post deployment migrations and so those are all the things where if we had database applications that we need to perform so if we were gonna mass modify something add a default to a lot of things stuff like that do we let that happen after we've upgraded everything so that it can kind of process in the background some of these post deployment migrations can take 20 30 minutes um as of uh git lab 10 5 we uh put a lot of these in batch scheduling so if we're going to take a long time we just schedule them as batches but that's kind of the order in which we deploy git lab um as you guys if you're consumers of git lab dot com sometimes this works well sometimes we hit a snag um we're trying very hard to reduce the times we get a snag and that was all the pre-canned slides I had so I would like to open it up now if you have questions freeform questions let's just go ahead and finish out our remaining time with some fireside chat all right yeah we did have one pop in one last one which I think is a good question here is will takeoff be integrated into git lab ci cd projects so takeoff is really used to just deploy git lab dot com um the ci cd projects that we're using today we're trying to go a different route with those and that's through kubernetes integration and uh our product called autodev ops um autodev ops if you're if you're tracking what we're doing there um looks at the code that you have in the repository uh uses some characteristics to see if we can understand what kind of application type you're actually got is it's really on rails project is it a no j s project things like that and then uh help you spin up nodes if you would like um to facilitate that so that's that's where the ci cd routes going is autodev ops um with heuristics based upon what your coach you committed to help you give everything from if you want to just let us make the choices for you to um saying hey this is the application I've noticed this and I'd like to hear some subs to help you start that's where ci cd is going takeoff really is just an app that we wrote to help us deploy git lab dot com faster to try and reduce downtime to try and get it done faster to try and bring more stability to the deployment process so that users of git lab dot com are affected less all right thank you that's all we've got in the queue so back to your uh suggestion I believe all right so if anybody has any questions about horizontal vertical um scaling please feel free to ask them uh I'm just gonna walk through some things that some of the tams have submitted as questions uh that they have seen for the remainder of the time so like I said if you have questions feel free to ask them um and one of the questions that I'm just gonna pull off this list here is uh which aws services can be used for ha rather than maintaining traditional vm roles or nodes um all their alternatives to running things like your own postgres so yes one of the things that you can use is you can use um amazon's uh database offering if you don't want to run your own database we have some constraints around that um in terms of what versions are supported because we do some version specific things but if you look in the omnibus release notes we'll always tell you the minimum version that you should be using and when that aligns with the offering inside of azure um or aws or google uh you can use that and another question is is networking fully isolated between nodes and git lab dot com or do nodes share a vpc subnet freely with just an external firewall so ingela.com today we actually do have uh isolation uh and we run firewalls between things so the web nodes can't talk back into other nodes that can only talk back into the database the git nodes can only talk to the file system and and to the api that they pull their git functions from so all that is is segmented into their own separate subnets with firewalls that have you know holes put through the follow on question of that was uh is the product changed enough to make strict port isolation uh hard to manage and the answer that's no so the only thing if you're mindful there of is if you're using nfs you need to go in and put the nfs requirements in to use specific static ports instead of dynamic ports for nfs other than that the ports that we use commonly for the application we don't change we try to try to introduce uh change like that without letting me know in advance uh one of the questions from the chat was have you tried using martin remeer has asked have you tried using other market ha appliances like f5 or others we have not um we have a strong belief and core structure that we want to use and support open source products so we use ha proxy as an open source uh product and where we see uh deficiencies on that we try to contribute back to that source for them uh anonymous asks is there a link for it i'm not sure what you're asking for a link for i think somebody asked this being recorded it is being recorded and i believe that we'll be providing a link uh post that's correct yeah uh wow i think uh well i'm gonna give this a try uh sigis uh asks i know you mentioned doing snapshots for the data nodes do you also currently do database nodes too does the storage and database backups need to be done at the same time so we do something different for the database nodes we use uh an open source uh package called walley that does uh streaming so inside postgres postgres writes uh what's called wall segments of the continuous data that is processing walley takes those wall segments as the database is writing them and streams them to a destination if you're choosing so we use walley to do real-time per minute log shipping basically of postgres to an s3 storage bucket so that we can do incremental restore to the database down to the minute so we do a nightly full walley backup and then as the database is operating we do real-time walley streams if you want to look at how that's configured how we do that inside the gitlab cookbooks there's a gitlab walley cookbook that you know is how we install walley on our database nodes and how we use pgp to encrypt all that and ship it off to amazon s3 the second part of your question do they need to be done at the same time no they don't we try to do them if we're on the same time we take database snapshots and and do the because we're doing permanent incrementals on the database we kind of have a high level of granularity but try to keep them closed just because you are storing some cache data about hash references stuff inside the database and then have you tried sorry thomas asks have you tried azure redis cache azure postgres SQL database do you have feedback we have made tries at those they work okay for small to medium installs um the tuning that we're doing for level of vacuuming row cleanup how our database is replicating stuff like that that's not really something that that fits within the azure postgres like hosted model or their redis cache but at the size that we're operating at if you are a much smaller size those certainly are options that you can use um and and they do have some great functionality to them i don't see any other questions in the queue we've got two minutes so feel free to ask something if you like or do or we have more minutes i'm not sure um yeah so i'm not sure if we've got one minute or not uh so i'll just plunk through this until somebody cuts me off we have 15 minutes oh okay so 15 minutes so if you have any any questions you want to ask be they esoteric crazy uh why or why not what do you want to know let's go through those um and in absence of that uh i'll just go to the next question that was submitted by tan which was how does gitlab use gitlab h a g o so we kind of talked about the h a perspective of that how we're doing high availability postgres high availability redis with redis sentinel and uh our postgres with uh rep manager and pg bouncer um the geo is kind of a new thing for us we're in the process of moving to google from azure as a back end cloud provider and currently gitlab is leveraging gitlab geo right now to do that migration for us so all of our issues repositories upload storage we're using our gitlab geo product to keep google currently in sync with the azure deployment of gitlab so right now there is a fully stood up version of gitlab dot com in google and it's in near real-time replication and we are in the final stages of working on the cut over for that so we are just practicing right now in fact tomorrow morning we'll have another practice where we practice cutting over an environment so we're taking our staging environment of gitlab dot com and practicing a cut over of that data to google staging environment and all that's being done using the geo product the geo processes and the geo scripts and we're continuing to refine that process and finish the development of like some of the finer details of that product and that will be what we use uh towards the end of this month to migrate gitlab dot com from microsoft azure to google uh somebody else asks what benefits another tam question was what benefit does using ha roles provide um the ha roles i think as we describe them in our ha documentation is around whether or not your the degree of segmentation that you're using inside the application so this goes back to do you just need to have multiple web front ends and you know to handle number of concurrent users or just bring high availability being able to take down a web node do you need to have more than one database node and then the ha roles there are uh database redis uh psychic processing like we talked about in the thing martin ramirez says looks forward to copy the slides yeah these slides are publicly available uh and uh we'll be putting them out there afterwards um another tam question is uh are there ways to isolate a core group of users or use cases against abuse or high load from other areas of the business uh how do we keep mission critical use cases going or prioritize them um i'm i'm not sure if the tam is submitted this is on the call i would love to have a more fleshed out by users are you meaning specific get lab users inside the application or i'm not sure yep sorry john i don't i don't have any help for you on this one i'm not sure who submitted it but it does point to some of the throttling capability i've seen in uh the instance level related to uh i guess you could say automatic malicious automation for example sure i think crawlers etc i don't know that that's what this question was geared for but you may be able to speak to some of it in that way keeping that sort of traffic activity uh to a minimum right so today we uh we use h a proxy to limit on gilla.com concurrent access to the api um the rest of the application we don't have that level of throttling in front of at an external application internally inside the get lab application and this is true for all of you that are running get lab um we ship a product an open source project inside there called rack attack and we've got rack attack configured to help mitigate abuse case scenarios that you might see so when people rapidly make submissions rack attack will throttle them out when they've got rapid click points in terms of you know it looks like somebody's automated a rapid click through to add a comment or a note or something we'll throttle that so those are some of the inbuilt things if you look in the documentation for rack attack you'll be able to see the configurations that we have inside of that we're trying to expose more of those configurations through the get lab admin interface so currently you can tune you know at what level you want to just throttle somebody for rapidly you know consuming your get lab instance and stuff like that more of those things are going to bubble up as we we begin to expose those into the application we we have another one that came in the queue here and that's do you have a user level document for setting up ha get lab instance for on prem yeah we actually do so uh in the the post notes about this well we can include a link um if you just google get lab ha uh the second link that you're going to find is our documentation for the administrators on how to configure get lab ha and what your lab ha looks like uh with following steps all the way from configuring the database to configuring redis adding additional nodes and load balancers um that's all in docs.getlab.com so that's covered in there um so if you're looking for an administrator standpoint to how to walk through that um and look at either everything from a hybrid configuration all the way up to you know the fully distributed configuration at get lab.com runs uh that's that's documented in there with how to configure each of those steps that's all the questions that I had from tams and I see no other questions in the queue so I'm going to hand this back to Andy if we have more questions that come out I'll be happy to ask them but uh if we don't thank you for attending this fireside it's been a pleasure talking with you and uh thanks for using get lab thanks John yeah that was uh great information we have out of this we'll definitely send a link to the recorded version of this session so you'll be able to go back and and parse through pick out the things that are of interest to you and if you have any additional questions you may send that to wow where should we send that uh let's see let me consult my oracle momentarily okay yes of course very intuitively you should send any further questions to your tam um assuming you know who that is I'm sure you all do so if they're following on send to the tam and expect an email from your tam regarding the recorded session thanks for joining folks appreciate it John thank you expertise uh goes a long way so it's useful to get that perspective all right have a great afternoon everyone and we'll see you next time around juxtapes thanks