 Yeah, well, thank you for thank you for coming in this late hour. I know everyone must be a little bit hungry My name is Shlomi Noch and this is automated scheme of regression with GitHub actions scheme and ghost I'm with the GitHub database infrastructure team I author a bunch of open source projects orchestrated ghost. I will talk a little bit about ghost today. I Work with GitHub. I assume I don't need to introduce the company here in the past year We shipped a lot of new products and services like actions packages security advisor advisories code navigation search Notification sponsors mobile and more Seeing that our backend is my skill my skills stores or all our companies metadata all these new developments Imply schema migrations, right new feature new table added column. Oh, we need a new call a new query So we need this new index. Maybe we drop a few tables. Maybe we do an iteration where Throughout the process of maturing the product we create and then drop tables and change the design all the time. I Asked earlier by show of hands. How many people run a skill of migration per month per week per day There were a few people who run a skill of migration per day on average We run two on average per day, of course, not every day is the same some days. We have like five or six schema migrations What is a schema migration, right? On the face of it it looks like okay, it's just a create table. It's just another table It's just a drop table but I think it's a lot more than that and I will ask by quick show of hands Do you find that you have a lot of manual labor before during and after a migration does this happen to you? Yes, if so Not so many. Okay, so I'd like to break down how I perceive schema migrations To be in a well-controlled audited safe reproducible debuggable Fashion, which is the way that that we do it so The way the way that I see it We have developers who design their features and they create the statement the need right? I need this table or need these new columns Hopefully they will ask their peers to review Hopefully before going to production. I don't just decide something myself and push this It can happen with smaller setups, but hopefully we go through some review process Hopefully If you have a databases team and DBAs, they will review it as well because there's some guidelines That they will be Concerned it like we do or do not support foreign keys. Oh this integer is signed. It should be unsigned Right the length of a column This is a duplicate index stuff like that and From there we need to formalize and make sure that the statement we we want to generate is correct I'll talk a little bit about that later and Then decide where to run this on smaller setups. You have a single my skill clusters or larger setups. You have Multiple 10 20 wait this this outer table on which is that outer table users wait? We have users in this cluster in this class in this class Who figures out exactly which one it is who says who's sure that this is indeed the table we're talking about? We need to Schedule the migration if we run multiple migrations a day Then possibly one of our clusters is already busy right now running a long migration. I'm not going to kick another concurrent migration It's possible for us, but it really slows things down We really wait until the first migration completes before we do the next one. So there's the scheduling here We run the migration We audit it make sure everything's fine production is fine is happy We probably need to report to developers. Hey, what's the status? We sometimes have like a multi-day migration because the table is crazy big We have a couple tables that take 20 days to migrate. That's like a little bit of the extreme, but we do have some Some substantial amount of tables that takes hours to a couple days to migrate and we need to be able to tell the people Hey, you know, it's still working ETA 12 hours, whatever At the end we need to maybe clean up drop the old table Let people know notify them. Hey, that's complete. Do you want to test it? Do you want to verify come back to me tell me that everything's okay? And then finally the developers will do deployments merges, whatever it is. They need to do does that make sense Would anyone everyone agree that there's a little more manual toil? Maybe I wasn't very clear previously about what manual labor it is around immigration. So yes, raise your hand now Yes, okay. Thank you internally at github we actually happened to for a long time to own Even some parts of the developers Rolls because we own so much of a process we took ownership of this extra step So for us, this was a real problem because if we need to run five six migrations per day for Sundays We are a human Scheduler multitasker context switcher Okay, I need to begin this while this is running. I'm going to review this. Oh, this is complete. Let me kick the other one this Put a huge amount of toil on us best case scenario hours per week worst case scenario is ours manual labor per day and we sought an automation to solve that And so what we're going to do now that we've explained the problem space I'll illustrate the solution we came up with so before showing the solution. I'd like to Tell you that the problem is complex. It's complex not because it's like rocket science But because there's different environments here different ownerships, right? The the code change comes from a developer, but I'm a DBA running it And maybe there's a necessary team to monitor or maybe it's the same team for you But there's a development environment and then production environment. How do you bring this from here to there and? in a search for a Complete solution that knows all the environment which means it has credentials to all the environments It can control anything. It's pretty risky. We came up with a design. That is a Combinatorie solution right a few loosely coupled Applications or solutions each responsible on their zone or realm and orchestrated together To bring you a full solution and this is what I'd like to describe So the first thing to note is the code. I asked earlier who here Don't even review their code a few people raised their hands In my opinion Schema change needs to be part of the code and needs to be reviewed as if it were a code. That's my opinion Specifically at get up. It's a necessity. For example We don't only have get up.com. We also shape get up enterprise in on-prem solution Right that sits on a customer's place and while we do continuous deployments to get up.com We do periodic upgrades like every few months whatever the time frame is on a customer's Host and we need to be able to reproduce those schema changes and the code updates that we did on com on The customers computers which means we have to have the log or we have to have the version We need to know Exactly what migrations we ran and we need to decide up to where right where am I going to upgrade to and That should be coupled with the code because if I put the wrong schema version with the wrong code everything will explode and so In my opinion Schema changes should be treated as code and internally at get up. We always couple the schema changes Or the schema design within the same repo of the code that uses them always together To that effect We've kind of we're kind of beginning to Figure out how the solution begins because if we're going to treat schema changes as code Then for us at get up. We happen to have a solution to that We happen to know a gate hosting service that provides, you know version in and then Source control and reviews, etc. And that's the very platform that we develop and so for us a schema change is Coupled with what what we feel is the heart of the get up flow, which is the pull request So for us to make a schema change a developer would check out branch Modify the schema commit push create a pull request the pull request is the place where The developer would seek peer review or DBA review We would discuss the schema change say this is good. This is not good. Let's move on This is where see I runs. This is where communication goes And for us it looks like this so in this sample PR I create a pull request. I describe what it's for I Made a few larger slides. I knew the slides here Not well visible to the people in the back. So we the way we run design our schemes or version our schemas that On our file system, we have the declarative approach to create table statements for those for our schemas And when I do a schema change what I do is I create a PR and My commit reflects the new schema design This is smaller but but higher right so I added a column I dropped An index and this is my new schema design I don't explain how to get from here to here, but I explain what I want to have with this code version That's a declarative approach as opposed to the programmatic approach where you version the ultra statement itself So if we who if we use this declarative approach Then how do I know what the statement is like what migration do I need to run? This is the before this is the after Get this is not the good solution here. It's very bad with different SQL and To that effect we use schema schema is a fantastic open-source tool by a friend from the community Evan Elias who's not here today is based in the US It's open source. It's I'm going to do Injustice because I'm going to really describe in a couple minutes It's a tool for automated schema migration control. We use a subset of the capabilities of these two and One of the I'll show you how we use schema The thing with schema is that for people who use gate or general purpose version control. This will feel very Familiar The idea with schema is that you know in my repo in my code. I will have some root path and Then for each database that I have I have a sub directory and for each table I have the definition file the SQL that Defines that table. Okay so far cool There's also some schema configuration files schema config configuration files tell schema Given an environment where to find the database server for that environment? Okay, so schema gets Access to database service if you want to later give it access to production give it access to production if it's qa testing staging Whatever whatever you choose you tell schema what's the environment and what are the credentials and location? The table definition files just the normal create table nothing interesting about that Okay, so far so schema sees a file system with Database and table layout on one hand and on the other hand It has the information to connect to your databases and you can do some interesting things with schema you can say schema Please push schema push will connect to your database and Apply whatever schema is defined in your file system, which will apply to the database It will create tables as needed. It will drop tables as needed It will alter tables as needed and the nice thing is that the scheme is very good about analyzing the correct diff and generate the correct And well-formed scheme of alter statements. You can also ask schema to pool So just overwrite my file system with whatever it is on the database and finally which is Roughly what we use you can say schema. Please diff don't do anything just output the necessary altered statements that would take My database from its current state to the one in the file system. Does that make sense so far cool? Where do we run schema? Earlier last year We released get-up actions get-up actions Is a place for you to run code Almost arbitrary code as response to events on your repos and events could be the creation of an issue Someone closed an issue Someone pushed or created a pull request or requested review or reviewed or added a label or whatever something happens on your repo and The code that you would run runs in a container The container runs on github's own infrastructure. You don't need to provide that container You don't need to do anything you just create an action YAML file and in that YAML file You would put whatever commands you want to run. It could be shell scripts, etc So github will run that code on github's infrastructure on your behalf Given some action and the last but not least is that that container while it's a container in isolated from the world Has access to your repo it implicitly implicitly gets the github API token to interact with your repo so github actions are Commonly used for CI CD or you know, that's a classical use Use case and that's how we're running schema. So we run schema To diff diff the changes upon pull request And it looks like this By the way, github actions is Free for public repos for private repos. There's some free plans some free amounts of minutes and from there on it's it's charged So it looks like this. This is a very simplified YAML file. That's the action definition file It's of course incomplete. The actual file is much bigger. It says, okay Per pull request. I'm going to do the next three things One is that I'm going to check out master or The bed base riff. I'm going to check out master Then and I forgot to say that the container runs Linux. It runs Ubuntu Along with a bunch of useful software among which is a MySQL server So next I'm going to skim a push I'm going to skim a push So I'm going to apply the schema Definition as found in master branch on to the containers MySQL Database okay so far and then finally I'm going to check out the PRs branch or head and then PR diff So essentially asking schema Tell me what the statements are needed to take the database from the master version to my PRs version to my branches version Okay and it looks like this it runs there's a bunch of steps and Eventually it ends up what what we did is that it ends up in adding a pull request comment Zoom in comment on to your PR saying this is the you know Well-formed alter state. This is what schema has Analyzed your PR to do does that make sense so far okay all this thus far all this story Requires one YAML file you put in your repo done done deal, okay The migration itself create table drop table relatively simple for alter table. I We use ghost raise your hand if you've heard of ghost Cool, we've described it in previous years. I'm not going to repeat I'm just going to say that for us ghost works very well It does not make impact in production. We don't need to worry about ghost running It's well controlled well behaved well throttled and does not hurt our production servers So we kind of don't care whether ghost is running or not. It's it's good and safe for us And so the glue the final glue is ski free Ski free the name coined by Tom Krupper names are hard It's based on schema and hence free or whatever and we go to ski free So this is the orchestrating service that connects the dots and makes the entire flow And we developed it we're developing it at GitHub to solve our migration problems so Allow them to just illustrate with you the entire flow and you'll understand what ski free does Okay, so we begin with the developer who creates a pull request. We've seen this. This is the change We have that YAML file the action file. It runs. It generates the diff It adds a magic label migration schema diff right Hopefully the developer will ask their peers to review and approve someone reviewed and approve Ski free is a process that is aware of my repos while it doesn't have access to my code my git code itself It does have access to my to the github API and to my repo through the API It looks for open pull requests that have schema diff statements with that label and have been approved by a peer That means it's ready to ship on How does it ship on ski free says well, yeah, the developers have approved I'm now seeking review from the database infrastructure team because they're the next step. That's that's the process for some repos we have the DB schema reviewers team who are more application aware, but We could have one or two reviewers further reviewers needed So assuming the database infrastructure team approve Ski free sees that it keeps pulling the pull request. What's the status law? Oh, it's been labeled as you know a change and the user approved the the peer approve and the database infrastructure team approved. All right, I Compute where I need to run this. Okay, this specific schema change is going to run on this and that cluster and Scheduler is kicking in and it checks and it knows whether a migration is already running Yes, or no and whether I'm good to go and as it begins to run the migration It adds a pull request comment and as it completes running the migration It adds a pull request comment the developer knows they just know because all they need to do is listen on their Notifications we use an already established mechanism forget about notifications. We don't need to reinvent the wheel We don't need to communicate to the developer because they just know we're telling them by automation Finally, maybe the PR has multiple migrations. We will notify the developer Everything is done. Please go ahead. Do your thing Great. I'm happy to marriage and do whatever I want to do deploy do whatever I want to do makes sense so far cool So this is how it looks like today We started running this like three months ago For those sitting in the back the entire list that used to be owned by DBA Now is mostly taken by Automation where developers should own it they own it the DBA team now only have one role in that process And that is to review and approve the PR Once we review and approve the PR Automation takes it away. We don't need to think about it anymore the impact for us Is down from hours per day or hours per week to minutes per week? That's the time we spent to down migrations. We don't care we care because we review but once we approve take it away and everyone knows Developers have visibility. They also have chat ops They they can proactively check what the status is but like I said they get notifications on the PR page Everything is communicated through the PR PR page Better time utilizations We don't need to wait for a database engineer to be up and awake at their computer to kick off the migration The migration just kicks off as soon as possible. So developers have been telling us, you know How shorter the time was, you know wall clock time from the moment they publish a request Until the moment it's done and in production Okay, so far Skifu was developed internally at GitHub and uses internal GitHub resources and services like our service discovery Mechanism and our inventory service and Chat ops integrations and internal goal line libraries, which we have And so it's not really a general purpose solution And it's something that's very difficult to open source and to just ship to to everyone because it relies so much on our internal Infrastructure and yet we hope the community can benefit from this. So we're open sourcing it So what you'll get Within a couple weeks, I hope so look out for a GitHub engineering blog post where we will share this You'll get an incomplete code. You'll get code that doesn't build doesn't compile Compile it will break but with some hints and suggestions for you Okay, instead of using the service discovery one Why don't you start with a simple configuration file and then work it out if the community finds this helpful? We're happy So hopefully within a couple weeks. We'll release key free as open source And I'm happy to take questions. Yes, sir The scale of each we deploy these migrations We have over a hundred production servers in a dozen More production clusters We are not huge but we were very busy right so we do have a couple clusters that are like maybe a couple terabytes Worth of data set others are smaller But almost all of them are extremely busy, right? So an interruption to those clusters is very quickly reflected as a github incident, right? So what's important for us is to keep everything under the radar? That's if it takes longer to run. That's fine, but most important to keep it under the radar Does that answer your question? Cool. Yes, sir Right. Yes. Yes. Yes. Yes. Yes So skiffery itself is backed by its own my skill database, right? And so skiffery skiffery stateless But everything gets written there and so skiffery can easily check whether there's an entry There's a live check that keeps updating to make sure and if the migration dies for some reason eventually That will will be garbage collected and we will Yes, sir How do I handle? How do we handle migrations that update data? We call these transitions, right? These are not schema migrations. These are like okay once we created some column now You want to populate that column based on some other Query this we call a transition This has to do with us, but but it's not a schema migration. So the developers design the transition we call this transition right the update the insert the whatever and The one thing that we do provide to them is a throttling mechanism because this might Involve like updating 50 million rows. We don't want to do that in once we have a throttling service called freno If our ENO, it's it's listed actually in the first slide And so this is a pushback service. It's a voluntary pushback service You will ask freno. Hey, am I good to write to that cluster and friend will say yeah It looks healthy to me or no, please refrain from writing if it says okay You will write like 50 maybe a hundred rows and ask again and again again again. So that's that's Service do I have time for more questions one last question sir in the back Yes, good question. Do we also? migrate schemas with the shotted tables We do experiment with the test. We are in different clusters. We're in different stages of maturity and So yes ski free is shot aware So one of the things that is internal to get a bizarre v test set up So ski free can talk to v test and verify Whether this is shotted and if so it knows to run like your single out the statement needs to run on three different clusters or four different clusters because the cluster is in 304 shards and so it will independently alter the table and Part of the v test integration is to update the v scheme, etc But yeah, this is basically handled Okay, thank you very much