 Oh, it's very cozy with a very cozy group of people Cool. So today I'm going to talk about release best practices. So these kind of like hard-earned experience I've got over the years Like number of times when I have to wake up in the middle of the night trying to troubleshoot things and from there Then you learn how not to do things It will help you guys, you know future career or actually right now If you're handling some internship or any other kind of deployment Right. So a little bit about me. It's already Introduced so I basically a software engineer and data engineer at Facebook for Singapore for a couple of years and then Earlier this year and moved to a startup world again So I graduated from computer science and us back in 2015 early 2016 Sweet, all right. So, oh a little bit about business AI. I Compulsory to do this. So we're doing ML deployment platform for enterprises and we hiring so Her bird here is also our intern last summer So if any of you want to know what house work at basic AI just asking you give better accounts than me All right, so compulsory memes one does not simply release code on a Friday So why do we why do we not do that? All right, because obviously you want Saturday and Sunday to yourself You don't want to wake up and like try to troubleshoot. Oh, what's happening with the database? What's happening with real latest release of the code? So It's a it's a meme, but it's real. So keep this in mind If there's only one thing that you take away from this talk this So what is a release it's basically you apply changes to your production environment All right, production environment is in code change database schema change whatever that is So in short, it'll be the first time your user actually use something that you have been cooking Right, so this can have a very small show of hands who have like deployed something So as in there will be real users who are not your friends who are not yourself But like complete strangers There's a few a few people I guess right cool. So That's that's basically how you feel right like you that you just like release your puppy to the world's and then oh Hopefully nobody mess it up So why is it important? So if you break the application Obviously, they're going to be down time and you disappoint a lot of people your friend yourself and your users So let's try not to do that Someone has to fix it when you break it, right? But you yourself your team your boss wherever that is and at any time because it's not good to have any Down time at all. So 3 p.m. Or 3 a.m. Then you have to wake up to fix it So going to now what are the very rule of thumb of releases? So why do I start with this because the actual act of releasing something? It's very simple most of the time just clicking buttons. That's it Click a deploy button and then that it will change into make the change into production area and then that's it So in in this very small particular case There's just a few things you need to worry about first of all don't release on a Friday I hate repeating myself, but this true or for all practical purposes Don't release it anytime when you are about to go and leave for anything. You don't want to touch work at all Then don't do it Second of all plan to be around for at least one hour. So if anything Clucks up after the release then you were there to fix it because you have the most context, right? And third of all let people know let your team know. Hey, I'm doing this if not then Too bad. You're the only one on the hook Cool. So with that out of the way, so the keys to safe release Aside from that just very opportunistic convection is all the planning that comes before that So it has a lot more to do with how you do the development yourself, right? How do you? Think of backwards and forwards compatibility Because breakage happens when either of them don't work So there are a few types of change when it comes to production release Either you do a code change often time code changes Stateless as in you can it's just a binary you switch one version for another and it's just still Not too much of a problem the second one more complicated is when there's data related change when there is like data Schema change or some if you are like a schema schema less Database then there'll be data format change if you change the log format To from one column name to another or one field name to another you can crash someone else's dashboards and like monitoring will trip you up And third time is config change you change your web server You change your load balancer config you change a C name and then oh all of a sudden you can access the site anymore so those are three main types of change and Within the context of this talk I focus more on the back end and web releases because mobile Releases are very different Anyone want to take a wild guess what mobile different it's different from web and back end Yes refuse so you want to release the app store. There will be an additional step for you to release it It's not like you just click a button and then it goes out So to keep the focus of this talk, let's think about back end web only So those are the more manageable ones and people will tends to have developed good practices around it So the first type code changes so Before you do anything about code changes try to think of Continuous integration and continuous delivery. So this thing is actually just kind of a pipeline And you get a lot out of it if you don't have any other type tooling yet. This is the first thing you should do so There are a lot of things out there that are already Off the shelf so you can use their Jenkins Travis so-called CI team CD Basically, you visit any of the more proper GitHub repository then you will see a batch of any of those things Have a check if it's not a check. There's a red or something then probably don't use that So the idea around this continuous integration and delivery is just From your code changes you make one liner to line a change or whatever And then you before you try to merge it into your master branch. It runs through a whole bunch of automated testing Automated like deployment even so these all of these infrastructure is already to make sure that your changes before being merged in is already Like somehow is already vetted. So your merge is safe So and I will not talk too much more about continuous integration and delivery because there are a lot more informative blog post about it that you can search online Let's just zoom past this point They'll be Q&A towards the end and second type second trick they can do for releases It's why I call inserts to colors deployment because people call it various names blue green red black whatever The point is there are two version of it So when you make this one version with the new changes and there's the old version So all you have to do is just you bring up the new version with all these new things Nice things that you've built in the past couple of weeks And then you just make changes to the load balancing Layer so that you can point all these traffic from the old version to the new version that's basically the idea behind it and there there can be multiple levels of Configuration there are some load balancing tools that you can direct like percentages of your traffic towards a new version And there are others like you can just like flip or switch or something and it will revert everything to the new version So they do to expand on this idea. You can see that you can move traffic from 0.1 to 0.5 to 1 to 2 percent all this thing So one of the story I had when I was working at Facebook was that That for a site that is for like billions of people literally You would have thought that they have very stringent process of like releasing whatsoever But the the thing is you you'll be amazed that all these tools Are built to automate the majority part of it So if you make some changes you merge it in and literally hours later It'll reach the millions of people out there. It's because of this traffic moving steps so they release it in in so-called circles on 0.1 percent the traffic first if the Monitoring tools don't trip up then you increase it gradually until like hundred percent the traffic is using the new version of it and Now this is purely kind of a technical thing right so like controlling traffic diverting traffic What not what not it's typically a job at a debuff engineer or a software engineer, but to expand on this What if you can control the release that is completely different from the? From like the business need of releasing a feature completely different from releasing of the code So the idea is that you ship the new version of the code, but you can dynamically Tuckling your feature to be on the off So you can target certain couple of certain groups Let's say if you release a new version you want to target only the internal user first all only employees will be able to see the change Then you you would release to that group of people first everything goes well Then you expand it to the rest of your company or the rest of organization or target certain customers You want to test out so basically the idea is that you can be couple feature release from code release So if you talk to non-technical people they tend to get very excited about this So that's one thing to keep in mind as well and do on there There are a few off-the-shell version that can a services that can provide you this But actually building one yourself is quite interesting and it's not difficult at all So you can kind of like research on this is very famous person Martin Fowler You wrote like an entire piece about feature of toggling so that's something you can look up on as well So the third one is API versioning So let's say your code changes or whatnot They're all internal to your application if you change your back end and the only consumer is your front-end and probably You can even if you trip up something It's still easy to just revert it because you control of both right now It comes much trickier trickier when you have an API and because when you have an API then the consumer set is not only Yourself anymore. There are other people who depends on particular version of the API to work So how do you try to got against this? So you try to follow semantics versioning So what semantics versioning try to do is just that if you have a way of naming your Your your versions right then people kind of know what to expect out of it So let's say if it's one point zero then it kind of means that hey I'm ready generally available if it's two point zero that it means. Oh, I'm still ready still generally available But there are non backwards compatible change as compared my version one So that actually is very helpful to me Let's say if you release version one and then you're not happy with the design or whatnot Then you can build up your version two while at the same time maintain V1 to work And then after that you press release or whatsoever to say that hey now have V2, which is so much better So why don't you guys move to there? And basically just don't break the existing V1 API and create a new version instead so let's say V1 actually there are a lot of times when V1 is due for deprecation and this Deprecation process take months or even years if you're looking to Facebook graph API for example the first version of it I believe from the moment it was released until it was fully deprecated it took like five years and The larger part of that of those five years is spent in deprecation mode So if you trigger any of the call to those APIs then it actually returns you some sort of warnings That you shouldn't do that and people and Facebook send you an email saying that you shouldn't trigger this endpoint anymore and The very important part if you want to do API versioning is that you have to have some sort of Tracking some sort of metrics to say that hey how many people are still using my old API version If not, then you don't know when it's a safe time You kind of like cut off and okay now I can just delete all those if you want API anymore So the third point is very important for this Right. So after all the code changes, they're more of less like binary you can just switch between version It was if it's too much of a trouble. We break something. It's like, okay. Just revert no big deal Right, but what if it's a database related change? So database related is much more much for trickier because first of all they the for database rolling back is actually much much harder and If you trip up something with a database, you might lead to permanent data loss so there is this incident about GitLab happened like about two three years ago and It was quite a hooha at the time because they lost data for like six hours And somehow eventually they got they managed to recover from it But imagine if you do something and then you don't have proper backup then that six hours Worth of data will be gone forever. There's no way in hell. We can be able to recover from it. So For database changes, there are a few things that we can try to Look into it All right. So First of all, make sure you have backups, right? So if you most of the time you would use a cloud providers database service so they already have Various built-in kind of backups mechanism is either like every hour every day or like some trigger A lot of APIs you can trigger it and then it will do a backup for you So you don't have to worry too much about the logistics of packing up things and generally there tends to be pretty Pretty good with recovering as well You don't have to worry too much about the integrity of the database backup that that like GCP or ADLS created for you then But still it's a good idea to like every now and then test your backups and make sure that hey this thing works If not, you're just backing up jumps in like the moment you need it is like, oh shit doesn't work and then 13 After you have all the backup all the backups are tested already then you still have to worry about Backwards and forwards compatibility Right So what it has to do with the database is that let's imagine you change a Column name, how do you make this change safely because there are various part where it can fail, right? You can start writing to the new column although your schema hasn't changed You can still read from the old column while your schema has already dropped strange Like how do you enter in what's the interplay between the code and the database schema? So one one thing the code is stateless whether the database itself is stateful So this thing is is a well covered topic So strike engineering actually have a very good Tutorial kind of about how you can how you can do this and I was just right out the steps here because they are very common And it's not that difficult to follow at all. First of all try to double right So if you create new you want to rename a column You create the new column and then you write it to both the old and the new version of that column So that you still maintain the data integrity parallel in parallel after that Then then you can run data migration can copy whatever of the old column over to the new column So you make sure that from there on all of your new Your new column have the full data set from the history as well as Present and as well as your future because it's already writing to the new column Then after that the third thing you can remove the read part of your old column So now because you already have it in the new column You don't need to read from the old anymore And eventually you're going to remove the right from the old column as well after all these four steps are done Then your interaction between your code and the old column is already disappeared Then you can safely drop that column or even table or whatsoever so it's kind of like a five-step thing and It's if you visit the link that I have here is actually a lot more interactive people Spend time did animations and whatnot, which is better than what I'm trying to do here So I recommend highly read that blog post from Stripe Engineering Right, so the third type of change that usually has to do with deployment is config changes So config change like web server change load balancer or whatever else that doesn't get captured in the database or captured in your code so a Very famous incident just a few months earlier in March Facebook had about 14 hours of downtime Partial downtime like images or whatnot couldn't load at all And the reason happening was there was some config changes that made it to the entire fleet of Facebook servers And then it just stopped working and if I'm not wrong it was it happened by some intern so so if you if you keep that in mind and you would not Boaster the stereotypes of of intern season anymore. So This is keep keep that in mind, right? So the first thing about config change don't don't just once one off and release it to the entire fleet of your server gradual rollout by gradual rollout I mean choose a specific group of servers or targets you want to make changes to Make changes there first and then observe how they behave after that if all goes well Then you can release it to a larger group kind of like similar to how the code changes. I suggested earlier Another thing is you can consider Some key value store there are various one that are pretty good by console zookeeper so those key value store basically can host your configuration as well and they tends to have good either back up or edits Mechanism built in so let's say if you change something and that doesn't work You can revert to the O version pretty easily And then the 13 is version your config so let's say if you add another row when you change another config values Just create a different version for it all together and name it like V1 and V2 so later on if V2 doesn't work out well just revert to V1 kind of like make it instead of a config change convert that into a code change So that's the idea behind like infrastructure as code as well before that everything is kind of ephemeral and Like states are everywhere. There's no good way to track it now People say that hey you should track all your configuration or your if you structure in version control as well So the config changes to deal trick about dealing with it is actually try to convert them into code changes And that would be better for everyone But of course don't commit your secrets right database password or whatnot. Absolutely, you know, but other things, okay Cool, so I try to wrap it up in very short amount of time. So now it's Q&A If you have any questions about like releases, what are some of the things that you have been wondering about in deployment Yep So Usually if you have a mechanism to do this traffic moving then there's some mechanism. There's some manual Switch for you to turn the version immediately back if you say there's like 0.5 percent is breaking things Right, then usually some dev-off sky will press the button and then you'll revert to entirely the old version So it depends on the company or the practices as well But most of the time what I see is that they will revert the entire thing back to zero kind of yeah Depends on your tools, but that's what you should aim for Hmm good question so First of all, well, I'm not the infrastructure kind of guy so like You know like the software engineer usually Don't like directly interact with the servers, right? Usually there will be a dev-off team or infrastructure team or production engineer teams that Focus on maintaining these these server and the basics infrastructure like networks whatsoever So I am I might not be best position to do that For my little experience dealing with with that area though The first thing you check is usually whether the note is up like whether it's Physically you can like connect to it. You can check if like it's then you can reach it is it reachable via network You can use SSH to it. What are the? Ways that you can communicate with that server Then after that But but these these kind of in the old days where a server actually matters a lot a lot nowadays Servers is kind of treated as fmural. You just migrate the workload somewhere else But the first thing is just check for whether you can connect to the thing at all Second of all is look into your locks The locks means either is the centralized logging or in some times you actually have to After you connected to the the the server then you look into their slash bar Slash lock and then certain services that you are interested in and you kind of like dig up from there and From from there onwards, then it's very Specific to your stack your application. So From there on this varies your mileage is not guaranteed by any Sequences or anything There's actually a pretty good book from Google site reliably reliability engineering is is just a reliability engineering Something something it's from from the guys who maintain server farms and whatnot Google And maintain the basic infrastructure as I as I mentioned and a lot of good practices are actually in there So if you are more interested in this area, I recommend go read that book instead a be testing So this actually has a lot more to do with this because AB testing essentially you want to maintain two version of your application at the same time, right and A lot of the time you want to control the group that receives one type of treatment versus the other So like some sort of feature toggling some sort of feature targeting is absolutely crucial But something that's very specific to a be testing though is you have to make sure you measure Your your success rate After you divide the group then you have to have logging all sort of metrics data and you pump it into your data warehouse and Perform analysis to say that a group A is better group B Or certain things like that So it's called closely related to it But it has a lot more intricacy because you have to build your data infrastructure next to it So that I believe there are a lot of other literature that has to do with a be testing So I might not be the best guy to answer this Yeah, so spin up a different database try to recover into that And like just run sequel stuff on it or if you're best yet if you have some Automated database testing like the one you built for us And that'll be cool So, yeah Again, I'm not the database Administrator so there are there are other best practices Specific to like testing your back up and whatnot that I might not know about cool All right, thanks guys