 Thank you everyone for coming down today. My name is Pratik and I'll be talking about the continuous integration journey that we've had at Haptic. So we went from releasing about once a month or twice a month to now. We do hourly releases, we release multiple times a day and kind of how we got there. So to start off a bit about Haptic itself, Haptic is a chatbot platform, the largest one in India. So I might have heard about the Haptic app on the Play Store but what we also have is a publisher and enterprise part of the business. And across all of these businesses, all of these, we have about 40, more than 40, 50 bots deployed across all of them. We process about 1.2 million, 1.3 million messages every single day, which culminate into about 65,000 conversations. The consumer app itself is one of the highest rated apps on the Play Store with more than 4.5 stars, with more than 100,000 user reviews over the course of its life. So to kind of understand where we came from and how we started out, I just had to touch briefly upon the team structure and how we kind of operate. So we have certain engineering teams that operate completely independently. We have the machine learning scientists that are working on the co-models on the chatbot platform. We have full stack developers shipping sort of the entire platform itself, as well as we have certain feature teams that work directly with the product managers to ship out certain features. They might be bots, they might be related to the platform, but they all, all these teams basically kind of operate independently. Now, each one of these teams has their own release cycle, their own set of features. And everyone is trying to, what we initially used to do is, everyone's features, you should then get merged into one massive big release. And that's kind of 40 engineers trying to ship onto sort of the same sort of code base, trying to make one massive release. And sure, it felt good that we did a one big release in the month with a lot of features, but as you can imagine, it kind of resulted in a lot of problems. Just briefly on the tech stack, we have primarily a Python workshop, Mongo, MySQL, Elasticsearch have come on kind of our main databases. And around that, we have Redis, React as our frontend, RabbitMP also kind of sits in the middle as one of the most important message brokers for us. So that's kind of the landscape then. And we're talking kind of about the problems that we had, right? So 40 engineers trying to release multiple sets of features at the same time, something always breaks when you try and release. And it's absolute chaos trying to figure out who's code base, who's, who shipped that bug and how to fix it, right? Everyone's kind of going in trying to say that, okay, it's this team, it's that team. And then trying to say that, yeah, this is how we in fact fix it. That would always result in either a block staging environment, a down production environment, where literally the entire team is working to do this entire big massive release the entire day. Everyone's doing just this, right? And that was kind of the major problem. You realize this is not kind of how we want to operate. Another major thing that went wrong is after the release was done, all the environments were out of sync, right? So one developer has been working and he's run some sort of migration on MySQL, but he was in another team and some other team there then didn't get that migration. It became a problem and to keep the development environment safe, the staging environments, making sure that everyone has up to date data, up to date schemas. So even after the release was done, it was really difficult to kind of just get back to being functional. So we realized that this isn't working for us. We really need to kind of do something to fix this, kind of get to a point where this isn't such an overhead for pretty much everyone, right? We realized and we set some goals that this is what we want to achieve. One of the first things that became clear to us is that we want to have a zero downtime deployment, right? Even in deploying in the middle of the night with a half an hour to any minute downtime was not acceptable to us. We had quite a few users online at that point in time and we want to make sure that this didn't interrupt service for users at any time of the day. To move away from that, you know, that big high-risk deployment, we want to have high frequency deployments, keep shipping often, keep shipping fast, that also reduces, you know, the integration problems that you have. And we really want to step out, you know, the ops team from kind of handling this deployment and be completely hands-off in the entire release. And manage all the dependencies, right? So for each sort of release, developers will be like, hey, I need this package installed. I need this available on the server. And it would have to be, again, another manual process where people would go in and have to set these things up on each of the servers. That would also go out of sync on the development environment then that, hey, other developers forgot to install those dependencies. Where is that list of dependencies maintained? How do you keep track of it? How do you bring up servers from scratch? All of those things. You want to really automate all of that. And more than anything, we really wanted the guy who made the feature, the engineer who developed it, really to own it from development all the way up to production, not to have some sort of handover process, not to have any sort of transfer or handover. It's just you should be responsible for shipping the code and as well as what sort of issues happen in production as well. Give you control of that entire sort of development cycle. So this is kind of the goals we set out with. And this is kind of the first version of where we landed up. So as you can see, there's multiple development environments. Each developer was kind of just giving up his own EC2 server with all the everything installed. There were certain things that people shared, like we'll share the databases with different CMOS and things like that, but more or less each sort of each developer has had his or her own environment. We set up Jenkins to kick off a deployment job on staging. So basically, every time new code was merged into develop, you could just any developer go on to Jenkins and just hit deploy on staging and that would put the latest code there similar sort of way to roll out for production as well. There's two sort of color arrows over here. You can see the purple arrows are kind of where you can set up each environment in any way you want. So say for example, you want to set up a different start developing a different chatbot, a different sort of channel, anything you want, you can just go in, enter that environment, each environment was fully configurable. You could bring up what you needed and get that kind of running. So for example, someone wants to do a father's day chatbot that's coming out a speech event, you just enter that data there. The yellow arrows will actually the infrastructure configuration that went in. So and we centralize that so that when someone wanted to update anything over there, it would just get pulled automatically into each of the environments. Supposing someone decided that hey, I need to bring up a new replica for some reason, another one, and we need that configuration on staging environment, it would be available and automatically to all the teams on development as well on their dev and server. So this kind of took out the manual process from each of the deployments, but I think we were really missing something key here and I don't know if you guys can spot what that is. So we were just missing tests, right? There were no, there's no sort of principle of testing things before they went onto the staging environment and that resulted in sort of absolute chaos where all we ended up doing was we ended up ship shipping bugger code faster. People would just deploy bad code faster and then what happened was on the staging environment that would just keep going down with bad code that was being merged in faster and faster and it caused further problems. You couldn't and just pretty soon we realized that you just cannot have CI without test. Like it is fundamentally it's wrong because you're bringing down the system with worse folks, like it's not about just you have to be able to ship with a certain level of confidence, and this actually presented a really big problem for us that how do we kind of go back and write all these tests for this entire system that's been around for like three years. I mean there was some test but it wasn't really in part of the culture that hey there has to be this amount of coverage, there has to be this amount of tests and we wanted to go back and write these tests but how do you then quantify that value, how do you can't just say hey guys we're not going to do any development for two months and we're just going to go and write tests and I would have gotten a like a tight slap for when you are new on switching that idea right and also even the new features writing tests for it it tended to you're taking more time now you're taking like 10 percent more time to write tests for the particular features even now that you're building and you've got to kind of validate that time to your product managers to whoever it is that kind of is going to realize that there's a delay to shipping this code so we had to kind of fight that battle as well so this was like really we realized that this was you know sort of a major major problem for us when we started out. Another big problem we realized was kind of the data and being this really machine learning driven company for us the data was actually just as important as the code and if you saw in that previous slide the data was going to be entered multiple places by different people that resulted in major conflicts of data someone might configure it differently somewhere else and there's when you merge it again all together that data conflicts with each other so we realized that data is really a first last reason for us and it's just as important to move the data environment to environment as it is moving code environment to environment and you almost have to sort of version control that data you have to make sure that it moves with multiple people can't change the same thing etc etc right we have all the chat data and mongo which was used to kind of send out all the bot responses that that's like 20 or 30,000 copies that pretty easily yeah configuration data and my sequel how that fathers they bought is actually set up right what is what are the what is those there's those static values what do those look like the environment configuration was already centralized in s3 and also the the models that we actually ended up generating we want to have to generate them for every every sort of environment again and again sometimes training those models is as time consuming as well as again if your chat data is trained and the model different and then you have those conflicts as well so this realize this had to really be a primary focus for us just as important as code so some of the key takeaways from sort of this initial version for us was that staging environment became absolutely chaotic for us it constantly kept going down it blocked testing we realized that if we couldn't really it wasn't helping things and it fixed some problems of you know the manual aspect of deploying that code but we had introduced a whole set of other problems so that was a key takeaway for us another one that we realized that there were certain cultural changes made to me right we had to move to a sort of test driven development standard we had to change how developers work had to change how we do estimates we had to change how the mindset of people you know when people get used to working in a certain amount of way for a few it's kind of hard to change that we also experimented with a pre-prod environment in between the staging and production environment and that was also actually another learning for us where it added to the problem it became one more environment you had to configure one more environment people had to enter data on and again transferring that data again kind of exaggerated the problem further so we kind of scrapped that pre-prod environment as well and this is kind of the main question you have to ask yourself and when building any sort of CI pipeline you have to understand that is this basic question answered for you do you have the confidence to ship this code to directly the production if you don't have that then you're going to be then that manual sort of system in that manual process is always going to be there in your entire code so we set out kind of again realize that these are kind of the extra goals we to add to kind of get this system working and really usable you know otherwise it's just lying there and causing further problems so yes we had to bring tests in we had to start writing unit tests or to clean up existing tests and again clean up means you've got to remove the bad tests also not just leave them lying around because if you get a test report with say 100 issues and then you're just going to ignore it like another issue report I'm not even going to look at it it has to really show you the problem of what you're looking at and focus more on functional coverage right people get really excited about having great coverage on their code but it's kind of meaningless if the test don't really test something of value they don't test business value it's just like coverage for the sake of coverage it's not really testing anything then that's also bad coverage doesn't add value to the test suit and we have to maintain consistency across environments right that sort of data availability that sort of correct configuration while you're testing and developing as well if your environments configured really differently then it's also a problem and you need to make sure things are in sync and all of this going to boil down to less queues time spent testing manual queues catch those bugs early catch those bugs often so this is what we came up with today and I will just go through the process now of what it takes to build a feature from scratch or after it so it all starts out of the developer environment a you create a feature branch out of your developer branch you start you once it's ready to go you build what you need to own a developer once you feel like it's it's in a it's in a good place you create a PR of it now this PR is actually deployed on another completely different test environment on that test environment the entire set of tests get run and on the PR itself on github we will post what your coverage is right as a label and which lines are covered and which lines are not covered where you missed out and that's kind of available for everyone else to see so it became you know psychologically very evident to everyone very quick who's writing tests who's not writing tests and how do we make sure that all the tests are running we also ran linters on it to make sure that your code is up to par we tested for we ran integration tests to make sure that whatever wasn't developed you want breaking anything else and once kind of this got this PR got approved by at least two people then only could you kind of merge it into develop and it went on to staging so at this point now you know that this code is not going to break anything you have that confidence and we could actually start or deploying things automatically on staging the moment we merged it uh this kind of this job was ran off of Jenkins where it would just automatically pull make sure if anything new was there it would just deploy that on staging as you can see the alos now have kind of changed the purple alo has a single point of ingress which is the staging environment so we would actually have everyone configure everything on staging first all data all your entire configuration off that any chat bot new environment anything would go on the staging first from there we'd have a little service which backed everything up and would make everything available then on your development environment or it would move directly to production environment as well so if you retrain the model on staging you could just push it onto production and then you could never directly edit anything on production there was only a single point where you could change it so even if you want to change anything on the production environment you first always had to change it on staging and then push it to production there's another service now called data tagging service so every time say a chat broke or the bot was not able to understand something uh that was a key data point for us so we pushed that out that got re-tagged in terms of hey this uh we pick up the entity there or we pick up we map the intent for it and then we take that data but we don't push it back to production again we push it back to staging and that data gets added then again you have to push it to production so this maintains that consistency of data everywhere it allows you to just retest everything on staging once if you want to make sure that you know there's this intent so the entity didn't break something else for someone else that data is shared right so this kind of got us all a single unidirectional flow for data which was really important to us for the production environment we brought in spotless to get us out and that deployment process it just goes through here so this is the actual show right so uh first thing majority of the tests will already happen on that test environment so that if it goes on staging and you just do basic sanity you're actually good to release over there if it requires some sort of manual testing and you want to test thing uh what we actually would do is so we follow get flow i'm sure if you know that is the moment your code is ready to go and develop you know everything on develop is clean you create off a release branch and then if someone else wants to test anything they merge that into develop but your release branch is on the site and it's ready to go whenever you want to take it to production so you create your release branch as often as as frequently as you want to make sure that your code whatever is actually been tested and shippable is already been forked is already been branched out and then you can merge that into the master branch whenever you want this is actually the actual flow for deployment itself uh going to work through it quick so the Jenkins deploy job which would kick off would kick with start of an ansible script the first thing the script would do is remove just touch upon really quick actually what is an on-demand server and on a spot server for aws uh everyone kind of in the know for that so an on-demand server is a server that's always available for you it never changes a spot server is something which is available to you at a really lower cost but can be taken away from amazon at any point so for every service what we do is we have one on-demand server right and that on-demand server we first remove from the load balancer if it was a web server we'd stop you whiskey or we'd stop supervisor and we kind of stop that server out uh you would then install any sort of requirements that we had for that server so any further pip packages os level dependencies anything it would pull the latest code from master if there were any migrations to be run uh through dango to run those as well let us and it would let us know when they were done once it was done we'd start you whiskey again it's still not under the load balancer uh or we'd start supervisor we'd run a basic sanity check we'd hit it with on the server just to make sure that everything is fine once that's fine we'd start the rollout phase so in the rollout phase we create an AMI of that entire server which is a machine image uh so this server is out of the load balancer it will be shut down now and then AMI is created for you the AMI is then given to spot instance so i don't know if you guys know what spot instance but spot instance is a fantastic service which runs on top of aws and what that does is you can actually use spot servers for all your uh for your main compute your main any point where you don't have a sort of single point of failure for your system right so if you have all your web servers which can just go up and come down at any point you just tell spot instance that hey i need 10 servers now of this type it it's kind of responsible for making sure that when things go when amazon takes that server away for you it provisions another one for you so you just give it that AMI whenever one goes down another one comes back up so we hand over that AMI to spot instance spot instance as a blue green deployment where it will bring up 10 servers with old code to bring up 10 new it will make sure that those servers are up and running fine and then it will bring out the old set of servers and then after that the original server which we had which was an AMI which was created we'd add that back to the load balancer so there's no sort of interruption at any point uh we can make sure that this particular server is always available so we keep one on-demand server so that was kind of you know the back end uh deployments and how are we going to harden those what we realized was we want to do the same thing for apps and apps was actually fairly straightforward uh we used to create we built in unit test integration test for those those ran on every single pr the same sort of way uh the integration test that we had actually through selenium you should take really long to run so we should run those only nightly and then get an entire test report and then people go ahead and fix those things uh the we also created sort of separate staging there which you could then you know connect to any sort of dev server for testing or the staging environment so that kind of gave us that flexibility as well so a bot testing tool right what we pretty soon realized was the unit tests we wrote they were testing your business logic they were testing the code at api also that you had written but what we were really missing was a way to test bots and this was actually a real challenge for us because we couldn't really find any sort of tool out there we couldn't find anything out of the box to test a bot in a sort of you know deterministic way so this was something we actually had to write from scratch and we actually scratched our heads for this for a while because how do you kind of validate and test a bot uh you can't say for this input get this output because outputs constantly changing right output is people are always changing and fooling around with the copies uh they're changing the way text is written the adding sort of new nodes or things like that so this is just like an example an image or a graph based bot for us not one which is working on deep learning and what we did over here was we built a system where we could test just in star point and end point uh we would provide it with a certain set of flows certain input text and we would just test where it came to so if based on a certain input text x you reach a certain endpoint node and with a certain sort of graph value on that truth table that you have collected x number of entities y number of entities then we can validate that that particular chat flow is working for that input string so uh this is something we're against still working on and we found some rudimentary way to test uh the tool for us uh but this is allowed us to automate it which is actually otherwise a really manual process sitting and manually tapping out and testing for that bot became really hard for us to make sure that each and everything is working every time we change something especially when you change your machine you change like you go from one model to another you have to run regression for all 40 bots or 20 000 copy that's like a massively manual process and we made sure that at least at some level that we can automate the testing of our bots and again it was agnostic of the content so if someone goes in and changes k to hello instead of the response string we are able to kind of just make sure we're protected from that sort of whatever copies the product guys change so that we had one that was the major back end sort of ci pipeline we had along with the bot testing tool and the and the and the ci that we had on mobile apps really hardened our entire system it allowed kind of teams to have their own release cycles teams were able to ship you know a lot better a lot faster better quality now and we were mainly we were able to step out of the process completely right releases now happen multiple times a day no one from operations involved whether there's it's a big release small release migration packages to be installed it just happens it goes ahead and everyone is able to uh own the entire thing so developers are just able to ship whatever code they need and apps are also hardened on a daily basis with the nightly reports that come out one of the key focuses and realizations for us was that you know it was really not about the tools we could have used something else instead of Jenkins we didn't have to use ansible we could have revolve kind of ended up in the same place what was really important for us was you know the culture we had to kind of get into the company the culture of test-driven development the culture of focusing on you know creating at the release branch making sure that's happening and that's sort of changing the way you're going to operate that was the big that was that was the actual and fighting learning for us in terms of getting getting something like this working getting it part of your company to make it a day to day sort of thing and thinking now where you're going to go from here we'd like to reach sort of functional coverage across the entire system right we've reached a certain level of code coverage but now I think you know how all enough all the use cases covered all enough business logic covered uh we'd like to just move to a sort of have that level of confidence to move to continuous deployments we've already we already have continuous deployments live on staging where hey you just merge you don't even have to kick off the deployment and it starts off on staging uh api testing is just another way of testing is bring that in soon just to further have that level of validation and another concept we're kind of toying around with is a sort of chaos monkey of conversations chaos monkey of conversations i literally like we go through our entire code base we pick out any conversation that any user had we toss it at the system and it should work right so that would kind of simulate the system really harden it again help us find flaws because what happens sometimes is that you know you fix something and then you you thought you fixed it permanently but you go back and realize another change you may has broken the earlier fix you've done so really really again specific to one model works for something else one other and it might not work for the previous data that you had so this sort of random testing of data of random testing of conversations is something you want to try out to make sure that we're really actually taking steps ahead in the right direction by making smarter bots better bots um yeah that's kind of that's where we're at now that's coming from me thank you guys