 All right, you guys ready to get started All right, thanks for joining us this afternoon First before we introduce ourselves. I'd like to by show of hands ask how many people are new to open stack in the room How about CICD implementation are you guys working on one because we're looking for a back-and-forth dialogue today and we're also gonna demonstrate kind of our journey along the way and trying to implement a CICD solution and Kind of the needs that we have and why we we started on this path And some of the tool sets that are available to you all as well and trying to implement this My name is Vic Howard. I'm part of the Comcast Cloud team Cloud services team. I'm Shridhar Basam. I'm the lead engineer for a Comcast OpenStack team. I am Prashant Hari. I'm a senior engineer in the Comcast load team So we're going to talk about three things today first How do you design a CICD process and a workflow that works for your organization? Let's give some real-world examples for you so you can kind of see what it looks like What we've implemented and it's kind of a work in progress and we can show you What where we're going where we've been kind of the difficulties we faced with it What worked really well what didn't and then we'll go into some technical details in the end to try to really drill Into kind of architecture diagrams some more implementation details But we're going to try to keep it somewhat high-level to give you kind of an introduction and let you know that you're not alone and trying to implement this stuff So I think it's really important to kind of take an inventory of where you're at Maybe you're just starting out. Maybe you already have some kind of manual process in place where you're deploying software Just find out where you are and then find out what a working CICD system for you would be Maybe you want automated deployment. Maybe you're not so concerned about automated deployment, but you want to speed your code Into production Let me take one step back and talk a little bit about kind of our environment We actually carry some patches that we'd like implemented in our version of OpenStack That may not have made it to the upstream yet So we do hold a number of patches until they're in the upstream, but we focus on trying to get our code into the upstream So I think What worked well for us was building a workflow actually whiteboarding a workflow What our CICD system was and then how long things are taking and maybe the pain points During that workflow when we first etched it out in addition I think you should make sure that you look at a minimum viable product even if there's manual processes involved in like Taking taking code and pushing it into production go ahead and mark it down and and look at The most painful and time-consuming portions of that workflow focus on those first and try to iterate over those every couple weeks So I think also it's very important to leverage existing infrastructure that the community provides We've used a lot of different things from the community as well as things that other people have turned us on to like P builder Jenkins we we also Use workers or our nodes with Jenkins so we can do many different things at once. We're not tied to just one master orchestration engine We also use rep repo to manage our our repository and we use D upload to get there We're big into testing and using puppet to implement and push our QA and product production environments We'll go into that in a little bit and then things that we really don't use internally or a zoo a node pool Jenkins job builder and Garrett We've been looking a lot into implementing Jenkins job builder though. So where do we start? Well? We had less than five data centers and We we started to realize that as we're gonna scale up these data centers We're gonna have to put more and more automation around our process. It was taking us More time than we'd like to actually push things into production. Actually, we have a goal Within two weeks of a milestone release getting that into our production environment and doing it in the most automated automated way that we can The things that work really well for us was I mean Jenkins was great as an orchestration tool Tying everything together. We really enjoyed it a lot of the default puppet manifest that we got from the community were great just minor tweaks or just us learning that was the only real blocker for us in Implementing that tempest testing. I mean leverage that the communities Using tempest we use tempest. We think it's really good for functional testing above and beyond unit testing when you deploy something Community is great There's a couple of IRC channels that I'll that I'll give you at the end of the presentation And feel free to reach out to us or talk to anybody in the community Tons of people that are willing to help IRC is great. I've reached out to several people in different organizations Talked about their issues our issues. It's really really nice Collaboration and definitely leverage that as you're starting starting out of trying to improve your process And defining workflow. I think When we set out just defining the workflow and actually putting it on on a whiteboard or on paper Really helped us to see well. We're not even thinking about this portion or we missed this piece It really helped us come up with kind of a roadmap for what we're trying to do What went bad when we started out? Well We didn't have a lot of standards like we didn't have one consistent place We're putting all of our code in we didn't have a naming convention for our code versus upstream code Those type of things don't sound like much in the beginning But they can really come back to bite you if you're trying to automate and have a consistent deployment mechanism Consistent way you're taking source and in a production, especially to carrying your own packages Packaging was really difficult for us as well mainly because We really didn't know what to do as far as We had no experience with Debbie and configs Not a lot of experience with Debbie in general As far as packaging up open stack and deploying it the community was very helpful there There's a lot of good configs on launchpad that we leveraged So we'd pull those down. We'd have to modify our change log Modify some things in the configs, which is kind of a manual process And then and then push that out. We've We've also got a lot of manual processes that we were dealing with in the beginning mainly around Where do we gate? Deploying QA very manual deploying production very manual process Even triggering CI to build triggering builds From Jenkins was a manual process in the beginning Also managing our our build hosts that we use for packaging the nodes that I was talking about earlier on Jenkins We were using vagrant and puppet librarian. That was kind of our best solution early on to Kind of prepare like an Essex environment or a grizzly environment or Havana To be able to then run those unit tests for packaging on and it was it became really difficult to manage that for time-consuming So You know, we had a 5x growth last year. I mean 500 500% growth in our data centers And we really knew that we needed to automate more about the way that we artifacted code the way that we put it all together So the things that worked really well for us and I see Chris sitting here a P builder. He turned me on the P builder That was really helpful for us because it segmented our builds and allowed us to use one unified build host Whereas before we had to manage them all with, you know, vagrant and puppet librarian. It's very taxing We parameterized Jenkins and came up with with consistent tagging mechanisms for our for our code for the patches that we held This this is really key and in the beginning Excuse me in the beginning We were using manual updates to Jenkins to change our code location, which is a big no-no But in tagging things consistently and having the same naming convention for the internal Debian configs We're holding we're able to automate our Jenkins jobs So no one has to update anything in Jenkins anymore instead instead. We look to our source Location have a drop-down with all of the tags and those tags match our configs So that's really saved us a lot of time and sped things up We still have a lot of manual processes involved in deploying Even though the system is automated. We have to trigger it manually. We have a lot of configurations and Sure, I'll go into that later And we need more testing. I mean I Think when when you're looking at the big picture what you have to ask yourself is do I feel comfortable just using this system to push Source even into QA or into production automatically and that should be kind of You know the final point that you're at just I feel comfortable using this system and not doing anything manual At least for us. That's that's our goal So the more testing we have in place the better. All right. Let's take take a look at how everything looks This this is what we kind of thought CICD would look like when we got started just we didn't know You know, it sounds so big right and we thought oh, well, we'll just you know We'll get some CICD from open source and we'll just implement that But I think what we found out it was that you know, everybody's a little bit different and everybody has different needs in their business and then and for us it really came down to Figure out what we needed and trying to learn how to tie everything together So this diagram here kind of shows the big picture of of our current state of implementation. We have Developers that are contributing upstream They also Rebase off of milestone releases to carry patches when we have to We then grab down the Debian configs and using Jenkins and P builder We we build those artifacts At that point we reach out the community bundle together the same community version that we need with our local patches And then they're available for puppet and for cobbler to provision our QA and production systems I'm gonna go ahead and hand over to Prashant now He's gonna talk a little bit more about the way that we manage our Artifacts and our apricot Hey, I'm Prashant Harry Senior engineer in Comcast so I'm going to talk about Artifacting and what our approach was in building in integrating all the artifacts and I'll also cover what our approach was on sitting this up and I May have some two minutes of demo on tempest gating based on the time respecting the time so To get started like what is artifacting basically so artifacts can be in the layman's term like artifacts can be any piece of work that is Completed right anytime will be so far. So when you map this to the same software concepts, it can be like a Piece I mean the software that is ready for deployment. It can be like I mean Source code it can also be a package that is ready for Deployment so the context of this presentation like By artifacting we mean like the open stack source code it we also mean Any packages relating to open stack or any other dependent packages that's going to be deployed in the target production nodes So if you see here like We need to identify the problem right the problem So what are we talking about so the problem is like it's to To deploy open start open stack. It's a which is a complex problem like We need to first like Create that into packages right and we need to be going to keep up with the releases and It's it's not just open stack. We are deploying so we need to we need to carry along few other things So basically it can be Patches you might also have like system management tools monitoring tools and all these needs to be in sync we have to Integrate it basically and then also Make sure that I mean we We are consistent on every releases So there are two challenges which we face right? I mean so we have all the community releases the opens I mean the Havana, I mean sx Havana grisly and all those things but We internally to Comcast we wanted to keep in touch with I mean we need to be in sync with that so we need to have like Internally like we have to take all these code. We need to consistently like integrate all our internal baseline packages and then Go along and then we need to also create packages for target operating systems like Ubuntu or red hat and Deploy to target location. So this was our problem. Basically. We are trying to solve so So our approach To this problem like as Victor mentioned like we took a minimum viable product approach Our journey to open stack began from sx onwards. So when we initially deployed sx So we are kind of in dark, right? I mean we knew that I mean we wanted to do continuous integration But the the entire solution evolved even now we are evolving kind of thing So we are evolving and initially like the what we the strategy which we adopted was that Sx is the first playground on open stack internal. So we wanted to get a feel of how the open second I think works we started with sx and then We identified all our internal tools That is kind of pretty much easy because we know like what tools internal to Comcast We have like a set of standard that needs to be deployed in the production environment. So That's where we began. We had like our internal Package repositories the only challenge we had was that I mean most of our systems were Based on like that had based We had we spend considerable of time like converting all our internal tools to depth files because our open stack production Runs on open to open to precise. So that was our first strategy on that and then For open stack sx packages we were directly using the external app sources so that was our first venture into this so when When we moved on to grizzly and Havana, we had That's the first time like we were kind of rolling out our internal patches So we we had to patch our keystone deployment. We also had to patch our horizon. So So that's when like we we identified that we need to have like because the the the sx environment We are kind of directly using the external open stack cloud archive Depth file. So we didn't have any control on these patches, right? I mean, so when we have when we deploy these patches internally the the upstream packages was changing So we needed some way to host these packages internally So when we started with grizzly, we decided like let's set up a partial mirror from cloud archive and then deploy internally that was the phase two and we started Downloading all the external packages dependencies and then created like PPS for open stack releases We introduced Jenkins for continuous integration Built automations. I'm coming to the workflow after I come to Sam good. I'll show you like what what's our current solution? So So we had automations that Jenkins would call for building kick-starting the artifacting and CI process So in ice house We are in a kind of I'd say like in a good shape on the CI CD. We have automated builds and workflows and We have also introduced like pack filtering upstream packages from multiple LTS repositories and other external subsystems For I source we are also doing like production gate tests using tempest So we have integrated tempest with gearman job server and Juno and kilo we would like to leverage Riley for benchmark tests and We are also kind of enhancing our artifacting The app repositories to use object stores back and so for so the tools we are currently using for artifacting like I think Reptapro is used for package management and germinate is for use germinate and some automations around germinate to Verify package dependencies. So on each releases like whether it is I source Juno or killer like I mean we had some kind of Criteria based filtering which uses Uses germinate to get all the packages from upstream And puppet Is our configuration management we use puppet for deploying into production so So this are a workflow so as Victor mentioned the earlier slides we have the people are generating all our internal custom patches and We are So after the people the process complete de-upload uploads our patches to the Artifact artifacting nodes the one you see the PPA nodes. That's where all our CI Automation and all the package repositories exist so once Once the patches are created using P builder the CI job would invoke Like a build CI. It's a Python script and it would it would kick off the the process for setting up the Building the entire releases. So internally what's happening is like we'll see I Uses calls the filter packages filter packages. We have set of configuration files where to know what all dependencies are there for different OpenStack releases it downloads that and also it creates like it we didn't want to complicate this basically so we We want to make it pretty much self-contained repositories like app repositories. So What we did was like For every OpenStack releases we create like a timestamp based PPA Repository and that would have that would host all the Upstream packages and all the patches and also all our Comcast internal tools. So the advantage what we have here is like Once we have this setup during the production gating So if you wanted to do like when you do gate tests we can We just have to pick up I mean just cherry-picked that particular timestamp releases and then we can start doing the gate tests So currently in our production gating we are you're kind of Using both manual testing as well as tempest the workflow for yet, so So the gate So what we have basically is like so once the once the packages the QA packages are created Jenkins So we are using the tempest community Code but I mean we did some rap we built more automations on Tempest basically like we have like a set of tempest clients that would invoke test and then Send it to a gear man job server we have workers and then workers would subscribe to gear man and then perform the test Does run the tests to the open stack at the production endpoint and also like I mean we can do like In the lab we have like you a Gating test so the test basically like the the tempest client Sends the region information where the test needs to be done workers subscribes that performs the tests on the target node Sends back the result and saves the result in Mongo. So I'm going to show you like a two minute. I mean short demo on This so to get a feel of it. So we did so basically the automations which we did was The back and the worker is basically running the same tempest run The nose test and the run test. That's what it's doing the worker we Had a lot of It was pretty much challenging for us to do this I mean we simplified a lot of simplification we did a lot of simplified the solution We had to spend like a lot of hours like I mean identifying the tempest configuration right? I mean The if you say like the tempest configuration pretty much complicated and we wanted to know Which configuration works for us so we had to spend like a lot of hours tweaking the tempest configuration files and So we so we had basically like I mean have template tempest configuration template and all the the tempest client what it does is it's just going to pass the region information and we already have like the configuration file that works for us and during runtime it creates the configuration file and executes the test and Sends back the result to Jenkins So This I mean the workflow which I explained this a short two-minute demo we have the tempest client that is Normally it's executed in the Jenkins for the demo sake, I mean We are running it manually. So that's the worker that that's waiting for requests to be sent So the Jenkins has sent Invoked a test and it has also passed the region information where the test needs to be executed. So Yeah, so the the test test picked off and Started creating images So we had to patch most of these tests some of the tests didn't work for us because in our I'll come to the challenges later, but So we are also storing the test Performance data so that we could we could feed that back to our KPI tools which we have Internal to Comcast and also we could do yeah, so Yeah, did the test it sent back the result and it's also saving the data in In a time series. We are basically using Mongo. We are also using the same tool for our operational Dashboards the idea is also we are also using the same tool for periodic health checks on our production setup now coming to the Challenges we had on tempest Huh So I think I don't know that I mean tempest is a valuable tool. So but making it work was It was a real tough task for us So we we had to spend like a lot of hours to make it work some of the tests One of some of the community scenario test didn't work out of the box for us because I was set up Used multi regions and we are also using provider networks So when we initially rolled this out the scenario tests the region configuration so the service the service definitions in this tempest Configuration file even though we mentioned the region information that I was not getting picked up So I think that that has changed in the current version of tempest So they have already the community has already merged the the service Client as well as the as well as the tempest client has been merged a single namespace So that is fixed now, but I mean we when we are rolling it out that didn't work and for for the multi but for the provider network there is already a Tempest scenario test which is under review So we are we currently we are testing this internally we but we are also running a patched version of it So so if any of you like our testing in your environment if you are using provider network I'd suggest like I mean to review this code. I think it's what it works and Please vote and vote on that Tempest scenario tests, so Can you yeah, so just to put things together So we had the Gate tests that's what I demoed now I mean we had the automated tempest tests and then the once that automated tempest tests are completed We so so we have cherry-picked one particular I mean a specific QA build and that QA build would be Promoted to a production repository like a golden repository and that repository is what will be deployed into a production node, so With that I'm passing on to Sridhar. He's going to talk about how the packages are deployed from the production Repositories it's gonna talk about the puppet Thank you Prashant So I'm gonna be covering the deployment phase of RCICD pipeline We use a bunch of tools We use cobbler as there to the two phases to the deployment process The first is actually getting an OS provisioned on the nodes and the second one is actually configuring open-stack services to run On the nodes once the operating system has been installed So we use cobbler as our imaging API We drive cobbler through puppet and Here we have an internal data model for us to First you Kind of define how an open-stack region looks like and all of that information is then replicated for different regions So our manifests are all the same across all regions and then the data portion is what changes between regions So the first step during installation When you cobbler a node you need to know some information about them and you need like the MAC address You need to enable option ROM or you need to Change the boot order so that pixies the first one which starts so we do this using a Bunch of scripts depending on your hardware some of this can be got through IPMI Other times you have to use the vendor specific Auto-bat management API to get get this information Like Vic and Prashant mentioned we are Debian based we use a bunch of precise and trustee trust We're currently testing trustee production is mostly precise We use a pre-seed base installation I'll talk a little bit about our challenges with using the pre-seed base forces like where we're going in the future And then we have different profiles depending on the function of the nodes So our storage nodes would have more desks More memory and might be like slower CPUs and our compute nodes typically have Fewer desk local desk or SSD back net a desk in case we need fml storage is fast And depending on that we create pre-seeds To match the flavor that we want to deploy So that's the current phase so in the future what we want to get to is Not have to do so much disk manual discovery even though it's scripted Somebody has to run that script and then kind of feed that data into our here a data model and then generate that configuration What what we're going towards and some of this is already in flight? We have an internal configuration management database Which tracks the entire lifecycle of hardware and the open stack configuration too? so The the configuration management database When a new node comes up detects that if it doesn't know it it automatically provisions it with Tiny RAM kernel for it to do discovery So we discovered that information and then put it back into our configuration management database And then at that point we know what the node looks like in terms of hardware a number of desks memory CPU everything and then an engineer or an operator can then define What that nodes purposes with an open stack like if they want it to be a controller an API node or if it's a storage node So once We also so I mentioned like our issues with pre-seed one of the things we had with sx when we were using upstream reposes The upstream Ubuntu repositories like they update packages regularly And so when we have a node which dies like you have like a Blue drive which crashes you get rebuild that now you don't want it to be different from every other node within your cluster what we found was like There were subtle changes in the AP in the way like the sx APIs work like in a later package and that caused issues for us So that's the reason why we now host all of those deviants internally And then when we deploy it We deploy against that internal Debian repo for open stack service Debian's and everything else comes from upstream Like I mentioned the hardware lifecycle management So when the node first comes in don't know anything about it We do some discovery automated discovery post that information back then it goes to the next phase which is Which is hardware burn in so we want to discover hardware issues Early in the process rather than like after once you've installed it and like deploy it at the production Once it's past burnin then we plan to then go through the actual open stack installation and configuration All right So this is our current way of deploying open stack where mostly production is mostly Havana and a little bit of sx We are using Isles is still in QA We're using puppet with again our here data model to configure open stack services So once an engineer has defined the data model for an open stack region then As you know like puppet Depending on like when you run it like it can you need services brought with an open stack and in a certain order for Things to function. Well, otherwise you have to do multiple puppet runs in order to kind of get it to its final state So the way we do it today is mostly scripted and automated a person actually goes through the sequence Manually bringing up notes so that you bring up the storage first so that Your self or block storage your glance all of those images can land in it. The next piece is actually bringing up our Load balancers and API nodes and then bootstrapping Galera cluster So those steps are kind of a scripted or manual process today Once the initial build is done then We we don't have to do anything more manually So that's the current pain points are It takes an engineer to actually go through the initial configuration to stand up a new Open stack region what we're working on and What we hope to happen in the future is? Drive all of that through our configuration management database So we're using an orchestration tool called Rundeck Rundeck as used to kind of Build out the entire environment. So it's got it like a workflow tool. It's similar to Ansible It's been used internally for other projects. So which is why we're using it here So we once you define this entire workflow for building a region you can have Rundeck kind of Build the entire and what the open stack environment for you without having any sort of operator intervention The CMDB then becomes the external node classifier. We no longer use here So that manual process also has gone like where somebody had to build the data all of that is actually discovered this very little There's very little information that somebody would need to input into the configuration management things like what network blocks we have for the open stack region and Then like the region name like other than that like everything else is going to be discovered and then Our CMDB is going to tell puppet how to configure nodes once they check in So yesterday there was a really good session on Ansible in the meridian We heard like lots of things lots of the community is like using Ansible to do this or the same orchestration works We feel Like we're gonna use we're probably gonna use Ansible instead of Rundeck and leverage the larger community rather than trying to Build something on our own That's it. All right Sorry, I was gonna say to summarize. Here's a lot of contact information We've got a lot of good information off of the ci.openstock.org site Just reach out to us There's a lot of detail we couldn't go into because our session was limited but You know, we're here for you and also These IRC channels are great toss up some questions. You might hear crickets a little bit for 10 15 minutes But someone will get back to you And we're always looking to start dialogue continue dialogue about the best way to to implement this Does anybody have any questions So it's it's a copy and then Yeah, use like that that is the tag within the Today like Hira has what what tag to use like what's the production tag and then Later, it's gonna be our configuration management database will have what tag goes into which region so your Was your question like on specific to the package repository like moving the packages from the Time-stamp queabills to a production repository was that the question or yeah, so So that part like we are handling through the reprop itself. It's does a good job like so we once we Know that I mean that the reprop the package is existing in that queue a bill works We just directly move the packages to a production this one. We are not using the same queabill We just moving all the packages which was there the this one and that's then feed it through the Hira To add to that a little bit We currently keep a bundled copy of all the community packages and ours and then we kind of we have a job that in Jenkins That promotes it to production which basically tags it in the app repos production instead of QA and then puppet knows Oh to look in that portion and this is a Production version not a QA version We're looking at ways to consolidate because obviously that's not the best way to do things It's not storage efficient So we're looking at one copy of the milestone and pull that in as we as we build stuff Almost like freezing versus the global which would be constantly pulling from the community the frozen would be this is the milestone that We're building for ice house or Juno Yeah, so that's still in the qi phase So that's still in the qi phase we use like things like CPU burn or like doing just stressing the discs To weed out bad hardware Yeah, yeah We're also interested in doing a lot of stress testing. We don't do a lot of that right now Stressing is mostly manual like I think Prashant had a slide where in the future We're gonna use rally to do Anybody else? All right. Well, thank you for your time