 You and McNeil, I'm the convener for the MINICONF, and we're going to get started now. Our first speaker is Javier, who's going to talk to us about configuration management, a love story. Okay, welcome everybody. My name is Javier. I'm here to talk about configuration management and about love. It's really tough to go after the keynote, so please take it easy with me. So what's the story with this story? I think the first time I heard about configuration management was in another great Linux conference for them. It was 2010, and I fell in love with the topic, then followed all the streams about DevOps, and it happened that I moved to live to Australia. I started working for a company, a REA group that managed realestate.com.edu, and I've been doing a lot of configuration management since then. The story behind this presentation was that we had the Puppet Labs guys coming one day to our office and asked this awkward question, configuration management, what are you doing about it? We're like, oh, that's a long story. Started sketching in the whiteboard, and then a few days later, one of my colleagues, David Lutz, asked me, oh, we have a slot in one of the meet-ups I'm running, which is Infacoders, and said, can you put together a presentation for that? I said, well, this could be a good story. When is the meet-up? And he said, tomorrow. I said, OK, I'll try to do that. So I thought, what is going to be the fastest way to get a presentation together? So I had this idea of just writing in a whiteboard and taking pictures of the different things that we are doing. So that's what I'm going to be doing. So we are going to go through four stories. Hopefully I have enough time for the four of them. I'm going to try to be quick. So let's start with the first one. Hands up who has been in love at any time. Excellent. Hands up who is doing something about configuration management. Cool, a few of you. Hands up who has ever hated configuration management. We have a few as well. So the first love, everything is going to be perfect. She is the one. She is the one. Has everything that you need is the one. So here we go with the story. So up in the top left we have a calendar, which is kind of rough days about when things were happening in the company. You can imagine it was a smaller company than it is today. Probably it had only one data center. It had a small dev team. It had a small ops team, typical. So a few servers. The deployments were done manually. And configuration was done via SSH. So someone will SSH into one of the servers and craft the configuration. Then the dev team over time keeps growing, growing, growing. Requirements makes you want another data center for DR for running active, active. So things keep growing and the number of ops slowly grows with it. But the dev team really wants to deploy. Deploy fast. And then we want more features. We want more business value deliver. So what can we do? So our brave ops go and add your service. Start deploying into the data centers as fast as they can. Until they hit a problem and we got an incident. Oh, slow down, slow down. We start to get that tension between the two groups that we all know about. Everybody is hungry. Nobody is happy with what's happening. So suddenly someone goes to a conference and comes back and say, like, oh, I heard about this tool. Got in love with this tool. It's called Puppet. And it's great. It's going to solve all our problems. So the team goes and starts implementing it. It's a good team. So they start crafting a few pieces of code, put it together. They push it to Git. Then we have a Jenkins job to push it into two Puppet masters, one in the data center. And then we had a lot of servers that are running Puppet in demo mode. And it starts pulling that configuration. And the configuration gets pulled. And everything is fantastic. End of the story. Hooray. We're winning. But then things started to be complicated and tried to represent that with the number of code, the number of pieces of code that we had in that repository. And we tried to make abstractions. And we tried to use classes in different places. And we tried to reuse code all over the place. So what happens is you try to push a piece of code. And not only breaks the server that you were deploying to, breaks a lot of other servers that are related to that piece of configuration. So then the ops team starts with a typical talk about, oh, it's an incident. We want to meet over 99.99999, whatever. This is not going to happen again. We need to do this manually. This cannot happen. We cannot break a lot of servers at once. So, well, when the team comes again and says deploy, please, then the ops team only had one way. It was removing the puppet daemon from the servers and do the deployments manually. What could possibly go wrong? So the ops team will go and put everything in Gith. It gets pushed into Jenkins. That goes to the two data centers. And then someone SSH into a box and applies that. OK, that's all right. But when you haven't done that for a long time, then it's when you have started to have problems. Because maybe you have updated a lot of times in the same server. But there's a few of them that you never update. So whenever you want to try to update those, you find that you have 72 changes to apply. And three of them are errors. So what do they do now? I need to deploy. I need to deploy now. So what could I do? So someone had a great idea. I will SSH into the box. And I'll make my change. And then I'll copy that into configuration management. What a great idea. Yep. Commit. It's already done. So you can imagine this over time and over time and over time. It's into a really, really difficult situation. Really difficult to manage. Lots of confidence and lots of problems. So we had some configuration management, but it was done probably incorrectly. So there comes the next story, which is the other. So now you're having troubles with your partner. And you start seeing other people around. And I don't start hanging out with someone else. And let's see what's out there. And then you start to have other ideas. So let's first present the problem. So we had our increasing development team that wants to deploy. But as we were having a lot of trouble with our deployments, then they want to have a staging environment where they can deploy and they can check things before going to production. Great idea. So the ops team goes, provision a bunch of things where they can deploy and then we have a great staging environment. But when you have tens or hundreds of developers that want to deploy, then there's a lot of contention into that environment. So they come and say, can I have two staging environments? And you're like, oh, well, we'll have to go to whatever is your hardware provider and then buy more team. They deploy the team to the data center and everything. So that was not ideal. We had a lot of people waiting. So we are into agile. So we provide user stories. We provide requirements. So as a dev, I want to test my changes in a product-like environment as early in the process as possible. And I don't want to be waiting for other people. So someone had a great idea. We created a team called Gandalf. Gandalf came to the Shire and said, I'm going to solve all these problems. You just need to hold this ring and I'll come back later. So Gandalf was a project that was put together. We had some people from the ops team. We have some people from the dev team and we have some consultant coming. So a mix of everything and what's going to solve the problem. So he came with new powers and new ideas. We were, why don't we use the cloud? Cloud is infinite. You can use as much as you want, as much as you can pay. And it's not going to be contentious because we can create environments for all over developers. So the idea was, why not? Each developer can have its own environment. So it can test any changes that they want. And then they say, like, well, to make that easier, why don't we write some tooling? And then with this tooling, we are able to deploy easily into that environment. So you just can type REA deploy and then say, I want this environment that has only the front end or has only the back end or has everything. So it could go from a couple of servers to probably 30 different kinds of servers and databases. And then the last great idea was, why don't we implement it in Chef? Because we've seen that Puppet is not working. So we have this other idea because why don't we bring another tool and that will solve all our problems. So Gandalf weighs his one. And we created a VPC into Amazon EC2. And then people were able to write their own Chef recipes to define all those services that we wanted to use. And then using the tooling, they could deploy into an environment, for example. That's one there. They had everything that they needed in a smaller size than in production. And they could do all the changes. This is great for hack days. So whenever you want to do a change, you just go create a full environment, hack a little small thing here. And you can see it happening all across the environment. That was great. And then another team wants to deploy. Or even an individual wants to have a full environment. I want to have everything they want to test and could do it. Oh, great. We're winning. OK. Now we test it. We can do stuff. Now let's go to prod. Let's see what happens. Then the ops team, because nobody thought about prod, goes and says, OK, I'll replicate the changes you want in my puppet and deploy it into the servers. And yeah, that always works. Can you guess why? Mm, prod like. Well, I forgot about a few things here and there. So cloud versus data centers. They're not the same environment. There's a lot of things that can't change. Puppet versus Chef. We are not keeping consistency between the two configuration management systems. Debian versus centers. Oh, why are we going to deploy in the same operating system? That's nonsense. Not the same deployment mechanism. Well, you can guess what kind of problems we're emerging from there. So yeah, that worked for a long time. That was kind of the strategy. We were finding problems. At least we could have as many staging environments as we wanted, and that problem went away. But many others came with that. So we thought, oh, we need to standardize. We need to go with one way of doing things. We want to get engaged. So the first thing that we should tell here is how the company changed over that period of time. So we no longer had ops team. We were following agile practices and DevOps principles. So what we did is we organized the company in different line of business that align to different business objectives. And inside those line of business, we have different teams that will be composed of people from different practices. So for example, in the LOBX Team 1, we have an iteration manager. We have a QA. We have developers. And we have one ops. So they have everything that they need to be able to deploy a full project and to create a full project without depending on other teams, which was great and is working really well at the moment. In the other side, we had a platform team, which his duty was to maintain the data center structure, the cloud structure, and some of the services that these teams were using. But we were drawing the line at the operating system. So everything about the operating system in terms of keeping security patches or deploying the application and managing everything was done by those teams removing handover between different teams. At the same time, we started to have our first EC2 production environment in which the first thing that we deployed and was really successful was the images of the site. So you can imagine at real estate sites, 90% of the site is the images. You can see the pictures. There's no point of having that kind of site. And we deployed move from the net up into EC2 and we started to have a farm of resizing service in there, which was working really well and gave the business the confidence that we could move our workloads from the data center into the cloud. So more confidence on having things in the cloud going forward. So what the platform team did was why don't we provide an environment that everybody can use even if it's in EC2 or is in the data center that has everything installed in that operating system that is normally required by all the teams. These are things like, for example, LDAP. These are things, for example, like NFS configuration. This is things like New Relic or Splank configuration. So everything that is a base and that all the teams are using will be pre-configured and installed in an image that you can deploy in both environments. That reduces the amount of configuration management that you need to do on top of that when you want to deploy a service which makes it really easy. So what we did is we brought up a bunch of puppets that we chunked in Git and then we have a Jenkins job again that creates an MI and a BMD key image and then we are able to test that before sending it to production. Whenever we are happy with that then the team will promote it to stable and then everybody can start using that for their deployment. So it's a tested image that is consistent between all the teams that I talked before about having different environments where people were deploying. So the process of deploying for these teams is we write in our code, we push it to Git, then we have a Jenkins job or a bamboo job and that packages everything into an RPM that can be deployed on top of that platform image. Again, that reduces the amount of things that need to be done in the configuration management layer to make it as simple as possible. So a deployment looks like something like that. We have an RPM with the application and then we have the platform image. We put them together and then we push it into prod using some automation. When we are happy, we push it first to test and then when we are happy, we push it to prod which makes sense. So there's one small bit that is not there. So you have different environments so you may want to have different configuration variables in those environments. So the thing that we were using, similar to what other environments like Heroku does is to inject those into the service before booting the services and then having them consume. For that we build what we call a config service which is just a REST API that you can consume from and it will give you all the information that you need that is related to that environment that is not going to be packaged in the RPM. That way we can deploy the image and the same application into dev and into prod and the only thing is it's going to change those small pieces of configuration. At the moment there wasn't anything that we like. Probably if we rebuild this we will use some of the products that are out there like HERA or HCD There are many options for that but at that point let's build our own. It probably wasn't a good idea. So you put all that together and then you can deploy and remove some of the problems that you will normally have in your deployments. That's the full thing put together. We have the platform image the app deployed as an RPM and the configuration. Finally, we are getting closer to today. We have this idea of you have different teams and the teams are independent between each other. Why they have to marry the same girl? They can make their own choices. We are seeing that a lot with other technologies for example languages. People are starting not to use the same language for everything so they choose the programming language that is more adapted to the application that they are creating and the same thing with databases and persistent storage. You don't use the same kind of database for everything. The same deployment mechanism for everything. What we have again the infrastructure environment has evolved a little bit more. At this time we are 2014 we had Amazon creating the region in Sydney so we started to have the opportunity to connect the Amazon region into our data center using DirectConnect so we had a tiny connection directly from the cloud into our data center and that helped us to migrate the applications really easy which in the long term maybe was the best idea because when you want to move to the cloud you want to rearchitect your applications to be more cloud like and that connection sometimes was something that it was making the engineers not to think about how to move stuff and how to decouple things between the old way of doing things and the new way of doing things. At the same time we created this circle and what we did before is as a team you have the autonomy to decide what are the best options for you in terms of technology and you have the autonomy to make your own choices so everything gets together and decides what technologies they want to use and that brings back the accountability that means that every system that you build you're going to have to maintain it so you're going to have to think before if you want to use this technology or that technology that you can empower to do whatever is needed to build your services then there's all this conversation about economic scale or well if everybody's using different services and it's using different technologies then how do we share like as a company if someone moves from team A to team B it's going to have to learn all the stuff that this team is using and also if we employ people it's going to have to come from different environments and sometimes there's this view of the teams are doing the same thing they are reinventing the world because we have probably 10 different ways of deploying and the idea was well we really fancy more the first option which is having the teams accountable and autonomous than having the economies of the scales so we decided we know that there's a little bit of ways of doing it this way so we're going to continue doing it in that way so some ways that you can try to solve that problem are the guilds and the open source model so we proud ourselves to be a sharing culture company where like everybody is happy to help someone else and everybody's happy to teach other people and there's a continuous conversation going on so we have this thing called guilds that other companies are doing as well where like people will get together and what a surprise so you will come there and there's different guilds so we have around one which is called the ops dojo where we get together and we practice operational skills and people learn about operations but there are many other guilds including things like a happiness guild or a public presentation guild or anything the other part that we try to use to bridge that gap of having teams doing their own thing and trying to follow something like the open source model this is the idea of world in the open source world everybody can start a project and everybody can decide what kind of project are they building and then there's a community that is built around that and then there's a supporting community and then like for example if there are two projects that are trying to attend the same thing eventually they will come to agreement and use one now if they are successful both of them they will continue doing their own thing but that's only if they are happy with that if not they can switch so the idea is that through all that sharing and all those guilds we have the opportunity to broadcast what we are doing and then we have the idea that the best projects will win and then people will start to adopt those instead of having a top down approach where we were pushing the teams to use the technology that we think is better so again coming back to the autonomy the teams are happy to see what other people are doing and to embrace tools and practices that other teams are doing so for example what we've seen a lot is now that we are using more and more amazon web services cloud formation is becoming the base that we use to build our infrastructure so before in the data center in the data center model we couldn't really decide and we couldn't really change a lot of topology and we couldn't change a lot of things in there now the teams in cloud formation they can decide what they do and how they build well, I need a BBC, I need a load balancer I need this kind of databases and they can work independently which is great and it's replacing somehow some of the configuration management tools that we had before because it's allowing us to shape the infrastructure that we need so it covers the deployment process and some things will use the platform image to do the deployment and to again shorten that gap to minimize the amount of configuration management required but other teams say like, you know what in that platform image there's a lot of things that I don't use I don't need all that stuff and that is making my boot time really slow and I can't cover that easily with my own configuration management they decided to stop using the platform image and move to an amazon AMI that's totally fine, we're happy with that is that maybe it will be more work for them but they can get better results and then again they can choose whatever deployment mechanism that they want some people choose Ansible for example what they will do is our infrastructure in cloud formation so this number of load balancers this number of servers, these auto scaling groups these databases, whatever and then when it comes to the deployment mechanism we normally will have a CI server that is running a deployment process and that will use the configuration management to deploy the final steps of the configuration of that server for that, continuing in the team 1 some teams are using the amazon CLI to do the deployment but all the teams for example the team 3 have decided to build their own tool again coming back to the kind of times where they decided that they can build something better than the CLI and that is more adapted to what they need to do and what they want to do so some things have created that Kool-Aid some teams are moving into docker so the teams that are moving at the moment into docker, they are building their own tool to wrap around docker to make it work the way that we like our deployments required to be done in this case for example they are using the Ruby SDK instead of using the CLI that is totally fine again and they decided that the platform image is a really good idea for them so they continue using it and they say like oh, you know what I'm not doing an RPM I'm not doing anything else it's so easy for me that the only thing I do is just just bash and say deploy this RPM so yeah again, different choices which I may agree with or not Questions So we have about 5 minutes for questions I was faster than I expected From the I think 2013 you're talking about having images with I think everything in install that might possibly be needed that seems so weird to me do you mean that every system had an LDAP server installed, every system had a Oh no no So what we have is the client so for example we have an LDAP server in the data center and then when you deploy new servers they may be using that LDAP server for authentication so what they have is the client configured to go to that LDAP server whenever anyone wants to log in so in the platform image we're only putting the client of the LDAP Does it make sense? Yes, but is it really so hard to just configure it in a configuration management system? So here I see if it is so difficult to do it in the configuration management system and I think that it relates back to the first story where we try to do everything in configuration management and we try to create classes that apply to every single system and then when someone wanted to modify those they were really scared because it was touching every single system that we had so we decided not to try to do that and it was really difficult to test so it can be done uncompletely Yes, that's fine and you can do it that way but we Yes, that's fine but you can do it these other ways so this is an alternative that we use to minimize that we had at that point but yeah, there's different ways to tackle the problem I think the main idea behind the conference is the talk is that there are many different ways of doing things but we have tried some of them some of them have been successful for us some of them failed but yeah, we had that same problem that you mentioned about having an image where they tried to install everything so we had another team that was taking care of the Windows deployments and the idea was well what about if I put everything that can be needed I have an SQL server I have everything that I need so the deployment is really easy and can be used for everyone but our image was much more lean it was really really small only had the things that really we thought that 90% of the teams will be using that's analytics and that's logging and authentication on just a little bit of our own customization Yeah, hi how do you manage things or secure things like database credentials configuration files? so it depends there are many ways that we are doing that one of the ways that we used to do it it was through the configuration server so every environment will be able to ping that configuration service and ask for credentials and those will be injected only into authorized boxes to then be able to access other systems where is the okay this one is going to be the last question Hi, one of the problems I see with this kind of approach is that you've now gone from supporting one or two different technology stacks to now supporting 10 or 20 or 30 and when those development teams leave or recycle in about six months or a year's time you know you've now got this huge legacy that you're having to carry and you'll have one person and one team saying why the hell did they use Ansible why don't they use Sol and suddenly it becomes a real burden on the company because there's all of these multiple technologies that are now involved the way that we tackle that normally I said it's up to the team to decide but normally there's a bigger unit which is normally the line of business which is like that business oriented that will have different streams and will have different teams so the ownership of the systems are normally given to that line of business so they try to do some reuse of technologies and they try to so they have the opportunity to do some of those economies of scale and they know that so they know that they're going to have to support the system over time so they try to minimize the number of technologies that they use and they try to find the right people for doing that some of that inside the smaller circles but we thought that the unit shouldn't be the company itself because when you make decisions at a company level then you're making them for some kind of cases and will not work for everything but yeah, that's true this approach comes with some problems that can be moving people between teams and having legacy that you have to support over time but that comes with the accountability part so we're giving you the autonomy to choose whatever you want better you have a plan to maintain that over time ok, that's all the time we have for questions please join me in thanking Havia for his talk