 actually three talks to DevOps days, KC, and that's two titles. The learning from the trenches would have been a different talk, slightly different flavor, but at any rate, so when is your company no longer a startup? Maybe you can't feed it on two people, or can't feed it on two pizzas, right? Maybe all the original founders are already left. You may not know when it happens, it may have already happened. You may be operating at scale, but you're not behaving that way as a company. So modern DevOps, I think the way I think about it, and I think a lot of the talks here in literature say, the 12-factor app, immutable infrastructure, disposal environments, that sort of thing. But a lot of companies, whether they're legacy, large corporations, or they're a startup, and they had something to get off the ground, a 1.0, get it working, it just works, they're not operating that way. They're not doing any of that. And so what do you do when DevOps is missing? How do you migrate legacy systems, legacy platforms, update production systems in place? And more importantly, how do you get teams to change behavior? I think DevOps is a culture, right? Which I think a lot of the talks today have covered that. It's a matter of behavior, and behavior is hard to change. So how do you outgrow a startup culture? My name is Caleb Hyde. I was gonna be presenting with my colleague Scott Howell. He actually couldn't make it here today, unfortunately. And Scott and I work together at Pinsight Media, which is downtown, KC. I'm no longer there, but he's still there. So some of the later content is information that he's contributed. So yeah, I want to give a timeline. A tale told in quarters, driven by profits, signifying progress. So when I joined the second quarter, the end of second quarter, June of 2015, last year, there was one man at the time, a small man, not a big man by any means, but John Todd, JT, was kind of keeping the lights on. He was the entire AWS team, infrastructure team. So we call him John Todd Van Damme. Which I mean, if you ever met him, he's this really sweet man, good guy. And doing everything he could to keep things running. Pinsight at the time and now has, depending on how you cut it, eight or so major product platforms, supporting platforms, internal systems like Jenkins for builds and that sort of thing, contributing to that. So eight or 12 platforms, supporting systems, sub-components of those platforms. Five developer teams, we had Java Enterprise Developers, we had mobile developers for Apple, iOS and Android. A lot of different stuff, a lot of interesting things going on. And already at that point, a footprint in AWS of about 300 servers or so. And each platform, each team was kind of its own thing, right? And that's where I came in. This man, we have five developer teams and one man helping them out, supporting the AWS platform and I come on to help them out. So what do you do in a case like this? You come into an existing company, maybe they call themselves a startup, but they've already got multiple revenue streams, multiple products and ostensibly, you're the AWS steward, you wanna help them build some automation, handle deployments, what do you do? Well, it's a lot of little things and it's actually turned out really well, this whole conference, I mean, Paulie's opening talk, I think talked about a lot of what I was thinking about covering a lot of this has kind of been covered in the open spaces yesterday and that sort of thing, but talked about it from the management and the leadership perspective. My talk, my perspective is more from the individual contributor. You come in, you don't necessarily have the leeway to make purchasing decisions and that sort of thing. What do you do? So JT and I, at least, we had a ticket queue, a backlog of requests, we had developers that would walk into our cubes asking for deploys, sometimes three different people standing in your cube, we had Slack, we used to joke that it was like Slack a human request, so open a request for a deployment. So we kind of started consolidating that, put it all through one single ticket queue, raised the awareness that there's a ton of work, a big backlog, and in the meantime, in my free time, which is to say the 50 to 60 hour time window of the week, I would start writing Ansible config management for internal platforms, things which weren't production facing or customer impacting, but anything that I could, if I could upgrade the Jenkins server, the Artifactory server and provide a little bit of reliability and also a feature upgrade because those systems were out of date at that point, maybe that would help raise awareness. And the other interesting thing I did, interesting to me, was there was a ton of servers running in AWS, people had kind of stood them up as needed and named them whatever. We had, honestly, servers called John's job browser or something generic. There weren't consistent tagging and naming. So I pulled the AWS detailed billing report down, which is this enormous file of every possible event you ever get, it's like over a terabyte. But it's got a ton of information and you can map it. I mapped it to product platforms and created this view month over month of platform in AWS, like the eight different product platforms and sent that out. So I started generating that once a month manually and sending it out and I think it helped raise awareness for where the spend was going and what was going on in our production system. So that was the first three months or so. Fourth quarter was a lot more of the same. We borrowed someone else from the organization. He happens to teach courses in wilderness survival, which isn't germane to this conversation, but he's a cool dude. And I can give you his info if you wanna learn about wilderness survival. So we borrowed someone else from elsewhere in the org, worked through the tickets, kinda worked really hard to just take the low hanging fruit, clear it out, try to get some of that backlog. We had tickets in there which were a year old, hadn't been updated in months and that sort of thing. And we also, so Pinsight runs a co-location data center, a data center in a co-loc, right, 1102 grand. And it's managed largely by a different set of teammates. And I worked on trying to bridge that and provide Ansible automation for that piece so they'd have like disk failures and the like and needed to replace them and upgrade the servers and tried to help out with that. It didn't really, the data center had kinda different cadence and a different set of priorities. They weren't directly the product platform so it didn't really work out and we just redoubled on AWS work. So that's three, six months, a lot of work, a lot of long hours, a lot of manual deployments. I get a text at 4 p.m. and platform developer says, hey, I'm ready to do a deploy. Like four hours later, like, does it look good? Are we good? But that's just the start of it, right? So going in the beginning of this year, January timeframe, Scott Howell came over. He was actually doing front end development for the marketing sites and he came over willing to learn DevOps, which is awesome. Like having someone with that mindset and that approach is great. Have someone that wants to learn came over. So at this point we had our wilderness survivalist guy and our front end developer and myself and John Tad had left at this point. So three people basically. And really a lot of the similar stuff. We did get hit by this pretty bad outage. It wasn't production, but QA, we had to take down our entire QA environment and rebuild it. And luckily, I mean the upshot is as a result, I got to write some config management for the QA environment and Wildfly, Java, Enterprise platforms. It turned out that the production platforms were an older Jboss version. So we didn't actually deploy any of that right away. Took several more months before those playbooks got used, config management. But we got to do it, get it in the backlog. Also at the time, stood up a streaming architecture in AWS, Kafka, Storm, Zookeeper, all of it, Greenfield from, it was the first clean environment that we built while I was there and with config management. And it was basically to offload compute from the data center, which had a lot of how to storage and compute and was highly constrained, doesn't auto scale. That went really well. It ran for about the next six months with really no problems until we upgraded it and expanded it, added more platforms to it. On the other hand, worked on a project, I proposed a project to leadership, got it approved to provide automation ansible playbooks for the existing platforms, legacy platforms, and it was approved. But then the release cadence didn't really allow for it. It didn't fit in with the developers, two week iterations and so that actually got canceled. So we tried a lot of things. Some of them did pretty well. Some of them kind of raised awareness. May the developers aware that this was maybe producing less alerts and alarms in their platforms. Some of them didn't work at all when they failed. Projects canceled and that sort of thing, but the show or the business has to go on. So we're growing at this point. The platform is growing, 500 servers at this point. A lot of, we're doing pretty substantial work in AWS. Scott was telling me, actually, they interviewed someone just the other day and he said, we're running 30 servers and they have to kind of politely say, like, that's great, we're running 300 or 600, which I mean, numbers don't really mean much, right? They might be underutilized. It might be a waste of resources, but the point is it wasn't a trivial environment. It was growing and so at this point we had kind of an internal wiki, a documentation wiki. We started publishing standards. We put together a diagram of all the platforms. We had had functional diagrams from different developer teams for one particular platform and that app would write to S3 and the logs would get S3 and that was all the developer knew, but it turned out that someone else on the ETL team was ingesting those logs. This is sort of hidden dependencies, a lot of interdependencies that no one team or person was tracking. We put together this very comprehensive diagram in Lucidcharts and it's like ANSI A4 page, huge, and you get down it's like eight point font and everything's there, every single line and connection that I knew about, because I'm running the security groups, I'm running the VPC, like I knew what would go into this document and put that together and that I think also raises awareness to how complex it was and how many sub-components of each app there were and that sort of thing. We hired someone from externally, so we had four people, we actually successfully hired and rather than borrow from within the company and our director carved us off from what was previously the infrastructure team, the data center, the IT, the sub-support and that sort of thing and we were then made this new team, DevOps and at that point the phrase was there, the team was there, we had four people, we had a manager and really starting to hit kind of that hockey puck of like awareness and success within the org. So this looks like progress, this looks like good things but there's still a lot of legacy stuff, like I heard the last talk, I was behind the stage and we had developers who had SSH access, they had suitor rights on their platform and they would log in to Dale Logs and that sort of thing. So what do you do? Like you can't just take it away, like the previous speaker mentioned, you can't take it away, you have to provide an equivalent or better alternative, like forward logs to CloudWatch and connect CloudWatch to now actually Elasticsearch service, the managed service or Splunk or any of a ton of things. Actually more recently Scott showed me that they have Grafana running and a lot of that goes in there and they have these Grafana dashboards so the developers don't have to worry about their AWS login and that sort of thing to get access and check on this stuff. Yeah, so this was starting, we were sort of starting to turn things around a little bit, getting a little away from manual work and automation a little bit. Had been asking for a while for enterprise support and like I mentioned, I don't make the budget decisions of purchasing myself, we were petitioning for this, we secured it and immediately like if you've not used it, if you have when you get it enterprise support, you open a ticket and you have this option where they call you and they call you immediately, they put you on hold and then they pick up within like five minutes but it flips it around, like rather than you open a ticket and you hope that they get around to it in the next 12 hours, then they're calling you and our manager saw this, our director saw this and it got again like got notice, we got some traction there. So kept growing playbooks, automation, build automation and internal tooling like I mentioned earlier, Jenkins, Artifactory and that sort of thing but redeploying new platforms and new instances of Jenkins for other teams or whatever it might be and again just raising awareness and developers really take a notice and we started encouraging them to use like either Vagrant or Docker to do development because in the past they would say well, our code's ready, it's ready to go into QA and we're blocked on DevOps because they need a server and they need the code deployed so that they can QA, well, I mean, you can do a lot of that with local development, right? But if you have these interconnected platforms, you need something like Docker compose to run them together and we kind of advise them to check it out. We like really didn't have time ourselves to write that for them or provide it but they did and now recently like Scott was telling me, one of our developers is using Docker compose to run three of the platforms which all interconnect locally and so they're no longer waiting on us. They get through an iteration and they're not spending half the iteration waiting on DevOps to turn up a server and that sort of thing. But yeah, platform coverage, new platforms now, we have a, I didn't talk much about it but a PMO org that we worked with and PM org which we worked with to provide a gate review, architectural review for new product launches and so a new product or new platform has to have functional diagrams, sequence diagrams at this initial gate and at the end when it's ready to roll into production we have launch readiness checklists, none of which existed before but these are now structural documents which did you load test? Did you estimate the volume of traffic that you're gonna get? At least do a calculation on that and tell us how it ought to auto scale? What the dimensions for the scaling are, right? Is it compute constrained or is it file descriptor constrained? That sort of thing. So this has been six months, sorry, six quarters, a year and a half, a lot of work, a lot of long hours trying things that didn't work, things that failed but in the end we made a fair amount of progress. Now we're running Jenkins and Docker for builds, leadership has really taken up optimization and cost and automation as priorities for themselves or for the company and so slowly through a lot of trial and error effect to change. So it's a good thing, it's been pretty nice to rest a tiny bit maybe and yeah we've gone from, you know when I started like you go to copy, you'd deploy code and you'd copy a server, you'd capture an AMI and the AWS console would say the original AMI for this instance cannot be found because they've been like ratcheted forward for three or four years and it was kind of just keeping things running, keeping the lights on and now these days we're releasing multiple times a day, deployments instead of taking several days of work directly with developers instead we can do multiple deploys in a week, mostly through Jenkins and Docker like I mentioned and most importantly to me like when I started it was the infrastructure team and it was, we're blocked on IT and that sort of thing and now we have a team, a DevOps team and we've brought this language into the company to say here's the terminology, right? You wanna go look it up like config management, continuous delivery, we're having discussions about how to get to CI and that sort of thing but it's like inception or something, right? You kind of whisper in the ear the project itself fails and then three months later someone says what about, what's this thing Ansible? Like what about config management and it's largely worked and now like I say I mean the most exciting thing for us was when the developers come to us and they're like hey I got this running in Vagran or I got it running in Docker compose and I just ran into this particular issue and you go okay yeah let's look at the Nginx or the configs or something of the sort but when they come to you and they've already like researched, read the AWS docs and they're like what about ALBs? Then you know that you're no longer like butting heads, you're no longer, you're the blocker and they're waiting on you and that sort of thing and it's cultural, it's behavioral. So again my name is Caleb Hyde. It's really all I have, Scott Howell helps me put this together and was my peer at the company but he couldn't make it today and that's pretty much it. Thank you all so much. This is a really awesome conference.