 My name's Nick Bienham. I'm here today to talk about Comcast, otherwise known as the Artisanal Internet Company, it seems. Some good news to start off with. Dr. Nick has volunteered to product test our new Artisanal Internet product. We haven't told him yet, so when we take him offline, at least he can't complain about it. All right, so my name's Nick, Nick Bienham. I'm an engineer at Comcast. I'm going to leave my job title blank for just now because a lot of what I'm going to talk about today is almost my own personal evolution. The transitions and transformations that I went through are very similar to what a lot of people are doing at Comcast today. I've been at Comcast for about 10 years, and I tend to do a lot of different things. So talking about the Enterprise Service Platform, so what is it? The Enterprise Service Platform at Comcast is our sort of cornerstone back office platform. We run pretty much everything through it, from provisioning services, device services. You pay your bills through the Enterprise Platform. We roll trucks to customers. It has a large footprint. This is over 500 physical servers spread across three data centers across the continental US. The stack is pretty much all Oracle, all the way down. We run on WebLogic, Java EE, Oracle databases on the back end. So the problem that faced us when we wanted to do our transformation to move from the old legacy SOA platform onto something that was much more cloud-native using Cloud Foundry, was that we had a large legacy code base. Some of the code was over eight years old. There's a Mr. Carlson in the audience here, I know. He left us with some of that code. Walked away, never to be seen again. We had a lot of manual and people-based processes. So everything needed a ticket, everything required an email, required some sort of approval. Our deployments sometimes took three weeks between database deployments updating schemas to rolling out large bundles of applications all at once and then flipping switches. So the cycle times were anything from a month to over three months to get features, you know, from ideation through to production. The third huge problem that we had was physical infrastructure. So physical infrastructure is the enemy of scale. Looking back at the Golden Gate example where the guy gets up and paints that bridge every morning, well, we were doing that except when we got to the end of the bridge, we went back to the start. It took us three months to scale our platform. So by the time we ordered our hardware, had it delivered, rack stacked, powered, cabled, cooled in the data center, it was time to start again. Our fourth big problem was commingled data. And what I mean by that is people wanted to share data and they didn't want to separate them and they didn't want to assign an owner. So it was an access pattern almost like Stephen King could be proud of. It was horrible. And we couldn't change data underneath without maybe unintentionally impacting somebody else. And the last one, and this was one of the more interesting ones, and it was one of the really hard ones to deal with, is that we already had the BAU and the project teams and they were taking business requests and we had to enable them without them losing velocity. So we had to be very careful about how much change we could introduce at once and what their absorption and tolerance for change would be. So in essence, we were moving to a cloud-native microservice model, but we had to look like so, we had to talk like so, and we had to act like so. And that presented its own set of challenges. So our approach. The first side we knew was going to be cultural. We had to start changing the way people thought about code, how they thought about delivering the code, the tooling that they used, how they wrote that code. So we were used to having product teams very much isolated, very much siloed. Everybody had their own little job to do. Nobody really shared. There was no economies of scale when it came to things like libraries or plugins or development methodology. Even when you look at some of the CI CD platforms, everyone had their own copy of Jenkins. Everyone had sometimes even their own copy of Sonar, Artifactory, things like that. We wanted to bring a lot of people together. We wanted them to start working in Unison and collaborating. And then the other side of this was the technical. And the technical was a much easier problem to solve. And one of the things we wanted to do was we wanted to innovate and we wanted to provide the feedback to us as fast as possible. So we unshackled everybody. We said, we're no longer going to be prescriptive and say you must do something a certain way. We're going to give you a set of outcomes and we're going to define them as like the definition of done. If you can meet all these outcomes, then you have succeeded. And how you succeed is up to you as long as you meet the outcomes. So everybody just went boom. God. And what we thought was that we would end up with like lots of snowflakes. And that didn't happen. And I'll tell you why in a little bit. So how we used to work. Now, you can have a laugh at the firewalls, but they are almost literal. So we had, and by firewalls, we looked at things and we said, what are these things? So when I talk about firewall for communication, I'm thinking like I have to send an email to somebody or I have to create a ticket. How do I remove these frictions? How do I remove these blocks? And everybody has a label. I'm a developer. I write code. You write tests, you run tests. Or worse, the QA's is I write the tests, not you. It worked both ways. And the operators were quietly busy trying to replace the other two in the pearl. And generally cleaning up after everybody, we know. And we would get on to the calls and all the fingers would go like this. And they'd say, it's the network. Or no, it's your tests didn't pass. And everyone would end up blaming the DBA who would either have a fit or go home crying. So what we wanted to do is we wanted to increase the velocity and we wanted to shorten those feedback loops. Because right now our feedback loop has to go around the whole circuit almost. And it took time. Some of these feedback loops were weeks long, especially if someone went on vacation and an email was left sitting in an inbox. Definitely suboptimal. So the first thing we did is what we reorganized. And we reorganized and we moved people physically into the same geographic location, into the same office, into a pod. And a pod is about 12 people. They sit around the edges. And there's a big table in the middle where they can have a phone call and a conference. But that wasn't enough. Because we still got into the scenario where I'm a developer, I'm a tester, and I'm an operator. What we had to do is we had to remove those labels completely from each other. And that's where we reached out to Pivotal and we asked for some help and said, how do we do this? They said, well, you're going to do test driven development and you're going to do pair programming. And I was like, yes! I'd been pushing change a long time. I was like, we need to change. We need to do this and we need to run continuous delivery and we need to do automated testing and everything else. And then one day, at the board with Pivotal, they said, Nick, you're going to write some code. And I'm like, what? No, no, no, no. I push apps. I don't write code. And in that moment, that was my transformation. Because I got taken out of my comfort zone much like I am today. And I was put into a space that I had no familiarity. And at that point, I had been pushing a change to a lot of people, other people. And it gave me a huge amount of empathy for how they felt. So pushing change was like, I was making people miserable because I was like, whoa, this is horrible. So we started pairing. And it was the right way and it was like an epiphany to a lot of us. This is how we removed the labels. So what we ended up was with team. One unit all cross-trained. Everybody knew everybody's job, at least to a functional extent. And once we had that team, when we got on, we decided we had the shared responsibility, the shared accountability to succeed as one. So when we got on to calls, when we were in trouble shooting issues and bridges and production outages, we didn't have this. There was no arguments. We've all seen this. I had this printed in my cube for a long time because that was... To teach me that, you know what? You all succeed and fail together. And when you have that thing and you all have that shared accountability and that shared responsibility, then you start to succeed. And you don't run into these problems. And the DBA doesn't go away crying at night. So once we'd finished all this, we had to standardize. So this is great. We've done this. We've transformed a single team, 12 people. How do we roll it out to an audience of 600 developers and 700 applications? We can't do snowflakes at this point. We have to be able to... And we can't do it one at a time. There's just too many people. So what we started to do is we started to look for opportunities to standardize for scale. Now, these took many forms. And I'm going to go back to the water cooler chat. When we set people out with outcomes, we said, we don't care how you do it, as long as you meet those outcomes. And what we found was that people went out. And I'll give you an example. So we wanted to move away from the relational stores into more like no-SQL, highly available stores. And again, we didn't tell you how you were going to do it. We just said, go and do it. What happened was that people went out, but they didn't collaborate over things like Slack. They started to collaborate just by talking to their neighbors. And this was really evident. So as things started to coalesce around technologies, the East Coast coalesced around Couchbase, and the West Coast coalesced around MongoDB, almost exclusively. Because the worry that we had, when we said everybody go out and do your thing, was that we would have 100 snowflakes come back. And we didn't. We had two. So that wasn't just for that technology or that area. This happened when we moved to Gradle. Everybody moved to Gradle. We started writing plugins. And we started raising the floor for entry, or sorry, lowering the bar for entry, where you could inherit a lot of functional stuff, like token caching in the security library, concise logging frameworks, everything like that. So when the business teams that we were going to roll on came in, all they had to do was run a little generator and drop in their business logic. Everything else was wired in for them. Even down to provisioning the OAuth tokens, we ended up going with a Ping Federer solution that was all managed for you. We did have a couple of patterns we had to solve. We did want to compose those large monolithic sewer services back into microservices. But again, we weren't allowed to have too much impact to the current teams. And that was a hard one. So this was my epiphany. They asked us to, we had to create a router. So we created this little ESP router. It's very lightweight. It's written in Go, which I'll make Mr. Hicks there happy. And what it does is it just looks, we pass it in a regular expression. It looks through the SOAP message. And it says, if you're this method, you go to Service A. And if you're another method, you can go to Service B. So we could implement a strangler type pattern to start decomposing our services while remaining transparent to our clients. Now, we, so those Service A and Service B, so they were written in recipe, what we did is we put SOAP facades on them. Now, if you want new functionality, that's not going to go into the facade. So we have to have something to get people to move towards the restful side of the house. And then for the commingled data, this is a rough representation of what we did. First thing we did. We just, all right, we put some firewall rules to see who screams. And then we put in an axis there. These are common patterns. These aren't new things. But to implement this was pain and suffering. It was a large fight because people didn't want to change. So going back to my newly found empathy for people undergoing change, you know, we worked through it and we worked through it together. So I'm going to cover some of the technology pieces now. Just quickly, because I know a lot of us have been at a lot of talks in the last couple of days. We tend to talk about the same things. So the platform, we went from WebLogic to Cloud Foundry. So we also needed scale of capacity because we're going to probably need more capacity in Cloud Foundry than we did in WebLogic. Just through the decomposition, we're going to be making more service calls. So that 250 million a day is probably going to turn into maybe 700 million a day by the time we break everything down, which will make Greg very happy. Our data is moving from Oracle into an OSQL store. Most of that is going to go to Couchbase. We implemented things like Cross Data Center replication to make it available around the country. And then the routing one becomes quite an interesting one. So the old way of doing things was everything was IP-based. We knew all the IPs of our consumers, so we could route them to, like, the farms based on their IP. So service to service calls, that's not going to work in Cloud Foundry anymore. So we came up with this idea of consumer-based routing. So we know who you are. You put a header into your IP, into your... header into your request, and then we can look it up and we can route you to generic pools. So we can... we have consumer isolation at least on, like, a pool level. So that could be within a space or it could be within an org. And then we can have some sort of isolation to stop maybe some of the heavier consumers overrunning some of the smaller ones. And then our delivery stack, and I've covered some of this already in talking, we made some changes around continuous delivery. We went to Jenkins and then we went to Go CD, and that was an interesting exercise. What happened is Jenkins just decided to stop working. And we're going to leave it at that. And we had nothing left in the toolbox. We went to reach for the tool and it wasn't there. And that was a big warning sign for us. And I was like, no, we're not going to do that again. So going forward, we started building a runway of always having the next tool ready in the box. We didn't want to spend the time we had to spend moving from one platform to another to look for pipelining functionality and other things again. So when I come back, I circle back around to that topic towards the end. The build, we moved from Avon to Gradle. Gradle gives us a lot more flexibility, gives us the opportunity to create plugins. Again, to raise that bar, lower the bar. I keep calling it raise the bar. I want to lower that bar to entry, allow people to adopt the frameworks as easily as possible. And our deployments move from those big bundle scripts where we would push out on mass and do database updates. And it's all driven by our CI platform now in Go CD. Everything is driven by plugins. The fact that Cloud Foundry gives you such a rich API to write your own deployment methods essentially has allowed us to do some quite creative things. So, and I'm quite proud of this one. So this is our zero downtime deployment plugin. They are going through the process of open sourcing and I'll give a shout out in the Cloud Foundry Slack channels when they finally get through legal. What we wanted to do is we wanted to deploy our apps in a zero downtime fashion. So, there's a chap at Pivotical, Joshua Cruck. I hope I got his name right. Or pronouncing it, right there. He wrote a really cool scale over plugin. And he wanted exactly what it wanted us to do. We could scale over our apps without causing outages. The way we did it first is we would deploy one app and then we would deploy another one on the same route and one would come up and one would go down. Now, when we were running at about 30,000 transactions a second and 80 instances of one application, as soon as that first one came up, everything went down. So, we ended up with an eight minute outage and unhappy customers. So, we had to go back and we had to sort of rethink how do we roll it out in a less impactful way. And so, the zero downtime using Joshua's plugin. And then what we did is we extended that. So, we wrapped it and we managed our routes. So, you push your app, your app comes in, it maps onto the route, it will roll over and then it will clean up the old app. We had to give you some context here. So, we had a defect request and this goes towards the cycle time as well. We had a defect request coming at 11 o'clock in the morning and we accidentally deployed it to production at four o'clock in the afternoon. So, not only did we take a three month cycle time and reduce it to hours. We actually, we managed to deploy and then have to go and ask for forgiveness later. The canary deployment was then an extension of the zero downtime deployment. And what we wanted to do was we wanted to be able to say, I want to deploy one instance of an application and I want to pass that URL back to my CI platform which will then run all the contract tests, et cetera, against it and then promote that version into production. So, this allows us to pre-stage a lot of applications. We have manual triggers in the promote stage but, again, it will deploy the application, pass it back, run the tests and then remap the routes to make it live. And this will hopefully all be coming up for open source in the coming weeks. A couple of the other smaller pieces that we added in was we added in a recycle plugin. So, some of our consumers like to use Jersey for some reason or other. And given that to propagate changes to, we had to restart it. But, again, we didn't want to restart everything up, down at once. So, there's a restart instance method on the CLI which takes an argument of an array index. So, we just wrapped that in a plugin. So, it just cycles through the array, spins one up, takes one down, spins one up, takes one down. You can pull in your property changes without impact to your users. And then the other piece that we wrote was a scalar service. And, again, this is all going to be up on our open source GitHub as soon as it goes through. But what we wanted to do is we wanted to give the opportunity for third-party and remote applications access into the foundation to do things like... So, our use case here was to tie it into our monitoring system, which is done by AppDynamics. So, we tie it in. AppDynamics triggers an event, say, on high load, which will then scale the application. So, what's on the horizon for us? So, still in our to-do list is change management. A lot of external processes require tickets. We hate it, and it slows us down. So, how do we reduce the risk and keep them happy, keep security happy with things like compliance and manage that change lock? And that is one of the ones we're sort of working through right now. And it's not an easy problem to solve because a lot of people like to categorize all the changes. And this is all the changes that are going to go tonight. This is all the changes that are going to go tomorrow. They don't like us using an API because we said we'd just create a ticket every day just in case. So, we could just deploy when we want. So, that's some of our challenges around that. On-boarding, we have a lot more teams. So, we did six teams last year total. We have another dozen to go. In the end, there'll be about 140 services migrating to the platform. Some will be retired, some will be eliminated. Some will invest more in, and some will just tolerate. We're exploring the use of concourse. I think I mentioned this before. One of the things we want to start doing is, as we started rolling at Go CD, we hit a sort of vertical scale limit where if too many pipelines of UI start to slow down, we had to start rubber stamping them out. So, right now we have about 12 deployments of Go CD for various teams. We lose that sort of governance and consistency across the platform at that point. So, that is one of the reasons we want to start looking at use concourse and then actually deploying that as a service to people. So, again, you commit your code and that's the last time you worry about it. You don't need to worry about your CI solution anymore. We'll provide that for you. We'll provide templates that you can adopt and tokenize and move forward. And the last one is my little sort of fun side project. I was on a couple of calls where they were asking us, what's the state of this application? And to me, that just seemed like a ludicrous ask. I was like, what do you mean you don't know? Why are you asking a human about the state of a machine? So, and everybody, and usually outages don't happen when we want them to happen, like when we're in the office, you know? We tend to be in other places. So, I look to integrate, we wrote a little bot just as a proof of concept and how do we integrate this with Slack, which is our sort of collaboration tool. So, we can ask Marvin, which is somewhat appropriate seeing as it's towel day today, how, you know, what's the status of my app? We integrated them with Ansible, so we can run things like Ansible Playbrooks. He's integrated with that scalar service from before so we can get health checks and status checks reported right back from Cloud Foundry to Slack. And I guess this is, I came across, I wanted something to sort of tie everything together and sort of our journey. And I'm a big Terry Pranchett fan, and I came across this quote, always be wary of any helpful item that weighs less than its operating manual. I think we have a lot of Swiss Army knives. We have a lot of things that want to do everything and do nothing terribly well. And this is how I sort of feel about Cloud Foundry. I said, concourse on my practice. But this is how I feel about Cloud Foundry. When you think about it, the manual for the developers, like this small, I write my 12-factor app, CF push. That's pretty much it. Everything else is managed for you. So the simplicity from the song, almost like the developer, how I implement and how I deploy, is we don't, you don't have to worry about that anymore. You don't have to read those huge manuals. Like, I remember they liked the Oracle certification this thick. That's all gone now. And it does what it wants to do and it does it very well. And it provides that as a service to you. So, back to where we were. Hello, my name is Nick. But I'm not an ops guy anymore. I'm not, I'm a developer. I write tests. I write code. I was chatting with our project manager last night. And we were talking about Ginko and this behavior-driven development. And he says, you're not an ops guy. You sound like a developer. So now, what do I? I look at myself as I'm a stuff doer. I do stuff. And I have plenty to keep me busy. Well, that's all from me. And I'm quite happy to take any questions that you guys might have. Okay. Well, if you want to catch me outside, I can always answer them out there too.