 Hello everybody, thanks for coming. My name is Jason Rualt. I'm with Time Warner Cable. I've got two members of my team here, Sean Lin and David Medbury, and we're going to spend some time talking to you about the Time Warner Cable deployment. We think we've done some pretty cool stuff. We think we have some good learnings that we can share with you, and we'd also like to talk to you a little bit about our path forward. So let's get going. Just real briefly, a little bit, who is Time Warner Cable? We are the second largest cable provider in the U.S. We provide video, phone, broadband services to all our customers. We're located in 29 states. We have about 15 million subscribers, and a big part of those subscribers are in the two largest markets, which is Los Angeles and New York City. Now, to support those services, we have quite a bit of infrastructure. So we've got four national data centers, and we have over 20 market and regional data centers. So there's quite a bit of infrastructure that needs to go in place to support Cable. So Time Warner Cable, the last few years, has really had a mission or a vision to transform television and really take it from the style that it is or has been of being able to deliver content at very specific times of days in two people's homes on their televisions and really move it to this notion of having any content, any time, anywhere on any device. And as you can imagine, to make that type of shift in the television industry or the cable industry actually requires quite a bit of change. Those changes span from technical to cultural to organizational. So from a technology standpoint, you know, we're doing a lot to roll out new set-top boxes, new software to support those boxes, new architecture changes to support transcoding and video content delivery. Also, there's a lot of new software platforms now for all the different consumer devices, right? So a lot of technology changes are going into place as well as then from the cultural perspective, and we actually have an interesting talk right after this in another room talking about the culture changes that Time Warner Cables has undergone from an open-stack perspective. But as you can imagine, there's a culture plays a big part of this, really moving from this notion of very slow, methodical, deliberate delivery mechanisms to very rapid, fail-fast delivery, right? And then organizationally, we're restructuring our teams to DevOps organizations to deliver on all these changes. So why open-stack and open-source? Well, we really think it's simple. It's the platform, right? It's going to provide the ability for us to be able to, you know, deliver on this whole model of being able to have very fast, flexible, reliable scale-out services with programmatic interfaces. And there are a lot of other benefits as well. Most everybody here is aware of those, so I'm not going to go into those. So what did we do at Time Warner Cable? Well, about a year ago, it was January of 2014, we embarked on a vision of, well, let's stand up open-stack and we'll have a certain set of requirements that we'll try to hit. And we're going to try to do this in six months. So the first one was we wanted to have our open-stack clouds to span two regions. And so that was a key requirement. We wanted, obviously, self-service capabilities. That's, you know, what most people want. Our customers really want the ability to have hands-on and be able to spin up instances very fast. We needed to support thousands of VMs. We wanted our identity system to be global, so your identity carries with you between regions and it's also tied into our corporate credential system. Key to this was the control plane infrastructure that we wanted for this cloud deployment. We wanted it to be very highly available and have DR components to it because we're going to have our customers putting mission-critical applications on this. So our control plane infrastructure needed to have those same tenants of HA and DR. Live migration was also a key driver into why we implemented certain things that you'll see a little bit later. But it was very important to us that we could actually move our customers' workloads around with no impact to them. We wanted to be able to do that so we could service the physical hardware so that we could upgrade the OS, upgrade the kernel, things like that. Automation, automation, automation was kind of a key tenant. We wanted to be able to do everything in an automated way. We wanted to keep hands and manual changes out of production. So we wanted to automate all our software deployments, all the integration, and also the configuration of bare metal systems. So we worked on this for six months and we hit our target. We got to production July 1st of last year. And all the while we did that, we also built a team to deliver on this. So we were kind of doing two things at once. We didn't have a team in place that was going to deliver on all these requirements. So we brought some talent in through other parts of the organization. We hired externally. And we also, at the same time, brought in all the tools and processes that are required of a DevOps organization. So one of the things that's important when you're bringing OpenStack, a private cloud like this, to an organization like Timer and Cable, or any large enterprise, is that it's very important that you educate the customers on what they're getting into. It's a very different environment, OpenStack, from what they're traditionally used to using. And so we spent a lot of time, and again, this is coming in another talk, in the cultural talk, but we spent a lot of time educating our customers. And we wanted to make it understandable why they should move on to the cloud. So we put together a little promotional video, and I thought I'd share that with you before we dive into the details of our deployment. So I'm going to go ahead and play this video here. It's kind of a simple little video, but it actually does a good job for people who aren't familiar with cloud and virtualization technology, kind of what the power is of those tools and capabilities and the advantages they get from it. There's a lot of key buzzwords in there, but they actually ring true. People get tired of waiting months to be able to get a machine and then have to work with another group to get networking set up in another group. So this really opens it up so that it empowers our customers that are building applications to actually really move at light speed. So as an overview of what we've deployed, and again, we're going to get more detail, we have out and running now all the CoreIS services. So as you can see here, Nova Glantz, Cinder, Neutron, Swift, Keystone Horizon, we also have Heat out there, so that's not on this slide here. We're up and running in the two national data centers, as you saw there. So we have two regions. We have about roughly 5,000 VM capacity at this point, and we're actually adding that capacity very rapidly. We're having a ton of adoption. The Swift that we've stood up is replicated in two regions and this goes along with kind of this notion that we wanted to have and bring to our customers is that this actually provides a very good DR strategy for you, right? There's a lot of things you can back up, things that you're running in the cloud that you can back up. Drop into Swift and it's available right away and replicated into the other region. From a storage standpoint, we have about one petabyte today combined object and block storage. And the networking that we have enabled is using Neutron. We're using ML2 with VXLAN overlay. So each of our customers has their own private network space, and we'll talk about that more here when Sean gets up. And then the keystone that we're using, again, is we've been enabled with Hybrid Auth, which we'll talk to you more about so that we can kind of get the global kind of global identity across the regions. And then the big part of standing up a cloud, an OpenStack cloud, is operationalizing it. So a lot of work went into the monitoring and alerting and tooling all around that. And so that it was all implemented as well within that timeframe. And last but not least for sure is CICD. So I talked about the automation. We want to automate everything. We spent a great deal of time putting in a complete, continuous integration, continuous deployment tool chain and set of processes so that we can actually rapidly roll out changes to production. We do that weekly, actually sometimes more than weekly. And on some of the services, we're actually already on liberty. We can pick and choose which services that we want to be close to trunk on versus not. So some of the more mature services, more stable services were okay being on stable. On some of the faster moving services that we're experimenting and playing around with, we're actually pulling from trunk. And so there was a talk already at the summit on our CICD tooling. So hopefully some of you got to see that. All right, next I'm going to bring up Sean Lynn who's going to go into some of the deployment details. Hi, everybody. My name is Sean Lynn. I'm a lead engineer at Time Warner Cable. And let's delve a little bit deeper into what our deployment actually looks like. As Jason mentioned, we have two data centers. Our Keystone is a global data store between both data centers. Tenants and users are the same between both sides. There's no separation there. That was key to us up front. Behind the scenes, you'll see two things. One is that we have a replicated Galera cluster, MySQL, that had some interesting challenges putting up that multi-region. It's pretty solid now. You'll see the Galera arbitrator in there. That's a tiebreaker that's in a third data center. And then we use hybrid auth. So our Keystone first checks local and says we use this for service accounts specifically. But it also checks our corporate active directory as well. So any user who comes into the system can log in just like they would with their user ID and password. As Jason mentioned, we use a VXLAN tenant-based network architecture. Pretty early adopter on that. We spiked it and Havana definitely brought it into production in Icehouse. And the decision there was to be able to enable our users to have really arbitrary networks, which they're really not used to. In most cases, it allows them a lot more flexibility than they had in the past. We use floating IPs for public access to those applications. This will be things like Time Warner Cable.com, TWC-TV, other actual production apps are in our cloud using this architecture right now. Our storage architecture, Swift has been there from day one. We continue to expand that out. At first, we deployed in both data centers. Day one site-to-site replication was there. Basically, that enabled us to allow our customers a DR strategy. Snapshots to the cloud and backups and all sorts of neat things that we provided in a knowledge base that we provide to our customers. And recently, we put in a Dropbox-like app, which has been getting rave reviews internally. Oh, yeah, this is so easy to share things now. So that uses Swift as a back end. We also have started off with Block Storage. Not initially, Ceph quickly moved there for reasons that were explained in our last talk, which is very interesting. Our Ceph deployment is now, or multiple hundreds of terabytes, and it's growing all the time. We've operationalized that. And really, this enabled live migration upfront. Right now, we're looking at other options. We're expanding into multiple tiers of storage. So we'll have a pure SSD storage solution pretty quick. And enable our customers to select what type of storage option they would like. As Jason mentioned, one of our original requirements, our mandates, in fact, was live migration. This is a very uncloudy thing to do, some people think, but it enables us operationally to, as administrators, to administer our cloud much easier. I would say that this should be almost a requirement for anybody who's putting up a cloud. Otherwise, you have to kill instances on nodes and call customers up. And for us, that's just not cool. We have a lot of customers who are quickly enabling their applications for the cloud and becoming very cloudy and cloud aware and scale out horizontally. And it doesn't matter if their app goes down. But not all of our customers are there. And so this really is important to us. We have kind of an interesting high availability strategy. We're trying to do active, active on any of the services, any of the service interactions that we can whenever possible. And basically, this is to allow us deeper flexibility and upgrades and code deployments and less down time for our customers is what it comes down to. Monitoring has been there from, I would say, 8.75 as we were initially standing up the cloud. It was actually a requirement for us. Definitely start small. We use Isinga. It came with a host of tests that are there anyway. We enabled those, then started weeding things out. And as we ran into problems, we basically solved the problem. Add a new Isinga test, write it out. Just use that as part of your development process and keep building it up. Now we have hundreds and hundreds of tests that are run and it allows us to be a lot more proactive on overcoming problems that come up and actually preventing it in a lot of cases. A lot of times now our customers never see the problem because we've fixed it before they've seen it, which is a great place to be. In addition to a monitoring strategy, I had people respond, we use pager duty and we have a call rotation. We are a true dev ops organization, so there's no throwing code over the fence. You actually have to eat your own dog food and you have to help customers out on a daily basis and you have to get up once every three months and deal with whatever the alerts come off. So that's actually a little bit annoying if you live it, but it also makes you much, much more responsive to your customers' needs and to the changes that you're putting into the field. You're a lot more sensitive to what the impacts are. Lastly, we use Isinga. We had initially delved into Solometer and ran in through some scaling issues with that. We are currently using Manasca more and more. We'll have our internal administrative monitoring through a dashboard in Manasca and by end of year it's on our roadmap to provide our customers with monitoring as a service. So they'll have default dashboards and be able to create limited new dashboards. Jason has mentioned automation up front. Automation and CI CD I would say are so part of our culture of our team right now that I can't even mention living without it. I think you have to start small on this as well and iterate and grow. Basically, we're using Cobbler and Ubuntu Pre-C to kick everything into place and then we're using individual host management as via Puppet and Puppet modules, typically Stack Forge and then a lot of custom modules that we've started committing upstream. We have core contributors in the Puppet regime here. Where Puppet fails often is orchestrating rollouts. It has no notion of this change needs to be applied to this system before that system. So we had a previous talk but in short Ansible is used. We turn off Puppet and we do our weekly changes, our code rollouts via Ansible roll out the change, restart Puppet and we can orchestrate our entire cloud upgrades this way. The more that we rely on external code sources the more they become a single point of failure. So we started mirroring and bringing a lot of the code sources in house. That actually improves our rollout speed as well as our ability to keep up with all the changes that are happening in house. A key one again is upgrade should be intentional and frequent. The longer that you let bugs lie in place in open stack, the more difficult it is to upgrade this. Up front we started running a little bit behind and in the last six, eight months it's been a massive task of ours to clean this up. We can now do deployments on demand. We can always get better. You talked to our CDI CD guys there after one touch but compared to where we were a year ago this is really flexible and really, really important. Everything's automated or most things are automated and the upgrades are weekly and in fact in a dev environment it's six times a day. Our environments start off of the development environment. We can roughly simulate everything, a developer, any developer can simulate our entire production cloud on our production cloud to test changes. This includes Ceph, Swift, basic services. You can bring up what you need, test out code changes, submit to our internal Git repository, excuse me. And it really speeds our development. We aren't sharing a development environment and stepping on each other's feet. And one thing that this also enabled is there's usually a process of deploying a code change and you roll it out and you have no idea and no insight on whether this code change can rebuild from scratch. So this having the virtual environments improve that ability. So I kind of related this before but here's the process. Each individual developer has a virtual environment up in the cloud. That's 100% there's. Through Git changes, they submit to the master branch and that gets basically we use Garrett plus minus, you get submitted to the master branch. That goes out to our development environment where another battery of tests are run. To go to staging and production, we tag that and then we migrate into both of those environments and this is done on a weekly basis or more to production. We continue to improve our CI CD tool chain. This is actually super key for us. As we start moving closer and closer to Tronkin, mix and match our services, the ability to via Git upstream pull down, upstream changes, manage local patches, merge and merge is important. We use essentially a process that's very much like the upstream open stack process. Makes it a lot easier on us and the tooling is the same. So we do have Garrett and there's unit tests run on Jenkins and all the goodness on that. We continue to employ or to improve our ability to test and to roll out which is the Jenkins ansible portion. I think you have to, one of the first things that you have to change in a culture, this is my read of it and we did this up front is if you expect to yum update and go and massively and then wait until from Juno to Kilo, you're lost. You have to have a better process in place in your team and within your company to, thank you, to upgrade and we now have a process where we can in minutes in some cases, especially on our horizon UI deploy a change to production. Just don't wait, deploy early and deploy often. As far as massive upgrades go, definitely test these full upgrades. So if you're going from Juno to Kilo, we have the ability in-house to test this whole process and more and more testing that goes on there. Test database migrations, test upgrading services, test the order that services have to be upgraded in. You need to think about these things and improve the process. I know that we've gotten bit in a couple places and now we are better and better every time. And then I'm going to turn this over to Dave Medbury. Hello, I'm Dave Medbury, I work for Time Warner Cable for a little over a year now and I asked specifically for this slide. I do want to talk about working with the OpenStack community. That's why we're here this week. And these are just some tips. I'm not going to go into too much detail. But join the mailing list. There's a mailing list for every project, Nova Neutron, Puppet, Chef, whatever. There's a mailing list for every single one. That's actually a requirement to be part of the OpenStack community. Participate in meetings. There's meetings occurring every single day, almost every hour of every day. They're largely done in IRC. So you need to be familiar with IRC. And just join the meetings. And if you just Google for that, you'll find the right meetings. But let's talk about some meetings that you might not know about. Obviously there's meetings here this week. There's design sessions. There's operator sessions. And then there's seminars or marketing sessions or whatever you call what I'm doing right now. So that's the kind of meetings that you probably know about. But there's also mid-cycle meetings. So every project has the opportunity to do a mid-cycle meeting. They will either use that to iron out issues that they're having conflict within the team. Or they will use it as a planning session. But that also includes the operators. We're big operators now. We're operating a large cloud. And we have both contributed to those sessions, but we've also taken a lot of value away from those sessions. And they're benefiting all of OpenStack when we do that. All of OpenStack gets better when operators give feedback. So look for the operator's mid-cycle meetup if you're not familiar with it. You need to get familiar with community processes. So one of the things you don't want to do is go in and try and start selling something in a community process. What you really want to do is educate people and participate and add value to the community process. It's really not a place for marketing things. And some of those community processes are very rigorously enforced, such as code changes, right? There's a very detailed way of how to get code into OpenStack. So that's one of the community processes. Another community process is proposing a session, like this one. You need to get familiar with that. And you might need to do a little social media to get yourself up on stage. Sean talked a lot about the community tools. He had a couple of slides. We are heavy users of the community tools. I'm not just talking OpenStack, but I'm talking OpenStack infrastructure. Garrett, GIT, Zool, Jenkins, what? Not Zool. Okay. Well, we don't use Zool. But Puppet. I mean, there are tools out there. A lot of people have done work. Anybody that has checked in a gate commit knows that there is a gate. And that gate is basically operational tools that you can bring into your own organization and use. There's value in the work of others, alright? So you do need to do this community work. You need to participate in this session. You need to participate in this bigger summit. But you need to participate on a daily, weekly basis. Especially things you're interested in. And if you're an operator, things you have problems with. There's a very excellent mailing list for operators that keeps us basically updated on what all the other operators are seeing. But I'm really up here to talk about pain points. And if you've been an operator, you know about RabbitMQ is one of our pain points. We had a session a couple of days ago. A very good session that told us that things have gotten much better. Much better since March, okay? So things have gotten much better. But historically, messaging failures are difficult to detect. The AMQP layer basically pervades all of OpenStack. And everything's got to be responding to it and behaving properly. And it's not really RabbitMQ's fault. RabbitMQ gets a lot of the blame. But it's really how the OpenStack different services and projects have incorporated RabbitMQ. So Qs are tricky. HA Qs are tricky. We actually use the multi-host or multi-node HA not the load balancer or HA proxy type of HA. And failovers happen automatically in RabbitMQ, but not all the services are aware of it. So you be aware of that. That has actually gotten fixed in Oslo messaging where most of the problems I wouldn't say they were coming from, but that was the most susceptible to them. And in Kilo, that's basically all fixed. So if you're on Kilo, you're probably not an operator at production scale or you moved very, very quickly. But when you get to Kilo, you'll see that those are fixed. Most of them are being backported to earlier leases by the distros. Heart beating was one of the big issues there as well. And that also landed in Kilo. And as you can tell, operators are focusing on RabbitMQ because it has caused us a lot of those late nights or early mornings or hair pulling out. Neutron, everything is a network problem. I believe if you were in the CEF talk that we did a couple of hours ago, you will hear that everything is a networking problem because it's just kind of the easiest thing to blame. So the networking guys, like Sean, have to defend themselves every single day and prove that it's not a networking problem because everything's a networking problem until it's proven not to be. And even the RabbitMQ problems, we didn't know what was going on. We thought there was actually network traffic down. There was something broken in the network. Well, it really wasn't. It was a messaging issue. Neutron problems lead to angry customers because they can't get to their instance. If they can't get to their VM, they get very upset because whatever their VM is doing can't be done. One of the best tips we have is to stay up to date as much as possible on an open V-switch. There were a number of bugs that got fixed after the Ice House release that didn't get really, really, really fixed until much later, like Kilo. So you may need to upgrade your open V-switch faster than your vendor if you have a vendor would like you to, but you've got to take care of you. Only the Brave use newest networking features. We're semi-Brave, so we did a talk on Designate about an hour ago. That's one of the newer features in the networking realm that we're using. Monitoring VMs is tricky. So we like to know if the VMs are like responding, so we ping the VMs all the time. Not continuously, but regularly we ping all the VMs, but if you've set up ICMP, if you've disabled ICMP, we can't ping your VM, especially if you've done it inside of your VM and not with Neutron rules. So monitoring VMs is tricky. Let your customer, your user, your tenant, monitor their own VMs and only really worry about it if it was pingable and stops being pingable. Kernel panics happen. Kernel panics happen. Kernel panics happen. Have a plan to handle kernel upgrades. Kernel panics happen, and hopefully if they happen more than once, you actually get a kernel dump and you can repair them. You can get a new kernel. But we're very big on live migration, but live migration doesn't help you if your kernel panic because you're not live at that point. So have a plan to handle kernel upgrades. Venom happened last week, last week. Venom happened last week. A lot of people had to do kernel upgrades on Thursday. And you don't know when that's going to happen. So how do you plan on debugging kernels, all right? Do you have the kernel team on your team? Do you have a kernel vendor on your vendor list? You need to think about those things. Practice dumping is my best recommendation for how you can prepare your cloud, your cloud infrastructure, not your VMs, your cloud infrastructure for kernel panics. Newer kernels are generally better. A lot of times, even your, just a sec, even your unknown undetermined kernel panic is actually fixed upstream. You may not be able to prove exactly which patch it was. So try a newer kernel. Okay, users. Users are one of our pain points. They might require, they bring value to, okay? They might require some infrastructure as a service education because they might not be doing things in cloudy ways. You need to educate your users. In particular, I'd really recommend that you give them an overview of OpenStack. More than just the cartoon, which is awesome. The cartoon has helped us a lot. But go beyond that. Provide some, from hands-on training, some seat-on training, and get them aware of what they should expect. Applications they build should be caught aware, and that is a cultural shift. Tooling. So the users, there's a lot of tooling in OpenStack, but they may still be doing things the old way. Just kind of advertise to them that there's better ways to do things. If you're using continuous integration and deployment, continuous integration, continuous delivery, why don't you get your users doing the same thing? The tools are already there, and you already know about them. So do a brown bag. Educate them. All right, pleasant surprises. Users do bring pleasant surprises. We had a training session, and the training session went over heat. And within three weeks, we had a couple of kind of heavy-duty heat users. Nobody on our team had ever used heat. We knew it was there. We provided the API endpoint. So when heat issues started coming up, Sean and Matt became heat experts overnight. But that's a pleasant surprise, and that's because we provided the training. That training included heat. All right, what's next? Processes... Maybe what's next was a slide. Let me look. No, it wasn't. Processes and tooling will do better integration. I think Jason mentioned this earlier. Deployment tool improvements. We're doing Python virtual environments already. Cod additions. So we've got load balancing as a service coming next month. Next month. We've got DNS as a service in beta trial inside of our cloud already. We've got monitoring as a service. There's something that hasn't been talked about much here at the summit that is Manasca, which is monitoring as a service. We're doing it for our infrastructure, but we're also going to open that up to our VM customers so that they can monitor their own VMs. Database as a service is coming soon. It's not immediate. It's not next month. Hadoop as a service. These are all things that we'll be bringing before we talk to you in Tokyo. There are a couple of other sessions coming up. There's one immediately after this, changing the culture at Time Warner Cable, and that was probably the most key thing is Matt Haynes, who's sitting in the back, was able to basically say, we have to do it this way. If you don't buy it this way, we're not going to try and do it because we need that CIO, CTO, CEO level support. Neutron in the real world. Sean's going to talk depth at 150. Real world experiences upgrading OpenStack. Matt Fisher in the front row here. Then that's tomorrow. And we're facilitating operator sessions all day. I've got a session right after this. I'm going to miss Matt's kind of presentation where we're going to talk about how operators are having problems and having successes with CEP. And then later today, there's another operator session about doing OpenStack upgrades, going from Havana to Icehouse to Juno to Kilo because they are operators still even before Havana. They're not on stage though. I think that's all I've got. Thank you. You can reach us. There is also at TWC Cloud is a Twitter handle that we sometimes print out information. As soon as the slides are available this week, I will definitely put something out on at TWC Cloud and I'll mention at OpenStack so you guys can all find it.