 Hi, so my name is Kais Halcroft. I'm here on behalf of Bloomberg. And what I'd like to talk to you about today is how we've architected our OpenStack Cloud to be highly available and extremely robust. I'd also like to talk a little bit about some of the things we got really right, and some of the lessons that we learned doing it, and also where we plan on going in the next few months. So if you don't know much about Bloomberg, if you walk into almost any major financial institution and even some not so major ones, you'll see screens like this. This is the Bloomberg terminal, or professional service, as it's officially called. That's our core product. There are a bunch of other media products that surround the core terminal, which supplement it and support it. And we're also moving into other industries these days, like government and law and sports. And Bloomberg is the market leader in financial data analysis. Just to give you an idea of a sense of scale of our system, the Bloomberg back-end handles about 22 million instant messages a day, about 200 million messages. It's available in 11 local languages. We pull down about 10,000 data feeds, processing about 45 billion ticks a day. So it's a pretty substantial system. To give you a little bit of a sense of the history of this thing, it was started in 81 with just a couple of guys. That's four years before Windows 1.0. By the time the first web server was launched, was developed at CERN, it already had an install base of about 20,000 people. And the growth has been pretty steady since then. By the time Google was launched, we were up 100,000. And we had 300 employees in R&D alone. Those are essentially programmers. Right up to the present day, 30 years later, 30 years of history with 4,500 people in Bloomberg R&D, and 30 years of code, and 30 years of infrastructure. So that terminal, the core product, has been in development for 30 years. It initially started out in the early 80s as a hardware product, and now has evolved into a software product. But we still make hardware in terms of specialist keyboards and specialist security devices and screens and what have you. So what are we doing with OpenStack at Bloomberg? So when we were designing our cloud infrastructure we had, it was part of a larger cultural shift at Bloomberg towards a DevOps philosophy. The major driver here is essentially turnaround time. For a company such as Bloomberg, time to market is everything. So we wanted to get that increase in developer flexibility, developer productivity, reduce the turnaround time on machine deployment, and get all the benefits you get from machine provisioning as code. If you want to know more about this Jombalone from Bloomberg, gave a great talk, which is public, and is available at this URL. If you just Google his name, you'll find it. So designing our stack, like I said, for Bloomberg, high availability, robustness of the architecture is everything. So that's our primary design requirement. We went with the idea of many smaller clusters, as opposed to one enormous, huge cluster. And each cluster has to be highly available, not necessarily at the machine level or the VM level, but in aggregate. We also wanted a simple architecture, which is fairly homogeneous across the stack. We don't want specialist snowflakes here and there, which could cause single points of failure for the entire system. We also obviously want an infrastructure that scales horizontally, trivially. And one of the key requirements in enterprise, which I think is commonly overlooked, is we want it to be deployable, and the absence of full internet. So that means maybe no internet at all, or heavily proxied internet. And of course, we wanted to make maximal use of all the open-source tools and community that's out there and contribute back to it. So the solution we've come up with is we call it Bloomberg Cluster Private Cloud, or BCPC for short. It's available on GitHub. It's a set of Chef recipes and supporting scripts and what have you to deploy your own cloud. It uses entirely open-source software, and I just sort of give a shout out here for all the various bits we use. It's not a totally exhaustive list, but it's a pretty good subset. So please visit our GitHub repo and take a look and try installing it yourself. We'll walk through the architecture a little bit. This slide looks a little busy, but we'll break it down into individual components. This represents our stack, so this is actually an actual factor, a picture of three example nodes in our cluster. And of course, the immediate thing that you see is they're absolutely identical across all three. We can break down our architecture into sort of four or five layers. Obviously, the host layer runs the OS and what have you. We have a distributed storage layer running Chef, a database and messaging layer, and then the open stack infrastructure as a service layer. And then we have a monitoring and services layer on top. And then right at the very, very top, we put our high availability lab. So we'll break these down a little bit. The host layer consists essentially of the machine hardware and the core OS. Just with our like our stack that we want no specialist machines anyway. We don't want any specialized hardware. We use identical hardware across our entire cluster. We generally, in BCPC, Chef BCPC, we assume three networks, however you arrange those physically is up to you. The first network is a management network which takes all the traffic for open stack and some monitoring traffic. The storage network is dedicated for the storage, distributed storage layer. And then the float network runs the VM traffic. These individual box, just vanilla pizza boxes, you can stuff them full of either HDDs or SDDs, SSDs, excuse me, and then distribute that those disks or contribute those disks into the distributed storage pool, obviously keeping one or a couple back for the OS. We currently run precise. We're moving to trustee in the next couple of weeks. We have some limited support for CentOS. Our distributed storage share uses Chef. Every, as I said, every node contributes to the cluster. It's, of course, rack and host and row aware. So we only need 50% plus one to continue running and keep all our data. The Chef services we run are Rados block device for backing Cinder and the Rados gateway object store for our S3 endpoints. We have boot on volume, so boot from volume and copy on write semantics available. And this has been an extremely solid part of our structure. It's a very well-written piece of software. The next layer up will be our messaging and database layer. We use MySQL Galera in multi-master mode, and that provides the database services for OpenStack and also provides database services for some of the higher layers, such as monitoring and power DNS, DNS services, and a few other bits and bobs. Our Rabbit layer uses Rabbit 3 for clustering, so it's all clustered queues, disk backed. That provides the queuing service for OpenStack. And then we provide a single point of access to these services through our high availability there, which I'll talk about in a moment. Our OpenStack layer, there's nothing terribly exciting here. It's pretty vanilla. We just deploy everything in a sort of shared nothing architecture. Cinder needed a bit of encouragement, but pretty much everything else just runs almost out of the box in this sort of high availability mode. All service endpoints are published through the HA layer, and that HA layer also distributes a load across every single node. And then these services can communicate, as I said, with MySQL and Rabbit back through the high availability layer. We use nover networking and not Neutron just for the high availability of high availability of nover networking. It wasn't really available in Neutron at the time. Our high availability layer runs KIPA LiveD and HA proxy. So KIPA LiveD publishes a VIP, a virtual IP through VRRP. Every node can contribute, every node participates in that VIP. There's a potential problem here that if you've got a real network partition, then you could have all nodes try and take the VIP. So what we do is we tie KIPA LiveD to Ceph, and you can't take the VIP unless you're part of Ceph Quorum, which keeps the whole cluster together. It's actually quite a nice solution. HA proxy takes the traffic from KIPA LiveD, if you like, and distributes that across the cluster. We actually pass through SSL and do the termination at the endpoint. We've had some scaling issues with HA proxy, which we're addressing, perhaps maybe using Apache or something like it as an alternative. So for high availability, we generally have sort of three or four paradigms we use. We have the VIP, which we use for publishing all the endpoints for the services. We have a lot of our components have an in-built high-availability mode and where that's available, we use it. So MySQL Galera being a good example, Rabbit and also Graphite. For database-backed applications, such as monitoring using Xabix and PowerDNS, we rely on the high-availability aspect of MySQL Galera. And then, of course, for most of the open stack, we just use this sort of shared nothing architecture. So our common services layer, common services and monitoring, we use PowerDNS for our DNS servers. And we have a little sort of trick where we create a view that provides both sort of forward and reverse DNS for all tenants. And our tenants DNS structure is if your tenant foo and you have a VM called bar, then you have bar.foo.bcp. Your corporate name.com. We've had some performance issues with this particular architecture and we're revisiting it right now. We use 389 for LDAP services. There's some more work in going on a ChefBCPC right now to sort of be examined that. Our monitoring layer, we use FluentD and Elasticsearch for log aggregation and analysis. And then Kibana is a nice front end to that. So the FluentD agent pulls all the log files off individual nodes, shoves them through the VIP into Elasticsearch, which is running on all our nodes. Xabix we've been using for monitoring and alerting. We've had some reliability issues with that, that it tends to not deal well with parts of the cluster going away. And also it doesn't really want to run in high availability mode. It doesn't like lots of Xabix service to be running. Graphite and Diamond we use for sort of graphing and analysis and drill down. And all I'm monitoring is on cluster. We're revisiting that paradigm to maybe a two-layered system or maybe a three-layered system on cluster and then some centralized logging as well. So how do you go about deploying this cluster? One of these clusters I should say. Seating is always an issue, especially if you don't have full internet. ChefBCPC has kind of a nice way of doing this. The first thing we do is stand up a Bootstrap node. So if you run just the ChefBCPC Chef scripts, you'll get a Bootstrap node which will run Chef server and a Cobbler server and also pull down any mirrors that we need to get the cluster bootstrapped. So you can stand in one place and pull from the internet and then push onto the Bootstrap server. The reason that we have to go through some of these gyrations is a lot of packages like to phone home. It's like Chef, the unusual Chef installers really want to sort of call back. So you have to do some gyrations to get around that. So these procedure for deploying either the first node or the end node is exactly the same. You just pixie boot it, assign a Chef role and Chef it and off you go. And then deployment, whether you can either push or pull, updates onto your Bootstrap node and then rerun Chef. We don't tend to run Chef continually in production. We've had some reliability issues doing that. We tend to run it when an update hits the system. Now I sort of said that we have an entirely homogeneous stack across all our clusters. Obviously that's not entirely true. We run what we call a head node which runs the full stack and we run many of those. But we want to of course expand our cluster beyond just all that infrastructure. So we can add to the cluster just using a work node. And the work node runs a very strict subset of all the available components in a head node. It runs particularly, obviously it runs the host layer but it also runs Ceph OSDs and some subset of the open stack layer usually just a compute API and networking. And you use work nodes to essentially expand the storage and compute capabilities of a cluster. So one of the really cool things about Chef for ECPC, excuse me, is that it comes with its own integrated development environment. If you just download the recipes straight off GitHub and bootstrap you'll get a cluster running in virtual box. Optionally you can use vagrant on top of that as well. So you can have a full Chef PCPC cluster exactly like Bloomberg uses in production running actually on your desktop or on your laptop. 16 gigs is nice. Some people claim eight is doable. And one of the great things about this is absolutely identical system that gets deployed into production even down to the Pixie booting. So you deploy a bootstrap node and then the scripts Pixie boot the individual nodes from there and deploy. So it's exactly the same as what we use or what one would use on bare metal. So of course the question we've gone to all this lengths to make a highly robust architecture. Does it in actual fact work? The actual fact, the answer is yes. Most of the time I'm the guy who gets called if there's a major problem and I spend my evenings doing other things. So it's pretty solid. Where are we going? So the roadmap can be broken down into sort of various components for OpenStack. There's some upgrade procedures we have to work through. We're still sort of rolling Havana. We have a branch or two branches and ChefBCPC right now you can flick between Grizzly and Havana. We're really looking forward to rolling updates in Icehouse. That will be a wonderful thing for us and also the possibility of single version skew within a cluster. That would be fantastic. Neutron versus Nova Networking. We're sticking with Nova Networking for now. It does everything we need it to do and works absolutely fine. Until high availability is available in the last three agent for Neutron we'll stay with Nova Networking. The KeepLiveD VVRP has worked pretty well. Obviously it requires level two spending between racks and we're doing some work on any cast and see if we can use a cool any cast system to sort of host a VIP. We'd like to support some more diverse architectures and also this continues some support for CentOS as a hypervisor. Our storage, we use Redis Block Device and Redis Gateway as the main Cef services that we use. It has a really clear consistency model. CefFS which we've had a look at its consistency model is a little bit harder so we haven't actually used it yet. Cef is synchronous. Brings up some interesting questions of how do you do cross data center replication if you want to do that. Razor encoding, new feature, looks awesome. What does it do for our failure domains? I think we need to understand that. And then there's also some work with HDFS and CefFS. It would be nice if you look at ChefPCPC you'll also see we have a Hadoop branch in there for deploying Hadoop clusters using a similar methodology. It would be nice to be able to unify the Hadoop and open stack storage layers. So that's an ongoing work. And then of course there's a whole thing about cold storage and backups. How do we do that? So in summary, we've been running this for a year. It's been absolutely rock solid. It's part of a very much bigger change at Bloomberg towards a sort of DevOps culture. So it's not just us doing cloud stuff there's a lot of people working on automated deployment and Chef and what have you. Some of the choices we made have been really good. CefFS our distributed storage layer has been rock solid. Some of them we're gonna revisit. We've learned a lot in the last year. And I would encourage people to go take a look. ChefPCPC it's fully available. The community there is very welcoming to requests or rants or whatever you want to do. So come along and take a look, download it, play with it. Contact people on GitHub if you have problems. Open an issue or even better submit a pull request. New features are definitely coming down the pipeline based on our production experience. And that's it. Thank you very much.