 So, welcome everybody to our talk about the Any9's experience we had with operating open source cloud foundry. Let's begin with our talk. So what does experience report cloud foundry operation, open source operations exactly mean? And we started three years ago with an open source cloud foundry public installation. And setting such a system up we faced a lot of challenges because we have some requirements we had for public installation. And we think people doing the same thing as we did have the same challenges. So we're trying to share our solutions and our experience we gained through these three years to inspire you, help you doing your own thing. Yeah, so let's start it. Before we really dive in in our talk we first like to introduce ourselves and give you a short overview about the structure of our talk. So some of you may have noticed that the talk was initially proposed by our colleague Julian Weber. But he couldn't come here in Santa Clara. So we replaced him. The thing is that he did more operations than we are. So he's way more knowledgeable in this. But we've been working with the installation for a couple of years now. So I guess we should be able to guide you through a presentation and hopefully also answer your questions at the end of the talk. So I'm Lucas Pinto. I'm a fresh guy. I've been in Berlin and working there. I've been a Rails developer for a long time. So I'm more on the Dev side. And I've joined the Any9 team a bit more than a year ago where I helped them with the Ruby knowledge that you have. If you have any questions after the talk you can contact me in my email address or ping me on Twitter. So my name is Stefan Zuber. I'm working in the Any9 team since two years now. I also co-developed the service framework we used in our public installation and also helping in the operations of our public installation. As Lucas mentioned it, just write us a mail to ping us on Twitter or come to our booth. We have a booth in the foundry if you have any questions. So let's start with our talk. So the talk will be articulated this way. First we'll look at what a standard open source cloud foundry looked like three years ago. Like our friends at Nature Springer say that's a DD and Dinosaur era when it comes to cloud foundry. But I think it's still an interesting thing to observe and analyze. And what we'll do is look at the weaknesses of this installation for our concerns which is serving our clients. And we'll try to spot where it doesn't really fit our requirements. Then we'll come to the most interesting parts, the solutions we choose, the strategies we choose, the solutions we implemented to come around those weaknesses. And then we'll have, if time allows, we'll have a small round of questions. But as Stefan said we have a booth in the main hall and you can definitely come talk to us about it. We're unfortunately out of swags. We brought really nice swags but they were so good that they're all gone by now. Hopefully we'll see you in Frankfurt. We'll be there, have a bigger booth, have more swags and we'd love to talk to you in Germany. Yeah. Let's start with the requirements, listing first of all what we had for requirements for public installation, what the goals are we want to aim for and want to achieve. And one of the main requirements we had was high availability because our public installation should be used by our clients and also by our developers for their internal projects and their apps should be running all the time. So downtime over apps for most people means that they're lost money and reputation and if our customers or clients lose money and reputation the same goes for us. And one thing we learned in these three years, everything fails. Processes fail, virtual machines fail, really everything fails. So you have to keep that in mind if you want to design a system that should be available all time. Fortunately there are some patterns you can use to stable your system. The first thing is to make your cloud phone components important, make them redundant. Use clusters as your baking services. Another important point is to aim for fast failovers, which means if some part of your system goes down it's recreated really fast and for that you should also have some kind of an auto healing system. As we are using CloudFounty for that we're using Bosch like it's always there if you have CloudFounty and it's pretty good at self-healing demands. Second point is resilience. Some of you are microservices builders so I guess you are aware of the meaning of resilience in this context. The idea is that an issue that happens somewhere should be contained where it happens and shouldn't propagate through your installations. That means that it shouldn't cascade and it will make it easier to spot and fix. An example for that is the recommendation service on an online shop. If your recommendation service is down it shouldn't mean that your users can't use your minimal viable products which is placing others. Maybe the recommendation service won't be there anymore, maybe you have a place order for that, but it shouldn't impact your user in that way. What we had in mind is to keep boundaries between our services and to be sure that we handle those kind of failures properly. Because as Stefan said failure is something that is going to happen. You have to be aware of it and build a system that is coupled loosely enough so that the impact of the death of a component will be minimum. So next important point is as I mentioned before self-healing. That means if a system fails it gets automatically recreated without that you have to do manually anything. And there are two aspects important. First of all monitoring and auto provisioning. Monitoring here means that you have health checks for your components so that you can detect if there is a failure. Auto provisioning means that you don't have, let's say, to get up at 3 a.m. in the morning in your pajama starting your virtual machine again. And there are also perfect examples in the cloud phone environment. You have the self-healing for applications done by the DA or now by Diego. And you have also the same thing by Bosch for jobs that are deployed by the Bosch director. He detects if there's a job missing or failing and uses the deployment manifest to recreate it without you to have to do anything. And that's really amazing about Bosch and that's why we are still using it and started with it. The last thing is something we won't really talk about during the talk but that's the requirements we had and it's monitoring. Because we want to be aware of issues before everything's crashed. We want to be aware of them as fast as possible. And so we have to build the endpoints for logging and for metrics and to then bind them with stuff as elastic search log slash Kibana or in our case it was graphite. Then we went to Grafana. You may also have heard of Firehose or NOAA. I've seen a couple of talks about monitoring. Unfortunately, I couldn't see any of them. I hope you folks did and I guess that's the last of our requirements. So as we mentioned, now the requirements we have, we can now go forward and do an analysis of the Cloud Foundry system we did before we started setting up the system. In this part we looked at all components in the Cloud Foundry environment and spotted possible single points of failure or components that doesn't fit our requirements. There is a little overview about the Cloud Foundry environment. Keep in mind that we started three years ago, so today it looks quite different. And I think most of you have always seen them. You have the HA proxy and the router for the income and requests, Cloud Controller database in the blob store for the API, your organizations, the applications, the UAA and UADB for managing the identity in Cloud Foundry, DEA for running your applications, Health Manager for the self-healing of the applications and last but not least the nuts, which is responsible for the messaging within the system. What we have done before we started it, we analyzed, as I mentioned it, every component regarding redundancy, high availability and consequences an outage of one of these components would have in our system. And we also have to keep in mind issues that can arise because of our environment. That means we try to serve our clients and make them happy. And one point is that we are using OpenStack in the infrastructure layer. So as you see on this slide, the orange boxes are the points we figured out are critical in our system. That means that one of these components would fail. We have an outage in our system. And in the next two slides, we will go through every component and talk in more detail what the problem here is. So the first thing we wanted to talk about is the databases. So those two databases, the Cloud Foundry one and the UA one, are not cluster, it's a single instance. So if they're done, it's really problematic for us. The consequences are concerning the Cloud Controller database that the Cloud Controller API won't be able to work anymore. The Health Manager won't be able to do his job anymore, which means that issues could spread in that case and our API won't be reachable anymore. So that is quite painful for us and for our customers. And the second one is the UA database. The consequences here is that we won't be able to authenticate our clients anymore, which means we can't let them do anything anymore. So that's also an issue there. The next thing we started is the Cloud Controller Plop Store. It's a Debian NFS and it's not clustered in a default Cloud Foundry setup. One reason is because the NFS scales badly. So if the job goes down, you can't push your applications because the build packs are stored there and also the cell feeling doesn't work anymore because the droplets can't be accessed if there's the Plop Store down. Next one is the HA proxy. So here we have a two-fold problem. The first one is it's also a single machine. So if it's down, it's down. That's something we could avoid by having a HA proxy cluster. But this will also mean that we would need to build our own specific load balancer for the infrastructure. But the second problem is a bigger one for us because we have clients pushing their applications on our platform. Some of them might need a secure connection. I think that's something that's quite important for them. But that also means that they need to host their SSL certificates on the web servers. And that is something that is not possible with the HA proxy. Another alternative, of course, is that we do upload SSL certificates mainly for them, something that we've done in the past, but it scales quite badly. This won't be our solution. So the next problem comes with our setting up environment. We are using OpenStack with DHCP. So we have a problem with static IPs and some templates in CloudFoundry releases. The problem here is, let's assume you have a virtual machine with static IP assigned to it. Now the compute node fails where the virtual machine is running. Bosch, as I mentioned, detects that there is a job missing, recreates the virtual machine and tries to assign the old static IP again to this new virtual machine. But in OpenStack, this static IP can't be reassigned directly because the compute node failed. So for us, we had a problem with static IPs and so we decided to use a service discovery via DNS instead of the IPs. And that was one of the big problems we had. And now we can start with the solutions. We use to make our system stable and to fulfill our requirements. And we worked through the solutions in the same order. We showed you a single point of failures and the static IPs so that you can see how it works. So our first issue was with the databases. And here, since we're a German company, we're environmental friendly. We thought about recycling. Something we already wrote, which is a Bosch release for Postgres that will allow us to build a three-nose cluster making this more stable. We have redundant data and we can be confident that this service will be of all time. If you want to know more about it, you can watch the talk of our CEO yesterday. If this is too complicated, you can read our blog posts. You can find it on our blog, any9s.com slash blog, I think, or blog.any9s.com, where we go more in depth about the theme of building this Postgres Bosch release. So the next thing was the Cloud Controller Plopster. What we have done is taking the Plopster out of the cloud foundry system and using external Plopster, which is clustered and redundant. At the time we used that, we have two possibilities. They use OpenStack Swift or Amazon S3. And as I mentioned, we are running on our own OpenStack, so we use the OpenStack Swift within our system. And the connection between this new external Plopster and the Cloud Controller wasn't really a big problem because you can use the Fock library, which is already in this cloud foundry system. And so it was easy to connect this database. You just have Plopster. You just have to change the APIs and the credentials, and so you can use the external Plopster. Now the HA Proxy, like I said, this issue was twofold. And the most important problem for us was serving clients, right? Our solution for SSL certificates was to create a bus release that will allow our customers to upload their SSL certificates, repeating myself. So for that we wrote a virtual host API to which the clients can talk. This virtual host API will then forward the user data via RabbitMQ messages, which will then land on the virtual hosts on our SSL gateways. And on our SSL gateways, we have workers who will then take care of reconfiguring our Nginx web server. So the SSL certificates, private keys can go from the clients to the web servers. And because we wrote that ourselves, we also wrote it in a way that we could now cluster those SSL gateways ensuring once again redundancy and availability. Yeah, now coming to the static IPs. As I said before, we tried to use DNS service discovery. Therefore, we deployed a console cluster in parallel to our Cloud Foundry system. Yeah, console is a service which you can use for service discovery and provides a domain name system that resolves host names to API addresses. And so if the host gets a new API address, the assignment to the host name will be updated. And therefore, we created our own Bosch release for a console and deployed a console cluster with five console nodes and also two DNS mask servers. To them, I will say a little bit later. First, why we are using console. Consoles have some different components. The console servers, console clients running on the nodes. So the console servers act like a service registry. Services can there be registered or updated by any client which send the information to the console servers and it also enables you to discover services and virtual machines by the DNS. Now the console clients, as I mentioned, are co-located on each virtual machine we deploy. So if the console shop starts, it automatically sends messages to the console servers. And so you have the host name associated with the IP address. And then you can use the host name for access virtual machines backing services. And one advantage of this solution is if one of the nodes gets recreated, it gets a new IP address from the new virtual machine. Console starts again, registers the new IP address with the old host name and you have no problem accessing your virtual machine again. So let's talk a little bit why we are using the two DNS mask servers also in our console clusters. You can only register two, three DNS servers in resolve.conf file on each virtual machine. So one slot is taken by the PowerDNS of the Bosch Director, which can be used for internal micro Bosch DNS host names. So you have two slots available. There you could just set one console node and one public server, and then you have the problem that you still have two single points of failure. If one fails, it doesn't work properly. So we decided to use two DNS mask servers with our own Bosch release. So the two DNS mask servers are configured as a load balancer in the resolve.conf.conf and also do some health checks. Another advantage of this console cluster is because of our service framework we built that's also using host names for access for the connection between applications and service instances. So we just can use the console cluster we deployed here for the applications. That means if you create your service and bind an app to it, your app gets the host name and connects via the host name and so the failover is still working. That this is working, you have to update the resolve.conf on the DAs so that the host names can be resolved correctly. So that was it for us. I hope we get you interested and curious about the theme. I would like to thank you for being here. It was for both of us our first conference talk ever. So thank you for your support. And yeah, if you have any questions, please shoot. At the moment it's not multi-tenant so we don't have these problems. But I must say I'm not really into exactly how the SSI gateways have been built because we started with it at the beginning. I can ask if you want to get more details, I can ask my colleague who should help this talk and he gets you more details about it. When we started there was no console in Cloud Foundry so we can adapt our own cluster in parallel. Now we are doing a lot of experiments trying to use our console cluster also in the Cloud Foundry system. But at the moment we have both. That means we have a console server in Cloud Foundry for the internal resolving and also our console cluster for resolving the hostnames of the service instances. Exactly. Yeah, we connected them with over our DNS mask server so that they can access them in a different order. So depending on the job which has the console in the resolve.conf we are just asking the internal Cloud Foundry console and then on the other virtual machine the DNS. Actually it's as we are not as big as a big team as we might be or would love to be. But for us to keep in the pace of Cloud Foundry as it mentioned on the first day every two weeks we try to keep up at the moment we are at version 230 and the size of the systems always change depending on how many applications are running. I think if there are a lot of loud applications and customers using it there are about 80 virtual machines and a lot of them are the DEAs separated in the three availability zones so we have about 60 DEAs and then also three Cloud Controller and stuff like that. We are between 10 and 15. So 15 but not all working full time. Any other question? Well, thanks a lot. Thank you very much.