 Welcome. My name is Christoph, I'm from Scaleup in Germany and this is Frank and we are here today to talk a little bit about how we built our OpenStack-based cloud at Scaleup and what we do to keep it running and all of that from a perspective of being a rather small company and doing it with a very small team. So to give you a rough overview of the agenda, so I'll talk briefly about our team here at Scaleup, the Scaleup story, how we actually got started with the whole cloud, getting into how we built our first OpenStack cloud into operations and then some additional topics. We'll briefly talk about billing and metering. We'll talk a little bit about upgrades, the good and bad things we learned by doing that. And last but not least, if we have time, we'll briefly touch the topic of set-based block storage. So I'll try to leave some time for questions at the end. A little bit about ourselves. So Frank is actually the COO of the company. So he's running all the operations at Scaleup and he's been working on OpenStack since 2012. I myself, I'm actually the founder or the co-founder of the company. The company was started in 1998, so quite a while ago. And I've been involved with OpenStack since 2011, when I went to my first OpenStack summit in Santa Clara back in 2011. We have a third colleague who is not here today. I mean, one of us doesn't need to work at least a bit. So Joe is, well, the newest addition to our team. So he's been working on OpenStack since last fall. So a bit about Scaleup to give you a perspective who we are and what we do. So as I said, my best friend Gijan and myself, we founded the company in 1998 in Germany. And our first offering was plain web hosting. Back then, that was a big thing. Nowadays, it's nothing anymore. And we grew from there. Headquarters are in Hamburg, Germany. And we also have offices in Berlin. And we currently operate four data center facilities. Two in Hamburg and two in Berlin. The reason for that is because we do have some customers who need a way to run redundant systems across multiple data centers. That's why we have two data centers in each city to kind of enable such a thing. So I don't know, maybe some of you were at the OpenStack Summit in Austin last year. We did talk there as well. And back then, our main topic was how we actually built our cloud and how we got there. So I'll just briefly reiterate that for all of you who haven't heard that presentation. So we actually did get started with cloud computing back in 2009. Around that time, we had quite a few customers and potential customers approach us as a company wanting to have us run their server infrastructures in a more flexible way. So back then, there were a lot of those new startups around all those web 2.0 topics, e-commerce companies. All of them had great ideas, small budgets. And they all wanted to have a server platform which can scale if they want to, but don't spend too much money on it. So we figured there needs to be a way for us to help those customers more efficiently. So back then, we already did quite a bit of virtualization of our infrastructure, but we wanted a more automated way to help those customers. So I actually stumbled upon a company called 3Terra based out of California. And they had a product called AppLogic, which back then, back in 2009, was probably one of the first cloud computing platforms on the market. So it was a small startup company. There are, I think, like 10 or 15 people back then. And they built a cloud platform on top of Zen as a virtualization layer and some other open source software components and actually built their own operating system, which you installed onto a couple of servers. And those servers kind of built a grid of servers where you could easily provision applications on top. So it actually had a very cool user interface, which more or less looked like Microsoft Visio, where you just designed your application on a canvas by connecting all the dots. You pushed a button, and then the software actually provisioned all those virtual machines interconnected the virtual machines with each other. So it was a very interesting platform. And we started using this platform in production in 2010, after some extensive tests and actually launched a cloud offering on top of that platform. Now I probably would not, or we would probably not be here if that all worked out. So this was a proprietary software, a closed source platform that company got acquired at some point. And after that acquisition, they didn't really develop the product any further. And it kind of lacked traction in the marketplace. So we had to make a decision what to do next. And that kind of leads me onto the topic of OpenStacks, how we got started with OpenStacks. So around 2012, we were still running this app logic based cloud platform. But we did have a few customers requesting an S3-like storage platform from us. So we looked around in the marketplace and actually decided to use OpenStacks Swift to build such a platform in order to offer cloud storage services, just as Amazon does. So we built a Swift-only OpenStack cluster, whose only purpose was to offer object-based storage service. And that's actually how we got into OpenStack. And I think it was by 2013. It was obvious that app logic was not going where we wanted it to go. And that's when we decided to fully go into OpenStack. So to talk a bit about how our OpenStack setup actually is built, I don't know, do you want to take over from here, Frank? Yeah, OK. Since three years already, we're now running OpenStack cloud. And it was in addition to the app logic clouds, we were running still when I began building it. It took only three months to get it to production. And from the beginning, I focused very much on HA setup, because I knew we do not have too many employees. And at least we have people who are able to work with OpenStack. So HA is the number one issue to have in mind when you build up this environment. This is a picture we used last time. This was our initial setup. You see we had this object storage with attached to it. Then I had two controllers in a HA cluster setup. Now I would tend to three controllers. We had this three server Galera cluster with MongoDB for metering and everything running on it. We had these two neutral nodes. And we have Cinder and BlockStorage was provided because we weren't able to start with SAF from the beginning as well. It was provided by MSA, HP, Zons, which were fed into the cloud by the Cinder cluster we had. This is a setup which basically still exists, which is in production since then. It's not too big. We are a small company. This is not the only thing we have inside our portfolio. We have data centers. We have customized server environments of dedicated servers and stuff like this. So you're running now 14 compute nodes inside that cluster. And we have about 120 to 150 instances changing, running on this cloud. A huge amount of the service of these instances is our own infrastructure. So I would guess we save about 40 dedicated servers, 40 hardware, 40 pieces of metal just by migrating over our own environment into the cloud. The rest is used by our customers. And, well, yeah, to briefly come back to the slide before. So the initial setup was done built on Ubuntu trustee. And one of the goals was to actually use existing hardware infrastructure that we had. That's one of the reasons why we integrated existing storage arrays into this infrastructure. Now, we did see some issues with that over time. So for example, we did have a few compute nodes integrated into the setup, which had AMD CPUs next to Intel based CPUs. And I mean, in day to day operations, it was running fine. But we did run into trouble from time to time. So we did have issues migrating instances from there on to the Intel based platform. And we also ran into some issues during upgrading our OpenStack infrastructure. So we recently or last year, we decided to get away from this more heterogeneous approach and make sure that we use only one vendor when it comes to the servers. So we did have a mix of HP based servers, Dell based servers. And in the end, well, it was supposed to work, but we did have issues from time to time. So this is something that we've already changed. Another thing that we are about to change is so we're building up a new block storage environment based on Ceph. And we'll talk about this later. And currently, we use multiple one gig connections for this infrastructure. But this will also change to a 10 gig based environment. So coming to the topic of actually keeping all of this running. So as you've, as you've seen or heard, we're a small team. So we're essentially three people. And I can't really count myself as a full member of the team because I'm not really involved in day to day operations. So how do we do that? So yeah, before I begin, let me put first that this is, this is offering how we did it. We just explained how we did it. And there are some decisions involved not to use or not to go this way or that way, which doesn't mean this way is bad or our way is the best way to go. We just want to find a way to describe how we manage to keep the cloud operating and with this very, very little downtime. I think we had one major issue in three years, which was related to two compute nodes and the customers on those two compute nodes and then lasting for about 10 hours. This was a major issue during an upgrade procedure. That was the time when I decided to get rid of the AMD CPUs because the server in question were those two compute nodes. There was no problem with migration as long as I could put I added those two servers to own aggregation group and from one to the other migration was no problem. But there were other things with a heart which I didn't want to care about any longer. So why do we have a small team? Well, it's really hard to get OpenStack experts. It's even harder to get them over there in Germany. There aren't many. And if you get, they are really expensive. And the other thing is the really expensive Stack experts very often don't want to do like the normal system administration work which in a small company everybody has to be part of. So what to do? In the end, the OpenStack cluster is just a bunch of servers. You have to just focus on that view for a moment. So a lot of the problems you have with your cluster are not OpenStack related problems, but really normal cluster trouble, which might be a rect interface, which might be a broken cable, which might be some database problems and stuff. But in most of the companies, especially the smaller companies, when there's any problem related to OpenStack, all the colleagues are screaming for the OpenStack professionals to solve it. Maybe they return it after a while to the database experts, to the networking experts. And because they found out, okay, it's not related to any OpenStack problem, it's just networking. A normal networking problem is just a normal database problem. But usually it's just, well, this small team is asked first and so how to get out of this dilemma. You have to consider yourself as your own customer. You have to build up escalation metrics like you do with your customers, so that you have a really defined escalation whenever a problem occurs, who's to alert it first? And only if they don't, if they're not able to solve the problem, then the OpenStack expert himself responsible is alerted, not the other way around. This might sound very simple, but it's very often the case, and in very many companies I was talking to, it's exactly that problem. How can I achieve this? Why you? We use Check-in Car as a monitoring tool in our company, which is just a framework on top of a negative service. And first of all, you do something very simple. You build up host groups, you build up service groups, and you have certain teams which are responsible for this and that host group, for this and that service group. The next thing is maybe use some business intelligence. In Check-in Car, if you set up business intelligence, you first define PACs. PAC is just an expression, it's just a naming for a team of co-workers. And the thing, if you're not familiar with business intelligence, the thing is you don't get alerted because one single process is not working, one single server is having trouble, but you get alerted whenever a setup is not delivering or is not working correctly in total, which means if you got three nodes hosting the controllers, when one of those nodes is not working, you don't get alerted because you know you got your HA setup, but it's still, your OpenStack Cloud is not in trouble. So why consider the OpenStack guys? They can have a look next day maybe, but first of all, it's just the normal stuff, take a look at what isn't working, whatever is the problem. This is of great help anyhow. Another very important thing is the database. A lot of problems I ran into, we ran into were database related. One example would be, well, we use HA setup as it's described in many tutorials or many, many documentaries of OpenStack. Just a Galera cluster, it's beginning with three database servers, master, master setup, and what we did is what we're doing for many of our other customers, not cloud customers, we had the balance, my secret request by HA proxy was a decent backend check configured inside HA proxy, but this didn't work too good. We had a lot of problems and we didn't understand what was happening, what was going on. So I think finally in Austin I talked to the Galera guys, and they told me that what I didn't know that Keystone or several pieces of OpenStack soft is not performing too good with when the request, when the queries get balanced to different backend nodes. This is not a fault of the HA setup, it's not a fault of Galera, but for example, Keystone doesn't like getting connected to a different backend every now and then. So what we did is we changed the setup, we installed MaxScale on the database cluster, in HA setup was causing the pacemaker, it's really easy, and then we did not split reads and writes as some colleagues proposed would be best, but what I did is I configured reconnections, which means every connection stays on the same backend, but in the end it balances pretty well because if you've got as many connections from one server to one backend as from the other server to another backend, but the goal is all this mysterious Keystone is not working or error notices like no valid hosts found or your networking quota is gone, all this vanished. Okay. So about logging, so another thing, I mean, might sound obvious, but nonetheless, it's important to have a single pane of glass looking at all those logs aggregated across your cluster. So what we do here is we use Elk, an Elk cluster to do that. Do you want to go into more detail or? Yes, it's really easy. It's a standard Elk cluster setup, but I would recommend that companies starting with their own cloud that they really early start using a centralized log server, whatever kind of logging server. If you try to read logs on like six, seven, eight, nine different nodes from number X different services, you get crazy really soon. And this is of great help. And yeah, you should start with this centralized logging. Kibana log stash and elastic search is a great tool. It's an easy setup. And it's horizontally scalable, really easy. So I would recommend to start with this really early. So what to do when all hell breaks loose and the cluster fails. That's certainly something which we thought a lot about. I mean, having a small team, you really need to make sure and I guess we succeeded in it. So I would recommend that you start with Elk. And then you have a small team. You really need to make sure and I guess we succeeded in it that you know what to do when something goes wrong. So what we kind of did is that we so for all the other team members or not members of the open stack team, but all the other ops guy, what we do is that we provide brief our Viki where we document whatever we think is necessary. And we also try to whenever there are issues or problems that we come across, we actually communicate this back to the rest of the ops team at the company, even though they are not really involved in day to day operations of open stack. But nonetheless, we saw that it helps them to better understand what's going on. And in the end, one of the most important things that we kind of did is actually, and we already talked about that, have an escalation matrix as we call it in place where it's clear when certain things go wrong, who's getting the alarm, who's going to fix it, and when to involve the open stack team, when to call for help. So that's one of the most important things. And to achieve this, well, we actually have a site which is in German, which is what to do when help breaks loose. So that all the colleagues can just focus on that site and just check, okay, this is happening. And what I did is I wrote a lot of scripts with tools to find out, for example, what's happening on your neutral cluster. What happened to us, and this was a learning out of an incident, was that one neutral note was broken down. And not all of the L3 routers did realize that the note in question wasn't alive anymore. So it's really easy to just change the agent responsible for the router inside the database. But if you don't know what to do, well, you can search for ages until you find the solution. So what I did is just a script with checks, okay, is this router on one of the surviving agents, and if not, just change the entry inside the database. There's a lot of things, well, I was mentioning the bunch of servers. You see, a lot of my colleagues are firm with KVM virtualization with all this LibVirt tools and stuff, but they don't see the connection. They don't, at the first sight, see, well, this is just a hypervisor running KVM. This is just LibVirt, and you can have all the description, all the necessary thing inside XML file. And if you are really in trouble and there's no way to fix it with OpenStack onboard tools, you can help your customer by migrating any instance to another still alive hypervisor. And then afterwards, the expert can come and fix the OpenStack setup again. You will have a customer with a living application. He will not be able to manage it via the API or via the dashboard, but this is just, well, of minor importance in that very moment. So what I try to train to my stuff is just see it as a bunch of KVM hypervisors. You're working with those a lot in your daytime work, and if it's that bad, forget about the OpenStack bit. We will care about the OpenStack bit afterwards. We will fix everything afterwards. You can fix everything afterwards, but you have to get this application running, and that's easy to achieve. Okay, I think we have to hurry up a bit. So upgrades is the next topic. So maybe we'll go a bit faster here. So the initial setup we did for Swift was based on Icehouse, which was kind of a standalone setup we did. Then we built a new OpenStack cluster, actually for the compute part. And that we did based on Juno back then. And then at some point we kind of integrated those two environments into one. And last year we upgraded this environment from Juno onto Kilo. And then onto Liberty. And we came across a few issues that you already mentioned. So one of the issues was networking related, which we never really discovered the true reason for it. It was two hosts having issues in the cluster, and all the rest were fine. Nonetheless, that upgrade went pretty smooth at the end. It went pretty smooth, yeah. Most of the stuff didn't have downtime at all, which brings us to the next step. So now we are actually, well, in the process of doing the next upgrades, and now we actually see more problems here. So we want to upgrade to Newton. We already have test environments running on Newton, and they run fine. And maybe you want to elaborate a bit on the issues we see there. Well, yeah. One issue I was running into in our test environment was, well, logically you have Newton and Nova running on at least the compute nodes. In normal setups, you would have them just running together on the controller. And I realized that I wasn't able, nor in one direction, first upgrade Nova and then upgrade Newton, nor in the other direction, start with Newton and then Nova. Because if you have both fed by the same Python stack, if you don't have a virtual environment for either the two, there will be some dependencies upgraded, which do not really cope with the elder version of the other program. I will try this again, maybe. Well, we have two different ways to go, and I'm discussing now with the founder of my company, which my aim is I want to make the test environment our new production environment and migrate all the stuff over to the other environment. Except of the three database servers, our test environment is exactly the same than our production with only two compute nodes. The Newton servers, Newton nodes are even better. I've got this test environment running with Newton right now, which is really nice, really smooth. Like a lot. I've integrated our newly built self cluster into it with a 10 gig backend, and we are providing all images, all ephemeral, all block storage and backup from the self environment, and we're really satisfied with it. I'm thinking about instead of upgrading the old production, just make the test environment a new production and make the old production our test environment, because there are different things. It's not only upgrading the open stack soft, but we are running all this on trustee, Ubuntu trustee, and we will need to, or we will want to change it to Xenile. And so on top of the upgrade procedure of open stack from Liberty to Mitaka to Newton, we will have to upgrade the servers as well. And the other thing is we've got one customer who really extensively uses the local answer as a server. Well, this was in the production environment, this is still version one, which is not existing since Mitaka changed to version two. And so as far as I know, there's no way to migrate seamlessly from the version one to the version two. So I would have to find any way with a slight downtime for the customer to change this, to migrate this to version two. So anyway, there will be some downtime, and I think that we can have it more controlled and more customer-friendly if we do it by migrating. Our provider network mainly is based on VLAN, and so it's pretty easy to just interconnect the back ends or the provider networks of the old and the new cloud. And the only thing is that we have to find a way to change the public IP address, the floating IP addresses, because we will not be able to provide this from both sides at the same time. That's about it. Yeah, very quickly, billing metering, because that's always the question. So we actually have different things in place, and that's one recommendation if you only have a small team and not so many resources how to go about it. So we have some self-service-based offerings. They are fully automated into our billing system. Then we use Cilometer to generate reports. That's something that we scripted. They generate reports every month for the cloud storage usage and the compute usage we actually extract out of Nova. And our plan currently is to change this because the Cilometer data is just growing too much and generating too much entries to migrate this using Nokia with Cilometer. Well, as it was with almost all the environments three years ago, we have set up Cilometer with Mongo database back-end, or most of the telemetry set-ups are built like this. And it's a common problem that Cilometer is generating a lot of noise, that there are quite a bunch of entries inside the MongoDB, and very soon it's getting really hard to be able to work properly with such an extended database. I already have the whole Mongo stack running on the cloud itself. Just nine servers. It's two shards with three config servers. Just a simple idea. Well, okay, we have the metering data running into the cloud itself. So what happens when the cloud doesn't run? So what do you want to meet it when the cloud is not running? So it's okay to have it there. Yeah, but I have some tests running with Gnocchi. This is really fine, but it didn't work out too good. Last time I tried in production was with Kilo and was not very successful with Kilo, but with Liberty and Mitaka it's working much better. And that's what we're going to do. Yeah, so I guess we are out of time. So we kind of have to, unfortunately, skip the safe base stuff and have some maybe a minute or two for questions. If not, you can approach us afterwards anyway. So any questions? There's a question coming. Two questions. Have you had any security scares? And if so, how do you handle the second question? Sort of a classic one. If you had to do it all over again, what would you do differently? So I guess the first question is easy to answer. No, we didn't have any security risk issues. Second question. Well, we started building the setup like almost four years ago. So a lot of stuff has changed around OpenStack. So yeah, maybe we would do it differently, but on the other hand, we like to really understand what we do and the way we build our OpenStack is the way we approach other things as well. So maybe it wouldn't change as much. Yeah, the thing is, well, I didn't come to this point. I think if you're using... Well, you can use DevStack for testing, but if you're using in a small company, if you start to build up a production cloud and you're using the ready-made puppet modules or the Ansible stack or anything, you really get to a working and satisfying result very fast, but it's hard to gain the knowledge you maybe need later on when things are not working the way they were planned. The other thing is learn it the hard way. And I think the gain in this is that you might have some more understanding of what is going on under the hood. And this helps you a lot in daily operational work. Thank you. You're welcome. There's one more. How large is your environment in terms of virtual machines and customers? Well, it's just... I think it's just around 20 customers. It's very small. Virtual machines, I don't know. Virtual machines, it's between 120, 150 right now changing. And we are, as I said, maybe our second biggest customer. There's one other customer who's really extensively using the cloud and he's got his own... He's the one with all the low balances and he's the one with... He's got his own... Yes, he has images or he has flavors with, I don't know... He's got his own images, he's got his own Kubernetes stack and all this. Well, sometimes he's the number one in the instance and sometimes we are. But yeah, imagine it's... It's 14 compute nodes and it's another 10, 12 servers and our company alone doesn't need to run like 30, 40 bare metal servers because we have all this stuff on the cloud. So it's a win situation. The thing is, why are we here? We are here because we want to encourage small companies to try it, to give it a try and to... Yeah, just to dare using OpenStack, just to dare using cloud environments. And yeah, that's what we want to tell you guys. With that, we have to close and I guess there's beer in the marketplace also.