 OK, so I'm Max, and today I'm going to talk about what we've learned operating Bluemix. The sad part is that a lot of the people operating Bluemix are in the audience, so they probably could join me and give that talk. But I'm here representing them, so I'll do my best. Another person that is not here that I've done this talk with before is Matt Zuchiu. He's actually part of the IBM China Lab, and he's the lead developer for our CPI group. This talk kind of covers what I've learned indirectly being the open source liaison with working with my colleagues here, and then the product manager on the back here, which is, well, the director, Fabio. Doesn't have the long hair, but that's him. All right, let's get going. This should be fun and interactive, so feel free to stop me. Obviously, I've got a deadline to stop, and then you guys got to go to lunch, so if the discussion gets too much, then we'll move it. But ask a question. I've given this talk before, and it was interactive, and you'll see why. It has this format that helps it. So let's get going. So first thing is, what's Bluemix? I'm not going to bore you too much, but it's important to set the stage of this. It's a certified pass. Well, we're getting there. I mean, there's other aspects of the certification that we still need to get, but it's a moving process, and our intention is to definitely support that. I'll share some public statistics that Simon Moser, my colleague, mentioned today in the keynote, is that it has, well, first thing we did mention is that we have three kinds of deployments. So one is public, dedicated, and then the last one is private. Public is what you get when you go to Bluemix.net. Dedicated is essentially a slice of a software environment for a particular customer. So we have different customers. So the customer gets its own, but we manage it for them. And then private is on-premise, so very similar to Pivotal PCF, in some ways, and other companies have that too. So as Simon mentioned this morning, we have over a million registered users. So these are the public numbers. If you want to know more, talk to Fabio. I'm not allowed to say more than that, but it's getting pretty big. We have 500-plus-thousand running apps and hundreds of services. And we use software. Now it's a source of strength and pain, but let's leave it at that. We've also looked at OpenStack as a potential alternative. But OpenStack has its issues too, so don't get me wrong. So everybody has issues when you start operating at scale. And it's a worldwide team, right? Developers and also DevOps. So a lot of our colleagues here in Germany and Bublingen are essentially doing a lot of the development. So they're in the room here. But we have also people doing DevOps, like for instance in Roma. We also have people in North Carolina as well as in China. So it's all over the world. It's a worldwide effort. So what have we learned? All right, so in the US, there's this guy called Dave Letterman, who is now retired. But he used to love doing top 10 list. And those work really well, especially when you have a short amount of time and a very focused set of things that you want to talk about. So that's what we're going to do. So we're going to use top 10. And that's why it's sort of interesting for you if you have questions. We're going to go from 10 to 1, OK? So number 10, I call this change. So Fabio and his team and a lot of other people essentially have this tight change request process. It has its bad parts. And you'll now see kind of the format of each one of those top 10. So I start with the bad, I talk about the good, and I talk about the lessons. So what's bad about this? Obviously, it makes it slow. We have to go through a tight change process where people like Fabio, Richard, and others have to essentially approve any change to the system. That can make it a little bit hard. But we're a huge company. Remember, we are 300 something thousand people. And I don't know how many works on Bluemix, but it seems like all 300,000 work on Bluemix sometimes. And then because it's worldwide, well, people in the middle of their sleep, what do you do? You have to deal with it, right? So that's a problem, but there are some good things. So what is good about it is that if somebody in Woma decided, well, they're going to make a change, and they figured it out, before it actually gets approved, it actually has to go to the US. And then also the guys in China have to take a look. And then people are traveling, right? So before it actually happens, there's this process that essentially sort of coalesce it and make sure that it can actually make it good. So what it does is it limits any problem. It makes it slow to adopt new changes, but it can limit things. And there are more eyes reviewing it, right? So that's sort of good. So what we found is that definitely if you're doing a large team like we do, if you can't afford not to do a large team, obviously do that, right? If you can have everybody in San Francisco like Pivotal does, more power to you. IBM, we can't, right? We're I with International for a reason. We are global, and we are a very large company. So we have to go worldwide. That's just the way it is. Yes, Chris? I just want to right now. Yes? I would generally cut. Yeah, and obviously there is going to be that too, right? There are customers that like this. They are happy to slow down. So that's number 10. Number nine, audit checklist. So what we do in some ways is we audit things a lot. Obviously it comes with the bad part. So what it is is that some of them are manual, and it can be a long time to sort of check those. But the good part is that using canary-based deployment, which is sort of a feature of Bosch, it helps because then you can sort of run your deployment and verify, oh, things not quite working, and then you can backtrace. And we do some audits for just at steady state. So in some ways, it's operating the system and then verifying as it's running that things are working. So the lesson here is that we've actually ended up creating our home-grown tool called Doctor. It comes from the China team that essentially allow you to, in some sort, it's getting there, do a continuous monitoring and then audits of all of the deployments. So in many ways, you can think, I can't reveal the numbers, but you can imagine we have X number of customers that have private, and then X number, maybe X square number, dedicated, and then public, which is a smaller number. And all of these go to the same tool so that the teams all over the world can kind of get a view of it. Number eight, checking, log checking and monitoring. The bad part of this is that because we are such a large environment or such a large set of environments, the log rotates very quickly. So you end up with a problem, and if you don't catch it, you might lose it in some sense. So a large amount of data, it's in the terabyte. How do you keep it? How do you manage it? Obviously, better cloud would help. Things like S3, for instance, you can keep large amount. But over time, it becomes also impossible. So what happened two years from now? It becomes an issue. Bluemix already has operation about a year and a half now, so it's getting close to that. And then how do you separate? How do you query over those logs? So that's a huge issue that we've sort of experienced. The good part here is that the way it's sort of designed, it doesn't always work. So don't get me wrong here. But when it works, or at least the design point is to allow streaming and allow different syncs. So when you do that, then you can use that data to your own set of tools to do the kind of query that you would need. So I've seen a pivotal working with them, for instance, that we've used Splunk as an example. But at IBM, we have our own grown solution. So it's not really a solution to you, but it's a lessons learned that you definitely want to start introducing some level of log tooling early. Because as your system grows, it can become an issue. How do you debug? It becomes a problem. All right, so moving along, number seven. I actually like Drake a lot. I don't think he's always kind of like sad, but that was a sad phase of Drake that I thought that was cool. So what is this? It's Boschinate rules. The reason I say this is because I have actually contributed a third line of code in Boschinate. You can just go to the repo and you'll see. So I'm not complaining about it, but I'm saying there are some issues with new stuff. And for instance, in the case of Boschinate, it can become, and I think we're solving this now, to recreate a director. And certainly because it was in development, we've had lots of frequent updates. So keeping up with that was an issue for us. Obviously that may not be an issue for smaller companies, but as a large company with a history of being very good at keeping, we deliver something to a customer. And then 20 years later, as long as they are still paying the bills, I mean, we'll support them. And it's a great thing, but it's also a source of pain. So when you're in that mindset, it can become an issue with frequent updates. The good part is that in general Boschinate has been great. I mean, it's moving towards the new Bosch CLI. If you haven't looked at that, it's going to be a single binary. And it really works well. And we've also, as part of this, introduced external CPIs, which, as we mentioned today, is essentially the source of growth for this movement, which is Cloud Foundry. If you think about all the other clouds, Paz, none of them have this one feature that we have, which is we can work across different clouds. And as far as I know, it's close to 20 now. And then today, I just, right before this, came from a talk of multi-CPI, where we have a CPI, or meta-CPI, to talk to multiple clouds. So it's a great source of strength for us. And we definitely love that. So the key is to deal with better planning for things like changes. And it's important for you to do that. Otherwise, you run into some issues. All right, next one is what I call custom software. This is actually a super important one. What that is is you want to make sure that you have custom releases. So in other words, if you want any piece of code in your organization to be part of the environment, whether it be installed on a stem cell, on all the VMs, or you want to have them as a separate deployment, like, say, for instance, a service, don't try to, I would say, use Bosch, adopt it, and embrace it. So what I mean by this is, like, some people at IBM, we've created our own hacky stem cell. And then we put a little agent running in there. And then we said, well, maybe some of those deployments, we should have our own job. And it doesn't really work with Bosch. When you start doing this, you start breaking the abstractions. And it becomes much harder for you to upgrade and to keep the system healthy. So create your own custom software. So first thing is, custom software is going to happen when you're a large company. Like, for instance, we have a security release at IBM, because the security guys think they know better, and probably maybe they do. I don't know. But they have their own release. Who am I to say I'm not an expert in security? So I have to just deal with that, right? So they create their own release. So that's going to happen. You can try to bake things into the stem cell. Usually a bad idea. Create your own release, add it. These days, Bosch has this thing called add-ons, where you can easily run time config and deploy those custom bits into all VMs. So definitely want to do this. So of course, with external CPI and co-located releases, you kind of solve this problem, right? And of course, there's a lot more information these days about creating releases, which was an issue. So Maria, who's the top essentially committer for Bosch at Pivotal, she's written really great presentations about how to create releases. So I urge you to go there. Towards the end, I have a couple of slides to point you to where to go. But if you go to Bosch in it, you'll see on the top of Bosch in it, there are some pointers to our slides. But definitely you want to create your own release. It's not easy, but the point is there's more information now. So simple lesson. Do not have your own custom SEM cell, custom Bosch, use the extension point. So the CPI is an extension point. The release is an extension point. But don't try to do more than that, because then you can survive with the rest of the system. Number five, this is an important one. It's kind of still work in progress, is that power DNS is actually an issue, right? So we highly recommend you not to use power DNS. But we don't really have a good alternative yet. We're still working on it. But let me talk about what's bad about it. So first is that it's single point of failure, so it's very hard to do HA on it. And we don't really have a solution. So if you're deploying on a cloud that has a DNS solution, then obviously you use that. Otherwise, you're going to have a bad day. The good thing is, as I mentioned, some of those IaaS support it. And we're also working on a power DNS test solution for Bosch where we're essentially using local DNS. So all of the VMs will have all information, in sync of all the other VMs. Now, that doesn't solve a full non-DNS deployment of CF because CF uses as part of it like discovery, especially Diego, discovery of services. And they use console. And you can't really replace console. But there's various discussion in the open source community to see how to do this. So you may still be able to use those other DNS solutions. But maybe the platform will have a power DNS less solution, basically batteries included. So the lessons learned is you really do have to think it's a general lesson of any node in the system that is not designed to BHA. Because when you do that, it will prevent you from growing in the future. So be careful of that. And it can become very hard to remove. I mean, this is based on our experience. We're still struggling removing power DNS usage. Number four is update. Security update are important. And they're painful. But you got to do that. Why is that? Well, obviously, the internet, as you know, unlike some orange person in the US, is not a source where you want to go and think that trust any information. It's actually exactly where a lot of bad guys, bad girls are coming and trying to attack your system. So this is nothing new. You should all know that. So what happens is that there are CVEs pretty much weekly. You're not all aware of them. But I can tell you, Chip and the rest of the Cloud Foundry Foundation and various folks at Pivotal and IBM and various other companies, we know about when there is a CVE. And those happen pretty much weekly on average. I mean, sometimes you don't get them. And what happens when you do get those is you have to be able to get the security in. But they're costly. But you still have to deal with it. Because the last thing you want is to be hacked in. And then you be in the news that your system, like what happened to Yahoo, it's unfortunate. It's like you're struggling. And they just got hit by this massive security bridge. You don't want this to happen to you. So what does the Bosch team do about this? Well, they essentially release new stem cells and new releases frequently. And recently, they've been going to a hardening of the entire system for security. And obviously, as I mentioned at IBM, we have our own security release. So you definitely want to take this very, very seriously and follow the guidance. So if there's one thing you want to remember from this talk, is definitely make sure if you're going to go in production, being able to deal with that is very important. We did come up with a way to essentially make it faster to even update the OS, we call OS reload. It's also a source of various pain. But this is one maybe advantage of that feature. Number three, multi-bosch deployment. So this is still something we're struggling with, but it's certainly an approach that you should think about. So what this is, is that when you start with one Bosch director, it becomes essentially your bottleneck and potentially your spoff as well. And when you start going, especially when you have multiple environments, that becomes a problem, right? So having a way to do multiple deployment with Bosch, that's definitely, so this is essentially, you do one Bosch and then you create other Bosches for your other deployment. So that when one fails, you don't lose the entire system. We've seen the director fail. I mean, we've seen pretty much every failure that you can think of in the book. Some of it self-inflicted, but some of it also because the software has problems. It's a big piece of code, right? So there are issues. So divide and conquer, right? That's the obvious solution. So definitely reconsider your deployment strategy. There are different sort of best practices. Certainly Dimitri is a source of great knowledge. He's seen a lot of it because he talks to everybody. So you wanna sort of hone into what he discusses on Slack, for instance. But experience with other things. So as you have your deployment, try other things. Don't just be stuck on your own style of deployment. There are other ways to do things that may be better for you. Number two, this is something we've learned from our colleagues in Italy. And it's kind of interesting because you see cultural things from different places kind of filter into a worldwide team. I know talking to Fabio and the rest of the team is that a lot of the deployments never work 100% of the time. So you do a deploy and you think you're gonna do like thousands of VMs and you're like, okay, I'll come back tomorrow and then it fails halfway. And then you think, well, when was the last time the whole deployment actually did work? Can't even remember that. So what do you do? All right? So you have to address the failures and then obviously continue. So Bush allows that model. But the point is that it's never gonna be 100% successful. And the guys in Rome, like maybe with their cool attitude, yeah, fails, just restart it. Don't stress about it, right? And that's the key is that the system is so large and there's so much, especially for big environments, there will be failures. Don't expect a fully pristine 100% just work perfect. All right? Like my camera is perfect German engineering. Doesn't always work like that. It fails and then you kind of deal with it. So Bush will help you with that. And even when the system is failing halfway, it's still usable. So don't stress too much about it. So embrace failure in some ways, right? And that was the point that Simon made today. This weirdness, it's gonna happen. And you just have to adopt it as part of your culture. So trust the tool. Failures are gonna be part of the system. And sometimes they don't work perfectly the first time. And it's okay, right? As long as it works in the end. And number one is to back up your director DB often and a lot. Okay, why do I put this at the end? It's an obvious one. Well, because I've experienced it personally. So last year, trying to go on vacation, Friday, this is a story I'll finish with. I think it was like Thursday and I took Friday off. I get a call like 4 a.m. in the morning. Chris Ferris tells me, we need your help, buddy. The whole blue mix is falling apart. And I think it's Bush. Now, Michael Franco on the back actually does a lot of that sort of firsthand. And he's very good at sort of identifying where the problem is. And I guess maybe, I'm not sure if he was the one identifying where it was in Bush. But anyways, it got down to me. And I know, of course, I'm going on vacation. So what do you do at 4 a.m.? Chris calls you. Well, I deal with it. So I'm getting ready to go to the airport in San Francisco. Got my laptop. I'm going to the code. I've got very little bits of logs to go by. And then on the plane, I kind of have the aha moment to see where the problem was. And of course, I get back and identify two lines of code in Bush where we're not checking for an error in the go code. And potentially that's what happened. The symptoms was that disk on the director where the director keeps all its data disappeared. So now we have a director in a deployment. And they don't know about each other because the data just disappeared. So first is identifying it. And of course, what do you do about that? So in the end, the disk got lost. So when you disconnect the director for its deployment, there's not much you can do. Although we kind of came up with some clever thing to do, the guys in China. So I'll talk about that. Obviously, how do you solve this? Or how do you prevent this from happening? Becoming a problem is to back up, right? You back up your director database. So when something like this happens, then worst case scenario, you can always go back to the old backup. But any VM that was created since the backup, you lose those in disks and so on. So that's a real issue. The problem with backup was that it was very slow for us. So when we have a large system, the backup took forever. Before the first backup got done, we were already doing the next deployment. So it's used this, right? So what happened is we worked actually the team from Rome work with them, with the Bosch team to make the backup come in faster. And partly it's because we're not backing up any information about blobs because they're kind of in a separate place. And you can recreate them, which is good. Now, some IaaS, for instance, AWS, has a way to snapshot disks. So if you can, the best thing to do is to take your disk on your director and snapshot it so that way the IaaS is doing the backup for you, in addition to maybe Bosch backup, right? So one solution we came up with, and it's kind of a clever one from the guys in Beijing, is to modify the dummy CPI and replay the deployment. So think of it this way. You have a director who no longer knows about the deployment, but you have the deployment. And when you have the deployment, you can always go to your console and get all the IDs for your disks and all the IDs for the VMs. So now you can create or modify a dummy deployment. You have your manifest, pretend that you're gonna deploy that one again. And instead of deploying, it gets the disk and VM ID from a list that you've created of what exists. And when you're done with your deployment, which is a fake deployment, you reconstructed your database. So we did that. That was over my two weeks vacation. The good thing at IBM, and I'll say this, is typically we have very, very good managers and I happen to have one, Alex, and he's super cool. So those two weeks I got back, generally. So I'm trying to get a couple more weeks, but anyways. So it happens. So with that, I wanna thank you and take some questions. So questions. The OS reload. Okay, so the guy that owns the OS reload is sitting next to you. That's Fabio on your right. So the problem with this is that in software, this is kind of a unique software problem, but it's actually an interesting solution we came up with that actually probably works other places, is that there are some data centers that have limited VM capacity and software doesn't have a first-class reservation system. So if you have a large deployment with VMs in it and you try to say, for instance, update or redeploy or add to your deployment, you might lose the VM. So if you release some VM, you might lose it. And since the system is out of capacity. So instead of releasing VMs and getting new ones, we essentially keep the VM and we reload the operating system, which is in some ways what you need to do because the VM hypervisor is fine. It's just the OS that you have to reload, right? So to do this, you actually have to modify the CPI to do this. I've talked to Dimitri a long time because at Pivotal, they actually have a similar problem for their data environments where instead of having created a new VM and then reload large amount of data, they'd rather update the stem cell by reloading the operating system. So there is some discussion about maybe potentially making this a first-class citizen. Any other questions? Before you go, I've got number 11. Oh, Chris. So he's the hiring manager. Thanks for taking my talk and doing a little advertising. Okay, so number 11, let's do this. Always seek knowledge. Why is that important? Because it's large and it's hard. So if you're a new Bosch developer, you might not understand how to use Bosch. That is very true. Not everybody that are great developers at Bosch know how to use Bosch. I was at Bibli again and I mentioned this. I love cars. I'm an engineer. I love the German cars, but guess what? I am a terrible driver. I think most of you probably fit in that category or one of those, right? So it's not because you're a developer, you know how to use it. Bosch is complex, so large learning cycle. So the good thing is there are lots of talk. Dr. Nick has tons of information. Maria is adding stuff. Dimitri, anything you need to know, just ping him on Slack. Pretty much you'll get an answer. If you can't get an answer from him, he'll point you to somebody that can give you an answer. So Bosch IO is your source, right? And Dr. Nick, as I mentioned, so seek knowledge, okay? So that was in case nobody had question. Anything else? Okay, we have 15 seconds. No? Thank you.