 Check. All right. I guess this thing's on. Hi, everybody. I tried to hide all the way up here in the corner, and it seems a bunch of you found me anyways, so I guess I have to give a talk. So I'm Bill Chapman, and I am a cloud architect and technical account manager at Stark and Wayne. I've been deploying, maintaining, and troubleshooting platforms and services on top of OpenStack for a couple of years now, primarily Cloud Foundry and Bosch-related services. I came to this space by way of application development. I was an application architect looking for a better way to scale my applications. And as I said, I work for Stark and Wayne. And Stark and Wayne is a multidisciplinary operations team. We've got folks from the application space, DevOps, infrastructure, and we all come together on the layer in between the infrastructure and the application where our paths lives. Cloud Foundry and Bosch are a core component of what we do every day. And yes, as far as anyone is concerned, Batman and Iron Man are our founders. So this is OpenStack. I don't think that I need to explain OpenStack, and I would need the entire conference to do so. But the reason I put this slide in so that we can take a second to look at OpenStack for what it is, it's a very complex distributed system. Errors can occur anywhere, and they can bubble up to places you don't expect them in other systems. And this is Cloud Foundry. This is what Cloud Foundry looks like today, along with a reference resource list there. Cloud Foundry is also a very complex distributed system. And this is Bosch, which is a distributed system to distribute distributed systems. No, Bosch allows you to deploy distributed systems. And Cloud Foundry, of course, is deployed via Bosch. There are some situations, like with Pivotal's Cloud Foundry, where Bosch is mostly abstracted away behind their operations manager, but it's still there. You still have access to it. So any discussion about troubleshooting Cloud Foundry, at least on a high level, like this talk is today, is primarily going to be a discussion about troubleshooting Bosch. If Bosch is doing its job, then Cloud Foundry and OpenStack really shouldn't interact as much. And this is your brain on all of that. This is what cramming a whole bunch of distributed systems together looks like. And as I mentioned, I came to this space by way of app development. So I went from something that looks like this in general to something that looks like this. And I'm starting to question my life choices. But fortunately, I found a lot of places to get help. If you're getting involved in the Cloud Foundry community is one of the first steps towards troubleshooting your environment. I spent a lot of time on the forums when I was first trying to find my way through my first Cloud Foundry deployments, especially on OpenStack. And you would be amazed how many people in the Cloud Foundry community have phenomenal experience and deep knowledge of OpenStack. We have some people in the audience today who have that knowledge. But wait, there's more. If you sign up today, I'll throw in over 80 other non-trivial services and systems that you can deploy via Bosch. A whole lot of this talk also applies to the ecosystem of Bosch releases. A quick side note, I'm going to go through a lot of examples. And some of them are no longer relevant on edge versions of Bosch, Cloud Foundry, and OpenStack. But very rarely do we encounter an organization that is on the latest version of these systems. You should note that I think Bosch is actively tested on LM&N, Liberty, Mitaka, and Newton right now. But many of the stacks we work on are already older than that. So talking about troubleshooting this platform, it really needs to be geared towards what's happening in the wild. Cloud Foundry and OpenStack have very aggressive release schedules for projects of their size. So I considered pruning this talk to only edge-related issues, but it turned out that I wouldn't be getting a good cross-section of what you might encounter. Next, I'm going to go through a bunch of basic problem classes that you might experience and some errors. A lot of times those errors are going to be not necessarily intuitive. That's why I'm pointing them out. And some of them may seem obvious. But I've got these from my notes, from the notes of colleagues and from the notes of some of the folks I've worked with in the community. And every one of them has caused someone to go to others for help or to lose a day or two of progress. True to the talk title, we'll start out with networking. This represents the basic collection of networks and subnets that you would need for just to get Bosch off the ground. In general, though, in production, obviously, you're probably going to have a much more complicated scenario. But the bottom line here is to always make sure that the networks available in your open stack map onto the topology that's shown in your manifest. This is just kind of, we're just getting started. But this is a typical area you might get. And you should get used to it. You should make friends with it. You're going to spend a lot of time together. You're trying to date Cloud Foundry. But this is the annoying friend that keeps tagging along. But I say that this is related to networking. But you also might run into this error if there's issues bringing up a VM. And in that case, that's not necessarily related to networking. And that's where it gets messy, because you look at this and you think, oh, that must be network related. So now we've got to talk about how do you figure out where the problem is. And that's where we learn to love our logs. And Bosch has its own logs. And sometimes you'll get distracted by them. Because Bosch only knows about Bosch. The CPI is aware of the underlying infrastructure. In this case, the OpenStack CPI. But as errors flow up from OpenStack, the IaaS layer, through the CPI into Bosch, very rarely does it end up giving you the smoking gun. It's not always going to make sense. It's not always going to be obvious. So log tracing is going to be a skill on its own. So you need to be aware of the primary ways to get at the logs. I find that I spend a lot of time in the OpenStack logs, generally bare log OpenStack component. But over time, I've spent less and less time there. I think the community has really come together and made the CPI a lot better. Like I said, some of these examples are going to be a little bit dated for edge versions of some software. The most important thing to consider is that you need to have a good understanding of OpenStack and Bosch networking. If you have three hours before you have to go stand up Cloud Foundry, and the only thing you have time to do is read through some of the docs on OpenStack networking, you can't go wrong. It'll save you a great amount of headache later. And I have a link up here for one of the troubleshooting guides that are available. Bosch debugging, this obviously is another skill you need to spend some time on. What I'm really excited about is this last example. The Bosch 2 CLI allows you to follow the logs for a particular job, which can be pretty useful. But again, this really isn't about the details of Bosch. It's just that before you go into a Cloud Foundry deployment, make sure you understand the Bosch CLI. Make sure you have at least a cursory knowledge of OpenStack networking. And make sure you understand what the manifest's role is in your Bosch deployment. Sometimes you'll use bespoke systems that you're editing your manifest piecemeal and you won't necessarily understand that the manifest is everything that makes your deployment your deployment. If there is something wrong in your deployment and it's not OpenStack, it's probably something in your manifest. And if you go to the community I mentioned before for help, the first thing they're going to do is ask you, can you show me what your manifest looks like? You'll see some examples later where this comes into play. I threw this in here because it's a diagram we like to use when people come to ask for help. Try to classify which vertex the problem lies in on this triangle because you've got OpenStack that has to speak to your virtual machines. You've got Bosch, the director, that has to speak to your virtual machines. And you've got Bosch that needs to speak to OpenStack. And the problem can lie on any one of these vertex. And what's interesting is Bosch has its own view of the world. OpenStack has its own view of the world. And it's really helpful to understand that. Because sometimes some of the errors we're going to go through is just a matter of Bosch's view of the world not syncing with OpenStack's view of the world. So you go to look at your manifest, and your manifest says, oh, this is right. And then you go to look at OpenStack. And OpenStack seems to imply something's right. But it turns out that there's an error somewhere else where it thinks the VM was down and it can't release a port. And it gets pretty interesting. So if you're coming to the community for help, spend some time trying to figure out if the problem is between OpenStack and the VMs. Bosch and the VMs are OpenStack and the Bosch director. And hopefully some of these examples will help. Now let's get to what are probably way too many examples in too little time. I'll try to get them through them quickly. I know we probably want to go to lunch, right? So you will probably face some of these issues. Start out with key pairs. Bosch must be provided with a key pair that you can use to communicate with instances. Without a valid key, your deployments will fail. This is a pretty nice class of error because it usually is spelled right out for you. It'll say missing private key. But sometimes it's not. In this case, if you're using OpenStack Liberty or Mataka, you can't use their SSH key generator. You have to generate one manually. So even if you tell the API to generate that for you, it doesn't matter. Actually can you use the API to generate a key? I haven't, yes you can, of course you can. It will break, it will not work. But you'll look in OpenStack and OpenStack will say, hey, yes, this is here, it worked. And then you'll go try to run Bosch deploy and it'll fail. This is a case of a bug that's actually in Liberty and Mataka. So it doesn't fall into the base class of error and a lot of validation might not even catch this. Another thing to keep in mind is VM communication. Bosch requires that the virtual machines have to be able to communicate with one another. I think we've seen this error before. This one's gonna come out a lot. I told you, it's that annoying friend that wants to tag along on your dates. It's not gonna go away. This one here is a typical area you might get if you have block network connectivity between the agent and the Bosch director. But then again, it's also typical for a whole another class of problems. And we're gonna see a pattern here. And that pattern is that the error that Bosch spits out is going to point to an entire class of problems. And sometimes that might mean you have to go look in the nova logs and sometimes it might mean you have to go look in the neutron logs. We'll talk a little bit about how to mitigate that later. Security group rules. The Bosch security group is the security group that Bosch VMs will be deployed within. And this right here is the reference list of security group rules. It's the minimum set that you need to make Bosch do its thing. It's not necessarily production. You wouldn't use this in production. And it's definitely not the most secure. But what I've seen happen far too many times, you have problems with security groups and then you do this. I've seen this happen a lot, especially when you're fighting with third-party SDNs. Many a young adventurer has lost a battle with neutron. I've done this myself. This is actually from an open stack that's running on my home lab. It's a short fix but you'll feel dirty about it in the morning, don't do it. Seriously. Oh no, not again. This error is gonna pop up again. Security group rules. Default flavors. So this one's interesting. Again, some type of validation we'll probably check to make sure you have all your default flavors. But a lot of people don't put a lot of thought into it because when you go to, Gesundheit, sorry, when you go to the documentation from Pivotal or from us at Stark and Wayne, it'll give you this list and it'll say put these in, make sure OpenStack has these available. But you really need to make some consideration about what your use case is for cloud foundry. Often these flavors will not be sufficient for what you're trying to do. When I mentioned this to my colleagues, the responses ranged from really, you have problems with flavors to, yeah, yeah, that bit me. So when it's a problem, it's really a problem. It really depends what you're gonna do with it. And this is a fun example for me because I was looking for an example from the past. I remembered having issues with flavors so it made it into the talk. And I'm trying to find an error for the slide and I couldn't get the failure that I had remembered. So I went looking and I found this and I thought this is great. Somebody else had the same problem I did. And five minutes into reading through the slide, I said, no, I had that problem. This is me. It happens way too often. This error happened because the manifest had a vSphere specific directive in it and it was just skipping all of the OpenStack stuff. Quota issues are also common, especially the first time you deploy things and especially if you have a large production deployment and your organization or your tenant doesn't have the resources it needs. Big thing here is to consider Diego which is probably by far going to be your largest consumer of resources because that's where your applications run. And when you think about your quotas you're gonna come back to your minimum deployment and I use PCF because this is what pivotal documents as the minimum amount of resources that you need to run a PCF deployment. So it should become pretty clear that quotas are a big deal. And then you end up doing this. Which is probably also a bad idea but if you happen to have admin rights to your tenant you might be tempted. You probably shouldn't. Another error that falls into the quota class. In this case, Bosch was trying to provision a new VM and they didn't have the quota for it. You get VM creation failed. This is the second class of error the timeout pinging one we've seen three times already. VM creation failure are also gonna see all the time and again the pattern here should be that you shouldn't get too caught up in the error. You need to dig deeper because the errors aren't necessarily going to explain the problem to you. OpenStack APIs, this one is another one to think about. In this case you need to know that Bosch needs to be able to talk to compute. It needs to be able to talk to identity. It needs to be able to talk to image storage and optionally needs to be talked to networking. Here's an issue you get if your API is unavailable. But what's interesting about this is this error happened because there was an upgrade to a newer version of the OpenStack CPI. Everything worked fine on version 27 and on version 28 OpenStack networking became the default. And all of a sudden we're in a situation where this client wasn't using OpenStack networking. So this problem arose and it worked fine yesterday which ends up being difficult. Stem cell issues, sometimes you have a problem with stem cells and usually it's pretty straightforward because it'll Bosch will just yell at you. You didn't have, you know, missing stem cell. But sometimes you get a problem with the stem cells missing but Bosch will think that the stem cells there so you try to upload the image and you can't. This usually means that somebody deleted the stem cell on you in your OpenStack. Again, all of these examples are examples in the real world from notes. That's, it's just a cross section that I believe most folks will encounter. Image provisioning. You need to be aware how OpenStack applies things like your SSH keys to your stem cells when they come up, right? Well the stem cell uses the metadata service. So you can check to make sure that an image can hit the metadata service by doing something like this. And in this case, that's the output you should receive. If you don't then there's a problem. And then we get this error again. Four times, four dates, four best friends tagging along. In this case, if you're having problems with the metadata service you're not gonna be able to SSH into any of your VMs. You're also going to have issues. Bosch is gonna bring the VM up just fine but then it's not gonna be able to communicate with it necessarily. So also remember that Bosch stores stem cells in glance or in image service in general. So you're gonna wanna check the amount of disk space you have available, things like that. Rate limiting. This one's also fun because Bosch throws hundreds of calls against your OpenStack API and if Nova has rate limiting set to low, you're gonna get an error like this which is actually not very intuitive. Because that doesn't say anything about I'm making too many requests. I think in newer versions of the CPI this may be cleaned up but this was the error we were getting at the time. So if stem cells are actually just machine images we need to check the amount of disk space that's being used for glance. If you're having trouble uploading stem cells make sure you haven't run out of space. I've moved on now. I've got a bunch of just some housekeeping things that will go through. VM performance. I've found that, actually we've found that sometimes you'll have problems with the performance of large distributed systems on top of OpenStack and you have to take care to figure out what type of emulation mode you're using. In this case, usually setting your CPU mode to host pass through will actually alleviate the problem but it's really not something you should do without understanding what's going on under the hood. It's probably a decision for your admin team to make but this is something you can point them at if you're having issues. Also I like to be aware of the default CPU allocation ratio and the over commit ratio for memory because sometimes you wanna double check those things if you're having performance issues because if you didn't stand up the stack you don't necessarily know what's going on under the hood. Chances are you're not gonna have that access but you could at least say hey I'd like you to check this for me and let me know what's going on. Network performance is another interesting class of problem. Jan mentioned MTU issues earlier. There's a, you can make changes to your manifest. You can update MTU settings in Nova. There's a bit of a dark art to that but another networking performance issue that isn't always apparent is that Neutron itself can be a bottleneck. There's ways to get around this. Distributed virtual routing for example where the level three agent is distributed across the compute nodes. But the first time this had come up I actually hadn't expected it. You can distribute your, you can add more Neutron nodes or you can use a third party SDN provider as well to try to get around this but when you consider that a typical large enterprise production cloud foundry might have hundreds of, well, say 110 Diego cells. You wanna make sure that you're scaling that traffic proxy issues. Also, they're not necessarily open stack specific but the reason they're here is because I find that a lot of the clients we've dealt with that are using open stack are also using some type of internal open stack. It's more likely that an open stack is gonna behind a corporate firewall and you're gonna come run into some pretty interesting issues with this because some of the build packs, interestingly this is the first issue that is cloud foundry specific. Everything else has been about Bosch but some of the build packs actually need to go grab stuff from GitHub. So sometimes you have to actually specify the proxy in your, excuse me, in your application manifest. There's also some problems with the UAA where the UAA controller will not recognize the proxy flag and if you have another flag, so there's the HTTP proxy flag and then proxy and the no proxy flag, I think. Well, there's two flags and they don't, the hierarchy of them isn't set right so you might have one flag set and think everything should work and then you can't get any access to the outside world. So you need to be aware of proxy issues. Cinder, I don't actually have anything about Cinder right now but Cinder, if you're not aware of it, is the open stack dating app. It allows you to swipe left for reliable block storage. Come on, really? Ha, ha, ha, ha, ha, ha, ha, ha. Actually, no, I do have an example for Cinder. Sometimes a Bosch deployment will break because a host, a compute host goes down. Bosch will try to resurrect that VM, the VMs will become unresponsive and because the VM relied on a host that's no longer available, compute node went down. So this means that the VMs attached, this could mean that the VMs were attached the Cinder volumes aren't resurrecting properly because it is stuck in the attached state to a VM that no longer exists. So this is a, we've come full circle, we're back to my first cloud foundry so we went backwards in time just now. Newer errors in the beginning and this one is relevant because it was one of my first battles with Bosch on OpenStack. And as you'll see, we've seen this error twice already and this error, in this case, I had assumed that OpenStack must be working correctly because this was one of my first experience with OpenStack and it's this huge project, it's got all this support, I met all these great people thinking, okay, so OpenStack's working, this must be a problem with Bosch. So after a day or so of troubleshooting, I finally found my way to the Nova logs and it turned out that there was a bug in Nova that was not allowing IP addresses to be manually assigned to new instances as they were brought up. So I figured this out, I patched it, then I did some Googling and I found that there was actually an open issue for it. And it's interesting because we've now seen three or four errors that all have the same error in Bosch, three or four situations that all have the same VM create failed, a bunch that have the same unable to ping and the point of all of that was, when I was looking through my notes, it was pretty fascinating that I didn't expect to keep seeing the same errors over and over again and it really is all about needing to go dig deeper. This is my last example and I can thank Sean Carey at Pivotal for this one. So in this case, the Bosch deploys were failing. Instances were getting created but the director couldn't talk to the agent and it turned out that there was, Neutron was timing out trying to create, trying to release the IP address from an orphaned instance and it was disallowing those agents from having an IP address assigned. And what I like about this one is it's really indicative of the general case of problems that we see in the wild so to speak and it's that there's no way to prep for this one except really understanding open stack networking which was my point from the beginning. All of these examples that preceded it all seem to fall into nice neat classes but then I show you an error message that doesn't fall into the class that it might be and then this is another one that kind of hits that point forward and the reason I wanted to share this is because this was a really painful one for them. They had a lot of people who know their way around working on it and it took a couple days to figure out and it turned out to be something fairly silly. So tools for the discerning operator. There's a lot of available packages out there that can help here and I find that any list of them would probably take a whole other talk to say but these are ones that we use a lot at Stark and Wayne or at least I use for some of the things I do. Libvert, you should get to know Libvert. Libvert of course takes a lot of the pain about out of troubleshooting virtual host-based issues. For example, we've had instance migrations fail in OpenStack when we're trying to move things around and you can use VERSH to get an idea of what the hypervisor thinks the world looks like and you can see what Bosch thinks the world looks like. CF sizing tool. This one is surprisingly useful. You can pick your IaaS, you can say what size deployment you want and it will spit out this is what you need. These are the flavors you need. This is how your flavors should be sized. If you even use the AWS one, it'll tell you how much it'll cost. And this is nice because if you do this before you go down the path of deploying Cloud Foundry, you can at least get an idea of how you're gonna size those quotas, how you're gonna size the flavors that you're gonna use, what kind of quotas you need to give to, I just said quotas twice, we'll skip that. So, Codex, this one's interesting and my colleague, Xu Zhao, is actually gonna speak a little bit about this later. Codex is a workbook that Stark and Wayne has been putting together and it is all of our best practices for deploying on the IaaS layers that we have worked with our clients. And it's a living document. So as we come up with new techniques for things, we add it in here so that anybody that wants to see how we're doing things can come here. Often our clients will approach us and say we want you to do what Codex says. It is a little different than other companies, I think, because this is the way we do it, but it has become our internal best practices. And then, Jen talked about this at length. There's a CF validation tool that the community supports that actually will cover a whole lot of the errors we've seen today. At least they'll cover the error classes. I tried to throw in errors that might not necessarily get caught by the validation tool. At the very least, it won't catch them all if you're not running the validation in pipeline regularly. For example, it can tell you if the security groups can be created, it can tell you if they exist right now, but if you're not running your validation regularly, you don't know what the state of the world is or if it's still valid tomorrow. This should go without saying, but script all your automation. Terraform is one way. Our Codex documentation uses Terraform, and we already have Terraform scripting that will stand up our reference OpenStack. The first time I set up OpenStack, it was 100 or so lines of Python, and that actually is really friendly too. I mean, being that it integrates so well with the OpenStack API. Bosch UI is another helpful tool, and Bosch UI is a relatively new project that Stark and Wayne has added to the Cloud Founder community, and it gives you a nice view of your Bosch world. It lets you, you can SSH into your Bosch, you can see logs. We're gonna update it for Bosch too, so that it has some of the features that Bosch too's CPI or sorry, CLI will support. And finally, if this isn't readily apparent, ship your logs. If you've got 30 compute nodes and every one of them has its own copy of what's going on in its view of Nova. I mean, some distributions, actually I hope most distributions now actually do some type of forwarding to a controller node, but if you're standing up your own bespoke OpenStack, I've come into situations often where there's nothing happening with the logs out of the gate. This should be the first thing you do. It should be pretty clear from what I've talked about today that you need access to those OpenStack logs. Most of the errors we've talked about, I showed you what Bosch has said this is the problem, and I did have lots of slides that also said this is what OpenStack says, and it turned out I just didn't have the time to really discuss all of them, so I thought it would be easier to make the point that you really have to dig a little deeper, and that's why we get to this point. This here is a project from Stark and Wayne called Sawmill, and Sawmill's pretty useful because it's basically tail-f for distributed systems. So you can't really, you're not getting any log retention, but you can see exactly what's going on right now, make the error happen, and get it in front of your face. A few people contributed to my talk, and some of their examples are in here, and I wanted to thank them. James Hunt, Jeremy Budnack, Sean Carey, Craig Buczek, and then my wife for putting up with me in general, and that's all I have today. There's a couple of minutes, but it is also lunchtime, so if anybody had any questions, other... Oh, sure, sure, sure. For the validator? Yep. Jan can answer that one. What open-stack versions is the validator support? Ah, okay, yes, I actually mentioned that, that's right. That's all these three. Yep, excellent. All right, enjoy lunch, everybody. Thanks for coming.