 Step with me into a time machine. Let's define that as a contraption that can transport you to any event in space-time, which in this case is 45 years ago, in a quaint and quiet Swedish small town. Today is October the 10th 1975 and on this autumn Friday a young doctor fresh out of medical school is on shift in the Accidents and Emergencies Award in the hospital of a small Swedish coastal town of about 35,000 people. It's been a very calm morning and it's now lunchtime at most of the senior personnel. The doctors and nurses are down in the hospital cafeteria for lunch and everything looks and feels very routine until suddenly the phone rings. The hospital is being advised by emergency responders that a plane crash has occurred in the vicinity and that the arrival of the first casualties by helicopter is imminent. Now as soon as the young doctor hangs up he grabs a nurse and they race over to the shelf with a binders containing the emergency procedures because they now have to immediately prepare the area for a mass casualty contingency, which means set up triage procedures, remove all non-essential vehicles from the car park, activate personnel that's on call etc etc but their frantic preparation is cut short when they hear the chopper land on the hospital ceiling pad and seconds later the first patient is wheeled through the doors on a stretcher and the doctor immediately starts evaluating the patient. He's conscious and he's breathing but otherwise he's not in great shape. He's shaking violently. He's incoherent and when the doctor addresses him he can't verbally respond because all that the patient utters is a sound like Now the doctor immediately recognizes this symptoms consistent with cerebral trauma. The patient evidently has had a head injury and he could lose consciousness any moment and as he's examining this patient the doctor is removing the patient's life jacket it was a plane crash into water drops that on the floor and what he seems now is a military flight suit that the patient is wearing so evidently he's dealing with a fighter a bomber pilot and suddenly the penny drops for the doctor that the sound that the pilot was making could have been an attempt to speak Russian now the doctor speaks a little Russian from classes that he took at university so to reassure the patient that he'll be alright he says I'm a doctor you're in a Swedish hospital and at this the pilot's eyes get as big as dinner plates they just widen in horror he's still not verbally responding but to the doctor the picture is very clear this is a Soviet pilot again this is the mid-70s and that pilot is understandably terrified that he's been shot down over hostile territory and the doctor comes to dismal conclusion which is if we're shooting down Soviet jets then we're probably at war with the Soviet Union and that in turn means that we're probably in the opening minutes of World War III 1975 part of the Cold War but as he momentarily looks down he realizes that he has a much more urgent problem on his hands with this particular patient because he the doctor is standing in a red puddle that has formed on the floor so this patient evidently doesn't just suffer from cerebral trauma but he also has an open wound possibly an arterial bleeding and they haven't found that yet so they need to get his flight suit off of him as soon as possible but unfortunately the doctor isn't very accustomed to patients wearing military flight suits so he's running into a snack he simply can't find the right zippers of buckles to get the suit off the pilot whatever he tries all that he manages to do is open one of the many pockets that the suit has so he grabs a pair of scissors to cut the suit open and at this stage it's something like 90 seconds after the original call and the senior doctors and nurses are finally pouring in from wherever they had spent their lunch break and just as he's about to cut into the trousers of the flight suit a nurse stops him and says don't do that that's a G suit those are thousands of Kroner apiece that's 1975 Kroner so you'll have to multiply by about 5.7 to account for inflation to the present day but at this comment the doctor looks at her obviously momentarily the fuddled and in the pause she adds oh and you may want to step off his rescue jacket because it looks like you pop the C die marker pouch and you're making a mess all over the floor after which she expertly strips the patient of his flight suit and the evaluation of this patient continues a patient who it turns out is a Swedish Air Force pilot was forced to eject from his sob AJ 37 vegan strike fighter till number 37005 in which he had been on a routine training flight over the Baltic Sea he then lost control of his plane due to a structural failure wing fracture a problem that was common and plagued the airframe in the mid 1970s and who was subsequently pulled out of the 8 degree Celsius water by a search and rescue team which then flew in promptly to the nearest hospital because he was suffering from a rather bad case of hypothermia the symptoms of which of course include incoherence and slurred speech and violent shivers and the reason I'm telling you all this is that absolutely every single assumption that the doctor had made was wrong with one exception there was actually a plane crash but it involved a single-seat military aircraft rather than a commercial airliner therefore mass casualty treatment was never required they were all dealing with a single accident survivor all along the patient wasn't suffering from cerebral trauma but from hypothermia the patient also was not trying to speak Russian he simply couldn't get any words out again due to his hypothermia and the patient also was not suffering from sudden catastrophic bleeding instead the red puddle came from a fluorescent sea-dye marker cartridge those interestingly are supposed to look green when they dissolve but they're bright red when they're concentrated and of course Sweden was not at war with the Soviet Union and nuclear Armageddon wasn't imminent and thankfully for the doctor none of the decisions that he made beta based on his wildly inaccurate assumptions actually harmed anyone other than of course scaring the living dealites out of Swedish Air Force reserve lieutenant Harald Gartl who after what must have already been the worst day of his flying career came to just to find out that rather than being pulled from the Baltic Sea by Swedish Navy comrades he'd been presumably abducted by a team of Spetsnaz and spirited away to somewhere where the doctors only spoke Russian and he that is the doctor reflected later in life that he was actually quite thankful for this inconsequential mishap early in his career as it profoundly and permanently influenced his later thinking which was to always challenge your own assumptions and the young doctor incidentally became a very famous man following this approach in the early 1980s he discovered a working prevention mechanism for conzo or bound legs disease while under doctors without borders mission in Central Africa he became the professor of international health at the prestigious Karolinsk Institute in 1995 and in the early 2000s he arose to internet fame with his data visualizations and riveting commentary about public health issues I'm of course talking about Hans Rusling who left us much too soon in 2017 but what we can do is take Hans's advice and we can apply it to operating and running OpenStack let's always challenge our own assumptions and that's really what this talk is about becoming a better open stack user operator by not always relying on your first impression now OpenStack's complexity undoubtedly comes with operational challenges and in situations where OpenStack misbehaves it's frequently non-trivial to find the actual cause of an issue in this talk includes several examples of red herrings in OpenStack and suggestions for spotting and avoiding them what's a red herring for those of you who are unfamiliar with the idiom I'm not talking about actual red herrings a red herring is something that misleads or distracts from a relevant or important question according to Wikipedia at least in other words among other things a red herring is the apparently obvious cause of a problem whereas the real cause is non-obvious and frequently completely different let's start up with something relatively straightforward virtual routers in neutron what I'm doing here is I'm simply creating virtual routers in a loop I'm operating against a single tenant and I just add one virtual router after another and that seems to all work just dandy until suddenly he doesn't so let's see what's at fault here let's start with the obvious assumption I'm running into an administrator in post limit so providers can set these limits through the OpenStack quota system so let's check whether perhaps I'm running into a quota limit luckily I can always check what my quota is so if I look for my router quota in this case I see that I can create a whopping 500 of them and the same thing is true for subnets and the same thing is also true for networks those might also be issues and I can also see that I have enough ports so evidently I'm not running into a quota issue besides if I actually exceeded a quota what I also get back here from neutron is an HTTP 413 error rather than the HTTP 200 combined with the router error status that we're actually seeing so we can dig a little bit further maybe neutron has a configuration limit on the maximum number of routes for tenants just like he has for stacks that does exist cockport a router but what it says it just the default of routers and he does get overwritten by quote explicitly set on a tenant so unfortunately the router avenue doesn't get us anywhere so let's try one thing by way of experimentation let's try and create a router that has HA disabled so we're creating a router explicitly with the no HA flag and then finally that works and it works immediately so without HA works with HA doesn't what about HA routers and how do those work well way back in the OpenStack Juno release we've got high availability support for for neutron routers which means that assuming you have more than one network gateway node and that can host them your virtual routers will work in an automated active backup configuration so in effect what neutron does for you is that for every subnet that's plugged into the router and for which it therefore acts as a default gateway the gateway address binds to a keep alive debacked VRP interface and on one of the network nodes that you have your that interface is active and on the other one it's in standby and if your network node goes down keep alive D make sure that the subnet's default gateway IPs come up on the other node and that keep alive D configuration is completely abstracted away from the user the neutron agents happily take care of all of that but in order to enable HA routers neutron creates one administrative network per tenant over a project over which it runs a VRP traffic and in order to tell apart all the keep alive D instances that it manages on the network it it's it assigns each of those an individual virtual router ID or VR ID or sometimes also pronounced VRID and here's the problem RFC 5798 the thing that defines this protocol VRP defines the virtual router ID to be an 8-bit integer and that means that if you use HA routers then setting a router quota over 255 is useless because neutron will run out of VRIDs in the administrative network before your tenant can ever hit the quota and this is a hard limit there's really not much that neutron can do about this apart from changing its approach or changing the RFC which probably won't work so therefore at least for the time being if you want more than 255 highly available virtual routers you'll have to spread them across multiple tenants you might say you really don't need HA routers well first of all you probably do want them really but let's assume for a moment that you actually don't or rather it's more important for you that you have more than 255 routers in a single tenant then for any of them to be highly available so you're guessing you can create routers with the HA flag flag set to false but it turns out you probably won't be able to do that and that's not because you can't change the router's HA flag without first temporarily disabling it that's fine that's not gonna hurt you much but it's because the default neutron policy restricts setting the HA flag on a router to admins only so if you want to be able to disable a router HA capability from user API call you'll first need to override some default entries in depending on open stack version neutron's policy JSON policy YAML or whichever and what you want to set is you want to override these rules to create router HA get route HA update route HA from admin only to admin or owner and of course if your cloud service provider deploys neutron with open stack Ansible or you are that service provider then you can define this in a variable from open stack Ansible and once the policy has been overridden in this manner then you should totally be able to create a new router with this command over stack router create dash dash no HA and you can also modify an existing router's high availability flag with open stack router set you first have to disable the router temporarily then you toggle the HA flag and then you re-enable it here's another red herring that's interesting and it comes from Magnum what are your prerequisites in Magnum to run a Kubernetes cluster well there's three really you need to have a glance image for one of the operating system platforms that Kubernetes supports there are several but Fedora coro s is the best tested and most widely used and before the Fedora coro s transition happened it used to be Fedora Tommy secondly you need a Magnum cluster template that references that image and then finally you need to use that template to actually spin up your Kubernetes cluster so let's look at this so here's my image it's a little dated at this point you shouldn't be using Fedora Tommy 27 anymore but the same consideration essentially applies to the current Fedora coro s 32 but that image should be totally supported for deploying the Kubernetes release that I've selected here with Magnum and I have a cluster template it sets the cluster orchestration engine to Kubernetes and it also sets the coup tag label etc and my cluster spinning up exactly as expected but I then decide well I really want to use a different image and again in this example that I'm using here it's slightly dated at this point I talks about Fedora atomic host 29 you shouldn't be using that anymore but it still serves to illustrate the concept now I do that I've uploaded a Fedora atomic 29 image and again in principle deploying Kubernetes off of this should work but it's a little bit weird everything I set up is exactly as I should as it should be but what I'm still getting is this HTTP 400 error which is you know this rather non-descript bad requests error and when you see this you probably think that something's wrong with your API call it's a bad request and since I've been making all my API calls essentially by the book the first problem to assume would be a bug in the Magnum client library or the open-stack client or in both well wrong again it's just another red herring it turns out the culprit is actually missing property on the image OS distro which must be set to the proper value for the matching driver so in this as I said slightly dated example I used Kate's Fedora atomic v1 and there it had to be set to Fedora atomic these things have now slightly changed for Fedora chorus but it's essentially the same thing this is actually very well documented you can totally find this in the Magnum documentation but many Magnum users never really need to use a private image and when they do and they they if we create this the missing property and the rather unhelpful error message sometimes trips them up but once we set this variable then we're ready to create a cluster template once we've got the template we can fire up a new cluster and then we can use Kubernetes from there my third red herring for you has to do with heat templates now this is also an interesting one what I'm doing here is fire up a heat template and I get a non-descript HDP 500 now adding dash dash debug to this command will clarify that we're dealing with a server-side encoding error it actually complains about the fact that something in there can't be decoded as proper UTFA so in other words it's the heat API endpoint not the client which is weird that's complaining that it's been given a template with an invalid encoding which sounds buggy because if it actually was an incorrectly encoded template then the heat client should have caught that and the additional information that I got here in this case was pretty useless because I was actually I saw an exact character that heat was complaining about but then that one was definitely not incorrectly encoded I could verify that the encoding was correct it was in in fact a US ASCII character so there really can't be any encoding issue when it comes to Unicode so the funny part about this one is when I ran into this bit I ran into it without making any changes to my template whatsoever it had previously worked quite alright the only thing that had changed when I first saw this problem was that the open stack region I was running against had recently been upgraded this was a few versions back but it's still an interesting case to discuss and I did have another region available where the template ran fine and I had other regions with the new release were broke so surely this must be a regression right something that somehow slipped past all the gates and and CI checks well I can tell you when I ran into this I spent some rather significant time working this one out but in the end the alleged encoding problem turned out to be yet another red herring and here's what it was just in case you run into something similar in the future yourself you may recall that in heat templates we can use a function called string replace or stir replace with string templating here's an example for how we can use this so in this example the string host in the template parameter is replaced with the IP address of an over instance and that then results in a usable URL that can then be retrieved with open stack stack output show logging URL in this case fairly straightforward and of course you can use other functions then get at her to to construct this value here but what happens here is the parameter substitution is just simple string replacement which means you can name your parameters anything you're not required to use any variable marker prefix like say for example the dollar character would be in bash but that quickly makes templates very unreadable so most people do use some sort of prefix because although there there is really not much of a convention there because that makes the template slightly more readable so they would for example use something like prefixing dollar sign some people have something like a jsp or asp.net background and they might be using some version of angle brackets and percent characters some people just use capitals this is essentially up to you the documentation doesn't mandate anything as long as it's valid YAML and of course that includes things like problem coding and so on so I'm going to show you a code snippet of one of my heat templates that used to run perfectly fine in all open stack releases up to a specific one and what this does is it simply takes a parameter a stack parameter named Ubuntu mirror and then it injects that into an instance is configuration via cloud config resource so that depending on which open stack region I launched this stack in I can select a suitable Ubuntu mirror right and like I said this particular template worked just fine and then we updated and suddenly produced an HTTP 500 and now you may say half course what's clearly happening here is that heat is trying to pass the string curly brace mirror close curly brace as an intrinsic function which doesn't exist well I have two answers for you masking unknown function name behind Unicode decode error would be pretty silly and secondly if you do use proper quoting for the template string you see exactly the same problem and in reality this is all it took so using a different prefix using the percent prefix here did the trick and then there was no API problem to this day by the way I still have no idea what exactly made this break specifically in that update but something truly good so that's my talk on red herrings I hope you'll find those useful even if it's just an impetus to look beyond your first impression and then continue troubleshooting with the non-obvious avenues these slides for this talk are obviously available under a CC by SA license and I also have some image credits to close out on with that I thank you very much for your time