 Good morning. Good afternoon. Good evening. Wherever you're hailing from welcome to another edition of Ask an open shift admin I am joined by the one and only Andrew Sullivan and his guests that he has invited today Andrew You want to take it away from here? Please sir. Thank you. And again welcome everybody Appreciate you joining us today. So this is our second episode after our brief hiatus there for summits and scoop con Again, if you didn't catch all the content related to that, it's all in the channel history. There's blog posts There's all kinds of other stuff out there. So oh JP date wants to know how you're doing Richard. Yeah, I know Always nice to hear from you James Yeah, it's a Richard welcome back to the to the stream, you know, you were here for our very first episode So it's good to have you back so This is one of our office hours series of live streams Effectively what that means is that we are here for you our audience to talk about whatever it is that is on your mind So feel free at any point throughout the stream and Chris and I will try and remind you periodically to ask any questions that you happen to have You know, of course, we are open shift experts And I think all of us here probably or at least me and Chris are on the administrator side I think Joe and Richard have a little more Developer side experience. So, you know, we might be able to help with some of those things But by all means whatever it is that is Bothering you whatever burning questions you have whatever it might be, you know, let us know and we're here to help So with that in mind if you all don't have questions We of course have a topic and in line with that topic. I have invited a couple of folks to join us So if you haven't seen on your screen today's topic is the splat team as well as the Vistra problem detector So first I want to hand off to Joe Callan to introduce himself as well as what what is the splat team Joe? What a splat? So, I've got Joe Callan. I've been here been at Red Hat for about five years out of my basement and and Around Kalamazoo, Michigan. No one knows where I actually live. So You might pop pop on Michigan Anyways, we're outside Kalamazoo so the splat team We're The specialist platform team. So we the the members have a specific platform that we're supposed to Be experts on and then how does that open you like integrate with OpenShift? Is pretty much our goal and how do we help our developers or which are our customers? principally Navigate through that platform Got it. So you are Let me make sure I'm getting this right so Internal to the engineering teams you're an escalation point for when they have questions about how AWS or vSphere or Red Hat virtualization how they work and how they should integrate with that platform Exactly, right. Got it And we are also joined by Richard Vanderpool and Richard as I said you you were on our very first show and now You're on our 29th show, but your your role over the last wait eight nine months has changed pretty significantly, right? Yeah, it sure has. So I've been with Red Hat for gosh about 26 27 months at this point It feels like it's been 10 years because it's been an absolute blast and there's so much to learn and Develop your skill set here at Red Hat. It's just been a really great time But yeah, so prior to about nine months ago. I work in OpenShift support So likely a number of the folks in the stream have either Talked to me over the phone or have corresponded with me via case updates so it's and that was a really awesome time to spend learning the product learning the pain points and Now that I'm in the Specialist platform role I get to bring some of that customer insight Into this position and that has really been awesome to take some of the Issues that that we were aware of in support and begin to help Make more folks aware of those and try to to develop KCS documentation fix bugs whatever we can do to help The broader organization as it relates to the various clouds that we support Yeah, I can only imagine how much the engineering teams value that perspective because I know Like any time I interact with engineering folks, they're always kind of begging for that customer perspective Like why are customers doing this? What does this mean to them? Like how can you know, they love it when we bring them on customer calls, too So I can only imagine how valuable that is and yeah to say that Red Hat is a fire hose in your first I've been here for not quite three years so it'll be three years in August and Yeah, it's It's a fire hose. That's that's for sure. No, that's our for the course. Yeah back in here about three years as well And yeah, I enjoyed drinking from fire hoses. So yeah, and I came over from the Ansible side. So, yeah So as always for anybody who is new to the stream for those who are old to the stream Right, I like to kind of before we get into the topic talk about a few things that are top of mind or things that I feel Are important to you all as administrators So there are roughly three and a half things I've been I say that's the precise amount and that's exactly what I meant for anybody who watches the cowboy Kent Rollins on YouTube My kids have been getting into him lately and he measures everything. He just dumps it in and goes yep And that's the precise amount that I wanted It could be a teaspoon it could be So I've anyways So a couple of things that are top of mind So first and foremost if you missed the email or you have unsubscribed from our various emailing lists You might not have noticed that open shift open shift that redhead enterprise Linux 8.4 has gone GA today. Oh Let me roll let me find the right you got it So first I'm gonna share my screen and then I'm gonna grab this link and I'm gonna paste it into here So this blog post which is kind of a meta blog post has links to all of the other various what's new in 8.4 blog posts and other things that are interesting or should be interesting to you Based off of what's in there. Come on. There we trying to drag your videos out of the way so I can actually access the URLs Additionally, I'll also make a plug for this on-demand webinar Oh, so this one is great. I haven't seen all of it yet because I haven't I need to also learn what's new in 8.4 So definitely recommend if you like I do you have a video or something running in the background that you can kind of half-listen to While you're doing other things, this is a great way to learn about what's new in 8.4 as well So definitely check out both of those resources I know we have a lot of crossover between rel admins and open shift admins and on the There the rel show is today. So we'll be talking about this a little bit. I'm sure oh, yeah Good point this afternoon at 2 p.m. Eastern if you're not subscribed to the calendar, which I'll drop a link to now The second thing I wanted to talk about is that after much Editing I'll use the word editing the open shift subscription and sizing guide has been updated You can see the last updated date here is May 18th So this is actually posted in two places. It's on redhat.com as well as on open shift calm I'm going up the redhat.com one up here. So I'll paste that link So inside of here the major changes From being a part of the editing process we added in a bunch of stuff around open shift platform plus So if you're curious about what's open shift platform plus is what the components are right all that other stuff There's a huge section in here about that Additionally, we made quite a bit of updates changes so kind of clarification around what is and what isn't a Infrastructure node. Mm-hmm. Nice way on down here. There's a whole section around infrastructure nodes And we always get a lot of question about that. So that's cool. Yeah, and unfortunately I can't link directly to this section ever reason. Yeah So there is an infrastructure node here section here It includes a lot of clarification on what does and doesn't qualify. So for example, you know If you are using the infrastructure node paradigm, it's not going to valid or invalidate that You know subscription exception if you're using like a custom monitoring agent. Hey, we wrote our own monitoring agent I need to deploy that and we don't consider that application workloads. So therefore it's still an infrastructure node If you need to deploy like VMware tools, you know, or whatever that happens to be all that stuff. So definitely read through that if you are Curious about any of the details if you are if there are things that you think might qualify for a infrastructure nodes qualification Again, lots and lots of details inside of there. So and I did post that link. So be sure to check that out So the third thing I wanted to talk about which is one that I will kind of I want to make everybody aware of But I don't have any additional knowledge than what you see here And that was or is it was yesterday, it might have been early this morning There was an announcement of this CVE around a sim link exchange attack with run C. That's very bad Yes, so I mean it says is important but like the premise is very bad Yeah, yeah, so again, I don't have any inside details or knowledge This is outside of it's like low level stuff that is outside of my realm of expertise I want to see is the container runtime under cryo by the way. Thank you. Yeah, so it's the components that instantiates containers, right or no, it's the it's the Stainer run standard interface. Yeah. Yeah, CRI Yeah container runtime interface, right? Yeah, and it's used by open shift. It's used by rel etc You can see all the all the details inside of here one thing to note from a red hat perspective I think we mark this as reduced impact if we come down here to the FAQ it talks about se linux helping to mitigate this That doesn't mean that it is not an impact. It just means that se linux systems with se linux enabled have a reduced impact Um, so if you're watching this and you're not using open shifts, but some other distribution heads up Yeah, and the good news is, um, you know open shift both open shift 3 and open shift 4 have se linux enabled by Default so unless you went out of your way, and I honestly don't even know if I don't know how to do it Turn off se linux. Yeah Unless you have gone way far out of your way se linux is there So please please please uh, if you have clusters, um that are and there's several Here if you allow untrusted users to run containers in your environment environment or allow trusted users to deploy untrusted containers in your environment So if you qualify for either one of those Please keep an eye on this Security bulletin. It'll be updated constantly if you saw I'm gonna jump to the top if you saw up here. It was updated an hour ago Yeah, wow, so they're continuing to update that with what's going on So the last thing that I wanted to to talk about Is something that actually came out of a phone call. I had this morning So I was on a customer call with with a customer over in uh, europe Uh, it's bright and early 7 a.m. This morning That's why I might seem a little more wired than normal because I've had my whole pot of coffee. You've had sufficient Wow for you, um so, uh They asked us uh, and it was uh myself and one of our peers that we're presenting and they asked us, you know Hey, sometimes, you know If we open a support case or even if we're just searching through, you know, the red hat material We'll find stuff in the case. Yes, you know an access dot redhead.com That is either in addition to or sometimes contradictory to what's in the documentation So, how do we know or how do we? Uh, de-conflict, right? How do we choose which one of those is the authoritative source? And I'll tell you my off-the-cuff answer and I definitely want to get richards perspective here because I'm I'm sure having been in support that he has Some some opinions on this So my off-the-cuff response was essentially the documentation is the source of truth Except where the kcs supersedes it So the kcs is often my perspective to be clear It is used to fill in gaps in the documentation because docs take time to update Especially if there's I think the one that they this customer pointed out to me was like infrastructure nodes Right. So the kcs has this you know, big article here on infrastructure nodes that goes into, you know, a number of details And yes, it's been updated recently but I From what I remember this article Actually predates when we had detailed infrastructure node information in the documentation So this was a bridge right in order to get there And then the the support folks, you know, they do a good job of keeping these things up to date largely In that respect So I'll anyways, I want to I want to kind of pick your brand on this Richard and get your thoughts Yeah, for sure. I'm actually quite familiar with this specific issue because it has been a long running topic for almost since the inception of four has been how to set up infernodes and the one the one bit of caution that I'll give when Judging kcs versus doc is doc is the source of truth, but it doesn't Cover all possible scenarios if you're looking at a kcs that is 18 months since the last update Approach it with caution because things change fast From y-stream to y-stream From version to version things can change and break and these kcs are written to address very specific problems And in general, they're really good ways out of sticky situations Now if we find a situation where This is a and this is a good example where we have infrastructure nodes being being discussed in this kcs article This is a situation where we need to be raising a bug back to docs and say look, this is a Missing piece of the documentation that numerous customers are running up against What is the best practice because we may not want support defining the best practice on how to set up infernodes because we're going to do what needs to be done To help the customer through but it may be the product has a very different view of how to approach setting up infernodes which they did and and so making the product team aware via bugs is really key because otherwise they may not be aware that this is a massive question out in the customer base Yeah, definitely and I like that there is that feedback mechanism as well You know, sometimes it does take time, you know, the docs team They have a lot on their plates And I know You know having It was the four four time frame. I wrote a series of blog posts on like Day two actions You know with your deployed cluster and it was nothing more than just a reorganization of the docs And I think it took two versions of docs for them to then incorporate that you know It was like four dot six time frame was when that was made Into a section in the docs and it's just you know keeping up with all the things going on in the product Consumes a massive amount of time So, you know doing those reorganization things is not that it's a lower priority But it can often involve a lot of effort and absolutely because everything in the docs is tested pretty I mean very thoroughly So it takes time to get a blessed Thoroughly tested procedure into the docs I'll also offer that if anybody has feedback on the docs like ux, you know user experience that type of stuff Let us know. Yeah, I have Yeah, I actually have I now have a reoccurring one-on-one with the director for the docs team You know, I guess I've I've opened my mouth enough. Yeah, I've opened my mouth enough that I have was given that Pleasure privilege privilege. That's pretty cool. So yeah, I'm happy to you know work on behalf of you all our audience right customers the field etc to bring back that information to the docs team and provide that so You know, if if there's things that are wrong, please open a bz or a github issue But if it's like, hey, it would really be easier to use these docs if they were like this I know that type of stuff. I am super interested in that So, please don't don't hesitate to send me a message And you can do that Just standard contact information Andrew that's all of them redhead.com. You can send me a dm on twitter practical Andrew if you're watching any of the chats the same as my Twitch username here practical Andrew Yeah, so yeah, please don't hesitate to reach out if you have any feedback on on the documentation ux Yeah, if you encounter a kcs that is not clear or you have questions about Don't don't hesitate to open a case to get an official way in from us As to the applicability of that article It is sometimes depending on what you're doing. It makes sense to get a double check. So And yes, christian is doing his leg day and saying he would like more anchor tags in the docs Yeah, yeah, and it's uh, so funny thing. Um, so christian now that you bring that up I'll use this as an opportunity to So the docs actually almost every one of the little subsections really is a link The problem is they're not always obvious. So I'm just choosing a random section here So I can go to like obtaining the installation program and you see this is a link. It's got the little hyperlink You can see up here in the bar. It's got the The ending to it So this one does two this one does two. Let's see if I can find one that doesn't Of course, I would choose a section now that doesn't have that. I can't find what I'm looking for Well, yeah, you picked a good page. I know Yeah, I thought it was only the the major categories that had had the links because I ran into that too Like this is the section. I want to send you but The one that you're probably looking for is going to be in the Install like bare metal install where you have networking infrastructure. Yeah, there's a section under there that doesn't have it Um, I think it's for I can't recall. I think it's like dns I know I I say this they appear we appear to be getting better at this because now they yeah Here lower balancers Yeah, like so if you uh, and I'm using what is this chrome? Uh, so if I right click and go to inspect And then I need to move this over here. So it's easier to read Uh, so if I go to the dom inspector here inside of chrome So you can see it actually does have an id tag So it's literally in this instance called load balancers So I can copy that id tag out and then just depend of the url pound sign or hashtag load balancers And then we'll go right there So, you know a little little dox hack So yeah, it took me a while to uh, figure that one out that you don't always have to link just these top level ones Or here the network topology requirements is another one. See there's no link associated with it if I do the inspect It comes up with network topology requirements and I can do the same thing I didn't know you could do that Yeah, that is pretty cool. That is awesome Yeah, well that particular section I've wanted to hard link like that maybe a hundred times. Yeah, so that's good to know So yes, yes christian. I agree. It's not the greatest workflow, but you know, it gets it done and I don't know if this is Like I know all of our docs are written in ascii doc and I don't know if it's just I think you have to flag or yeah There's maybe it only links, you know, certain, you know down to an h3 or something and this is an h4 and an h5 Right. Yeah So yeah, anyways a little a little tip there. Cool. All right, so uh today's topic So the specialist platform team, uh, which again very privileged very happy to have both joe and richard join us to talk about What it is that they do and how that interacts with engineering, um, so kind of Joe you mentioned that you are effectively an escalation point for the engineering folks and how like specific platforms work And like just this morning I included you all because I know you're you're both kind of v-sphere specialists I included you on an email where it was like, hey, how do we you know? What are some of the low-level things that we can check to do like performance and all of these other things? Which I know is not necessarily directly in your your, uh day-to-day responsibilities, but I I hope you don't mind me relying on you for uh An escalation point. Yeah, no worries. So so what does a day in the life of a specialist platform team member look like? I I actually think this is one of the great things about this role is you don't really know so Um, I know that's really vague, but so, you know, we're all on Slack and then in and gchat Um, just fielding questions at least part of the day From It could be anybody within, you know within red hat that you know has a problem Definitely ci is a big focus, uh this morning I was making a change in And ci to run our upgrade process between, you know, a major version for seven to four eight more often So definitely ci monitoring ci making sure that You know our specific platform That we're responsible for is passing accordingly Um, can I go down a tangent for a moment? I I think it would be I think it would be To know a little bit about the ci system that that we use for open shift itself Oh, sure. So we currently so, uh, ci actually runs on open shift um and I will admit that I don't know it well enough, but so our specific open shift cluster for ci runs actually in AWS But our vSphere environment runs on VMware on AWS. So in vmc Our current environment in vmc is six nodes. We originally had it set to edrs. I don't know if anybody knows out there about vmc, but um, they will spin up additional barometer nodes for you based on your load within your vSphere cluster Um We recently set that statically. So we got six nodes um, we run 12 open shift clusters kind of currently and usually we're hitting that max most of the time Um, it's all vSAN um, you know vmc is very a very opinionated design. So we get what we get But that's the reason why we wanted it to um We would rather not be vSphere admins. We just want to use it so I mean we can get into more details of of the uh Of how it's set up, but that was that's kind of the gist Yeah, that's the whole the whole managed services thing right of I just want to use the platform I don't want to have right Which speaking which I uh, I've been meaning to invite our product manager for managed services on Yes Hey chris, I think you saw that uh me me discussing that idea of I want to have the managed services folks come on And I want a really inflammatory title about you know, like how managed services is the death of the administrator or something Oh, damn It ain't the death that's for sure, but uh Definitely not as as joe said you still have to do some uh some administration Oh, yeah, that we just we're we're just having an issue In vmc that we spent quite a bit of time on that we were pretty much being administrators again so Yeah, it does happen from time to time And I know you all get involved in a lot of different areas and in particular, you know I'm thinking of I saw both of your names in Like every other message on the recent bz about the uh the vmx net offload issue That delayed openshift 4.6 to 4.7 upgrades Right So so one, um, I think and I know like jp dade who I saw here. I know I think he was one of the ones who was uh Waiting on that to be fixed Along with many many other folks, um, you know that were impacted by it So one, I think it would be interesting to get your perspective on you know one Maybe a little bit about what the actual bug was but but two kind of the process behind that Like how do we find these things? How do we address these things? How do we fix these things? On your side, right? So we so It was kind of an interesting way that it came up originally so richard and I Were asked by the there's um, I don't know if you guys have discussed this on the channel or not But the the assisted installer so um Those folks reached out to us They were saying well, we kept keep having issues installing vSphere on 4.7 and You know at first year we were Uh You know kind of you've got to be doing something wrong And the more that we looked at it um We found out that you know, they were right as far as you know, there was this this issue So it was kind of a guess so I just started I I don't know why it popped in my head to start disabling Offloading features Maybe just because I've seen problems in the past with that before but so I think the original offloading feature was a higher level one that I disabled and the cluster just kind of Just kind of sprung back to life um So the so the gist on the the bug or it's not really a bug per se at least from especially from vmware's perspective They don't they don't feel that it's a bug So uh vxlion uses can use two different ports for I can't remember four seven eight nine and I can't remember the other one. Don't worry about port numbers. Okay Yeah, I don't remember so um so they uh vmware added offloading features for udp For those for vxlion, which is you know, that's awesome, right? Because we're sending we're everything open shift sdn is vxlion so um but uh If you change the port for vxlion to the non default um the Nick vmx nut three nick just drops those packets Um, which is what was the real cause of the problem? So in in rel in rel eight dot five I think the current kernel bz They've implemented not a fix but a work around to default like turn that those offloading features off And I think it's being back ported a dot four, which would be then picked up in arcas Yeah, it was an interesting Very interesting bug to work on Just to reproduce was interesting. It took a lot of work Yes I tell you I so I didn't know it at the time and it didn't occur to me until I don't know two weeks ago So when four out seven was first released, I was deploying a new cluster like I always do and for whatever reason Uh at that time I was deploying a cluster. It was all Bare metal or a you know platform equals none non integrated platform agnostic And I had my control plane on vSphere and my worker nodes on rev And I could not get the cluster to deploy Okay, I could get I could deploy vSphere ipi and it would be it wasn't stable But it would deploy it and with that mix of nodes I couldn't get it to come up at all and I was it's actually chatting You know internally with christian hernandez and like I don't know what's going on I don't know like I just can't figure and you know, it's like Yeah, I was banging my head against it for like two weeks or something and finally just gave up and I would switch between all vSphere and all rev instead of the mixture So It what it didn't occur to me until about two weeks ago that I bet I was hitting that bug and I actually went in and I manually turned off the settings. So on last week's stream I'll I'll pick up the link here in just a second On last week's stream, you know, I showed how to how to manually do those things Uh, yeah, you can I did that and it just popped right up and deployed and finish or finished the deployment without an issue It was it was funny to encounter that and just a quick A quick plug here. Um, so open shift dot com dot blog or a open shift dot com slash blog We do have every week after a live stream We do publish our Show notes and everything. Yes So be sure to uh Friday mornings Sometimes early morning sometimes mid morning, but we get these blog posts out So here's the link I was referring to And it goes and it shows the machine config then it uses right right here Executes is on on the host. Yep To turn off that offload setting inside of there And and that's moved to so that this was my original pr To do that change with the help of some networking folks that said do it better this way So that's that's moved to uh to uh Common someplace else just just so because the what you mentioned Right because what you must Mentioned andrew is if I've got platform none And but I'm still running on vSphere I've I can't just apply the fix to the vSphere section of mco Because only that will be templated if if I know that I'm running on you know in the install config platform colon vSphere Yeah, I've been learning a lot more and I reference this section of the machine config operator You know github fairly frequently because if you're familiar with the on-prem ipis, right? They use the kipa live d right thing so reference this pretty frequently and Here i'm gonna pick on one of these right we'll go in here and look at Actually, I don't want common. I want like I'll look at the control plane And if we look on prem and files, we've got like this kipa live d kipa live d So this is what manages the api Virtual ip address right that's associated with an on-prem ipi cluster So I use these again pretty frequently But what you might notice here is that Like some of these I don't know if it's on this one Some of these have like templates that it uses inside of here like this is a template file. So here Like where does that lb config lb port come from and so I've been like digging through and learning more about how all of this stuff works in order to Determine You know where where that about those values come from and how they're impacted and where changes happen at and all this other stuff so It's one of those like working for an open source company It's phenomenal that I can just go and look at the code and at the same time I can go and ask the engineers on the back end You know we're working at other you know previously non open source companies where it's one of those like well You got to know the right guy who knows the right guy who can enable your account to access the source control system With the right permissions, but we also have to trust you and make sure you're not just going to sell the code Like yeah, it's it's so nice Sometimes you still need to know the right guys. Yeah that knows how it works yeah So okay, so Splat the specialist platform team right so I think Joe I know you're a vSphere, you know, you're one of the vSphere specialists and richard I believe you're the other vSphere specialists, but on the team. I know you all have other Other folks that focus on other platforms Yeah, we sure do um our other the other platform that we're Um managing at the moment is azure. We're looking at expanding to two other clouds in short order But right now our primary focus is vSphere and azure and Joe and I usually hang out in forum VMware in in slack We have other folks that hang out in forum azure. Kenny Woodson is is It's also one of our colleagues and he's he's primarily working on azure We do also have someone working on arm, which is a pretty Which is pretty cool, too. Um, that's one of the newer platforms. We're looking at It's someone say my favorite architecture Yeah It's still my dream to stand up open shifts on a raspberry pi. I mean, I only have the four gig model So that's unlikely, but that would be cool. Yeah, yeah That's what I really I just need to convince my wife that I can buy some like six eight gig raspberry pies and then put ESXi arm on it and see if you can see if you can build the like a upi I don't know if IPI would be good. Let's let's pull like this guys. I've gone down this row and I I actually ended up Andrew pointing me to a website to buy you servers and that's With the big tell do you blade? Or chassis in my basement now. So I've gone down the the pi route I would like to go down and again sometime, but I would have to upgrade everything Pi wise. So, yeah Yeah, I know we uh, we speaking of We have a What's new and what's next coming up relatively soon? Don't we? Uh, it is June 22nd. I think Yeah, I think double checking here. What's next is june Is later in june. What's new 24? Is the what's new? Okay. Yeah, so june 24 folks tune in for the open shift for eight What is what is a new and open shift for eight session? Yeah So, yeah, that's already on the calendar A month away It's time time flies when you're having fun Uh, yeah, so, uh, lorvis I see you commenting here mco is open shift secret sauce You know, it's funny to have you know, I'm I'm an old v-sphere administrator as well You know, that's what I did for A decade. I was an admin and an architect. So I look at mco And like open shifts relationship to Coro s and the bridge of mco is very similar to how v-center manages esxi Right, it's it's all done from within the platform or it's almost entirely done from within the platform And it gives all of that configurability and manageability and all that stuff that's inside of there. So Yeah, I'm I'm also a big fan of the machine config operator It does take a little bit of a mindset shift Right, you you you get out of the mindset of I'm going to ssh into the node and modify these files. Um, and instead you treat Individual nodes mostly as disposable. So you you only do things that apply to All the nodes and we've talked about that before here on the stream with like Using, you know, you don't use mco to set static ip's on individual worker nodes Because it just doesn't work So, yeah Prepare to shift those paradigms Christian's full of aphorisms today. Yeah, it's it's a very inspirational workout today. I'm kidding um, so yeah, so, uh, richard, I wanted to ask you, um With your perspective, you know, and again coming from the support side where you had, you know Some influence and some input into what engineering is doing like, hey, there's there's a problem. We need to fix this Um, but now from your side, how does that influence like? Or how does that change and what I mean by that is I think you still have some customer interaction and stuff like that you definitely bring that perspective and How do we, you know, how do you take your vSphere expertise bring that into the open shift platform and influence what they're doing with like the um the machine api for vSphere Yeah, so I would describe our team as a multiplier. So we are actively seeking out input into things that are that need to be fundamentally reviewed For example, Joe and I, we are actively reviewing case data right now to try to understand why installations on vSphere are problematic why why there's so many cases opened up and That's just an example of one of the efforts we've undertaken to try to just holistically look at the customer experience of vSphere and And look at and then look at ways we can go back and either improve things ourselves or if we can go back and file bugs or rfe's to try to push improvements back in And and have them considered for implementation on down the line. So our team is actively engaged in I wouldn't say necessarily in the support process but certainly engaged with the data that we're getting back and Joe and I still get on the phone with customers from time to time. I wouldn't say it's Frequent, but I would say maybe monthly we're we're on a call with the customer working through and trying to understand pain points to Either push a bug along or see if we can help out ourselves in the resolution So I'm going to steal a link out of our notes here. I hope you all don't mind me sharing this one Which I I'm assuming that this is one that you all created It is. Yep. Absolutely. Yeah. So this is an example of an issue that is just pervasive throughout VMware cases We have an ambiguous statement of My master nodes are slow. Okay. Well, what does that mean? It could be that the host itself Is over committed or it could be something wrong with open shift. We really don't know or have the tools to dig in So what we did was we put together this knowledge article Which describes what we do as flat engineers when we're trying to understand where the bottleneck lies in conjunction with that we also did a A two-day education session with the support team to try to pass along This knowledge and give them some handles to know when to dig in and when to push back And try to get help from VMware to understand where Worry performance bottle meant where performance bottleneck might lie Because we may have customers that are following open shift best practices But they may not be following vmware best practices and that can have a tremendously negative impact On the stability of their open shift deployment And not something that we have insight into necessarily from a must-gatherer Well, we've also noticed from customers that The silos still exist Andrew you were saying you're a vcr admin. I'm sure you and I was a vcr admin at one point You know, there's a network team. There's a storage team and there's a vmware team and now there's an open shift team and You know the silos At least on customer calls you can you can see that those still are still are definitely alive and And sometimes there's an operating system team that's in there. Sometimes there's a security team that's in there. Yeah, it's uh that was When I was a customer, I was uh, I was the vSphere administrator But I had to work with you know, the network team the storage team the security team We we didn't know what was inside of the virtual machines. We were handed basically os templates Here's your windows server, you know 2012 template. Here's your, you know, rel at the time. I was rel 3 and rel 4 template You know, and whenever somebody asks for one of these deploy one of those, you know type of things So yeah, that that kind of top to bottom troubleshooting is hard How do you know where it's at and looking at this? There's some really interesting stuff inside of here Nick, uh, I I didn't know that we had a podman Command here to to do the uh, scd perf open shift scale at cd perf. That's really cool. Yeah, that's a good one That that that's definitely a good one to To to run to make sure that your your storage Because we don't know what your underlying storage is. I mean you could have the sand local desk Yeah, local desk fiber channel So Oh god Thunderbolt drives could be anything. Yeah. Oh, yeah, yeah anything So I'm with the intent. Yeah, I'm sorry. No, go ahead I'm just just say quickly. So the intent of this article is to provide both the VMware ESXi host side of analyzing and gathering data But also here at the bottom it gives handles on how to collect The metrics that are going to be useful from an open shift perspective So there's a set of queries that are down the bottom of this article that That that you can run to pull back the data that we can analyze or support can analyze in conjunction with the ESXi metrics So it gives you both the the host and the guest perspective of things and it's helpful in pinning it down Very cool. And I I want to use this as a A jumping off point or a another tangent point around the vSphere problem detector and Both of you actually very helpfully corrected me Because I have been kind of casually associating your team with the vSphere problem detector When in fact it's not So can you can you kind of give us some background there? What is the problem detector? Where does it fit in and and what does it do? And I'm going to pull up the github page here Absolutely. So the vSphere problem detector was introduced in four seven And the intent of the vSphere problem detector is to identify common misconfigurations That will lead to potential outages. It could be outages with regards to storage or the inability to scale machines It's its intent is to provide alerts specific to what misconfigurations are being detected by us It's something that rides along with the cluster storage operator And It is something that's being actively enhanced as we get Feedback from the field A recent example of that is I'm sure everyone if I run into this at some point the encoding of a password with a trailing line feed It's something that is Really difficult to pin down unless you know that that's a problem So we've got a pr in place right now to add detection for that That's something they can waste days and it can have a real business impact to our customers because they can't scale their workloads effectively Yeah, and So this one is interesting to me because I view it as a way and I can only assume that there Or I hope there might be more things like this in the future To help us kind of pre detect or prefix You know a pre identify. I'm gonna keep using the prefix pre just the same you know, uh, you know issues before they become issues you know, so previous company We we got a new, you know, svp of supports and I remember she kept you know She she came in and one of the things she was talking about was We want to get out of you know Proactive support and we want to get into preemptive support right proactive is I found something. I'm taking care of it Preemptive is I fixed it before it was ever a problem So this is one of those things that there's a lot of potential here for me and it makes it makes me very excited Although I will say I think that there is actually an issue with upgrades related to the vSphere problem detector So I feel a little less guilty pointing that out now that I know it's not you all directly responsible for that Well, there is but it's it's being actively addressed. Yeah, right the R go in for it yesterday actually to address this So I think it is yeah, so so just for those that don't know if you installed Following the vM went following the vSphere installation guidance pre four seven And you did not provide A valid username and password. Let's say for example Your vSphere administrator wouldn't allow you to have the credentials necessary to install But you still follow the upi guidance for vmware installs When you upgraded to four seven your storage operator degraded because we're checking for that now So it's a it's a valid catch, but I we're softening it from a degradation to an alert and yeah kind of going back to the predictive piece of this the The vSphere problem detector feeds metrics into for me So tools like insights and I believe telemetry also pushes up some of this data as well Or directly fed by the vSphere problem detector Yeah to help us so I think one of the biggest biggest things within the problem detector at least from our perspective moving forward to get you know Like the auditory csi is to determine which versions customers are running vSphere versions that our Customers are running so that we can better determine when To be able to roll out like the auditory csi which requires, you know six seven u3 and higher and hardware 15 Yeah, so we we have a question here. Um, yeah So and I think that might be peter. Um, so any guidance on vMotion for control play nodes and worker nodes Oh, uh, so I I know from a documentation perspective Let's go up here and we'll go back to our Stalling on vSphere and we'll just pick this one There is a So Richard Yeah, so Richard's tested this so storage vMotion. We know isn't supported Richard's got a um script It's kind of I like a chaos thing where it is vMotions Your machine's all over the place all the time Like that's what it does Okay Richard do you want to do you want to talk talk about? For sure. Yeah, so, um kind of touching back briefly on the ci testing we discussed earlier One of the key things you run as part of that is the open shift into in test suite Which is a massive set of tests. Yeah over 1200 tests get kicked off. So what we theorized was that vMotion In and of itself wasn't actually causing problems But it was the underlying Infrastructure that was causing problems when vMotion was invoked and So in order to proof that what we did was we wrote this chaos script that what it does is it vMotions almost continuously The control plane nodes and the worker nodes While running the open shift tests and we measured to see if there was a marked increase in the number of failures We there was not a significant uptick now this does assume because so I have two environments. I have my environment behind me, which is a deal to you that I'm sorry. It's an hp to you server It does not hold up as well under this test Where it did hold up well was in vmc where we have I think it's a 25 gigabit backbone, right? Um, that's 100 So vMotion All that time there versus on my one gigabit network and so in my environment it had problems in vmc it did not and It's simply because If your environment is performance enough it can handle the load if it is not then vMotion and control plane nodes may not be the right choice because you may see an api outage And and just to add to that so I just pulled up vcenter um, our ci cluster is vMotion 48,000 times Um, I'm sorry. How many times 48,000? Yeah, so that goes back to the world record Exactly So that goes back to when we were having ed ed rs enabled So now we've set that statically it I don't think it's growing as much as or as fast as it was but Okay, because it would pull it pulls Hosts out of the cluster once you reach the threshold too. So we were you know increasing and decreasing hosts quite often but It's interesting. I wonder if that's one of those problems where Or things where if you are experiencing a problem if if you have enterprise plus and dv switches turning on network io control To you know limit the amount of throughput that vMotion can consume To you know actually give the vms the ability to have their bandwidth You know, especially at cd. I can imagine because vMotion from what I recall is There's no limit on the amount of throughput that it will consume So if it's sharing bandwidth with for example the network adapters being used by the control plane You know for ecd communication that would be that would be bad I would also refer back To the vmware best practices around vMotion because there are very very specific recommendations to try to Reduce the possibility of having that type of resource contention, but I would say it's not something that is implemented To the t very often Yeah, that makes sense. I'll dig up the link for vmware's or if you have it handy you can just Post it over to me I'll see if I can dig up the link for their best vmotion best practices and we can share that in the blog post So again, I know we've only got a couple minutes left. I actually do have a hard stop at the top of the hour so if we if words Please keep an eye out on friday for the blog post So in those blog posts if you saw earlier, right? I put a time stamp in where we answer questions and stuff like that Try and turn those into a reference I know that having a 60 minute recording to find, you know, the the 30 seconds where we answer your question is sometimes You know, you don't want to listen to 60 minutes of me babbling So I do try and do that as well as we include links, you know, all the links that we shared here We'll have all of that stuff inside of there So with just about four minutes left five minutes left, please if you have any other questions anything that's top of mind Please let us know So to circle around and answer that question about vmotion. So yes vmotion absolutely supported Nothing wrong with doing that. If you start to encounter issues Maybe check your environment right see what's going on there, but it is something that's that that is supported by us So All that means is that if you open a support case They might end up sending you over to that link that I had up a minute ago Which is the one created by Joe and Richard on here's how to determine performance characteristics of your v-sphere environments To help troubleshoot those types of things But yeah storage vmotion definitely not supported And I I I know among other things that that causes issues with persistent volume claims as well If you storage vmotion a node that has pvcs attached to it It also brings all those pvcs with it which then causes them, you know the the provisioner to lose track of them It just doesn't know where they are So that that's an important one Yeah, and I don't think that changes with csi either um not From what I have the storage team was We were in a chat About a month ago and it sounded like it didn't like the original plan for the auditory csi was It was going to support storage vmotion and it sounds like it's being backtracked though though if you look at The recent pr is there was one if you were running in a v-sand environment. So maybe that Changes how they can handle but I'm not quite sure if you you have v-sand why are you storage v-motioning but Well, yeah Unless it's to move off of or to v-sand. Yeah, yeah to v-sand right. I guess that doesn't make sense So okay, well, um I don't have any more questions for y'all. I just want to say thank you very much for coming on today I know we are originally scheduled subject for this week was a licensing entitlements But our guests had to delay out a week. Apparently there's some pm thing going on today So we'll be covering that next week for anybody who is watching. So thank you very much Joe and Richard I know I asked you like super last minute. Hey, do you mind joining or is it possible you to join? and you both immediately jumped on that grenade and Uh, it thank you so much. I really do appreciate it. Thank you. Yeah, I really appreciate it. It's it's been fun Yeah, sure So for anybody in the audience, uh, if we uh, either didn't adequately answer your question If you didn't get to answer ask your question Uh, if there's something that comes up to you after the fact when whether or not we're on the air here streaming Please don't hesitate to reach out to me. Uh, andrew.sullivan at redhat.com Chris is short at redhat.com now You can also reach out on twitter practical andrew or chris short on twitter We're more than happy to field any and all of those questions at any point in time So Thank you very much for watching today Again, please keep an eye out on friday for the follow-up blog post and uh, have a great rest of your day Yeah, see you soon folks. Bye Yeah