 Good morning. Good afternoon. Good evening. Wherever you're handling from welcome to another edition of the OpenShift Container Storage Office Hours. Today, we are going to break things on purpose. I'm joined by Michelle De Palma and Eric Nelson. Michelle, how are you doing? It's been a little bit. It has and can I have to tell you breaking things on purpose is like my favorite thing. So this is going to be a good show. It's going to be fun. So okay, so we're still on our migration train. And if everyone remembers last time, we talked about just migrating from OCSDLX to 4DX in general. But now we actually want to get into the problems you might encounter. So Eric Nelson is back. And I'm going to ask Eric to just give a slight level set overview of where we are so that we can kind of pick up where we left off last time. Sure. Hi. Thanks for having me back. Hey, thank you. So right now we're looking at the OpenShift 4 console. So MTC is a tool that helps you migrate your app workloads from typically we see an OpenShift 3 to an OpenShift 4 cluster, although that's not actually necessary. So it'll do four to four workloads as well. And actually we've had people even go to three from four. So it's agnostic as far as which direction you're going. We're looking at the OpenShift 4 side. So this is the target. This is where the majority of your, we call this like the control cluster. So typically you have one controller. The API surface is a set of CRDs. And then you basically you'll create CRDs and the controller will orchestrate things across clusters in order to make your migration happen from a source to a target cluster. It has its own UI. So Michelle. There it is. There it is. So on a high level, when you're talking when you're thinking about a migration, you've got two clusters, of course, so you're going from a source to a target cluster. You need an intermediary replication repository. And that's going to hold. It depends on exactly what type of migration you're doing. We actually support indirect and direct migration so you can upload some objects into storage in between your clusters. And then it'll end up in a target where you can actually directly into the target. There are reasons why you might want to or not want to do that depending on your network topology and that sort of stuff. But that's always required even if you're doing direct because your Kubernetes resources, so the objects themselves are serialized and deserialized out of the replication repository. Direct will do your volume data directly. So that's kind of the important thing. It's going to be your heavy, your most heavy thing bucket of stuff. So that's the second piece that you need. And then and then you're good to go to describe a migration plan. So you plan out your migration ahead of time to be executed at a time of your choosing in the future. So today what we're going to show is we have a series of break fix exercises. So these are some exercises that deliberately break things and then we're going to look at repairing that so you'll get a chance to see kind of what is what do things look like when they go sideways. And so for the first one, we are going to misconfigure our clusters and replication repository and we should see that let us know and then we can repair it. Okay, so I've already cloned just so everyone can see we have the OpenShift TV demo repository with our broken stuff and I've already cloned it. So talk me through the setup at this point we MTC is set up but we've only got the OCP for configuring. Yes. So we're looking at this is where you this is what you would see if you had just installed MTC into your OpenShift for cluster. So no other clusters have been registered. So we have an OpenShift 3 cluster with some workloads on it but it hasn't been registered yet. So when you come into MTC the first thing that you want to do is actually register your source cluster. And separately you want to set up your replication repository. So we're going to go through actually corrupt like doing that in a corrupt way. So Michelle if you want to hit the add cluster button, we'll see exactly what information and actually is going to ask you. So on the exercise, if you recall I said that the API for MTC, it's all CRDs. So you can basically do everything from the CLI and more specifically like from the Kubernetes API. So you can use kube control to create your objects or whatever. So we actually have some YAML files that will automate the creation of this. So what you're doing when you fill out this form right here is actually creating a MIG cluster CRD to register your other cluster. Sorry, a MIG cluster CR instance. And so we're going to put some bad values in that and see the controller actually reconcile it and determine that that's actually bad. Okay, so do you want me to do that now? If you go on to the CLI. Let's get out of this for a second. All right, first one misconfiguration. Yep. Okay. So there's a zero one misconfigure. And if you want to click on that, maybe we can take a look at what's actually in there. The instructions just tell you to create it, but it's probably useful to take a look. So you can see that. Let's see. There's a series of resources here. So there's a MIG cluster object. And so a MIG cluster object in order to like actually register another cluster, you need two things really you need the coordinates to the cluster, which takes the form of the API servers URL. And then secondly, you're going to need a service account token. And when you install NTC, it's sort of like it needs it requires an agent on each of your clusters. It's going to it's actually an operator that deploys its operands. And when you there's a we call it sort of a control cluster and a remote cluster variant. The remote cluster is a subset of the things that the control cluster gets. Specifically, it's Valero really because we require Valero in order to like pull our stuff out of that cluster. And in addition to that, the operator will create a service account and permission it with all the permissions that we actually require in order to do this on the source cluster. So that service account token is what we use for the credentials in order to get it separately. You're going to see there. There's some AWS access to secret keys. It's a bunch of junk in this file. So that's actually what we're going to want to go in and repair. So if you wanted to you can you could like automate all of this by create by creating a file like this and then filling it out with all of your good details. And the UI itself actually is just talking to the Kubernetes API and filling out this these objects as well. So this is we're going to create a bad mid cluster and we're going to create a bag replication repository and we're going to see how the controller reconciles that. Okay. All right, so hang on, let me just go back to the read me so someone nice to be nice to look at we're going to apply this I'm already logged in just for everybody. I'm logged into my three dot x cluster. Yep, the source. All right. And open shift. I don't know where we're starting we're breaking things early. So. Okay. So we actually ran that on the wrong cluster. I think I was, I was mistaken by telling you to run that on the open shift three cluster. Now this is awesome. All right. Yeah, you are really going to get it now. All right. All right. So we want. Hang on. So what did it create? It created the information. It created the big cluster in the big storage on the open shift three cluster. The reason why that's not going to do anything is because there's no controller in the remote cluster that's actually reconciling those in order to make anything nothing's watching them. Right. That's like a core concept in Kubernetes is you've got your watchers of these objects. It's also telling you that it can't find open shift config because we the way that the data model is structured a big cluster references a secret. With your like, with your sensitive credentials and on open shift three, there is not, there is no open shift config. So I knew right away seeing this that open shift that we were doing it on the wrong cluster because I know that open shift config isn't a namespace that's there by default on open shift three. So good to know. So you want to log into the four cluster for the next cluster. Yeah. I'm going to stop sharing for one second. Hang on. If that's a comparison, take it up. Happy to answer here for amongst yourselves. There we go. So, Eric, can you send those to me. Yep. Thanks. Let me grab the command for you. So, while everybody's watching. Don't forget, Red Hat Summit is a big event this year and you should go because well, it's free. And if that's not a good enough reason. Well, it's still free. And I hope that works. So if I just drop the Lincoln chat there if you want to hit that. I'm going to send you whatever info they ask you and away you'll go to Red Hat Summit that's broken up into three parts this year. First part is going to be more keynote focused. The second part is going to be a little more, you know, technical. And then the third part for hoping fingers crossed that we could come to a place near you and have some hands on workshop kind of thing. Sign up and get in the loop on all that info, even if you don't want to go to the first part or the second part, you still probably want to know about the third part. So go ahead and sign up for all that info. You ready. You think so. I'm just logging into the cluster here. All right. All right, I got the command for you, Michelle. I'm going to send that to you. Thank you. So if you're not familiar with operators folks. You don't have to work away from home. Oh, that's so I'm sorry to hear that Tony. But if you've not heard of operators and what they do, we have a wonderful book on operators, you can get for free. Just head over to the operators ebook link I dropped. But operators incredibly powerful because you're actually. There's a gift of a person jumping into a box and the box closing kind of thing. Essentially what you're doing, you're putting this operational knowledge that someone has in their head and putting it into code for lack of a better term, right? But you can write these with helm charts or Ansible or go or there's now a billion ways to write a good operator these days feels like so. Don't be afraid to get your hands dirty and creating some operators to start automating your workloads or even automating deploying of your entire application stack because operators can do that. They live on a premise of a reconciliation loop so they're always looking at a certain group version kind and waiting for events to happen and they run. Once something is triggered in the event log so it is an incredibly powerful way to kind of a do things that would normally be kind of either step by step, or be something that you would just want to do on a regular basis and, you know, get this piece of work out of my way. And let me get on the other more challenging things like blowing up clusters. Okay, we're good. So, again, I'm applying this to the forecluster and where the watcher is to reconcile this. Okay. There we go. All right, so do you want me to go check to see that this it says it's created a repository let's go see. Okay, look at that. So the, we created the these close these, these CRs with some known bad values in it right. And so the controller. What Chris was describing is reconciling these objects to make their desired state. So, and as part of that it runs through a validation. And it does that for both a cluster and a replication repository so for an S3 bucket, it actually like creates a dummy piece of data in the in the bucket in order to verify that it can actually talk to the bucket. And so the cluster, I believe it, it tries to it makes some API requests in order to actually confirm that it's possible. So, if we pull, there's a kebab menu to the right of that connection filled. If you hit edit, we'll see the values. So, we've got the name, and you can see misconfigured url.com hopefully does not refer to a Kubernetes cluster, and then we've got a bad secret service account token. So, okay, it gives you the reason down here. Yes. And so if we were to actually pull up, we don't have to do this but if you were to pull up the MIG cluster CR, that's going to have a status section, which is typically like the spec section is the users intent of a CR while the status section is the current status of that CR as the controller views it as it's reconciling so that's the typical Kubernetes pattern and that's where you're going to find this this type of information so you'll find like conditions that are associated with that CR that describe whether like what's going on with it and in this case you'd see a failed condition that says hey your connection is bad. So, it's looking for a domain misconfigured url does appear to be available. Okay, so how do we fix it. Can we fix it from here do you prefer like what do you want to do and should we go look at replication repositories and the error there as well or yeah let's take a look there as well. And I either, I can shoot you the service account token Michelle or I can actually repair it on my side just sort of, I'll fix it and actually you should see it on your side because it's live so it's pulling. Okay. So should be fixed the cluster I can go in and try to fix that. Yeah, yeah. Eric is the man behind the curtain fixing the cluster. Yeah, so I'm getting the. API location so I'm going to pop open edit and drop in the. I'll refresh at some point the URL, you actually can yeah it's you should see it live. So if you close the modal like it's actually pulling. Okay, cool. So I'm updating the cluster with my credentials right now. It's validating. Okay. I'm more connected. Nice. Yay. Okay, so that's the actual cool you fixed it. Yay. Hey, look at that. That's our first break fix and so it's validating it again. Okay. And that will just continue to happen so like if we were to go in and break it again, it would, it would end up not connected and actually like most of the data model and the controller is watching for ready conditions. So that's how it knows when it's trying to do work. Like it won't allow you to go and do or it won't actually initiate a migration unless everything is ready. So I can go in and I'll do the same thing for this one if you want to take a look at that. I'm going to do it from my side as well because we do have some sensitive credentials here. Yeah, don't don't share anything you don't want the internet to see please just makes my job hard. There's a pot boil here. Hey, you know, paint drying. Somebody said something in chat. Yeah, are there any questions so far. Just people appreciating the break fix idea and walking through the troubleshooting right like it's like I always say before you know that anybody goes on for the first time on the channel right people appreciate when things fail because they get to learn how we troubleshoot Yeah, so we wholly embrace failure. Okay, let's see. I thought it was interesting that Eric mentioned that it doesn't have to be OCS three or an open shift three to four can be migrations between clusters of different versions that are all on for that could definitely open up possibilities for someone like I tend to think of it as just coming from OCS three dot X but that's that's not true you can do all kinds of stuff so that's cool. So I'm actually having a problem registering this so I think this is well, this is off the rails a little bit. I think we can repair it so we should probably we should walk through that. If you open up the kebab menu. Hit edit. Okay. So if you were to, let's see, I updated the name and credentials. Now though it is, it considers it to be a self hosted s3 rather than an AWS s3. So up top if you see the storage provider type. Let me change it. Yeah. Okay, so try to update that just hit the update button, I think the values that have provided are good and if you scroll down that modal. We may see an error here. Okay, so let's see the backup storage config settings. Okay, so you can see that it's still trying to connect to misconfigured endpoint of course there's a bucket that lives there right. So, not yet but I'm sure someone could get one running by the end of the show. Yeah, so there are a couple options that you could do here. The, if you, if you open up let's see if we can take a look at the actual object on the CLI. What am I editing? So do an OC edit. Let me give you the specific command. What I'm trying to do is actually pull that that bad endpoint off of the object. So let me make sure double check the command. Are you do you have the open shift migration namespace active that's going to be the namespace where you're going to find all of our big API objects. What is open shift names? No migration. Sorry, I'm usually in a different namespace. Open shift migration. Did you say? Yeah, do a does that set I think you want to make it active by doing OC project open shift migration. Okay, okay. Okay, so if you do OC edit. MIG storage, MIG storage. One word. That is the name of the CRD. It's that's the short name of the CRD and then space S3 dash repository. Good. Yep. Okay. So this is the. This is the CR that represents the repository. So let's see if we can find that misconfigured URL. Oh, there it is. Yeah. Go ahead and actually delete that key. If that's possible. Okay. So if, okay. All right. That's okay. I was just going to talk through some of the other fields, but it's possible that that may have your VI crash. That's impossible. Yeah, it doesn't crash. So where it is. All right, you went completely deleted. Okay. Do you want to talk about the other fields first or? Yeah, sure. So I mean, there's nothing like super interesting here, but you can see that we use the bucket name. Like really it's, it's similar. It's like we're trying to, we need the coordinates to the bucket and we need the credentials. So the way that you're going to file your credentials is we're not just putting them into plain text. They're sitting in a secret here. So you can see that it actually refers to a secret and to fully qualify that we need to give it, provide it the name and the name space. Okay. Fingers crossed that secret has the correct values. And we should see the, did that, you saved it. I did. Okay. It's thinking. Hang on. Yeah. Okay. Yay. I think it actually just came ready. Did it. Yeah. Nice. Good. Just like that. So we have prepared of that bucket so fast. We have repaired the cluster and the intermediate area repository. So that is the first break fix exercise and speed bumps, but we made our way through them. Oh, it's good. Okay. So do you want to move on to two? Does anyone have questions? No questions. Okay. I'm going to give you an example of like generically, like when things go wrong, the controller should be reconciling and telling you, giving you meaningful error messages. Like in the case of the replication repository, that actually wasn't planned, but based on the error that was reported by the controller, it was clear that misconfigured was wrong. So naturally, the first thing that I went to do was to go check the CR because that's going to, that's going to be where the error is. And of course misconfigured is the wrong value. So. Okay. That makes total sense. Okay. So are we moving on to two? Sure. Unless there's any questions, doesn't sound like it. All right. Okay. So I can give a little brief on this one as well. So the way that these break fix exercises were written for a version of MTC. Actually, let me even give a little background on this part. So when MTC transfers your data, there's a lot of different ways that you can do that. One of the ways that it uses is a file system copy. So you can kind of think of that as an r sync. At one point in time, we really only had one mechanism in order to do that. And that was called RESTIC. RESTIC is a tool that's bundled with Valero. And so Valero is a backup and restore tool. And it's Kubernetes native and you would create your backup and restore CRs and it would pull your Kubernetes objects and your data and drop it into an intermediary repository. And that's actually why we require one because we're built on top of Valero. And so Valero will pull your data and drop it into the replication repository for restore at another point in time in the future because it's a backup and restore tool for DR. So for our case, we were for a long time MTC, really the way that it worked was that it would orchestrate backups and restores between a source and target. So it sort of abstracted the implementation details of a migration, which was a few different like a dance of backups and restores. And of course that's not ideal because it means that you would need to update, let's say you had a terabyte of data in a volume, you'd need to put a terabyte of data up into an S3 bucket and then you'd need to pull it back down, which means you're transferring two terabytes of data. So twice the amount of data that you actually have, it's going to take longer, etc. So with a recent version, we've actually moved to or we support a direct migration, which is quite literally an R-sync over like there's sort of a, I can go into the technical details if people are interested, but it'll R-sync data over a route to your target cluster. So it goes directly, which means in theory it should go twice as fast because you're moving twice like one half of the data that you needed to with the former indirect mechanism. And as part of the indirect mechanism, we do something called we require stage pods, and in fact, direct also requires a similar mechanism. So in order to actually access the data, we need to bring up a pod so we can start to run things like R-sync with the volume actually mounted. We can't access it without the volume mounted. There are actually ways we have other tools that we can get directly onto the nodes and access the data from there. So it's a much lower level copy. We have some tools for that, which is sort of an escape hatch, but in general, that's how this works. So there are cases where stage pods may or may not actually spawn in the way that we require them to. It's a hard prerequisite. We need to be able to do it. Things like resource quotas or stuff like that may actually prevent those from spawning. Other cases, like maybe there's a node selector that doesn't find a node that matches. So what you're going to see here is like what that actually looks like when it happens. And these exercises are based on real world scenarios that we've seen that we had to work through. So in order to fake a problem with a stage pod, what we're going to do is actually corrupt the image pull spec so that it's bad. So we should see an image pull failure. And so the stage pod will never come up because of that. And this will be in the middle of a migration. So question. So I can't remember because it was a while ago. At what point are we selecting if it's going to be a direct migration or not? That's in the migration plan details? Yep. So we're going to have to go through and write up a migration plan. So that's kind of a good refresher for folks as well. Now that we've got like a stable cluster and replication repository. We've configured everything. We fixed any problems that we fat fingered into the UI. And so we're ready to go ahead and set up a migration plan. Okay. So do you want me, for the sake of the exercise, am I doing the preparation? Let me take a look. I don't think the order matters. So let's go in and actually corrupt it first. So there's two shell scripts in here. Let's dump the shell script so we can take a look at what it's actually doing. The stage pod, yeah. The restore script is the fix. All right. So, hey, and I'm waking up over here. Give me a second. Oh, you're fine. Michelle, I thought you had a vat of coffee. I'm working on mine. It's right here. Not this one here. Oh, I see it. Sorry. Copy, paste. I like that. Okay. So what this script is going to do is we actually store the image pull spec, the fully qualified string that we use in order to identify the stage pod image. That is parameterized and it's held inside of a config map that the controller reads, I want to say. It's a migration cluster config. So that first command right there is we need to get the good value first. We're going to extract the current image. We're going to echo it into a file so that it gets persisted to disk. And then we're going to patch in a bad value. So you can see quay.io conveyor invalid image there. That's going to patch the config map with a bad value. And so the controller is going to try to spawn the stage pod with that bad value and it's going to fail to come up. And then finally, it's actually going to delete the config map, which actually forces the operator, Chris, as you were talking about to re-reconcile and create this config map that has a bad value in it. So that's what this script does. Can I run it? Yep. Let's do it. Okay. So we deleted the config map. Right now what I'm expecting is for the operator, see the operator is watching for this config map and it's going to see that it's not present. So it's going to reconcile with this newly updated bad value. Then, so now we can actually go in and create the plan. Okay. While that actually is reconciling. Refresh my certificates. Hang on a second. Can you see my screen? Yeah. Okay. This just happens. You might need to re-accept that. Yeah. Or maybe not. Sorry, I just had a little blip on my side connection wise. Okay, so we're going, our clusters are okay. Our repository was fixed already. We did that. We're going to add a migration plan. Okay. So feel free to enter anything. Test. Bad. How's that? Yep. So we've got our two clusters. I should mention here too, when you register a cluster, it's not designated a source or a target. So that's actually how people are able to go in like reverse directions. It's because they're able to just create a new plan and swap the source in the target destination or swap the source in the cluster. Yeah. So when you create a plan, that's when they actually get that semantic designation. Let's go ahead and hit next. Okay. Oh, maybe go back. I think we might have. Source cluster is saying look, it's not many selection. There's not much to screw up here. So I'm surprised to see that this may be another opportunity to debug. Let me take a look. Okay. So you're saying that this just as you've expected more. Yeah. So what this screen does is it, it does a project discovery. So we have another controller that's running that when you create a plan and you say, I want to use this as the source cluster, our controller actually goes and looks at that source cluster and then determines it does some filtering. It basically says this is the set of namespaces that are eligible for you to choose as your source. And I'm surprised to see DVM benchmark because I think that that is actually the host cluster, the, the rather than the, than the open shift three cluster. So something may be off about our configuration. Okay. Let's go to the clusters and let's double check that that is actually what we think it is. Oh, this is interesting. I gave you the wrong. We registered the four cluster as the three cluster. Okay. So more breaking things on purpose. So do you want me to fix it or do you need to put in like the account token? So in order to fix this, we just need to update these values so that it's actually pointing to the correct cluster. And I think I actually know how that happened. I had logged into it into the wrong cluster and I ran the command and gave you the, the coordinates and the, and the token that had, that had been dropped out of it. So I mean, this is the kind of stuff that you're going to do by accident while you're doing these migrations. So let me give you the actual credentials or maybe I'll just re-register it. So it's correct. You could, yeah, do that because that way I don't have to worry about. Yeah, I've got it right here in front of me. Yes, folks. This is what you do Saturday mornings when you're doing your big migration of some something. This is what happens. I mean, it happens people. Oh, source destination wires crossed game over. Yeah. It happens. And David, I'm looking for an answer your question right now. I'm pretty sure it shows all project resources. Like all of them plus your apps cluster and apps combined, but I got to double check because it's three and I don't remember. There it is. Okay. Connected. We're happy. So now that I've because I've got this controller wreck like verification, I'm pretty confident that we shouldn't see that anymore. We should see the source, the source apps. We're good. We're good at migration plan. So I think my repository next. There we go. There's another fix. That is literally the kind of stuff that we're typically helping people through. And we're not immune to that either. So the on, we've got a few projects to choose from here. Let's choose engine next example really what that is, and you can see like it's got pods PV claims so it's actually using a PV so it has some data. And actually, this may show off the feature that I mentioned where folks are actually filling up their volumes to the point where it's past what they have actually requested. I can talk a little bit about that if you think that's useful. Okay, so engine next example should have one PV right. I thought I saw one. Yeah. All right, so that's our PV. This is where we can choose how you migrate the data. I can dive into what these are, or that's going to kind of lead us off onto a little bit of a tangent. We kind of covered it last time, but I can give a quick overview or we can do a quick. Sure. So copy is literally I mean it's a clone of the data. It's copying the data under the target. So there's a couple ways that you can do it. There's a file system level which is going to be like an r sync there's a snapshot which is like a provider supported API so like an AWS snapshot it will restore from the snapshot as long as you're going to the same provider. A move is interesting because it's actually like a PV swing. So the it's going to it's going to drop your it's going to it's like a it's like an unplug of the volume and then plugging it back into the target. And then, yeah, so we're going to go ahead with copy. I actually just got a notification that my headset is running low so I may need to that may drop out but I can grab some other headphones and I'll be back in two seconds. All right, so we're going to do a copy of a light fading over here. It's great. Things are going sideways. Okay. We're happy. So here is where you can designate your target your target storage class so we're actually going from an OCS three to an OCS four which is on our four side. Okay. And you're happy. Okay. Just, I'm just poking around. What is this verify I can't quite see it verify is it checks on verify. Okay, so that's going to take a lot more time, but it will validate the files at a file system layer. So here is Michelle you're asking how do I tell it whether to use director not this is where you do that. Okay, so actually can you uncheck that we're going to do indirect. So we'll use the old style stage pods that this exercise was written for. Okay, cool. Alright, so not no direct image migration no direct PV migration. We're not adding any hooks correct. No hooks. Okay, so this warning is telling you that so we have a feature enabled that actually intelligently detects when your volumes are approaching their capacity, and it will give you a little it'll request a slight buffer on the target side to try to provide you enough space in order to do it. This is a, this is a feature that we've added because we've identified we've had a number of failures where people request the same size volume on the target side and they actually run out of space. And so things blow up because the bucket isn't large enough to hold all of their data. Okay, so, but it's done automatically what if someone wanted to turn automatically. Yeah, would we say that again. What is someone wanted to tweak it what if they were like oh well then don't do it here like what they have any control over how much. I think there are actually some. There are some like low level CLI ways in order to do that actually have some stories on our roadmap in order to make that more accessible so that you can tweak like the threshold that would trigger it, and also the amount of space that it would buffer on the target. Okay, also by default this features turned off. Okay. All right, but we're going to see it in action. Okay. So at this point if you run a stage. Yep, so that's so we're actually going to do the stage. It's going to attempt to stage the data. And we, what will happen is it will, it will spawn the stage pod try and mount the volume. Well, I'm sorry, it'll attempt to spawn it will attempt to spawn the stage pod to mount the volume and of course it will fail because we corrupted that value. Okay. If you click on the just to the right of bad stage pods there's a button. Yeah. Okay, and then click on stage there. Oh, on the left. Okay. Yeah. So, Right now it's, it's, it's, you can see it's, it's backed up five objects so it's working on the Kubernetes objects first. Okay. I'm actually going to take the opportunity to go grab my headphones so I can get go ahead. Okay. So anything else to click on down here. Okay. One of two. Well this page all the refreshes the question. I'm assuming so. Any questions Chris anybody got anything. No no questions really folks if you have any questions, open shifts storage related or not feel free to ask or I'm sorry open shift data foundations is that what it is now. Yes. Yes, ODF. There was name changes, what, two weeks ago last week changes are hard. And naming things are hard in general just, yeah. All right. Oh, okay. So now it's at the cleanup stage, cleaning up all the stuff that have put out there to make the migration happen. Okay, so I need to ask him when he comes back Chris because I thought we would break here. Right. Like, I wonder if we're going to break here. Nope, it fixed. I thought it was supposed to be broken. Go to the store the stage restored details real quick. Stage restored details that that. Hmm. Hmm. That's interesting. But it says it finished maybe we found a bug. Who knows. Gonna be great. Hey, can everyone hear me. Yeah, yes we have. Oh, that's great. Okay. It looks like it might have succeeded when we expected it not to. Yeah. Here's the overview. So everything looks complete. But if you go in, you can see some parts actually didn't do, didn't back up, didn't restore. Okay. Let's take a look if you go up on one level. Okay. So actually where we expected this to fail, I believe is stage backup. But so I think I know what might have happened, but we'll take a look. So when we. When we broke that stage specification. It's possible that it actually got recreated with a correct value. I'm not sure how that might have happened, but there are some steps that we can take in order to verify. So we said that the stage actually finished to completion. Yes. Okay. So let's, these values are passed to the controller, a really good like authoritative place in order to get the, the, the parameters that are passed to the controller are actually in its environment variables. Those are the variables that it's always going to respect. That's a good place to, to verify, to, to like assert your expectations. So I would expect to see a bad value having been passed to the controller. And it's possible that, so like, it would have never been able to spawn a stage spot a stage pod if that had that was actually corrupt. So that's a good place to start is to confirm that those environment variables that pass those values to the controller are actually what the we expect them to be. So we can do that by going to the, let's do an OC get CM and see what, what config maps we've got here. The config maps are the resources that hold these configuration values. So if I remember correctly, it should be on the, it's actually in the cluster config. So if you do a OC get, we want to dump that migration cluster config object to YAML and then grep case insensitive for stage. So the stage. Okay. Okay. So that's actually correct. That's interesting. So let's take a look, let's do the same for the controllers deployment. Actually, let's actually check the pods first to see the controller needs to remount. Like once we change that the controller pod needs to actually get restarted in order to take those new values. So it's possible that it's actually stale. It actually looks like it did. Because it was, it was restarted 13 minutes ago. If you look at the age of that pod. So let's just confirm that it has the right value. Okay. So we'll do the same thing. We'll dump that pod to YAML and we'll grep case insensitive for stage. I think it'll show up. It may not. Okay. It probably doesn't because we've annotated the pod to mount the config map values as, as environment variables. So they're not actually on the deployment. Just look at it. Do you want me to grab something else? We could try it again. And watch the pods and see if it actually does spawn the, the stage pod and is successful. So. Like really try it again. Like corrupt it again. Well, so we did corrupt it and we can confirm that it successfully corrupted because it the config map does have the correct value in it. Yeah. Here. Okay. Let's, I think I missed like the section of the migration. So let's let me think for a second. What I want to see is I would want to verify that on the source cluster. If we were to do like a watch of the pods while this is running, we should see the, the stage pod spawn. And then what I would expect is for that to actually error out and fail to, fail to, to launch due to that error, error image pool. But it, it seems like what we would actually see is a pod created and then and actually succeed or something. Go ahead and ask your question. So this script is run against the source cluster, specifically. Oh, you know what that, that actually I think is it because I ran it on the, the, the, yeah, I ran it on the target because that's just for leftover from I think you nailed it. Yeah. It is, it is on the source. So hang on a second. I think we want to repair that that on the four side. Okay. And then go to corrupt the three side. So let's run it directly. If you, I think that you can just run that restore. And what it will do is read the correct value out of the file that it persisted. Do what actually let's look at it first if you don't mind just so people can see. Yeah, we can talk through what that does. Okay. So yeah, when we ran the, when we ran the initial corruption, it first grabbed the good value and it wrote it to disk in that file. So here it's reading it back out of that file. And then it's patching that back into the, into the config map. Okay. So let's restore. Okay. And then do you want to check? Can you, can you delete the, Give another second migration. No, am I looking at the wrong one? I am. Oh, I know what it is. Can you delete that config map? Okay. What it actually is doing is patching the correct value into the operator, which actually is the owner of that config map. And by deleting it, the operator will recreate it with the correct value. Okay. That's that reconciliation loop. What was the name of this config? It's this one. Nope. Where am I? It is migration. One second. Let me. Migration cluster config. Yeah. Okay. Okay, we're sure we're happy. Yeah. Okay. Okay. Yeah, that's the cool thing about operators, right? It's like they, They own all this stuff. So if you've seen some, There's some amazing demos that with open shift for everything's operator managed. So you can really wreak havoc on the cluster and the operators are just going to continually repair it. Yep. You could, You could try all you want until you delete the operator is going to keep trying. So it should pop up. Yep. Michelle, If you want to log into the open shift three cluster, Let me know if you need, You probably, It's going to have credentials in the login. So you probably want to stop sharing, But I can send you the command. I think I have it. Okay. Try it. And then. Okay. And also. There it is. Okay. So they just so you know, The environmental environment variable hasn't fixed. I can see it. You know what, You want to go ahead and send it to me anyway, Just so I'm sure I'm on the right one. Good call. Thank you. So Narenda have asked what database are we migrating? It's not a database. Data. This is a toy example with engine X. So engine X is storing its logs in a PV. So that's what's persisting. But yeah, This will function with, With like a my SQL database. And actually, I think like the exercises call for that. Okay. We're good. Hang in a second. The sharing. Okay. Can you see everything? Yes. Okay. So now I can see the engine X. Example. So you can verify that's open shift three. And all the rest except for that. Okay. So that's a good example of like, this is, If you recall, I was talking about like a control cluster versus a remote cluster. There's no controller on the three cluster. So all you see here is rustic plus Valero. Okay. And then I'm going to run this here. Correct. The same, the bad, the, the break it stuff. Okay. Yes. All right. Okay. So what do you want it? Do we rerun this? A migration plan? Like, do we, That's a good question. I don't know if it will. Let's try that. Let's rerun another stage. Okay. Yeah. So staging the idea, the difference between a stage and a migrate is staged. The whole point of stage is that you can rerun it over and over and over again to continue to cap, and then you can use like, catch up to where your data is at and get it onto the target side. So you really want to get as much of it as possible so that your final cut over is just capturing that final Delta. Can we run it? Yes. We go here. Correct. Yeah. Michelle, if you want to go, well, nevermind. I was going to tell you to watch the pods in the name space, but we might not even run the command fast enough to pick it up. Okay. And there we go. Beautiful. There we go. We've broken it successfully. Yes. So it will perpetually hang here. There are timeouts. But in general, like this is what it looks like when you're, when your migration stalls. And like I mentioned, there are a litany of reasons why stage pods would not be able to come up. Could be node selectors unable to schedule. Could be resource quotas. Could be a bad image pull spec like this. So, uh, I think I actually saw this. I think I saw like a hit the API when I was running through these, it hit the API so many times that it actually got like rate limited so that we may actually see that as well. That's another thing you may see. Dockers recently like cut down the rate limits. So, but this one's hitting Koi. So I think we should be all right. But at this point now. We were to repair it. The controller will actually see that and, and update the values and it will pick back up. So, um, Okay. So this is the restore just to remind people. We're just going to set it back. Okay. And now we should just see this repair. Do the instructions tell you to rerun it? I actually can't remember. I could read the instructions that always helps. Okay. Uh, we did all of this. Oh, it tells you to cancel it. Um, so if you want to go in and cancel, you can cancel migrations. Um, So that's going to abandon it. And, um, That's always useful because it, it will. Uh, it'll like clean up everything. You can see it actually goes through cleanup helpers and, and stuff like that. So we canceled that. Now we can go ahead and try and another stage. Okay. Um, It's aging started. Okay. I think it just passed the step. Yeah. Did we pass, um, Okay. So this is stuff that folks may, I think last time we had a number of people who were watching who were familiar with cam, um, which is the early versions of MTC early versions of, uh, well, cam didn't have any of this like progress supporting. It didn't have, um, we've added that stuff to upstream Valero. In some cases, um, I think they actually just recently released some of that, um, Pernav, uh, An engineer on our team added that. Um, but a lot of what you see here has all been like, we've just got so much better, such better, um, Transparency into what's going on based on these pipelines and everything. Um, Vince console is the UX person who's been helping us with this. Um, So we've come a long way, I think from cam. And we're continuing to work on this. So, um, The tool kind of does what we originally set out for it to do, which is migrate workloads from one cluster to another. We can do that faster, which is what direct migration is helping us do. Uh, and then the last part of that is, um, And when you're evacuating an entire cluster full of workloads, it's, uh, um, If anything is wrong, it's probably going to show up. So, um, And that could be like a bad storage layer. It just doesn't matter. Like, you know, a lot of these clusters are typically may have something wonky about them. So, uh, that, that will often show up. And so we're just trying to really improve this, this debug experience to help you get to the root problem faster. Um, Okay. So I was just going to ask you, so in this case here, we're in the, uh, creating the stage back up here. This has zero of zero, but this seems to me, there's nothing to back up. Correct. I think there's probably, that's probably an artifact of, um, I have to look at that. Okay. Um, that's probably an artifact of like the toy workload. Um, Oh, okay. There may not even be anything in that volume. Right. That was my thousand. Yeah. Nothing to back up. So, okay. Okay. Awesome. Do you want to stage your store? Everything looks right. And there was nothing to actually, okay, that's fine. We're just going through the motions here. I mean, just Eric, you know, I can talk to the UX team if you want, but like zero percent is, that's not like a good UX in my opinion, right? Like nothing to move would have been better than zero percent. Yeah. I'll make a note of that. Yeah. You say a hundred percent of nothing has been right. Zero bytes and zero bytes and it's good. Okay. Well, then just say it's complete or say it's, you know, nothing there, whatever writing that down. Okay. So, and I just wanted to, it's for the storage people out there just to remember, like, you're not going to use, um, if you're going to do direct and you're going to need staging pods, you cannot do this on a source cluster that's already resource limited and constrained, right? This is not an emergency migration because you're going to have problems putting out staging pods. This is a planned relaxed. I'm taking my Saturday to do some migrations kind of thing. It's not a scenario, right? It's not, it's not a disaster kind of. Yeah. This is not DR scenario. Not DR scenario. Controlled, like design to migrate. Kind of that scenario. I'm just, you know, mentioning it because sometimes. Yeah. No, that's a very valid point, right? Like, oh my gosh, my cluster's failing. Well, let's just migrate it. No. No, you can't migrate bad. I mean, you could, you wouldn't like it. But we don't have a lot of time left, but I just wanted to touch on, um, what was the numbers? Do we want to touch on anything else that's in the demos? Even though. Um, I think that's the main piece. Like, um, you know, we can just like infinitely go down these like breakage scenarios. We saw some ones that we didn't expect today, but that are, that are just typical. Like that's the kind of stuff that happens. Um, and that's, that's what that, that process of fixing it looks like. Uh, the third one is, um, I'll quickly just describe it. Um, it's much more in depth than the others. Uh, and this is when you, your GVKs are not compatible between your source and your target cluster. Um, a good example of this is when an API exists on your source cluster and you're using it, but it doesn't exist on your target cluster. Um, so, uh, maybe you've got a CRD. That's a V one alpha one on your, uh, source cluster. And it's using V one on a target cluster. And as, um, as open shift for continues to progress and follow Kubernetes and it's, it's versions increment. APIs are getting deprecated. Um, and in fact we were going to have to deal with one CRDs. Uh, V one beta one CRDs are going to no longer be supported. Um, so when you're taking objects out and putting them into your target cluster, you're going to have to address that. And this helps you, this walks through how to deal with that. Okay. Awesome. Eric, you have a Q con talk, don't you? I do. Yeah. Um, we're going to be, um, uh, chatting with a couple of folks from the Valero project. Um, so it's a birds of a feather talk. It's going to be, um, kind of a relaxed, like Q and a session. Um, Dylan Murray from our team, I think is going to be, I guess, uh, sort of MCing that. Um, and so that, uh, yeah, it should be fun. Yeah. Kubernetes birds of a feather are always interesting. Right. Like I've moderated a few of them before and it's always fun. Right. Especially if you like know nothing about the thing. Right. It's very, uh, you learn very quickly. Yeah. If that makes sense, right? Like, and actually, I've never, I've actually never met, uh, the folks that were, that are going to be on the panel. So it'll be interesting just to kind of like compare notes about these are the, this is the stuff we're seeing. This is the stuff they're seeing. Um, yeah. It's nice. Okay. Cool. Anything else we want to talk about in the last minute? Uh, well, there's the, we got conveyor up on the screen. Um, I just dropped the Lincoln chat there. Yeah. I'll just mention. So conveyor is the, is the upstream, um, organization, I guess, um, that you could, uh, that we're, we're building MTC and it's called, the project is called crane, um, upstream and conveyor. Um, this is sort of a, uh, it's an organization where we're, it has an umbrella of projects underneath it. We're running meetups as part of it. Um, so there's like, there's crane, which is like container workloads, but you can also, you can see forklift does virtual machines to cube Burt. There are other projects that are involved in modernization and, and, um, monitoring like performance refactoring. So, um, if you're interested in any of this, um, please go and, uh, you can subscribe to the mailing list. There's meetups and invites and other discussions that are happening there. And then separately, um, we have a Slack channel on, on the Kubernetes Slack. It's, um, hashtag conveyor. Um, those are great places to get in touch with us. All of our engineers are hanging out in there. We're helping people with their migrations there directly. So, um, yeah, uh, we'd love for you to subscribe and get involved. Awesome. Fantastic. Cool. Wonderful. Well, thank you for a wonderful episode, Michelle and Eric. And thank you audience for tuning in. Uh, later on the channel today, we're going to be talking to a Lars Herman and Katie. I can't say her last name, so I won't try. Uh, but we're going to be talking about red hat marketplace and red hat marketplace is a place where you can go and kind of like just get everything you need off the shelf kind of deal. And it's one simple bill, right? Like you could go grab something from the marketplace and it might not belong to red hat, but we've worked with the vendor so that you can plug it into your cluster and off you go. So tune in for that. We've got Dev Nation, the shows coming up. Uh, we've got DevSecOps is the way we're going to be talking about compliance. And then last, but not least today at 3 p.m. Eastern 1900 UTC, getups guide to the galaxy will be exploring CI with Tecton. So please stay tuned to the channel today. There's a lot going on and look forward to seeing you all there. Take care.