 Hi, hi. How are you all doing? Okay, you all look energetic. So let's start and as it says on the board the We are going to talk about all the challenges and the best practices that we can actually, you know Give to you from our experience a while upgrading is to in inner production environment. All right So as mentioned earlier by Mitch, I am not promising I work as a software engineer in a small-scale startup in India called re-skill and There is that's my Twitter where you can find me and Am I audible? Yeah, hi. I am a Kanj Gupta and I am a software engineer at this startup called Zeta And we are a fintech startup and we do is pass and sass for banking services So yeah service measures are very important for us So you can find my Twitter handle there and let's hand it over to Narupama All right So what are we going to cover today exactly is the strategies that we can take for upgrading is to the pros and cons for Taking up the different strategy strategies as well as the common challenges that one can face So we are going to go in a pattern of things to do before upgrading then upgrading and then post-up upgrading things to take care of First of all, this was the most obvious slide I had to add. So what's Istio? It's big Everyone here knows it's an open-source service mesh that helps, you know organizations in running distributed microservices based apps everywhere Then there's the architecture of Istio now There are two planes to it control plane then there's data plane your control plane contains all of the management and Every configuration that may be is custom or is pre-installed that Exists in our Istio Then there's our data plane which controls all the traffic where the traffic goes and how much of the traffic goes in the different services now over the course of the years we have seen that The structure of the Istio has changed entirely now. It looks most contained But the thing is it it's exactly the same as it was before So we had few things like pilot citadel galley over there before in our control plane Now it's all in in a combined state and it's called Istio D And then there's data plane as it is with the different services and proxies that they Manage the traffic towards right This is just a small Proxy status and everything that the things these are contained The system thing different things contained in the system So why do we need to upgrade our Istio? It's the most basic reason is you want to stay up to date You want to keep adding the latest features that Istio has to provide to you with the diff different upcoming Features it's for security and it's for better performance by the day Next slide So This is a simple command you see here, which is which will manifest the Manifest generate to install a Istio for you and When you do this there are a list of things that gets installed all together and with this single command So that this is the big list So it contains like 15 custom resource definitions to deployments These two deployments are Istio D and Istio gateways then as you can see in the list, there's mutating web webhook configurations, which is basically What they do is whenever a new pod is generated it attaches a site card to it automatically So there's one of those then there's horizontal pod auto scalar We'll talk about north horizontal north south and the vertical east west later on in the presentation So this is just the basic custom Configurations that guest get installed. This is part of the screenshots if you want to look at them Otherwise we can move but I let it stay on the screen Yeah, so Whenever we install Istio, we have like 15 CRDs And I just generated the YAML for that those 15 CRDs and you can see them on the screen right now So this is probably the most basic installation of Istio right now and we will go into the upgrades later Okay, so This slide says why should be our decouple Istio components? Why are Yeah, so why should we decouple Istio components is because we have so many services and different components working inside the Istio Now what we want to do is work on them be able to work on them individually So that you do not go through the whole file Make some change that you're going to do and then have to rerun the whole deployment each time you're you're even making a small update So that is where these similar smaller Components come in so that whatever you're updating can be the deployment can be rerun Only one deployment needs rerunning in the smaller components takes up less less of your time and the whole Istio is maintained So yeah on the left-hand side you can see the Istio operator or in the middle We have the north-south traffic and in the right-hand side We can see the east-west traffic so the north-south traffic is the traffic between Coming into the cluster and going out of the cluster the east-west traffic is from service A to service B inside the cluster So if we want to change a port of a service so from like 15443 to some like 8080 or something. It's although it's TLS, but if we want to we can Then we just have to decouple the east-west traffic operator and we can just do that apart from that there is a possibility that the Istio D operator or the Istio control plane is being managed by a separate team and other things like Traffic management or the gateways are managed by another team So it's good to decouple Istio components so that Even if we are making any changes or if we are making any deployments afterwards We don't have to make a huge change to the Istio at One point of time we can just do one specific change and we can do the deployment Okay, so before upgrading What should we do so before upgrading we should back up all the configurations so obviously So let's just say that our Istio is going on from 1.12 to 1.13 and we have not yet tested How it is going to look like right if it's going to be compatible or not Although it should be because Istio when we are doing a rolling upgrade Istio N Version should be compatible with Istio N plus one version. So Although it is compatible we should always do a snapshot of our basic configurations Whenever we can so before upgrading just do Like output in a yaml form and just keep it out if your deployment is like broken If the upgrade is broken, we can just roll it back in one single command Then we have the Istio CTL pre-check command so I don't know if you are familiar with upgrades, but With cubanities we have cube CTL with Istio we have Istio CTL and For any validating pre-checks if the deployment is going to go through or not if we have to monitor anything We can just do Istio CTL X pre-check. It will give us if the whole Like cluster is compatible with Istio Also with that we can also do Istio CTL bug report which will give us all the Like logs and everything the in one single plane and we can just compare the logs afterwards the third point would be to Address and analyze any issues. So just analyze all namespaces if they are compatible with Istio Istio proxy status how many services are registered to the service mesh so Pretty straightforward Next I would say that Istio is one of those Services which has a lot of metrics exposed a lot and a lot of metrics exposed So we should always have a graphana dashboard ready for Istio What version over all the namespaces were running at all the deployments were running at before the upgrade after the upgrade what is the Resource quotas what how much is the peak load everything we can just check the graphana dashboard before the Istio upgrade and after the Istio upgrade and We can compare if the upgrade was successful or not the fifth point is Go and look at the Istio upgrade nodes So it's different from the release nodes But upgrade nodes are very important and it will tell you if you have any breaking change in the Istio upgrade when you are doing From some n version to some n plus one or n plus two version Coming back so this is a sample graphana dashboard here In in this place we see that all my pods are running on one point eighteen point two of Istio and So I couldn't get a screenshot where there were Like two separate instances of Istio running, but if you have any Grafana dashboard built in like this we can see Until and unless the whole deployment is restarted the sidecars are not upgraded to the control plane Version so there can be a possibility that at a single point of time there might be a one point sixteen or one point seventeen running along with The one point eighteen version as well apart from that we can see the memory usage the CPU usage the disk and everything So this is a basic graphana dashboard and I would recommend Highly whenever we are working with Istio. We should always have a graphana dashboard ready beforehand Okay, so what should we? Pay the most attention to So the most attention we should go to If you have any custom configurations So let's say we have any wasm filters present if we have any wasm filters present in like 1.11 1.12 We have custom wasm wasm filters. We have custom configurations. We want to do something Which is not being directly given by Istio We should look after that because as soon as we are upgrading to one point like if we are on 1.11 We are upgrading to 1.12 or 1.13 if it's canary then we can like these custom configurations can break So we need to look after those custom configurations and always envoy filters They can be like a pain So we should always look after those so types of upgrade So there is one which is called canry Which is the most recommended one the second is rolling or in place Upgrades and we have a third upgrade mechanism which is called In our terms lazy upgrades. I will talk about lazy upgrades in like the coming slides And we have to consider the factors like service availability rollback options and complexity And we should always choose the strategy that aligns with our organization priorities because the thing is that I am in a banking Company like we manage a lot of lot of transactions per day And if my like if I do an in place upgrade and it breaks Then there would be a definite breach of SLS So that would cost us upwards of a couple of hundred thousand dollars. So in Indian rupees That's like a lack to a crore rupees and that's a lot Apart from that Yeah, we will talk about in place deployments canary and everything So yes, if we are doing an in in place deployment, what should we do? So in in place deployments, it's very easy direct upgrades So if you have to go to n word if you are on n version, you have to go to n plus 5 version You have to go to n plus 1 then n plus 2 and then n plus 3 and then you will go after n plus 4 and plus 5 So if you are on 1.9 to 1.18, I would say like I've shown in the garfana dashboard Then you will have to start with 1.9 going to 1.10 then 1.11 1.12 and taking care of everything If all the side cars are working perfectly and nothing is breaking especially the one the Custom configurations that we had told earlier if we are doing a canary deployments The only difference is that Istio will say okay If you are on n version you can do n plus 2 max to max So if you are on 1.9, you can go to 1.11 without any breaking changes So in place of rolling upgrade So in place upgrade, we directly start upgrading the Istio operator or the Istio decontrol plane Then we upgrade the ingress gateways, the egress gateways We restart the deployments because if we do not restart the deployments The side cars will not go to the same version of Istio that is there on the control plane And then we do Istio CTL proxy status and we will find if we have any problems with any of the services or the deployments Now coming to our lazy upgrade option So a lazy upgrade is a very funny thing because you have to take care a lot of things but it is a very fast upgrade path So what you can do is on a lower environment we can do something like a rolling upgrade We can check if we are going from 1.11 to 1.19 We can check on 1.12 if everything is working fine in your cluster You can check on 1.13, 1.14 and till 1.19 whatever it is In lazy upgrades, you can directly go to prod Go from 1.11 to 1.19 with the control plane going from 1.12, 13, 14 without restarting any of the deployments Because you have already checked that everything will work perfectly fine on the lower environment We can schedule something like an EKS upgrade or any of the like AKS upgrade or GKE upgrade or whatever And it will do your restart of all the namespaces at one go So that is something called a lazy upgrade and that is the path we chose in our recent upgrade timeline Now coming to canary upgrade, so canary upgrade The most recommended one What do you do? So we have two ways to do canary upgrades One is we have a control plane with 1.12 We put in a control plane with 1.13 We start to redirect traffic to 1.13 As soon as we are on 100% load we have the dashboards ready We do not see any spike everything is working correctly We just delete the earlier version of 1.12 The second one is we have 1.12 We do a blue-green type deployment where what we do is we put 1.13 And we do a deployment of the same service which was serving 1.12 And then we redirect it to 1.13 and we delete the service and the HTO control plane as well Okay after upgrading what do we do? So after upgrading we analyze all namespaces if they are working fine We go and check the Grafana dashboard if any spikes Then we do nothing but a bug report we can see in the bug report we will have logs So we can compare it to the logs that were earlier generated before upgrades And we can see if anything is breaking or not If something is breaking you can just roll back because we have taken a snapshot earlier in the day Okay so how much time do we have? 10 minutes So let's try to do a demo because I am working with MiniCube So I don't know if that will work perfectly but let's do a demo Just give me a minute I need to change my laptop So do we have a question till now? Yes please So just one question because I am going to start with the demo Yeah you mentioned like how do you pace out the normal service deployment along with that Like do you ask developer you are not supposed to deploy your services at this point When the Istio is ongoing like is kind of transparent to the developers Sorry? So I am saying like do you kind of pace out the Istio deployment with the service deployment Like the application developer that is at the same time Yeah so the thing is when we are in a company which is dealing with a lot of transactions as I told earlier We have windows schedules for all our deployments right? If there is an infrared related deployment that would be scheduled at a time which will not affect any other service deployment When we are doing a service deployment I will not prioritize any other type of deployment So the thing is that coming back to your question Whenever I am doing an Istio upgrade, EKS upgrade or some type of upgrade I will not put a service deployment on that Although we can because if I am doing a service deployment and on the Istio upgrade it is a canary upgrade We can just redirect it to the old control plane and then we can start moving traffic to the new control plane So that would not be a problem Okay so what do we have? So I am using something known as K9S Okay Okay so I am using something known as K9S How many of you have heard about K9S? Oh a lot of you I have seen people work with Lens but I prefer K9S over Lens because I can work on terminal So what do we have? So I have started a mini cube instance and we have default namespaces and all the namespaces which are started At the time when a new cluster is started Okay so what we are doing is So we are in Istio demo Okay Yeah As soon as I so let's do Istio CTL version check On my system on globally my system is running 1.19.3 right now If I go to let's just say I have downloaded 1.12.5 If I go there I have installed something known as Derenf And thanks to people at I would say I would like to give a shout out to Tetrate Labs Because they have shown me Thanks for Derenf It will give me the it will use the bin path of 12.5 as Istio version So let's check Istio CTL version Exactly So let's do a cube CTX let's check what is my context right now Okay so I have a mini cube running right now Clear Yes Now what I will do is I will check if I will use the precheck command to check If everything will be working fine on the our mini cube side Istio CTL X precheck And we can see that Istio is safe to install or upgrade So we have a 12.5 Istio version which we are going to install And according to the precheck command it's very safe to install Thank God Let's install Istio So I have used sublime text to copy all my commands here so that I don't forget So the easiest way to install Istio is do this So we have an Istio CTL install with revision 1-12-5 If I change the revision it might conflict but I don't want to check So let's do this It will ask if I should install all the core components of Istio I will just do a Y and I will go ahead with the installation Just a couple of minutes Some warnings There is a problem but that is due to a mini cube I think But apart from that Istio 12.5 would be installed Let's go to K-9s So we have an Istio system namespace coming up And we have an ingress gateway and an Istio D If we go deployments We have Istio D 12.5 Let's see how much time it will take to come up So the container is still creating and it will continue to do so for another 2 minutes Do we have any questions in the meanwhile? Questions? Okay Let's try to do something else Let's try to see all the pods So I can already see the pods on K-9s But let's do a cube CTL check So ingress gateway pod is being created And Istio control plane is being created It's running Next let's do a check for mutating webbooks As mentioned by Nirupama Mutating webbooks are nothing but whenever a new pod of any service is created Mutating webbooks will check and inject up the sidecar to it So cube CTL Get mutating mutating webhook configurations Yes, so we have a sidecar injected for 1.12.5 Let's do one thing So I know we might not be able to get this done Let's see Deployments Okay, it's running Hang on Let's deploy something such as the book info and everything Let's skip that Let's start with installing 13.2 or whatever it is So let's go to this cd.dot cd And as soon as I go into the directory for Istio 13.2 I get the 13.2 Istio CTL version So Istio CTL version It says that my client version is on 13.2 But my control plane version is still on 1.12.5 And the data plane version or the sidecar proxies are on 1.12.5 So what I will do is Istio CTL Analyze first Let's do this Yeah, so after analyzing nothing is found as a problem So we can go ahead with installing 13.2 We will use this command It will just say that Yes, Istio is being upgraded from 12.5 to 13.2 Do you want to do that? And I certainly want to do that So yes Some problems with Minikube But everything else might show up now Istio system So Istio D13.2 is coming up Istio gateway, increased gateway for that 13.2 is also coming up And my 12.5 version And the Istio gateway for 12.5 is also working So Do we have time? No Okay So let me just explain what we are going to do next I will just stop here So although I did not deploy any service here But if I had deployed any service It would be serving on our 12.5 version As soon I could have done two things I could have just deleted 12.5 And the traffic would have rerouted to 13.2 Or I could have deployed the same version of the service As a blue-green deployment And I could have made some weights Like 90% of the traffic will go to 12.5 And 10% will go to 13.2 And as soon as everything is working fine We can just do the change in weights From 90.10 to 50.50 And then the reverse, 100.0 And then we can delete the whole thing Another thing is in the blue-green deployment We can just put a whole new service Which will be serving with 13.2 And we can remove the old service with 12.5 So that was the presentation for Istio Upgrades If you have any questions, please let us know Yeah Yes Yeah, a rolling upgrade would be the best way to do So I would go from 1.12 to 1.13 Then 14, then 15, then 16, then 17 Like that Yeah, so Istio says that Only n plus, like the 1.13 Will have no breaking changes with 1.12 But it will not give you guarantee for 1.11 So that is why I will do just a rolling upgrade for that Or a lazy upgrade if I have checked all my upgrades With a pre-prod type kind of environment