 So hi everyone, my name is Adnan Hodzik or Hodzic for all the welcome people here I've been with ING for more than four years where I work as a lead side-of-the-law ability engineer as part of a public cloud team at data at a data analytics platform as part of Amsterdam this talk covers ING machine learning platform a migration journey or a battle to communities which took over two years and What initially started off with a lot of resistance that lasted throughout the process Ultimately resulted to be the only right choice for us in the end Just to set up your expectations One focus too much on technical implementation details, but will instead focus on giving a high-level overview to answer some of the questions I had that was looking for in cube cons and to provide some guidance if you're in the same situation in your company and Just so there is no confusion while I'm part of the public cloud team now This talk revolves around my time in the same role at MLP team and since I'll be saying MLP a lot MLP is an MLP is an acronym the clicker MLP is an acronym stands for machine learning platform So please keep this in mind because I'll say it a lot of times so our Premise takes place at ING ING is one of the world's biggest banks definitely the biggest banks in Netherlands it has around 60,000 employees 18,000 of them working in tech offices in around 40 countries and Serving around 38 million customers Now this is a case in many environments, but especially banks Since we work in a highly regulated environment and are subject to rigorous policies in terms of risks Security change management controls. It's not as easy to deploy or workloads to production What's even more complicated if you're a data scientist who would like to do the same without any underlying SRE or deployment or infrastructure knowledge and that's where MLP steps in as it will Allow data scientists to seamlessly deploy their machine learning model to production while abstracting all of this underlying complexity from them And it will in the end it will serve as a as a model hosting platform Set up consisted of Python ML models being wrapped into Python containers which then along with other services are part of the Docker compose files and Then these Docker compose files are deployed to various and numerous VMs using Ansible you could say that things work great But I thought the missing link in and this whole picture was that it was not running on Kubernetes as By observing how things were architected and it set up to work in existing VM setup I immediately thought we'll have a problem with scalability as in grow in its scale and Something that could be a fixed by communities of course and Since I did use quotes around work great Here's the list of the problems and annoyances we had in existing setup So IG being a bank means it has very little appetite for risk. So besides bi-weekly patching cycles We had a constant influx of of risk work maintenance work And some other things that we just had to act upon and this took a lot of our time and What made this problem even worse is that VMs were treated like pets and not like cattle So if something went wrong with one of our VMs, we could or whole environment of VMs We couldn't just get rid of it and start over for scratch because you don't want to get rid of your pets, right? joking aside due to a bit of Not being able to seamlessly reprovision our VMs from scratch made us very Well made us extremely careful not to misconfigure any of those VMs So this made the whole deployment and development process very lengthy and also error prone To combat this problem. I developed a tool tool called chameleon which allowed us to mimic broad or any other Environments target setup as part of a local container. So now we could be more creative with our code We could see certain things worked for that target environment before those changes even hit that target environment We could even adhere to infrastructure code principles also although in a very limited context But still and now instead of having to commit our changes push our changes create a pool request Wait for that pool request to be approved And then for the changes to be merged one of the pipelines to pick up those changes Just to see you have a syntax error half an hour or an hour later Now we had an immediate response loop So if something was wrong with our code we could see it immediately and this made things drastically faster at least in the terms of development and the deployment and Related to what I said in one of my previous first slides actually now It was evident to everybody if we had a exponential growth of MLP and number of models that was being onboarded we Getting all this infrastructure provision getting it all configured it would take a significant amount of our time and Now it wasn't just me saying it now. It was a known risk on the horizon and it was a fact And Then we had a large number of unplanned migrations not something that we were trying to do But it was something again that we just had to act upon in order to keep our workloads running in compliance secure matter And again, it took a lot of our time and effort And by this point I was treading very lightly to even make communities a proposal But here are some of the things that you might Think of if you're thinking of doing the same and making a Kubernetes migration a possibility So containerize everything well that you cannot containerize there are certain things that maybe you shouldn't containerize But otherwise just do it even if you're not thinking of moving to Kubernetes if you're using Containers it could make your lives easier and also by adopting microservices architecture should do the same also in the future If you do decide to move to Kubernetes is going to make your set of future proof We're not even Kubernetes to any other container Orchestrator and this is not the future. This has been the present for years now Even Linus desktop apps are becoming as Containerize workloads for example snap packages, which I also use to package one of my private Python apps because if you think about it what better way to Handle a Python application with bunch of bunch of its libraries that in this form factor will work regardless of where you ship it to or deploy it to and This should be obvious But the more open source or CNCF software you use the better for example our monitoring stack Which was using Prometheus getting this Migrated from VMs to Kubernetes was a breeze and also with open source software if you have a problem And you know how to fix it You can simply create a pull request. So this way you fix your problem You contribute back to the project you're using you're benefiting the whole ecosystem. So I think it's a true win-win scenario and Spread the principles of SRE and share the mall knowledge among all team members Think of this as if you are creating a soccer or football team And I know a loss half of you that I said this but please hear me out Idea here is that not one only one person can be like a striker or a goalie But that anyone in the team can come and be a striker or defend the goal in absence of another team member Or even play in a different position for the time being but while still adhering to their expertise Some people just have more affinity or experience towards kicking the ball into the goal So to join a parallel with MLP team while it mostly consisted of software engineers We worked in the same way that whatever was on the board that could be picked up by anyone But while still adhering to our expertise and some people if they have more experience in this field They could lead these stories think this is just a nice way to work regardless if you're migrating or just doing daily tasks and Document everything I used to be one of those coders are Coders are documentation people like not anymore is it proved to be very valuable to have things documented before doing it And after the whole migration process also this way besides Spreading the knowledge it allows you as an expert to move on and do other things So now a software engineer who never did disaster recovery can do a disaster recovery by following our guidelines also documenting also means that creating design pages So everybody on the team is on the same page before the big changes are even made and before you even get to coding And Since by this point we were doing all of these things and many more actually I thought why not just make a proposal that we move to Kubernetes And this is where the same comes in where an idea is only 5% and the rest of it 95% is Execution and here's why Things are probably different now if you suggest to your team Hey, we should move to Kubernetes due to reasons But when I suggested it pretty much nobody was on the board. Maybe one guy was on the board this guy But the main reason was that things just worked in current setup and then we wouldn't really benefit by my by moving to Kubernetes So while I was definitely bummed by just what happened here to say the least I thought it didn't happen due to some 11 malevolent reasons people now liking Kubernetes or whatnot It was just simple fact. They didn't understand Kubernetes well enough or knew its Architecture so idea instead of giving up it was just to I Need to move closer Idea was that through research POC's demos show why this would be a good idea for everyone to adopt So please note. I don't have any dedicated move or migrate to Kubernetes time at this point I'm busy with many other things So luckily for me and I and G we have something that's called mastery or study days where Every two weeks we get to spend one day working on an idea or a concept that you could maybe use Maybe use your work in future and this is where I would for example do a research and POC and demo of MOP V2 Working on Argo CD and how that would look like or if we were using a helm charts or if we were using Ansible Customized instead of Ansible or that we didn't even need Ansible in the new setup and I also used every opportunity to demonstrate how existing problems can be fixed with Kubernetes I distinctly remember how around this time a lot of team members including myself were bothered by containers crashing and then Remaining in that fail in failed state until you literally went there manually and brought him back up To the point that we are about to start working on something that could represent Developing something that could represent Reconciliation loop for Docker and I remember saying like but this is already a feature in Kubernetes like what's going on here and Then we had a big one at least for me, which was attempt to go to the public cloud Now as you might have had a chance to hear from I and G colleagues today Diana and tie stock Things are in very desirable state in ICHP or I and G container hosting platform Which we did evaluate as as one of our targets and back in 2020s and some things were still rough around the edges We didn't like the whole namespace as a service concept We thought it was very limiting to us as a platform and that it was more tailored toward Applications also some of the features which I demoed like Argo CD. That was not possible Due to compliance issues, but regardless all of these things didn't necessarily make it our first choice target Which is a shame because that was our only Target that we could go back in I and G at the time So when out of a blue eyes a lead SRE I'm given a choice one day Hey, if we're going to Kubernetes, where are we gonna go? I see HP or cloud I said cloud and as part of this decision Within a single spread I created a fully working POC of MLP working on GKE Google Kubernetes engine Everything worked we could call models everything was encrypted on GKE side everything There are stateless workloads, so no data persist anywhere on the Google infrastructure even the in in transit traffic is encrypted I Documented a whole thing if anyone in IG wanted to recreate it for their setup Documented to my team a document. It's I Demoted to a bunch of other teams and interested parties and just when I was ramping up to make this a success story Whole thing was Forced to be shut down So why well since our in POC we used dummy data and in production we would be using actual IG customer data and even with everything being encrypted a law dispute called sherms 2 is what stopped us in our tracks I won't get too much into it but the gist of it is that it makes handling data between EU and EU S companies a lot more complicated and Well, some other Dutch and European banks don't have problems with sherms 2 Customer data remaining in compliance secure hands remain number one priority for on G which meant public cloud was a no-go Please note this was Back then things are now better back then we didn't even have a public cloud team now I'm part of this public cloud team things have and are moving towards a better Especially as with each passing day It's becoming inevitable that more and more people are realizing that public cloud being part of our future is just inevitable But but that's another topic for another discussion And why what just happened here was definitely a breaking point for me And I was literally this close to to give up on the whole idea I Thought instead of giving up again was the best thing I could do is just keep pushing on this idea Give more reasons because now people could see how easy was and how fast you could deploy with with communities and whole thing was now generating a lot of traction and Those are by the way like some of the reasons that I use so for example Nobody on the team like doing any VM or any other Maintenance or risk work. So when I demonstrate it to the team how if we went for a managed Kubernetes solution We don't need to do any of these things. This guy likes from everybody on the team It's also worth mentioning that Kubernetes comes with a lot of great features out of out of box like HPA horizontal part out of scale VPA vertical part out of scale plus or out of scale in service discovery Great conciliation loop I was talking about before and these are basic features. This is not advanced stuff Which means that then you don't need to spend time creating your own tooling Because we are going as a platform just around this time. We realized that We need a service discovery and even creating your own was in the cards But we ended up going for an existing solution But this still took a lot of our time and we had to do a lot of re-architecting to make it part of our setup and with Kubernetes services discovery is just there One of Angie's global ambitious is to go green I would even use this as an example in current VM setup We had VMs running on servers which are on all the time in Kubernetes You could scale up and down depending on the workloads So you're gonna consume less energy your whole energy for a pen is gonna be smaller Even the costs are be gonna be less and Then we as a team also had these discussions like hypothetical situations if we went for public cloud How would that work just for us to realize that we're not very mobile in an existing VM setup and with Kubernetes once All the deployment services are part of the ML files or health charts in this form factor We could move everywhere anywhere on prem to on prem or cloud to cloud This was a big positive realization for the whole team Excuse me So parallel to all of this what's happening here as part of my private life Auto CPU frack one of my open source tools is trending as number one on hacker news pretty much the whole day So happy days Well sure but under the load my private blog went down because people are coming to see what the project is about And why why am I even talking about this now? Well, because Yearly enough my private setup was very similar to what we had in MLP where I had container as workloads with bunch of services Part of the Docker compose files, which are then deployed to VMs using Ansible and although I was using AWS Which is like synonym with scalability since I was using a the AWS light sale I didn't have this feature even in cloud. I couldn't I couldn't scale and At the time I was working on the WP cage project an open source project That offers you prod ready fully scalable and highly available WordPress for Kubernetes Because fun fact WordPress by default is a stateful workload as such is not going to scale on Kubernetes at all So this is something I was working on if I ever wanted to move to two Kubernetes So since I had a project that was almost finished since I had a cluster that was shut down I thought what better time to prove the proof is in the pudding to myself to my team to everybody I simply started the cluster backup simple DNS change and since WP case project had horizontal part auto scale a cluster auto scaling It automatically scaled a cluster and pods to accommodate the all the incoming traffic automatically the scaling issue was fixed within minutes Result of that was WP case project, which was targeting GKE at the time Just something to help people get started with WordPress on Kubernetes with being fully scalable and highly available if you're interested take a look Of course when I realized how much this is going to cost me to run a private blog in GKE and said it said I had like two Synology NAS with around 48 terabytes of data I thought why don't I just get a couple of Raspberry Pi's and I make my own little private cloud as part of my home And that's what I ended up doing. I extended the WP case project to also Private Kubernetes offering I extensively document at everything from which UPS I bought which Raspberry Pi's about which SD cards Why I chose Ubuntu and microcades because idea was here that I don't want to maintain this thing that it runs as automatically as it can and Yeah, and if you go to that blog post right now It's being served for my home and depending on how many you go there. It will automatically scale and Since you're in Amsterdam and you might enjoy a beer to highly recommended by the way another workload another project that's hosted on the same cluster is Answered and toilet urinal Finder available on ATUF app just something to keep our city clean. All right So back to reality. I went to the team. I shared the news I think it does even happen over the weekend and they were impressed with what ease I fix the scalability issue But all of it still remain to be one of those like cool story, bro And that was it the decision in this time and point in time is that we still remain running on in current setup Meanwhile the platform is growing more and more time is needed to keep the lights on in the old setup to the point that The tenant features are not getting developed How much time it took cut to maintain the old setup? And then we had an influx of a lot of a lot and a lot of big models They requested a lot of our time in terms of hardware resources and just engineering effort And that's when we hit our old nemesis, which is an ability to grow in current setup To keep growing as a platform and to keep scaling Options were simple one We hire more people just so we can scale or number two We hire more we move to Kubernetes And then we don't have to do half of the things that we're doing now and this is when it was Like this is when it was clear crystal clear for everyone on the team Migration to Kubernetes is just inevitable and that's how our decision was made. So let that sink in Once the decision was made as it was as simple as creating a design page with acceptance criteria Scope of work what we want with this migration what we don't want with this migration what can be done now what can wait for later and Big part of it was concluding research where we want to migrate So public cloud was still still not an option so and that this consisted of Creating a big research Document or a confluence page which extensively compared to both options that even had a pros and cons list between between them so Options were ICHP which we evaluated in the past and a new kid on the block Which is a DAP Kubernetes cluster, which is now the biggest cluster in ING the biggest Kubernetes cluster in ING So option one was DAP which had vanilla Kubernetes So it meant that we could have any feature that we wanted It had dedicated Nvidia GPU core So if we had a model that was onboarded that requested custom GPU support we could give it to them It was part of our area So it meant that we could develop it or heavily influence the the direction of its development So match made in heaven well not really because us taking part in its development meant that we would have to continue to Do maintenance and development work on this which is something we were trying to avoid Also this cluster all the biggest cluster in ING is Only targeting the dev workloads and for us to keep our integrity rating as a platform We will need to be running in two zones So if something goes wrong that we could just have a an active failover an Option two was ICHP, which we thought was not a good option for us in the past now after we had a acceptance criteria Because it didn't have some of the features we wanted now we figure we don't even need those features for now It was stable. It was running in multiple zones It handled huge portion of our risk and maintenance work And it simply came down to taking more of the checkboxes on our acceptance criteria and was the option we went for in the end and At this part it was just do it because everything that was was demoed or or researched as part of the mastery days We knew exactly what we wanted and now we just put it in all all together So some of the takeaways from the whole journey As with every migration The best approach is to keep running into setups in parallel So don't try to extend your existing environments. This is going to be a completely different landscape Just start from scratch once you verify that everything is working in an old setup simply switch the traffic over and shut the lights off in the old setup and Monitoring good monitoring and alerting will go a long way I'm saying this because this is easily put aside Besides being alerted if something goes wrong in your old environment You also want to know if something went wrong in your new environment, especially in your new environment So you don't just keep building on top of something. That's already broken at its core And be brutally pragmatic and stick to the vision and what was decided Once the migration started we had all kind of crazy ideas But we also for example wanted to rebase all of our Container base images to a different distribution Don't do this because when things fail and they will fail you want to blast radius to be as small as possible so you can debug it easier and I'm also saying this because trust me There were a lot of options that we could have went for and if we did I wouldn't be here given the stock we would still be busy doing the migration and Journeys will take time to over two years is what took for us And I don't think our setup is that complicated So depending on the amount of the complexity of your setup and the size of your environments might take even longer That it just do it step in the previous slide took like three months. So please note good things will take time and Perseverance and not giving up there are so many ways to just give up in the whole process It was it was very tempting even But while adhering to this not giving up and persevering you also need to keep Re-evaluating the landscape but with an open mind So you don't miss certain opportunities because your view was too narrowly scoped perfect example is ICHP Which we thought was not a good option for us. Now it was the best option this point in time Even with the private cluster. I tried doing the same thing with raspberry pies before It was just not a good experience the hardware was not up for a job and now it was and Maybe that's one of my biggest takeaways from all of this that keep in mind that everything changes certain ideas along with certain circumstances Will match the changes in the landscape and in my lead to a perfect storm as it did in MLP's case so that was our journey and Why for all of those of you still running in VMs why I think with Kubernetes resistance is futile. So thank you