 Hi and welcome to this presentation called the building and agile platform in the highly regulated industry This is a joint project that you must I'm myself worked on for Almost a year at if the insurance company My name is Freda Klingberg as I mentioned have this username from a time at uni that we could find me on Twitter Obviously, I'm on the LinkedIn and Slack as well. I tried to blog a little bit Which out there? I'm also a traffic and the link the ambassador. I try to help out in those communities Whenever I see people having somewhat of the similar problems that we have already been faced with Yeah, and my name is Jonas Samuelsen I'm a platform lead at if insurance the main driver behind the platform. We're gonna present today. I Love tech teamwork and centralized agile platforms that provide value for web teams And I'm an avid craft beer enthusiast. Maybe you can tell by looking at me So in today's talk, we're trying to walk you through some of the considerations We had to make and hopefully a bit on how we solve them when creating this new agile platform Service for dev teams in our company We have an agenda as well Well, I kick things off with looking at if if insurance very brief introductions So to set the context a little bit there are constraints in the Insurance company and that if of course We'll look at those You must lie had a starting point there were some clusters already in place if the insurance company will look out What those starting points were? And then we introduce some principles to help us to to guide us when building building a platform Obviously we look into the actual building what tooling did they choose what kind of constraints Where we under and then what consideration did we do? You know us will touch upon on boarding off the teams What were the what some of the challenges were moving to a more cloud-native environment? Especially with the get-ups waived in things and we'll have a Q&A in the end Yeah, but first off a bit about if the insurance company so we're mainly operating in the Nordic market We have a strong market position in the Nordics with over 3.9 million customers, and it's all the market and finance position today We're wholly owned subsidiary by the Sampra group in Finland since 2004 and We're they're listed on Nasdaq in Helsinki We have a solid basis and motor and property insurance With sort of traditional lines of insurance products And we give people the confidence today to shape our tomorrow. That's our purpose But enough of some of the boring figures and key stats. Let's move over to the more interesting story of Shaping ifs tomorrow So we face an ever-evolving market technology landscape and new competition coming all the time from pure tech companies with little or no legacy a customer expectations are changing and The new young generations are expecting to interact with us under smart for smartphones and tablets like any other tech company out there our partners are expecting more of Digital integrations and information sharing at the same time we face the normal competition on the traditional insurance market So this kind of sets the scene for why we're here today and the start the journey for our company to Become more tech driven company Let's look at some of the constraints If insurance obviously is built on trust if the customers can't trust the insurance company in regular hours How can you trust when there's a flawed or natural disaster? There are regulatory requirements that if takes obviously very seriously And that that sets the constraints of the company and the tech and how you move Go about doing things and they're the there are the newcomers maybe don't have that legacy applications and Networking requirements that that if has how do you move fast to compete in marketplace against those? Without losing the trust that you have built up over several years with your customers That's right Frederick On the one hand, we have risk management compliance service delivery and CISO and if driving implementations of regulations and internal compliance policies stemming from GDPR and solvency to that we of course have to comply with operating on the European insurance market on the other hand We have the developers and teams wanting to adapt new tools and technologies like any other technology driven company some of the considerations we have to make or Are a bit like a balancing act between having control and being able to move fast And the question is can we find ways to allow both to coexist in my opinion? It should be built into any any platform or offering that That we have as a central offering in the company So it makes it more easy to do the right things with the guardrails that Frederick will present a bit later on So the starting point was to quit question many of the so-called truth. We have in the company today basing from the legacy trying to find new solutions to the some of the old issues Just like the balancing act shown in the previous slide We have a bit of a balancing act and implementing the right level of security and finding common ground The most secure solution is of course the one that is not running at all But that kind of makes it hard to run the business Many of the traditional concerns still apply and are valid like networking and web application firewalls But some of the modern keep Abilities give room for solving issues and concerns in new ways and can be solved with modern security approaches and layering We we pretty quickly got the proposal to separate our cluster into the DM sub cluster because that's how we do secure on-prem class or platforms But we questioned this truth due to the fact that it would be very costly and maintenance heavy To do it that way so we suggested that we connect the platform in clusters to the new central components that the networking team were setting up at the same time and Utilize things like networking policy from Calico privately in clusters and their central capabilities being created in if like web applications firewalls and API management and things like that and also we integrated the clusters to our back roles and identity So what I'm trying to say in a sense is building a new platform on New technology and with limited experience around this and the company Is difficult and takes a lot of discussions and anchoring to find To find the new ways forward But just because it's difficult doesn't mean you shouldn't try to do it. It's a bit more work And you have to clear out the question marks But it's possible If that's a lot of teams many of the teams already using Kubernetes So what were the starting point when you was nice started? How was those clusters configured? We started interviewing the teams with the predefined template systematically gather in data and seeing where we were at Managing some of the things that came out of those questionnaires was that managing Kubernetes takes a lot of works obviously Just fine new Kubernetes doesn't cut it At least not for us Although we use the managers Kubernetes offering for Microsoft called AKS. It takes a lot of work moving from a AKS to a platform and that's the whole point of Kubernetes as well You want to tailor it to your needs build upon it to create something that are truly yours These are some of the questions that we We asked to the different teams It's not the exhaustive list, but maybe some of the more important ones On-prem connectivity. Do you need to communicate with those? legacy applications if that's a lot of not mainframes maybe but large applications on-prem that are very very secure How many environments do you have we'll come back to that one on the next slide and the number of environments varied greatly Authentication authorization how do you go about doing that? monitoring a big topic and the focus that we saw quite early on was a very Focus on logs, which is good. We tried to push them gently towards thinking more about metrics scaling on those Going for event alarm instant handling escalation policies a lot of these Concerns were maybe not that present in the current platforms that already in if Scaling if if there's a natural disasters and And the the calls come flooding in no pun intended How do you go about? Making sure that your application scales instead of going and clicking and scaling up manually Again security paramount for if it goes through all of the layers if the customers can't be Trusting if in regular towers you're out of luck. I mentioned the number of environments There were a lot of them very many environments you have staging you have this batch way of thinking that That you have qa's and then qa's off the qa's before going out to production We quit if we post a a friendly challenge. We're not trying to To put down the teams. We're trying to help them to create the platform that they can use So right off the bat. We said we'll give you a dev test in production Environment and obviously you get your own namespace in those and try to think more in terms of trunk based deployment Get things out quickly There are small changes if there are Things breaks you can roll back quickly especially with the get ups flow that will present shortly We emphasize that we were not trying to put down the teams. We're trying to to to see what the pain points were We got some a lot of data Bouncing act between Creating something super custom for each individual teams. They need to meet us halfway But we introduced some some principles to help us to build the platform as well You go to the CNCF homepage and you're presented with with this Low-resolution image. I'm sure you've seen it before you heard all the jokes Really difficult to choose the correct subset of the CNCF tools Obviously, we could have tried to hit all of those technologies cool challenge But maybe not that useful for for if This is not an exhaustive list of the principles But they were going through like a red thread on on the building on the tar time We won't have that gar rails aka aspect oriented model Where even though if the team made the mistakes, they will gently be nudged in the right right correction direction and And the biggest cost in application or system lifetime is the maintenance part Even though one technology would be easy to get going with if that means horrible day to operations It was not worth it because that's the main a main part of the the time it on the spend And we'll see how we we use that when choosing the technology. I Think everybody want to reduce the complexity But it's easy to forget when you see the cool technology that you're presented with when the CNCF Do one thing do one thing. Well, we'll see how we went about choosing a service mesh on that one You look at this language a big one talking to the teams They could see things from different viewpoints What do you actually mean when you say application what do you mean by system? We try to be very specific what our viewpoints work without letting other people Feel bad about that But we clear that this is how we talk about these terms on a platform Everything as code goes about saying We worked on with the source recovery thinking about those from the get-go and we should be able to get up and running with with everything really fast and We needed to minimize developer friction We are creating a platform that hopefully the developers developers want to use not being forced to use to to get good documentation listen to what the requirements were and Should be a joyful experience. Just get the application up and running and you're good to go Okay on to the building of the platform obviously, I Can't go into every detail about building platform. We don't have time for that But one of the things we which start out with from the get-go was to try to do get-ups Everything on the previous project that I worked on I had used get-ups For the platform, but not for the teams We thought it would be a good idea to see if could we do get-ups everything have all of the state Already tried to build rebuild the clusters and we'll go into the details on orchestrating that There's there are other talks that does the get-ups topic more service than I can do But sufficient to say it's an operating model that that we chose for the platform but also for the teams that meant that the teams needed to to learn a new way of Doing the CD part of delivering software There's a sharp separation with the contents integration and contents deployment. This was good for the teams we could say that Do CA exactly like you do it now Only just push an image and will do some small changes to the contents deployment and working with them And then there's the question of where does the infrastructure's code start and end and where is the configuration? With get-ups start and end which shows a version where the infrastructure's code was very small or the smallest possible And instead move everything into git and use get-ups for the other things It shows flux version to you There are other Get-ups tools that are probably awesome, but we're happy with flux version to work. We're good We used the bit number sealed secret to seal the secret So now we can add all of the state inside of the git repos and During the lifetime of the project. We were turning into big customized fans. We were posting a friendly challenge to the teams that were Very helm focused I'm not opposed to helm But some of the things that I see people introducing helmet you can actually solve with customized kind of hidden gem in the CNCF Landscape I think and it's all over the part of CubeSatel. We already have it Let's do a small excerpt of how we went about bootstrapping and building up everything and this ties into scaling as well We can then now scale a number of platform members working on all of the things We can now scale up the number of clusters and and rebuild it very quickly We kick things off with the very small bash script that I just wrote using Ashley CLI We installed a secret controller This was the critical part. We need a private key install it be sure to keep that safe Install the flux component and then they kick things off by start syncing in specific branch and a path on that one For that cluster that you're building Again, this is not a get-up stock But sufficient to say that there are some CRDs in the get-ups flux version 2 So in this what we are syncing in from a specific repo and applying in the customized onto that one And we had the platform cones repo consisting of this And a lot of other things as well you have the The primitives You have the Grafana and other tools and then they have system syncs as well The reason why we chose to separate those where that we as a platform team can introduce new technology Different with the different cadence and don't messing things up with introducing new systems The the system park here is where we introduce the teams So system aka team aka namespace it's the same thing and we can do that on board those and Do that with us again with the different cadence that the other ones a Lot of cool things with flux that made a life easier One of the things where the cluster all that you see on the left-hand side there The customization could can run as an essay bound to that cluster all obviously that's something that we in the platform team control Remember those guardrails we could say that team just drop all of the ammo that you have inside that repo if you buy mistake Create a service of type low balancer The essay doesn't have that permission. So you gently nudge back into the correct path again We use the notification controller and this was a bit of a challenge to get the teams to embrace the eventual consistency of Kubernetes and And the notification channel they could drop down the ammo You could potentially get some notifications in a small amount of time saying that Hey, you add some something breaking here or something. It's not correct And we have the the teams channel for for each of the teams getting that feedback as quickly as possible automatic image updates like I mentioned teams channel for each of them and Customized dependencies Not to spoil what I'm gonna talk about on the next couple of slides But you sometimes you want to make sure that that some components are already running in your cluster Before you run everything else and with this customized dependency you can say that these health checks needs to be Good before you start applying and then employing other things For the teams they would have an application Sorry, the teams would have many applications. They would be in a more of a microservice architecture. Obviously, everybody's doing that now What we said to them was to rip out all of that the ammo put that in that repo that we give you and we start Sinking in and we'll be working with them closely on the branching strategy and the trunk based model and Have everything there This is quite good because Previously when you have all of that the ammo in the different applications repos whenever you want to do some changes to your system configuration, let's say you want to have some more CPU memory or the scaling or the delta between application A and B You would potentially go through the whole CI part again and build a new image Whereas here you can do those changes independently of your application A little bit of the same point that in microservice you can change the scaling do that quite rapidly We spent quite a bit of time on the branching strategy for the platform as well It's very important to emphasize That the branches were not there were no commits in test branch for example that were not in dev branch So we didn't use the branches to hold state in that sense we only use the branches to To be able to roll things from dev to test and test the production with having it this way we were able to Take one branch and compare it to another one to see exactly what was in dev But yet to be pushed in test Which was previously quite difficult when you have all of those in the in the cd pipeline instead This is a small excerpt of the customized part This is something you see in the customized community Where you have that base folder and the overlays folder We would then introduce the system into dev make all of the overlays for testing production And again only use the branches to promote thing from dev to test and test the production Customized hats a lot of those features that can Swap things out in in your yaml for you and make sure that you're not deploying things to the wrong namespace so in this case we have the a system called ABC that Doesn't ingress changes Think a little bit of syncing changes, but obviously the referencing everything in base And of course you want to have as much as possible in the base. So you just do roll things out to new environments We're having all of the state here. We could start thinking about the platform being more immutable And this is good in terms of disaster recovery we We trained for disaster for let's say we want to do an upgrade to new Kubernetes version Down-growing Kubernetes version in the in the cloud provider is I think it's actually impossible if not very hacky But whereas in this way, we will just rebuild the cluster and train for that disaster recovery from the start and we're actually down to 20 minutes or something like that and that's only due to the pulling off the images and getting the nodes up and running Even if they were a disaster we can quite easily just bring things up Obviously we used node pools Priority clauses to make sure that the most critical applications running. They're all a lot of all the tricks as well And they play that what if No pun intended here as well To use separate node pools There are some applications very memory-intensive for meters for example Keep those in different no pools use the tainting tolerations keep things separated think about these things from from the start Then there's the policies we Got going with those. We obviously wanted to extend them Part of the guard rails make sure that you can try to deploy something without limits on CPU memory But you would get the friendly message saying in your team's channel you you're missing these things We like the guard gatekeeper project and we're looking forward to the pod security standard as well coming out shortly We used the Azure policy for that But we wanted to this is something that we are currently working on extending that even further to use those public policies Yeah, the policies make sure that you have priority classes set and all of those things Then it's the question of service mesh Quickly we came up with the question when I joined the team. They were already using a service mesh That you happen to like right? Yes That was already in place But we hear this question with different customer working with as well service mesh come on Do you actually need that to bring up this has a lot of components? It's complete complex enough Turn out that you probably do need it There are some questions asked that some questions that you maybe didn't know that you should be answering asking And this is where the search mesh come into place We had the the two main questions that we wanted to solve This is something that we see a lot of people having you moving things into Kubernetes. We're all that All of that communications unencrypted We want to have it encrypted and we also want that insight you want to get the insights tuned to Sockets to network traffic who's talking to you in terms of applications. I Gave it the I gave it away a little bit earlier because you saw LinkedIn on one of those systems We chose LinkedIn It's all that I've chosen. I've used LinkedIn all the ones as well full disclaimer my LinkedIn ambassador some totally unbiased of course, but LinkedIn has its own proxy really efficient I doubt that is still console the other ones would have been issue performance wise But we take everything all the performance gains we can get The main driver was that LinkedIn has a very good day to operations not only the documentation But also the community that was helping us with solving some of these the scenarios or constraints And the concerns that we had that we didn't initially have The main thing although the encryption is important The main thing that we got from from LinkedIn and a service mesh is that network insight You have it already in other applications, but in Kubernetes to you are somewhat in the blind Here we could see sockets if the application should be are doing that correctly reusing those sockets You could see Who's calling who and a lot of other metrics as well? There are a lot of other CNCF tools that we used. I like to highlight the Thanos project We use Prometheus operator scraped the Prometheus Metrics that we got from LinkedIn and and Store those and then I should blob The idea is that a couple of years from now We can see trends going back and see where are the most calls coming in what kind of application having the most trouble And we also had Kira in place that could take in those Prometheus metrics coming from LinkedIn For example and do the automatic scaling again Trying to be more cloud native Something that we saw from the teams that scaling was not really something that they were thinking so much about But we tried to help them in that as well. Yeah, so final section about team onboarding We have to spend time on onboarding teams in the company to introduce some of the key concepts around get-ups, of course There is initial thresholds to get over when moving from the push-to-pull based approach And we also have to do practical setups sometimes together like Frederick showed setting up the repos and everything Also, there's sometimes a need to get used to get ways of working even with pull requests reviews and Quality assurance tooling and stuff like this. We have to emphasize that we should try to keep the platform as stateless as possible So it's possible to recover more easily in the event of failure that Frederick was discussing Also, we emphasize that app should be built in a more stateless nature and rely not rely on a single pod or a session So onboarding sessions are really helpful for us and gives insights for both Sides to be able to run the system in an efficient way together and learn from each other We also create dedicated teams channels and communication between us and the teams that on board to the platform I think we have a couple of questions for Q&A as well Yeah, can you maybe go to the microphone? So hi as you are a part of a highly regulated industry and being in a highly regulated place have you encountered any pushback from Like the regulator in body or from inside the company Regarding going on with these newer technologies, which you are starting to use Yeah, we did All right, but how did you deal with it? We've tried to explain and put in many is saying how we would solve their concern Like a lot of times you're saying no, no, no, you need this subnet here We need to separate traffic here and we should do this and that and we said, okay But what you're actually trying to solve is to be able to encrypt traffic not having anybody's singing Trying to go to the underlying problem that they're trying to solve and tell them We are solving it as well, but with more cloud native tools That's that's one way of doing it and and if you mean that The question about moving to cloud at all or not Then of course we have that discussion in the company as well But currently we have taken the position that we can do it and we just have to find the ways forward That's something that we hope this talk will convey as well If if the the biggest insurance company in Nordics can do it Hopefully this will be a talk that could push the other ones to see that it's actually possible Thank you. So I had a question So you said you got folks or development teams that voluntarily come use your platform So one of the things I wonder is what are some strategies or approaches you took with your internal customers? That worked and made them want to onboard Tell them trying to Careful listen to what problems that have and tell them how we could solve those problems for them We were trying to do like a minimal valve of product more for startup rule where we could Create something that solves a problem not only being cool technology the get-ups part for example being able to to control The applications and scaling automatically was a pain point that a lot of teams had Yeah, the short answer is try to solve the problem that they have then they would go and a lot of teams spent Too much time on managing the Kubernetes and platform part It's a lot of tools and technology to get into We and we said that how about we do that for you and you meet us on the halfway and brace the in this case The get-up part and you'll get Automatic encryption you get scaling you get alarms that we set up for you and then there are probably on board But yeah thinking more in there in the startup role with something and probably Thank you so much an excellent time. Thank you Yeah quick question about test data and how test data work in a highly regulated industry if you could say a little bit on How you distribute and work with it? I know it's a big topic It's a bit off topic the test data It's a big topic and it's kind of outside our scope for this platform But is it's of course naturally a discussion we're having in the company as well how to manage the test data and I think Like one approach is marking that data from the back end That's what I could say right now Trying to keep the platform more stateless possible as you once mentioned That's one way of doing it, but I've seen a couple of scenarios where You to test the application on very small amount of data and then you go into production with Real-world data which is so much more and you see that the application and system doesn't scale Being able to generate some stat synthetically data to test those things is important thing When it comes to GDPR I mean those data Masking it etc as you said a big topic, but very very important for sure So my question is about you had several teams for example for web application firewall Was that integrated into gate ops or was it still separated? No, we're keeping we're keeping it separate now So we're working with the networking team that are that's quick sort of at the same time setting up this central capability in the cloud around Web application firewalls. So our our approach is to connect to that and use that Yeah, they were already a lot of Security and firewalls and Palo Alto and a five everything is place We spend quite a bit of time figuring out, okay Where can we put our Kubernetes cluster to still use the functionality of those keep the network in primary at ease? F5 Palo Alto firewall awesome products. We didn't want to replace all of those But finding that sweet spot is so these but quite a bit of time But quite often you need to make changes specifically to an application was that then still Like to the central team or was that all then to get upset you could make that change to that central product We were onboarding the team now. You know says more knowledge from that recently but That's it. That's a big topic as well trying to embrace the cloud native Obviously, we see if you go for something that you have on-prem One big monolithic stateful application moving out to a cloud native landscape Very important and you have to know more on system development as well That's something we try to to work with in the platform team That we shouldn't only be with concerning platform and cloud native tools You should also have people that have been building system and being an application developer in addition Especially where if you want to move from a stateful application on-prem on to the platform Yeah, good question. Good question and We will be in part dependent on another team as you say We're trying to solve that in different ways and one of the ways is like us owning maybe the certificates and The entries so that we can easily onboard teams and add them as entries to In a more so that we still can move fast And The other questions, yeah Do your applications does the platform also hold the data and you didn't talk about stateful sets And how you can do disaster recovery with the github approach when you have drives and also how do people create databases? This is part of the of your platform or they have to order them or offline and then just connect them Good question Yeah, it wasn't part of that. Yeah, how do you manage the data and especially in this house to cover place? Is that a correct one a Little bit of a cop out I would say because we we kept the the state and the databases outside Once you introduce all of that databases in the platform Stripping those doing snapshots of those more difficult of sure One part is the Prometheus for example that has the this This persistent volume and We made a conscious decision. Okay any worst case scenario. We would lose two hours of metrics But it's a trailer. We could spend more time on it as you move on now But it's it's an MP and then you you start getting okay. That's crucial, but the shipping those off with tunnels Helped us quite a bit. Thanks. We because we have a very similar setup and we have the same problem I'm trying to solve that. Thanks There's a queue for Questions. Yeah, I wanted to ask you about audits, do you have external audits of for security compliance and how do you pass them because from my experience auditors When they see this added complexity of Kubernetes, they just Can not always follow up all this complexity So what do you give them? We have audits as an insurance company and we yeah, of course we do and What we're gonna show them if if it comes to it like who implemented what changed at what time it's you know, they get Manifest and they get history, right? So we all have it there. That's the that's the cool part here that we We sort of have a tracking backlog of all the changes that are occurred Towards the environment both from a system perspective, but and a platform expert perspective I'm working for a new company now that is identity provider and payment for it and already obviously PCI DSS those other thing is on the forefront of everybody's mind And it doesn't solve everything But like you also mentioned having those audit logs making sure that you close down the cloud say that For for testing production. You have to do every change in Infrastructure as code and all the application code and platform in those git repos and you can then dump those out to the auditors To say that this is who did what and when That's your audit log Thank you Hi, I was wondering so what's the adoption of your platform and when you don't get all the developers in their company And I guess as an insurance provider you have some kind of legacy teams who are working on stuff that not fits in this concept and If you have a strategy to I don't know Get this adopted across all the company so to mention one Yeah, you're totally correct to start with we have legacy teams And what we have also had legacy Java applications and they have actually been the first movers now moving from the Java on-prem sort of boss environment and It's been starting to spin it up in containers and getting the you know know how and knowledge around that So that's actually the first use cases. We've had so far And there to mention the mount I would say we're right now currently having around 10 10 to 15 different teams onboarding at this moment. I Think we are out of time. Thank you