 Hello and welcome to this talk about the cloud native journey at Adobe. I'm going to talk to you about how we use Kubernetes and other CNCF projects, what went well, what things didn't work. So hopefully you'll learn some lessons that will help you on your own teams and join me in this journey. So I'm a cloud engineer at Adobe Experience Manager cloud service. This is a product that I'll give more details that are part of the Adobe portfolio. And I have a background on open source. I started the Jenkins Kubernetes plugin and I contributed at Jenkins X, Apache Maven for many years, Eclipse Foundation, etc. So my background is a lot on the open source community. So Adobe Experience Manager, what it is so we can understand better what are the challenges we have to go through. So this applies to my team inside this Adobe Experience Manager cloud service that we started announcing at the beginning of this year. But there's other teams in AM, Adobe Experience Manager, there's many teams in Adobe, a lot of them using Kubernetes too. So this just so you know applies to my team. Adobe Experience Manager is a content management system, digital asset management, enrollment forms, and it's used by many 4,100 companies. Companies you already know and are already customers of Adobe Experience Manager. And now we are providing this cloud service. It's an existing distributed Java application. You have author instances where authors can create content, publish instances where visitors can see the content that was created, and this can all scale horizontally. So this already existed before the move to Kubernetes. This has been already around for years. The stack is using Java with OSGI and a lot of open source components from the Apache Software Foundation. There's a huge market for extension developers that write modules that run on AM in process. And I'll go through why this is important to understand later on. The challenge for us was to run Adobe Experience Manager on Kubernetes. So getting into it, we are running on Azure. We have more than 10 clusters and we keep adding clusters. Multiple regions, US, Europe, Australia, and more coming. And we have at Adobe a dedicated team managing clusters for multiple products. So it's not that we don't create the clusters ourselves, there's another team that takes care of that. Every customer can have multiple AM environments that they can self-serve. So they can create environments on their own. And for each customer we have at least three namespaces like development stage, production environments. We also have sandboxes that are evaluation-like environments. And all of this is managed by the customer themselves through Cloud Manager which is a separate service that has their own its own web UI and API. So customers can have multiple environments that match to multiple Kubernetes namespaces. On the environments, for us the namespaces provide the scope like network isolation, quotas, permissions. So we use that that Kubernetes give us already. And for each environment we use a lot of init containers, sub-card containers to apply the vision of concerns. So we have sidecard containers that do storage initialization or we have hcdpd fronting the java application or exporting metrics. We also use fluent bit to send logs. And another example is java threat dump collection where we gather the threat dumps and store them. So we have several sidecars custom developed like the threat dump collector, the storage initialization that are very particular to our use case, open source like fluent bit and some that we extended from open source like hcdpd. And we have to scale these to hundreds of customers, thousands of sandboxes. So for us that's the it's a bit challenge too to make sure that this whatever we build services we built around it they will scale. Some of the issues we face with scaling you can run on Azure APA rate limits especially in upgrades so we have to limit each cluster to a few hundred nodes. So you could solve this also having bigger nodes so you get higher bigger cluster sizes with the same number of nodes. And we use the Kubernetes vertical and horizontal pod autoscaler extensively. About the vertical pod autoscaler we use it to scale up and down on memory in CPU but this jvm footprint is is hard to reduce, right? If you are aware of how java applications work memory you basically reserve the memory on a startup set the hip size and then it's up to the jvm to manage that memory. One thing you have to be careful with the vertical pod autoscaler is that changes to the request need pod restarts to become effective and make sure that you don't set vpa to do it automatically because otherwise you would get some random pod restarts and depending on your application that may be a problem. We also use the horizontal pod autoscaler and we set it to scale up on requests per minute and one thing that you have to be aware is that you cannot use the same metrics for hpa and vpa. That's on the documentation of Kubernetes. One service we built to manage the scale of the clusters we implemented something that we call hibernation for environment environments that are used by engineering and sandboxes that are randomly used so we scale down these environments and this allows us to overbook clusters and and save a lot of money by doing that and putting being able to put more resources into into one cluster. How this works is very simple. We have a Kubernetes job that checks Prometheus metrics if there's an activity in the last n number of hours we scale a bunch of deployments down to zero. The customer the user is showing a message to the hibernate by clicking a button and that's that's all there is to it and in this this allows us to scale the clusters and get more packing in the clusters. Ideally we it would just the hibernate automatically right on a new request like functional service k native these sorts of things but you have to account that jvms in in this product it takes around five minutes to start depending on how much content you have how many things you have on on the store but unless you use new jvm micro frameworks the jvm if you are taking an existing application it's probably going to take a good amount of time to start. Networking we we use the networking capabilities from Kubernetes which is it's a bit complex and we have to account for multi-tency we are running multiple customers on the same cluster so we limit things so services cannot connect to other services in other namespaces and we block everything by default and we open specific cases as needed so we start for a fully blocked and then open as needed. Everything is virtual Kubernetes networking this allows flexibility but introduces complexity of course but this is managed by Kubernetes and we had some issues but overall it works it can work pretty well. We use Cilium for networking and uses ebpf instead of IP tables it's more efficient and performant and allows us to have custom network policies at level 7 so things like policies on path, HTTP headers, HTTP methods so it's a it's a higher level API that allows us to define more fine-grained constraints. We use the network policy object to block or allow traffic we block access to all the other namespaces and we just allow some outgoing HTTPS and common ports. Customers may also want to allow specific things like iGris, IPs, say we have a you have a development environment and it can only be accessed by your IP range of your private network so we can do that too. For Ingress we are using a contour fork that has more features and also has more features than the standard Kubernetes Ingress object so we use blocklist, allowlist, pathbase routing, stuff like that and it uses Envoy behind the scenes so we use we heavily use Envoy. Envoy is something that is like a kind of love-hate relationship where you can do things it can break in many different ways and it can break badly so it's if you have half-envoy misconfigure then you can cause a cluster-wide outage and like if your configuration you update your configuration and it's wrong and then you restart Envoy it will clear all the Envoy routes so you get a cluster-wide outage and we have issues when the rate of change is too high when it gets locked and things just start crawling back crawling down but it's very very powerful and we had to do some work to fix issues and to use it correctly and things that we do for instance now is validation of all the conflicts both at build time and run time so this is to prevent causing any any problems as I said it's very powerful but if you do something wrong it's being used everywhere and can cause a cluster-wide outage for logging we use Fluent Bit Sidecars that sends logs to Centralize Store we also use Grafana Locky for for log aggregation and for monitoring and alerting typical Prometheus Grafana and then we aggregate all the clusters data and we have alerts coming from Alert Manager so that's very typical stack we have a feature that is the customer logging so customers can also need access to some of these logs and we use Fluent Bit to send the logs to either Logstash and Recently Locky and the customer can see them on this Cloud Manager UI and get those logs and see what whatever it's in the logs on both from the application or their custom code Logstash for instance we are moving away from it because it's a JVM heavy and it uses a lot of memory and we're thinking that Locky is a better option for for these multi-tenant services. Resiliency and self-healing some things that you have to be aware of so you have to have readiness, liveness, probes make sure that your services are unavailable if there's any problem and that they are restarted automatically if in this case something is wrong you have to have pod disruption budgets to ensure you have a minimum number of replicas and rollouts and cluster operates so make sure that you always have one pod running of a specific service if you want or 50% of the pods so they are not automatically killed by Kubernetes and pod anti-affinity to distributed services across nodes and availability zones because if we have multiple availability zones you gotta make sure that not all the pods will end in the same availability zone so you want to use this pod anti-affinity to spread out the load across ACs. So we're building a very multi-tenant service so we try to always everything we write, limit the blast radius as I said before customers are namespace isolated we enforce that all the deployments have CPU and memory requests and limits so this is key. We are also running customer code so this is an interesting topic because basically customer can write something that can take down or cause trouble to their instances we run checks before that code goes into into the production clusters but there's always that that risk so these pods we are starting testing Carta containers where the pod runs in a virtual machine transparently so it's a very nice project it's it's transparently you keep using your kubectl apis and command line and apis and we are also developers from this other team that manage the the Kubernetes cluster for us are contributing improvement substreams so we are helping improve Carta containers. External services we use we don't don't run everything on Kubernetes we do have like external morngoDB and that's that's running outside the cluster we use Azure blob storage these two things for persistence and we don't use much Kubernetes persistent volumes because it's a bit risky obviously it's better if you don't need to use them there's just some people need to use them because of their system applications and so on but especially in Azure we found that we hit the Azure API rate limiting when having hundreds of persistent volumes it's been a few months since that happened but we stopped using them and using other options more like external storage for for data for data processing we have service based on Kafka and this allows us to do syncing of data between author and publish and this works worldwide when we have publishers in multiple regions because customers that create content may have publishers and publishing information, publishing their websites and their data and their assets into multiple regions and this Kafka service allows us to to sync all these regions for CDN we use Fastly in front of your Kubernetes load balancer and it also fronts the binary content that we store in Azure Blobs one new feature that we have to build for instance was an egress-dedicated IP so customers want to to have a dedicated IP for either firewall configuration and egress IP either firewall configuration or to be embedded in throttle or blocked by other tenants right if you have multiple tenants coming out of your cluster there's a range of IPs that is used and it's shared across all all tenants so we can provide customers with a dedicated IP and it's exclusive for for these customers so we built this setting a scale set of proxies outside of the Kubernetes cluster and with their own load balancer and this is dedicated for each customer and we use the network policies to allow only the customer to access the proxy so proxy can go out to the internet all the pods sorry the pods can go out of the customer can go out to the internet through this proxy and this will set a dedicated IP that is only used for that customer for continuous delivery how are we managing this we move from a yearly on the non-cloud version of the product to a daily release on Kubernetes so this has its own challenges we use Jenkins mostly for CI CD and we use some also queues to trigger some justice there's a Jenkins plugin that can do that for you we use extensively githubs all the configuration is starting it and reconcile on each commit and we have a model of pull versus pooling so we can scale right if you have to push the configuration from Jenkins to all the clusters and all the namespaces and so on this is not going to scale and we have services in the cluster that are pooling this data and and doing the reconciliation for Kubernetes deployments we use a combination of Helm for the Adobe Experience Manager application we also use plain Kubernetes files for some services and we also use customize for some new microservices that we built so we have a mix of of all of them we use extensively Helm and things that we learn is like don't mix application and infrastructure configuration in the same package because then it makes it hard and if you need to make a change because some infrastructure change then you will have to redeploy all these Helm releases we have thousands of them and Helm also needs to push to all namespaces and it's easier pulling so we are moving towards the Helm operator that will improve this situation things we use during CI and CD to make sure things don't break our kubeval to validate the Kubernetes schemas and we have some custom CRDs and we created schemas for those two so every pull request every change goes through kubeval validation it also goes through conf test using open policy agents open policy agent rules so we validate every object in Kubernetes with conf test rules so developers can still create pull request and see the the output of of the of these rules and they still have autonomy they can see what they should do what are the best practices by by having these rules so are things like security recommendations, leveling standards, image sources like you want to make sure that all your Docker images are coming from a specific registry and not from from other ones and this it's very helpful for for letting developers manage their own services and their own features so a lot on the on the self-service part so that's it I hope you enjoyed and get some examples on how these technologies are being used in a like a real life use case existing application because a lot of times we talk about Kubernetes and it's all like greenfield development every new applications with the latest technologies so this I hope it shows how you can bring an existing application an existing service that has not initially been thought of on the Kubernetes world how you can bring something like that into Kubernetes Kubernetes makes it really easy to do this lift and shift into into Kubernetes or something that is already running and I hope you got some ideas and that it helps you thank you for being here and we are hiring if you know any of these technologies that I mentioned today let me know you can ping me on Twitter and also you can ask me any questions now thank you