 services. Microservice architecture is structured in the application in a different way. Structures are in a collection of services with a set of characteristics. They are highly maintainable and testable. They are loosely coupled between each other. They can be deployed independently, scale independently, and they are organized around business capabilities. We saw that the microservice architecture was the right solution for the companies, especially because it allows us to scale better like if something needs to scale, you can scale the individual microservice. It's easier to develop because then all the team owners can just focus on their microservice, choosing the specialized stack that they need. But it also comes with important problems that needs to be considered. Actually, architecture is more complex because you need a platform to host these microservices. The full system has less stability because there are lots of small components together like they really need to be properly integrated. They are very more difficult to deploy, to test. They can have security issues as well because the communication needs to be done properly. So far, it's a difficult, hard transition, but I think it's really worth the effort that the results are paying it. So what was our main strategy for that? They need, every transformation is to come with a strategy. In our case, it was done at the steps like considering the key pieces for this transformation. The final goal of the journey is to extract microservices from this year monolith to isolate the functionality and mostly using domain-driven design. So the first steps were actually analyzing the business functionalities to identify all the units and define the boundaries between them. We needed to decide as well how to manage the data, schemas because actually with microservices the data needs to be isolated as well. We need to accommodate to this new architecture, define how the service would need to communicate between each other, also externally. And a very important task was also to consider the potential impact on the organization. It's not only transforming the architecture and this transformation is affecting everyone. It's affecting the developer experience because developers mostly were used to work in a monolithic environment like touching just a single application. And now they need to learn how to code for microservices. They need to consider the boundaries. It involves a knowledge and a transformation period actually. A key added to transform as well, very impacted was Quay and CICD because I mean testing and deploying a monolith is very different from testing microservices. Like in a monolith you test one single application cooling microservices, you need to test every functional unit and then you need to integrate it and you need to do the deployment of all the pieces together. With those considerations, I mean actually the paradigm change needed to be propagated across all the teams. All the teams needed to be aware of that. They needed to change the way they were developing, they were behaving, they were communicating and focusing on key areas. So there was these requirements that we needed to spread when designing and coding a microservice. So a microservice needs to have an importance. I mean a service needs to behave in the same way for all the requests. It means that it needs to be stateless and a stateful situations needs to be managed carefully considering all the problem use cases. A microservice needs to be resilient. A service can fail and application needs to be robust and supported. Service needs to handle, reprise, timeouts carefully and like in an efficient way and the whole piece and integration needs to actually support it. The microservice needs to be the composite and functional units. A microservice needs to be as minimum as possible. If we start seeing that a microservice actually is touching a different functional unit, then it's better to just separate into another. There needs to be proper dependency management. The architecture involves a complete set of minimal components and the design of the application needs to consider these dependency chains avoiding like deadlocks. And finally, something that was very affected and very affecting my team was architectural requirements. The microservice architecture really needed a platform that supported it. So it actually involved a complete change of infrastructure. What were those changes? So the microservices mostly are deployed on containers. We need to have some container orchestration system. In our case, it was hosting on Amazon, so it was a QES, connected cluster. Microservice needed to have a way to efficiently communicate internally. Then it broke the need of having a service mesh. Microservice needed to expose their services to the outside, have some client service communication. It broke the need for an API gateway. They need to have observability. So they need to have a common layer for metrics, traces, logs, alerts, instead of letting people do their own. They need to have some specialized CI-CD system to test pieces individually and integrate them together. So we need to have our proper CI-CD tooling. And they need to live in a zero trust environment. It means that we cannot trust any component of the system even if it's internal. So with all the premises, we actually started our transformation. So let's start talking a bit about our microservices architecture. This is really a very, very simplified way of our current architecture. We are using Kubernetes for container orchestration. And then microservices are just regular Kubernetes deployment. And so we have different ports, replicas running there. We run on ours. So mostly we're using the proprietary QES clusters for it. And typically the network is always load balancers with Amazon load balancers. And then they come to a first layer that is the Kubernetes like COM Kubernetes Ingress Gateway. This is actually a common API gateway that are exposed and all microservices that needs to be exposed to the outset. There are some microservices that are internal. They don't need to expose it, but for all play I am facing and points we are using COM. And then internally, all the services inside the mesh are communicated by using this mesh technology also including COM. COM is communicating internally with other services using the mesh. And for the mesh, we are using Istio as our platform. We choose it for many, many features. We mostly trust on it for the features that they offer and the strong community behind it. Okay, now it's time to explain where actually do we need a service mesh? Why not just use the for Kubernetes as a way of communicating between services? So I'm not sure if you may know, but let's first define what is a service mesh? Service mesh, it's a dedicated infrastructure layer that is built into an application and it controls the service to service communication mostly focused on microservices architecture. The main responsibilities that a mesh have are like controlling internal traffic between microservices, properly perform this balancing including communication between microservices and actually discovering all the services that are enrolled into the mesh. How is it achieved? This logic is achieved by abstracting from the main service and abstracted from the main service using what we are calling a sidecar proxy. A sidecar proxy is actually a container that is deployed along the pods. Like in a pod you can have multiple containers. So the mesh is actually deploying an extra one that is acting as a proxy. And this proxy is in charge of managing all the traffic that arrives and leaves the service. The proxy is in the front line, it's collecting all the traffic that arrives into this specific pod and then it's redirecting the traffic to the right one, it's doing the transformation and finally pass the final request to the main service. We do the same for traffic that is arriving to the service and traffic leaving the service. And actually the mesh is providing a set of features that are very, very useful for us and actually in the next slide, I'm going to show some relevant problems that we've had and how did we solve it with the help of the mesh. Some of the features that we need is like having advanced inter-service routing. So we need to expose multiple versions of the same service, like half canaries, half a way to control the versions. That's one thing that we need. We need to have also transparent end-to-end MTLS communication. It means that the internal service communication needs to be secure and developers don't need to do half any take on implemented that. It needs to be offered by the mesh. We need better security. The mesh offers authorization policies like they have a way to control which services can talk to with ingress and regress rules to control the external traffic, different ways of controlling that. We need to have improbable balancing for service-to-service communication, especially at GRPC level. This is very important for us. Our services to be available, highly available, resilient. So we need to have actually the ability of controlling retries, timeouts, availability conditions for the service. It's actually where the mesh has a key point as well. We need to have a robust and automated deployment because we need to just deploy all the different versions, different components of the services. So we need to have proper CI CD tuning and the mesh is actually integrated with that. It's helping us a lot. And also it opens the door for better optionality, offering better metrics, tracing logs. And this metrics can be integrated with external tuning. Okay, so that were mostly the features that we are using the service mesh for. And now let me talk a bit about practical cases where really we use the mesh to solve our pain points. Okay, problem one. So the first problem was GRPC load balancing. And it was actually one of the first problem that we had when we were moving to microservices. We decided to use GRPC for communicating like mostly as internal communication method. But then we found a problem like Kubernetes wasn't properly able to load balance GRPC. What did happen? I mean, in theory Kubernetes has ways to load balance communication between service. You just post a service and the Kubernetes can balance it. But actually this is actually because GRPC is using HTTP to protocol that relies on long-living connections and it multiplies the request on those connections. And it's quite different from HTTP one. HTTP one just opens a new connection for each request. That's the request and closing that. And then Kubernetes is efficient on doing this load balancing for connections. But this is not very efficient because it involves a lot of overheading like creating the request, preparing, like just sending the traffic closing. In HTTP two, we can use these persistent connections that they are open to a target and then sending on the request is much more efficient. But actually it has some incompatibility with Kubernetes because Kubernetes load balances the request at layer three at connection level. If there is a single connection, it can balance it okay because like it just sends the different connections to different targets and it's fine. But then if we are using persistent connections, the mesh is where it helps here because it actually it's opening a persistent connection to all the replicas of the related service and it's actually load balancing that at the request level at layer seven. So we avoid using Kubernetes that is just balancing at connection level. We keep the connections opened and then the mesh is actually doing it the load balancing at request level. This case for our strategy and it's a very important piece where the mesh was really fundamental for us. Okay, second requirement that we had is to have N2S, N2MTLS communication actually including internal service to service communication. One of the main requirements for our platform was zero trust. It means in the environment, anything, I mean, nothing can be trusted. Any microservice even if it's located on a trusted zone like we cannot trust it. It means that we need to have N2MTLS communication but in our architecture as you can see N2Ls communication is actually terminated at con ingress level. Then how can we do it? It will mean that the internal service communication will not be included. So we have two alternatives. The first one is actually implementing N2Ls on our own. It means that the developers can start and end HTTPS requests from their code, manage their own certificates. But it means like a slow development time because they need to care about certificates, trusting that, starting HTTPS. It means also consider overhead for SRE because they need to handle all the different certificates you said, more development time. And the second alternative is having a unified platform letting actually the mesh manage it. In Istio, we can enforce N2Ls communication by simply defining some rules. We have a peer authentication object that actually says that we want to enforce this HTTPS traffic. So any insecure request will be upgraded to HTTPS automatically. Istio takes care of generating the certificates and applying to its service. It has different certificates for its service that is published. And it will only accept requests, accept connection that are authenticated with the service and certificates that are coming from a trusted Istio CA. Continuing with the ZetoTrust concept, we need to have policy control and enforcement rules. So well, our cluster is placed in a secure environment. It's under a VPC, attacks can happen. People, hackers can enter. And if some attack gains access into our microservice inside the cluster, if we don't put any means, it will gain access to all of the services in the platform. So we need a way to put this communication. We have, in Kubernetes, we have different ways as for example, network policies, but we can make it better with the help of the mesh. Istio has the concept of authorization policies that is an object that allows to control the way that microservices communicate between themselves. We actually have it configured in our system to only allow communication between services inside the same namespace. So it means that the communication is isolated at functional unit. So if microservice needs to have additional communication, it needs to be specifically granted using this authorization policy objects. So we typically are doing that at namespace level as well. We are letting one namespace to talk with another. Like we use the fact that the namespaces are deployed by functional units. So we are actually controlling the authorizations at this functional unit level as well. We also can control external traffic. It's very key for us. So actually we can control to which endpoints the microservice is talking to. In our case, we are blocking all the external communications by default and we are only listing specific connections by using what you are seeing as service entry objects. By this way, we can prevent some type of attacks like some hacker enters into a service. It cannot use it actually for attacking other system. It cannot be an origin for attacking other clusters or it cannot be used for lawful projects as crypto mining or any other illegal activities. So the mesh really helps as cutting all the threads there. Another requirement for us was to have unified observability. We need to have a single system for observability. With the transformation to microservices, it is actually very tempting that all the developers starts creating their own dashboards, their own metrics. But this actually would cause a mess like making it very hard, very difficult to maintain and it's going to be very difficult for other teams to have visibility on the specific services that they may be connecting to, they may be interested to. And we also hit the issue sometimes that the communication between microservices can be like a black box. So a request will enter into the system using the gateway. It will traverse several services to produce some result. But then it's difficult to see which services have been involved in this request, how they behave it, how much time did it take to in each service. So it's still relying actually on the envoy proxy and this proxy is provided in specific metrics, stats for traffic communication, and we can use it. Those metrics are exposed by Istio in a way that we can consume and they can be integrated with other tooling like Prometheus, for example. And this data, this matrix can also be used to generate alerts to help adjusting our defined SLOs for the service. So we can actually raise some alert when there are so many errors, when the latency is so high, using this Prometheus alerts. Here you can see some screenshots of the dashboards that we are preparing. And also Istio is offering Kiali that is a graphical management platform. And actually it's very useful for visualizing traffic topology and actually visualize all the traffic details for internal service, how they are communicating between each other. It also has tracing enabled. So when we enable that, we can see all the different communication, the spans between services and it's actually a very powerful tooling to debugging abnormal behavior of microservice. And actually one of our final requirements was to have a proper way of deploying microservice. The movement of microservices meant that when we need to release a new version, we need to do it more often and they are small pieces. And we need to be sure that all these pieces are matching together. This is quite an important challenge because in terms of integration, deployment, deploying any version of the service that has not been properly tested and especially tested with other dependencies can cause failures and these failures can lead to service outages. We really need to address it. So Istio has some ways to like mitigate it and to help with that. And for example, one thing is to have canaries. A canary actually allows to deploy newer versions of a service but not shift the traffic totally to it but instead go gradually increasing it using some percentage or other specific rules. So when we deploy a service, we release the new version but we continue using the old one. We release the initial, the new version without any traffic and we can start increasing traffic gradually by five, 10% and then we check in the results of it matching with the specific metrics in Prometheus to be sure that everything works as we expect. We can then go shifting the traffic gradually and when we arrive to 100% we can actually make the final version stable. We can just say that this version was properly and then we can drop the version of the service but we can do it gradually so we avoid the service outages, actually. And in Istio, we do it by using the concept of virtual service where you can see that it's exposing the service but it exposes different versions and we actually can say which percentage of traffic can it be taking. And while it can be done manually, we can just manually go increasing the weights in the virtual service object. We also can do it in an automated way. Istio is integrated with some specific CI CD tooling like cargo CD, flags or others. So we can specify deployment strategies for these canaries. We can say, I want to deploy this service gradually. This is going to be the start weight, the final weight, how the percentage will increase, the duration and then the tooling can automatically perform those weight changes like based on some metrics. And it actually can do an automated rollback when you deploy something and matching metrics are showing that this is wrong then the rollback will be done automatically. It's actually going to save a lot of time and it's actually to prevent outages for sure. Okay, so with all of that, actually we have more problems but I think I have liked the main ones and we learned a lot over this journey. So I'm going to explain a bit about that. So this transformation of microservices is actually a long route but I really think that this was traversing it because the results that we're having are really rewarding. We have seen more efficient development of services, more efficient deployments and I think that we really have gained control over our platform. It required actually a total paradigm and architectural change was not easy and actually needed to be spread across the company from infrastructure teams, development teams, QA, SRE, everything was affected. And it also didn't bring a lot of advantages but it also brought some specific problems in terms of development, infrastructure changes, testing, deployment. It was really a challenge to learn about everything and just face all the problems. Here is where tooling like Kone API Gateway Service Mesh really provided a great value for us. And specifically the Service Mesh was fundamental to us in service to service communication, security, observability. So we have been giving a really important zero trust we really focus on security and we see it as a key and now the system is more secure and we have the control of it just and the mesh is being a key part of that. Also we found important challenges in terms of integrating microservices and inter-service communication. Sometimes this communication was hard to implement of the back but I think it's really the right way to go. And of course there is always room for improvement, learning, so this journey continues. Okay, so thanks for being here and has been a pleasure. Now, please if there is time for questions, please ask me whatever you need. Thank you.