 Good morning everyone. We are going to start, we're gonna give it up for Gijs Molnar from Spotify with GRPC. I would like to have Marcel Klassen if here, can he make himself known to me please. Thank you. Perfect thank you. Gijs, yeah. Perfect, yeah. Thank you for having me, 10, 12, early. I'm fresh, you're fresh, so that's good. And I'm gonna talk about Proxilus Service Mesh and but I'm gonna specifically talk about migrations. So I'm Gijs, I'm Dutch, if that's not obvious. I work for about 10 months at Spotify as a system engineer and I'm also since beginning of this year tech lead of the Hermes deprecation. I'm gonna talk a bit more about what is Hermes and all the the pain points we have with this deprecation project. Before I was a research engineer, also a contractor and I work mostly in science. I was specifically in astronomy and I've been a lot in South Africa and worked in radio astronomy there and in South Africa they've been building a huge radio telescope called the Square Kilometer. I was specifically doing research in large-scale data reduction pipelines, containerized data reduction pipelines. I did a PhD in that. That's a good idea to build these pipelines that way is a different question but that's why it's research, right? So many talks have some quote from a person in it. I thought it was a good idea to have one also and I'm quoting Warren Buffett and I'm quoting him as it's good to learn from your mistakes but it's better to learn from other people's mistakes. So I'm this tech lead migration now so I was looking around at migrations that didn't go flawlessly or didn't go very smooth and in science there's many but I don't want to bore you with some obscure radio astronomy package that we tried to migrate away. So I want to focus a bit on Python and the migration for I don't know if everybody here is old enough to remember that I think so if I look around but the migration from Python 2 to Python 3 didn't go very well. It took way longer than Gido for Rosem was also Dutch anticipated. So then to quote Gido for Rosem, I had no idea it would be so hard or take so long. I still think it was the right thing to do but it's clear I underestimated the magnitude of the task. Poor Gido. So why do people want to migrate? So often if you migrate to something you migrate to something that has improved features or new features you want to use. It could be that there's better support, paid support or there's a community with people that you can ask questions about this new thing that you're using. It could be compatibility that this new thing that you're using is way more compatible with all these other things that are also new now. Your community could be larger again that you just have more people that you can ask questions to but also you can hire people easier. So if you have new people coming into your company you don't need to explain all this old obscure technology to them they already know this technology and they can start quicker. But there's also a national resistance within every company that has an older stack. It could be that your software is very complex and it's very hard to adjust and that's very much the case in radius front of me where you have these very old packages from from the 70s written in Fortran and you need to modify that and there's not very many people that can do that. It could also be that you have a lack of resources in money or in humans or in time that is just not a priority now to migrate and we see that a lot within Spotify where we know it's important but it's just there's a lot of other important things happening that so this migration doesn't flow to the top. And it could also be that it's a lot of people just think it's good that it is now. You don't need to change because it just works. So if you put this in a plot then and you have this technology is the latest and greatest and we put this on the right. You have all these cool new features and but it's very hard to to to migrate because it requires a lot of changes and I would argue Python is somewhere Python 3 was somewhere there and on the left there is technology that is actually might be easier to migrate to but it has similar features and it just not that interesting to migrate to and there's no need to migrate to because we're already sort of there so you want to be somewhere in the middle there. Sorry. So now I move a bit back to Spotify. So Spotify was the first paid streaming platform and they were pioneering in this field in that sense and they were one of the first that solved this problem or they were the first that solved that problem and they needed tools that didn't exist yet so I had to make these tools. With that comes great creative freedom right you can do what you can make this because there's no there's nothing there yet and very smart engineers before us working there made all kind of very cool things and encryption in online streaming compression and it was all very much state-of-the-art technology what they've built and the tradition within the company was always to adopt the latest theoretical things available to us and and utilize that to full potential and it's the same with the microservice architecture as soon as Docker was invented or became popular Spotify was one of the first to actually go full-on on this this kind of technology so there's a lot of these old technologies that we invented and I'm not going to mention all of them but there's the three that I want to zoom into because they have a close relationship with containerized platforms and they are Hermes Helios and Nameless. So Hermes is a protocol invented by Spotify a long time ago and it behaves a bit like HTTP but HTTP was just not good enough and HTTP2 didn't exist yet. HTTP2 solves many of the problems that Spotify saw with HTTP for back-end to back-end communication so it's home-built built on zero MQ and protobuf back then was also just released by Google version 2 and it supports bundling multiple requests in out-of-order so this gives way less overhead and gives better performance and saves money and data usage. Another technology we made ourselves is called Nameless and this is a service discovery service and it's based on DNS server records and DNS service records are not that popular but they're very similar to CNAME records that they create a list of host names if you if you resolve a service but also include a port number. So all internal Spotify services register per region with a Nameless instance and then other services can query this Nameless service to get all the hosts that offer a specific service and then we utilize client-based load balancing so the client decides which service is going to be picked which is usually around Robin. Nameless operates per region so we have multiple regions around the world and but if some services for some reason don't run in a specific region then we can manually redirect traffic from one region to the other but this is a manual task semi manual. It's also important to know that Nameless works across different environments and that I mean it works with classic VMs and Kubernetes hosts that Google and AWS it's completely independent because it's DNS and the third one is Helios which is our own homebrew container platform and we're very proud to to mention that and we actually recently discovered this that a colleague of mine did some archaeology in programs of Docker conferences and we discovered that Helios was actually announced one day before Kubernetes was and back then functionally they were very similar they were they are a platform where you can schedule containers you can schedule them together and you can expose them on a network and they have an API but quickly Kubernetes just became way more popular and took over so but up up to today Helios is still used but we're migrating away from it so but these technologies brought us where we are now we have a worldwide company leading the streaming business and we're everywhere but still the all these technology just cannot keep up with with the growth we're seeing. Another problem is that all these things come with overhead right so you're the first mover you create something you make it but then you have other people learning from your mistakes so they create something that is better and other people start using it that has more momentum and before you know you have your own do-it-yourself technical depth also the company grows it keeps on growing and all the technology that we've built ourselves only goes so far and keep on adjusting so it can keep up with the scale. Many of the tools that we created doesn't have proper third party support for example the protocol there's no packet snippers that understand that offer great support for for something like Hermes and many new developers including myself are just not familiar like I joined Spotify ten months ago and I have to learn all of these things and reverse engineer my understanding of something like Helios. So I want to move away from Helios and it I mean I already explained it it's Kubernetes has way more momentum and the Kubernetes is all around us and all these big cloud providers just offer this as a service so this is a sort of solve problem. Hermes is also Spotify only that third party support and but most importantly it lacks proper support for load balancing techniques and it is not zone aware or it was not I was playing a bit more about that and also Nameless just doesn't scale it's hard to add new features mostly because the response size of a nameless response package is limit to 64 kilobytes which is absolutely tiny in these days so what this means is that if you use Nameless and you want to use for example the Spotify playlist API or something and you want to discover this service you get back one response of all the services all the host names that represent the service in a region and but that whole list can only be 64 kilobytes in size in practice that means that we can return about 8 900 host names but only those host names and you cannot add additional information or you need to reduce the number of hosts and they've become a problem if you have so many hosts so we're really pushing the limits of DNS there and they're just better technologies available now another problem with DNS is that's actually quite slow there is a lot of caching happening which is unknown where the caching happened you can play with that with changing with your time to lives and things like that but that it just very hard to debug your problem another problem we have is the cross zone traffic problem so if you go to your favorite cloud provider and you create a new Kubernetes cluster they advise you to just have three zones and fire up some VMs in that in those zones and and don't really worry about that and just randomly deploy your pots over it and there is some cost involved for interzone traffic which is about like one cent per gigabyte and which all sounds fine but this becomes really a problem if you don't think about that and you start to transfer petabytes or exabytes of traffic between your zones and suddenly this becomes really expensive so we sort of made the mistake by growing and not really thinking about that and planning to migrate but then the migration took longer and before you know this becomes really expensive so we want to save money also and so and we need so nowhere routing so this was hard to do with nameless but then I had a colleague who was very smart and he hacked Hermes to hack in zone aware routing and we're actually deploying this as we speak and we're saving money as we speak we merged a couple of pull requests yesterday but we cannot keep on doing this we cannot keep on hacking these protocols it's while there is more robust the future proof technologies available so now to service mesh so for people who don't know what is service mesh service mesh is a different way you can manage traffic going through your cluster where traditionally you have your communities clusters and your pots and if one service you want to communicate with another service they could just do that you say okay this it has a name you can resolve and maybe there's a couple of pot service mess introduces another abstraction layer in this whole contraption where it has a dedicated infrastructure layer typically manifested in the form of a proxy so in practice this is almost always and boy and this proxy downloads or communicates with the central configuration entity called control plane traffic director in this picture and this traffic director or control plane configures the proxy and what this means is that the client can then be configured to pick specific clients in a specific zone or in a specific cluster or region or and this way you have way more control over your traffic routing you can do zone aware load balancing things like that and fall fall over to different regions or and even also if certain zones get very expensive you can play a bit with the metrics and really like traffic to certain zones or areas or regions so traditionally a service mesh has been realized with a proxy but the dream is the the the proxy less service mesh and that's the where we're heading to so what you can do you can just get rid of this proxy and but to do this you need a client library that is able to communicate with a control plane now luckily all the modern grpc libraries basically have support for this so now it's just a matter for us to start switching to grpc fully so traffic director is the managed service for service mesh by Google an alternative is Istio I hope I pronounce it right always get a bit confused there and so you might ask why don't we use that so the problem with Istio is that it has multi cluster support so we have many clusters per region but we need better support for that we're more advanced routing techniques that traffic director actually has support for also we have global routing so we need certain services are really for legal reasons for example bound to certain regions and traffic needs to be redirected in between and also we're still not fully converted yet we still have significant infrastructure running on Helios so why would you want to go proxy less well you just have less moving parts right you you don't have your proxy you don't have this mental overhead and we actually already do this we already do it as I mentioned before we do this client side load balancing where we have our Hermes client that gets these nameless records in and decides who to talk to it's just it's a bit stupid this way we can make it a bit smarter so this whole architecture is philosophically compatible with what we've already been doing and everybody is reasonably comfortable with it so what is stopping us in this whole story so the Hermes Geo PC migration is something we are slightly underestimating it's actually quite hard to migrate a protocol the I mean I've not been involved with the Helios deprecation effort but that seems way more easy or you just have a container you move it to a different platform while here you need to modify the content of the container and also the protocol that the containers talk in between and so there's reasonable resistance within the company natural it's not that people don't want to it's just it takes time and effort to move and and help all these teams and all these services to integrate a new protocol in there in their infrastructure another problem is that we need to regain trust again after a massive incident and the massive incident I'm talking about it gave us the the honor to be the top one incident at down detector last year so in March 2022 we had a global incident and Google accidentally pushed the faulty configuration to traffic director which was propagated to the whole infrastructure in combination with that there was a bug in a library that we use that propagated the error even further the result was that basically every service within Spotify started returning 404 and it's a security measure ever all the clients were locked out we have like a perimeter security check and people start accessing things that they're not allowed to they get locked out so I don't know if you ever realize but one of the things with Spotify is that you're never locked out nobody ever needs to re log in or type that password so the problem is that nobody knows that password so this problem propagated to all the company like suddenly you have all these clients but also artists who are like worried about their income like contacting resetting password contacting support and for weeks this was shock waves going through the company so the rollout to traffic director so I mean there were only very few services migrated back then but some very important services were migrated as a trial project so that rollout got a that rollout got rolled back so now afterwards we sat down with Google and we said like look we really like traffic director you want to use it again but we need to prevent happening there so Google sat down and looked at what happened and they came up with a couple of solutions to solve to address these issues and one of that is just make it harder to make this mistake not to push a faulty faulty faulty configuration or break the configuration another thing is that they introduce canary rollout in in regions and they improve the observability of what's happening so you have early early early detection of if anything is going wrong so to bring everything together to adopt service mesh we need GOPC GOPC is big pickle we should not underestimate the migration and just asking people internally it doesn't work either Spotify is a community of people and just asking nicely it doesn't help you need to create a bit of an in-sentence so that's what I'm working on now I'm not expecting miracles but I just want to make it the obvious next way and we're working on that with certain teams so we're slowly migrating very important services and hoping that has a sort of network effect within the company but I think we will get there so that brings me back to the quote that I showed from Guido for Rosa so actually that's a fake quote it has been created by chat GPT he didn't say that but he could have said that so so anyway that was my talk so thank you I am I am open for questions I'm not sure if I can answer all of them that's are you on a point or should I so does anyone thank you very much guys that was amazing and I really thought that Guido said that there we go yeah full job does anyone have questions for guys yes gonna keep me on thanks for the talk back in the days before service mesh existed Netflix had a library called his tricks and they moved away with it because it was forcing them to do everything in Java and use that library and cost tight coupling now you're moving in exactly the same direction what has changed so I can't repeat the one thing back in the days Netflix was using his tricks as a library yeah to do the communication between the services and service discovery and all of the things that service mesh does nowadays yeah and they moved away from it because they didn't like that all of their microservices had to be programmed in Java and had a tight coupling with that one library right yeah the gfpc is a protocol with implementations in almost all languages so we're not tightly coupled to Java and I mean we're investigating other languages also we're fully Java based now with some Python here and there but the beauty with gfpc is that you can write these protobuf definitions and you can generate code in almost every language so we are language agnostic there any more questions for for a heiss no well thank you very much heiss then give it up for heiss please thank you so we were this morning 15 minutes late with everything then we switched into being 10 minutes late with everything then heiss now just brought us five minutes so we are in a regression state that is fantastic eventually we're gonna get on time to be later on we are going to have a little break now and start again at 10 past 11 so it's 10 minutes after the schedule we are going to have security talks with Marcel Klassen and I would like to have Marcel Klassen make himself known to