 Hi, thanks for coming. This talk is about why Rust is going to be a foundational technology to the future of cloud-native infrastructure. Before I get to that, let me introduce myself. My name is Oliver Gould. These are my dogs, and I'm the creator of a project called Linkerty, a service mesh that's been part of CNCS since 2016. I'm also the CTO of a company called where we make this and some other infrastructure tools. Before that, I worked at internet companies like Twitter and Yahoo really focused on production operations and infrastructure, and that's really the lens to which this talk is going to be delivered. This talk is basically three parts. First, I want to take you through a brief history of the cloud from my perspective. Next, I want to get into the details of why I think Rust is so important to cloud technology, and then we'll wrap this up with a quick tour of the Rust toolkit that we use in Linkerty. I want to emphasize, when we talk about the history, this is the history from my perspective, like all histories, subjective. If you may have been in the industry through this whole time, you may have slightly different perspective of things, that's fine, but I think it's important to set the table for where we've come from before we talk about where we're going. When I entered the industry and when I started working at Yahoo in 2007, Yahoo was a big old internet company. They had literally millions of physical hosts that were managed. They were managed by dozens of hardware teams, people in data centers, people provisioning hardware, and also dozens of system and teams. Our team in production operations was one of many responsible for managing these fleet of hosts across the world. Because of this, because of all these legacy systems that accrue over time, it was extremely heterogeneous. Lots of free BSD, Linux creeping in, OS versions, configurations proliferated in every different way, but the idea here is that they are largely what they call pets and not cattle, really bespoke configurations for millions of hosts. Also at the time, if you wanted to get hardware, if you wanted a new server, you had to go to something called the hardware request committee, hardware review committee. This was literally a meeting with the CTO of the company, David Philo, where you'd justify your need for a server or for a fleet of servers. It was undoubtedly slow and laborious and a little bit stressful to get hardware. This was done to save costs and to make sure we're using things efficiently, but it's really a different way of requiring, of getting hardware than we do today. And the first problem I really started working on there, the problem I worked on through most of my time at Yahoo was config management. So across these millions of hosts, how do we make sure that they get security patches? How do we make sure that new users of the company get access to the host or when a user leaves the company that they no longer have access to the host? How do we manage a proliferation of configuration? This is a hard problem and we worked on this for years. And this wasn't unique to Yahoo. Lots of companies at the time were going through similar problems. They all had to manage hosts to be connected to the internet securely. And so the proliferation of projects came around this first CF engine, long time ago written in C. And then in the mid 2000s to late 2000s, we saw new products coming on, mostly in Ruby or Python, projects like Puppet, Chef, Ansible, and of course others. And the job of a config management system is kind of simple, a little bit. They mostly have to run commands and generate templates. These scripting languages that were really coming to popularity at the time were a good fit for that. They're system scripting languages. And the real job of config management systems is to make the host able to run an application and frequently config management systems were responsible for actually deploying code, for getting application software ready to run. And it's meant there's a pretty tight coupling between your host's database and your actual application. And it's not really where we are today, obviously. Around the same time, so early, well, really throughout the early 2000s, we saw a new proliferation of virtualization technology coupling line. So projects like free BSD gels kind of in the earlier side, and then Zen and Solaris and building up to Linux C groups, which so much of what we're talking about today is built on. Really the job of these virtualization technologies made systems multi-tenant. So no longer do I have to have a single host with a single application or even a single user or customer, now I can run multiple operating systems on a single piece of hardware. And this really changed the game. Of course, all of this stuff is very low-level operating system that's maybe written in C at best, but also lots of assembly to get this stuff done because you're virtualizing hardware. You're actually mimicking what a machine does. And this gave birth to a new set of products and services that we really call what we call the cloud today. And it really made the data center as a service or data center as a product. And EC2 was probably the first widely available one of these, and of course, many have followed since. And I remember having a conversation with a colleague with a Netflix and probably 2009 or 2010, where he told me how Netflix was moving from their AIX mainframe. So Netflix originally was on AIX mainframes and how they were moving to AWS. And this made no sense to me. Again, I was working at Yahoo with these massive fleets of bespoke systems and the idea that a growing popular internet company would go to Amazon's infrastructure didn't make any sense to me. Boy, was I wrong, huh? And so the great thing about this, as we all know, is that it made service accessible. There's no longer a hardware review committee I have to go to to get a new server. There's no longer some of the data center I have to call to get to fix a server. I now just have an API and a credit card and I get access to a server. I get an internet connected server. This means as students, we actually can just get online and get server technology easily. And as startups and businesses, we get access to these things without having to get data center contracts or any of the kind of overhead that's really associated with time for this. The other interesting thing here is that Linux is really tied to this, right? Linux does not generally require licenses to operate. And so this was a great fit. I no longer have to get a Microsoft or whatever license to get into prod. I now just get a free operating system. And I also no longer have to worry about the vast array of driver compatibility issues, which were kind of a headache before this. So now I can just make an API call, click a button, and I get a server that's ready to operate on the internet, which is great. It really reduces the barriers to getting involved in server technology. However, then there's some downsides here. We're still dealing with her host, even though we're dealing with virtual host VMs, we still have hosts as our primary abstraction, which means config management is still a big problem, which is why we have all these config management companies coming online and products. We also have no control over the hardware using our very little control of the hardware using, which means we have kind of less reliability there. We don't no longer are we trying to have one superpower machine stay online all the time. Now we're kind of dealing with a world where systems might fail and we can't call any of them to deal with it. We really have to consider variable performance. No longer do I have exclusive access to machine. I might have other businesses running on this machine that are dealing with lots of traffic. These concerns end up changing how we actually think about operating services. We have a whole new set of failures, soft failures, more frequent failures, and this kind of gives birth to a new way of testing a new methodology called chaos testing. We're also coming out of Netflix probably in 2011 or so. Building on that, so this is kind of by the time we met Twitter in 2010, we have a new set of technologies coming online that really decouple applications from hosts. No longer do we, as someone who's operating a service, do I have to think about config management or SSH into a host or kind of all the overhead of that. Now what we're thinking of through projects like Mesos and Aurora, or I just want to ship my workload, I want to write software, build an artifact, get it running. This was really great for Twitter and other growing scaling companies, Uber, Lyft, et cetera, where you now need to focus on developer productivity. We have hundreds of engineers, how do we get them to write software, ship them to product quickly without having to think about all the operational overhead? How do we stratify that and separate that? The downsides of these projects were that they were really operationally complex. It was pretty hard, or if not impossible, to run a full Mesos cluster on your laptop. You actually need quite a bit of hardware to get started, or you need some pretty beefy cloud boxes to get started. There's some overhead here. This is not a broadly accessible technology that you can get started with. Also, we're dealing with a lot of JVM runtime, which comes with runtime costs, overhead, memory, CPU, and operational costs in terms of debugging GC and things like that. In this role, we're dealing with highly dynamic systems where Mesos may reschedule pods or instances without there being any user involvement. We have to deal with things like services discovery and load balancing and retries and timeouts and all the things that are necessary to manage services of the scale. At Twitter, we were working on a library called Finagle. That's really what came to be the core of the first version of Linkerty. As we dealt with all of these production issues and dealt with making communication more reliable in this library called Finagle, the idea with Linkerty was, well, how do we package that up into a proxy and make that accessible to folks who are not writing software with Finagle? Following that, or around the same time, there was a new set of technologies coming on. What we call Cloud Native. It really starts with Docker in a lot of ways. Docker is building on Linux C groups that technology we were talking about a little bit ago. Docker makes it possible, as many of you know, I'm sure, to package up an application and ship it somewhere and get it running with resource constraints. It pulls in parts of the config management story and isolates them into a binary that really is almost a whole operating system running in a binary. Kubernetes extends that model and makes it possible to take a cluster of servers and just run these Docker containers anywhere. With that, we have this heavy reliance on the network. What we call microservice architectures are tiny services that are distributed in data center or in a cluster and they communicate over the network. Tools like GRPC and Envoy and Linkerty fit into this world to really focus on managing the complexity of a dynamic system. We deal with fault tolerance. We deal with the fact that we're going to have to load balance. A lot of these things, Kubernetes and Linkerty, especially Docker as well, focus on user experience, on reducing the costs of managing it, getting started, of understanding it, to make it accessible for application owners to get running. We focus on applications and not hosts. We're finally broken that those barriers down. Let me take a little detour and describe what Linkerty is in case you don't know. Then we'll get into why this is so important for us. Linkerty is a service mesh. What a service mesh is, it's a pattern of deploying rich data planes as generally as a proxy, a sidecar proxy, that deal with this communication complexity. We have to deal with load balancing over a set of instances or a set of replicas in a cluster. I have to deal with making sure that everything gets tailed by default because I may not trust the network that I'm running in. I also want to have identity on either side of this. I want to know which workload is talking to which workload. That's easily done through TLS. What we do is we deploy a proxy sidecar next to every application instance. This helps manage communication complexity. This is really in Linkerty, there's two halves to this. We have a control plane, which talks to the Kubernetes API, which deals with a lot of the configuration and discovery and the fact that things are dynamic and feeding that to proxies. The proxies are supposed to be very lightweight, small instances that can fit many, many, many on a host to serve this traffic. We can kind of look at it like this, where we have the Kubernetes API. Kubernetes is, of course, written go. We have a Linkerty control plane, which is also today written go. We chose go for the control plane because it's so coupled to the Kubernetes API. We want to use client go. We don't want to have to write a Kubernetes client from scratch and think about all the complexity of what's in a Kubernetes client. This is, again, three plus years ago when we were starting. We wanted Linkerty's control plane to feel like part of the Kubernetes ecosystem. We chose go for that. But when we went to write the proxy, the sidecar proxy, we chose Rust. That has been a great experience. But when we were starting, it was really rough around the edges. We had to bootstrap ourselves. We had to build lots of technology. We invested heavily in the Rust ecosystem to make this work. Why is Rust going to happen now? What about this moment? Why is Rust so appealing to us in this point of time? Rust gives us a bunch of primitives to build components. It's a programming language, which focuses on safety, efficiency, and possibility. Really, I'm making developers productive. I, as an engineer, can write a data plane proxy, a micro service proxy, and I can do that with high confidence that it's not going to have memory leaks or memory safety issues. That will do the job well. We want to use this to build the cloud-native technology. The cloud-native systems are, again, dynamic, network, fault-tolerant, and loosely coupled. These things end up lining pretty closely. I'm going to get into why. But first, let me take you back to the OSI model. This is one of my favorite pictures. Every talk I do has this. But we look at the application stack, or the networking stack, and we see all these layers. But really what we're talking about for applications, for the people building websites and user-facing applications, this is how the world looks. They should only really care about their application logic, whether it's tweets or pictures or payments or whatever. It may be the presentation, whether that's Jason or Protobuf, or the details of how it's rendered and shared. But everything beneath that is infrastructure. It's the cloud. But somebody has to build that stuff, and that stuff is us. Down at the bottom, we have physical layers and link layers, which are really part of the cloud providers or data center or hardware. In this middle glue layer, is where we spend all our time as infrastructure developers. Things like Lincordie and Kubernetes really fit into this middle glue layer that is not talked about too much. So what we're all becoming is system programmers. Anyone working in the cloud-native space is really not an application developer. Applications are end user-facing. System programmers build software that supports applications. And generally, these things have to be highly trustworthy, meaning that they're going to work safely and correctly. And they generally have pretty tight performance requirements. And this is really where Rust fits in. Rust is a native language, meaning we actually compiled a native code. We're not running in a JIT or runtime, a VM. And so we have access to low-level memory primitives and things like that. But we also need to do that safely. And that's where I think Rust really shines. And so I'm going to walk through some comparisons. I'm going to compare it to Go because Go is what I know well and really what is the kind of state-of-the-art in the cloud native. And we'll talk about where Rust really makes improvements over the current state of things. So here's a really simple example, a function that fails. And I call it, and I ignore the error. This is a bug, right? If something fails, we should have to handle it. And Rust makes that really easy. And it uses the type system to do that. So one of the big advantages of Rust is a really nice type system. And types let us express constraints in a much richer way. And really the goal of all of what we're doing here is taking things that might fail at runtime when the application is running and trying to make them fail before we even build the thing, before we test it as early as possible during the compilation phase. And Rust really excels at this. So here, the same function that just fails. If we ignore the error, Rust will actually admit a compiler warning. In LinkerD, this will prevent LinkerD from building. And we'll have to fix this before we go on. Another example, pretty similar. Here is a place where I access an initialized value. And we've had this type of bug in the LinkerD control planer CLI countless times, more times than I can care more than I wish. This is a big pain in my neck. And Rust again makes it simple with the type system. So no longer can I access something that hasn't been initialized. If I try to do that, I'll actually get a compilation error. I have to deal with the fact that something may not be set. There's no null value in Rust. Option is the closest thing we have. Or it either exists or it doesn't, but it's part of the type system. To fix this, I actually have to... I can get the same thing, the same runtime failure that I would get in Go, but I actually have to document that. I have to expect with an error message. So no longer just segfaulting because I did something dumb, Rust makes me deal with these things before I even compile. Similarly, concurrency becomes a big issue, especially in a proxy like LinkerD. We have multiple connections and requests going at once. We're talking to control plane. There's loss of concurrent access. And in Go, this can be quite dangerous by default. So here, I've written just this actually from the Go by example website, where they demonstrate how to use mutexes. And here, I've just left the mutex out. And Go will happily compile and it'll even run. So if I run this thing for a second, Go works just fine, which is great, right? Unless I run it for longer, so I increase the runtime here to 10 seconds and all of a sudden, I hit an error. So this is completely nondeterministic, right? I can write tests that pass and then when I ship it to prod, this thing can fail in an unexpected way. This is virtually impossible to do in Rust. So here's the same code effectively written in Rust. And when I try to compile this, I'll actually get an error that says, hey, this map you built, you can't use it multiple places at once. Somebody has to own this thing. And so this idea of a borrower checker is really an ownership model for who owns memory, who is responsible for this, or what code is responsible for this means that I can't even compile this code in Rust because it's unsafe, the access patterns are unsafe. And so to fix this, I actually have to go and put a mutex in. This is the same fix effectively that should exist in the Go code, but the compiler enforces it in Rust. When I add the mutex, everything works as you'd expect, which is great. Rust has made my program more safe without even writing tests. My last example here is that Rust is this idea called R-A-I-I, resource acquisition is initialization. And this deals with lifetimes and when kind of tying back to that borrowing and ownership model, and when I drop something, it no longer exists. And so here in the example on the left, we have Go code, and that sends two messages on the channel and then drops the sender. And then I have another task that continually reads from that channel. And if I run this, it runs until that error happens. And we hit a deadlock and go exits, which is great. I mean, go should fail in this case, because that's the best it can do, but we can do better in Rust. And Rust, I have this type system again. And what the type system tells, let's me do, is I get an optional value back. And so there's no more runtime failure here. I can't even write the code to not work here. I have to handle these conditions. And when I do that, I actually, we see that we get some value back every time we do a read. So this is an example of the type of safety net that we get from Rust. And why, you know, we haven't even gotten to any of the details around memory access or, you know, Rust's kind of provable safety. But these are all sorts of ways that we take failures that could happen at runtime. After I've shipped my software production, when I hit a weird corner case, where things can crash and break the whole up, you know, the rest of the system, Rust, we want to push all, Rust lets us take all of those types of failures and bring them back in the development cycle. So they have to be dealt with explicitly. And finally, let me take you through a quick tour of what, of Tokyo's async ecosystem. So Tokyo is a Rust library that's kind of similar to Netty, if you're familiar with the JVM, it gives us asynchronous IO. And so when I, when I start a program, I set up the Tokyo runtime, and this lets us run IO concurrently without having to have a thread per connection, or, you know, it gives us something you can basically think of similar to Go's runtime, where Go has these green threads that let you run things concurrently and just block. Rust and Tokyo give you kind of a similar set of primitives. And Tokyo has an ecosystem around it that really lets us build up, build up systems on these good primitives that are trustworthy. The first one, which we've invested heavily in Lincardy, and a lot of towers, primitives come out of Lincardy, is this system called tower, which is really similar to finagle services. And so it's a service abstraction where there's a request and a response. And as a set of, you know, layers or middlewares that'll let us layer stack these things together so that they can be used. And I can write loosely couple components and then bind them together. So here's an example from the Lincardy proxy. This is an HTTP client. And so what this is, for every endpoint we're talking to, we build one of these services. And this has an HTTP client that has a reconnect layer. It lets us do Lincardy's tap feature and adds metrics. And all of these features are, you know, orthogonal. They have no dependencies on each other, really. And so I can write all of these separate modules that are easily tested or easily shared and reused without having to couple them together. This is a really great building block. I also should emphasize that a lot of the primitives we've developed for Lincardy are freely available in the tower library and framework. So you can use, for instance, Lincardy's load balancer without having to pull in Lincardy. That's the tower balance project or crate is something we've contributed back upstream. So this is, again, a set of reasonable components that you can use to build new systems. Another library that we've been heavily involved is something called Tonic. Tonic is a GRPC binding for Rust. Again, really bound up with tower and Tokyo's async runtime and async networking. Tonic lets me write little GRPC services. So here on the right, this is a load testing service that I wrote. And we just have to implement, you know, we take a GRPC protobuf and we write the function that's generated by the API and now we have a network server. And so this makes it really easy to build microservices or little pieces of services in Rust. Again, with Tokyo's async runtime. Finally, I want to call out another library which is newer in this ecosystem called KuberS. When KuberS is basically high go for Rust. So what it gives us is Kubernetes API bindings and primitives that use the Tokyo primitives that can be merged easily with Tonic GRPC services or tower services. And so here is an example from a prototype I'm building which watches all pods in a cluster and it indexes which ports are available on those pods. So this is something I'm really excited about because this means we may actually start being able to replace or as we add new controllers and linkerty, we can start doing them in Rust where several years ago this was totally not possible. Now we have a rich ecosystem of projects around Rust and around Tokyo specifically that we can use to stamp out new infrastructure code that is going to be much safer that we will be more productive writing and generally have a much easier time about it. So in summary, cloud computing creates new ubiquitous distractions. We no longer have to deal with managing hosts or acquiring hardware nearly the same way that we did a decade ago. Now we have Kubernetes APIs and we have lots of glue beneath the application. And we need that to work well. We've all become system programmers. Anyone who would have been in operations a decade ago is basically a systems programmer now. But scaling that out having to have an industry of people writing C would not be great. It hasn't been great. We have security vulnerabilities. We have safety issues and we actually have C has a pretty steep learning and development curve. And Rust makes this way more accessible. And one of the things I'm most excited about in the Rust ecosystem is the number of young engineers getting involved here. People in school or just out of school are really gravitating towards Rust and I think that the industry is going to be transformed by this. We're going to have a much richer, more reliable systems ecosystem that's built on Rust. Finally, this wasn't possible a few years ago. There's been a tremendous amount of investment. Our team has invested heavily in Rust and these ecosystem libraries. Folks at Amazon and Microsoft and Google, you name it, have been investing in Rust. And I think that this really paints to a future that is going to be much safer, more efficient, better for the environment and more reliable. And so finally, thanks for coming. I hope this talk was useful. I hope you enjoy the rest of the talks today. Have a good one.