 Welcome, everyone. Thanks for attending the Open Source Summit. This is my first time speaking at this conference. I have spoke at other conferences previously. I work at Indeed.com. For those of you that don't know, Indeed is the number one job search engine around the world. We aggregate jobs from across the web and enable job seekers to search and apply to jobs across the world. We're available in 60 different countries and offer a wide variety of languages. My name is Maya. I'm a senior engineer working on service infrastructure at Indeed. This includes everything from the mechanisms that drive how our systems communicate with one another to the actual runtime platform that backs our ecosystem. One of my teams, the service architecture team, is responsible for the delivery of tools, systems, and services that help drive our service-oriented architecture. This talk is largely focused on how we went around vetting this kind of migration and why we would even go about making such a migration. I'll provide a little bit of background about how Indeed has historically written services, as well as the desire and end goal of this kind of migration, provide an overview of a test harness that we built to help address some of the concerns that our product teams and engineering managers had across the company when it came to this adoption. I'll show you some examples of how we built baselines from our existing production ecosystem to ensure that the migration would be safe and successful, and then possibly simulate a few workloads. It's a much smaller audience, so I might go through things a little bit quicker to enable more of an open form discussion at the end. We will look at some things that we learned at the end as well. Historically, Indeed has built its systems using Boxcar, which is an internal distributed services framework. It's largely based on protocol buffers and offers a cogeneration layer. It works by load balancing a set of operations across fixed-size pools of TCP connections, and we only really have great support offered in Java. Two years ago, we first started considering migrating off of Boxcar. We have historically only had a handful of people working on the framework internally. Being able to leverage something like GRPC, which is an open source solution, gave us a lot of value. But in the use of Boxcar, we've also found ourselves re-implementing the wheel when it came to things like circuit braking and client-side load balancing. And so we started to consider the use of a service mesh as well. During the summer of 2017, we evaluated a bunch of different service mesh solutions out there, and the one we ended up coming out with was LinkerD. And this was LinkerD v1, so the Java-based version, not the rewrite that they had recently done. Later that year, I spoke at KubeCon, describing how we were able to leverage some of these pieces of technology to enable communication with our Java-based Boxcar services. In 2018, something interesting happened. I actually left, indeed, for about seven months. Being one of the main technical leads pursuing this initiative, we saw a little bit of a halt in the project. And so the team had to work really hard to back view a few roles while I was out. Upon arrival, I had discovered that we actually switched technologies. We went from using LinkerD 1.1 to using Envoy because of its lower resource footprint. We did evaluate Envoy in that summer of 2017, but there was some performance issues there when it came to having the large number of services that we have running in our ecosystem. Since then, the team has been hard at work driving that solution across our ecosystem. Now, some people might ask why we might do something and do such a thing, where we've taken our existing infrastructure, almost thrown it out the window, and started over. For one, indeed, started to leverage Python in more key business areas. Because the way Boxcar was written, we couldn't really port it to Python all that well. And so we wanted to consider a solution like gRPC, which would give us more native support for Python. Envoy would kind of fill in some of the other gaps that we had, where we had thick client-side load balancing, as well as common circuit breaking patterns. Together, these two pieces of technology also let us transparently handle encrypted HTTP2 connections. Applications would be able to continue to speak the protocols that they were very familiar with, whether it be HTTP or even the plaintext HTTP2 protocol. And then Envoy would encapsulate a lot of the encryption and long-lived TLS connections. In our Apache Mesos ecosystem, both the application and Envoy's sidecar ran within the same container. And as we started to shift more and more components to Kubernetes, each of these processes would run as their own container within the same pod. Once we built our initial integration, my team had gone out and started working with different products and technical leads at the company, trying to get them to adopt or consider the solution. After presenting to a few of our larger products, they were comfortable with the idea of adopting a service mesh if it meant that their application logic was easier to reason about, but they were concerned about the underlying performance of the two different protocols. At the time, we had some anecdotal evidence that proved out our integration and showed that it was relatively comparable, but it wasn't really the traffic or the numbers that these business critical systems were looking for. And so we built a system called Hyperloop internally. It was originally designed to exhaustively test new versions of Boxcar before releasing them to the organization at large. But with a few modifications, we were able to leverage this system to directly compare Boxcar and GRPC. We used Prometheus to monitor the underlying infrastructure. And with the way that our current applications work today, we emit all of our performance metrics via stats D. And we were able to leverage the stats D exporter from the Prometheus community to translate those metrics into something that Prometheus could scrape. Hyperloop itself has two main components, service and a consumer. Service is pretty simple. It's just a ping pong style service. Receives a request, sleeps for some configured duration, and then returns a response. The client can be configured to induce different loads on the service. And there's a few different knobs that we have that can adjust that. With a few more modifications, the same ecosystem was able to test out the actual underlying service mesh integration as well. While we didn't necessarily perform this head-to-head analysis in the case of this test, we do run the triad of systems in our QA and production environments. So we have one setup for Boxcar, one setup for GRPC, and then one setup for GRPC over the service mesh. The note here is that Boxcar would not be able to run over the service mesh as it requires direct management of the underlying TCP connections to ensure load balancing is properly happening across our ecosystem. When it came to tuning and adjusting these underlying systems, Boxcar was really easy. Clients would just set whatever number of connections they needed to maintain, whereas servers would set the number of connections that they were willing to accept. GRPC, for those who are unfamiliar with some of the underlying mechanics, specifically within the GRPC Java implementation, has two key components. An executor, which currently isn't pictured in this diagram. At first, it wasn't really clear what it was responsible for, but after working with the GRPC team at Google, we learned that it was primarily responsible for communication with out-of-band load balancers. So this is more or less the GRPC load balancer, which eventually got replaced with XDS today. The second component is the event loop group. And this is more netty specific. It handles reading all of the responses that are coming back from the server and writing all requests to the server and then reading the responses coming back. If you're using the Google APIs extensions library, you can choose to leverage a channel pool. This is a way to scale out the number of connections that you open to a back end within GRPC. By default, it will only open a single connection per back end, and this gives you a way to kind of improve throughput in the case of having a fixed number of back ends. On the server side, we have three more components. The first is a boss event loop group. It is responsible for accepting all new connections coming into the server, handshaking it if you're using TLS, and then inserting it into a pool. Because of the light demand on this component, we didn't really have to worry about tuning it as we moved through our testing. But it's just something that you might be aware of if you have a system that needs to manage high frequency of connections. The second component is a worker event loop group. It's kind of the server side equivalent to the client side event loop group where it handles reading the requests coming into the service, handing it off to the executor, and then writing the responses back. Finally, the executor is one of the other core components. It's actually responsible for directly invoking the business logic on your server side. And one thing that's kind of important to note here is that it's actually server implementation agnostic. Once we knew the components that we wanted to tune and adjust within the underlying ecosystem, we started to build a baseline off of our production workloads. Some of the top bits of information that we wanted to know about were the scale of requests and the distribution across our ecosystem. Using some metrics that we were already able, or that we were already emitting to Datadog, we could graph the request rates of servers against the total of number of connections that were established to them. On the x-axis, we have the number of connections that are currently established, and then on the y, we have the corresponding request rates. We can see some of our higher-end services performing about 2,400 operations per second. But one of the nice things about this graph is it shows where the majority of our services actually do work, which is less than 500 operations per second. Similarly, we were able to construct this visualization for our client side because number of instances matter. Clients might issue request rates at different... Might issue requests at different rates than servers are able to receive them. And so we wanted to be able to answer this side of the problem as well. Similarly, the number of connections a client had open is on the x, and then the corresponding request rates are on the y. Caviar, or I guess, one call out here is h.in both these graphs represents a single application in our ecosystem. So you can see up in the far upper right, we have a single client that issues about 4,500 operations per second. What this effectively did was it gave us a great way to prepare a mental model that well-described our ecosystem. We knew a few of the extremes being server being... Server being 2,400 operations, client being 4,500 operations, but then we knew that the majority of our systems fell below 500. And so it gave us an easy target to more or less shoot for in our evaluation. The last thing that we tended to wanna ask about was latency. So under these request loads, what latency did we see when it came to just raw buyer time? So kind of excluding what the performance of the server was doing. And when we started to compute that, what we found was the latency was highly dependent around the ecosystem that we were running in. Cloud providers make different guarantees around what kind of latency you'd see at different points in time. And so we couldn't really use that as a way to establish a baseline. Instead, we just went with more of a direct comparative analysis and then kind of sanity checked ourselves when we got into our QA and production ecosystems. From there, we were able to start simulating workloads. And one of the big things that we needed to do was figure out what the number of event loop groups to set things to. After digging through code, we learned that the client side event loop group and the server side event loop group were dynamically sized off the number of cores allocated to the process. Except that in certain versions of Java, the allocation wasn't properly respecting C groups. And so you would say, what number of cores am I allocated? And you'd actually get the total number of cores allocated to the cluster. To kind of mitigate this, we just kind of assumed a single core. So we programmatically set a single core for the event loop group on the client and two cores for the, or two event loop groups on the server. With some minimal configuration, we were able to replicate most of our workloads at Indeed. We had 32 worker threads that were each issuing two calls in a batch. And then repeating that process for until the client side process had terminated. Both GRPC and Boxcar were able to match their underlying request rates. And the 90th percentile client side response time were both identical. It's important to note that the server in this case had been configured to do an 11 millisecond sleep. And so in this case, the client actually saw zero milliseconds of wire time. Zero milliseconds. There may have been some lower thresholds that we could have gone to, but the unfortunate part about our ecosystem was we only met this information at millisecond granularity. We continued on and doubled the number of calls we were performing in each batch. Again, being able to match the request rates between Boxcar and GRPC. And again, being able to see the same exact response time. We increased our batch again from four to six. And this time we were able to again match response time, or match request rates. But we finally saw the first point where the response times were different. GRPC had seen a wire time latency of about one millisecond, whereas Boxcar still had not seen that. We increased again from six to eight. And now we finally saw a difference in the actual request rates that both GRPC and Boxcar were issuing. Boxcar was able to still handle 2,555 operations per second, whereas GRPC lagged behind by four ops. And in this configuration, we actually saw a little bit more latency. Boxcar started to see about one millisecond of latency and then eventually got a little bit better, whereas GRPC saw more of one to two milliseconds of the difference. We increased again to 10, and Boxcar kind of continued to pull ahead. So 2,909 operations per second, whereas GRPC fell a little bit further behind at 2,671. Hanging a little bit more to that two millisecond mark. But what was interesting about these trials was in that last few steps, we saw smaller and smaller deltas between each iteration. In the first three, we saw differences of about 600 operations per second. But in the last three, we're only seeing one to 200 operations per second differences. And this called out to a bottleneck that we had in our underlying implementation. If you think about it, there's 1,000 milliseconds in a second, with each call lasting about 11 milliseconds and 32 workers each issuing a call, the maximum number of calls that Boxcar could have produced was 2,909 operations per second. With GRPC at 12 milliseconds a call, we can compute that to be about 2,666. And so what we wound up doing was going back to that point just before we put that system under contention. And started to work on sussing out some of the differences that we saw, both in terms of request rates as well as in latency. The first thing that we kind of noticed was that the server's request rate was about 100 less than what the client was reporting. And so we were able to go and double the number of event loop groups on the server. After doubling the number of event loop groups, we saw that the request rate for both client and server actually matched again. But we were still seeing that one millisecond of latency on the client. And so we decided to try doubling that not only once, but twice. And after not being able to suss out that one millisecond of latency, we decided to accept that maybe there was a networking issue within our test harness. Maybe there was some underlying, just some underlying reporting fuzziness that we couldn't really deal with. And maybe just one millisecond was something that we would have to accept as the cost of adopting this framework. With that state of configuration, we were able to replicate most of our workloads. But we were still missing that 4,500 operations per second mark. By quickly doubling the number of workers on the client that were issuing requests, we were able to get up to that peak number, or at least close to it, with 4,200 operations per second. In preparation for this, we did double the number of threads on the server side executor pool, as well as the number of event loop groups. And at this configuration, what we had noticed was that GRPC was actually outperforming Boxcar, both at the 90th and 99th percentiles. While we did incur some slight increases in latency, we were able to kinda look at it and say, well, knowing that we run into problems when Boxcar is under contention, this may be a better solution for our ecosystem. We took the opportunity to then also test how the number of channels in a channel pool impacted performance in relation to the number of event loop groups. And what we had learned was that doubling the number of event loop groups was about the same performance benefit that we saw from doubling the number of channels that were established to the backend. There's some key kinda differences in when you wanna do one or the other, and we'll get into those in a minute. But generally speaking, if you can scale your backend, you don't necessarily have to worry about setting up a channel pool. So in this process, we learned quite a few things. When it comes to tuning GRPC clients, you wanna set that event loop groups configuration based on your expected IO, so the number of requests that you're sending to the backend service and the number of responses that you're getting from it. This can be influenced both by application load as well as the use of some stream-based API calls. So if you're doing client-side streaming or you're doing server-side streaming or even by-dye streaming, you wanna take that into consideration when setting this value. And a channel pool, you wanna use seldomly, but the places to use it is when you can't tune or configure the number of backends that you're speaking to. So in the case of a service mesh, you only have one target that you're typically dialing, that single proxy that you're talking to. And so being able to leverage a channel pool can help increase throughput to that underlying proxy. On the server-side, you wanna tune that worker event loop group similarly to your client. So expected based on the number of requests that you're reading in and the number of responses that you're writing back. And the executor can be used as a way to apply back pressure and only prefer a certain number of operations concurrently. The other thing to note about the executor is that the type of executor can influence your performance as well. So by default, we use the fork join work pool, which will allow us to reclaim threads in the event that a worker sleeps or goes off and does some other IO out of bounds. The last question that we had asked or the product teams had been asking us, how did Boxcar and GRPC compare to one another? We were able to tell them that they performed comparably. We couldn't tell them one was better over the other. There was some situations where one was better than the other, but generally speaking, we tended to prefer the kind of newer solution. Boxcar was much easier to tune for our application developers. They would only have to worry about setting one value, but then they would run into problems where when their system was under load or under contention, we had to go in and manually intervene to alleviate problems. And finally, that GRPC was a lot more flexible to the types of workloads that we tend to see where bursts in traffic occur and occur often. One of the final lessons that we kind of took away was from the GRPC conference earlier this year. Anna Berenberg, a distinguished engineer at Google, had presented this notion where the actual client application and server application can be actors in the data plane of a service mesh. So instead of just having to go through a proxy, systems that were really latency sensitive were able to actually partake in it as well. And if you follow the source code for GRPC closely like I do, you can actually start to see where they've started to integrate directly with systems like XDS as a means to act as a control plane for the actual systems applications themselves. Thank you. Comments. I have a mic. I think I have a mic. I don't have a mic. I don't know how this thing turns on. Yeah, I'll reiterate the questions back. Did we mix GRPC and Envoy in our architecture? Yeah, so the actual service architecture team started to require both GRPC and service mesh migration at the same time, instead of doing them as kind of a two-phased upgrade, where one, you might go and upgrade to using GRPC first and then cut over to using a service mesh. The XDS integration has only recently been put in and I'm not sure of its maturity yet. We never went through and actually did the full performance evaluation, but I'd be really curious to go and take some of the same harness metrics that we were able to put together and do that same test. So when we were using the service mesh, in our test harness, we saw some latency, but when we got into our production ecosystem, we actually saw comparable performance. It was rather interesting and one of the things that we were looking at was its possible relation to an MTU issue that we were having inside of our cluster. We had some, the ecosystem where we had run this performance test was an internal open-stack cluster and when the systems administrators had set up that cluster, they set the MTU to be much lower than kind of your default packet size and so we ran into all sorts of problems when it came to handling SSL connections or downloading larger files. It was a really, really frustrating thing from our perspective and it bit us quite a bit during our Kubernetes migration as well. So the question was what's our control plane? We wrote our own internal XDS implementation given that it's just a spec. It was really easy to sit down and take kind of the existing data store that we had. We used console for service registration and discovery. Console Connect was not available at the version that we were using plus the semantics in terms of how we were registering services didn't quite align with the way that the systems were expecting to resolve it so we wrote a quick little, believe it was a Java layer that kind of did that translation for us. So it would read from console, translate it back and return it to our clients. Transparent TLS was a big one especially with the ability to have HTTP2 connections between point to point. That way languages like Python or PHP where you necessarily didn't have this benefit of having a HTTP2 connection from your actual application logic would benefit from this kind of short on box hop to the proxy and then the proxy kind of handled this more performant long distance call across machines. Some of the other big features that we were looking at trying to get was traffic monitoring and traffic control so being able to do more shifting between different versions of our applications. We didn't really have a great way to do that in boxcar but that's been a big feature ask where people want to say like I want to send 10% traffic, 10% of my traffic to this new instance and so the service mesh would have been able to give us that type of capability. So your hand up next. So the question was did we consider SEO? Yes we've actually, we went back and forth on it so in that first evaluation in 2017 we did the evaluation at the time it was only supporting Kubernetes which didn't match our underlying runtime platform. We were on Apache Mesos at the time and so we kind of ruled it out as an immediate option but we kind of put a flagpole in it and upon my return back to NDE I actually went through and did a whole other evaluation of Istio because we had learned that they had support for console and more nomad based ecosystems. Not that we run nomad but given that it supported console as a back end it was something that we could easily integrate with as well. Again it kind of boiled down to the differences in terms of how we were registering the semantics within the cluster and then when I think we were going through and trying to figure out what that migration path looked like it looked very painful. Istio from what I had seen was really easy to adopt if you didn't have any communication in place like if you were already operating in a clustered ecosystem like Nomad or Docker Swarm or Kubernetes it was really easy to just drop in and gain all the benefits from but in a legacy ecosystem like ours it was really difficult to sit down and actually get integrated. You said you had two questions, right? So a question was how are we handling authentication and authorization? Mutual TLS is primarily how we're doing the authentication mechanism. Authorization is currently done off API tokens but we were talking about leveraging things like RBAC and control policies that Istio was originally providing but then possibly implementing some of that within our XDS layer. So a question was around distributed tracing. We do have distributed tracing in place we actually implemented an internal version off the Google Dapper paper back in 2014 that has since been migrated to using things like open tracing and light step as our primary sources of integration that's been working out pretty well. It's actually reduced a lot of the metrics that we no longer have to admit because that actually gives us a lot of what we were looking for. So the service mesh specifically so the question was are we using service mesh in production? Yes there's several teams that are currently using it and they've actually reported better performance than Boxcar in their cases. These cases were things like a good example is our messaging pipeline so the team that owns kind of all of our email integration they switched their services over to using the service mesh and have seen a huge improvement primarily around the ability to do things like this Mutual TLS and then also handling what was the big one? The HTTP2 connections and being able to interleave kind of requests along a single stream plus the compression both ways. So compression on request plus compression on response. We did have that in Boxcar but we had compression on response in Boxcar we did not have it on request. Request is a very difficult problem to solve if you don't have it in from the beginning. Any other questions? I think that's time. I think it's lunch so I'll be up here at least for the time being feel free to come up and chat. I'll also be out and about at lunch so feel free to reach out. Thanks.