 It's mostly focused on being available. We can deal with some inconsistencies later, even if it means manual intervention. But we never want to be down. We want the app to be working every time, every day, every minute. And that's challenging. So what you're seeing on the screen is the view of Amsterdam, the city that I live in now from Uber's point of view. You can see the brightness of the color represents how often an Uber car drives there. And so our architecture, we have many servers in geographically distributed data centers. And we have a service-oriented architecture with hundreds of microservices. And it's not only about people, actually. We have no tech talk is good enough without a picture of some kitten. So Uber is actually going into the last mile delivery business. We can deliver packages, different cities. That's part of the Uber rush service. We can deliver food very quickly within minutes as part of the Uber Eats service. And we have some fun promotions where we agree with some animal shelter. And we take some kitten from them. And people can order an Uber car with kitten and the money that they give them goes to the animal shelter. And then they also can keep the cat. So yeah, some fun. Right. So about open source, our platforms are built on open source technologies everywhere. We are using Ubuntu and Debian Linux on our servers. We are using a lot of Docker. Docker was a big buzzword last year. I don't know if it still is this year, but I guess it is. And the main languages, our backend services are written in our Python, Node.js. And recently, Go has been gaining a lot of speed. Our storage or further backend services include Kafka for logging, Redis for caching, Cassandra for key value storage, basically, and Hadoop for data analytics. And the list goes on and on. We use open source technologies everywhere. But we are not only using them, we are also contributing. Every project, every service that is created at Uber is considered for open sourcing. Of course, that can happen every time. There are some business critical things that must be kept secret. Sometimes it doesn't make a lot of sense to open source something, but if we can do it, we open source things. And we have about more than 880 original public GitHub repositories, also more than 40 forked repositories. And our software engineers are constantly contributing to projects in different ecosystems, I would say, mostly in Node.js at this point. But contributions from our people are very often our people contribute. So this graph shows you how many microservices there are at Uber. This is like a historical graph. So two years ago, there were no microservices. Everything was served by a monolithic API service. But since then, the number of microservices has exploded. We now have over 700 microservices running at Uber. And even keeping them running is a challenging thing. And so I'm going to talk about three things that support this microservice-oriented architecture. The first is string pop, which is scalable, full tolerant application level sharding. We'll see what it is in a minute. T-channel, high performance RPC, and Hyperban, service discovery and routing for our large scale service-oriented architecture operations. All right, so ring pop. Ring pop is actually the project that we are developing in MSTREM. So ring pop consists of two main things. It's a consistent hashing ring and a membership protocol. Consistent hashing allows us to consistently shard requests to the workers that actually do the business logic. So imagine you have a trip that has started. The trip has an ID. And because the trip is full of state, and you're doing some caching, you want the cache to be fresh, so you want that one request to be routed to the one host, to the one instance, and that one instance is going to handle the whole trip. And you can imagine that the supply service, for example, or the demand service that we have are being served on hundreds of hosts. There are hundreds of instances. And we need to route these requests consistently to one machine that is going to handle that one trip. And then it can do other things. And of course, we want to be available. So if that machine fails, we want this failure to be non-disruptive. We want the trip to still continue. If the machine crashes, we cannot afford to lose the trip. And a membership protocol gives us full detection in a decentralized way, in a scalable way. So we will see how these two things come together. So I will dig a little deep into the underlying technology. So the hashing ring is actually like a continuous, or basically a four-byte integer. So the first thing we need to do, we are basically placing workers on this hashing ring and our instances, as we call them. And then as the requests come in, we route them to these instances based on hashing. So first thing we need to do, we need to place these instances on the hashing ring. So here we have three instances, A, B, and C. And we've run them through a hashing function that is going to just return a four-byte integer and that places them on the hashing ring. This determines the key space division. So every instance is responsible for the keys, for the hashing keys that are counterclockwise until the next instance. So in this case, instance B is responsible for the left half of the key space and instances B and C have split the right part of the key space. In reality, it's a little more complicated than this. It actually never happens that this would be this imbalanced. But for the sake of simplicity, let's assume it's kind of disbalanced, but in reality, it's not. And now if we want to determine which instance is responsible for a given user, for example, or a given trip, we hash the user or the trip and it falls somewhere on the ring. In this case, users 1 and 5 will be handled by instance B. And the same way, user A is going to be handled by instance C and user 4 is going to be handled by instance A. And now, let's say instance C has caught fire, like it's down. Somebody has cut the cable. And we need this thing to cause as few disruptions as possible. So our membership protocol determines that instance C is down. It removes it from the ring. And now instance A is responsible also for user 8 because C is no longer part of the ring. But as you can see, users 1 and 5 are still being handled by instance B. So unless the instance goes down, the request is going to be routed to the same instance every time. And in the other case, if we put another instance in the ring, in this case, D, it's going to take over some of A's key space and user 4 is now being handled by instance D. So the important thing to remember again, unless something goes wrong, the request is always being handled by one instance from beginning to the end. All right. And now how we do the membership protocol. So our membership protocol is based on Swim. If you've never heard of it, go read the Swim paper. It's an excellent paper. I really recommend it. It's a great read. Do it. So we are assuming we have three instances and we somehow want to keep the membership and we want to keep it in a decentralized, scalable way. And Swim allows us to do this. So in the steady state, these instances cycle through their membership list and randomly, in given intervals, ping their neighbors. And if nothing happens, they just happily ping and everything is fine. But if instance B goes down and instance A decides, like, tries to ping it, it doesn't get a response back. So but we don't know. Sometimes packets get lost on the network. It might be just a temporary thing. We don't want to jump to conclusion that instance B is down. So we want to be sure. So what happens is that instance A asks another instance, in this case, instance C, to do what is called an indirect ping. It sends a ping request to instance C and instance C pings instance B on A's behalf. And so this is for the case where a link between A and B has been broken, but still the link between C and B is fine. So we don't want to mark B as faulty because it's still there. Just A cannot reach B, but that's not a big problem. So the indirect ping fails and now A declares B as suspect. That means it's not yet declared faulty. It could be just a temporary failure. We don't want to be causing many disruptions unless it's really necessary. So it's just been declared suspect. But after some time, yeah, I forgot something. So what's important here is that this protocol is based on what we call gossiping. That is, in these pings, we gossip or piggyback what we call it, information about the membership to other instances. So the next time A pings C, it also the ping carries the information that B has been declared suspect. So in the next ping, the information that B is declared suspect goes to instance C. And after a little while, when B doesn't respond back, it's actually declared faulty. So the best thing about this approach is that it's very scalable. It's doing what we call infection-style dissemination. It goes randomly. It randomly disseminates information about membership. But this random infection-style dissemination eventually gets to all the nodes. But the beautiful thing about this is that the traffic is constant per node. So when we grow the cluster, nodes don't need to keep connections to every other node there is. They are just pinging some subset in constant intervals. So it's very scalable. And we have clusters with 1,000 instances in production. We have tested it to 2,500 instances. And we just don't have enough hardware at this point to go further. But this thing is going to be fine up to 10,000 instances of this running. So that's something about the scale, Fuber. We have a service that has 1,000 instances in production. OK, so how does RingPop actually, how does one use RingPop? So RingPop is an application-level middleware. Requests come into the service. RingPop, based on the hashing ring and the membership decides whether it should handle the request or forward it to some other instance. And so you can imagine something like this. This is just pseudocode. Don't look too closely at it. So if you look at the diagram in the bottom, these are our three instances. A request for user 1 comes to instance A. Instance A is not the owner of the user based on the consistent hashing, but it knows where it should go immediately. So it forwards it to instance C and instance C handles it and replies through instance A and sends the response. So this is what the code could look like for some kind of business logic on users. So you have seen that there's a lot of RPC happening, many services, many pings happening, many requests and responses flying around. And we used to use HTTP. But there were many issues with it. HTTP is a complex protocol. It's kind of slow performing in some cases. So we came up with our, yeah, there was a need to make something more reliable faster. And so we created T-Channel. T-Channel was created with the service-oriented architecture in mind from the start. So we are not calling hosts, we are calling services. We wanted to be able to trace any requests in our service oriented architecture so that we can do diagnostics, we can do performance testing, and tracing of the requests. We wanted to be easy to implement in the languages that we use. Also, we wanted it to support multiplexing. That is, you can have one connection open and send many requests. You don't have to wait for the response as you would need with HTTP. With HTTP, you can work around it by using multiple connections. But then you're using more connections, and it's not as scalable. We wanted it to support arbitrary serialization. We still use a lot of JSON, but the problem with JSON is that it's slow to parse and produce. We have found that many of our services were basically doing one thing all day, and that is parsing and producing JSON. So we actually started using Thrift, which is a different serialization approach. And we wanted, of course, it to support high performance forwarding, because that's what we do a lot as you have seen. So T-Channel is a binary protocol. And it's binary because we know fixed points in the T-Channel header that we can access directly without needing any difficult or complicated parsing. And based on these headers in fixed positions in the byte stream, we can determine the forwarding. We don't need to parse the whole request in any way. So that allows the very fast forwarding. And there's a request ID, which actually allows us to do the tracing I described. We can just trace any request by its ID throughout the whole network. So again, T-Channel is open source. We have four implementations in Node.js, Python, Go, and Java. And we've also built tools to support it. T-Curl does, yeah, it's just curl for T-Channel. And T-CAP allows us to trace and diagnose T-Channel requests as they fly through the network. And of course, if you have a large service-oriented architecture operation, you need some way to do service discovery and routing on this large scale. So it's one thing to have one service that is nicely working and it's reliable and it uses consistent hashing and everything is working fine. But you need a way to communicate between these services. So that's where HyperBrand comes in. HyperBrand gives us service discovery, request forwarding. Clients have to do almost zero configuration. You just need to give it one HyperBrand instance and it will bootstrap automatically. IP address of one HyperBrand instance and it will bootstrap automatically. And also HyperBrand allows us to do circuit breaking. That means if one service is misbehaving, it doesn't respond. We can cut traffic to it so that it doesn't disrupt services that are downstream to it or upstream. And also allows us to do rate limiting if, again, one service is misbehaving, is producing many requests. We want to limit it dynamically so that, again, it doesn't disrupt more of our architecture. So HyperBrand is actually based on HyperBrand. The HyperBrand routers form a ring and services connect to this ring. Services do not connect to every node in the HyperBrand ring. They just have a subset of them, what we call an affinity group. So you can imagine when the service starts, it contacts one of the HyperBrand nodes and the HyperBrand node tells it where it should connect, where its affinity group is. And it connects there and then it's ready to respond to requests. So this is what a request flow could look like in HyperBrand. So service A connects to HyperBrand. HyperBrand basically, service A wants to do a request to service B. That's the goal. So service A sends the request to HyperBrand. HyperBrand determines where service B's affinity group is forwards the request to one of the routers in the B's affinity group. And then forwards it to service B. Service B again determines where it should actually handle the request, forwards it to the correct instance, and then the whole thing goes back. So you can see there's a lot of routing happening and forwarding happening. And that's only things to teach and all that we can do this in a high performance way. So conclusion, Uber is both open source user and contributor. We have a large service oriented architecture based on our open source projects. You can just, there's good documentation. You can really go to GitHub, take a look, play with it. These projects are RingPop, T-Channel, and HyperBrand. There's plenty, plenty more there. And we still, but we still have a lot of interesting challenges ahead. We've been around for four or five years, something like that. Our architecture is still evolving. And so things will change, things will break, but a lot of interesting stuff is happening. So thank you. Yeah, Uber is hiring if you want to work on this and more cool stuff, get in touch with me. Also, if you want to take an Uber, you still don't have an account. You can use a promo code mark for you to get 250 crowns of your first ride. Yeah, so that's all from me. I'm ready to answer any questions you may have. All right. Yeah, I think you were first. I have two very quick questions. Yeah. OK. Yeah, the first question was, do we use some kind of raft or pexus implementation? Answer is? Yes. Answer is, we use Cassandra and we use React, which are both based on these things, but we don't have anything in-house based on these two. So, kind of. And the second question was, we collect a lot of data. What are we doing with the data? Are we selling it or not? So, we are definitely not selling the data. As you, like, actually, yeah, we have a big head-loop operation. We are analyzing a lot of data, but we are definitely not selling the data. No. What are you doing? So, you can imagine that. OK, I'm not sure I can actually, like, talk about it a lot. Yeah. Like, there's a lot of space for optimization in routing, for example, you know, like, of the cars. And, yeah, that's about everything I can probably say here. Yeah, I'm sorry. Yeah, I think, yeah, yeah. Yeah, yeah, so, we work with UUIDs throughout, yeah, sure. The question was, how do we determine the identity of the user, basically? So, we don't use usernames. Every user is associated with UUID, so we use these UUIDs as the source for the hashing, right? Yeah? Yeah, the question was about security and encryption with the channel. This is all happening internally. We don't do any encryption on the application layer. I believe there are IPsec things happening, you know, like, in the lower layers of the network. But, yeah, you wouldn't use the channel to connect from your cell phone to our infrastructure. We still use HTTP and REST-based approach to get to the end device. Yeah? Yeah. And if it dies, another node takes over. Yeah. What happens in the background? We probably have to replicate the data. So, we use customer for it, or it's a stateless or it's some sort of state plan to replicate the data. Yeah, so, the question was, what happens if the die, if the node actually dies in the middle of the trip, how do we load the state? So, there's a couple of things we do. One, we use Redis for caching. So, you would stay the transient state in Redis as one place, or we have a couple of storage services based on Cassandra, based on React, where we store the data so that it can be loaded in case the node dies. Yeah, yeah. Yes, yes. So, the advantage there is that your caches are fresh. And also, there are other services which only store really transient data that get refreshed every couple of seconds. So, if the node, if one of the instances dies, the data gets refreshed from the source very quickly. So, we get a four second or something like that disruption, but it's not a big deal. Yeah? So, the question was, is there any guarantee that the requests get where they are supposed to go? The answer is no. Like, you can always lose the packets somewhere in the network. But what you can do is, you can do speculative execution. That is, you actually ask two instances for some request and you use the first response that comes in so that if one of the packets is lost somewhere, you still get the second one. You can also wait for the responses. If it doesn't come in, you repeat it. But then, of course, you are faced with some latency issues. Yeah, so, yeah. There are ways to work around this. But of course, if a packet is lost somewhere in the cable, it just gets stuck in the cable. No? Yeah, you need to. Yeah? So, thanks to the fact that we have request IDs, we know that we send a request with an ID and we are waiting for a response with an ID. So, if the response doesn't come in, we can safely resend the request because the destination service knows that it already served it so we can send the same response again. So, we can guarantee that it's going to handle the request only once. And as well, we can guarantee that the source service is going to handle the response only once. Thanks to the IDs. Yeah? So, yeah, what used to happen is that devices would basically ping our data center every couple of seconds and they would get the whole state of the application back and it would just display the state. But we are kind of moving away from this approach. So, the mobile application knows what data it should request and it just asks our back end services for the data. And if the connection gets broken, your application won't update for a couple of seconds. But when the connection comes back, it will just load the data that is necessary. So, I don't think there's a big room for conflicts, really. The application is fairly simple, actually. So, when you make a request to get a car, you just commit with pressing the button. The request either gets to the server or it doesn't. And if it gets there, it gets processed on the server and you wouldn't send the request twice. So, I don't think there's a lot of room for conflicts, actually. Yeah? Yeah? Yeah, so the first question is, Hyperban is written in Node.js. Why didn't we choose Go? Basically, Node.js at the time was the language to write things in. At Uber, there were software engineers able to write things in Node.js, so that's the basic reason. And the second question, yeah, did we have problems with Node.js? Yes. So, yes, we've had problems. Yeah, that's why actually Go is gaining a lot of speed at Uber now. We are considering all options. So, unfortunately, we are really out of time, so thanks, Mark, for the great talk. Did you speak English, right? So, I will repeat the question. The question was, why didn't we write Hyperban? Why didn't we use something like all of the books?