 So to get things right this is not rocket science. It's kind of simple so it's a manual application of kiss So let's keep all of this as simple as possible So before we get started we've seen the diagrams, but what is signaling really? So signaling is the offer answer exchange process that is needed set up a peer-to-peer connection between two participants I normally think of this as the handshake. How do we agree on the connectivity pieces? So another sequence diagram So we have four pieces we have Alice and Bob their application sites and Alice WebRTC library instance and Bob's WebRTC library instance Alice starts all of this off by calling create offer That gets her an offer which is sent across to Bob by any means In this case we just call it carrier pigeon Bob uses that offer to call create answer Sends that back to Alice who can complete the handshake by calling set remote description Once that happened we have a peer-to-peer connection created between Alice and Bob To the magic of ice done in turn that has been mentioned before So the title of the talk is fast reliable and scalable What does that really mean? So fast means we want to minimize the call setup latency Reliable means we want to have really reliable call setup so we have as few failures as possible Both for the users but also for the developers If you get that wrong you're going to have pissed off developers that tries to use your system And the final thing is you want to have a system that actually survives and survives the stampede of your own success So what's the impact of doing this? As I said I've been on a bunch of products And in our experience we managed to reduce the call setup time from about 10 to 2 seconds We managed to get significantly more reliable signaling or call setups The system is a lot simpler to build and run than what we used to have before And turns out it's pretty cheap to run as well So let's replace the carrier pigeon with some details We have some more components to look at in this system We have HTTP servers and a database And we have the same use cases as before We want to set up a peer-to-peer connection between Alice and Bob And just as before Alice starts off by calling create offer She then sends it to Bob using a post HTTP post operation Bob is sort of by magic already polling the system trying to receive that offer So that once it's inserted into the database Bob is going to receive it As soon as Bob receives it he can use it just as before to call create answer on WebRTC Send it back to Alice using post Alice as soon as she did the post she entered her waiting loop So she keeps polling waiting for the answer to come back As soon as in search completes then Alice is going to receive the answer She can call set remote description and the peer-to-peer connection is going to be created Looks good, right? Not really So if you build this and you try to run it at scale You're going to detect that it's slow, it's unreliable and it doesn't scale at all As always the Devils is in the details So what are the problems with the previous solution? So polling is pretty crappy, it addresses overhead Both on the server side and the database side It also adds a lot of latency Using polling and HTTP adds sort of double latency because you need to pay for the poll cycle And you need to pay for the TLS handshake for every single poll Which is going to be crazy Use a single database also have a set of problems It's going to add latency because that database needs to live somewhere globally It's going to have replication problems because it's just one small database And it for sure won't scale There are some other details to this There isn't any delivery acts If you don't have a delivery acts in the system you actually don't know if the message made it to cross That's going to make developers a lot happier if they know And beyond that we don't have any rendezvous system So as I said we rely on magic for Bob to know when to start polling So it just doesn't work So the details that we sort of need to go through and design We need to build a rendezvous system We need to look at the semantics of message delivery so we get those right We need to figure out how we do system sharding or partitioning So we get the properties out of the system we want We need to have a better client server protocol HTTP, get them post, they are great But if you actually want to have performance and you want to have performance on mobile devices You need something else And finally, but probably the most important from a developer's perspective Let's make sure the system can handle retries and potency Otherwise it's going to be a Saturday So rendezvous, it's kind of a fancy name for a really simple question How do Bob know that Alice is calling him? So we have two main options We have GCM and APNs Or we could build our own system GCM or Firebase Cloud Messaging or Google Cloud Messaging Same thing, different names It's really push notifications for Android It's reliable in quotes And it has a reasonable low latency So the normal latency is about 350 millis Assuming you can actually reach the device If it's in a drawer somewhere without battery, it's going to take longer I say reliable within quotes because there are corner cases where they drop messages silently Especially if you run out of capacity on the server to store more messages for that endpoint APNs, so this is Apple Push Notification Service It's built into all iOS devices It's best effort by default So it means that we're going to drop messages both client side and server side I'm fine with that, at least they deliver what they promise and not the GCM approach They have better latency in most of the time than GCM So about 250 millis, it sort of depends, but you can count on it We have the third option You could always go off and say, hey, I want to build my own thing You could use a long live TCP connection from the device to the cloud, handle that You can make it as reliable as you want to You can make it really, really low latency Turns out it's scary hard to implement, especially at scale It will most likely drain the battery rather than make the user happier And on iOS, it's almost impossible to actually implement because of the constraints put on background processes on iOS So, message delivery, the mantex, but really important What does delivered really mean? So a message is delivered, we define it as It's been delivered as it has been received by the destination device And handled by the application successfully So what would that mean in our example? That means that the offer from Alice has been received by Bob Bob have been able to apply it using createAnswer and that operation succeeded First at that point, when we reach that point, we mark that message as delivered Everything else is just lost or random We have some other details in here We need to decide how strict we're going to be with message delivery There are really two big options We can say in order or any order In order means that we promise to deliver all messages in exactly the same order as they were sent Our option is to say, let's be a little more relaxed and say We deliver them most of the time in the same order as they were sent, but sometimes we get it wrong The reason for this sometimes is when you scale out, you're going to realize that servers go up and down at the worst possible time When that happens, you need to retry and that tends to rearrange stuff on the wire The second thing, which is also sort of semantical but critical You can say that we're going to deliver every message exactly once or at least once Exactly once means one message was sent, we delivered exactly that one The other one is, yeah, sometimes we deliver more than once From an engineering perspective, it feels kind of nice to say, let's provide in order and exactly once We only have four options, that's the one that sounds really simple to use But when you apply latency and complexity dimensions, you're going to realize it's slow and it's scary hard to build So the Google ways, let's make it simple and let's make it fast We need to handle the problems client side but still, it turns out that works really well System shorting, we need to partition the system we're building We also know by experience that most calls are made within a single region You only call your friends, most of the friends are in the region that you are Most of the time If we wouldn't do this, then every single request that leaves one region just to go to another for no good reason at all Is going to add another 250 milliseconds round trip time to every call So if we apply this to the initial sort of polling one, that would mean that the lowest possible latency would be 250 milliseconds Because that's the cost of a round trip Even if you got extremely lucky, if you got unlucky, it would be worse So the thing we realize from this is we need to store user data in the region where the user lives And we need to avoid cross-region calls if at all possible The last word is critical, right, if at all possible So we know we need to shard the system, we know we need to shard it by users So how do we do it? We can be very lucky and lazy, as we were doing on Allo And you can say, hey, this is phone number base Phone numbers are by default sort of regionalized We can use that one, works, okay There are other options A really simple one is to rely on a registration system and DNS So you rely on DNS to find the server that is closest to the user And that server to actually know which region it belongs in So it's a simple example, right So we have Alice again, we have a global user database We have a DNS system, happens to be the one Google provides, but you can actually use anyone Then you have servers in each of the regions So when Alice want to do her registration, the first thing she does is she's going to do a DNS query for some DNS If you configure DNS correctly, that's going to end up resolving to the closest user The server that is closest to the user That server knows which region it's in All the cloud providers, you can query the metadata for the instance And it's going to return the region When we update the user record in the global database, we can say Alice is home in EU Pretty simple The drawback of this is if Alice is traveling to the US the first time she uses the application She ends up in the US and she's never moved She's going to have a bad day So you need to add some kind of user migration screen if you do it like this Okay, so a few times here I've said local storage or global storage So what does that really mean? So local storage is really some kind of storage, could be mySQL database Running within a single region, lower application factor But we can use it as a scratch That means we're going to have super fast writes and we're going to have super fast reads Local storage is the thing we want to avoid, sadly as you can probably tell, you can't It's globally replicated, it's very cacheable because we don't change it very often Reads can be fast because we can cache everything But writes are going to be glacially slow, that's just the fact So don't do it often, but you can live with it Okay, client server protocol As I said, we know that HTTP polling is expensive and slow TLS overhead, overhead of us doing extra call, RTG round trips, especially on mobile networks We risk hotspotting databases because we keep reading the same entry over and over and over again It's not very good So we sort of have this wish list for mobile devices on how should the client server protocol work We want it to be to support server-to-client push So we don't need to do the polling, we just need to do one TLS handshake And then when the server knows something, it can tell us about it We need it to be really cheap on the wire Sure, we're going to set up a mobile call soon, that's going to burn a lot of data But still, the actual negotiation should be cheap We want it to be fast, really, really fast And the last one, especially as a developer, it should be simple to use It shouldn't be complex and hard So at Google, we use protocol buffers There is this standing yoke at Google that there are only two kinds of jobs You do proto to proto, we go proto to UI And that's all you do at Google as a software engineer Essentially, it's true So protocol buffers is a very flexible way to define data structures that are language independent They are pretty easy to use once you get used to the quirks, they have some They're strongly typed and they have a really nice compact binary representation So when you send them across the wire, they're cheap to send About a year ago, we released something called DRPC, G as Google Remote procedure calls So this is a public port of an internal system that we use more or less day-to-day For everything we do It's defined using protocol buffers, so you define your service as a set of operations on a service It supports server-to-client pushes through long-living by directional RPCs So essentially, the client connects and then you can start to send and receive Across that connection without retrying It's based on HTTP2 or Quick, for Duo we used Quick and that's been sort of publicized And the nice thing is it has support for more than 10 languages So when you want to build something that's like iOS, Android, server-side, and something else You don't need to write the same service API library multiple times Okay, probably the most critical things to get this working for real You need to realize that mobile networks are flaky I can't stress that enough, right? They're really flaky Even though they seem to work, they don't You have frequent disconnects, you have a lot of timeouts, and packet loss And you just need to deal with it So in practice that means that every single API that you publish has to be retry friendly The fancy word for that is idempotent So what it means is that you need to be able to do the same operation multiple times With the same arguments, and you should only have the same outcome In this example, we have a really crappy retry loop that tries to send Bob an offer If it succeeds, we break out of the loop, and if it fails, we try again If this isn't implemented in a retry friendly way We're going to send Bob three copies of the offer if he's on a crappy network Because the call may succeed even though the sender thinks it failed Because of timeouts, so the request makes it to the server The server processes the request and sends it on, but the client timeouts Before it actually realized that it worked So, design summary here then We want to have a round of view system based on GCM and APNs We want to define delivered as successfully handled by the application We want to have a delivery model that say, yeah, we're going to do it at least once And in any order, so that means the client needs to handle with duplicate messages All kinds of crazy stuff The lucky thing is, that's kind of easy to do on the client Doing that on the server, that's tricky and really, really hard Shorting, we just say, hey, let's do it by year on the first registration And protocol, we make sure to build everything as item potent, geopsies Kind of simple, no rocket science, as I said So, we'll remember this one, right? This is where we started, more or less HTTP, post and gets and polling And we already know what this is bad So, a reliable GRPC solution looks more or less the same, somewhat You could even claim that it's a bit more complex But if we work it through, you're going to realize it's kind of nice and easy to reason about We have a new server type, CS, connection server, like the first thing you connect to You have GCM, which is Google Cloud Messaging, just to make it simple And not have APNs and all of this in the same picture And you still have Alice app and Alice library, Bob's library and Bob's app Okay, so how does this work? Alice called Binds as soon as she pulled the, as soon as the application starts We want to do it as early as possible To make sure that this is ready when we want to send Setting up the connection is going to be the slowest thing Bind, in this case, is a bidirectional GRPC So, it's going to stay up and it's the one we're going to use for all of the subsequent operations As soon as the offer is created, we send it to Bob using the bidirectional Bind channel That's going to hit one connection server, just any random one We're going to tickle Bob because we know that Bob isn't around I'm going to go into the details soon GCM is going to forward a tickle to Bob's application, causing it to wake up And actually do something about it, or actually causing Bob to actually answer, really On the other side, Bob is going to do the same bind to sort of set up the connection As soon as he calls bind, the offer sent by Alice is going to be delivered But we say that this is the offer and it's coming from Alice As soon as Bob receives it, it calls createAnswer Once that succeeds, he acknowledges the offer and sends it to Alice We know that Alice is connected So the first connection server gets forward the answer directly to Alice's connection server We don't need to go through any other loops Alice does the same thing, she receives the answer, calls that remote description And then acknowledges the answer So once again, we have the peer-to-peer connection up and running So that's pretty simple, not very hard It's kind of easy to reason about anyway So let's dig into the actual details on how to implement this This is a bit gritty, but this is the way it is Again, connection server, GCM, another connection server We have a local connection database, we have the global user database And we have the local message database Alice starts by issuing her bind As soon as she binds to a connection server, we write her connection information directly into the connection table That's essentially IP port and some number identifier Alice's state on that connection server When we bind, we also check the messages table No one called Alice, so there is really nothing to fetch Alice wants to send the first message again So she calls send, addresses it to Bob and passes in the message We need to verify that A, Bob is a valid user and B, where he is Because we know that it's not unlikely that it's going to be in the same region As soon as we know where Bob is, we can issue a select to actually get the connectivity details for Bob But as we know, Bob isn't around, so that's going to return empty We also insert that message directly into the database This is actually the point where we tell Alice that this operation succeeded And the reason we do it as early as possible is to minimize the risk of an error internally propagating out As soon as we return to Alice, we actually continue We call GCM and say, hey, tickle because we need to reach Bob At some point, further in the future, GCM is going to wake up the application on Bob's side And we sort of have started the first thing Okay, so application is running on Bob's side He calls bind, once again we insert his connection info into the database We check any messages pending, yep, there is a message from Alice, so we're going to deliver that one Bob handle it by calling create answer, then he calls act On act, we update the state of the message in the database, but we don't actually delete it It saves a tad bit of overhead, especially on our database But it also requires you to have some kind of cleanup system that goes around after the fact and deletes all of these old messages So Bob wants to send his answer back to Alice, so he calls and once again we check, does Alice even exist? That's pretty cacheable, but we do it anyway We figure out where Alice is connected, and we know that Alice is connected, so we're actually going to get the result back Then we insert the message into the database We know that Alice was connected, so we got her IP port and ID That means Alice, Bob's connection server can create the direct connection to Alice And do a direct call saying, hey, here's the message But the trick is, there's always a tiny, tiny risk that Alice goes away and loses her connection before we manage to deliver it So we always tickle GCM, even if we know that there is a message there Even if we know that there is a connection there There's going to be a race between the delivery of the message and the tickle from GCM We just need to live with it, client side, that's kind of easy to do, server side, it would be a pain Once again, Alice receives the message, calls the app, she updates the state on the database, and it's done So, how does this really work when you have a bigger system with all the pieces in place? So once again, connection servers, GCM, local state, local state, and global user database As I said, this is drawn as sort of global only, but think of it as highly cashable state you can move about So, again, Bob's want to send his response to Alice, he goes out, oh, where's Alice's home? Then goes to the connection database in exactly that region, and checks, is Alice connected at the moment? And also inserts the messages into the database Alice is connected, so Bob's connection server connects directly to Alice's And forwarded the message We do the tickle in parallel to make sure we don't lose the connection, and we have the normal race Alice receives the answer, calls set remote description That succeeds, so she says act, as soon as she says act, we update the entry in the database that say this is now delivered That message will never be delivered again There are of course races where you sort of have a message coming in, reconnects and stuff like that So that may actually be delivered again, and that's why we say it may or may not, the client needs to deal with it Okay, so, again, back to the overview of the reliable solution It's kind of easy to reason about because you don't have polling loops, you don't have strange state, it has clean and simple operations One thing that this doesn't show is that a lot of the operations in this slide and the previous slides are very easy to parallelize So you can do them at the same time, and you can also do them a bit sort of optimistically, you just try, and if it fails, it's just a cost of a fail There has been a lot of details in here, but stuff we haven't covered, which is equally hard or simple, depending on how you look at it Authentication, identity, load balancing, item potency, how do you actually implement that to make it work Message identity, cache eviction, especially global state, how do you make sure you evict the stuff that's in the cache so you can update it And how do you handle Poisson messages? In the current system, if you get the Poisson message in that actually cause Alice's, the receiver's clients to crash We can never get out of that state, it's going to keep crashing, keep crashing, because as soon as we deliver it dies, so that needs to be handled So the summer of this tiny talk, you got to remember the details because that's where the actual complexity of this is Don't go home and build a custom roundable system, it just won't work, it has been tried once too many, so let's not repeat that Make sure you have very clear message delivery semantics, it doesn't need to be exactly the ones I picked here, but make sure they're clear so the client teams know what to expect Sure the system based on actual user behavior, right? For Duo and Hangouts we know how the users behave, but however your user is going to behave, I don't know, you need to figure it out or assume something when you start Use something really fast and really lightweight as client server protocol, preferably something that's easy to use as well, because it's going to make your developer a lot happier And finally, and I can't stress this enough, make sure all APIs are retrievable, because if they're not, you're going to have very strange states flying around in your application That's it, questions?