 All right, so I'm going to talk about something that product managers don't normally talk about. I'm going to talk about operations, not going to follow the subject that is in the schedule. Before I dive in, a lot of my talk is related to WebRTC, but it's also about the bits around WebRTC, about the things that are required to run a WebRTC service. And I want to talk a little bit about some of the lessons just quick about Twilio. If you're not familiar, wow, we've got really bad resolution here. If you're not familiar with Twilio, we provide a cloud platform for building communications apps of all kinds, a carrier connectivity all over the world. We have over 750,000 developers on the platform and are processing more than 50 billion interactions with our API a year. So I think if you've been working with WebRTC for a while, you know that the simple case, it's a testament to the strength of the WebRTC API is that it's easy to get something simple up and running quickly. A peer-to-peer call between two instances of Chrome is not challenging, but the complexity of a WebRTC service scales very rapidly and it does not scale linearly. It gets, you're introducing more variables into the equation and more things can go wrong. You know, it starts when you just wanna talk to another browser, if you wanna talk to Firefox, your client needs to get thicker to deal with the differences in the WebRTC implementations. If you want to have more reliable network connectivity or support mobile devices, you need to introduce turn and you need to get WebRTC compiling for mobile apps. Then if you want to support multi-party, like Emil talked about, you need to introduce media servers and at that point, you'll probably also wanna separate signaling from media and because they have very different scaling properties and move the state of your application out into a separate service. So now we've got even more boxes running here and then someone like me, a lousy product manager, will come along and say, well, we wanna connect to SIP devices too, so you introduce a SIP gateway and then next thing you know, you're calling PSTN phones. And at each step here, we're introducing more things that can and will break and we're introducing them in a multi-point network all over the world. You know, things will go wrong. Networks fail, services fail, hosts fail. Something will break. Something's breaking at Twilio all the time and we hope you don't know that. We hope that we've built a system that can protect customer applications from those failures. So we don't have perfect solutions for all of this but I wanted to share just a handful of the tools that we've created over the course of running a WebRTC service for a few years now and hopefully they'll be useful to you. Things break for a bunch of reasons. You know, you've got in a cloud service, hosts fail at random, networks to grade, you have to deal with people on the internet trying to do bad things, sending traffic to your servers that you may not expect. And you also have human elements, things you can control, things like operational mishaps and just inadequate testing in your code, bugs you've introduced because we're human beings. As I mentioned, these things are happening all the time at Twilio. You know, we have tens of service failures a week. Something, you know, some alert is going off, something is paging, someone is looking into something because it is such a large network but we've been able to deliver 100% uptime for the last two quarters and we'll have 99.95% uptime for the year in our WebRTC service. So the first thing and most important part of our strategy for dealing with failure is to test constantly. We have a set of what we call end-to-end testers, if you're ever talking to a Twilion, they might even mention them. You might say are E to E tests or end-to-end testers. And these are a set of applications that are continually hammering our service like a customer would. They're continually testing the cloud the same way the customers will. Now we learned a few things when we built these. Our first attempt was to build them using the Node WebRTC project. So these end-to-end testers, by the way, these are running on servers, running on servers either in Amazon or in another cloud and they are making requests into the Twilio cloud. Our first approach was to use a server-based project like Node WebRTC to try to act like a browser client. We found that that was just really, really difficult. It's hard to keep WebRTC running well on a server right now. Just server-side WebRTC isn't quite there yet. So what we ended up doing over time is we've actually separated these things. We broke out our signaling tests from our media tests and so we have a set of very lightweight, fast tests that are continually making sure that the signaling infrastructure is available and that's really key because a lot of the application logic is, of course, living at the signaling layer. And so we have a set of tests running in Node that are simple WebSocket clients of our gateway that are constantly exercising various application functionality. Then we have a set of robo-callers we call them. These are basically headless browser tests running on using the latest version of Chrome, using Chrome Canary as well and Firefox and any other browser that comes along with WebRTC or ROTC support to act like a browser will when it's interacting with our service. These tests run less frequently, maybe once every five minutes, but they give us the combination of these tests, give us a full picture of the availability of our service at any given point in time. We aggregate events from these tests in a product called Rollbar. Rollbar is a great tool if you are operating any kind of service or application, I highly recommend it. So we pull all these things into Rollbar. It gives us the ability to aggregate events to look very easily identified changes in event patterns and set off pagers if necessary when things go wrong. We also have a bunch of end-to-end manual testing tools. So here are just a handful of them and there are others that I haven't depicted. And the idea here is that automated tests don't always catch everything. You will, as browsers change as we release new versions of our SDKs, as we make changes in the cloud, you sometimes you miss things. And so we have infrastructure in place that the minute we identify that something's gone wrong, we can jump into a set of tools that allow us to very quickly exercise a very specific scenario. So for example, we want to connect to a conference mixer in the Twilio Cloud or we want to connect to a PSTN endpoint or we want to make a call from a SIP endpoint to a WebRTC endpoint or we want to make a call between two WebRTC endpoints. These are all scenarios that are kind of baked, locked and loaded and ready to go and so we can figure out when things have gone wrong very quickly. One of the things that we've started doing more recently that I think is pretty cool is we're also using chat as a control plane for this. So we are using hip chat and but this can be done in Slack as well. We have a set of very easy bots set up so that we can just issue commands in our teams hip chat window and run a set of end to end tests against the entire cluster at a moment's notice and then get a report back with which tests passed which tests failed. At the end of this, we end up with a sort of a layered approach to service monitoring. So we've got end to end tests running at both signaling and media layers. We have an anomaly detection service that is looking for changes in behavior and generating pages or alerts if anything seems to have changed. We have your typical host monitoring tools through Nagios and we have this set of manual tools that allow us to very quickly test. So I've already talked about how things fail but you not only have to plan for things to fail, you have to plan for what happens when you recover from a failure. Here's what I mean. Our simplified view of a portion of Twilio's architecture looks like this. We have a DNS name pointing to our gateways that actually does DNS distribution to a number of load balancers in a given AWS region. Behind that we have a set of gateways for signaling and then further back behind that we actually have a registrar. If you're not familiar with the concept of a registrar, this is just a database that keeps track of where users are connected right now. So if someone tries to call them, you can ring their phone. This isn't important if you're doing sort of a room-based WebRTC app but if you want to have a persistent connection so that you can reach someone quickly like to have a phone call sort of use case, you need a registrar to keep track of where everyone is. So obviously we've built redundancy into this architecture. If we've got a client connected to load balancer A and they were routed to gateway A, if gateway A fails, well they can be redirected to gateway B. They can reconnect. If load balancer A fails, that client can connect to load balancer B and get connected to a gateway B. But what happens in this scenario? What if it's not just one client but thousands or hundreds of thousands or millions of clients connected and you have a load balancer just go away because that happens sometimes in the cloud. Well, this is first-hand experience here. What will happen is that all of these clients will start hammering load balancer A and in a way that you might not have thought of. This happened to us a couple of years ago and caused some significant problems. It led us to think about this failure case of, okay, let's look at our peak number of connections and let's consider what would happen if all those peak connections had to move from one balancer to another or one data center to another at the drop of a hat. Would we be able to handle that shift? It ended up with us putting in extensive rate limiting both on the balancers and the gateways. You also have to think about when you're under that, when you're recovering from that failure, what happens to your database layer? How do you handle locking in your database? Because you'll be getting hit with a bunch of registration requests very quickly and you have to be able to make sure that your registrar can write all those records down without getting bogged down. So next, I think we've already touched on this a bit and Fippo did a great job of talking about some of the ways some great WebRTC apps are working around lousy networks. We found that as we started using WebRTC on mobile, we needed to get more effective at testing in poor quality networks so that we could understand various trade-offs we were making. Should we use, what is it like to use H.264 versus VP8 on an iOS device and how does that shift based on various network conditions? Is it, are we getting much more out of hardware acceleration in a typical network environment? These are just some of the questions we asked and we needed a way to inject various network conditions into our environment and measure the results. You can do this on one machine. If you're using a Mac, Apple provides a network link conditioner. You can go download this. It's great if you just wanna do a test through the simulator. It doesn't scale well though. So what we actually did was we, another node app we created. We created a node app called Network Throttler. You can go see this on GitHub if you're interested. This is just a Node.js service that basically issues commands to NetM and it runs on a Raspberry Pi. And so we have a Wi-Fi interface on a Raspberry Pi running this network emulator and we can do things like simulate a 3G network, simulate a 4G network, simulate high loss at various levels. And so we have a kind of a closed environment that we have set up in our office in Mountain View where we can run tests under various network conditions and really get a sense for how WebRTC and how our customers will experience situations under typical network conditions. And on that point, the next major tool we use is WebRTC's Get Stats API. When we, I think anyone who's operated a communication service of any kind and a bunch of you in this room have done that, you know that you will always have complaints about call quality, you will always have reports of one way audio, you will always have complaints about video call quality because networks are hard and they're not reliable. And so about a year ago, we started aggressively gathering statistics from Get Stats on all of our Chrome endpoints and Firefox as well. We pull the Get Stats API, we take the statistics we get back, we bundle them up in a standard format and we push them to Amazon Kinesis. If you're not familiar with Kinesis, it's a great product for dealing with large sets of data and kind of subscribing to different data streams to process them in various ways. We pump the data out of Kinesis into some real time monitoring tools so that we can see changes in call quality for various customers. And we also pump them into Redshift for historical reporting. So we have a long term reporting and real time monitoring of call quality via the Get Stats API. Some of the stats we capture, you know we look audio input level, audio output level, all the typical things you would see in the Get Stats API. I know this is a kind of a moving, Get Stats is a moving target in all the browsers and you have to kind of roll your own if you're dealing with WebRTC natively but I highly encourage you to take a look at this. Here's some of the patterns we look for. You know, kind of obvious ones for a given customer. We just, if we see packet loss drop off for a customer that we know is working out of a single office location, well we know they're probably having really bad call quality right now. If we see spikes in jitter or latency. Some of the other more complex ones. Calls that are greater than five seconds but have no audio or have an audio or have a low audio input level. Or if we see calls that are set up for, you know, 30 seconds, 50 seconds but there are no packets sent from the browser. These are indications of one way audio. And a good story here is this, we were able to use this tool to identify some, an issue I think it was in Chrome 41. There was a very weird, I think actually the fellow from Wix mentioned it. It was a very, very unusual scenario where if you were using a USB headset on a Mac in Chrome 41 and it was more likely to occur if your app was served over HTTPS, you would have a frequent occurrence of one way audio. The HTTPS thing wasn't really a factor in getting access to the microphone but it was because you weren't calling get user media all the time and prompting the user that it was less likely that you would get corrupted audio stream. So using some of these heuristics we were able to identify, hey look we've got a pattern of the number of one way audio calls just increased tremendously based on the latest release of Chrome. We should probably dive in and do something there. We also look, we look at all these things by various vectors. You know, we look by Twilio account, by the identity of a given endpoint. What browser are they using? So we try to identify changes in behavior. The last thing I'll say is that operational excellence kind of building these tools, thinking about these tools, caring about service uptime is an issue of culture. It is a team effort. It does not come for free. And so some of the things we do at Twilio to kind of build this into the way we work. We write the SLA or we try to at least, we don't always get this right but we try to write the SLA first. When we're designing a new product or service we start with the service level we want to deliver to our customers. When I started at Twilio there was an old saying write the API first. That came from our roots as an API company. And we've modified that over time to not only write the API first but write the service level first as well. We also have a culture where the teams that build a service operate that service in production. There is no ops team. We have people who are in support but the people who build services operate them. So there's a very personal, a sense of personal ownership of service quality. Five wise analysis. I probably don't need to talk about what that is. Regular fire drills. Actually creating incidents either in a development environment or even in a controlled way perhaps even in a production environment and responding to the fire drill. Making sure that you have the well oiled machine so you can respond to an incident, et cetera. So last thing I'll say is communication tools are only as good as their availability. We've all had that experience where we go to use a communication service. We go to use a product and it just isn't there. Something doesn't quite work. So this is key. Hopefully you find some of these tools useful. Thanks. Thanks, Rob. That was really interesting. Why don't we take some questions quick? Anybody? I'll look in the back first. Maybe I'll start, Rob. Maybe a simple one. What would you say is the most common issue that you see with WebRTC? I think we still see, right now we still see a lot of one way audio problems for various reasons. Could be anything to do with detail-less changes, ice failures, et cetera. That's probably one of the most common issues, yeah. Did you come across any best practices for injecting streams during testing? No, that's something we're still working on. I think our media testing is, we're still improving, but we do a set of things where we play audio through the browser and we try to play it back, but nothing too complicated. What percentage of your customers using audio versus video calls? Almost all of our calls are audio right now. Twilio did not have a video product. We were using WebRTC since 2012. We've not had a video product in the market until earlier this year. So we do a lot of WebRTC calls, but they're all audio. Last one. So what tools are you using now to identify those patterns of when you see there's flaws, is it the browser, is it? Yeah, like I said, roll bar and some of our own anomaly detection tools are really key. I think the key thing for identifying issues in the browser is, for us, we use get stats a lot. We use the automated tests, Selenium based on Beta Channel and Canary, of each of the browsers. Those are the key things for us. Awesome. Thanks, Rob. Thanks. All right.