 So, to continue on the iOS topic, I'm happy to invite up another one of our gracious sponsors from Twilio, Chris Eagelson. So Chris is another really deep expert in this. He's been doing mobile iOS development since, you know, the App Store was introduced. He's been part of a WebRC startup, and now he's over at Twilio. So I'm really excited to hear he's going to follow on in iOS and tell us even more. Thanks. Awesome. Thanks, Jed. So my talk is video communications on iOS, and I'm going to dig in a little bit deeper into what it means to use WebRTC with iOS and talk about a few things that we've discovered at Twilio doing this. So iOS at Twilio is a number of products, but today, obviously, I'm going to focus on the video SDK. We also have a voice SDK that is not based on WebRTC at the moment, but our video SDK is based on the Chromium implementation of WebRTC. So we take a little bit of a different approach than Eric does. In our case, we just consume the C++ SDK and use those interfaces as opposed to building on top of the Objective-C APIs that exist today. So with that said, our agenda today is going to be three things. We're going to start with audio, and I'm going to talk about the audio pipeline, but also relate a few experiences we had with CallKit. After that, we'll get into video, looking at the video pipeline, and also the included capture that comes with WebRTC and allows you to use the camera. After that, I think we're going to get into the most interesting part of the talk, and that is looking at custom captures and how you could build those to satisfy your own use cases. And then I'm going to specifically get into building a screen capture with you guys during this talk. So the WebRTC audio pipeline starts with something that's pretty important, and that's the Audio Device Module. This is the central place that is responsible for playback and recording of audio. And as Eric mentioned, during the Q&A session, there's actually several different implementations of the Audio Device Module. You actually implement a protocol called Audio Device Generic, and you provide implementations for different platforms such as iOS, Android, Linux, Windows, and so on and so forth. When it comes to codecs, there are a couple pretty popular ones that you're going to see when you're using WebRTC, and those are Isaac and Opus. Both of these are software-based implementations on the iOS platform. And Isaac is a lower complexity codec. It works better at lower bit rates and is optimized mostly for voice. Opus is more suitable for a wide range of audio content, including not just voice, but potentially live music or anything else you want to throw at it. It tends to do a pretty good job. The disadvantage with Opus, however, is that since it's a software implementation, it does use more CPU and consume more resources on your mobile device. So that's something to keep in mind. But at this point, Opus is the default unless you're going to go ahead and change SDPs like you saw in the Android talk. You're going to get Opus by just negotiating a peer connection. So let's move a little bit into iOS and talk about the audio pipeline there. So the iOS implementation of Audio Device Generic is called Audio Device iOS, and this class is written in Objective C++. So it uses a couple of Apple APIs. These are AV Audio Session, also Core Audio and Audio Units. So by taking advantage of those last two frameworks that I mentioned, we're able to achieve real-time audio playback and recording. So on iOS specifically, we're able to take advantage of a couple of things that are kind of unique to the platform. Every single iOS device that ships has a hardware unit which can do echo cancellation and gain. So rather than using a software implementation of these features, we're going to just defer to the hardware and take advantage of that at all times. Another class to be aware of, and this is one that's in the public Objective C APIs that you might consume as a developer, is RTC Audio Session. So this is just a wrapper on AV Audio Session, which is Apple's way to manage audio on your device. So these are both singletons. There's only one of them in your application, and WebRTC is going to go ahead and manage that mostly itself. But I'll get into that a little more later in this talk. So next I want to talk about capturing and encoding on iOS. So something that's interesting here is that audio capture on iOS runs on this real-time thread. So you can see in this diagram that the core audio thread over here on the left is responsible for real-time audio capture. So this is running, and this is a thread that you actually can't create yourself. It's going to only be created by invoking core audio in the audio unit APIs. And so WebRTC does this internally, which causes this real-time audio thread to be created. And it's going to go ahead and capture audio. So what you're going to see while this is happening is the voice processing unit, and this is the one that's doing echo cancellation and gain for you, is going to go ahead and produce audio. And we're going to pull samples out of that unit on the real-time audio thread. Now, there's a little bit of an impedance mismatch between what WebRTC wants as a system and what iOS is going to give you. So in iOS, we're going to go ahead and set up this AV audio session and ask it for buffers that are approximately 10 milliseconds long. But in reality, it's going to give us something that's a little bit different than that. And WebRTC would prefer or actually must operate in 10 millisecond intervals. So in order to actually capture this audio and pass it off to WebRTC for encoding, we're going to need to do a little bit of buffering. We're just going to need to compensate for the difference between the 10 millisecond or approximately 10 milliseconds that we're going to get from iOS and the 10 millisecond sizes that WebRTC actually wants. So once we do that, we can fill our intermediate buffer. And finally, when we have 10 milliseconds of audio ready, we can deliver that over to WebRTC, at which point the WebRTC thread is going to go ahead and encode this audio for us so we can send it to other people. So that was a basic description of the audio capture pipeline on iOS. At this point, I'd like to talk a little bit about what we've learned from using CallKit. So we recently shipped a second beta of our video SDK for iOS. And this version actually added support for CallKit. It was a pretty interesting experience and we learned a bunch of things during this development cycle. So one of the things we started with was that the RTC audio session that I mentioned before kind of has its own view of the world and its own set of state. Most of this is based on what AV audio session is doing itself. However, with CallKit, there's a little bit of a wrinkle into this problem. And that's that if you go through the CallKit APIs, as Eric explains in the previous talk, Apple's going to go ahead and activate the AV audio session for you. So when they do this, this is kind of without WebRTC's knowledge. And WebRTC thinks at this point it hasn't actually activated the audio session. But in reality, Apple's went and done it through CallKit. So another thing we ran into was some sample rate mismatches. And if this happens to you, you're going to figure it out really quickly because the audio is just going to sound like garbage, pretty much just like a computer talking to you or an old modem, something like that. It's not going to be good. So what we discovered is there was a particular event that was not being handled internally by WebRTC. This is this very long AV audio session route configuration change notification. And so we were able to actually just make a few lines of code change to actually get us to handle this event, at which point we would always be in sync with the sample rate that was being produced by core audio versus what WebRTC actually expects. Another thing we ran into, which is an issue that has been filed on the bug tracker, was a risk for app rejection. So one thing that Apple was concerned about is private API usage. And we looked into this, and what actually happened was that the RTC audio session just uses a selector that happens to conflict with something that Apple's got privately, something we wouldn't have known because we don't have access to their source code. But nevertheless, this was causing some people to get their apps rejected. So a simple change here by just renaming a couple of methods actually gets you past Apple's App Store review and gets you into the store. So that's Ticket 6382 and the Chromium bug tracker. If you're interested, feel free to vote for it. Google tends to prioritize tickets based on what you vote for. So your stars count. Let them know what you think. So moving on with Call Kit, one of the things that was pretty important that we had to do to get Call Kit to work properly. Well, this is a little bit of a hack on our side, but this is the approach that we took for our first release supporting Call Kit, which was to use the manual audio option in RTC audio session. So what we did was flip this on externally using this Objective-C class RTC audio session. And then from there, we made a few internal changes just in order to satisfy the API contract that Apple has. So in particular, there's a couple of things that WebRTC does that are completely automatic. And it has behaviors like, for example, if you open up a peer connection, it's going to start recording because it knows that you want to send audio to someone else. Now, say you close your last peer connection because you're no longer talking with anyone. Well, WebRTC is going to go automatically do a bunch of things. It's going to un-configure your AV audio session. It's going to stop the audio unit that's causing playback and recording to happen. And all this is going to, well, this can sometimes get in the way of the API contract that Apple wants you to adhere to for Call Kit. So we ended up making a few modifications inside the Audio Device iOS implementation. And that, coupled with a somewhat hacky usage of use manual audio, allowed us to actually get full Call Kit support and reliable usage of these APIs. Another thing that was introduced in iOS 10, you may have noticed if you're kind of into AV audio session and you've used it a lot in the past, is that they added a new way of a new audio threading mode. So in this mode, if you happen to have an iPhone 6s or a 7, one of the ones that supports this live photos feature, they actually change audio in a certain way. What they'll do by default, if you don't tell them otherwise, is they're going to have multi-threaded audio capture and recording. So they're going to operate two real-time threads, one for capture, one for playback. And so what we did, just because we didn't want to make any more changes to the source code for WebRTC, is we would just force back to the old iOS 9 and earlier threading model. And that you can do that by using the AV audio session IO type aggregated. And if you do that, you'll get a single thread for real-time audio capture. And this is going to get you back to the iOS 9 behavior. So we just did this for compatibility reasons. We didn't want to have to have any more code pass to test and introduce more of a testing burden in our case. So this is just something that we're doing for now. Hopefully, we'll be able to kind of validate this behavior with the latest versions of WebRTC and kind of adopt the new wave of thinking. Another thing that we discovered that's pretty interesting is AV audio session modes. So in particular, WebRTC likes to use the voice chat mode. And as you might imagine, this mode is actually a lot more useful for voice chat than it is for video chat. So what we tend to do by default at this point, and we actually give you a choice in our SDK, is we use the video chat mode. So this is pretty much the preferred mode if you're making audio and video calls. So for example, if I'm in a call, I'm typically holding up my phone, and I'm using the front camera. So if I'm doing that, then the microphones I actually want to use are the ones on the top here that are pointed towards your face as you're speaking into the camera. If you're in the voice chat mode, what you're actually going to get is using the bottom array of microphones. And this is actually not great if you're doing a video call because you're not speaking into those microphones and they're getting indirect sound, you're just not going to get very good voice intelligibility by doing this. So my recommendation is that you should use the video chat mode if possible. We had to make, I believe it was only a few lines of code changes in WebRTC to support this. There's a little bit of code that selects mono playback and recording, and it's keyed off of the voice chat mode. So if you make that small change, you'd be OK in the situation. And you can take advantage of the video chat mode, which is pretty much the preferred option on iOS if you're doing a video call. So as I mentioned, we just released a beta, which has support for CallKit over the past couple of days. We've also been talking to Google about potentially contributing some of the CallKit changes that we made back to Chromium WebRTC. And that's hopefully our plan moving forward is to get at least some of these changes back into Chromium so we don't have to keep porting them forward. No one wants that burden, so we'd rather just put it upstream and let everyone take advantage of these improvements. So that was CallKit. For the next section of the talk, I'm going to move on to the video pipeline, and I'm going to start with Capture. So first, a definition of what Captures and Renders are. Captures, you can think of them as sources, which produce raw video. But along with that, they're also producing timing information and other metadata that's important when you're doing video capture. Renders, on the other hand, tend to consume raw video. And usually, the reason why you want to consume raw video is actually just to draw it to screen. But I'm sure you guys are all developers. You can come up with some pretty cool things to do with Renders besides just draw to the screen. Nevertheless, that's what they do by default. Moving on to Codex, encoders are basically a module or a block that consumes raw video. And on the other end, produces an encoded bitstream that you can transmit over the network using RTP. Decoders, on the other hand, take encoded video, consume it. And on the other end, they produce raw decoded video, which you might send to a renderer in order to display it. So on iOS, there are three Codex to be aware of. They're pretty important. And those are VP8 and VP9, also H.264. So there are a couple of different approaches used to implementing these Codex on iOS. Apple only provides hardware for H.264. So as you might imagine, the VP8 and VP9 implementations are actually done in software. So Chromium WebRTC relies on a dependency called libvpx. And this dependency includes support for both VP8 and VP9. Now, because these implementations are done in software, there are a couple constraints on the system. What you're going to need to do is, if you have captured buffers and you have memory coming out of your camera, for example, this isn't always in the same format that these software encoders expect. So you're going to have to take these buffers that reside in hardware map memory, and you need to do a format conversion and bring them into main memory. Now, in the case of the WebRTC video pipeline, it's expecting all of its software encoders to deal with I-420 buffers. So if you're not in that format, you're going to have to make a conversion in order to get to the format that they need for input. On the H.264 side, we're able to take advantage of something called Video Toolbox. And this is a framework that exists on both the Mac and iOS. And what it allows you to do is take advantage of the hardware H.264 encoders that are built into your iOS device. So this is an extremely popular feature. I believe it started with WebRTC45. This is when it first was introduced behind a flag. And since then, it's gotten 80 stars last time I checked on the bug tracker. So this has been an extremely popular issue that everyone's voted for, and Google's listened, and done a really good job of improving this implementation over the past 10 releases of Chromium. At this point, as of 55, you're actually able to get something called zero copy. And what I mean by that is that the buffers that are coming out of your camera or your video source are not copied on their way to the encoder. So you can actually just pass, just by doing reference counting, you can pass these buffers without copying them all the way through the system and then feed them to your encoder. So this is pretty cool. Another thing to note about the current implementation of H.264 on iOS is that it supports the baseline profile. There is a ticket, which you can take a look at, which is about adding support for high profile on iOS. There's also another ticket, which is about negotiating high profile H.264 in your SDPs. So feel free to take a look at those issues and vote for them if it's something you're interested in. All right, so that was a summary of the video pipeline. We're going to take a look at what actually happens when you pass buffers through the pipeline using a software encoder. So all this is going to start with a capturer. So in this case, the capturer on the far left here is going to be a camera capturer. And that capturer is producing buffers that are in a format called NV12. And also, those buffers are mapped to the hardware. So they're buffers that could be shared between every other piece of the system if you let them. Unfortunately, in this case, because we're using a VPA encoder, we're going to have to do this conversion into I-420. And at the same time when we do this conversion, we're also going to put those buffers into main system memory. If you look at the rendering side, this is actually looking pretty good in WebRTC55. If you just use that RTC Eagle video view, which Eric talked about in the last session, what you'll get is zero copy rendering by default on iOS. So this is pretty cool. What it's going to do is it's going to pass those buffers into the renderer. And the renderer can actually just seamlessly use them as openGL textures. But in this case, because they already exist and they are already mapped into hardware, you don't need to make another copy in order to actually use them in your renderer. So the renderer is just going to run a simple shader, which is going to do a format conversion for us. It's going to go from NV12, the format we captured in, and over to RGBA, which we need to display to screen. So let's contrast that with HC64-based encoding. So you can see things have gotten a little bit nicer here, the colors that you don't see any red. That's good, because that means we don't have to do any conversions. So what's going to happen here is it's pretty much the same path from the rendering side. We're getting zero copy. That's great. But from the encoder side, we're able to seamlessly pass off of these buffers into the H.264 encoder. At that point, we're just going to let video toolbox encode them without making another copy. OK, so we've now done. We've talked about the video pipeline. Looked at a few basics of how it's set up on iOS. So now let's talk about what is actually included in terms of captures that ship with WebRTC. So if you've been using the Objective-C APIs to develop an iOS app, whether it's written in Swift or Objective-C, you may have noticed RTC, AV Foundation video source. This source allows you to use the camera. It's pretty much as simple as that. And it has an Objective-C API, and it lets you use the front and rear cameras. So I mentioned just front and rear. That means that you can, to this point, use some of the new cameras that were added on the iPhone 7. So if you want to use the telephoto lens, for example, you can't do that. Currently, it uses an older set of APIs that work on iOS 9 and below. So it's just not going to give you access to those other cameras or the fused camera that is both the wide and telephoto cameras on the iPhone 7. That being said, this is still a pretty good capture. It gives you a bunch of different capture formats that are supported out of the box. So you can use what are called video constraints to select anywhere from about 360p to 720p. So you can kind of tailor this to your application, whether you want high definition, whether you want standard definition, it's up to you. You can choose this, and it's all supported out of the box. So internally, this capture relies on something called AV Foundation Video Capture. And in this case, this class is a C++ Cricket Video Capture subclass. And what it does is it pretty much acts as a bridge from the objective C capture into the rest of the capture pipeline. Now, you think you might be able to use this to write your own external captures and just take advantage of this C++ class in order to do everything you want. But unfortunately, it is specific to the requirements of using AV Capture Session and AV Foundation. So with that said, let's talk a little bit about custom use cases for custom captures. So why might you want to do this? Well, I'm going to give you a couple reasons. The first one that's pretty important that a lot of people have asked about is screen sharing. Another one would be, imagine you have a video game, and it renders openGL or metal content, and you want to actually share that content with someone else. Well, in order to do that, you might need to write, or you would need to write, your own custom capture. Another use case that could be important to you, depending on your product, is maybe you have a video sitting locally on your computer or your iOS device, or maybe you're streaming a video to your iOS device, and you want to rebroadcast that using WebRTC. Well, all these things are going to require you to write your own capture. So later in this talk, we're going to discuss how we would actually do that. Getting back to the camera use case, one thing you can do with the camera that is a little bit more advanced than what you're getting out of the box with Google's implementation is you might want to do something like live image processing, or you might want full manual controls. So doing things like selecting this fused camera that has the wide and the telephoto lens on modern iPhones, or you might want to do things like manual exposure, zooming, plenty of things like that that just aren't supported out of the box. So if you write your own capture, you could accomplish these things for your product. So what we did at Twilio is we set out to build what I would call a generic video capture. And this is something that, well, the actual name we called it was a core video capture. And the reason for that is because this capture acts as a bridge between the objective C world and C++. And what it takes as input is core video buffers. So this is basically taking the core video world of iOS and bridging that down into WebRTC. All right, so this core video capture, it does pretty much everything you'd want from a WebRTC capture in that it's able to list formats. It's able to provide metadata about the frames you're going to push through the pipeline. And also, it's going to support zero copy wherever possible. All right, so let's get into some coding. We're going to look at the API that we developed here for this capture. It's pretty simple. It only has four methods. And what it allows you to do is list the formats that you're going to support. So this includes things like dimensions, frame rates, and pixel formats. All these together comprise a video format. And a list of many of these things, describes what your capture actually supports in the field. Another property that we have in our generic capture is screencast. So this will indicate to the downstream pipeline that you're going to produce content that is either screen or not. And the video pipeline below is going to adjust accordingly. So for screen content, we're going to want to focus on the clarity of the image and not frame rate and motion. It also has methods to start and stop capture, which we'll get into a little further in the next slide. So when you start capture, you're going to be handed something called a video consumer. This consumer allows you to deliver frames downwards to the pipeline. It also allows you to signal when your capture is finished starting. And this diagram helps to explain a little bit about what's going on here. So imagine your capture, which you have on your own thread, is going to start asynchronously. WebRTC is going to come up on the worker thread, and it's going to tell you that it's time to start capturing. When this happens, your capture can go off on its own thread and start asynchronously. When it's done, it'll signal that it started back to WebRTC. And at this point, the downstream pipeline knows everything it needs to know, that you're ready to deliver frames, and you've been able to start in the format that was requested from your capture. And once your capture has started, you're going to be producing frames, hopefully on your own thread. And you're going to be able to deliver those frames over to the WebRTC worker thread, where they can go through the pipeline to renderers, to encoders, and so on and so forth. So next, I'm going to break things up a little bit with a live demo. A screen capture is what we're going to be building for the rest of the talk. So before we get into the code, I'm just going to show you what this thing actually looks like and how it works. All righty. So what I've done here is I've loaded up WebRTC.org in my simulator. Let's just make that landscape so you can all see it. So as I scroll down, my capture is just capturing the video from the screen, and it's just rendering it in the bottom corner there. So you can see as I scroll, I'm just capturing at a low frame rate, at this point, five frames a second. And it's going ahead and just giving me a captured version of this screen that I can then share with others over a peer connection. So that's the screen capture, and let's get into how to build it. So what this thing actually does is it shares the contents of a UI view. It's going to do that by periodically drawing the view hierarchy, and it will use what's called a CA display link timer in order to draw this at a very specific time interval. We don't want to draw a particular frame more than once. We only want to draw it a single time. Let's be efficient here. So we're going to use this display link timer, which is exactly synchronized to the vSync on your phone. It's also going to use something called UI Graphics Image Context in order to draw the view. This is a pretty simple solution that doesn't require you to write a lot of code or manage a lot of your own buffers. So we're just going to go with this for the purposes of an example. So let's take a look at the drawing loop next. The drawing loop is pretty simple, and for a screen capture, it makes sense to operate on the main thread. So we have a source, which is a display link timer all the way on the far end of the screen here. And that display link timer is operating on the main thread, and it's going to call you back periodically when it's time to draw the screen. So you're going to go ahead and draw your UI view. At that point, you're going to take what you get out of this drawing, which is a UI image, and you're going to convert that into a core video pixel buffer, at which point you can signal that buffer over to WebRTC, let it do its thing, rendering and coding, all that good stuff, and you'll return control to the main run loop, at which point it'll wait for the next display link timer to fire, and you can draw your view again and repeat this whole process. So the first thing we're going to do with this screen capture is describe what we actually support to WebRTC. Apologies if these slides are a little bit small in the back, we've got a lot of code here. But what we're going to do is we're going to start by telling WebRTC if we're a screen cast or not. In this case, it's a screen capture, so obviously we are. We'll return yes there. Next, we're going to describe the formats that we support. So in this case, we only support a single format with this example capture. So describing this to WebRTC is pretty easy. So what we're going to do is we're going to rasterize at a 1x scale. So as you might know, if you've heard of retina screens, things like that, iOS devices can render at a lot of different scales. For the case of our screen capture, we're just going to render at 1x, which is kind of a simple not terribly CPU intensive way of drawing the screen in order to share it with other people. And in our case, how we're going to render is a format called ARGB. So we'll indicate that to WebRTC. We're also going to target a frame rate of five frames a second rather than 60 or something higher. Again, this is just to conserve a CPU. And because you're sharing with the screen with other people, you don't really care as much about motion, or at least for this particular example, you don't need to. So we'll indicate that we are capturing at five frames a second. And because of the rasterization scale, what we're going to have is dimensions equal to the screen where points are equal to pixels. Now that we've done that, we're going to move over to the drawing loop that I showed you earlier. And we're going to actually draw the UI view. So the way we'll do this is we're going to just wrap everything in an auto-release pool to clean up any temporary memory. And then we're going to go ahead and use UI graphics image context to draw the contents of review into a UI image. After we've done that, we'll just clean up the context, grab the image, and move on. So the next step, we have a UI image. This is pretty good. What we can get from the UI image is something called a CG image or a core graphics image. So once we have that, we can get access to the underlying data that represents the pixel buffer. Now in this particular example, I mentioned we were using UI graphics context to draw. So what we're going to do here is we're actually going to make a copy. This is not great for performance reasons, as I already explained earlier in the talk. But it does simplify this example a little bit because we don't need to worry about providing our own buffers to draw into. We're just going to use this built-in system functionality, grab an image out of it, make a copy. And then with that copy, we're going to create a core video pixel buffer and pass this into WebRTC. So the invocations that we make here are just creating a pixel buffer with some existing bytes. And we're going to pass this data that we copied from the UI image. And now that data that we copied is going to have the same lifecycle as the core video pixel buffer. So at the time when that pixel buffer has flowed all the way through the pipeline and is ready to be destroyed, we're then going to destroy the underlying data that we copied. All right, so we've done pretty much all we need to do here. And the final step would be delivering this buffer to WebRTC. So we're just going to package it in terms of the image data and the metadata. And so that's pretty simple. In our case, we don't do anything fancy when we draw. So our images are always going to be oriented upwards. There's going to be no rotation tags. So we'll just indicate essentially a zero-degree rotation. So WebRTC knows that it does not need to rotate these buffers. And it does not need to pass those tags on to the other parties over our peer connection. Next, we're going to go ahead and timestamp this frame. And what we're going to do is use the display link timer's timestamp as our capture time. So we'll just take that timestamp, convert it into units that make sense for WebRTC. And then finally, we're going to go pass off this buffer into the WebRTC video pipeline. Now at this point, we can relinquish control of the buffer and let WebRTC do its thing, retaining the buffer as needed. And when it makes its way all the way through the pipeline, we'll release it, and it'll be free to go. So with that said, we've completed the screen capture. We've built the demo that you saw. And if you follow this link, we have a couple of things that are interesting. The first one is a call kit example. And this is available today. Just follow this link. It's our video quick start for Swift. So this shows you how to use the call kit APIs with RSDK. And what we're going to be adding soon is a screen capture example similar to what you saw today but written in Swift. So with that, thank you very much.