 So we'll start our next presentation. This time we're going into the formats and codecs area. With our first speech is about Matroska and low latency streaming by Steve Long. Give a warm welcome please. Hello, so I'm Steve Long. I created not alone the Matroska, so for those who don't know, you might know the format MKV, so that was the other name for the format, the actual name. So Matroska was created as an open format from the beginning. So it was developed mostly on the web with people who mostly never met each other. So mostly on IRC mailing list or the Doom 9 forum. And it's mixing lots of technologies that were kind of emerging when it was created. So the idea was to replace AVI, there was org at the time, XML, Unicode and Semantic Web. And we mixed all of this together to create the format. It was developed in a way that's inspired by what was done on the web. So everything was open, everything was royalty free. And the original goal was to have a long-term storage format, something that would be usable for 10 years for when it was created in 2002. So in 2018 it is still used, so we reached that goal. And the other thing was to have low overhead, because since it's inspired by XML, it's very verbal, so we try to remove as much verbality as possible. And also streaming friendly, because the design was always... So it's the basis format for WebM, which was created by Google and Mozilla a few years ago. And so you'll know it's also used as MKV. And there's also MKA for files with only audio. There's also right now ITF working group to standardize Matroska. It's called CELAR, so it stands for codec encoding for lossless archiving and real-time transmission. So the workgroup is working on EBML, which is the binary XML that Matroska is using. Matroska, which is now split in actually in three different documents now. That might be all, but at the time each project had about 13 contributors, 300 commits, so mostly on documentation, cleaning the documentation that existed and improving it. The project, the workgroup, started in 2015 and still ongoing and still a bit of work to be done until we have final RFCs for Matroska. The workgroup is also working on the standardization of FFV1, which is a lossless video format and also FLAC, which I'm sure you all know. So my talk today is talking about the latency in Matroska and how it can be used and how it does it compare to other solutions. So first to make sure we all talk about the same thing, the definition of latency I found on Wikipedia. Latency is a time interval between the stimulation and response or from a more general point of view, a time delay between the cause and the effects of some physical change in the system being observed. So basically here there's a diagram of what's happening when you actually do some, for example, live streaming of video and audio. So first you have four video, you have frames which go into an encoder and then maxing, that means putting it in a format that can be, for example, mixed with audio and also adding timestamps and stuff like that. Then it goes through the network. Then on the other side there's the demaxing, so to get the encoded frames that go through the decoder and then you get the frames. So each of these steps actually introduces a latency. I'm not going to talk about all kinds of latencies, for example not the network, but some that are more involved in what Matroska does. So for example when you, okay, that's supposed to be red. So the video encoding latency, for example, usually you have frames one, two, three, four, et cetera that come from the camera. Then it goes through the video encoder and usually when you have something pre-recorded the codec can encode the frames in a different order than they are received to have better coding efficiency. But for live streaming you don't want that, you don't want to have frames that appear very late compared to what they're supposed to do. So for live streaming you actually have to put your encoder in a mode that always sends the pictures in the same order. So basically live streaming is always a special case of encoding and maxing and everything. It's not the general case of how most people use a video usually, but it's growing since most people now watch videos on YouTube or whatever, a new Facebook. And that's increasingly going through the web and networks so that's becoming more and more important. So how does Matroska store information? So that's how when you look inside a Matroska file you have first a very small header. The metastick which actually tells you where these parts are. So when you read here you can quickly go here if you want to. Then these parts are the actual audio and video that you're going to read. Here it's for seeking if you want to go quickly to a place. If you have that information you can go here. Chapters, it's regular chapters and tags for all metadata to tag, actually you can tag this information but also this information with tags from here. But in streaming you actually don't have a lot of these parts. You only have the header and these parts. The metastick since you're not going to seek during live streaming is useless. The queues you're not going to seek back so it's useless as well. You cannot introduce chapters for the same reason you're not going to seek in the file. The only one that could be used is tags that you had actually put before send the actual data. So WebM is actually exactly the same as Matroska. It just has at the beginning it says WebM instead of Matroska. Otherwise it's the same thing with some feature removed. So it's basically a Matroska profile. So in WebM's live streaming it would look exactly the same as in Matroska. But most people don't use actually live streaming this way. They actually use adaptive streaming so when your connection is better you can have better quality or your connection is bad. You have a lower quality version. That's why when you do live streaming in Dash which is one of the versions of adaptive streaming you actually have, for example, the audio can be split from the video so you can have a better quality video with a lower quality audio. But the file actually looks the same. It's always still a header and a cluster for audio and here it's still a header and cluster for video. The only special thing is that the transition here when you go to a better or lower quality of bandwidth the boundaries here have to match always and in all the versions of the same file with different qualities. So that's actually what you are using when you're watching YouTube or Netflix, that's the kind of technology they use. But live streaming is even different because that's for even regular files that are already stored for a long time but live streaming is actually even more constrained so live streaming is basically when you have as little time as we saw before the definition of latency as little time possible between the moment the frames are captured and the frames are received to the user and you try to remove as much as possible stuff that you have. So basically how does Matroska file in live streaming so basically a continuous stream of data coming it's not a file you're reading but a stream. So basically this part is always there and it's the same part, I don't know. So it's this part that we adhere, it's always the same structure that you have your audio and video. So I took an example of a file from Fragmented MP4 and I redrew did it in Matroska to see the difference. So the header on Matroska was 460 bytes big and then each cluster as a header we're going to see that after and then the data, the header is 25 bytes and then you have the data. So the extra data is here and that much data and for Fragmented MP4 you also have a header and then the data. The header was 2100 bytes and the equivalent of a cluster in Fragmented MP4 the overhead is for 850 and the data. So you can already see Matroska saving space. So that's in more detail what a cluster where you actually save the frame. When I say frame it can actually be video or audio it's exactly the same thing. So in each cluster you have one timecode for the cluster and for each what's called block you have a frame and a header. So each frame has a header that gives each timestamp and other stuff. So basically before you start getting the first frame on a cluster you have 25 bytes and for Fragmented MP4 the structure is a bit more complex. You have this atom which consists of more atoms inside which is like 1k and then the actual data where you have each frame. So you have to get all this data before you actually get this frame. So here what I call frame latencies how much time or how much data it's kind of the same. Do you need to actually get the end of the first frame that you can send to your decoder? So here for a frame of 40k you would have 40,025 bytes before you actually can send it to the decoder. For Fragmented MP4 you actually have to have this 1k and then 40,000. So again the latency is better in Matroska. Again that was a real life sample and I just transformed into Matroska to see the difference but it's the same. And the lower the bitrate actually the bigger difference it makes to use Matroska. So basically here the other thing you have is buffer latency because as you can see in Matroska you send one frame after the other with its header in Fragmented MP4 it's different. The header for each frame is actually stored at the beginning. So in terms of latency before you get this you need to know all the frames you're going to have in that whole big block. That means before you actually get frame one you need to have all the frames and wait so as we see earlier. So if you have blocks of two seconds in adaptive streaming then that means you need to cache two seconds of data before you actually can send it because before you can write this data you need to know all that stuff. So that's for example in that case two seconds latency or in the case of Matroska it would be just that amount of data so 25 bytes and then the amount of the frame. In this case it was 20 milliseconds. And then so we already see that Matroska has a big advantage here. The other format that's used a lot in streaming and that's actually the format that was used in Apple's adapting streaming format. They insisted on using TS but now they changed using Fragmented MP4 as well. So the same for one frame in Matroska how does it work in transport stream? So one frame is actually cut in 184 bytes and then each of these 154 bytes is encoded in 188 bytes so each has a header, so four bytes header then a part of the frame then a header part of the frame etc. So actually to get the first frame you actually have to get all the parts of the frame plus all the headers in the case of the 40,000 kilobytes frame the overhead, the extra data you have to download to get to the end of the same frame is 869 so the 869 is the difference compared to the 25 bytes. So there's also less latency in Matroska that transports streaming actually early. So basically we saw the difference between Matroska and Fragmented MP4 so the advantage of Matroska compared to this is a kind of smaller overhead from the network there's actually only like 400 bytes difference between the formats so it's not that much but for example if you buffer two seconds in adaptive streaming that's the latency that you're going to have in Fragmented MP4 but in Matroska you don't have the data you need to get to the end of the first frame is like two or three percent of data so that's data you could actually use for better encoding in your codec and the buffer latency so the amount of data you need to get to the end of the first frame is the same compared to MP4 you need to get the whole let's say 20 or 40 frames before you can actually start sending them and that's actually early so I hope you have a lot of questions and I have some stickers in my bag and you can also go on the video LAN booth in Building K there's a wheel of fortune you can win lots of stuff and there's also some stickers there and also thank you Hello How are you? I have a presentation just after you so I am a bit stressed but before that I am curious about how you compute the cluster size before sending the second and third frame do you know the size of the clusters when you need to write this size in the Bstream? So I am going to go back to show let's say here so yeah, as you said since it's inspired by XML actually each element in EBML has an idea and the size of the data that it contains and then the actual data but in live streaming there's a feature in Matroska and EBML actually where you can actually tell this element contains data but I don't know how much so I'm not telling you the size and you keep reading until you find the same cluster beginning and then it's a different one it's called, there's different ways to call it but usually we call it unknown lengths and so you just send that in your stream even though you don't know the size you don't need to wait until you know it you just send it and it's too bad and on the other side the parser knows it has to handle that size In that case, how do you know at which level of Matroska you are at the block level or at the cluster level it is with the key of Matroska it is a cluster so it is... that's how EBML works so to read an element the ID, you have to actually know what the parent is and then it's parent, etc so when you find data that are not for this parent you look in the other parent it's for this parent, etc and then you actually find where it belongs another question that's it so MPEG-2 transport streams are able to be cut off at the 188 byte boundary which means that if you've for contribution sources happening before what is distributed you are able to when you have multiple streams for resiliency going into your final endcode you can just mix and match if you've got network problems or other kinds of say encoder issues are you able or not able to do that with Matroska in a similar way given the format so I have two questions ok, so about live streaming have there been thoughts about program metadata yet? what do you mean program? such as EPG and things like that well in the case of adaptive streaming usually the metadata are not in the stream it can be in the manifest so in adaptive streaming a manifest is actually an external file which describes where to find these files for example with a general rule they're not rewriting the file every time a new segment comes and they can put all kinds of data in there it's not related to the actual data yes but in case of a single stream live stream like I said earlier you could actually put these tags so tags like the author or the date it was produced or you can put all kinds of stuff you could actually put it here and because it's a live stream you could actually put it here and here you can add a new one during the streaming to actually change the data well like if you play a song and then the song changes then you have new metadata that you can introduce in the middle it actually exists for mp3 streaming and you get id3 or I don't remember the format in the middle of the stream that you have to pass and then you get the new song that's playing it could be also done the same way with Matroska okay thank you I feel that you are mixing explanation on file format and what you put on the wire for doing the streaming and you are comparing Fragment at MPEG 4 which is a file format with the streaming which is maybe HLS or dash whatever and why don't you compare to the other ways people are doing low latency streaming like LTP or MGPEG which are the low latency solution nowadays well since ever okay compared to RTP RTP is a network thing and usually they actually there's all kinds of formats for RTP but usually there's not even a container that would be Matroska in this case and what I'm talking about here is actually either files for advancing streaming but that's how you can still write on the server while people read it which cannot be done in Fragment at MPEG 4 that's why it's mixed between file format and streaming format because in the case of Matroska it can be both in the case of Fragment at MPEG 4 it cannot but compared to RTP I mean it's not the same goal you can probably you do get better latency with RTP but then you get you don't scale that well or you have other issues and that's why people on the web at least they're not using that because that's not how technology works but for example RTP is used for example if you have Twitch or other stuff I think Twitter live the client is actually sending the server in RTP format but then for distribution that's not a good format for them they need something else Thank you Steve