 Hey, good morning, everyone. My name is Yan. I'm here to tell you a bit about a few projects that I've done on the side over the years. So in my day job and my spare time and every other minute when I'm not doing something else, I am a gStreamer developer working on the framework that many appliances and many computers use for playback of multimedia or delivery of multimedia. If you're processing video and audio in some way, then gStreamer is a useful tool for that, but I'll talk a bit more about that. What I wanted to talk about today is a couple of projects that I do in my spare time. So this time of year, the gStreamer conference is on somewhere in Europe. And I like to challenge myself before that conference by coming up with some project that I've wanted to implement for a while and then proposing a talk, thus giving myself a deadline for implementing something towards that goal. And this is a couple of things along that track with a common theme of automating my house. So I've never wanted to buy a smart house or integrate a lot of other people's appliances. It's a challenge to build it myself. So I've got this vision that's now a pretty common, well-understood one of a house that you can talk to. You can ask it to do things for you, like turn on the lights and turn off the lights. You can ask it to play music for you. And that it's aware of your presence and your needs and understanding what it is that you want when you ask it to do things. And there are commercial offerings along that line that you can buy off the shelf and that many of you probably have. The Google and Amazon and Apple offerings that you can buy a smart assistant drop it in your house and they have integration with a lot of external devices that you can buy. But they have a problem for a free software or open developer that they are all isolated bins of closed silos. And I really think that we need a solid open ecosystem built around that. And they're all cloud-based. So all of these things you utter in your house get streamed to cloud servers somewhere for voice recognition and stored. As we've seen in the last year or two, potentially listened to by complete strangers who are monitoring things going on in your house. We hear about people who are subcontractors of Amazon hearing things going on in people's houses that they never should have heard. So I'd like us to be able to take that control and own it ourselves. And there's lots of... There is a bunch of free software initiatives that are moving in that direction. One that I like a lot is Mycroft that's an open source smart speaker offering that you can download and try on a Raspberry Pi. And various other underpinning projects that tie in with that smart speaker theme. So there's a Mozilla Deep Speech Project that's about doing speech recognition. And there's multiple options for doing text back out to speech so they can reply to you. There's the Home Assistant framework for doing the lower level control once you've got some commands that you want to execute. And there are even lower levels. The MQTT message transport that there's been some talks about elsewhere here at this conference. And lots of cheap hardware for building this stuff on. This is a 10 years ago, it was so much harder to try and do this yourself. But now there's a million cheap SBCs and SOCs and there are smart switches you can buy that jump on your Wi-Fi and let you take control of the appliances at the electrical level. But this talk is about GStreamer. So what is GStreamer? Because I normally give this talk or these talks at the GStreamer conference or another conference that I know people are already familiar with what GStreamer is. But in this case, this is the first time I've spoken at ELC. So I'd like to talk a little bit about what is GStreamer? At its heart, GStreamer is a data processing framework on top of which we have built specific data processing plugins for doing video and audio. It's not the only use. There are other examples of people doing things that are not video and audio with GStreamer. But the heart of it is you build a processing chain of components, you link them together, they feed data buffers between themselves and do conversions of some kind. And with the right plugins, you can do any kind of media transformation that you can imagine. And with the right application on top of that, you can reconfigure them on the fly and do quite elaborate things. So there's a very simple example from the GStreamer documentation of what a Ogvorbus audio decoder pipeline looks like inside. Ogvorbus, the aura, video and audio decoder. So pipelines can do audio codecs. We have plugins for recording and playing, audio converting, audio doing, resampling, all the things you think that you should do. We have plugins for streaming your media across the network on different protocols. Most recently in the last couple of years, a WebRTC stack that means you can integrate tightly with hardware encoders to connect to web browsers without having to pull in the Google WebRTC libraries and their somewhat finicky hardware integration model. And okay, so the first side project that I wanted to talk about as a thing that I've done. So way back, seven or eight years ago, I first came across a sonar system and it was kind of the first of these boxes where you buy individual smart speakers and you drop them around your house and they talk to each other and they play synchronized video and audio. They give you that music in every room and you can divide your house up into zones and turn the music on and off. They give you a remote control on your phone via voice over a smart speaker and it reminded me immediately of a toy that we had built almost a decade earlier, 2005, 2006 or so. Doing a similar sort of thing, network connected devices that had to synchronize against each other. And I came up with this project called Orana. So, and its goal was to give me that kind of distributed house-wide media player. So how does that work in practice? So one of the key pieces of GStreamer is its abstraction of playback timing and clocking. And it's a very complicated system. It lets you do all kinds of things like playing back at different rates or playing subsections of an input media but time-shifting it or playing things, reversed trick modes. And the basic problem with trying to play sound out to separate devices is one, how do you line up the sound so that it sounds natural? You don't want it to be too far apart or you get echoes and it sounds like a stadium. And also, when you're playing something for a really long time, every clock in the world runs at a slightly different rate so you get clock drift. So after playing for a little while, even if you started synchronized, they won't be synchronized anymore. So the idea that many of you probably know, you can synchronize your clocks across the network using protocols like NTP. And GStreamer has a built-in piece like that, an NTP-like protocol. These days we also have PTP implementations and an NTP implementation in GStreamer. So there's a couple of different ways. You can have one clock somewhere on the network and then you can have multiple listeners that are tracking that clock and accurately replicating it on another machine. And now you have the key piece for synchronizing devices across the network. So here is my professional sketch of what Orina should look like. I don't know how visible that is up on the big screen. So it's a rough sketch. There's a server somewhere, it has some pieces inside, it has a command and control, it has configuration, it has a database of media available and it has some interfaces that clients can connect to. And then clients are either a sound output device or a controller device or a combination of both. I built a few controllers. There's a web interface where you can use a native GTK application and the sound output's a simple command line daemon that connects to the server and listens to be told what to do. And then you put it into practice and you get something that is hopefully synchronized across the network to some respect, in some respect. With a little prayer for doing demos in a completely new environment. Let's see if I can get something here. If I scroll across all of my other available windows, one of the things I have here is a, okay, so I have the web browser, it's connected to a server running on my laptop and then I have a client on my laptop, that's Pimiento and I have another little box here that's just connected over an ethernet cable for today's purposes. And you can see you've got the world's simplest interface where I can go play, I can hit next track and get a random shuffle, I have a global volume control, I have individual volume controls and I also have these little reports coming out about the clock synchronization so we can tell how closely are these devices going to attempt to synchronize and that's pretty good, we're getting under a millisecond of error across the ethernet link, it gets a bit worse on wifi, it can get up to 10, 20 milliseconds or worse and that's getting to audible range but at that point you're more worried about separation of the speaker devices for time delay of audio transmission through the air so that's pretty good. If I hit play, I don't know quite what's going to happen. So this device is plugged into the speaker output that I've been given but I don't hear that coming out anywhere so no, that's not coming out but it is coming out on my laptop. So that's arena as it was so that's arena as I first put it together for a GStreamer conference a few years ago and then I've added things to it since that and we've done various improvements in the clocking and but there are still pieces I've never quite got around. These projects tend to turn into sketches of something that then get left idle and I push them forward every now and then but there are things that so it can do video if you put video files in your media then you will get synchronized video playback as well and at a millisecond or 10 or 20 that's frame accurate video playback across your house. It will do multi-channel audio but at the moment it's every device plays every channel and one of the things I always wanted was to add a mapping layer where you could turn each speaker you add to a room into front left, front right and tell it which channels to play so there's some things that I could do in that direction and the zone support is everything just joins one zone at the moment if you want to play something different on other speakers as well that's work to do. So that's arena. Now let's talk about another thing that I've done so it's pretty easy with GStreamer to connect up to a video camera it's also pretty easy to do things like standing up an RTSP server that provides that camera onto your network so I do that pretty often I've got a bunch of Raspberry Pis or other SBCs you put GStreamer on them you take whatever camera they have and then you publish them as RTSP so there's a bunch of RTSP devices just hanging off my network that are hand built for with whatever I had laying around and they do need things like letting me watch the sheep or we do fisheye lenses that's kind of fun that can get you a 180 degree view around that you can pan around and look I've also the last year been working a bit on 360 degree support with two cameras so there's a bit more to do on that front but we've now got a GStreamer that can take input from a 360 degree camera and then add the metadata that's required to stream that live to YouTube for example and then I don't know if you've ever seen those kinds of videos on YouTube you can grab the video and drag around to look anywhere in the playback on the sphere that's kind of fun this was Linux Conf Australia in January they gave us all little Raspberry Pis for swag in the bag so I strapped the fisheye camera on that and then had it in my badge live streaming my conference over WebRTC in this case so I mentioned GStreamer has grown a WebRTC stack that from the top level is a couple of it looks like a pretty normal WebRTC API converted into C internally it does a whole lot of work for you so that's a picture of the pipeline that the WebRTC stack builds internally to do all of the finicky pieces that are around firewall, discovery, ice negotiation and DTLS encryption transfer, RTP level jitterbuffer and UDP streaming pieces are all sort of plugged together for you to make this complicated thing so that's a couple of the projects then I mentioned the desire to do voice control and when I came up with Arena and was building it to do my music playback I had in the back of my head this secret plan that took me a few years to actually get around to and the idea was that if I can synchronize sound output I can also synchronize sound input across the network so I can stick microphones around and I had been aware for a while of this technique of if you have an array of microphones or an array of network antennas you can use transmission delays and arrival times at each node plus maths plus filtering to extract information about where the signal came from so you can do beam forming so I had this plan to build an array of microphones across the network using individual nodes and the network synchronization you stream all the audio back to a central server instead of from a central server and then you have the processing on the central server and do magic, extract information about where the person is standing that's what I really wanted this context awareness in my house of when I say turn on the lights the house knows where I'm standing and which lights I'm near or if I say turn off the TV it knows the room that I'm in and the TV that I'm standing near even if I'm in the doorway so it's about adding context awareness and so these microphone arrays are normally a hard coded thing but I was proposing to build a flexible microphone array with unknown configuration and that was a bit of a challenge and my first attempt I put a runner on Android devices because I had a few of those lying around and set it up to stream microphones to the server and I had found a project from the university in Canada called Many Ears designed for robot auditions so it was designed for autonomous robotics with microphones on the top and that it would do this analysis of the incoming microphone signal and their model was based around eight microphones in a fixed grid and gave you both localization of where the sound was coming from but also tracking individual sound sources and splitting individual sound sources so that's another key component if you speak in a noisy room that you can use an array of microphones to split out the background noise and so the theory is that you have sound traveling at 343 meters per second at 20 degrees C and normal atmospheric pressure so a millisecond of difference between two signals means 34 centimeters of positioning error and that sounded fine to me if I have roughly a millisecond of error in my clock that I can get someone to within a meter of where they're standing, that's pretty good but the worst the clock synchronization gets then the worst the spatial inaccuracy so the result of that project, complete failure just didn't work at all the problem was I was using Android in this case and getting up to 100 milliseconds worth of unknowns in when the sounds were arriving at the Android device let alone when I was able to send them to the server and that just was horrible and there was a second problem which is that I was just putting one maybe two microphones on each audio, on each device and then I had to go and measure the configuration of the room manually there was no kind of way to build that model of the microphones and that's critical so I put it aside, I left it for a few years and then last year I decided to have another go at this problem and by now the situation has changed because I've been watching and I can go online now and I can buy lots of little like this one this is a four microphone array hat for a Raspberry Pi you can buy two microphone arrays you can buy little boxes like this re-speaker hexagon that has them built in in a six grid so suddenly microphone arrays in 2018 are very available but in 2015, hard to find, in 2018 they're in your Alexa's, they're in your Google Homes and your Homebods already doing this kind of tracking and separation so every one of those devices they use it to split out the noise of the TV and listen to the person speaking and who knows what else, they're a closed box doing whatever they want so I have lots more options for my hardware now so I'm playing with the Seed Studio reed speaker I'm playing with the re-speaker core that's got a built in arm and the microphone array you got Raspberry Pi hats that come in four microphone or two microphone arrays even on the zero form factor and I also discovered that many years had been superseded by another person at the same university who has built a new piece of software called ODAS with a similar goal but much more flexibility and a cool interface that I'll demo in a bit so you have one of these arrays and they can tell you which direction a person is speaking in, you have a fixed configuration in the hardware now so you can model it at each node you don't have to worry so much about the configuration of the room and you can get the audio out you can split multiple people speaking out into separate audio streams but a node generally can't tell how far away someone is from that node they can just tell you the direction so that's the far field effect where the microphones are too close together to get decent triangulation so each node on its own it's not enough to tell me where a person is standing it's just enough to tell me the direction they're in but let me give you a little demo of ODAS because it's kind of very cool when did I leave that so this is the ODAS interface come back over here okay so I go, I'm running in SSH over to this little microphone array and I can run the ODAS live client which after a little minute of running we should see it find that and then connect and then hopefully we get some kind of pop up hey there's some sound sources appearing it's hearing me as I walk around it should show me a track of me walking around the microphone so it can tell me the direction there if I speak into this it probably gets a bit confused about where I am if I just talk here you can see it's tracking my position it's also tracking so it's also got that loud offset loud bit of noise off to the side that is the wearing of my laptop and it has some little filters here if I turn the low area down you see a lot of static noise that's the thresholding turn that back up so ODAS is awesome so what I've been doing is taking that and wrapping it up in a GStreamer element which is required of a bit of modifying of that ODAS code and so I don't have it quite ready as a GStreamer component to show you today but I'm going to keep hacking on that but what that would let me do is is place it into my existing arena set up so that arena is capturing the microphones and running it through ODAS and getting out these individual sound sources and then making them available for me to stream back to something else so now I can capture it and do that voice recognition on it through GStreamer gives me a bunch more flexibility around what I do with that so that's ODAS as it stands I mentioned I have this still have this problem of I don't know I can't solve my original goal of knowing where someone's standing when they say something because all I have is a direction to them so I still need to know the configuration of my listener nodes so I have this question about how do you do calibration and I came up with this answer of you have the nodes play a sound and you have the other nodes listen for the sound and because they have a synchronized microphone if you have information about exactly when that sound was transmitted and you can receive it and measure when it was received now you can measure the time delay plus you have a microphone array now so you know the time delay and the direction that it came from by using the network clock so now we can for each node play a sound in that direction and get a vector both of them should get the same answer for the distance between them and now we've got a locked configuration for two nodes you do that across the whole network and solve for all the relative positions right, simple there's a couple of problems one, how do I... what information am I going to transmit? Yep, sorry Oh, sorry that's...thank you let's get this out of the way let me do it this way array, okay thank you okay, so I have this unknown configuration and I can't do my triangulation from that so I have the nodes play to each other for each one they can get a vector relative to themselves that gives them the ability to figure out what their absolute orientation is relative to the room I showed you ODAS it's able to figure out up, you know, anywhere within the sphere the hemisphere above the array it can give you azimuth and elevation information about the source so you can actually have things at an angle strapped to the wall and they'll still be able to solve their relative positions it's not dependent on having them all on the same level or flat on a desk or anything we play sounds we listen we get the direction of arrival we get the delay and we... I want to encode some identity information into each signal that'll make some of the other pieces around it easier to manage and then you solve for the relative positions of all the nodes but I have some problems how do I put information into a sound signal in a way that it's able to be received across a room through noise I need a super redundant signal and also this unknown question right now of what calibration volume should I use I don't want to just blast to full volume that gives you the best chance of receiving it on the other node but if it's audible information then there might be people in the way so I'm being conservative on that front at the moment I've also done some work on this calibration question I found a modem package LibQuiet that uses the liquid DSP libraries to do modem encoding in different forms it has a packet structure it has a bunch of different modulations you can choose and it generally works I can put in some input information I get a sound out I can play the sound and if I'm lucky I can receive the sound and get out the information again but one it doesn't work too well beyond half a meter right now the modem is kind of it's an acoustic modem but it assumes a channel that is free enough of distortion to recover the signal and it uses quite a narrow band signal so more work to do on figuring out what modulation I want to use there the other thing is I need to be able as I receive a sound on the remote and decode data back out of that I need to be able to apply a timestamp so I think I need finer control over what's being generated in that audio stream so that I know exactly what the delays are for here is the data I want to send here is the time I want to send it at what piece of audio did that go into and then on the other side what piece of audio did it come from when I get data back out some timing recovery things to do there but that kind of works let's see if I can do another little demo for you this window actually the one I want is back over here this is the arena version that I didn't point out these little microphone symbols here this is the arena version where it also has the ability to turn the microphone on and each node and it'll start streaming it back to the server and it has that calibrate button there that let's see if I can get that to go let me do this a different way this is the question of what volume should I play at and of course the volume is way down can anyone hear that good so it's supposed to be reasonably hard to hear there it is this little window over then I have a G streamer pipeline running listening to the microphone a little USB mic so what I mean about it being finicky so it played the sound but it didn't hear it really awkward to do this on the screen up there demo gods are not being kind to me today this is what I mean about I need a lot more redundancy in the signal that's being generated here so that it can be received when there's not so much noise when there's distortion in the signal there's a bunch more research I think there's a good chance you could describe that as an entire research project on its own to get that working I have seen some other interesting options I found a web implementation that uses quite audible beeps and blips to transmit an sdp between two machines and then use that sdp to set up a WebRTC connection between the machines so you can just do a you don't need a central server you just have a one web browser plays a sound the other web browser listens to the sound and they now have enough information to form a WebRTC link and transfer a file they're the method that they were using there looked pretty promising and I can also go a lot lower in bit rate than what lib quiet is trying to achieve because I really only need about 100 bits per second or something in the encoded into a really redundant audio signal I'm going to keep working on that alright so that's pretty much it for the couple of little side projects I wanted to talk about and introduce the idea of is good for building these kinds of toys and maybe introduce you to a new idea around using arrays of microphones for doing interesting things does anyone have any questions about what I've just covered thanks for the great talk it was very interesting actually what is the time synchronization you can really achieve on a commodity Wi-Fi networks so they have like several milliseconds how long can you bring the down so the clocking algorithm is doing a bunch of non-linear filters because even on my domestic Wi-Fi I have round trip times where the Wi-Fi will go 1 millisecond, 1 millisecond 1000 milliseconds, 1 millisecond and so the clock has a lot of outlier detection and it's trying to do a best estimate and it uses minimum ping times to lower bound the round trip and then runs filtering to try over a longer window to try and estimate the clock within that I think it's not perfect I've been tinkering with some other filtering algorithms but it does work pretty well it works really well over Ethernet and it works pretty well over Wi-Fi certainly well enough for my audio but it doesn't sound like a stadium kind of use case I mean is it like a half millisecond or maybe lower than that or is it unreasonable to expect something below a millisecond range like you I think it's not you can't really expect sub millisecond accuracy on general purpose Wi-Fi especially if there are devices running at lower speed that take up huge chunks than cause big delays but I can certainly see I see it pretty regularly under 10 and that's you know human ears generally can't hear 10 millisecond difference or 10 millisecond difference is 3 meters of separation between two speakers anyway there's another one there you talked about the smart home like I can easily imagine several interesting things to do with these directional audio measurements but how applicable is it when there's furniture in the room like how does it deal with reflections like if there's a soul finder in the middle of them yeah so the ODAS I didn't talk too much about what ODAS is doing internally it's looking so a reflection looks like a ghost signal but at lower intensity it generally works pretty well even in a room with furniture and with sounds bouncing off the walls you get those kinds of scattergun static noise sources but if you put the entropy threshold slider above the cutoff then they tend to disappear and you only get strong entropy primary sources you also generally get the primary source arriving before you get those reflections and then it has in the existing ODAS code base it's using a bunch of particle filters to track individual sound sources and to correlate the actual audio signal of them so it'll work okay if you have two people circling a room and crossing each other it will still track the right person as they cross as well and the one downside of it is that it chews pretty much a full core on the arm on this to do all that processing but the researcher that's responsible for ODAS also has new papers out in the last six months with a new technique that is 40 times less CPU intensive so is there like in G streamer is there a plugin for you touched a little bit on it for voice to text and motion capture of video there are open CV elements for doing motion detection on incoming video and voice to text there are so the Mozilla deep speech project I think has a G streamer plugin as part of deep speech but I haven't used that so it's hard to kick track of what there are G streamer plugins for because they're not always within the G streamer framework or the project the speaks to text has a G streamer plugin as part of that project that's completely outside and there's a festival plugin for going the other direction to produce audio out right are those projects that you mentioned are those like engines that you can install yourself because I really liked your comment about not being able to do other stuff at home at all yes they are the Mozilla deep speech has an online cloud based service but you can if you have the right level of GPU available you can do that locally and the lower performance lower requirement things like Sphinx you can install and download they just don't have as good a response necessarily the benefit of having these microphone arrays that you can get better speech recognition because you can clean up and isolate the voice signal and to get rid of that background noise that automatically helps some of these engines work better as well any other questions thank you all very much for listening