 So, we're giving a talk about Ubuntu 14.04 long term, something, no, we're giving a talk about, not even that one, sorry, we're giving a talk on subtitling, on subtitling videos that are being recorded back there, for example, we are presenting work from a student project at Hamburg University in the computer science department, we tried to convince our students to present, but they were afraid this would be too late or too many people or something. So there's just one sitting over there, Florian, hey, Florian is there. So Florian is studying computer science towards his bachelor degree and together with five other students, five other students, they are working on producing subtitles or well, producing subtitles semi-automatically for C3 videos, that's the database that we selected for this task because it's really available, thank you very much to that. But of course, what we're going to do also works similarly well for other videos, hopefully, we haven't tried it yet. So what we did was we looked at last year's videos, which were transcribed, most of them, or many of them at least, and so we have the transcriptions and for subtitling, you need more than just the transcriptions, that is more than just the text, you also want the timing information for that text, so you want to be able in your favorite video player to load in, for example SRT file, that seems to be a data format for that, and use and click that in your favorite video player and then you see these subtitles or maybe on a briar display if you can't see or visually if you can't hear. So I'm kind of lost now. So we got the transcriptions, so why do we want transcriptions, why not just use speech recognition? It turns out that speech recognition is not very good. Hey! And now I'm very glad that I will hand over to Anna for all the difficult parts of the talk. But we're almost there anyway, right? That's good enough, I mean don't break it anymore. I don't care. The other way around, please. Maybe do whatever. So I think we already had this. The switching knob doesn't work? That's funny. It is funny, isn't it? So this computer has a demo prepared. We're looking into whether that's going to work or not. It's unlikely. Please, we had at least some slides. Okay. Yeah. Okay, that's, yeah. Okay. All right, so up to Anna. So, hi. Yeah, so this is our group. I think you've already seen Timo. This is Wolfgang, that is me. And Jonathan, Paul, Florian, Tim and Faiz couldn't be on the photo because I think he had to move on that day. So we had, I wasn't able to listen. Did you already say that? Okay, so what we want to do is generate subtitles from transcription. We need to align them, chunk them into subtitles and export them into whatever format we want to have afterwards. And our secondary goal is acquiring speech data because as a university, we usually don't have access to huge amount of data. That's the main difference between, well, huge global players like Google who can just tape everything you say into your phone. For example, if you are Apple, if you use Siri or something like that, they can actually do that to enhance their performance and we don't have access to that, unfortunately. So free data is necessary to actually have free software speech programs, speech recognition programs. So, yeah, you already went into that a bit. If you try to do speech recognition, you really need to do a lot of stuff. You need to know about the world as a whole. You need to know about what is the context of what I'm talking about right now and you can all do that in your brain because it's so powerful and optimized for that but the computer cannot really do all that right now. They usually base their computation on what sequence probabilities. That means we have a large, large corpus of data which says, well, this has been said, I don't know, one million times and so this is much more likely than something else that has only been observed once. And the problem here is that all of you use weird words in weird contexts and then the speech recognition just gives up and says, well, I don't know. And to show you that is a really hard problem, maybe you've been to the talk yesterday where Jacob Applebaum said, oh no, I have to remember it. GNU PG has safety faults and what was transcribed live by a human was GNU PG has safety faults and what he intended to say was GNU PG has safe defaults and that makes a difference. And so that shows that even for humans, when they are under time pressure to actually type stuff in, they will miss things. For example, that he said before that it is good to use GNU PG. And therefore we see that speech technology is bad at understanding what humans want to say and Timo will talk about that in detail and I can sit down. Yeah, you can sit down. So, oh no, there is no preview anymore. I was wondering why you were standing here. So as Anna said, speech technology is bad at understanding what humans want to say so that is a problem. So what we tried or what we figured out was that humans are at least relatively good at transcribing speech apart from that example that Anna just presented. But then going from these transcriptions to subtitles is still a major issue. So this is a screenshot I took earlier today from the AMARA transcription interface which is really good at the job, but the complexity of the interface is somehow clear, I think, that it is not really easy to figure out what to do. You really have to figure out how to do that and it is really hard to build a very good interface for this task of putting the timings on the text, figuring out what was spoken when. So it is really hard to build a good interface and I think it is hard to build an interface because it is such an unnatural task to humans. We never do this in our everyday lives. We never figure out when something was spoken. We just know what was spoken. So even if we tried to improve user experience for this AMARA interface, that wouldn't solve the underlying problem that humans aren't made for this. Humans are made for understanding, but certainly not for subtitling. So our idea here was to have the computer solve the unnatural part of the task and the humans the natural part of the task. So the... Yeah, I got all that. The idea. So what happens is a kind of pipeline that is very similar to what speech recognition does We go from the transcribed text and have to normalize things because as people transcribe they naturally type things like 431 before Christ or it might even say anti-domini if we are unlucky. But they would certainly not say 431B period C period unlikely at least. Then we have to go from the normalized text that we have got into more like a sequence of speech sounds like a f or... What did I say 400? F or something like that. Right? A sequence. And of course I hear most likely pronunciation there may be multiple variants that we want to track. And then we want to figure out of course when we have this sequence of speech sounds how does this align with the acoustic waveform that we've got up here that doesn't happen magically but it's a learned model it's learned from lots of data as Anna was implying earlier data that is severely lacking when you're working at a university that we need in order to have good models of how speech is being spoken of what speech sounds actually sound like. And if we have that then we get a pretty decent timing information that hopefully that's our goal. However, if we just follow this standard approach of figuring out things one after the other and then doing one big search over that this isn't robust because this search that we have to take on with all the different timing variations that could come up is just too large so what you do in speech recognition in general is you don't do a full complete search but you do a beam search and in our case what we notice is that the correct solution always falls off the beam. So if you're lucky it may work on very short stretches like one or two utterances in a row it may also work if you have very distinctly clearly spoken text so what we tried in a different experiment was looking at spoken Wikipedia data where we have the article and the text and people really try hard to speak exactly those words that are in the article but as we noticed earlier transcriptions at C3 are just they have to be done in almost real time so it doesn't work as well. So the approach that we use it's not our own, we just used it is to stick with the fact that people say more or less what's in the transcription there are some stretches of speech that are not transcribed this actually happens a lot for data at C3 it appears and also there are transcribed words that are not spoken so people just type anything and it just happens not to be in the talk it's quite funny to look at the data we all believe it's really nice but it's nice but it's not perfect so what we do is we actually do this word sequence modeling that Ana was talking about earlier it's not on like random or huge amounts of text but only on the full transcription that we got and then we do speech recognition on the full thing and try to find landmarks well we have a correspondence between the recognition result that we get with this very specific speech recognizer and the original transcription that we have and then we iterate over that in a divide and conquer approach so after that we look at the stretches between landmarks trying to narrow it down as much as possible so that only those two bits and pieces remain that we can't align to each other there is one problem that it's not very fast it's only running at about one eighth of real time so if you've got a one hour keynote that will take overnight which is quite long in my opinion so it's not real time anyway so this gives us this step will give us alignment so we know for every word the timing where it's been in the talk however we want subtitles not word by word alignments and with that I hand off to Ana for the next bit of the talk okay so how do we get from those aligned words to actual subtitles if you look at subtitles that's maybe more complex than you might think because you have limited space on the screen so you cannot say well I just put one sentence there because sentences tend to be really long especially in talks like this talk so people really go on and go on and go on and then they say and and and and and and and then you have a really long sentence the reading performance of the person who's reading the subtitle later on is also limited so you have to take into account what can be read and so we need to break up those sentences into shorter chunks and they must be somehow meaningful and we can see that simple heuristics don't really work for that if you know this is from the no no this is not the time for the video so sorry so if you want to see the keynote from last year again you have to do it online later on but this is from the keynote last year and this is what a simple heuristic actually did you see the subtitle here and it says well it kind of cut to me in a telephone and so well in a telephone watch so and that doesn't make sense but if you look at the whole sentence you see that it occurred to me in a telephone call and that makes much more sense so you need to actually get the syntactic information that telephone and call are strongly related and you don't it's a bad idea to have a chunk boundary between those two words therefore we perform a full syntactic analysis of each sentences that looks like this this shows the dependence of well each word to another word and then you get a dependency tree that's done by some nice program that does this and it doesn't always get it right so that means here we have to take into account that it doesn't get it right all the time especially because the sentences can be really long and there can be syntactic errors and so on and then we just search for the best chunking that is for the best split points in the sentence by defining two cost functions first by the splitting cost at each point which says well if you split here that will cost you 200 points and if you split here that's only one point and the other cost function is the resulting chunk length so if you have a chunk length of only two words that's extremely bad but if you have a chunk length of 200 words that's also bad you want to be somewhere in the middle and with this we actually get better subtitles yeah our results show debug interface can you switch to that computer? who does that? who's in charge of that? should I press a button? no no no okay so what you can see here is just the debug output for our system and it shows the dependency tree I think the tree is not really that visible so the gray like parts over there this is the dependency tree it's perfectly clear for me sorry and then you can see the different chunks and on the right not there you can see all the meta information we added so for example when does this word so the first here the U when does the word start it's the word number 29 and the syntactic parent is the word number 30 and so on and so on and so on we can reuse all this and we can actually jump between points and see why the system actually does what it does and that's important because we don't have a clear gold standard where we can say well this is wrong this is right we can only have a look at it as humans and say well this looks good and if it doesn't look good well let's see why this might be the case yeah and usually you would hear a sound and I don't know what but not today sorry the sound is there it's just not very loud okay if you want to switch back that would be great okay I think it's me talking now and we actually process I don't know several videos with it and it mostly works and we have to further work on this and I think we'll go to this yes is this your part? I think it would be but you so to sum up those results so some statistics on last year's keynote we have processed more videos than just this one so is this the five minutes before the questions? okay good thank you so this is some statistic on last year's keynote we get alignments for roughly 90% of the transcribed words so that's not all the words unfortunately we have some ideas how to further improve it actually we have a huge problem with applause which is very disruptive and laughter so we have the problem with applause that the first words after applause so thank you very much the first few words after applause are usually not correctly transcribed we know that from the data we don't have any good applause model in our acoustic model because so far we didn't get applause now we've got no thanks no need, no need, no no no thank you so far we didn't have any applause data so we couldn't have it in our acoustic models now we have data with applause we haven't really figured out where the applause is we'll have to do some manual work on that but this will help us to build better acoustic models that contain applause and to then further improve the transcriptions and subtitles of course so our current ideas are to kind of estimate the duration of those words that we don't know about again for the statistics most of the time it's very precise sometimes it's just totally odd we don't have any precise numbers on that I'm telling the students all the time that we need evaluation because we're doing scientific work I can talk all day about that but the semester isn't over yet we have more ideas we still need to improve chunking heuristics so we want to go to two line subtitles which sounds like a bit difficult but it does change the game a little bit because now you have two different kinds of chunk endings which need to be integrated and then there are all sorts of practical issues we haven't integrated with Amara yet in a suitable way we had promised originally that we wanted to talk about the process 31c3 talks we haven't done that, I apologize for that but we're working on that and some people actually need credit for this class so it will happen, I promise so for the other direction we were going for yeah, let's go oh, now I'm off well, whatever so we also want to use this data that we got for better oh, he almost reminds me of that now language changes are also a difficulty in transcription so the cable behind the back no, it's funny okay, no better? no, it's better, okay so we also want to use the data as it's very freely licensed to improve the open source speed recognition models that are out there you need lots of data because all those models are usually machine learned it's estimations on large amounts of data and as Anna said earlier that's a huge problem so far and it's much improved already and we hope to get another like what is it, 30 hours or so of speech data of talk data into those acoustic models which are freely available at Voxforge it will be German it will mostly be German it will be English but mostly German speaking English so that will also be an advantage afterwards for the English models for German speakers yeah, so that's one of the goals that we have where we think this is actually a perfect example where two sides actually gain something actually three sides because the students are learning something C3 is getting transcriptions and we are getting data so to wrap up alignment is simple for computers hard for humans and this is exactly the opposite for transcription which is easy for humans hard for computers we combine the two for best results then subtitling is more than just alignment because this splitting is actually very difficult yeah, thank you very much oh, no, no, thank you not very much yet so we have plenty of ideas for future work which also at least partly line up with the C3's ideas about subtitling I really like the idea of having very low delay subtitles so go from a few hours delay to a few seconds that would be marvelous and in that we could use the typing speed information because these transcriptions are actually made on aetherpads we have like key by key timing information and right now we're throwing this information all the way we're just saying okay, it could have been transcribed whenever but usually people transcribe in sequence and mostly during the talks thank you very much whoever is transcribing or might be and we're throwing half of your information away for the record but we intend to change that and use this typing speed information for the transcription process for the alignment process sorry about that of course if we didn't have talks but more fantastic videos we could do video analysis and all sorts of other things which would be really nice to have kind of built information retrieval content search facility where you go to the video directly when you search for I don't know what something say privacy alright, thank you very much thank you very much we're almost completely out of time and I wanted to ask whether the signal angel has one really pressing question from the internet because all the other people it would be so great if you could just ask directly so we can free the room again for the next talk it would be so great okay then let's have one question from the internet did you test it with a normal movie and some transcription extracted out of some subtitles available no we have not done that yet we intend to do that but as I said not all like our evaluation scripts aren't really done yet so testing it on those data would be hard I doubt that we would excel over like commercial services but we're working in that direction at least okay as you go please don't speak as other people are still here and this was really fast so we can do one other question there's one at microphone 3 hi okay so you fleetingly mentioned that the subtitles had to be readable which means that they have to be displayed for a long enough period that people can actually have time to read them and in classical subtitles softwares what happens is you map subtitles yourself and the software calculates for you and allows you to use a determined fixed number of characters are you going to do something like that or do you just plan on going on some sort of gut feeling of what is readable and what is not and if you are to going to do that are you going to have to actually adapt your transcriptions which happens in translations you can't actually keep the whole sentence you actually have to focus on some part of it or not yeah basically so first question whether we intend to do more than karaoke yes we do right now we try to do to get as precise timings as we can but then we have to think about the usability issue that we have to show a subtitle for a sufficient amount of time exactly I haven't mentioned that or we haven't mentioned that but yes we do intend to do that it's like an additional constraint now to your second idea of changing subtitles so that perfectly makes sense for translations I think and is vital there but we are trying to remain truthful to what the speaker said or at least to what the transcriber transcribed which may differ in those cases but so far I think that's beyond the scope of this current like little one semester project because we would need exactly the kind of understanding of what was going on in that sentence that we don't have in order to be able to shorten things thank you so much Anna and Tim I think you will be over there on the outside to answer more and more questions thank you so much for that and this is your applause