 Good morning everybody. Buenos dias a todos My name is Terry but they're it's my first time in in for them. So I'm a little bit excited Let me tell you some words about me and my company I've been working the embedded world since let's say more than 25 years and My company is a small one, but growing one friends whose name is IOT that busy dates. This is us We are we are going well 28 engineers that means that almost a third of my company is here today with me We are one the main HGL Contributors because we belong to the HGL community and what is there for with us today too As you can see in the past years we have provided where the most Influenced committer on the HGL code What I will talk about today is Julius and Julius is a word. I mean a speech recognition engine The reason and the birth of that project was pretty much Due to our CES demo that you have performed in the last month in Las Vegas We wanted to be able to show a connection less speech recognition engine That was obviously open source that was light that had an activity on the Github and was Possibly easy to integrate in HGL I had to find that one Even a list a short list of available ones I've taken a look of CME CM use things But I have eliminated it basically due to the fact that it was JV code. It is not my favorite language We wanted also it to run on the boards that we support For instance the one from one SS but obviously from the one from Intel The architecture that we have shown in our CES demo was as such we have shown Boat simulator What was equipped with an autopilot and this autopilot was with with receiving speech orders In Jason format that were captured on another PCM device and possessed by the leap Julius To achieve that goal I have I have had to hack and to cheat a little bit after some days of trying to Integrate it in an embedded world and to cross compile it. I Which I have reached the point to I had decided to fork the project at the end of my presentation all the Github links are available and that presentation by the way will be also available on our website When I'm back in the coming days I List some of the issues that I have to deal with The most one that is was just not close compiling and it was not feasible with the other label auto tools because There have been hard coding things and the contributors Didn't want to give me the original piece of code also Within the engine by itself There was no easy way to interpret the recognition results I mean the condition the relation sorry as a recognition results Basically, basically consists in a tree a recognition tree with some weight And the user has to deal it by its by itself in order to compute a recognition score But didn't sound good to me and I have decided to improve the lib gel the jealous library To make that piece of information Available in a needier way also by default Julius is based on a on a no grammar and a and a dictionary or more more than 320,000 lines that Significates that it computes a lot that consumes a lot of CPU load and also when you have to choose a word from a list of 300,000 You make mistakes and you don't want that type of mistakes for instance when you do motion control If you want to play some music, it's okay is a sound that That goes out that is chosen by the engine is not the one that you want We didn't want to listen to but in when you pilot a boat, it's not acceptable this is an example of the World to phony that is part of that huge diet dictionary Just like I said, it's not efficient at all. This is just a single file that we were I've been using for our project It's a grammar definition is in a standard format whose name is the B in F and It's always begins with the same thing At the beginning you just describe what is a sentence a sentence begins with a silence Eventually, this is a HMM. It's it's an hesitation of so people that speaks and NS E is the end of the sentence Then you describe the sentence by itself a sentence is composed of an order and you can eventually be polite or not and then the list of orders With that piece of code you already understand that the sub dictionary will only hold the words that are part of the grammar so the idea is to replace These usually amount of lines, but just the needed subset and To extract all the funny funny combination when I say come back combination It's just because for one word you can have several way to say that Depending on the accent depending on the country you are from and the dictionary was also dealing with that with that That's why I have introduced a new format a new five format, which is a pre-vocal one the vocal is the dictionary the pre-vocal is a dictionary without the phonemes and also a New extension, which is the foie d'alvers for instance. It's just to we try to be a little clearer It's more efficient for the grammar to recognize get out That just get and then out But it was not supported by the engine and I have added this feature that leads to best reconditioned results and less false positives to summary the grammar generation goes this way The user has to create the pre-vocal file That is compute to a vocal one and then That tool which is provided by the Jullus suit Computes to a finite state machine, which is a DFA one and a dictionary in its own format This is the expected result from to follow the pre-vocal to vocal computation The numbers that are acceptable are described here and at the end Which is missing here of course, but you've got the numbers and the every possible combination of phonemes The same for starboard This is probably the trickiest thing that I've been that I had to deal with Jullus didn't support to wake word implementation a wake word is mandatory for several reasons The ones that are listed here on that page But also they can have legal issues to wake word because in some cases in some countries It's not acceptable that the microphone is always on listening to what to what you are saying When you have a wake word you can have sentences integrity I mean when you are in the process or recognizing Something in the middle of the sentence the wake word will be ignored and the opposite You won't start any recognition until a wake word is recognized That's it saves a lot the CPU load and Also, you can dedicate the recognition process to I mean you can do it by with dedicated agents a conclusion that I That I have a bridge also by doing some tests is that wake words have to be long enough It's more efficient. You have better success when for instance saying a pilot than just pilot for when you have a wake word you have to to implement a matching grammar in front of it in our case for the both demo the only one that I had is a Is a grammar that matched the pilot one, but also I could have We wanted the Alexa wake word to an Alexa voice agent for instance as Since the wake word is part of a grammar by itself a grammar which is composed of a single word, which is a wake word by itself You have to deal with two grammars That's that was also a tricky a tricky part Because even if you lose says that reading some pieces of test samples That is fairly if possible to dynamically switch a grammar to another one This this that seems to be a broken piece of code To work around that I have decided to run to recognition instances one for the wake one and one for my grammar and The issue was to share the as a PCM Hi, I have just made two threads that I will switch in from one to the other when the wake word process Or the sentence success was okay at the end of the first implementation I mean I was just alone on my desk with my laptop and everything worked fine But as soon as you introduce some more noise for instance a chord or just people Talking a few meters from you or some music or some sirens or whatever you want. You have to deal with No, it's the only ones Another nice noise issue that was really really unexpected is that I had been using a microphone that was buggy the microphone at the bug and the Reconition was very profiling performing badly in the first case with with two peaks in the spectrum Which were about at 4,000 and 6,000 Hertz the recognition process needed About 30 seconds Before being able to decide what was actually speech or music or noise That was that was really better in the other case when it's just Constant or one-domino is the recognition success decrease really dramatically and That's him what that I have taken with audacity from a dot wave 5 capture for the microphone without any transformation You can see the two peaks that were disturbing the recognition. This is the first and dirty hack That has worked It has really improved the performance is significant I Had heard about the speaks as a module Quite a long ago, and I have to admit that I didn't know what it was made for and By looking at the code and its features. I have seen that it was Implementing denoising Which is not as good as this as they say, but also auto gain control. This is a AGC on and auto gain control Allows you to be more independent from the hardware microphone that you are using and That's mandatory when you do speech to have at least that that one the automatic grain control For the denoise speaks to just those the best it can okay the best way to deal with that issue is to have a really to perform noise cancellation is to have a noise model and To use a tool between the microphone and the engine for instance socks But having a model is not always possible Either for instance, but you don't have a quad near you. You have to pay to make some sound samples The other the last one is user voice learning That's something that is available in some speech recognition funds sense on your horse But I I can see that as really constraining To the user because it's painful you have to to tell about 25 Sentences and it's longer and after that you have to proceed that through several files But that should improve the results a lot in term of CPU needs I was Let's say disappointed because it worked fine on the laptop an Intel with eight cores But it was quite impossible to make it one properly on an arm even if on a recent hardware like Arch 64 you have to leverage the Vector vector calculation as a FPU for instance a VX one is mandatory and Obviously on arm even if it doesn't perform the way you have to use the neon co-processor as well What an ID to get better results is to use it to use a DSP which which is sometimes available for instance on the Windows as hardware These are the axis of improvements that I wish to work on in the coming months if people get some interest to it To implement grammar plugins I what I say by that is that The end application should be able to provide its grammar to Julius Julius would eventually Precipricate it and use it on demand also To look deeper in the dynamic grammar switching switching recognition a switching process because for now it's just a hack to match Amazon Alexa with a some place in order to be compliant with the application that all way are already aware of the Amazon API and Also to add in these templates No one's for motion control because they simply don't exist as a present time. I'm done with representation I will be happy to try to answer to any questions if you've got some ones Yes Hello, hey, so I create two questions first. What about upstreaming your modification to Julius? And the second question is when will we have a package in bill for that in bill? Okay, that's a good question. I will try to answer to to the second one first I Really love bill routine. It's pretty easy to integrate a package that comes from it. I already have some contribution in bill wood yes, okay and For the first one. I already have a pull request on Julius, but the main the maintainers don't have accepted it yet I have some commits missing on that pre-request branch. I I will add them essentially essentially the piece of code that computes the the results tree to Just a sentence and the best score I think which is useful. It is painless for them to have to have that code additionally Hi, so The main question is Can you do you have any idea of why I can't use I'm okay? I will wait Okay, now better. Okay, so why do you think that Julius has that? Performance problem in other platforms is anything that you are being able to measure or Okay, you mean what what was my criteria for performances? Yes, okay. I didn't want the For instance eight cores to be busy for seven seconds It's not acceptable and also the time of computation was critical in terms of user experience. I mean I've already been working on HMI Interfaces in the past and by experience when you wait more than 500 milliseconds the user wants to say that it's crap. Okay We've speech Let's say that you can be a little More patient and what's he can be okay, but not seven Thank you. Okay Come to Brittany Yes, but Okay, okay You are asking Eventually if it could be possible to add new To a new language to Julius to add to have more languages to Julius. Yes, just already supports Japanese they all they already have Japanese model and They say that they are compliant That they are compliant with some dictionary Language models some with some standards of these models, but not Something that I have been investigating much, but it could it should be able to add more languages That's one Hi at the start of your talk you showed table converting words into phonemes and We obviously had these three hundred and twenty five thousand words or so Where did you obtain that conversion table from and how do you how did you cope with different? variations on standard English for Australian South African Scottish whatever The table that has taken is the one that Julius provides as an example and It's common English with no particular accent To my mind if you wanted to really better support accents for instance the Australian ones I don't think that we will it will work and In the model that I have shown you should also be able to completely remove Julius Because I have made a kind of API between An abstraction level between the application and the engine by itself and I wish for instance to be compliant with the new ones one because I've already Seen what what heard let's say what they care that can do and I think that's a better solution Even if it's commercial you have you will have to pay for that. Okay time is up for me