 Okay, so today I'd like to talk about closed-loop training for iPhone-based TTS and talk about some applications to product and services in Toshiba. Okay, firstly, I'd like to introduce myself. I finished my PhD in 1985. My PhD thesis was about state-space approach to synthesis of minimum quantization in a digital filter without limit cycles. And after I finished my PhD, I joined Toshiba and I have been working in Toshiba Research and Development Center for more than 30 years. Maybe my career is older than you, I think, most of you. And I've worked on speech coding, video coding, TTS, SR, and spoken dialogue. Yeah, I have worked on many, many things. And today, I'd like to talk about the TTS and we have two types of TTS systems. One of them is unit concatenation-based speech synthesis. This includes closed-loop training for iPhone-based TTS and multiple unit selection and fusion method. This multiple unit selection and fusion method is a kind of expansion of diaphragm-based TTS. And we have another type, the other type, that is statistical parametric speech synthesis. We have developed the speaker and language adaptation and the cluster adaptive training. These two topics are for creating a wide variety of speech, authentic speech in terms of speaker characteristics, speaking styles and the emotions. And we propose the subband-based models, dynamic synthesizer models, complex kipsel analysis. These are alternative ways of speech representations to traditional MMCC parameters. Maybe you know that so statistical modeling introduces so-called over-smoothing effect. And also the parametric representation of vocoder introduces distortions in synthetic speech. So we wanted to improve the speech quality for statistical parametric speech synthesis by introducing new types of speech representations. They are subband-based models, dynamic synthesizer models and complex kipsel analysis. Today, I'd like to explain what I talk about this closed-loop training and multiple unit selection and the fusions. So I don't touch anything about statistical speech synthesis except cluster adaptive training. Because at the end of my talk, I'd like to demonstrate some samples created by this one, cluster adaptive training. Okay, this is the outline of my talk. You will learn about the diphone-based speech synthesis with the deep solar time domain, which synchronous overlap and add. Then you will learn closed-loop training of diphone unit and the porcelain generation models. After that, I'll talk about multiple unit selection and the fusions and introduce applications, the product and the services. Okay, I will start my talk with the diphone-based speech synthesis with the deep solar. And you will learn here the structure of DTS, three types of speech synthesis. And here I want to define what the diphone-based speech synthesis is. But so far the other lecturers already explained DTS structure and the types of speech synthesis, including unit concatenation-based speech synthesis and HMM-based or DNN-based speech synthesis. So I will skip some slides about these things. So I can shorten my presentation and you have more time for discussions at posters. Then I will talk about the prosodic modification using the deep solar and speech quality problems caused by prosodic modification. This is a big issue in diphone-based speech synthesis. Okay, maybe I will skip this one. You already know that the DTS systems have three major processing parts. Extra analysis, prosodic generation and speech synthesis. And there are three types of speech synthesis. All month synthesis, unit concatenation-based speech synthesis and statistical parametric speech synthesis. I want to say that so diphone-based speech synthesis is one type of unit concatenation-based speech synthesis where this system has a single inventory for each diphone. And when the system has multiple inventory for each diphone, that system is called unit selection-based speech synthesis. Okay. And the statistical parametric speech synthesis was already introduced by Hager. And one of them is HMN-based and the others are diphone-based speech synthesis such as DNN or RNN or LSDN. Okay. Maybe I can skip this one. But just I play one samples that was demonstrated by Professor Klatt of MIT in 1983. He demonstrated the full text-to-speech synthesis system based on the source filter model. Here, source filter model has a source model and the synthesis model. And the source model has two kind of excitation signals, first train separated by pitch period for voice part and the fight noise for un-voice part. And then this source model is controlled by UV decision and pitch F0 and gain. And the synthesis filter is characterized by four-month frequencies and their bandwidth and gain. Okay. Here is a speech sample demonstrated by Klatt, Professor Klatt. Doesn't work. The speech systems are beginning to be applied in many ways, including aid for the handicapped, medical aid and teaching providers. The first kind of aid that they consider is the talking aid for the local handicapped. According to the American speech and a hearing association, there are over one million people in each country who are unable to speak for one reason or another. Okay. So the speech is fairly intelligible, but the quote is about poor. And so next one is the diphone-based speech synthesis. This system has just a single inventory for each diphone. So according to the phoneme sequence, we select actually retrieve diphone unit from the inventory. Then we apply prosodic modification to the diphone unit in order to produce speech signals with the prosody given by prosody generation part. Then we concatenate the modified diphone unit here to produce synthetic speech. Okay. In this case, we have a phoneme sequence like this. This is Japanese and we retrieve diphone unit from inventory, this, and concatenate. Then we produce synthetic speech. The unit selection-based speech synthesis has multiple inventory with a different spectrum, F-Zero duration. And the unit selection is done based on target cost and the concatenation cost. High-quality, this system produces high-quality voice, but needs large data. And the target cost is defined as differences or distances between the target unit and candidate unit in features such as current phoneme, previous or following phonemes, positions of the phoneme in word, in phrases, or in sentences, or the position of stress in the world. And the concatenation cost is defined as a distance between previous unit and the current unit in terms of spectrum, F-Zero, and duration. And Hager just introduced the statistical parametric speech synthesis. So maybe I don't need to explain this part, but here I just want to say that so this statistical parametric speech synthesis is trying to map a linguistic feature to acoustic features using statistical models, in this case, HMM. And so this system is repressed uses statistical model instead of craftive rules. Okay, we studied a research project of TTS in 1994. This is a diphone-based fixed-to-speech synthesis system. So we selected diphone-based speech synthesis as the baseline system. Why we selected this diphone-based speech synthesis? Because we focused on applications to embedded systems. At that time, and because the market of GPS navigations or incarnations was growing in Japan and other countries, voice user interface was essential for such applications to embedded systems. And Toshiba had good business on semiconductor. So I thought we can enhance Toshiba's semiconductor business by applying TTS or ASR to in-car navigation systems. And our research objectives were that we create a method to synthesize high-quality speech with very small footprint, less than one megabyte. And low computational complexity, less than 15 minutes, not minutes, MIPS. So one megabyte memory size and 15 MIPS computation were our objectives. At that time, there are two major speech synthesis methods. One is diphone-based speech synthesis and the other is unit-selection-based speech synthesis. And we selected diphone-based speech synthesis because we wanted to apply TTS to in-car navigations e-dictionaries or telephone. And these devices have small memory, small requires, small footprint and small computational cost. So we selected this one and we decided to improve the speech quality of diphone-based speech synthesis. This is speech samples of diphone-based speech synthesis. Okay, this is, sorry, this is Japanese, but maybe you recognize that speech sounds muffled and the FZR is very flat. So the quality is not good. Quality is terrible. And we want to produce this level of speech quality. This was our target. And so in order to improve this, speech quality, I focused on prosodic modification because I believe that prosodic modification caused the problem of speech quality. The diphone-based speech synthesis needs prosodic modification of the diphone unit, as I said, because, so this system has just one inventory for diphone, so we need to modify prosodic information in order to synthesize speech signals that has prosodic generated by the prosodic generation model. And for prosodic modification, we need to modify prosodic without distortion or with minimum distortions. So you can imagine that when you speed up the recordings to shorten the duration, you end up the higher pitch or you end up lower pitch when you slow down the recordings. So prosodic modification is not so trivial. We need to modify time scale, okay. We need the time scale modification that modifies the duration without changing the pitch. And we need a pitch scale modification that modifies the pitch without changing its duration. So for this purpose, time domain prosodic pitch synchronous overlap and add method was proposed. This method is a signal processing method to perform time scale and the pitch scale modification of speech. We isolate pitch waveforms in the original signals by applying a window to the original signal. Then we perform the modification by allocating isolated waveforms at new pitch marks. Then we overlap and add to this synthesize the final waveform. Yeah. I will explain again. So first, we have periodic signals here. And we apply window, for example, hunting window to this signal to decompose the signal into pitch waveforms. Then we overlap and add at new pitch marks with lower pitch period to increase the pitch frequency like this. Or put the new pitch marks with the longer pitch period to decrease the pitch. And using PDIP solar, we can modify pitch and the duration at the same time. Though firstly, we decompose the signal into pitch wave performance. Then we duplicate or delete pitch waveforms and put these decomposed pitch waveforms into new pitch marks to desynthesize speech waveforms. And like this, we can modify pitch and the duration. So now I try to explain how the pitch modification is done in the frequency domain and why the distortion is caused by prosodic modification. So here we have a periodic signal and the periodic signal has discrete spectrum like this. So this spectrum has values only at harmonic frequencies. So when the pitch is high, this spectrum has a sparse point like this. And prosodic modification is to resample the spectrum envelope at the different harmonic frequencies from the original like this. So we need to extract the spectrum envelope from discrete spectrum. But it is almost impossible to extract through spectrum envelope because it is not guaranteed that this discrete spectrum meets the sampling frequency theory. So we have a large distortion for high pitch voice and if we have a large modification of F0, we have a large distortion. When we started our research project, the Diffon unit was created by hand crafting. And this is trial and error process. We have recorded speech samples here. Then we clip small segment from recorded speech and select a proper unit by hand like this. We clip small segment and evaluate it by synthesized speech signal using a clip segment. And the listen or evaluate that quality of synthesized speech. And we try many, many times this process and we create the Diffon unit. This is the labor and the time consuming process. So the Diffon speech synthesis had the problems in speech quality and also in Diffon unit creation process. So in order to overcome these problems, I come up with an idea. This is a closed loop training method of Diffon unit. The idea is very simple. Idea is that we formulate the distortion in speech synthesis as an error between original and the synthesized speech. Then we generate the Diffon unit that minimize the distortion. And at that time I come up with two methods of closed loop training. The method one, this is called unit selection by closed loop training. And the other one is called analytic generation by closed loop training. Okay, so the key point of this closed loop training is the formulation of distortion in speech synthesis. So when I started TTS in 1984, the people in research society, Diffon-based speech synthesis society say that there is no way to calculate that distortion because TTS is to input text and output web forms. There is no way to calculate the distortion between different dimensions like text and web forms. So I introduced a recorded voice as reference like this. Now here, maybe you don't believe that, but nobody said this kind of things at that time. I introduced recorded voice here. Then we extract speech and the duration information from the specie data from this one. Then we apply prosodic modification and concatenation to candidate so that the speech and the duration matches that of recorded voice. Then now we have here synthesized the speech signal and here recorded speech signal. And we can calculate the distortion as an error between these two signals. So method one, this is unit selection by closed loop training. First we set the candidate, unit candidate, and then we modify prosody of candidate and the synthesized web forms. Then we calculated distortion between synthesized and the original web forms in training data and then we select Diffon units that minimize the total sum of distortion. This is method one. And analytic generation by closed loop is like this. We firstly formulate synthesized web forms as a function of Diffon unit. And formulate distortion between synthesized and original web forms. Then we generate analytically the optimum unit that minimizes the total sum of distortion. This is method two. So closed loop training is a very simple idea but very effective to minimize the distortion. And here is a product diagram of method one. We have speech copy as training data here. And we set candidate unit. They are part of training data or they are the same as the training, same as speech data in training data space. And we extract prosody from the speech signal in training data. Then we apply prosody modification and concatenation to the candidate. Then we calculate the distortion between synthesized speech and original speech. And we select the best one from the candidate as Diffon unit. And as I said, the prosody modification is done using TD solar with the compose periodic signal into pitch web forms like this and align these pitch web forms into pitch mark of training data. Then overlap and add. And we get a centered speech here. And the distortion is evaluated based on the squared error with normalized power like this. Here we can use any kind of distortion. This is just one example. And okay, now I will explain unit section using the course of training in more details. So suppose that we have four candidate for a centered unit like this one, U1 to U4. And we have training vectors in speech corpus here, R1 to R5. Then we can calculate the error between synthesized speech using U1 and original speech vector R1 like this. And we can calculate the errors for all combinations of the candidate and the training vectors like this. And this is error matrixes. And then we calculate the cost. Cost is defined as like this. So for example, if we use, we select just one unit from four candidate, then the total error is like this to sum up the errors on the row. For example, if we use U1, then total error is summation of E11 to E15 like this. And if we select two unit from four candidate, then we choose the smaller one. For example, if we select U1 and U2 from this candidate, then we choose smaller error like this. Okay, next I have an example to explain. For example, if we select two unit from the candidate, then for example, if we use U1 and U2, we choose a smaller error like this, one among the two errors. So first we choose one and the next, we choose this one, next, this one, and this one, and this one. So the total error is calculated like this. And the error is 1.8. And we calculate the cost of the candidate for every possible combination of two unit from four unit like this. If we choose the unit three and the unit four, then the error, the cost is calculated like this. And we selected two unit that gives the minimum total errors. Okay, this is the gross loop training method. This is method one, the unit selection using gross loop training. And here we evaluated this method. And in this case, we used 40 minutes space later. This is one female speaker's voice. And the number of sentences is about 600, just 600 sentences were used. And we created 262 life point units. And we compared gross loop training with open loop training. Open loop training is like this. We have training data like this and we calculate LSP parameters of all the training data and take an operation of all the data like this one. And we select the synthesis unit so whose LSP parameters are the courses to the average. The most, the course is the candidate. Course is the data is used for a synthesis unit in this case. This is open loop training. This is the experimental result. So this is preference test result. For female speech, this course loop training is much better than open loop training method. But there is no significant differences in the case of male speech. So pitch is high, then the distortion caused by orthodic modification is larger. So for female speech, course loop training is much better than open loop training. So may I ask, why there is such a difference between male and female? Was just this specific female speech or male speech that made the difference? I see that or it will be constant performance independently who is the female and male speaker. Okay, this is just one example of the evaluation. So in this case, so there was no significant difference in male speech between course loop and open loop training. But there are many, many cases where course loop training is much better than open loop training. Could you explain a little bit what does open loop training refers to? Open loop training? Open loop training is like this. So open loop training means that we set, well, we create a speech unit that is mostly close to the center of the training data. And next is analytic generation method. Okay, the analytic generation method doesn't assume we have a unit candidate, but we don't know what the optimum unit is. So here we have unknown unit, U. And so here we want to describe or we want to formulate prosodic modification and the synthetic speech as a function of this unknown unit, U. Then we can calculate what we can define the distortion between the original speech and the synthesized speech here. And we can minimize the distortion to obtain the optimum unit. This is the concept of a course of training to minimize the distortion. Okay, here is algorithm. We prepare speech segments as training data. We set initial synthesis unit vectors as initialization. And we partition training vectors into clusters, several clusters based on the nearest neighbor conditions. And we generate the optimum unit vector that minimizes the distortion in each cluster. And we update the synthesis unit vector and we repeat the partition and the generation of the optimum unit. And we repeat these steps. I said, Diphon-based speech synthesis is a system that has only one single unit for Diphon. But this algorithm shows that we create more than one unit for each Diphon. Okay, so partition of training vector is done using these equations. So this is a normal partitioning algorithm. But note that the partition is not based on the distance between the training vector and centroid of the cluster. But this clustering is, where partition is made based on the distance between training vector and, in this case, synthesized vector using unit. Okay, now here we have training vectors shown by X, like this. And the normal clustering algorithm is doing like this. We calculate the centroid of the cluster then based on this vector and based on the distance between the vector and the centroid, we partition the data. But in this case, we partitioned this training data based on the distance between the vector and synthesized vector using U. This is different from the traditional partitioning algorithm. And the prosthetic modification is done by this. Okay, I already explained so I skip this one. And the next step is to formulate the prosthetic modification, as I said, the prosthetic modification is overlap and add operation of separated pitch waveforms. So this process can be written as matrix operation like this. Okay, this matrix shows the overlap and add operation. In this case, we use rectangular window, but actually we use honey window or honey window. So the coefficient of these matrix are different from unit matrix. All right, anyway, we can write the overlap and add operation using matrix operation like this. So we can calculate the distortion as an error between the training vector and a synthesized vector, Y. Y is described as matrix operation between A and U. And here, we introduce the gain to adjust the difference in amplitude of synthesized vector. And the training vector. And we optimize this G, the gain factor to minimize the distortion. And we get optimum gain like this. And under this optimum gain condition, we calculate the distortion as a squared error. And so the generation of the optimum unit is very simple. We calculate the distortion over all the training vector like this and we take partial derivative of E with regard to the unit U and set it to zero. And we get linear equations like this. And so we can obtain the optimum unit U by solving this equation. This is course of training by analytic way. So here, we have the evaluation. In this case, we use the 40 minutes long Japanese male and female voices. And we created about 300 dipole units. And we use the process generated by a method based on course of training. I will explain later. And here is the result. So method two, analytic generation method is much better than the selection based method. Method one. And so there is no meaning in distortion. So here, when we use just one unit by dipole and method one has a distortion, this is set as one. So this values are relative values. Against this one. Okay, and for preference, so analytic method has a great bigger preference than selective method for male and female voices. Okay, so here, I want to explain how we can minimize the distortion in synthetic speech. And this is an explanation in the frequency domain. So suppose that we have periodic signal, or we have speech signals like this that have the same spectrum envelope but the different F zero. Then this course of training is try to extract the most likely spectrum over different F zero. So this method exploits the increase of sampling point of spectrum. This is the reason why the course of training can improve the speech quality. Okay, so now I will move on to close-loop training of first-of-the-generation model. Maybe I should stop here. Yeah, if you want. Yeah, you can make a break. Mm-hmm. But there are questions. Well, if you have questions. Okay, and also dynamic. Okay, so next topic is first-of-the-generation model. So here I'll show you the generation model and close-loop training method of F zero counter code book and how to select the representative vectors from the code book. And here is suppose of the generation model. We, okay, we created this model as a training model as because we want to train the first-of-the-generation using data. So here we have F zero counter code book and here is a representative vectors by each type of accent. And we selected the vector from this code book based on the text feature or linguistic features. Then we apply up expansion or construction to the selected vector according to the duration. Then we move this vector to upward or downward using offset level prediction. And all component of F zero counter code book and the vector selection and offset adjustment can be trained using data. So we first generate F zero counter code book using close-loop training and then we select the vector based on linguistic features and we predict offset based on grammatical, well known. So linguistic features. We use quantification theory type one model for counter vector selection and offset prediction. Here is F zero counter code book and this shows the how to train, how to get the F zero counter code book. First we have speech database here and we extract F zero counters and the set F zero counter code books on corpus here and we set the initial value for counter code books. And using this one, we create porosity according to our porosity generation model in previous slide here. We apply these operations into the initial value of the code book. That operation is vector modification and we get porosity generated by this model. Then we calculate the error between the original counter vector and the synthesized vector. And based on this error or distance between these two vectors, we do clustering the training vector. Then we optimize, we get the optimum code book for each cluster. The algorithm is very similar to that of generating the optima unit for diphone-based speech synthesis. And finally we take the approximation error from this process as some training data to model that estimate the approximation error for vector selection and offset estimation. Of course, here maybe you can see next slide in more details. And clustering is done in this space. So this is the same as unit generation. So we have many training vectors and we apply clustering to these data based on the distance between these vectors and not centroid but the synthesized vector using porosity generation model. And this is the porosity generation model so this is the pitch counter of a segment and here is a representative vector in the code book. See, and this D matrix shows the duration matrix. This is actually the time warping matrix. This shows the operation of this one. So the expansion and the contraction of duration. This operation is described as a time warping. So this matrix is time warping matrix. And here B is offset of the pitch counter. So the error or distortion between the original pitch counter and the synthesized pitch counter is defined as a squared error like this. And so we take the partial derivative of this error with regard to the code book C and set it to zero. Then we can obtain this linear equation. And by solving this equation, we can obtain the optimum representative vector in the code book C. This is the course loop training of the code book vector. The concept is the same as that that of course loop training for speech unit. And the next step is to create an estimation model to model that estimate the approximation of the code in error. When we use a special vector from the code book, then here we estimate the error for each vector. Then we select the best one according to this error, error estimation. So in this case, for this approximation, estimation we use approximation, no, no, no. Qualification theory type one model. And here we have training data from the approximation error corpus obtained for code book training. And quantification theory type one model has this estimation. And here this is coefficient according to the linguistic features. So when linguistic feature, for example, positions of current phrase is first or middle or last. If this linguistic feature is the feature matches this category, then we give this coefficient like this. And for other features, we set these coefficients and we estimate the error like this. And we train these coefficients using the training data to minimize the estimated error and actual error, okay? A little bit complicated. In this training, we have data of approximation error when we use the optimum representative vector here. And we want to estimate this error from the linguistic features. And we use this error as a training data to model the estimation like this. And in the same way, we estimate the offset level, offset level, in this case, we also use quantification theory type one model. And here, the training data is the optimum offset level, corpus, obtained to corpus training, I don't know, codebook training. So I skip the details of this model, but the training method is the same as previous one like this. And here I had experiment. And in this case, we use about 866 sentences for Japanese male voice. And we extract F0 and do manual collection of F0. Then we created 48 vectors for six accent types. And each accent type has eight vectors as a representative vector in codebook. And we did preference test between central speech by proposal method and central speech by conventional method. Conventional method uses a card, decision tree to estimate F0 point for prosodic phrase. For prosodic phrase. And here is preference test result. So for naturalness, this proposal method, proposal training method has a big preference to the conventional method. And also the similarity test but similarity test, the proposed method has big preference to the conventional method. And this shows the some examples of F0 counter. This is original. And this line shows the this counter produced by proposed method. Okay, and here is speech samples as I demonstrated. Life home based speech synthesis has very poor quality but here, here is speech sample created by the cross loop training. You can listen to this sample. This sample is created by cross loop training. So we use just one unit per diphone. That means we, this system has just about 300 diphones, very small. And again, this was the conventional method. This is Japanese but maybe you can listen to, you can recognize that the speech quality is much, much better than this one. Of course, the speakers are different but you can see, okay. And here are some other samples. Okay, so messages are for in-car navigation systems. And this is in English. The holiday season expect congestion on the main road leading to the beaches. Take the first turn left, then at the second traffic light, turn right and continue on about 100 yards. Okay, so these samples are created by using diphones with footprint, 500 kilobyte for Japanese and 800 kilobyte for English. Small footprint, very, very small. And the next, I'd like to talk about the multiple unit section and the fusion. So firstly, I want to explain our motivation, why we developed this method. So we wanted a scalable system with a different copper size and voice quality. And this system has multiple unit selection part and the unit fusion into one diphone. And the other process is the same as diphone-based synthesis. So this part is just different from the previous one. So we can change the number of units to be fused depending on the copper size. If copper size is big, then we select just a few unit, for example, two or three, then use them into one diphone unit. All right, if the speech copper size is small, then we select maybe 10 unit, then use into one. So this make the speech sample, synthesis speech sample more stable. This is the motivation. How we can do this? How we select the multiple speech unit and the hues into one. This, the unit selection is the same as the data of unit selection type speech synthesis. So firstly, we do pre-selection by target cost then search for the best part based on the target cost and concatenation cost like this. So we selected the best part using the target cost and the concatenation cost, then select a number of unit based on the best part and the target cost and the concatenation cost between this part. So once we find the best part, then we select multiple unit based on the target cost between this unit and the others and also concatenation cost between, for example, this one and the previous, the best unit, okay, like this. Then, so unit fusion is done like this. So we choose several units into one. Then after that, we apply overlap and add process to synthesize the speech signal for voiced part. For unvoiced part, we just apply unit concatenation. Unit concatenation, okay. So unit fusion is done like this. We select, for example, three units, then we compose the periodic signal into pitch waveforms like this. Then we normalize the duration and we adjust the phase of each pitch waveforms like this. Then we take an average, we do averaging operation onto these pitch waveforms and get fused unit, okay. Average by sample by sample or fragmented? Yeah, average by sample by sample, yes. And then here is speech samples created by this method. And in this case, we use the speech copers led by small, maybe elementary school girls and boy. Okay, this is also Japanese, but you can listen to the quality of the speech, synthetic speech to loud. So like this, and next one. Okay, this is almost the end of this presentation. Next, I talk about the product and services using TTS and I demonstrate some samples. Do you have any questions? I have one regarding the code book for English. Was it more varied or less varied than Japanese because in Japan you have these accents and F-zero modeling of? It's a good question, but I'm not an expert of language, but I think English has more variations of F-zero counter and also more variations of unit in terms of unit. So like this, or English unit has more variations of bigger footprint. And the second question, the waveform representation was simply some kind of roll, I mean, roll samples or some kind of coding? Just roll samples. Okay, so we applied the TTS to maybe three kinds of applications, like human machine interface, including in-car navigation systems, home appliances like TVs, air conditioners and fridge, and ebook readers. And we can use TTS for content creation services. So this content creation is used for creating the contents for e-learning or text for education. And there are many smartphone applications using TTS. And so as a business in Toshiba, we have a middleware licensing business to navigation manufacturers or car manufacturers or game manufacturers or e-dictionaries or telephones. And we apply TTS to Toshiba product, like TV, fridge or others, home appliances and laptop you see. And we have Microsoft, no, no, microprocessor embedded with PTS and ASR middlewares. Now another type, Toshiba has semiconductor chips for these devices, the softwares or contents creation services. For example, elevator or internet services, e-learning contents or others. And first I introduced Toshiba's TTS course loop training method to car manufacturers. And the car manufacturer, famous car manufacturer adopted Toshiba's TTS in 1999. And at that time, we provided TTS middleware to four vendors and these four vendors had 38% of market share in Japanese incarnation systems. And then this market share increased gradually or dramatically. So in two years, the market share grew to about 72% and in 2006, Toshiba's TTS was used in about 95% of in-car navigations in Japanese market. And course loop training, so Daifon-based space synthesis is old technology, but still that technology still survive the competition against HMM synthesis and other types of space synthesis. And this method is still used in in-car navigations and others, e-dictionaries and others. And still we have a big market share, about 80 to 85%. And also we have semiconductor chip used in in-car operations, in-car navigations and for making hands-on recall while driving. This is a chip set of Bluetooth, TIP, and the microprocessor. The microprocessor has functions of SR, TTS, and graphene to phoneme translation. And so for SR, this is very small vocabulary size specific condition. This is used in the car. And also we created the website to open the TTS voice creation to public. This was done five years ago, currently we stopped this service. And at the website, we provided a pipeline system to users to create new voices. So anyone can upload recorded voice of about 100 sentences, small data, just 100 sentences. And process create the live phone units. And this process complete in one hour. So anyone can get a sentence speech in one hour after uploading his or her voice to the website. And we created more than 800 voices using the user's voice. Here is one example. This is a recorded voice by a user using his own recording voice, a recording device. A little bit noisy, just maybe this is problem of the audio system. This is original speech. And this is synthesized speech. The similarity is different, that's so good. But quality is not so bad. So this live phones are created by just 100 sentences. And we have more than 50 speakers voice for professional use. And they are some samples like this. We have different speaking style, for example. And this is conventional, no, no. Conversational style, that is like this. And as I said, our system has a scalability in terms of the size of the footprint and the speech quality. So like this, if we use the live phones with 2.5 megabyte footprint, then the quality is like this. And if we have a big speech copus, then the quality is improved. So these samples are produced by just one system. And MOS, I'm not sure, but about 3.2 to 3.6 or 7. And this year we started a new cloud service of media processing. And this include contents creation using TTS, transcription service using SR, speech-to-speech translation applications and spoken dialogue. We are providing these functions to customers. And here is a website you can access, but sorry, this is Japanese. Now this is online, so we can input any text like here, such like, oh. So like this, so, okay, now I'm in Crete and I'm demonstrating in Summer School, the meaning is this. And okay, great, like this. And of course we can change the speaker and the change speech and change the speaking style like this. Okay, this is angry voice. And the next demonstration is using cluster adaptive training-based statistical speech synthesis. So I will explain a little bit cat-based cluster adaptive training-based statistical models. So we train cluster models using cluster adaptive training. And in this model, the meaning of the distribution defined as a linear combination of peak clusters like this. We have different clusters and each cluster has different distribution tree for regression. And combining the mean vectors from each, from different clusters, we can create mean vectors like this. And so in this case, cat-weight this lambda represent continuous voice space. And the good thing of cat model is that unlike again voice model, cat doesn't need to force the cluster to be auto-vonal. And the decision tree can be independent. So we can combine different decision trees to create more context, efficiently like this. And so we can control voice character using cat-weight like this. So in this case, we have average voice model lambda zero. And as a difference between the cluster models and average voice model, we can control the characteristics of voice by changing the weight. Okay. This is the last demo. Okay. Thank you. Thank you for your attention.