 First, I will present the summary paper of the Blizzard Challenge 2020. Okay, this challenge, as we have mentioned, this is the 16th annual Blizzard Challenge and the first challenge was held in 2005. And this is the first challenge that was organized by the team at the University of Science and Technology of China. And this year, we released two datasets of Chinese dialects for evaluation, and all evaluations were conducted online. In terms of the techniques adopted by the systems, we can see a clear trend toward using neural network models and neural decoders. I will present more details in the following slides. Okay, for designing the tasks, I founded a company, generously released, provided two datasets for system construction. The first dataset is a Mandarin data speech dataset. The transcriptions are daily news. A speaker is a professional male native Mandarin speaker. The total duration is about 9.5 hours, and the sampling rate is 48 kHz. We provide audio together with text transcriptions. Another dataset is a Shanghai news, considering that this workshop was originally planned to be held in Shanghai. So this dataset, the transcriptions are also daily news. The speaker is a professional female native Shanghai news speaker. It's a 3-hour, 16 kHz sampling rate. In addition to audio and text transcriptions, we also provided phonetic transcriptions for both the training and the test data of this Shanghai news dataset. Based on these two datasets, two tasks are designed. The first is the half task, which is to build one voice in Mandarin using the provided data, and each participant is required to synthesize the test set of 700 utterances, including news sentences, PSC sentences, and intelligibility sentences. The PSC sentences are sentences of the Mandarin proficiency test in China, and the intelligibility sentences are meaningless ones and composed by random choosing words. For the spoke task, it is to build one voice in Shanghai news using the provided data, and each participant is asked to synthesize the test set of 391 utterances, including news and chat sentences. The rules this year just follow the rules of previous challenges, and we just made one modification. That is, we put a limitation on the amount of external data. That means each participant must use no more than 100 hours of audio for each task, including the provided data. This challenge has attracted 32 teams to register and download the data. Among them, 16 teams submit the half task results, and the eight teams submit the spoke task results. We can see a very balanced distribution between academic and industry teams. Following previous challenges, all systems are identified using anonymous letters during the evaluation, and the letter A is used for natural speech, and we have no benchmark system this year. All systems this year follow the neural network-based statistical parametric species approach. That means that is for the first time in the Chinese history, there were no feedback model based, or unit selection and web home companion systems were submitted. Among the 16 half task submissions, 11 teams adopted the sequence of models, such as TechControl, and the five teams adopted DMBaseX models. And neural recorders were adopted by all systems, with WebIron, WebNet, and LPCNet being popular choices. And 10 of the 16 teams utilized external data, such as the DataBaker dataset, which is a Chinese open dataset, and an LG speech, like this dataset. Okay, let's go to the listening test. We have just like previous challenges for the half task. We have three types of listeners, including the paid university students, and the speech experts, and the volunteers. For the spoken task, because it is Shanghai, we just had the paid university students. They are recruited at the Shanghai International Studies University. So due to the impact of COVID-19, all listeners completed the test online this year, because it prevents the challenge. Usually, the paid listeners will conduct the test outside. But this year, all the listeners conducted the test online. So this is how the test is designed. Actually, we only select a subset of the completed test set, and use them in the formal listening test. For the half task, there are six sections. Each section has 17 samples. These six sessions cover the similarity, naturalness, paragraph, and the intelligibility. For the spoken task, there are seven sections, and nine samples in each section, which covers similarity, naturalness, and intelligibility. It should be noticed that considering the complexity of input transcriptions for Shanghai needs, the dictation test was not conducted for the intelligibility test of the spoken task. Instead, the listeners were asked to choose a response that represented how intelligible the synthetic voice was using a scale from 1 to 5, just like the mean-opening score test. The overall completion rate of the registered listeners was 84.8%, which is quite high this year. And we also ignored some results, which seems not very good. Finally, we have the numbers of listeners used in the final listening test result. There were 370 for the mean task, and 87 for the sub-task. Okay, let's go to the results of the challenge. At first, let's look at the mean task. This figure shows the box plot of the mean-opening scores on naturalness of all systems. Here, letter A indicates the natural speech, the other systems are ranked in a descending order of the mean-naturalness. We can also see some colored symbols below each system letter, which indicates what kind of acoustic models and neural encoders were used by them. So, at first, I'd like to play a sample synthesized by all these systems. You can hear them one by one. Can you hear the sample? We can, but the audio has stopped. Stop? Okay, maybe I can restart it. Yes, please. We can, but the audio has stopped. Stop? Okay, maybe I can restart it. Yes, please. 60% 60% 60% 60% 60% 但是收取的费用最高不得超过 60% 但是收取的费用最高不得超过 60% 但是收取的费用最高不得超过 60% 收取的费用最高不得超过60%, 但是收取的费用最高不得超过60%. OK, so as a native speaker of Mandarin Chinese, I can see that most of this sample sounds quite good. But there are still no system was as natural as the reference natural speech according to the results of significance analysis. And among these systems, system AI and system O were significantly more natural than all other systems. Their new opening score was 4.2. Let's look at the techniques used by the systems from these labels. We can see that there is no clear conclusion that if the synthesis model could perform better than the DNBAS model, because there are some for the top system, some use the DNBAS model, some use the synthesis models. Regarding with the vocoder, I think most of the systems use the autoregressive neural vocoders. Only two systems use the non-autoregressive ones. One is the WebGlow, another is the PWG here. So for the top systems, most of them use the WebIN and WebNet as their neural vocoders. OK, let's continue. This figure shows the similarity results. We have some conclusions here. No system was as similar to the target speaker as the natural speech, and the system AI was significantly more similar to the target speaker than all other systems except system O. And its similarity of system AI in the new opening score is 4.2. For the intelligibility, this figure shows the pinging error rates with tones according to the dictation results of listeners. We can see that there are several systems achieved the insignificant difference with the natural speech. There are system D, here, I, L and P. There are no significant difference among them and the natural speech in terms of the integrity. OK, then let's move to the spoken task. This figure is also the naturalist results showing box plot. We have eight submissions together with natural speech. These labels indicate their methods. At first, I will also play a sample. 紐約互相互作臉文不落散黑心色比起 紐約互相互作臉文不落散黑心色比起 紐約互相互作臉文不落散黑心色比起 According to the listening test results, there were no systems as natural as the reference natural speech. And the system AI was significantly more natural than all other systems. Its mean opening score is 4.0 on naturalness. For similarity, no system was as similar to the target speaker as natural speech. And the system E achieved a mean opening score of 4.1. It was significantly more similar to the target speaker except system L, that is this one. For intelligibility, as I have said that we didn't use dictation but used a more mean opening score test for the intelligibility in relation of the spoken task. We can see that there was no significant difference between the system I and E. For system I, its intelligibility has a median score here of 5. It is the same as natural speech. But a further investigation shows that we calculate the Pearson correlation coefficient between the naturalness and the intelligible scores of all systems. We can find a strong correlation. The correlation is about 0.76. This means that you use this kind of more intelligible test. When listeners make their judgment, they may also be influenced by the naturalness of the speech. So maybe in the future, we should prefer to use the dictation test. Okay, let's go to the summary of this paper. So it has been eight months from the announcement to this workshop. In this challenge, we have designed two tasks and we have all together 16 team submitted their results and we have recruited more than 500 listeners. And the best naturalness and the similarity scores of the best system, that is 4.2 on the half task. Okay, so if you are interested in more material of this challenge, we can find the... Maybe we haven't done that, but we will do it in the near future. We will put the listening test results and the synthetic speech materials to the bridge challenge website. And all the bridge challenge papers are also available at the FETS Fest Vox website. So yeah, I also want to thank a lot of people who helped us to organize this challenge. Thank you very much, especially to all the participants, to all the teams and all the listeners. Thank you very much. Okay, thank you. So this is the presentation of the summary paper of this challenge. Now, I think we still have two minutes for questions. So if you have any questions, you can type them. Hello, everyone. My name is Leipeng from the Unison AI Technology Cooperation. Today, I will give you a brief introduction about what we have done for the Blizzard Challenge 2017. We are team I and most members are my colleagues as well as mates from Shanghai Normal University. This is our first time participating in the challenge. We built two TTS systems for both Mandarin and Shanghai news tasks with the same architecture. I will divide the presentation into four parts and now let's begin. Usually the front-end module for Mandarin TTS consists of multiple sub-modules as shown on the slides. This time we used a rule-based TN module to convert non-standard forms of words into spoken forms. Meanwhile, we applied two CIF models for word segmentation and part of speech tagging respectively. G2P module converts Chinese characters into PN sequence through looking up a dictionary. However, sometimes pronunciations are difficult to determine such as those for polyphones. We applied CIF models for the most frequently used polyphones to predict their pronunciations. The same method was also used to predict the boundary of a positive word and phrase. Word form, part of speech, and word length were used as input to predict the boundary of a positive word. When dealing with a positive phrase, a positive word boundary was also taken into account. In Mandarin, TTS and Erhua sound will significantly affect the intelligibility. Our system used grammar rules and prosthetic paths to handle these special situations, such as the third TTS and the sound of Yi and Bu. For Erhua sound, first of all, we listed all the non-Erhua sound words, so the system could read those words, which are not on the list in the right way. A tectron-based 626 model was used to predict male spectrum from character sequences. Here, the CBHG was retained as a post-processing module to further improve the prediction accuracy. Chinese characters were first converted to PN sequence and further represented by international pronunciation alphabet. Prosthetic word and phrase boundary representations were inserted between IPA characters. A trainable embedding table was selected to store speaker embedding. For Mandarin data, each paragraph was equipped with an embedding for style modeling. We used an autoregressive webnet to reconstruct audio waveform. Audio samples were transformed by 10-bit mule and context. Speaker embedding and male spectrum were used as global and local conditioning, respectively. Deleted convolutions were adopted and the filter width was set to 2. The training data was annotated before training section. Some extremely long sentences were divided into multiple classes. Prosthetic word and prosthetic phrase boundary annotations were performed on Mandarin data. Silence was trained to further improve the stability of attention, both at the beginning and the end of a sentence. What's more, we have sampled the Shanghai audio to a sample rate of 24 kHz. The high frequency band was a fake spectrogram generated from the original audio. At the training phase, we first trained a tectron webnet system based on our 90-hour corpus. Mandarin and Shanghainese corpus were adopted to perform fan tuning based on the pre-trained models. Male spectrum used for webnet fan tuning was generated from tectron via ground-truth alarm mode. One thing to mention is that our training corpus doesn't contain any Shanghai news data, but the experiment proved that the above training scheme was also effective, which means the training data of Mandarin was also helpful for Shanghainese training. At the census phase, Mandarin test sentences were first sent to the front-end model to get PN sequence and prosthetic boundaries. As for Shanghainese, the phonetic transcriptions were used as input of tectron. Global variance conversion was performed on the generated male spectrum to adjust its dynamic range. Now, after my introduction, let's take a look at the results of the system evaluation. Systems are evaluated from four dimensions, and the following is mainly about our results in each test. In the natural needs test, the most for our Mandarin system is 4.2. And now let's listen to Assistant's speech from AutoDomain Center. More than 300 years ago, the architecture designer, Lai Yi'en, designed the English-language government hall. He used the knowledge of engineering history to design a hallboard that only uses a pillar of support. A year later, the government was able to complete the engineering test, but it was too dangerous to use a pillar of support. Lai Yi'en asked Lai Yi'en to add a few more pillars. Lai Yi'en believed that only a pillar of support was enough to ensure the safety of the hallboard. Lai Yi'en's stubbornness made the government official seem to be sent to court. Okay, the boss of our Shanghainese system is 4.0. And here is Assistant's speech. The next part. Okay, the next part. This part evaluates the similarity between Assistant's speech and the reference recording in the Mandarin task. Our system obtained an average opinion similarity score of 4.2. For the Shanghainese task, our system scored 3.6. The intelligibility test for Mandarin requires listeners to record the sound. The opinion error rate with tons of our system is 9%. Moreover, our system solves most of the Erhua sound problems and less fuel it by a fuel sentence. The next part. Okay, the process for Shanghainese uses a scoring mechanism similar to naturalness and similarity. And our most score is 4.1. The paragraph test was to evaluate Assistant's speech from multiple aspects. In each part, listeners were asked to listen to a whole paragraph from news before choosing a score from 1 to 60 for each aspect. And the main opinion scores of our system are listed in the table. Evaluation results have proved the effectiveness of our systems at the same time. Our systems are not perfect and there are still some problems. We use CF models to solve most of the frontend problems for simplicity. But its prediction accuracy is limited and errors accumulate between modules. The structure of texturing leads to pronunciation problems such as word repeating and skipping. In addition, we use an autoregressive WebNet model as a record. But the influence speed is really slow. We have tried many ways to improve the above problems. And our new solution now has solved the instability problem and it runs much faster than real time. In fact, the engine has been released for commercial use. Okay, that's all about our systems in this challenge. Thanks for your patience and listening. I'll be coming to introduce you to the Opos text to the system this year. First, let me play a few audio clips from our submission. These audios are generated from different types of text in the competition. For example, in Jilinsheng, the unlimited power monitoring device has achieved full coverage. How can the world cup have such a huge吸引 power? Except for its own charm, what else is more powerful than that? Let's think about it. Okay, next let me explain the overall architecture and the frontend pipeline. Like the other system, our system consists of two phases, the training phase and the testing phase. In the training phase, we extracted acoustic and somatic features from the prepared data set. And then we use those features to train our frontend model, acoustic model and the recorder in sequence. In the testing phase, we use the trained frontend and acoustic model to generate features in sequence. And finally, the audio is generated by the recorder. That's our overall architecture. And I'm going to go through the frontend module, the acoustic module and the recorder in turn. The goal of our frontend system is to generate accurate phonetic and prosthetic labels from the input text. These pictures show the overview of our frontend pipeline. First, all the text will go through a text normalization module which transforms the text into a standard writing form. Next, we hand over the normalized text to BERT for a charm level, feature extraction, which extracts features that can turn richer context information from a character sequence. In our experiment, it can bring a significant improvement compared with traditional charm embeddings. Then we use the extracted char features to do word segmentation and part of speech tagging. And further extract word novel features and post features from the results. Finally, we use all those features to train our prosody model and G2P model and get a prosthetic label sequence and a phone label sequence. Now I'm going to talk about how we use the feature extracted by BERT to make our frontend more accurate. There are currently two basic modeling units. One is based on characters and the other is based on words. We found that the word-based approach is more precise because words literally carry certain prosthetic boundary information. But it is easy to be affected by the results of word segmentation. It means that if your word segmentation system makes a mistake, the prosody will sound bad. So we use the character-based method, which is simple and flexible and is a lot affected by the result of word segmentation. We appended the word information to the corresponding characters at the same time. In this way, we returned the word information and alleviated the problem that the result was affected by the word segmentation result. So the question becomes how do we combine word features and character features? We will introduce a core idea of our method through a simple example. As you can see, we have a word sanzang, which means rice in English. Or you can call it a char sequence. It consists of two char, which are sanzang and zang. First, we can use bird to extract features from the char sequence. In this case, since we have two char, we can get a char level feature with two time steps. Then for this word, we repeated trance to get a sequence of two words, which guarantees that two char are able to get the word information. Of course, if your word contains three char, you'll need to repeat the char three times to get a sequence of three words. Since there is a low word-based pre-trained bird, we use a traditional pre-trained word embedding instead. Finally, we can calculate the word features, which is from word embedding and char features, which is from bird, to form a new feature at input to the processing model and the GZ model. And now it's our processing model. We still adopted a hierarchical structure. We mainly use prosodic word level and prosodic phrase level. We represent the prosodic label equivalence as two-level boundary labels. Therefore, we need two different levels of feature extractors, which we represent simply use NPW and NPPH. As you can see from the diagram that we have an input, the input here is the concatenate embeddings mentioned in last slide. We fit the input into two-level neural network feature extractor, get the respective boundary result, and then combine the results into the fellow prosodic labels. One thing that needs to be said here. The output of NPW will be concatenated with original inputs again at the input of NPPH so that the hierarchical information can be passed. And can be any feature extractor, like transformers, LSTM, we use transformers. And that's all our front-ended related works. And we use type 2 to predict the mirror spectrum. For Mandarin, we build a multi-speaker model with a five-mile Mandarin Chinese datasets and one-mile English datasets. In order to preserve the characteristics of each speaker, we add a 16-dimension speaker embedding table and concatenate the speaker embedding with the original outputs or encoder to form new inputs for the decoder. Then we use the converged multi-speaker model as the initial model to find two on competition datasets. For improving the robustness of long sentence TMM attention with eight mixtures was used. At the same time, we use the guided attention loss for the convergence or alignment. In the end, we adopt WaveRM as a work order and 25-hour Mandarin mile corpus was used to get the base model. Then we fine-tuned the model with the provided data. Finally, we fine-tuned the model with GTA ground truth line. So WaveRM structure was used in similar to the original paper but with several improvements. Instead of the GRU or dual-softmax layer to predict the fine part and the coarse part, two GRU layer and three dense layer are used. Thank you. That's our assistant. Thank you for your listening. Hello everyone. My name is Zou Wang Zhang and it's my great honor to give a presentation about the Tencent system for Blizzard Challenge 2020. Blizzard Challenge 2020 has two tasks this year. The first task is the MHU1 task. It contains nine hours of Mandarin Chinese speech data from professional male speaker. It provides text transcription and audio files. The second task is SS1. It contains three hours of speech data from a native Shanghai famous speaker. We are provided with text transcription, formings and audio files. Here is an overview of our TDS system. We have three parts. The first part is front-end analysis and we have a very robust front-end project supporting both Chinese and English. The second part is acoustic modeling. We have two systems for Blizzard Challenge. The first one is add-during system. It's a variant of a durian system proposed by Tencent AI Lab and it's for MHU1 task. The second system is GMM way-to-be-based text transcription for SS1 task. The last part is neural coder. We adopted the Federal Wave. It's proposed by our team and it's a variant of Wave Rn. The right figure is the overview of our TDS system. First, let me introduce our front-end analysis. We have customers, labor and SSML. Besides, we have text normalization. It's a neural-based and model-based. The model is bidirectional, error-steam and MRP. We also have English segmentation and Chinese segmentation. The English segmentation is based on trigram model. The Chinese segmentation is based on bidirectional, error-steam and CRF. We also have PIN detection and letters detection. It decides how to pronounce, when met with PIN or letters. Besides, we also have procedure structure prediction. It's based on bidirectional representation and error-steam and CRF. Besides, we also have polyform disambiguation for Chinese. It's based on MMT encoder or bot and MRP. We have just one model for all Chinese words, for all Chinese polyform words. Besides, we also consider the tone sampling. We also have English G2P model. It's based on word level, sequence to sequence. The red figure is the framework of our front-end analysis. The next part is acoustic modeling. The first is adduent model for MH1 task. We have foaming and tone stress and prosthetic symbol as input. We also need HMM-based hard-aligned duration to give guidance for acoustic modeling. We have skip encoder for robust pronunciation. We also apply speak identity style code and language decoder for more robust and controllable expressive TDS. The decoder contains two residual L-steams. We also have time-delayed L-steam post-net. The next system is GMM W2P-based detection 2. It's for SS1 task. It has fast alignment and very good stability. It doesn't need any hard-aligned duration and prosthetic. It can learn end to end. The red figure is our adduent system. The following question is from GMM W2P paper. The last part is neural encoder. As a climate in the paper, W1 2048 can get no significant difference with the WebNet in most tests. Bigger W1 with GIO hidden size 2048 was used by us to obtain high-quality generative speech. The W1 model at 48 kHz sampling rate as offered in the task MH1 for high-fidelity speech synthesis, which conditioned on conventional 24 kHz mere spectrograms produced by acoustic model. For SS1 task, 16 kHz W1 was adopted as a corpus sampling at 16 kHz. The first neural encoder adopted the other variant of W1. We used 12-bit new law and four bands to efficiently model discretized waveforms and faster generation. Conditional network consists of one-time-three convolution layers with channel size 512. Before passing into sampled network, extracted local features are repeated to match the sampling rate of target waveforms. Finally, as acoustic features are extracted with hop size 240 from 24 kHz audio in task MH1, local features are repeated 480 times to generate 48 kHz waveforms. If you want to know more details, you can refer to the paper. Here are some samples of our system. In order to ensure that the 2018 high-fidelity test is successfully completed, Huawen will definitely be the one who will be the head of the test. In order to ensure that the 2018 high-fidelity test is completed, let me introduce the evaluation results. The first evaluation is naturalness. As you can see, the system OIR performs better than other systems. It's proved that our adherence system has shown superiority over most other systems. The next is the similarity test. In this section, each listener should score the same set of audio in two fixed-reference samples, and our system achieves a third highest score and shows significant advantages than many other participants. The third test is pin-in error rate with tone test. In this case, system IERD has a similar performance, and our system IERD has the lowest error rate. It's all into our adherence system. The second part is SS1 task. The first is naturalness. The top three systems IERD perform similarly. It's proved that our tecton-based speech sensor system has shown superiority over most other systems. The next one is speaker similarity. And obviously, system E performs best, and our system IERD achieves a second highest score and still performs better than many other systems. The last one is inter-gability. In this test, system IERD performs much better than other participants. Our system IERD still has a third highest score of inter-gability. Let me draw the conclusion. We have three parts, front-end analysis, acoustic model, and neural recorder. We have very robust texanomalization, very high accuracy of polyform design, big-view action, and very natural prosthetic structure predictions. As for acoustic model, we have add-during and GMA-V2B-based tecton. We have sub-gibbing encoder for robust neural TDS and tongue embedding for more accurate pronunciation. Stereocode for more controllable and expressive TDS. Pertrained multi-speaker-based model for stability. This is for image-one task. And finally, the GMA-V2B-based attention for stable and fast end-to-end learning. Neural-coder Feather Wave is based on 12-bit neon law for better quality and LPC and multi-band processing for faster inference speed. That's all. Thank you. Hi, everyone. My name is Zhixing. On behalf of our team, I'm here to present our TTS system built for the 2020 Belize challenge. Here I will briefly introduce the technique we applied in our entry. We rely a lot on neural network models to meet the goal of the challenge this year. The synthesis timeline is shown in A from the figure on the left. Prosody annotations are added into the test sequence by the prosody prediction module, which is designed to predict appropriate policies within speech. Then the sequence is converted to phoneme sequence and passed into the special gram prediction module as the input. After we have the synthesized special gram, we feed that as input to the waveform generation module to convert it back into audio signals. Three independent sequence-to-sequence network are trained in this framework. They are the prosody prediction module and the tackleshaw-based text-to-special gram model and the neural-vercoder wave-on-one respectively. The tackleshaw-based model is shown in B from the figure on the left and the prosody prediction model is shown in the right figure. The prosody prediction model here adopts three individual prosody networks for different prosody annotations in a theoretical structure to tag the prosody labels for the input sequence. For example, this is a sentence that has been added to the prosody labels where the symbol means prosody word annotation here. Then it is converted into a pin-in sequence and then the phoneme sequence. The final phoneme sequence is the input of our tackleshaw-based model. The synthetic result of our system is not as good as other teams this year. We rank in 14 out of 16 and we achieve a mouse-on naturalness with 3.2 and another 3.2 on speaker similarity. For the intelligibility, the pin-in error rate is 17.8%. And finally, the overall impression of our system scores 34 out of 16. Let's take a listen to some samples. The second one is for intelligibility test. The first one, same as the first one, is for naturalness and similarity evaluation. Here are the main references. Thanks for listening. I hope you enjoyed it. Please feel free to ask if you have any questions. Hello, everyone. Here is the team from South China University of Technology. Now we present our test-to-speech system for the BLEES search range 2020 health test. As so in the figure, our system used the tackleshaw-to-architecture to generate a form-level mouse spectrograph. And then we used the wave-inland recorder to restore the waveforms. In our front-end processing model, it contains a graphite-to-phoneme model and a bird-based bidirectional LSTN-CRF structure positive-dipodition model. Since the provided training dataset contains both Chinese sentences and some English words, we modified the graphite-to-phoneme model to support fighting ground test per-processing. And then we introduced a two-state training method to achieve fighting ground systems. In the first state, we used one English dataset and one external Mandarin dataset to fully change the tackleshaw model. Then in the second state, we fixed the parameter of the encode and only employed the provided dataset to fight the other part of the tackleshaw. Unfortunately, our system performed poorly. Our naturalness, MOS is only 2.6, and our pin-in error rate with tomes is 17.1%. Also, the voice similarity is written by the unclear output speech. We attribute our system to our two-state training method. For example, it ignored the difference of coherence between the characters in the dataset used in the two states. This is our lab of consideration in the sixth sense design, but I believe we will do better in the next time. And this is all of our presentation. Thanks for listening. Hello, everyone. I'm Xue Haozhou. I'm from National University of Singapore. Today, I will make a presentation here to present our paper for Blizzard Challenge 2020. The paper title is NUS HRT System for Blizzard Challenge 2020. My presentation will start from the introduction of TTS, then I will introduce our system architecture. What is text-to-speech? A text-to-speech system is able to generate the natural sound from the corresponding transcription. There are different techniques to build a TTS system. The traditional approach includes the parametric synthesis and unit selection. Recently, with the development of sequence-to-sequence network, the internal approach is a popular way to build a TTS system, such as Takochuan transformer TTS and fast-speech. This is our system architecture. Our system was built based on Takochuan 2. Takochuan 2 is a sequence-to-sequence, RNA-based network. However, we make some modifications on Takochuan 2. Our system input is phone. The phone identity is passed through an embedding layer to obtain the phone embedding. Similarly, the tone identity is passed through an embedding layer to obtain the tone embedding. We concatenated the phone embedding and tone embedding to form the new input to encoder. The relative skip connection was implemented in our encoder part. Additionally, we extracted the word embedding from a correction language model, and we combined the word embedding with encoder output to serve as the input to the attention-based decoder. The attention-based decoder predicts the male spectrogram from the encoder output. For Mandarin task, MH1, we used with RN as neural vocoder, because in this task, we had lots of samples to generate, and with RN is able to generate speech fast. For Shanghai NIST task SS1, we used with NET as neural vocoder. So, this is my presentation. Thank you. Thanks for your time. If you have any questions, we can discuss further. Thanks again. Hello everyone. This is the SOGO system for Blizzard Challenge 2020. We come from SOGO in Beijing, China. My name is Liu Kai. This is a flowchart of our system. It mostly contains an entry-end fast speech-based custom model and multi-band V-wire based on vocoder. To be clear in advance, we used some extra text and speech data for front-end and back-end models training. Now, I will introduce the details of the above system. These are the features of our system. In the text analysis section, we used rule-mod and ME model for text normalization, BSTM model for word segmentation, BERT-DM model for polyform prediction, and BERT-LSTM model for prosodic boundary prediction. We adopted fast speech as our custom model and made some improvements. Firstly, we used syntax-level VE to model the channel's different information in the training data. Following level VE is used to model the local prosodic information of sentences. In addition, we also used multi-band decoder and GAN to improve the local modeling accuracies of male spectrums. These are the predicted male spectrums for each system. It's clear that SYPS-D got the clearest results. We adopted V-wire as our base vocoder structure and 16-base sample, 32 kHz with MOL and PQMF subband, and we trained another textual line, a custom model for teacher-fast learning. Eventually, the speech quality gets a great promotion. Here are some demos. This is system A. And it's system D. The quality of system A to D greatly improved, and finally we submitted system D. Now let's give a summary. We used an improved fast speech-based custom model followed by a multi-band V-wire new vocoder and 32 kHz. It can generate very high-quality speech stability. In the future, we will make more attempts in reform modeling at higher sample rate and cross-liquid learning. Thanks. It's my honor to introduce the Royal Flash TTS system for Placer Challenge 2020. Fortunately, I will briefly introduce the tasks. There are two voices that need to be built from the released Mandarin and Chinese data, and our system is based on one-to-and-a-half-chon-two acoustic model followed by the RPCNet vocoder. This is the overall architecture of our system in training-fest, quarterly, we generated linguistic features, including phonium sequences and prosody boundary labels for Mandarin text using G2P and prosody annotation tools. RPCNet features are drafted as acoustic features when mass-stature grads are drafted as intermediate features. In census success, we adopted front-end to promote Mandarin text to linguistic features. Then they are played into the system and speech waveforms as generated. No external training data are used in our system. It is worth noting that we generate prosody boundary labels for Mandarin data while we direct Chinese phonium sequences and input of acoustic model. Different from original top-chon-two architecture, we further predict mass-stature grads, then CBHG module is used to transform them into the RPCNet features. With this approach, we aim to improve the stability of census success speech. We also apply GMM-based attention for long-form speech synthesis. 17 systems were evaluated for the half-test, including natural speech aid. As we can see, our system forms relatively well. The intelligibility test appears to still need to be improved in naturalness and similarity. 9 systems were evaluated for the 12 tests. Our system forms a moderate among all teams. So the royal flash details system has been introduced. In future work, more effective front-end module and high-platform Europe corridors has major changes over our system. Thank you. Hi, everyone. I'm so glad to be here to join the workshop and show the synthesis system for the Blader Challenge 2020 today. My name is Wendy He from the text-to-speech group of the Himalaya Company. Well, our presentation is consist of three parts, including the many architecture of our system, the overall results, and the conclusion in our future work. So firstly, in the graph, we will see the front-end module followed by an acoustic model. Finally, the male spectrograms will be transformed into the waveform by the vocoder. So let's focus on the front-end module. You know, in order to improve rhythm and pronunciation, we use a polyphonic disambiguation model and a prosthetic prediction model to correct the pin-in and generate a prosthetic break in memory data sets. As mentioned in the figure, the features of the data are extracted and then cognated to be the input of the BioSTM-based prosthetic prediction network. And after the output with the prosthetic label is generated, a BioSTM and CNN-based sequence-to-sequence model we proposed, combined with the rule-based G2-P2R kit are working for polyphonic disambiguation to generate the pin-in sequence with higher accuracy. So as for the acoustic model, Tecchrom 2 is also our best choice in this time and finally we got the waveform reconstructed by the WaveRNM vocoder. And it is worth mentioning that BEER has a 512-dimension random-initialized channel boss speaker embedding its implement for the fine-tuning the BLEED rendering data set from a pre-trained first speaker Tecchrom 2 model. And as for the evaluation results, our Himalaya TDSC system performs better and achieves higher quality in memory synthesis-dention-highness synthesis. So in conclusion, the final evaluation results in both tasks indicates that our system has a middle performance slightly above the average and we have much room for improvement in some aspect below and we hope to achieve better performance in the future. So that's all I want to say today. Thanks for your listening. Hello, everyone. It's my pleasure to share our project with you. We are come from Harbin Institution of Technology Shenzhen. My name is Fuhu Hao and my team member is Zhang Yibeng, Liu Kailong and Liu Chao. Next, we will introduce the TDS mode architecture from the following three parts. The first part is the front-end. First, we nominate the original text to more normal text and obtain the pin-in and linguistic future by open-source tools like JBA and PIP-in. And then we use the broad-based MSF model to predict the boundary break and it is more useful to make more natural speech. The second layer is sequence-to-sequence caustic model. First, we replace the encoder with a more powerful transformer network and introduce the local IR and to model the local information of the input text. And in order to enhance the robustness of the model, we also introduce the forward attention to replace the local sensitive attention in section two. And then we add the SSM laws with model training which can make the spectrum more sharp. And the third part is the encoder. Considering the speed of the encoder, we adopt the game-based model parallel with game. It's more faster than other encoders like with net, with IR net and LPC net. But the sound quality is a little poor. Finally, let's look at the result. Our system code is G. The result is not very satisfied. But our system has not of space, of the improvement both in front-end and the encoders. That's all for our report. Thank you for listening. Hello, everyone. My name is Long-Tong. It is my great honor to be here to give an introduction of our speech sensitive system for the Challenge 2020. I am a doctor candidate from the Institute of Automation, Chinese academic of science. My research interests in both speech-sensitive and voice emotion. And this is the main intent of my report. I would describe the CSIS speech-sensitive system about 9.5 hours of speech date for an user enter and 3 hours of speech from one native Shanghai eastern speaker I adopted as the training date for the construction this year. Our CSIS system is built based on the multi-speaker end-to-end model and the LPC net based on the neural recorder is adopted to improve the quality. The whole network consists of two components as shown in this feature. It is a typical multi-end-to-end speech-sensitive model which includes the technical and the LPC net. The speech database is about 9.5 hours of speech from one native Mandarin Chinese speaker. It contains about different speaking styles. We build a multi-speaker speech-sensitive system by adding external data. The composition of the external Mandarin database is shown in the above table. In the Shanghai list task we build a front-end system suitable for Shanghai list based on who we filming in this table. The filming labeling constructed by this method is more in line with the actual pronunciation rules of Shanghai list. The natural list and similarity were calculated. The results are shown in the above feature and our identity of the system is the same. For all this evaluation result our system only grants average level. So many works need to be done for us especially on improving the quality of the synthesized speech. Thanks for your listening. Hello everyone. My name is Hu Beibei. I'm going to present our system the Ajimide text-to-speech system for Blizzard Challenge 2020. Here is the overview. First, I will briefly introduce MH1 task. Then I will describe our TTS system. Finally, show the evaluation results. The MH1 task is to build a voice from a Mandarin Chinese data site which contains about 9.5 hours of speech data. We built our system for this task. Here is our system architecture. There are three components in the system. A bird based text front-end, a multi-speak text-to-model, and a Viva and vocoder. In the training phase we trained a bird-based G2P model on an internal polyphonic data site which includes 145 Chinese polyphonic characters. The TAG-22 model was trained from an internal multi-speaker data site and fan tuned on the MH1 data site. GTA was adopted to train the Viva and model. Our system ID is P. The right four figures shows the scores of our system. We achieved a good score on PTER. We believe it benefits from the bird-based G2P model. Besides, the Sanhe processing is also important. In the other task sections we achieved the middle scores. We will make more times to improve the system performance. Thanks for your listening.