 Okay, hello everyone. My name is Wenxin Huang from Naga University. On behalf of the organization team, we would like to present to you the results of the voice conversion challenge 2020. Intro link goes in parallel and cross-lingual voice conversion. So first, I will briefly introduce the tasks, database, and the timeline for this year's challenge. And then Xiaohai Tian from National University of Singapore will give a breakdown of the participants and the submitted systems. And then finally, Yijiao from National Institute of Informatics Japan will talk about the subjective evaluations and the analysis of the EPCS challenge. And then finally, she will give a conclusion. All right, so as we mentioned in the very first of this workshop, BCC is a bi-annual scientific event that tries to compare different BC technologies by looking at a common database. So this year, we introduced two tasks. The first task is intro-lingual semi-parallel voice conversion. So when we say intro-lingual mean that the source and the target languages are the same, and then this year we set it to English. And then when we say semi-parallel, we mean that among the source and the target training sets, a subset of them is parallel, meaning that the contents are the same, while the rest of them are non-parallel, meaning that they are of different contents. Although that methods for tackling non-parallel BC data can be directly used, we encourage participants to utilize this small subset of parallel data so that they can enhance their results. The second task is the cross-lingual voice conversion. So here the source language is still English, but we have Mandarin, German, and Finnish as the target languages. So during training, we provide these training sentences from these languages of the target speakers. And during conversion, given an input English sentence, we want the VC model to convert as if this English sentence is spoken by these target speakers. So what's challenging about this task is that the VC has never seen these target speakers spoken English before. So the VC model is asked to somehow model, for example, the pronunciation or the speaker characteristics of the target speakers so that they can correctly speak the English in a native way. All right, so the database was based on the E-Mine dataset, which contained speech from native English speakers and these bilingual speakers. So first we chose four English speakers as the short speakers, two male and two female. And then we chose another four English speakers as the target speakers for task one. So the criterion here for choosing these eight speakers is that we manually selected these speakers so that they can be as perceptually discriminated as possible. So that, for example, during the listening test, the participants can easily or more easily discriminate between the common audio and the target audio. For task two, we chose two German Finnish and Mandarin speakers as the target speakers, one male and one female per language. The criterion here is the fluency so that because the English speech of the bilingual speakers will be used by, will be used to be the reference during the listening test. And we do not want the, for example, the fluency and, for example, the accent to affect the judgments of the listeners. So we use these fluency charts. These are from the original E-Mine paper. And then we selected those bilingual speakers who was the most fluent in English. Okay, so this is the timeline. So basically we gave the participants about two months to build their systems based on the training data. And after we released the evaluation data, we gave them about one week to convert all the audio. And then we spent about two months to do the listening test. It was pretty large scale. And we released the results and gave the participants about another month to write their papers. All right. Next, I will ask Xiao Hai to take over the presentation. Okay. Thank you, Wenqing. Hello, everyone. My name is Tian Xiao Hai from National University of Singapore. Next, I will give a brief summary of the participants and submitted system of voice conversion 2020. Next page, please. Hello, Wenqing. Next page. Thank you. In total, there are 33 teams submitted their systems, including three baselines and other 30 of round participants. As mentioned, this year's challenge has two tasks, a monolingual and cross-lingual voice conversion. There are 31 teams, participant task one, 29 teams, participant task two. Well, there are 26 teams submitted their systems for both. Next page, please. In general, our voice conversion system mainly consists of two modules, the feature conversion model and vocoder. I will summarize the system according to these two aspects. This table shows the summary of the feature conversion model used in submitted systems. In total, there are roughly seven types of model belongs to two categories. Two types of model belong to parallel data voice conversion, while there are another five belongs to non-parallel data solution. Next page, please. From the chart above, we can see for both tasks, most teams choose non-parallel data solution. For task one, the 26 teams choose non-parallel data solution. Three teams choose parallel data solution. For task two, this number will be 24 versus two. We know that there are also two teams denoted submitted system description, so we mark them as unknown. Data information for non-parallel data solution are shown in the chart below. PPG-based voice conversion and auto-encoder-based voice conversion are the two chosen by most of the teams, followed by ASR-TTS, leverage-TTS for voice conversion and GN-based model. In task one, PPG-based voice conversion and auto-encoder-based voice conversion has been chosen by eight teams, while in task two, there are 11 teams use PPG-based voice conversion and eight teams use auto-encoder-based voice conversion. Next page, please. Next, we will give a summary of the system from an encoder perspective. In total, there are 10 types of encoders in two categories. Three are traditional encoders, the rest are neural encoders. For neural encoders specifically, three of them are autoregressive, four are non-autoregressive generation. Next page, from the chart above, we can see neural encoders have been chosen by most of the teams. For task one, 11 teams choose autoregressive neural encoder, 14 teams choose non-autoregressive neural encoder, and there are four teams choose traditional encoder. For task two, there are 11 teams choose autoregressive neural encoder, 10 teams choose non-autoregressive neural encoder, five teams choose traditional encoders. This information of neural encoders are shown in chart plus below. WeavNet and parallel VVGN are the two chosen by most of the teams, followed by VVRNN, LPCNet, VVGLOW, MLGN, and NSL. In task one, nine teams choose parallel VVGN, and five teams choose WeavNet. In task two, seven teams have chosen parallel VVGN, and five teams choose WeavNet. So that's all for my part. Thank you. Next, you will take over the presentation. Hi everyone. I am Yijia from National Institute of Informatics in Japan. I will introduce the subjective evaluations and analysis of these challenging results. Next page, yeah. To design the listening test, we choose to use a cross-section test format to evaluate both the nature needs and the speaker similarity. To evaluate the nature needs, we use a five-point scale most score. To evaluate the similarity, we use four-point scale score. In particular, in the second course, Lingotask, in addition to reference speech in English, we use the reference speech in L2 language for the subjects to judge in the speaker similarity across the language. We hired both English and Japanese listeners to finish two independent experiments. Next page, please. This is the nature needs results for task one. We can see that four systems performed better than last year with the CCC's top system, team 11. Most of the best performing systems use the PPG entirely or partially. Team 10 and team 13 obtained the highest most values among all systems. Their improvements compared with team 11 were statistically significant. However, team 10 and team 13's most score were still known as the nature speech and the differences were statistically significant. Next page. This is the similarity results of task one. From finger, we can see that eight systems obtained the highest speaker similarity scores among the basic systems. There were no significant differences among the eight best performing systems and their improvements compared with last year's top system, team 11 were statistically significant. Also, the similarity scores of the eight systems were not significantly different from the nature speech of the target speaker. It means the best performing system achieved human-level similarity. So the basic intradingal VC task has been solved in the sense of speaker similarity. We think that this is a historical achievement for VC research. Next page. This finger shows a scan plot matchingベントネートニー and the percentage of similarity to the targeted speaker for task one. Let's listen to some samples. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. This is the nature of this plot for a second task. We can see that most of the best performing system used the PPG like Team 10, 13, and 25. We can also see that these systems using a combination of ASR and TT system like Team 27, 33, and 22 obtained known scores. Moreover, it is interesting to observe that Team 10 achieved the highest scores and was not significantly different from the natural speech of the target speaker. However, it suffered a job in nature and is of 0.75 points in Moscow compared with the in-control angle task. In fact, most teams who joined both tasks obtained a known score evaluation in task two. This clearly indicates an increase in the complexity of the cross-lingual task. Next. This is a nature-need result for the task two. All of the VC system has much known similarity scores than for natural speech and the differences are statistically significant, which means the cross-lingual VC has room for improvement. Next page. This finger shows the scanner plot matching nature needs and the percentage of similarity to the target speaker of task two. We can see that there was an obvious gap between natural speech and VC systems. Please play the samples. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. In reality, the European Parliament is practicing delay tactics. They also did further analysis using our national test data. Here, we just simply introduced the conclusions. Firstly, we compared the English and the Japanese nationals evaluation results. We found that although there were a few inconsistencies among the two national groups, it seems it is acceptable to use non-native nationals to access the performance of cross-lingual VC system to some extent. Secondly, the choice of the language of the reference audio is very important. Nationals generally gave no nurse speaker similarity scores in the case of the L2 language reference. Thirdly, we found that most of the VC system had the highest most similarity scores for German target speakers. And the lowest similarity scores for Mandarin speakers. This may be partially due to the linguistic distances to English, but this requires further investigation. Next page. And now let's go to the conclusion. Next page. This VCC challenge, we can see that a great progress has been achieved in techniques. In the cross-lingual coercion, the best system could achieve the every nation is more score as high as 4.27. And this score is only 4.1 in last VCC challenge. Also over 95 percentage accredited speech samples were to be the same as the target speakers, but this number is only 80% in last VCC challenge. In the cross-lingual task, we can see that the best system could achieve the nature is more score as high as 4.27. This is the same as the intro-lingual coercion, and this is surprising. But the similarity results, only 75 percentage accredited speech samples were to be the same as the target speakers. Next page. From these results of the cross-lingual test, we can see that VC methods have been progressed not. In particular, the speaker similarity scores of several systems turned out to be as high as target speakers in the intro-lingual semi-parallel VC task. This is that this is a historical achievement for VC research. However, we confirmed that none of them have achieved human-level natureness. This is pity. The cross-lingual voice coercion task is a more difficult task. And the overall natureness and the similarity scores were none other than all the intro-lingual coercion task. This voice coercion challenge has showed that both intro-lingual and cross-lingual VC tasks have not been completely solved. We believe that continuous efforts and investments in VC knowledge are dissolved. Thank you. This is the end of our presentation. Hello, everyone. I'm Rohan from NUS, Singapore. So we'd like to talk about our work, Predictions of Subjective Rating and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions. Next slide. So in this work, we investigate two aspects. First one is, can the objective assessment metrics predict human judgments on naturalness and speaker similarity? And the second aspect is, which are the voice conversion technologies that have the highest trade for spoofing attacks against automatic speaker verification and countermeasures. Next slide. So to this view, why we need the objective assessment? Firstly, they are complementary to the listening test. Apart from that, they are less time consuming and they are cost effective than the large crowdsourcing listening test. For instance, for VC 2020, we had to spend about 1 million Japanese yen for the crowdsourcing test. Next slide. Next slide. So coming to ASV, it has been found that they are quite vulnerable to various kinds of spoofing attacks. And voice conversion technology is one such way to attack ASV. Next slide. In this regard, the research on spoofing countermeasures have gained attention across the last couple of years. Even some of the industries have deployed spoofing countermeasures along with their voice biometrics to enhance the chip utility against such attacks. Apart from that, the ASB spoof challenge series also tries to promote research on countermeasures, especially on the attacks those are derived by voice conversion text to speech or replay attacks. Next. Coming to our work, we use four different kinds of objective evaluation techniques. First one is the ASV, which we use to judge the speaker similarity. We use the X-factor-based embedding. And PLDA scoring is used. And apart from that, we also use the cosine distance-based metric as well. The next objective technique is the spoofing countermeasures, which is used to assess the real versus fake assessment. We use a light CNN-based state-of-the-art system that considers LFCC as input, which is trained on ASB spoof 2019 logical access purpose. The third tool that we use is the automatic mean opinion score, which is MOSNET that is used to judge the quality. It is a CNN LSTM-based model that is trained using magnitude spectra following the original work. And we use two different kinds of model in this work. One is using the previous voice conversion challenge data, which is PCC 2018. And another is the latest ASB spoof series challenge data, ASB spoof 2019. Lastly, we use ASR for assessing the intelligibility of the converted speech. This is a system developed by IFlytech that is based on sequence to sequence with attention. It uses 10,000 hours of recording for acoustic model and GB-level text for language modeling. Next, we will observe how this objective evaluation metrics can be used for judging the rankings of each team. So next I pass over to Tommy for next part. Okay, thank you. So let's look at the ASB system. So here we use the Caldexpector receipt to basically score the different voice conversion entries. So generally we treat here the ASB as a black box which basically receives two speech inputs enrollment or training utterance and test utterance. And basically this black box gives us a speaker similarity scores. So in this BCC20, we compute three kinds of scores. First one of them is going to be different natural speakers compared to each other, leading to score distribution like this. The other one is the same speaker, a natural distribution zone in green. And the third one, which then involves comparing the voice conversion samples with the target speakers is shown here. And basically we use the overlap of these distributions two and three as a measure of the voice conversion success. So if these two distributions are completely overlapped, it means this has an equal rate of 50%. And it means that the ASB system is completely confused to believe that this is a target speaker recording. So let's look at the results. So here let's start with the natural data. So this ASB system gives an error rate of 0.50% without any voice conversion. So when we, then at the voice conversion, you can see that the equal error rate of the top systems gets close to the 50% and the false acceptance rate of most systems is close to the 100%. So the ASB system gets rather confused. We also at this only look one more analysis, which is basically how far away the voice conversion manages to move from the original source speaker. And this is the one zone in gray. And we see here that the ASB system in this case is also confused to declare most of the time that these are different speakers. So in this sense, most of the voice conversion entries here are successfully de-identifying the original source speaker in terms of ASB. So we can look at the task too. Actually, the findings are very similar here, a little bit lower false acceptance rates, but generally speaking, similar trends. So let's look at the next result. So this is the other ASB system where we don't compute the equal error rate, but this is simply average cosine similarity between the voice conversion and the target recording. And we see here that most of the time we get similarity score close to the maximum, which is one. So and the findings are generally similar with the other speaker recognition experiment. Next one. So when we look at the spoofing countermeasure, so here we also see that the equal error rates of the spoofing countermeasure are in sometimes close to 50% and generally very high. So this means that the spoofing countermeasure cannot differentiate the human speeds from the voice converted speeds in this case. So for instance, in the task one, around half of the teams have equal error rate and more than 30. And in the second task, there are only three teams which have an equal error rate less than 10%. So now I will switch to Wenjin. All right, so next, I will talk about the MOSNET predictions. So comparing to the true range of the most scores, which is from one to five, we can see that the MOSNET predictions range from 2.5 to 4.5, which is a lot smaller than the true range. And then this is the same for task one and two. As for automatic speech recognition, we can see that around half of the teams have more error rate less than 20%, which means that most teams can maintain the linguistic consistency when they do conversion. All right, so finally, let's hear some audio samples so that we can know that what kind of aspect that each metric is evaluating. So first, we will play the ASV. So we're playing a sample that is pretty good, which means that it can easily fall the ASV system to the source. Changes around orthodox opposition and sometimes government intervention. These changes aroused orthodox opposition and sometimes government intervention. Comparative sample. These changes aroused orthodox opposition and sometimes government intervention. Next, the countermeasure. So this is also a pretty good sample, meaning that it has less artifacts compared to other samples. So I'll play the converted samples here only. This is a trade and shipping point for an overview of large merchandise forums. Okay, the next, the MOSNET prediction. So counter to the previous two samples, here we play a sample that has low MOSNET prediction score. Almost all students who are accepted into medical schools obtain a medical degree. Finally, we come to the ESR. So this one is also having a low intelligibility of where they're rated about 91%. Chains for the new years, the Cloudhounds Association, we think it's good. So we want to emphasize that, for example, if one sample is performing pretty good in one metric, does not mean that it will perform good in another metric. And also performing good, for example, in all metrics, does not mean that it will be perceptually good to human. Okay, so next, after we see the individual results of each metric, we want to know how it correlates to the subjective results. So we want to ask whether these metrics can predict human judgments. So first, we draw these scatter plots. So if you look at the appendix of the paper, you can see a complete series of these plots. So here we list two samples. This is the ESR, ASV, equal rate plots for task one quality and task one similarity. So by looking at these plots, you might have a sense that maybe this one has lower correlation and this one has a higher correlation. We will want to look at the real numbers, look at the precise numbers. So we can compare the correlation between different figures. So we calculated these Pearson correlation coefficients between the metrics and the subjective scores. Okay, so here we can roughly separate the metrics into two categories. These ASV scores, cosine distance can be seen as, for example, the speaker-related metrics and the MOSNET scores and the ASR warrior rate can be seen as those related to quality. So first, we will see which metrics have moderate coefficients for quality. So what we expect here is that, for example, the MOSNET predictions in ASR scores should be pretty related to the quality and this is what we expected here. But we can also see that, for example, ASV scores is pretty related to the subjective scores for task one and for task two, the cosine distance is pretty related to the quality. So we might ask, why does this happen? So the reason we suggest is that maybe the human judgments on quality and similarity are not independent. So that when you listen to, when we are asked to judge the quality, you're somehow kind of like influenced by the similarity of that converter speech. And by looking at similarity, we can also observe a pretty similar trend so that of course the ASV scores and the cosine distance are pretty related to the similarity, but here the MOSNET scores are correlated to the similarity for task one also. So this again proves that the judgments on quality and similarity are not independent. So finally, after seeing these individual correlations, we want to know that if we can further correct correctly, or we can say more correctly, predict the subjective scores by combining multiple metrics. So what we did here was that we simply combined multiple metrics and used a multiple linear regression to use these input variables to regress to the subjective scores. And the observation here is that for MOS, which means the quality scores, we can see that we have two metrics that can be significantly explainable. For example, in task one MOS, we have ASR, ASV. For task two, we have MOSNET and ASR. And for similarity, we can see that only the ASV equal area score is significantly explainable. So what we want to say here is that we think that the ASV ER itself is sufficiently explainable for the similarity score because itself has pretty high correlation. Well, overall we will say that this kind of analysis is pretty consistent with the previous analysis saying that, for example, the ASR and MOSNET is responsible for quality and ASV is responsible for similarity. And we can also see that if we compare these adjusted R-squared scores with correlations in the individual section, we can see that the prediction accuracy of quality can be improved by combining multiple metrics. However, the task two quality is a little bit lower in terms of adjusted R-squared score. And this indicates that cross-lingual voice conversion is more, it's kind of like more difficult to be predicted by using these metrics. So we will need more improvement in these metrics to further correctly predict these metrics. All right, so I'll let Tommy to take over the last part of our presentation. Okay, so as our last analysis, we look at so-called a tandem performance assessment of the spoofing countermeasure and the ASV. So let's look at how this looks like. So basically in this case, we are thinking that the spoofing countermeasure is something like a gate, which tries to stop the voice conversion attacks from being sent to the ASV system. And there is a recent metric, you can find the reference below, which can be kind of used to evaluate the combination of these two systems. So generally speaking, there are like four different errors that can happen in this kind of combined system. And two of them are related to the user inconvenience and two are related to the security periods, let's say this way. And there are parameters, but basically you put these parameters and you get one number that tells how well the combination of the two systems works. So we compute that this metric for the different voice conversion systems. And here are the from each task, the top systems identify the highest tandem DCF. So we basically identified two different patterns. So one pattern happens when the automatic speaker verification system is basically not fooled at all by the voice conversion. So the ASV system rejects the voice conversion, but combined with let's say low performing spoofing countermeasure in for that perspective. So this means that because the attack wouldn't have any impact on the ASV, so we shouldn't be trying to reject it because then the spoofing countermeasure will lead to genuine user rejections. So this is what we see in this audience and this is associated with the classic vocoder type of systems mostly. And the other type of pattern that we see when both of these systems basically have high error rates. And so this takes place with a more modern neural vocoder type of systems. So both of these are kind of important to consider in this kind of evolution. So maybe we can move on to conclusion. So of course there are many results, but what we could maybe conclusively say here is that from these different techniques, the automatic speaker verification and ASR, which are trained on our natural data and let's say represent somehow like matured technology. So these are kind of like a little stable kind of metrics and good correlations with the subjective results. Then with the other two techniques like MOSNET and spoofing countermeasures. So these are somewhat more experimental that they have a very potential and actually all of these four are contributing to the predictions as we saw. But we also know for instance for the spoofing countermeasures that we have issues when generalizing these systems from database to another. So it needs more basic work maybe on these systems. As far as the spoofing threat is considered, so about the traditional process and the neural book orders here keep remaining important. And one limitation of this work that we should address in future work is that, okay, we basically looked everything at the system level, taking all the samples in a kind of aggregated statistics, but we didn't so a kind of like sentence level predictions as for the listening test. So that's all we probably want to say. So thank you very much for your attention. Hello everyone, this is Jingxuan Zhang. I'm very glad to share our work on the Voice Converting Challenge 2020, the mono-lingual task. The Ragnation Sensei's based voice conversion is one of the popular method for solving non-parallel voice conversion. It consists of Ragnation Model for extracting some intermediate representations. Then the converted speech is senses from the senses model. There's mainly two limitations of this kind of method. First, the intermediate representations such as PPGs may still contain some speaker-related information, which may harm the similarity of converted speech. Second, the PPG-based feature are frame-based. Therefore, it's difficult to adjust the duration flexibly. The ASR-TDS technique has developed very rapidly in the recent years. Therefore, it's natural to build an ASR-TDS-based voice conversion system by cascading the ASR and TDS module. In this case, the characters is actually used as intermediate representations. There are two benefits of this kind of method. First, the characters only contain linguistic content. Second, once the characters are obtained, it's easily to adjust the duration flexibly by using sequence-to-sequence TDS system. We first compare the iFlight tag and the ASP-Night-based ASR module. And we select the iFlight tag-based one because it achieved better performance. For the TDS module, we selected the transformer TDS. First, the transformer TDS is highly parallel during training. It leads to better training efficiency. Second, the self-attention structure makes the transformer TDS easily for capture long-term dependency. Besides the speaker and linguistic information, we also introduce a proselyte. We also assume the speech contain proselyte aspect, which need to be modeled. Therefore, we introduce a proselyte encoder to extract some proselyte code. During the conversion, the proselyte code is used to condition transformer TDS to transfer the proselyte from the source speaker. A speaker-adversive learning is used to make sure the proselyte code is speaker-independent. This is a configuration of our experiment. We compare our method with the PBG baseline, which we denoted as VCC2080 plus. We use the PBG-like features as intermediate representation and used as cross-lingual task in our voice govern challenge 2020. From this result, we can see that the proposed method achieved better similarity and close naturalness, slightly better naturalness compared to the PBG baseline. Then we compare our method with and without proselyte transfer. For the target speaker TF2, with proselyte transfer achieved slightly better result. However, the average way, the with and without proselyte transfer method achieved close performance and this difference is insignificant. At last, our method achieved the best naturalness and the best similarity in the voice govern challenge 2020. Compared to the natural target, we achieve very close similarity school. And for the most school, there's still some distance between our team and natural target. In conclusion, the present ASL-TDS method for voice govern challenge 2020 and our team achieved the best naturalness and similarity. A proselyte transfer technique is presented, which we extracted a proselyte code used to condition the TDS module. It achieved slightly better result on the TF2 target speaker. However, the overall difference is insignificant. We suppose this reason is because the voice govern challenge 2020 is not very expressive. Therefore, the proselyte transfer technique makes no significant difference. Thank you. This is the ending of this representation. Hello, everyone. My name is Li Zhengyu. I come from AFLitech Research. It's my honor to share our paper, Non-parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment. This is the outline of this presentation. In most conversion challenges in 2018, the intern system, which adopted a PPG-based conversion model with neural encoder, was proposed for non-parallel VC. It achieves the standard of the other performance. However, when comparing results of intern with the target speech, a performance scale still exists. This may be due to several modeling inefficiencies in the system, such as the limited modeling capability of our CMR-based conversion model, the quantization noise in the adopted neural encoder, as well as that the visual difference is not considered in this system. So in order to improve the conversion performance, this paper proposes three improvement approaches. Autoregressive Model has shown strong modeling ability in capturing dependency in sequential data and has been widely used in TDS. So we propose to substitute the Autoregressive Model for our CMR-based conversion model and predict a mirror spectrum instead. We use a similar network structure as that in TECTRON2. To mitigate a speech-reader difference, we propose a Duration Adjustment strategy. Furthermore, we propose to use a high fidelity VALnet introduced in the clarinet as our neural encoder, which uses a single variance-bounded Gaussian distribution to model 24 kilohertz 16-bit VAL form. Both the conversion model and the neural encoder model follow a two-stage training strategy, the pre-training and the fine-tuning. During the pre-training strategy, large speaker models are trained with the speaker embedding and during fine-tuning strategy, speaker-dependent model is adjusted using a trend data of each target speaker with the initialization of the pre-trained model. We propose a Duration Adjustment strategy to mitigate the speech-reader difference. During the conversion process, the apparent feature sequences extracted from ASR model are linearly interpolated using estimated interpolation coefficients before fitting into the conversion model. The interpolation coefficient of the fine-tune conversion pair is estimated separately and is obtained in two steps. In the first step, a rough coefficient is estimated by calculating the ratio of average form adoration between source and the target speakers. Then the rough coefficient is precisely adjusted until no above-views speech-reader difference is perceived between converted speech and the target speech. To evaluate the performance of our proposed method, we use the dataset of the intralingo task in VCC 2020. We also evaluate the performance of applying our proposed method to cross-lingual VC directly. We use the dataset of the cross-lingual task in VCC 2020. In this experiment, 64 sentences are random ratios for training and there are many sentences for validation. For the non-parallel intralingo VC, three systems are built and compared. The baseline is the intran system in VCC 2018. We also construct two systems using our proposed method with and without the duration adjustment. We conduct subjective listening tests. And first of all, we compare our proposed method without using duration adjustment with the baseline system. From table one, we can see that both naturalists and the speaker similarity performance can obtain 0.116 and 0.1 MOS improvements by using autoregressive conversion model and high fidelity neural code. Then we evaluate the effectiveness of using duration adjustment. We can see that from the results in table two, that the naturalists and the speaker similarity can be very improved with the use of duration adjustment strategy. For the cross-lingual VC, we present official listening test results in VCC 2020. The left figure shows that the subjective listening test results are for speech quality. The right figure shows the results of the speaker similarity. Our system is T10. We can see that our system obtains the best performance in speech quality and comparable performance to the best system in speaker similarity. In this paper, I improve the system was proposed for non-parallel VC on the basis of the top system in VCC 2018. In our proposed method, autoregressive best conversion model was used to improve the conversion model capability. Our high fidelity web net best neural vocoder was adopted to model 24 kilohertz, 16-bit well-formed to improve the conversion speech quality. A duration adjustment strategy was proposed to alleviate the above views between the difference between source and targeted speakers. Experimental results validated the effectiveness of our proposed method in both converted speech naturalists and speaker similarity. Besides, by applying it directly to the cross-lingual VC in VCC 2018, it achieved the standard of our performance. That's all. Thank you. Hello, everyone. We are from Samsung Research China, Beijing. We would like to present our work submission from SRCB for voice conversion challenge 2020. Here is our outline. First, we will talk about the voice conversion problem. Then, we will talk about our proposed method and experiment details. Last, we will analyze our results in this year's challenge and make a conclusion. Voice conversion modifies a source speaker speech so that the result sounds like a target speakers. Our work is to build a conversion function. This year's voice conversion challenge has two difficulties to overcome. Firstly, the training data provided is non-parallel. We cannot learn the conversion function in strongly supervised manner like using parallel data do. And also, non-parallel training cannot guarantee alignment of acoustic units among the training groups, which is very helpful to acoustic modeling at finer granularity. Secondly, voice conversion becomes particularly difficult when source and target speakers speak different languages. The phonetic foundation linking the speaker's acoustic characteristic is weaker. And of course, different language means non-parallel too. We need to find a solution considering both of these two challenges and meanwhile, improve the output speech's naturalness and sound quality. Related to work uses automatic speech recognition, which is called ASR module to infer phonetic correspondence between non-parallel materials. We consider a TTS-like approach that first computes phonetic posterior gram, we call the PPG, from source authorance. The PPG is seen as a soft decoding of the authorance into time-stamped phonetic units. It is then fed as text to think of this speech in the voice of the targeted speaker. Cross-linked voice conversion can also be implemented by mapping PPG to and from acoustic features of different speakers. So we chose this direction to design our solution. Now, we would like to explain how they work. For PPG extracting, we used the CVT mandatory ASR model and LibreSpeech English ASR model for extracting bilingual PPGs. Since mandatory PPGs contains more tongue information, it is reasonable to augment the English PPG with mandatory PPG to effectively express phonetic content. In particularly, we consider concatenating English and mandatory PPGs computed from the source authorance, regardless of its true language ID, as shown in the picture. Then the speech synthesizer converts this bilingual PPG to male spectrogram. The X vector is a discriminatively brain-trained speaker embedding widely used in speaker recognition. Well, we train the X vector extractor model by using both mandatory and English data. But different from the usual ones, we use 64 dimension X vector instead of 512 dimension, because it became easier to converge while training the model. Then we will talk about the core synthesizer. Besides the X vector, what we call the speaker embedding, the decoder is also conditioned on residue encoding. The residue encoder predicts a variational posterior from source authorance, which includes unexplained information, like noise and environment. We observe that this allows training synthesizer with noisy data and then generating clean speech. In this work, we add 14 types of noise to VCC data at different SNR. This gives 15-fold augmentation of the low-resource VCC training data and helps improve the naturalness and sound quality allowed. Our internal data is much more than the data VCC provided. Without this augmentation of VCC data, the synthesizer's training would lose its balance. We also use the domain adversarial training to encourage the output of tax decoder to encode PPG in a speaker-independent manner by introducing a speaker classifier based on the tax decoding and the gradient reversal lawyer. It has strong good performance in TTS training, so it is reasonable that we apply this adversarial training to voice conversions core synthesizers training. They have similar structure. Finally, we will talk about our vocoder. We chose VifNet as our vocoder for its high naturalness. We've trained a general model first and then finetuned it for each targeted speaker to improve its sound quality. And then we add a post-processing module called the Volcoast Noise Suppression. In this module, we use pink reference to construct noise model and calculate SNR of each frame. And then we use Wiener filtering. We add a void detection module and use different noise suppression intensities for void and non-void frames, respectively. This can improve sound quality a lot. And then we will talk about experiment details. For PPG, we use Caldy pre-trained ARSR model, including CBT-E Mandarin model and Libre speech English model. For X-vector extractor, we use Caldy 2-kit to train the model. And we use internal data set, more than 200 hours, and 300 Mandarin and American English speakers. To train a core synthesizer, we use two data sets. One is VCC 2020 data set, including eight English speakers, two Finnish speakers, two German speakers, and two Mandarin speakers. Each speaker has 70 utterances, 65 for training and five for testing. Another data set is our internal data set, including 14,000 utterances of six American English speakers, including three females and three males. We use sample rate 24,000. And we use mirror spectrochrome 80 dimensions. We train our wave net vocoder using our internal data, more than 30,000 utterances of one female American speaker. And then we fine tune for each targeted speaker. Well, in VCC 2020 official report, there are subjective tests and objective tests. We mainly focus on subjective results. The subjective test includes two items, naturalness and similarities. We got most 4.17 in intralingual voice conversion task, which is task one, and 4.13 in cross-lingual conversion task, which is task two, both rank two among all the teams per team. This score means our work on generating human-like speech has a good performance. Then in similarity test, we got most 3.68 in intralingual task and 2.94 in cross-lingual task, both rank four, which rank is slightly lower than the naturalness performance. Why? Because we found that the vocoder wave net indeed has good naturalness when converting male spectrochrome into waveform speech. However, we did not train the target speaker's vocoder directly, for we do not have enough clean speech of targeted speaker. So instead, we first train the general network, use one American English female speaker, and then fine-tune it for each targeted speaker. So the general model guarantees the naturalness, but the fine-tuned model cannot guarantee the similarity of targeted speaker. Considering the good performance of speech naturalness of the vocoder, the loss of speaker similarity was acceptable. Finally, we will make a conclusion. First, we design a PPG and TTS solution for the nonparallel and cross-lingual challenge. Then improve some quality and naturalness by improving the performance of synthesizer. First, we use residue encoder and data augmentation for low-resource speaker. And then we use domain-adversarial training for a speaker-independent text encoder. And then to improve the performance of our vocoder, we use wall-focused noise-supportion post-processing. Thank you. Hello, everyone. I'm Lian Zheng from the Institution of Automation, Chinese Academy of Science. It's a really great pleasure indeed for me to be able to attend this meeting. Today, I would like to present my paper, the Cassiya Voice Coversion System for the Voice Coversion Challenge 2020. My presentation contains all four parts, including the motivation, the proposal method, the experimental results, and the conclusions. Firstly, I will introduce the motivation of this paper. Voice conversion aims to modify the source speaker's voice to sound like that of the target speaker while keeping the linguistic content unchanged. The conversional voice conversion approach usually needs parallel training data, which contains pairs of same transcript utterance spoken by different speakers. Parallel voice conversion first aligns acoustic unites between source and target speech by dynamic time wrapping, then a conversion model is learned to map speech from source to target speaker. While parallel training data is unavailable, there are some methods for non-parallel voice conversion. Variational autoencoder has been successfully proposed for non-parallel voice conversion. However, variational autoencoder suffers from over-smoothing to address this problem. General adversarial network and its variants use a discriminator that amplifies these artifacts in the loss function. However, these methods are hard to train and the discriminator's discernment may not respond well to human auditory possession. Recently, there is another track of research that applies to PBGS for non-parallel voice conversion. PBGS are the frame-level linguistic information represents obtained from the speaker-independent automatic speech recognition system. The PBGS-based voice conversion framework mainly have two key components, the conversion model and the vocoder. The conversion model converts PBGS extracted from the source speech into acoustic features of the target speaker. Then the vocoder uses this converted speech to synthesize the speech waveform of the target speaker. Since VCC 2020 only provides few parallel train data in this task, while the parallel train data is unavailable in the second task. To deal with this problem, we utilize the PBGS-based voice conversion framework in our system. Our system uses a CBHG conversion model and the LPCNAT vocoder for speech generation. The CBHG conversion model has a bank of one-dimensional convolutional filters, highway networks and the bi-directional gate-recognitive units. Previous works have verified that this structure can effectively capture content information in feature sequences. The LPCNAT vocoder combines the linear prediction with recontent neural networks. Previous works have verified that this vocoder can better control the spectral shape. To better control the procedure of converted speech, we utilize acoustic features of the source speech as additional features, including the pitch, voice, unvoiced flag and BAP. In the training stage, we first extract acoustic features, pitch, band, BAP, voice and unvoiced flag and PBGS from the target speech. Then we congregate these features. Finally, a CBHG conversion model is learned to convert these features into the acoustic features. In the conversion stage, we first extract pitch, BAP, voice, unvoiced flag and PBGS from the source speech. A linear conversion is applied to convert pitch from the source speaker to the target speaker. These representations are fed into the conversion model, aiming to predict the converted acoustic features. Finally, we feed these converted acoustic features to the speaker dependent LPCNAT vocoder for speech generation. Vocoder influence the quality of the converted speech. In this paper, we choose the LPCNAT vocoder for speech generation. In this work, we use a code published by the Mozilla team with some modifications. Since voice conversion challenge 2020 focused on the 24 kilohertz conversion strategies, we modified original 16 kilohertz LPCNAT to 24 kilohertz LPCNAT. To better control high frequency features, we increased feature dimension of BFCC features to 30 dimensions, to extract more curate pitch trajectory, we use the wrapper toolkit in the world of vocoder for the pitch estimation. Totally, we extract 32 dimensional features. Since the training data is limited in the voice conversion challenge 2020, we apply speaker adaptive approach for CBAG conversion model and LPCNAT vocoder as for the CBAG conversion model. We first change a multi-speaker average model. The PBGS argumented with one-horse speaker embedding vector are used as the inputs. Then we adapt the pre-chained average model to the tucked speaker. As for the LPCNAT vocoder, we first change initialization model with a multi-speaker data set. Without additional speaker embedding, then we adopt the initialization model to the tucked speaker with limited data. What's more, we increase training samples by means of speech to day preservation. It is a technique of changing speech speed while keeping the tongue unchanged. We randomly choose speed factor from 0.6 to 1.2. In the VCC 2020, the quality of the speech samples and their speech and their similarity to the tucked speaker are evaluated using the official subjective evaluation. The organizer accrued 206 Japanese listeners and 68 English listeners to evaluate the converted speech. Subjective test results on the path one are shown in this figure. Our concierge with conversion system is denoted as T29. This figure shows that our system achieves a Morse of 3.99 for speech quality in this task, compared with 3.79 for the baseline and the 4.07 for the top system. And this figure also shows that our system achieves a similarity score of 64% in task one, compared with 79% for the baseline system and 89% for the top system. Subjective test results on the task two are shown in this figure. It shows that our system achieves a Morse of 3.99 for speech quality in task two, compared with 3.80 for the baseline and 4.18 for the top system. And the right figure shows that our system achieves a similarity score of 69% in the task two, compared with 59% for the baseline system and 71% for the top system. Overall, the results of the voice conversion challenge 2020 rank our system in the second place. There are some examples. These changes are asked orthodox opposition and sometimes government intervention. And this is the target speaker. Moroccan agriculture enjoys special treatment when exporting to Europe. And this is our proposed message. These changes are asked orthodox opposition and sometimes government intervention. And I will show some example of the task two. These changes are asked orthodox opposition and sometimes government intervention. The verschiedenen Berichte spiegeln diesen Mainstream wieder. These changes are asked orthodox opposition and sometimes government intervention. Finally, we conclude our whole paper. In this paper, we present our Cassia voice conversion system developed for the voice conversion challenge 2020. Our system adopts the CBA structure to convert the soft speech, PVGS in toxic speech acoustic features. Then the LPC network holder is utilized for speech generation. To better control the prosody of conversed speech. The other acoustic features including the pitch, the voice on voice flag and the BAP of the soft speech are utilized as additional inputs. To deal with the impact of limited training samples, speaker adaptive strategies are also applied. The results of voice conversion in 2020 wrapped our system in the second place with a most of 3.99 for speech quality and a 64% accuracy for speaker similarity. And that's all and thanks for your attention. Hi everyone, so my name is Ho Tung Vo and I'm from Akagi Laboratory of Japan Advanced Institute of Science and Technology. So it's a pleasure for me today to introduce my study on normally voice conversions based on the Iraqi Coal-Latin Embedding vector quantized rational autoencoder. So first I want to make some brief introduction. Voice conversions is a method to change the voice characteristics, for example, age or gender while preserving the linguistic content. The cross-lingual voice conversions is the special case where the target speaker does not speak the soft speaker language. The vector quantized rational autoencoder, this model that can be used for cross-lingual voice conversions and it can entangle the linguistic content and the speaker informations. In this model, the discrete latent CQV here will represent the linguistic informations while the speaker informations will be captured in the speaker embedding. However, in the conventional VQV model, the discrete latent variable only capture the information at the fifth temporal scale. Therefore, it may not be efficient to capture the Iraqi Coal-70 structure of the speech, for example, phonemes, syllables or words. So inspired by the VQV2 study, we propose to use the Iraqi Coal-Latin structure to capture the 70 structure at different temporal scale. So here is the overview of our model. Our model contains multiple stages that operate at multiple time scales. So each encoder will now samples the input by the factor of two. On the other side, the decoder will up samples the input by the factor of two. Our one of the features of our model is the speaker embedding, which is learnable. Therefore, it can be updated during the training process. To adapt the model to the cross-lingual setting. So first, we randomly initialize new target speaker embedding in the latent code books. Next, we fire tune in the target speaker embedding and the latent code book on the target data. And finally, the voice conversions is performed using the new target embedding and the model trained on English data. So here is the illustrations of the speaker embedding during the training process. And we conduct experiment to compare the performance of three models. The baseline model is the VQV model without the hierarchical structure and our proposed model with two stage and three stage of hierarchical structure. For training data, we use the voice conversion challenge training data and the VCTK dataset. We use the 80 dimensional male cast stream for the input features. And we use parallel wave gang to reconstruct the waveform. We measure the root mean square error between the modulation spectrum of the target and the converted speech. The RMSC average across all modulation frequencies, male channels and utterances. As we can see in the table, the proposed model with three stage achieved the lowest RMSC. Next, we conduct a listening test to compare the performance. As we can see, the proposed model with three stage achieved the highest preference score in both nature test and the speaker similarity test. Therefore, we use the converted utterance from the three stage model to submit to the voice conversion challenge. Our team is T20. The official results show that the model achieved the highest natural score among on other encoder-based model. In both task one and task two, for speaker similarity, our model achieved the comparable performance with other encoder-based method. However, there is a decline in the task two. And we have four conclusions. We have proposed the ERC goal, later structure to improve the performance of VQV based cross-legal voice conversions. And the proposed method can offer from the vanilla VQV model and achieve a best natural score among on other encoder-based method. So here is some audio samples from our model. In reality, the European Parliament is practicing delay tactics. Summertime has in practice become normal time. In reality, the European Parliament is practicing die tactics. In reality, the European Parliament is practicing delay tactics. So that's all for my presentations. Thank you. Hi, everyone, and thanks for attending this presentation. I am Uriel, and I will talk about FASBC, a work that I did with Miloš, principal engineer at Logitech. The motivation of this work was to build a simple system for voice conversion that performs FAS inference so as to allow industrial applications. However, despite these two first requirements, we didn't want to give up quality. Our work, takes a lot of time and effort Our work takes AutoBC as a baseline because this model satisfies two of the three constraints that we had. First of all, it's a simple model based on conditional encoders, and second of all, it achieves high-quality results. Now, let's talk about FASBC. Our model achieves voice conversion with a three-stage procedure, similarly to AutoBC. Since these two models have a lot of similarities, I will focus on the main differences that we have with respect to AutoBC in each of these stages. Let's first focus on the male spectrum transform. In line with the simplicity requirement, we use raw data's input. This means that we don't apply any kind of preprocessing, and we use all frequencies differently from AutoBC. But the main difference is that we implement the male spectrum transform with CNNs. This means that the transform can be initialized to the exact value, but can it also be learned thus allowing end-to-end training? The autoencoder module is the core of voice conversion. FASBC uses one hotend-coded speaker identities instead of a speaker embeddings. This means that we don't allow zero-shed voice conversion, but this was not a desirable feature. Also, FASBC has different hyper-parameters in the information bottleneck than AutoBC. AutoBC applies two kinds of information bottleneck. First of all, it applies a temporal down-sampling and also a dimensionality reduction. FASBC applies twice as many temporal down-sampling and has half the dimensionality that AutoBC has. This is to ensure that the latents are speaker-independent. Last of all, let's focus on the male spectrum inverter. FASBC uses Mellgon, which in contrast to autoregressive models such as WaveNet, which is the design choice of AutoBC, allows for very fast male spectrum inversions. Let's now listen to some samples of the cross-lingual task. We'll hear the samples obtained with the two baselines of the voice conversion challenge, AutoBC and our proposal, FASBC. The audiovisual sector is very important. The various reports mirror this mainstream again. The audiovisual sector is very important, efficient. The audiovisual sector is very important. The audiovisual sector is very important. The audiovisual sector is very important. Let's now jump to the objective evaluation of the proposed method. Voice conversion is a new post problem. For this reason, it is difficult to assess the results of a voice conversion model with objective measures. We use PESC to voice conversion with either the same or different speakers. This table presents the results when the same speaker was used. Note that for this comparison to be meaningful, the latent representations have to be speaker-independent. Otherwise, not applying information ball next to the input signal would yield perfect reconstruction. In this case, however, we would have speaker dependence in the latents and voice conversion would not be possible. To see whether the PESC was correlated with a minimum score, a small subjective evaluation was conducted. The PESC was indeed correlated with a subjective score in the first three cases. There is, on models training on reconstruction. However, there was no correlation for the model with end-to-end training. Note that in this case, the score is significantly lower, but the models have perceptually similar quality. We believe that this is due to the ill-posit nature of the task. The end-to-end model can learn other valid outputs that are not necessarily equal to the input. Here, we present the subjective results that correspond to the cross-linked well voice conversion task of the voice conversion challenge. First of all, we present the quality results, where Fast VC outperforms the two baselines of the voice conversion challenge. Here are the similarity results for the same task. Note that in this case, Fast VC performs worse than the average. We justify this because Fast VC uses several datasets. In particular, we use the voice conversion challenge dataset and the BCTK combined. The BCTK has more data per speaker, so we have data imbalance here, but also in the voice conversion challenge, we compete in the cross-linked well task, where we also have data imbalance test regarding the language. Let's wrap up by reviewing the original requirements that we had. First of all, we have a simple model that is based on conditional autocoders that learn speech reconstruction and also don't need any kind of annotations but a speaker identifier. Second of all, we have a fast model that is way faster than auto VC that was our baseline. And third of all, despite these two factors, we still have quality results. Thanks a lot for being interested in our work and watching this presentation. Please feel free to reach out if you have any doubt. You can also scan this QR code and take a look at my master thesis in there, but also hear more samples where we compare Fast VC to other methods. Thanks. Hello there. I'm from National Institute of Informatics, and I'm here to present our work for Workverse and Challenge 2020. Our system ET07, the name of the title, which was a later language embedding for Roslingu, Tattoo, Speak, and Workverse. So if you look at a picture here in my lines, you can see that Tattoo, Speak, and Workverse could be considered a just different interface to generate the Speak with a target's voices. So in case of the Roslingu Speak generation, it could be the scenario in Speak with a tendency I generate this with the voices of the target speaker in the language not spoken by them originally. So this is also the topic of Workverse and Challenge 2020. So if you look at the result of the challenges before I go detail into our work. So for task one, with the English target speaker, you can see that's our system among the top 10 with the highest similarity. The quality moderates, but not the best. But for Tattoo, with the Roslingu, Workverse on here, you can see that's our system played first in a similarity metric. But what interesting here it has, it has a huge gap in terms of quality when compared with the first and the second system. So the question is, why do they have on high similarity but very huge gap in quality? So what listener think when they get the kind of answer in listening test? So now I will describe our system. So it's the extended version of our robot system for voice learning. So for this system, we first need to train a robust English latent language embedding by training what we call a test-based multi-modal system using the multi-speaker data from multiple speakers. So given a well-trained latent language embedding, we can use it to adapt to speakers who didn't speak English at all in the adaptation step. So in this step, we will turn in the speaker decoder and the vocoder to the target speaker. And then we can perform one extra step by training the speaker decoder and the vocoder to enrich their compatible. So that's a step-by-step on how can we learn in voices using the robot system. So the result of the vocaboson challenge show that our system can do rustling of vocaboson. So how about rustling with that to speak? So we do extra listening that to show that our rustling with that to speak system also have a very high similarity and quality. So it's very consistent with the vocaboson as well. And the performance is quite consistent between the English target speaker as well as the Finnish, Germany and Mandarin speaker. So let's here assemble here. So this is a Finnish female speaker. So if we get a test input to the robot system, we have a rustling good that to speak system. During the following years, he tried unsuccessfully to get into production. But if we get a shock, utter an input. During the following years, he tried unsuccessfully to get into production. Then it's a rustling good vocaboson system. During the following years, he tried unsuccessfully to get into production. So our robot system can do both at the same time. We now have to do any extra step and a firm on between the total speak and vocaboson is very consistent. And they have a scented like speaker characters that we want to explore more in our future work. So you can find more speak sample using the QR code right here or visit my website and find on the relevant material there. So thank you for your attention. Hello, everyone. My name is Patrick. I'm from Nagoya in the first stage of Japan in this presentation. I would like to present our baseline system for the first conference in China 2020 with SACLIC Personal Autonomous Recorder and Parallel Reveal. And the motivation of our system is that we want to provide an open source based on system for a visit in 2020 that can be reproducible even using only the visit 2020 data set. And SACLIC Personal Autonomous Recorder or SACLIC Reveal itself is a robust and lightweight framework for non-parallel specular modeling which has a potential for future deployment and use cases in real-world applications. And the PWG itself of our Parallel Reveal is a fast and high-quality, non-autoregressive protocol which can provide convenient solution for a converted specific form with higher quality than the conventional protocol. And the SACLIC VV Based Spectral Modeling itself is a spectral reconstruction model with the VV Based Lateral Regulation where it utilize the SACLIC flow to use the sample converter spectral features to further regularize the network so that it can improve the conversion accuracy and the latent space is entangled. And in the development of the PWG in order of color we actually use the SACLIC VV Based Spectral because they have closer trace to converter than the original spectrum and they are used to improve the PWG business and with respect to the feature mismatches in this case it's the spectral upper smoothing. And in the conversion we just used to train multispeak and PWG to generate the converted speech wave from the converted spectrum. And this is the result of the first processing challenge 2020. This is the result of the task one, the enthralling ones. And our system is T16 and it can be seen that our system provides higher than F its speaker similarity and average naturalness. And in the case of the ASAP provides methods we can provide a very good performance in this category. And this is the result for the task two for the cross-lingual and it can also provide the higher than average speaker similarity while slightly lower than average naturalness while and providing very good performance for the ASAP provides methods category. And for the future work what we want to do is we want to alleviate the spectrum of our smoothness with a refinement network after psychopathy is trained because this is a very serious issue that can address to improve the quality and we want to utilize sequence-based portaling and our latent constraint to further improve similarity. Thank you. Hello everyone. My name is Wenxin Huang from Naga University. Today I'm going to introduce the sequence-sequence baseline for the VCC 2020. So we wanted to build a non-parallel system which is sequence-sequence at the same time. First from the data point of view this year's challenge mainly includes non-parallel tasks. And from the model point of view sequence-sequence VC models have shown to be very good at converting prosody. We also wanted the system to be easy to use open source and as competitive as possible. So we decided to cascade sequence-sequence ASR TTS with ESPINEN which is a well-documented and rapidly developed open source toolkit. We tried to use a lot of public data sets and we show that contrary to the belief that jointly optimizing a system is better simply cascading state of VR ASR TTS models can give us a very strong model as we will show later. So this is a system overview. First, we used an ASR model to recognize the source speech into transcription. Then we send it into the TTS model to get across the features and finally used a neural vocoder to generate the speech waveform. There are two things worth noticing. Number one is that because the VC data set is very limited we used a pre-training and fine tuning technique. Number two, the output of the ASR model are English characters and we optionally convert them into phonemes for example, we did it for Mandarin TTS. The ASR model and the TTS model were based on the transformer which is the sequence-to-sequence state of VR and for the data sets we used Libre speech, Libre TTS and many other data sets that are publicly available. Finally, we used the Parallel Wave GAN because it is a real-time neural vocoder. So we ranked about the top third in terms of naturalness in task one and we ranked the second in terms of similarity which shows that indeed sequence-to-sequence models are very good at combining similarity. And although that we perform very poorly in terms of naturalness in task two we could still perform in the top third in terms of similarity. So these results show that our system is very competitive. We list some possible improving directions including more trained data more linguistic knowledge a better multi-speaker TTS model and a better neural vocoder. And so I suggest you to look into the paper for more details and I thank you for your attention. Hello everyone. My name is Wang Qinhuan from Nagano University. Today I'm going to introduce the NUN tree for the VCC 2020. So we want to use the power of the large-scale listening test in this year's VCC to answer two research questions. We start from the baseline system T16 which is a combination of a frame-based cycle VAE conversion model and a non-autoregressive neural vocoder parallel with game PWG. Our team has been focusing on sequence-sequence VC recently and we want to know that how it can outperform frame-based VC. And also we know that AR neural vocoders are better than non-AR ones. They want to know in what specs they outperform the non-AR ones. So first for task one we extended the voice transformer network VTN with synthetic parallel data. So VTN is a parallel VC model we developed. It was based on the transformer and because it was originally designed to tackle parallel VC only we extended it to tackle non-parallel data because task one is non-parallel. What we did was that we trained TTS models to generate synthetic parallel data. So we trained a TTS model for the source speaker and a TTS model for the target speaker and we generated these pseudo-parallel data pairs and synthetic external data pairs to train the final VTN model. For task two we used the same cycle VAE but we replaced the original PWG with the VTN neural vocoder which is an AR vocoder. Also we used more training data. We used the VC-TK corpus which contained a lot more data than the VC data set and we also used data augmentation which was based on F0 perturbation and we also used a more complex training flow which consisted of many fine-tuning procedures and for more details we suggest you to look into the paper. So looking at the results in task one we can see that the T23 which is our system is better than the average of all the submissions in terms of similarity by about 20% and they can also outperform the T16 baseline by about 15% and this result shows that sequence-sequence VC can improve similarity. For task two the T23 system can outperform the average about for about 0.5 and it can outperform the T16 for about 0.6 or 0.7 in terms of naturalness. So AR and your vocoders can improve quality or naturalness. Hello everyone my name is Tian Xiaohai come from National University of Singapore. I will present our system for voice conversion challenge 2020. This is a joint work with Professor Xie Li's group in Northwestern Polytechnical University in China. Our submission is a PVG based voice conversion system mainly consisting of three modules the PPG instructor, feature conversion model and the neural coder. Given a soft speech we first use PPG instructor to extract the content information then the conversion model is used to transfer the content information into target acoustic features. Finally the target speech will be generated by the neural coder. For the PPG instructor it includes four GRU layers followed by a bottleneck layer and a softmax output layer. In practice the output of the bottleneck layer will be used to represent the content information. Our PPG instructor is trained with a multi-speaker English corpus. The conversion model follows encoder decoder framework. As the bottleneck layer and the acoustic feature is initialized so the attention mechanism is not used in our model. The conversion model is first trained with multi-speaker corpus. For each target we adapt the average model with the target speech. The multi-band VVRN is used for our neural coder. Same as the conversion model is first trained with multi-speaker corpus then adapted with the target speech for each target. As this year's wide-screen challenge has two tasks we use the same system for both intra-lingual and inter-lingual tasks. For more details please refer to our paper as following link. Thank you for your attention. Hi, I'm Zhang Huaitong from the S-Games AI Lab. I want to share our paper. Our system mainly consists of four parts. The first part is encoder which consists of stacks of convolutional layers and polyconnected layers. I mean to learn the higher representation from the input features. And the second part is the vector quantization part which is to quantize the continuous representation into a fixed number of embedding by finding the nearest embedding. The third part is the geotune mechanism. It replaced the embedding at the current frames with that at the last frame or at the next frame with a probability of 0.12. And the last part is the decoder which is mainly a webnet based decoder to reconstruct a waveform conditioned on the learn representation and the speaking embedding. Here shows the evaluation results for task one. Our system achieves a most about three. Although the ranking is not very high compared to all the systems by achieving the best performance the non-supervised learning such as ASR or TDS is involved. And the second figure shows our system achieved a most about 3.28. And it also achieved the best performance when non-supervised learning is involved. Here shows the objective evaluation for task one. From the ASV matches our system ranked the 6th place and for CMER it ranked 8th place. And for predictive modes our system achieved a most about 3.9.2.3.2.5. For ASR or system achieved about 20% of ER. We are from Academia Seneca, Taiwan. And this is Alexander Kim. Today we are going to demonstrate our systems for VCC 2020. In VCC 2020 we present two different systems for monolingual task and cross-lingual task. Both systems are using the similar concepts. We cascade ASR and TDS structure. For the first task we change the output of ASR. We replace English words with international phonetic alphabet symbols which called IPA. Also we trend our TTS with the IPA symbols. For the second task since cross-lingual ASR is very hard to be trend we extend the idea from the first task. We then machine to define its own phonetic token ID. And the token ID is used to trend our TTS. In this slide we are showing the monolingual system structure. As you know the output of ASR are IPA sequence and the TTS will take the sequence to generate male spectrograms. And then parallel wave gain will take the male spectrogram to make a target waveform. For the cross-lingual task we need to trend a complete VQVAE first. And then we will take the first part of VQVAE and then trade it as ASR. The output of this ASR will be phonetic token ID which defined by machine itself. Second we will use phonetic token ID to trend our TTS. Finally we cascade ASR TTS parallel wave gain together which is showing in the bottom of the slide. Finally here is our result. For the task one our nationalist is ranked fifth out of 18 groups. Similarity is ranked first out of 18 groups. For the second task our nationalist is ranked fifth out of 15 groups. Similarity is ranked sixth out of 13 groups. Thank you for your listening. Hello my name is Victor Dacosta this is Federal University of Rio de Janeiro entering for the voice conversion challenge 2020. For the system description we use a psychogram model to perform the conversion which is a non-parallel morph model which we use to convert male spectrograms. Then we use a male gamma code to synthesize those male spectrograms into converted out. Psychogram is a non-parallel morph model based on generative adversarial networks as a gun has a generator which performs the conversion itself and a discriminator which judges a converted output. Those networks are trained on an adversarial loss but an adversarial loss alone is not enough so psychogram uses a psychoconsistence loss to regularize the model which is the distance between the original input and input transformable forward and backwards transformations. Here's the structure of psychogram with all component networks. Male gamma is a neuro vocoder also based on generative adversarial networks its generator generates all your front spectrograms and is non-autoregressive allowing its generation speed to be extremely fast. Its discriminator is a multi-skill discriminator which is an ensemble of discriminator at different rates which allows it to analyze both long-time windows and a wide-frequency mesh. Here's the structure of male gamma with the generator and the multi-skill discriminator. And here's the structure of the full system with psychograms converging the more spectrograms than male gamma synthesizing the results. Now for the results. Here are the VCC results for English listeners with our system highlighted. We obtain below average results with a quality score slightly below 2 and a similarity percentage of around 50%. Results for Japanese listeners were similar. In conclusion, we obtained below average results for both naturality and similarity and our own test indicates that the conversion is the bottleneck. The system needed more development with changing both rectitude choices and hyperparameters to mean. Thank you.