ZeroSpeech 2019: TTS without T

Task and intended goal

Young children learn to talk long before they learn to read and write. They can conduct a dialogue and produce novel sentences, without being trained on an annotated corpus of speech and text or aligned phonetic symbols. Presumably, they achieve this by recoding the input speech in their own internal phonetic representations (proto-phonemes or proto-text), which encode linguistic units in a speaker-invariant representation, and use this representation to generate output speech.

Duplicating the child’s ability would be useful for the thousands of so-called low-resource languages, which lack the textual resources and/or linguistic expertise required to build a traditional speech synthesis system. The ZeroSpeech 2019 challenge addresses this problem by proposing to build a speech synthesizer without any text or phonetic labels, hence TTS without T (text-to-speech without text). We provide raw audio for the target voice(s) in an unknown language (the Voice Dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery Dataset) and align them to the voice recording in a way that works best for the purpose of synthesizing novel utterances from novel speakers (see Figure 1).

The ZeroSpeech 2019 is a continuation, and a logical extension of the sub-word unit discovery track of ZeroSpeech 2017 and ZeroSpeech 2015 , as it demands of participants to discover such units, and then evaluate them by assessing their performance on a novel speech synthesis task. We provide a baseline system which performs the task using two off-the-shelf components: (1) a system which discovers discrete acoustic units automatically in the spirit of Track 1 of the Zero Resource Challenges 2015 [1] and 2017 [2], and (2) a standard TTS system.

A submission to the challenge will replace at least one of these systems. Participants can construct their own end-to-end system with the joint objective of discovering sub-word units and producing a waveform from them. Participants can, alternatively, make use of one of the two baseline systems, and improve the other. The challenge is therefore therefore open to ASR-only systems which make a contribution primarily to unit discovery, focusing on improving the embedding evaluation scores (see Evaluation metrics). Vice versa, the challenge is open to TTS-only systems which concentrate primarily on improving the quality of the synthesis on the baseline sub-word units. (All submissions must be complete, however, including both resynthesized audio and the embeddings used to generate them.)

Figure 1a. General outline of the ZeroSpeech 2019 Challenge. The aim is to construct a speech synthesizer which is conditioned, not on text, but on automatically discovered discrete subword units. View this as a voice conversion autoencoder with a discrete bottleneck (the input is speech from any speaker, the hidden representation is discrete, the output is speech in a target voice). We evaluate both the quality of the resynthesized wave file and the compacity of the intermediate discrete code.

This challenge is being prepared with the intention of running as a special session in Interspeech 2019, and operating on CodaLab. As with the other two challenges, it relies exclusively on open source software and datasets. It encourages participants to submit their code or systems. It will remain open after the workshop deadline, although human evaluation of the synthesis systems will only be rerun if there are enough new registrations.

A limited number of papers have provided a proof of principle that TTS without T is feasible [3-4]. We use this work as a baseline for the current challenge. This baseline uses out-of-the box acoustic unit discovery [6] and an out-of-the-box speech synthesizer (Merlin [11]), provided in a virtual machine. A number of more recent improvements on both the acoustic unit discovery side of the problem and the synthesis side of the problem could be used to easily improve on this baseline.

On the side of the discovery of phonetic units, several methods have been used (Bayesian methods [5,6], binarized autoencoders [7], binarized siamese networks [8]; a variety of speaker normalization techniques have been used to improve the categories [9], as well as unsupervised lexical information [10]).

On the speech synthesis side, waveform generation has recently seen great improvement (Wavenet [12], SampleRNN [13], Tacotron 2 [14], DeepVoice3 [15], VoiceLoop [16], Transformer TTS [17], etc), some of these systems being open source.

Additional optimization methods could be used to improve the combined baseline. Recent research have shown that training ASR and TTS jointly with reconstruction losses can result in improvement in both systems [18].

The ZR19 task has similarity with the voice conversion problem [19], in which the audio of a given source speaker is converted into a target one, without any annotation (using variational autoencoders [20], disentangling autoencoders [21], or GANs [22]). The additional constraints we bring to the table are that the target synthesis voice be trained on a small, unannotated corpus; that the system apply, at test time, to convert completely novel source voices into this target voice, on the basis of only one, or a very small number of very short recordings for each source voice; that the input to the decoder improve over a baseline on an explicit phoneme discriminability test; and that, in the encoder-decoder architecture, the intermediate representation have a low bit rate.

Datasets

We provide datasets for two languages, one for development, one for test.

The development language is English. The test language is a surprise Austronesian language for which much fewer resources are available. Only the development language is to be used for model development and hyperparameter tuning. The exact same model and hyperparameters must be used to train on the test language. In other words, the results on the surprise language must be the output of applying exactly the same training procedure as the one applied to the development corpus; all hyperparameter selection must be done beforehand, on the development corpus, or automated and integrated into training. The goal is to build a system that generalizes out of the box, as well as possible, to new languages. Participants will treat English as if it were a low-resource language, and refrain from using preexisting resources (ASR or TTS systems pretrained on other English datasets are banned). The aim is to only use the training datasets provided for this challenge. Exploration and novel algorithms are to be rewarded: no number hacking here!

Four datasets are be provided for each language:

  • the Voice Dataset contains one or two talkers, for around 2h of speech per talker. It is intended to build an acoustic model of the target voice for speech synthesis.

  • The Unit Discovery Dataset contains read text from 100 speakers, with around 10 minutes talk from each speaker. These are intended to allow for the construction of acoustic units.

  • The Optional Parallel Dataset contains around 10 minutes of parallel spoken utterances from the Target Voice and from other speakers. This dataset is optional. It is intended to fine tune the task of voice conversion. Systems that use this dataset will be evaluated in a separate ranking.

  • The Test Dataset contains new utterances by novel speakers. For each utterance, the proposed system has to generate a transcription using discovered subword units, and resynthesize the utterance into a waveform in the target voice(s), only using the computed transcription.

The Voice Dataset in English contains two voices (one male, one female). This is intended to cross-validate the TTS part of the system on these two voices (to avoid overfitting).

# speakers # utterances Duration (VAD)
Development Language (English)
Train Voice Dataset 1 Male 970 2h
1 Female 2563 2h40
Train Unit Dataset 100 5941 [16-80 by speaker] 15h40 [2-11min by speaker]
Train Parallel Dataset 10+ (*) 1 male 92 4.3min
1 female 98 4.5min
Test Dataset 24 455 [6-20 by speaker] 28min [1-4min by speaker]
Test Language (surprise Austronesian)
Train Voice Dataset 1 Female 1862 1h30
Train Unit Dataset 112 15340 [81-164 by speaker] 15h
Train Parallel Dataset 15 + 1 Female 150 8min
Test Dataset 15 405 [10-30 by speaker] 29min [1-3min by speaker]

* : For the Train Parallel Dataset, there are 10 new speakers reading phrases from each speaker from the Train Voice Dataset.

Figure 1b. Summary of the datasets used in the ZeroSpeech2019 Challenge. The Unit Discovery dataset provides a variety of speakers and is used for unsupervised acoustic modeling (i.e., to discover speaker-invariant, language-specific subword discrete units–symbolic embeddings). The Voice dataset is used to train the synthesizer. The Parallel dataset is for fine tuning the two subsystems. These datasets consist in raw waveforms (no text, no label beyond speaker ID).

Evaluation metrics

Participants are requested to run each test sentence in their system, and extract two type of representations. At the very end of their pipeline, participants should collect a resynthesized wav file. In the middle of the pipeline, participants should extract a sequence of vectors corresponding to the system’s embedding of the test sentence at the entry point to the synthesis component. The embedding which is passed to the decoder must be submitted. Additionally, two additional representations may be submitted for evaluation, from earlier or later steps in the system’s pipeline: see Auxiliary embeddings for more details.

The most general form for the embedding is a real valued matrix, with the rows corresponding to the time axis and the columns to the dimensions of a real valued vector. The low bit-rate constraint (see below) will militate strongly in favour of a small, finite set of quantized units. One possibility, similar to the symbolic “text” in a classical TTS system, is a one-hot representation, with each “symbol” being coded as a 1 on its own dimension. Representations are not limited to one-hot representations, as many systems (for example, end-to-end DNN systems) may wish to take useful advantage of the internal structure of the numerical embedding space (but, to reduce the bitrate, it is nevertheless advisable to quantize to a small discrete subset, of numerical “symbols”). Conceptually, a d-dimensional vector submitted as an embedding may thus be of at least two types:

  • $d$ -dimensional 1-hot vectors (zeroes everywhere except for a 1 in one dimension) representing the $d$ discrete “symbols” discovered by the unit discovery system

  • $d$ -dimensional continuous valued embedding vectors, representing some transformation of the acoustic input

The file format, however, does not distinguish between these cases. The embedding file will be provided in simple text format (ASCII), one line per row (separated by newlines), each column being a numerical value, columns separated by exactly one space (see Output Preparation). The number of rows is not fixed: while you may use a fixed frame rate, no notion of “frame” is presupposed by any of the evaluations, so the length of the sequence of vectors for a given file can be whatever the encoding component of the system deems it to be. The matrix format is unaffected by this choice of representation, and the evaluation software will be applied irrespective of this distinction.

For the bitrate computation, the vectors will be processed as a single character string: a dictionary of all of the possible values of the symbolic vectors will be constructed over the entire test set, and the index to the dictionary will be used in place of the original value. This will be used to calculate the number of bits transmitted over the duration of the test set, by calculating the entropy of the symbolic vector distribution (average number of bits transmitted per vector) and multiplying by the total number of vectors. Be careful! We use the term symbolic vectors to remind you that the vectors should ideally belong to a small dictionary of possible values. We will automatically construct this dictionary using the character strings as they appear in the file. This means that very close vectors ‘1.00000000 -1.0000000’ and ‘1.00000001 -1.0000001’ will be considered distinct. (For that matter, even ‘1 1’ and ‘1.0 1.0’ will be considered different, because the character strings are different). Therefore, make sure that the numeric values of the vectors are formatted consistently, otherwise you will be penalized with a higher bitrate than you intended.

The two sets of output files (embedding and waveform) will be evaluated separately. The symbolic output will be analyzed using two metrics:

Bit rate.

Here, we assume that the entire set of audio in the test set corresponds to a sequence of vectors $U$ of length $P$ : $U=[s_1, ..., s_P]$ . The bit rate for $U$ is then $B(U)=\frac{P. \sum_{i=1}^{P} p(s_i)\,log_{2}\,p(s_i)}{D}$ , where $p(s_{i})$ is the probability of symbol $s_i$ The numerator of $B(U)$ is $P$ times the entropy of the symbols, which gives the ideal number of bits needed to transmit the sequence of symbols $s_{1:P}$ . In order to obtain a bitrate, we divide by $D$ , the total duration of $U$ in seconds. Note that a fixed frame rate transcription may have a higher bitrate than a ‘textual’ representation due to the repetition of symbols across frames. For instance, the bit rate of a 5 ms framewise gold phonetic transcription is around 450 bits/sec and that of a ‘textual’ transcription is 60 bits/sec. This indicates that there is a tradeoff between the bitrate and the detail of the information provided to the TTS system, both in number of symbols and in temporal information.

Unit quality.

Since it is unknown whether the discovered representations correspond to large or small linguistic units (phone states, phonemes, features, syllables, etc.), we evaluate this with a theory-neutral score, a machine ABX score, as in the previous Zero Resource challenges [23]. For a frame-wise representation, the machine-ABX discriminability between ‘beg’ and ‘bag’ is defined as the probability that A and X are closer than B and X, where A and X are tokens of ‘beg’, and B a token of ‘bag’ (or vice versa), and X is uttered by a different speaker than A and B. The choice of the appropriate distance measure is up to the researcher.
In previous challenges, we used by default the average frame-wise cosine divergence of the representations of the tokens along a DTW-realigned path. The global ABX discriminability score aggregates over the entire set of minimal pairs like ‘beg’-‘bag’ to be found in the test set. We provide in the evaluation package the option of instead using a normalized Levenshtein edit distance. We give ABX scores as error rates. The ABX score of the gold phonetic transcriptions is 0 (perfect), since A, B and X are determined to be “the same” or “different” based on the gold transcription.

Synthesis accuracy and quality.

The synthesized wave files will be evaluated by humans (native speakers of the target language) using three judgment tasks.

  1. Judges will evaluate intelligibility by transcribing orthographically the synthesized sentence. Each transcription will then be compared with the gold transcription using the edit distance, yielding a character error rate (character-wise, to take into account possible orthographic variants or phone-level susbtitutions).

  2. Judges will evaluate the speaker similarity using a subjective 1 to 5 scale. They will be presented with a sentence from the original target voice, another sentence by the source voice, and then yet another resynthesized sentence by one of the system. A response of 1 means that the resynthesized sentence has a voice very similar to the target voice, 5 that it has a voice very similar to the source.

  3. Judges will evaluate the overall quality of the synthesis on a 1 to 5 scale, yielding a Mean Opinion Score (MOS).

For the development language, we will provide all of the evaluation software (bitrate and ABX), so that participants can test for the adequacy of their full pipeline. We will not provide human evaluations, but we will provide a mock experiment reproducing the conditions of the human judgemnts, that participants can run on themselves, so that they can get a sense of what the final evaluation on the test set will look like.

For the test language, only audio will be provided (plus the bitrate evaluation software). All human and ABX evaluations will be run automatically when the participants will have submitted their annotations and wav files through the submission portal. An automatic leaderboard will then appear on the challenge’s website.

Baseline, topline and leaderboard

As with the other ZR challenges, the primary aim is to explore interesting new ideas and algorithms, not to simply push numbers at all costs. Our leaderboard will therefore reflect this fact, as TTS without T is first and foremost a problem of tradeoff between the bit rate of the acoustic units and the quality of the synthesis. At one extreme, a voice conversion algorithm would obtain excellent synthesis, but at a cost of a very high bit rate. At the other extreme, a heavily quantized intermediate representation could provide a very low bit rate, at a cost of a poor signal reconstruction. Therefore the systems will be put on a 2D graphs, with bit rate on the x axis, and synthesis quality on the y axis (see Figure 2).

Figure 2. 2D leaderboard, showing the potential results of the topline, baseline and submitted systems in terms of synthesis quality against bitrate. The quantization examples illustrate the kinds of tradeoffs that can arise between quality of synthesis and bitrate of the embedding.

A baseline system will be provided, consisting of a pipeline with a simple acoustic unit discovery system based on DPGMM, and a speech synthesizer based on Merlin. A topline system will also be provided, which will consist in an ASR system (Kaldi) piped to a trained TTS system (Merlin), both trained using supervision (the gold labels). The baseline system will be provided in a virtual machine (the same one containing the evaluation software for the dev language), together with the datasets upon registration.

The Challenge is run on Codalab. Participants should register by creating a Codalab account and then downloading the datasets and virtual machine (see Getting Started). Each team should submit the symbolic embeddings on which they condition the waveform synthesis part of their model, as well as the synthesized wavs for each test sentence on both dev and test languages. Regarding the embeddings, we will have to trust the participants that this is the ONLY INFORMATION that they use to generate the waveform. (Otherwise, this challenge becomes ordinary voice transfer, not TTS.)

For reproducibility purposes (and ability to check that the model truly runs on the provided discrete embeddings), we strongly encourage submission of the code or binary executable or virtual machine for the proposed system. The leaderboard will indicate whether the submitted paper has included their code or not with an “open source/open science” tag. Because we have to run the evaluations on human annotators, each team is only allowed two submissions maximum. The evaluation will be returned to each team by email, and the leaderboard published once the deadline of paper submission is over.

Timeline

This challenge has been proposed as an Interspeech special session to be held during the conference (outcome known Jan 1). Attention: due to the time needed for us to run the human evaluations on the resynthesized waveforms, we will require that these waveforms be submitted to our website through Codalab 2 weeks before the Interspeech official abstract submission deadline.

Date
Dec 15 2018 Release of the website (English datasets and evaluation code)
Feb 4 2019 Release of the Austronesian datasets (test language) and codalab competition
March 15 2019 Wavefile Submission Deadline! (codalab)
March 25 2019 Outcome of human evaluations+ Leaderboard
March 29 2019 Abstract submission deadline
April 5 2019 Final paper submissions deadline
June 17 2019 Paper acceptance/rejection notification
July 1, 2019 Camera-ready paper due
Sept 15-19 2019 Interspeech conference.

Scientific committee

  • Laurent Besacier

    • LIG, Univ. Grenoble Alpes, France
    • Automatic speech recognition, processing low-resourced languages, acoustic modeling, speech data collection, machine-assisted language documentation
    • email: laurent.besacier at imag.fr, https://cv.archives-ouvertes.fr/laurent-besacier
  • Alan W. Black

  • Ewan Dunbar

  • Emmanuel Dupoux

    • Ecole des Hautes Etudes en Sciences Sociales, Paris
    • Computational modeling of language acquisition, psycholinguistics, unsupervised learning of linguistic units
    • email: emmanuel.dupoux at gmail.com, http://www.lscp.net/persons/dupoux
  • Lucas Ondel

    • University of Brno,
    • Speech technology, unsupervised learning of linguistic units
    • email: iondel at fit.vutbr.cz
  • Sakti Sakriani

    • Nara Institute of Science and Technonology (NAIST)
    • Speech technology, low resources languages, speech translation, spoken dialog systems
    • email:ssakti at is.naist.jp, http://isw3.naist.jp/~ssakti

Challenge Organizing Committee

  • Ewan Dunbar (Organizer)

    Researcher, EHESS,

  • Emmanuel Dupoux (Coordination)

    Researcher, EHESS & Facebook, emmanuel.dupoux@gmail.com

  • Sakti Sakriani (Datasets)

    Researcher, Nara Institute of Science and Technonology (NAIST) ssakti at is.naist.jp

  • Xuan-Nga Cao (Datasets)

    Research Engineer, ENS, Paris, ngafrance at gmail.com

  • Mathieu Bernard (Website & Submission)

    Engineer, INRIA, Paris, mathieu.a.bernard at inria.fr

  • Julien Karadayi (Website & Submission)

    Engineer, ENS, Paris, julien.karadayi at gmail.com

  • Juan Benjumea (Datasets & Baselines)

    Engineer, ENS, Paris, jubenjum at gmail.com

  • Lucas Ondel (Baselines)

    Graduate Student, University of Brno & JHU,

  • Robin Algayres (Toplines)

    Engineer, ENS, Paris, rb.algayres at orange.fr

Sponsors

The ZeroSpeech2019 challenge is funded through Facebook AI Research and a MSR Grant. It is endorsed by SIG-UL, a joint ELRA-ISCA Special Interest Group on Under-resourced Languages.