# The Zero Resource Speech Challenge 2020

###### Summary

ZeroSpeech 2020 is a consolidating challenge in which participants submit systems to the ZeroSpeech 2017 (Track 1 or Track 2) or the ZeroSpeech 2019 tasks. Participants are particularly encouraged to submit to multiple tracks/challenges (unit discovery evaluated on both the 2017 and 2019 evaluations, unit discovery used as a basis for spoken term discovery).

For general background information see the Zerospeech 2017 and Zerospeech 2019 main pages. Changes have been made to the challenge for the 2020 edition. Please attentively read the Instructions for detailed information.

• Datasets

The 2020 edition reuses the training datasets for the 2017 and 2019 challenges. The test datasets have changed to include additional files.

• Baseline and topline

The baseline and topline reference systems will not change, and will be exactly those used in the 2017 and 2019 challenges.

• Evaluation metrics

The evaluation has undergone an overhaul, which fixes bugs, inconsistencies, and problems of speed. The bugs and inconsistencies in the 2017 Track 2 task evaluation tool had an impact on the scores. All of the Track 2 metrics should be expected to change somewhat, with the exception of npairs and nwords. See Track 1 for information on the 2017 Track 1 task evaluation, Updated 2017 Track 2 task evaluation below for information on the updated Track 2 task evaluation, and see

• Submission format

The submission format for the 2017 Track 1 task has changed. See instructions for detailed information.

• Software

Software is provided for validating submissions and for running the evaluation on the development languages, for all three tasks. Software is provided as a Python 3 (conda) package. See instructions for detailed information. No baseline or topline systems are included in the package (the Docker containing the baseline system for the 2019 Challenge remains available from the Zerospeech 2019 site).

## Timeline

This challenge has been accepted as an Interspeech 2020 special session to be held during the conference. Due to the time needed for us to run the human evaluations on the resynthesized waveforms, we will require that these waveforms be submitted two weeks before the Interspeech official abstract submission deadline.

Date
Feb 7, 2020 Release of competition materials
March 2, 2020 Challenge opens on Codalab
April 24, 2020 Challenge submission deadline
May 01, 2020 Leaderboard published on zerospeech.com
July 24, 2020 Paper acceptance/rejection notification
Oct. 26-29, 2020 Interspeech Conference

## Updated 2017 Track 2 task evaluation

All of our metrics assume a time aligned transcription, where $T_{i,j}$ is the (phoneme) transcription corresponding to the speech fragment designated by the pair of indices $\langle i,j \rangle$ (i.e., the speech fragment between frame i and j). If the left or right edge of the fragment contains part of a phoneme, that phoneme is included in the transcription if it corresponds to more than more than 30ms or, for phonemes less than 60ms, more than 50% of its duration.

We first define the set related to the output of the discovery algorithm:

• $C_{disc}$ : the set of discovered clusters (a cluster being a set of fragments grouped together).

From these, we can derive:

• $F_{disc}$ : the set of discovered fragments, $F_{disc} = \{ f | f \in c , c \in C_{disc} \}$

• $P_{disc}$ : the set of non overlapping discovered pairs (two fragments a and b overlap if they share more than half of their temporal extension), $P_{disc} = \{ \{a,b\} | a \in c, b \in c, \neg \textrm{overlap}(a,b), c \in C_{disc} \}$

• $P_{disc^*}$ : the set of pairwise substring completion of $P_{disc}$ , which mean that we compute all of the possible minimal path realignments of the two strings, and extract all of the substrings pairs along the path (e.g., for fragment pair $\langle abcd, efg \rangle$ : $\langle abc, efg \rangle$ , $\langle ab,ef \rangle$ , $\langle bc, fg \rangle$ , $\langle bcd, efg \rangle$ , etc).

• $B_{disc}$ : the set of discovered fragment boundaries (boundaries are defined in terms of i, the index of the nearest phoneme boundary in the transcription if it is less than 30ms away or, for phonemes less than 60ms, if more than 50% of its duration is covered by a fragment associated with the boundary, and -1 [wrong boundary)] otherwise)

Next, we define the gold sets:

• $F_{all}$ : the set of all possible fragments of size between 3 and 20 phonemes in the corpus.

• $P_{all}$ : the set of all possible non overlapping matching fragment pairs. $P_{all}=\{ \{a,b \}\in F_{all} \times F_{all} | T_{a} = T_{b}, \neg \textrm{overlap}(a,b)\}$ .

• $F_{goldLex}$ : the set of fragments corresponding to the corpus transcribed at the word level (gold transcription).

• $P_{goldLex}$ : the set of matching fragments pairs from the $F_{goldLex}$ .

• $B_{gold}$ : the set of boundaries in the parsed corpus.

Most of our measures are defined in terms of precision, recall and F-score. Precision is the probability that an element in a discovered set of entities belongs to the gold set, and recall the probability that a gold entity belongs to the discovered set. The F-score is the harmonic mean of precision and recall.

• $Precision_{disc,gold} = | disc \cap gold | / | disc |$
• $Recall_{disc,gold} = | disc \cap gold | / | gold |$
• $F-Score_{disc,gold} = 2 / (1/Precision_{disc,gold} + 1/Recall_{disc,gold})$

## Matching quality

Many spoken term discovery systems incorporate a step whereby fragments of speech are realigned and compared. Matching quality measures the accuracy of this process. Here, we use the NED/coverage metrics for evaluating that.

NED and coverage are quick to compute and give a qualitative estimate of the matching step. NED is the Normalised Edit Distance; it is equal to zero when a pair of fragments have exactly the same transcription, and 1 when they differ in all phonemes. Coverage is the fraction of corpus that contain matching pairs that has been discovered.

$$\textrm{NED} = \sum_{\langle x, y\rangle \in P_{disc}} \frac{\textrm{ned}(x, y)}{|P_{disc}|}$$
$$\textrm{Coverage} = \frac{|\textrm{cover}(P_{disc})|}{|\textrm{cover}(P_{all})|}$$

where:

$$\textrm{ned}(\langle i, j \rangle, \langle k, l \rangle) = \frac{\textrm{Levenshtein}(T_{i,j}, T_{k,l})}{\textrm{max}(j-i+1,k-l+1)}$$
$$\textrm{cover}(P) = \bigcup_{\langle i, j \rangle \in \textrm{flat}(P)}[i, j]$$
$$\textrm{flat}(P) = \{p|\exists q:\{p,q\}\in P\}$$

## Clustering Quality

Clustering quality is evaluated using two metrics. The first metrics (Grouping precision, recall and F-score) computes the intrinsic quality of the clusters in terms of their phonetic composition. This score is equivalent to the purity and inverse purity scores used for evaluating clustering. As the Matching score, it is computed over pairs, but contrary to the Matching scores, it focusses on the covered part of the corpus.

$$\textrm{Grouping precision} = \sum_{t\in\textrm{types}(\textrm{flat}(P_{clus}))} freq(t, P_{clus}) \frac{|\textrm{match}(t, P_{clus} \cap P_{goldclus})|}{|\textrm{match}(t, P_{clus})|}$$
$$\textrm{Grouping recall} = \sum_{t\in\textrm{types}(\textrm{flat}(P_{goldclus}))} freq(t, P_{goldclus}) \frac{|\textrm{match}(t, P_{clus} \cap P_{goldclus})|}{|\textrm{match}(t, P_{goldclus})|}$$

where:

$$P_{clus} = \{\langle \langle i, j\rangle , \langle k, l \rangle\rangle | \exists c\in C_{disc},\langle i, j\rangle\in c \wedge \langle k, l\rangle\in c\}$$
$$P_{goldclus} = \{\langle \langle i, j\rangle , \langle k, l \rangle\rangle | \exists c_1,c_2\in C_{disc}:\langle i, j\rangle\in c_1 \wedge \langle k, l\rangle\in c_2$$
$$\wedge T_{i,j}=T_{k,l} \wedge [i,j] \cap [k,l] = \varnothing \}$$

The second metrics (Type precision, recall and F-score) takes as the gold cluster set the true lexicon and is therefore much more demanding. Indeed, a system could have very pure clusters, but could systematically missegment words. Since a discovered cluster could have several transcriptions, we use all of them (rather than using some kind of centroid).

$$\textrm{Type precision} = \frac{|\textrm{types}(F_{disc}) \cap \textrm{types}(F_{goldLex})|} {|\textrm{types}(F_{disc})|}$$
$$\textrm{Type recall} = \frac{|\textrm{types}(F_{disc}) \cap \textrm{types}(F_{goldLex})|} {|\textrm{types}(F_{goldLex})|}$$

## Parsing Quality

Parsing quality is evaluated using two metrics. The first one (Token precision, recall and F-score) evaluates how many of the word tokens were correctly segmented ( $X = F_{disc}$ , $Y = F_{goldLex}$ ). The second one (Boundary precision, recall and F-score) evaluates how many of the gold word boundaries were found ( $X = B_{disc}$ , $Y = B_{gold}$ ). These two metrics are typically correlated, but researchers typically use the first. We provide Boundary metrics for completeness, and also to enable system diagnostic.

$$\textrm{Token precision} = \frac{|F_{disc}\cap F_{goldLex}|}{|F_{disc}|}$$
$$\textrm{Token recall} = \frac{|F_{disc}\cap F_{goldLex}|}{|F_{goldLex}|}$$
$$\textrm{Boundary precision} = \frac{|B_{disc}\cap B_{gold}|}{|B_{disc}|}$$
$$\textrm{Boundary recall} = \frac{|B_{disc}\cap B_{gold}|}{|B_{gold}|}$$

## Challenge Organizing Committee

• Ewan Dunbar (Organizer)

Researcher, Université de Paris / Cognitive Machine Learning, ewan.dunbar at univ-paris-diderot.fr

• Emmanuel Dupoux (Coordination)

Researcher, EHESS / Cognitive Machine Learning / Facebook, emmanuel.dupoux at gmail.com

• Mathieu Bernard (Website & Submission)

Engineer, INRIA, Paris, mathieu.a.bernard at inria.fr

• Julien Karadayi (Website & Submission)

Engineer, ENS, Paris, julien.karadayi at gmail.com

## Scientific committee

• Laurent Besacier

• LIG, Univ. Grenoble Alpes, France
• Automatic speech recognition, processing low-resourced languages, acoustic modeling, speech data collection, machine-assisted language documentation
• email: laurent.besacier at imag.fr, https://cv.archives-ouvertes.fr/laurent-besacier
• Alan W. Black

• Ewan Dunbar

• Emmanuel Dupoux

• Ecole des Hautes Etudes en Sciences Sociales / Cognitive Machine Learning / Facebook AI
• Computational modeling of language acquisition, psycholinguistics, unsupervised learning of linguistic units
• email: emmanuel.dupoux at gmail.com, http://www.lscp.net/persons/dupoux
• Lucas Ondel

• University of Brno,
• Speech technology, unsupervised learning of linguistic units
• email: iondel at fit.vutbr.cz
• Sakriani Sakti

• Nara Institute of Science and Technonology (NAIST)
• Speech technology, low resources languages, speech translation, spoken dialog systems
• email:ssakti at is.naist.jp, http://isw3.naist.jp/~ssakti

## Acknowledgments

The ZeroSpeech 2021 challenge is hosted on Codalab, an open-source web-based platform for machine learning competitions.

## References

• Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, J., Cao, X.-N., Miskic, L., Dugrain, C., Ondel, L., Black, A. W., Besacier, L., Sakti, S., Dupoux, E. (2019). The Zero Resource Speech Challenge 2019: TTS without T. In INTERSPEECH-2019.

• The references for the 2017 challenge are here.

• The references for the 2019 challenge are here.