The Zero Resource Speech Challenge 2021

ZeroSpeech 2020 is a new challenge aimed at Spoken Language Modeling. This task consists in learning language models directly from raw audio in an unknown language, without any annotation nor text.

Systems are only allowed to use the raw audio of the training set as input; they can use it to discover discrete units from it (pseudo-text) and then train a language model from it or learn everything end-to-end without discrete units.

The evaluation is done through a suite of 4 black-box, zero-shot metrics probing for the quality of the learned models at 4 linguistic levels: phonetics, lexicon, syntax and semantics.

We provide several baseline systems made up by the concatenation of three unsupervised systems: self-supervised contrastive representation learning (CPC) [1], clustering (kmeans), language modeling (LSTM or BERT). The language models learn on the basis of the pseudo-text derived from clustering the learned representation.

In order to take into account the computing resources of participants, we distinguish two categories of submissions: low budget and high budget. ‘Low budget’ submissions use smaller models (less than 30M parameters) that can be trained with only one GPU in a maximum of 3 days. ‘High budget’ submissions use larger models and more GPUs.

Accordingly our baseline language models are sorted into a high and a low compute budget. This benchmark series is about fostering new ideas, not getting the best numbers. This is why it is perfectly ok to submit systems in the low budget category, and we encourage participants to do so.

For a detailed presentation of the benchmark, see Nguyen, et al. (2020) [2]. Please quote this paper when using it.

@inproceedings{
authors="Nguyen, Tu Anh and de Seyssel, Maureen and Rozé, Patricia and Rivière, Morgane and Kharitonov, Evgeny and Baevski, Alexei and Dunbar, Ewan and Dupoux, Emmanuel",
year="2020",
title= "{The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling}"
booktitle="Self-Supervised Learning for Speech and Audio Processing Workshop @ NeurIPS",
}

Datasets

As training set, participants can use either of the following two datasets:

  • LibriSpeech (the standard subsets: clean-100, clean-360, other-500)
  • Libri-light: small (600h), med(6kh) or large(60kh).

Participants should only use the audio part of these datasets (no text, no labels!). They can however, use a special VAD track, as well as speaker IDs, which are both provided. Do not forget to mention which of the datasets you use!

We also provide dev and test sets, consisting in short files which are organized in subdirectories specific to each evaluation metric (see the instructions page).

Baseline

We provide the source code and recipe for training our baseline models at github.

All of our baselines consists in the following pipeline: a discrete encoder followed by a language model. As discrete encoder, we provide a pretrained CPC model (trained on librispeech 960) followed by kmeans (trained on librispeech 100). This pretrained model used the following resources

  • CPC: 2-3 days to train on 8 16Gb-GPUs
  • clustering and quantizing : 2x12h on 1 GPU.

As for the LMs (trained on librispeech 960 encoded with the quantized units) we used the following resources :

  • low budget models: LSTM (22M params) and BERT (28M params): 60h on 1 GPU
  • high budget models: BERT-base (90M parameters): 40h on 32 GPUs.

Participants are allowed to use and tweak any or all parts of this pipeline.

Evaluation metrics

As described in the paper, we use 4 metrics, each focused on a linguistic level: phonetics, lexicon, syntax and semantics. They only require participants to provide either an embedding and a distance metric, or a pseudo-probability for any given input.

  1. Acoustic level (libri-light ABX). The ABX metrics consists in computing, for two speech categories A and B, (e.g., ‘bit’ vs ‘bet’), the probability that two sounds belonging to the same category are closer to one another thanwhen they belong to different categories. The score is symmetrized and aggregated across all of minimal pairs of triphones like ‘bit’, ‘bet’, (where the change only occurs in the middle phoneme) and turned into an error. This score can be computed within speaker (a, b and x are spoken by the same speaker) or across speaker (a and b are by the same speaker, and x by a different speaker). If the phonemes have well separated representations, the ABX error will be close to 0 (5-10% errors correspond to good separations; 20%-30%: some signal, but not very good separation – like for MFCC representations; 50%: chance level).

  2. Lexical level (sWUGGY spot-the-word). The models are presented with a pair of an existing word (like ‘brick’) and have to provide a probability to each. The spot-the-word metric corresponds to the average accuracy of classifying the words and nonwords correctly based on their probabilities across each pair. Text-based LMs easily reach 95% on this task; chance is 50%.

  3. Semantic level (sSIMI similarity score). Here, the task is to compute the similarity of the representation of pairs of words and compare it to human similarity judgements. The score is a correlation coeficient. If the model perfectly predicts human judgement, the score will be 1. Random models will have a score of 0.

  4. Syntactic level (sBLIMP acceptability). As for spot-the-word, the task is to decide which of the two members of the pair is grammatical or not based on the probability of the sentence (eg, “the dogs sleep” is grammatical “the dog sleep” is not). The sBLIMP dataset uses a synthetic version of the BLIMP dataset, and contain a variety of syntactic phenomena. Text LM trained on librispeech are around 68% correct range; humans and large LMs are in the 80-90%. Chance is 50%.

Metrics 1. and 3. require that participants first extract an embedding for each test input, and specify a (pseudo) distance to compare two such embeddings. As test inputs may not have the same length (different number of speech frames), this requires two decisions: picking a frame-wise distance (which can be for instance the cosine distance, the angular distance, KL, etc) and a pooling method. For Metric 1, it is customary to average along the DTW realignement path as pooling method. For Metric 3, it may make more sense to use max or mean pooling. We provide scripts to compute these distances and poolings given an embedding for each input file, and participants can use the dev set to select the best embedding, pooling, and distance in their system.

Metrics 2 and 4 requires a (pseudo) probability that will be associated to each input. It is up to the participants to provide such a number (it can be any positive or negative float, hence the pseudo); in our baseline, we compute it by applying various masks to the input and compute the probability of the BERT reconstruction of the pseudo-text hidden behind the mask.

Participants are provided with the scripts to run the 4 metrics on the dev set, and will have to submit their output files to the website to have the results on the test set, during the opened phase of the challenge.

Submission format

The submission is in simple ascii text format. Each of the 4 metrics requires a particular format, and there is a validation script that checks it before submission. Details in Submission format.

Participants are allowed a total 4 submissions maximum (across the low and high budget categories).

Timeline

The timeline of this challenge is aligned to the timing of Interspeech 2021. We have submitted it to be a Special Challenge Session. If accepted, participants will be able to submit their papers to the special session; if not they can still submit their paper to an Interspeech regular session. We also plan to have a second session open for another conference later in the year, (e.g. NeuRIPS) to give more time to participants to complete experiments (mid May-June), especially in the high budget category.

To give more time to participants to address the difficulties of this challenge, we open up the benchmark code and data ahead of the official Interspeech paper submission deadline. Attention we close the challenge one week before the paper submission deadline to enable the organizers to write a summary paper.

Date
Nov 25, 2020 Release of competition materials and challenge submissions open
Jan 1, 2021 Announcement of whether the challenge will have a special session
March 22, 2021 Challenge submission deadline
March 26, 2021 Interspeech first deadline
April 2, 2021 Interspeech update deadline
June 2, 2021 Paper acceptance/rejection notification
Aug 30-Sept 3, 2021 Interspeech Conference, Brno

Challenge Organizing Committee

  • Emmanuel Dupoux (Organizer)

    Researcher, EHESS / Cognitive Machine Learning / Facebook, emmanuel.dupoux at gmail.com

  • Ewan Dunbar (Organizer)

    Assistant Professor, University of Toronto, ewan.dunbar at utoronto.ca

  • Mathieu Bernard (Website & Submission)

    Engineer, INRIA, Paris, mathieu.a.bernard at inria.fr

  • Nicolas Hamilakis (Website & Submission)

    Engineer, ENS, Paris, at nicolas.hamilakis at inria.fr

  • Maureen de Seyssel (Datasets & Metrics)

    PhD student, INRIA, Paris, maureen.deseyssel at gmail.com

  • Tu Anh Nguyen (Baselines)

    PhD student, INRIA/Facebook, Paris, nguyentuanh208 at gmail.com

Acknowledgments

The ZeroSpeech 2021 Benchmark has been funded by a Facebook gift, a CIFAR grant (Learning in Minds and Brains), and grants from the Agence Nationale de la Recherche (ANR-17-EURE-0017 Frontcog, ANR-10-IDEX-0001-02 PSL*, ANR-19-P3IA-0001 PRAIRIE 3IA Institute) given to the E. Dupoux in his EHESS role.

The ZeroSpeech 2021 challenge is hosted on Codalab, an open-source web-based platform for machine learning competitions.

Alt Text

References

[1] Oord, A. V. D., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[2] Nguyen, T.A., de Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Baevski, A., Dunbar, E., & Dupoux, E. (2020). The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint: arXiv:2011.11588.