Tracks 1 and 2: Speech-based language modelling Track 2: Visually-grounded language modelling Instructions Data Results

Results

The columns are sortable by clicking on the |sortable| picture of each column header. A detailed view of the results is available by clicking on the details picture of each row.

The columns are interpreted as follows (see Evaluation metrics for details):

Phonetic (across and within)
- ABX error rate on embeddings
- Scale is $[0, 1]$, lower is better
Lexical and Syntactic
- Mean correct / incorrect classification accurary
- Scale is $[0, 1]$, higher is better
- For Lexical the all column is the mean accuracy over five frequency bins (based on raw frequency counts in LibriSpeech-960: OOV; 1-5; 6-20; 21-100; 101+), and the in vocab. column leaves out the OOV category. Only the all column was published in the Interspeech summary paper.
Semantic
- Human judgement correlation coeficient (x 100$)
- Scale is $[-100, 100]$, far from 0 is better
- Mean score across all datasets
- Semantic (Weighted): Same as Semantic with mean score weighted by the number of pairs in each dataset. Only the unweighted (Semantic) columns were published in the Interspeech summary paper.

					Phonetic (Within)		Phonetic (Across)		Lexical	Syntactic	Semantic		Semantic (Weighted)
#		Author	Budget	Set	clean	other	clean	other	all	in vocab.		synth.	libri.	synth.	libri.