EmoBox

EmoBox, a groundbreaking multilingual multi-corpus speech emotion recognition (SER) toolkit designed to streamline research in this field. EmoBox is accompanied by a meticulously curated benchmark tailored for both intra-corpus and cross-corpus evaluation settings.

EmoBox consists of:

For intra-corpus evaluations, we have devised a systematic approach to data partitioning across various datasets, ensuring that researchers can conduct rigorous and comparable analyses of different SER models.
For the cross-corpus evaluations, we leverage a foundational SER model, emotion2vec, to address annotation discrepancies and create a test set that achieves a balance in speaker and emotion distribution, a feat previously unattained in SER research.

Based on EmoBox, we present the intra-corpus SER results of 10 pre-trained speech models on 32 emotion datasets with 14 languages, and the cross-corpus SER results on 4 datasets with the fully balanced test sets. To the best of our knowledge, this is the largest SER benchmark, across language scopes and quantity scales. We hope that our toolkit and benchmark can facilitate the research of SER in the community