Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

Abstract. Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.

Audio Flamingo not only achieves the state-of-the-art on audio understanding benchmarks, but also understands audio in depth.
Below we show Audio Flamingo can understand (1) occurrences in audio, (2) order of sounds, (3) quality of audio, (4) loudness and it's change over time, and (5) distance of sounds.

In-Context Learning and Retrieval Augmented Generation

Audio Flamingo can be adapted to new tasks via in-context few-shot learning and retrieval augmented generation, without task-specific fune-tuning.
Below we show how Audio Flamingo can (1) borrow key words from retrieved samples on audio captioning and (2) disregard retrieved samples if they are noisy and ineffective.

Audio Flamingo can have multiple rounds of dialogues with the user.
It understands complex context and captures correlations between rounds; for example, it understands pronouns properly.

Audio Dialogue
User: Can you briefly explain what you hear in the audio?
Ground-truth: The audio primarily consists of bird sounds, including chirping and bird vocalizations. There are also sounds of other animals. Additionally, there is a woman speaking at certain intervals.
Baseline: Birds are chirping.
Audio Flamingo: The audio features a variety of sounds including bird vocalizations, mechanical sounds, impact sounds, a man speaking, and a ticking sound.

User: Are the bird sounds continuous?
Ground-truth: The bird sounds are not continuous. They occur at different intervals throughout the audio, with some overlap with the woman's speech.
Baseline: Yes, the bird sounds are continuous.
Audio Flamingo: No, the bird sounds are not continuous. They occur at different intervals throughout the audio.

User: Do the bird sounds and the person's speech overlap?
Ground-truth: Yes, there are instances where the bird sounds and the woman's speech overlap.
Baseline: Yes, the bird sounds and the woman's speech overlap.
Audio Flamingo: Yes, the bird sounds and the person's speech do overlap at certain points in the audio.

User: Is the male voice rapping or singing?
Ground-truth: The male voice is doing both, rapping and singing in the audio.
Baseline: The male voice is rapping.
Audio Flamingo: The male voice is rapping.

User: What kind of music is it?
Ground-truth: The type of music isn't specified, but it's accompanied by rapping and male singing.
Baseline: The kind of music is hip-hop.
Audio Flamingo: The music is a rap song.

User: Is the male voice the only one present?
Ground-truth: Yes, the male voice is the only one specified in the audio, both rapping and singing.
Baseline: Yes, the male voice is the only one present.
Audio Flamingo: Yes, the male voice is the only one present in the audio.

User: Is the rattle sound followed immediately by the spray sound?
Ground-truth: Yes, the rattle sound is immediately followed by the spray sound with a very short gap in between.
Baseline: Yes, the rattle sound is followed immediately by the spray sound.
Audio Flamingo: Yes, the rattle sound is followed immediately by the spray sound.

User: Does the spray sound occur only once?
Ground-truth: No, the spray sound occurs twice in the audio.
Baseline: Yes, the spray sound occurs only once.
Audio Flamingo: No, the spray sound occurs twice in the audio.

User: Is the second spray sound longer than the first one?
Ground-truth: Yes, the second spray sound lasts longer than the first one.
Baseline: Yes, the second spray sound is longer than the first one.
Audio Flamingo: Yes, the second spray sound is longer than the first one.

User: What genre does this music belong to?
Ground-truth: This music belongs to the R&B/Disco genre.
Baseline: It is not specified in the given audio clip.
Audio Flamingo: This music belongs to the R&B genre.

User: Can you tell me more about the vocals in this track?
Ground-truth: Sure, the vocals in this track are from a male singer who is singing in a higher register. His voice has been modified with effects, making it sound like a voice sample.
Baseline: The vocals in this track are described as rich and full-bodied, but no further details are given about the specific singer or the lyrics being sung in the audio clip.
Audio Flamingo: The vocals in this track are by a male singer. His voice is deep, resonant, and powerful, adding to the overall emotional impact of the song.