Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Abstract. Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks.
Audio Flamingo not only achieves the state-of-the-art on audio understanding benchmarks, but also understands
audio in depth.
Below we show Audio Flamingo can understand (1) occurrences in audio, (2) order of sounds, (3) quality of
audio, (4) loudness and it's change over time, and (5) distance of sounds.
In-Context Learning and Retrieval Augmented Generation
Audio Flamingo can be adapted to new tasks via in-context few-shot learning and retrieval augmented
generation, without task-specific fune-tuning.
Below we show how Audio Flamingo can (1) borrow key words from retrieved samples on audio captioning and (2)
disregard retrieved samples if they are noisy and ineffective.
Audio Flamingo can have multiple rounds of dialogues with the user.
It understands complex context and captures correlations between rounds; for example, it understands
pronouns properly.
| Audio | Dialogue |
|---|---|
| User: Can you briefly explain what you hear in the
audio? Ground-truth: The audio primarily consists of bird sounds, including chirping and bird vocalizations. There are also sounds of other animals. Additionally, there is a woman speaking at certain intervals. Baseline: Birds are chirping. Audio Flamingo: The audio features a variety of sounds including bird vocalizations, mechanical sounds, impact sounds, a man speaking, and a ticking sound.
User: Are the bird sounds continuous?
User: Do the bird sounds and the person's speech overlap? |
|
| User: Is the male voice rapping or singing? Ground-truth: The male voice is doing both, rapping and singing in the audio. Baseline: The male voice is rapping. Audio Flamingo: The male voice is rapping.
User: What kind of music is it?
User: Is the male voice the only one present? |
|
| User: Is the rattle sound followed immediately by the
spray sound? Ground-truth: Yes, the rattle sound is immediately followed by the spray sound with a very short gap in between. Baseline: Yes, the rattle sound is followed immediately by the spray sound. Audio Flamingo: Yes, the rattle sound is followed immediately by the spray sound.
User: Does the spray sound occur only once?
User: Is the second spray sound longer than the first one? |
|
| User: What genre does this music belong to? Ground-truth: This music belongs to the R&B/Disco genre. Baseline: It is not specified in the given audio clip. Audio Flamingo: This music belongs to the R&B genre.
User: Can you tell me more about the vocals in this track? |