Voice Agent 开发者必读，2024 最前沿语音模型梳理

今天推荐的是我们的社区成员 BoJack 创建的 GitHub 仓库，如果你在关注 Voice Agent 开发，想了解最前沿的语音模型都有哪些，这个仓库的列表就非常值得关注。

BoJack 正在上海交大读博，研究方向为语音多模态，语音交互系统，自监督预训练。他也是近期发布的语音全双工模型 LSLM、TTS 语音合成模型 F5-TTS 的作者之一。

仓库地址：
https://github.com/ddlBoJack/Awesome-Speech-Language-Model

Awesome-Speech-Language-Model

论文、代码与资源：语音语言模型和端到端语音对话系统。

通用语音、音频和音乐理解模型

Universal Speech, Audio and Music Understanding

**模型
Model**

LTU: Listen, Think, and Understand - ICLR 2024

https://arxiv.org/abs/2305.10790

SALMONN: Towards Generic Hearing Abilities for Large Language Models- ICLR 2024

https://arxiv.org/abs/2310.13289

LTU-AS: Joint Audio and Speech Understanding - ASRU 2024

https://arxiv.org/abs/2309.14405

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models - arXiv 2023

https://arxiv.org/abs/2311.07919

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities - ICML 2024

https://arxiv.org/abs/2402.01831

Qwen2-Audio Technical Report - arXiv 2024

https://arxiv.org/abs/2407.10759

WavLLM: Towards Robust and Adaptive Speech Large Language Model - EMNLP 2024

https://arxiv.org/abs/2404.00656

DiVA: Distilling an End-to-End Voice Assistant Without Instruction Training Data - arXiv 2024

https://arxiv.org/abs/2410.02678

**基准
Benchmark**

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech - ICASSP 2024

https://arxiv.org/abs/2309.09510

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension - ACL 2024

https://arxiv.org/abs/2402.07729

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond

Words - arXiv 2024

https://arxiv.org/abs/2406.13340

AudioBench: A Universal Benchmark for Audio Large Language Models -

arXiv 2024

https://arxiv.org/abs/2406.16020

SALMon: A Suite for Acoustic Language Model Evaluation - arXiv 2024

https://arxiv.org/abs/2409.07437

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark - arXiv 2024

https://www.arxiv.org/abs/2410.19168

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks -ICLR 2024 open review

https://openreview.net/forum?id=s7lzZpAW7T

端到端语音对话系统

End2End Speech Dialogue System

**模型
Model**

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities - EMNLP 2023

https://arxiv.org/abs/2305.11000

GPT-4o Voice Mode -API 2024

https://openai.com/index/hello-gpt-4o/

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems - EMNLP 2024
VITA: Towards Open-Source Interactive Omni Multimodal LLM - arXiv 2024

https://www.arxiv.org/abs/2408.05211