Room Impulse Response Generation Conditioned on Acoustic Parameters

¹Dolby Laboratories, ²KTH Royal Institute of Technology
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2025)

Abstract

The generation of room impulse responses (RIRs) using deep neural networks has attracted growing research interest due to its applications in virtual and augmented reality, audio postproduction, and related fields. Most existing approaches condition generative models on physical descriptions of a room, such as its size, shape, and surface materials. However, this reliance on geometric information limits their usability in scenarios where the room layout is unknown or when perceptual realism (how a space sounds to a listener) is more important than strict physical accuracy. In this study, we propose an alternative strategy: conditioning RIR generation directly on a set of RIR acoustic parameters. These parameters include various measures of reverberation time and direct sound to reverberation ratio, both broadband and bandwise. By specifying how the space should sound instead of how it should look, our method enables more flexible and perceptually driven RIR generation. We explore both autoregressive and non-autoregressive generative models operating in the Descript Audio Codec domain, using either discrete token sequences or continuous embeddings. Specifically, we evaluate four model types: an autoregressive transformer, the MaskGIT model, a flow matching model, and a classifier-based approach. Objective and subjective evaluations are performed to compare these methods with state-of-the-art alternatives. Results show that the proposed models match or outperform state-of-the-art alternatives, with the MaskGIT model achieving the best performance.

Demo

This demo presents four examples from the listening test. For each example, dry audio signals were convolved with real and model-generated room impulse responses (RIRs). The demo includes reverberant audio from the reference RIR, a version of the reference convolved signal encoded with DAC, a low-quality anchor, four proposed models, and two baseline methods.

For full details on the listening test setup, evaluation procedure, and models, please refer to the paper.

Reference

@inproceedings{arellano2025rirgeneration, title={Room Impulse Response Generation Conditioned on Acoustic Parameters}, author={Silvia Arellano and Chunghsin Yeh and Gautam Bhattacharya and Daniel Arteaga}, booktitle={IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2025)}, year={2025}, note={Accepted}, eprint={2507.12136}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2507.12136} }

Sound Credits

The dry audio samples were sourced from freesound.org and are used under the following licenses:

"handclaps.wav" by Anton – Freesound #345 – Licensed under Creative Commons Attribution 4.0
"footsteps shoes walk hard floor stone patio nice.flac" by kyles – Freesound #637555 – Licensed under Creative Commons 0
"Piano, Bach Fantasia, A (H6 MS).wav" by InspectorJ – Freesound #411165 – Licensed under Creative Commons Attribution 4.0
"ExcessiveExposure.wav" by acclivity – Freesound #33711 – Licensed under Creative Commons Attribution-NonCommercial 4.0

Room Impulse Response Generation Conditioned on Acoustic Parameters

Overview of the proposed models with the respective conditioning mechanism. ARXA: AR transformer with cross-attention, ARCG: AR transformer with classifier guidance, MG: MaskGIT with adaLN, FM flow matching with in-context conditioning

Abstract

Demo

Reference

Sound Credits

Overview of the proposed models with the respective conditioning mechanism. AR_XA: AR transformer with cross-attention, AR_CG: AR transformer with classifier guidance, MG: MaskGIT with adaLN, FM flow matching with in-context conditioning