Query-Based Asymmetric Modeling with Decoupled Input–Output Rates for Speech Restoration
Speech restoration in real-world conditions is challenging due to compounded distortions and mismatches between input and desired output rates. Most existing systems assume a fixed and shared input–output rate, relying on external resampling that incurs redundant computation and limits generality. We address this setting by formulating speech restoration under decoupled input–output rates, and propose TF-Restormer, a query-based asymmetric modeling framework. The encoder concentrates analysis on the observed input bandwidth using a time–frequency dual-path architecture, while a lightweight decoder reconstructs missing spectral content via frequency extension queries. This design enables a single model to operate consistently across arbitrary input–output rate pairs without redundant resampling. Experiments across diverse sampling rates, degradations, and operating modes show that TF-Restormer maintains stable restoration behavior and balanced perceptual quality, including in real-time streaming scenarios.
Asymmetric encoder-decoder framework: Heavy processing on input bandwidth (encoder), lightweight frequency extension (decoder)
Learnable frequency patterns $\mathbf{q}_{\text{ext}}$ shared across frames. Enables bandwidth extension from $F_E$ to $F_D$ without resampling.
5 multi-scale discriminators {20, 40, 60, 80, 100}ms. Enables single-model training across all sampling rates.
Loss: $$\mathcal{L}_s = w_{tf} \cdot \log\left(1 + \frac{|Y_{c,tf} - S_{c,tf}|}{w_{tf}}\right)$$
Emphasizes well-predicted regions, suppresses large deviations.
Speech enhancement comparison on UNIVERSE dataset with various baseline models: