Complex Spectral Prediction for Speech Restoration
Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fidelity, while diffusion models remain impractical for streaming. Moreover, most assume a fixed target sampling rate, requiring external resampling that leads to redundant computations.
We present TF-Restormer, an encoder-decoder architecture that concentrates analysis on input-bandwidth with a time-frequency dual-path encoder and reconstructs missing high-frequency bands through a light decoder with frequency extension queries. It enables efficient and universal restoration across arbitrary input-output rates without redundant resampling.
To support adversarial training across diverse rates, we introduce a shared sampling-frequency-independent (SFI) STFT discriminator. TF-Restormer further supports streaming with a causal time module, and improves robustness under extreme degradations by injecting spectral inductive bias into the frequency module. Finally, we propose a scaled log-spectral loss that stabilizes optimization under severe conditions while emphasizing well-predicted spectral details.
As a single model across sampling rates, TF-Restormer consistently delivers well-balanced results in signal fidelity, semantic preservation, and perceptual quality for both offline and streaming scenarios.
Asymmetric encoder-decoder framework: Heavy processing on input bandwidth (encoder), lightweight frequency extension (decoder)
Learnable frequency patterns $\mathbf{q}_{\text{ext}}$ shared across frames. Enables bandwidth extension from $F_E$ to $F_D$ without resampling.
5 multi-scale discriminators {20, 40, 60, 80, 100}ms. Enables single-model training across all sampling rates.
Loss: $$\mathcal{L}_s = w_{tf} \cdot \log\left(1 + \frac{|Y_{c,tf} - S_{c,tf}|}{w_{tf}}\right)$$
Emphasizes well-predicted regions, suppresses large deviations.
Speech enhancement comparison on UNIVERSE dataset with various baseline models: