Speech Enhancement Real-time Streaming Open Source

TF-Restormer

Query-Based Asymmetric Modeling with Decoupled Input–Output Rates for Speech Restoration

Abstract

Speech restoration in real-world conditions is challenging due to compounded distortions and mismatches between input and desired output rates. Most existing systems assume a fixed and shared input–output rate, relying on external resampling that incurs redundant computation and limits generality. We address this setting by formulating speech restoration under decoupled input–output rates, and propose TF-Restormer, a query-based asymmetric modeling framework. The encoder concentrates analysis on the observed input bandwidth using a time–frequency dual-path architecture, while a lightweight decoder reconstructs missing spectral content via frequency extension queries. This design enables a single model to operate consistently across arbitrary input–output rate pairs without redundant resampling. Experiments across diverse sampling rates, degradations, and operating modes show that TF-Restormer maintains stable restoration behavior and balanced perceptual quality, including in real-time streaming scenarios.

System Architecture

Asymmetric encoder-decoder framework: Heavy processing on input bandwidth (encoder), lightweight frequency extension (decoder)

TF-Restormer Architecture

Model Components

TF-Encoder (Heavy Processing)

  • Input: Complex STFT $\mathbf{X} \in \mathbb{R}^{F_E \times T \times 2}$
  • Processing: Conv2D(3×3) → LayerNorm → Freq PE
  • Architecture: $B_E = 6$ alternating TF blocks
  • Channels: $C_E = 128$ feature dimensions
  • Purpose: Analyze degraded input, extract clean representations

TF-Decoder (Lightweight Extension)

  • Architecture: $B_D = 3$ blocks (3× fewer than encoder)
  • Channels: $C_D = 64$ feature dimensions
  • Extension Query: $\mathbf{q}_{\text{ext}} \in \mathbb{R}^{(F_D - F_E) \times T \times C_D}$
  • Output: Complex STFT $\mathbf{Y} \in \mathbb{R}^{F_D \times T \times 2}$
  • Purpose: Reconstruct missing high frequencies

Time-Frequency Dual-Path Processing

Time Module
  • Macaron-style: ConvFFN + MHSA + ConvFFN
  • Multi-head attention: $H = 4$ heads
  • Processes $F$ sequences of length $T$
  • Rotary position embeddings (RoPE)
  • Streaming variant: Mamba ($d_{\text{state}} = 16$)
Frequency Module
  • Macaron-style: ConvFFN + MHSA/MHCA + ConvFFN
  • Linformer projection: $\mathbf{A}_h \in \mathbb{R}^{F_{\text{max}} \times F_{\text{proj}}}$, $F_{\text{proj}} = 512$
  • Processes $T$ sequences of length $F$
  • Cross-attention in decoder (encoder K/V)
  • ConvFFN: $K = 7$, SwiGLU activation

Key Innovations

Extension Queries

Learnable frequency patterns $\mathbf{q}_{\text{ext}}$ shared across frames. Enables bandwidth extension from $F_E$ to $F_D$ without resampling.

SFI-STFT Discriminator

5 multi-scale discriminators {20, 40, 60, 80, 100}ms. Enables single-model training across all sampling rates.

Scaled Log-Spectral Loss

Loss: $$\mathcal{L}_s = w_{tf} \cdot \log\left(1 + \frac{|Y_{c,tf} - S_{c,tf}|}{w_{tf}}\right)$$
Emphasizes well-predicted regions, suppresses large deviations.

Training Strategy

Phase 1: Pretraining
  • Perceptual loss (WavLM-based)
  • Scaled log-spectral loss
  • Complex domain supervision
Phase 2: Adversarial Training
  • LS-GAN loss + Feature matching
  • Human feedback loss (PESQ, UTMOS)
  • Multi-scale SFI discriminators

Audio Samples

UNIVERSE Dataset Comparison

Speech enhancement comparison on UNIVERSE dataset with various baseline models:

Input (Noisy)

Input Spectrogram

TF-Restormer (Ours)

TF-Restormer Spectrogram

TF-Restormer Streaming (Ours)

TF-Restormer Streaming Spectrogram

Ground Truth (Clean)

Clean Spectrogram
Compare with Baseline Models
UNIVERSE
UNIVERSE Spectrogram
UNIVERSE++
UNIVERSE++ Spectrogram
StoRM
StoRM Spectrogram
TF-Locoformer
TF-Locoformer Spectrogram
VoiceFixer
VoiceFixer Spectrogram
FINALLY
FINALLY Spectrogram