Speech Enhancement Real-time Streaming Open Source

TF-Restormer

Complex Spectral Prediction for Speech Restoration

Abstract

Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fidelity, while diffusion models remain impractical for streaming. Moreover, most assume a fixed target sampling rate, requiring external resampling that leads to redundant computations.

We present TF-Restormer, an encoder-decoder architecture that concentrates analysis on input-bandwidth with a time-frequency dual-path encoder and reconstructs missing high-frequency bands through a light decoder with frequency extension queries. It enables efficient and universal restoration across arbitrary input-output rates without redundant resampling.

To support adversarial training across diverse rates, we introduce a shared sampling-frequency-independent (SFI) STFT discriminator. TF-Restormer further supports streaming with a causal time module, and improves robustness under extreme degradations by injecting spectral inductive bias into the frequency module. Finally, we propose a scaled log-spectral loss that stabilizes optimization under severe conditions while emphasizing well-predicted spectral details.

As a single model across sampling rates, TF-Restormer consistently delivers well-balanced results in signal fidelity, semantic preservation, and perceptual quality for both offline and streaming scenarios.

System Architecture

Asymmetric encoder-decoder framework: Heavy processing on input bandwidth (encoder), lightweight frequency extension (decoder)

TF-Restormer Architecture

Model Components

TF-Encoder (Heavy Processing)

  • Input: Complex STFT $\mathbf{X} \in \mathbb{R}^{F_E \times T \times 2}$
  • Processing: Conv2D(3×3) → LayerNorm → Freq PE
  • Architecture: $B_E = 6$ alternating TF blocks
  • Channels: $C_E = 128$ feature dimensions
  • Purpose: Analyze degraded input, extract clean representations

TF-Decoder (Lightweight Extension)

  • Architecture: $B_D = 3$ blocks (3× fewer than encoder)
  • Channels: $C_D = 64$ feature dimensions
  • Extension Query: $\mathbf{q}_{\text{ext}} \in \mathbb{R}^{(F_D - F_E) \times T \times C_D}$
  • Output: Complex STFT $\mathbf{Y} \in \mathbb{R}^{F_D \times T \times 2}$
  • Purpose: Reconstruct missing high frequencies

Time-Frequency Dual-Path Processing

Time Module
  • Macaron-style: ConvFFN + MHSA + ConvFFN
  • Multi-head attention: $H = 4$ heads
  • Processes $F$ sequences of length $T$
  • Rotary position embeddings (RoPE)
  • Streaming variant: Mamba ($d_{\text{state}} = 16$)
Frequency Module
  • Macaron-style: ConvFFN + MHSA/MHCA + ConvFFN
  • Linformer projection: $\mathbf{A}_h \in \mathbb{R}^{F_{\text{max}} \times F_{\text{proj}}}$, $F_{\text{proj}} = 512$
  • Processes $T$ sequences of length $F$
  • Cross-attention in decoder (encoder K/V)
  • ConvFFN: $K = 7$, SwiGLU activation

Key Innovations

Extension Queries

Learnable frequency patterns $\mathbf{q}_{\text{ext}}$ shared across frames. Enables bandwidth extension from $F_E$ to $F_D$ without resampling.

SFI-STFT Discriminator

5 multi-scale discriminators {20, 40, 60, 80, 100}ms. Enables single-model training across all sampling rates.

Scaled Log-Spectral Loss

Loss: $$\mathcal{L}_s = w_{tf} \cdot \log\left(1 + \frac{|Y_{c,tf} - S_{c,tf}|}{w_{tf}}\right)$$
Emphasizes well-predicted regions, suppresses large deviations.

Training Strategy

Phase 1: Pretraining
  • Perceptual loss (WavLM-based)
  • Scaled log-spectral loss
  • Complex domain supervision
Phase 2: Adversarial Training
  • LS-GAN loss + Feature matching
  • Human feedback loss (PESQ, UTMOS)
  • Multi-scale SFI discriminators

Audio Samples

UNIVERSE Dataset Comparison

Speech enhancement comparison on UNIVERSE dataset with various baseline models:

Input (Noisy)

Input Spectrogram

TF-Restormer (Ours)

TF-Restormer Spectrogram

TF-Restormer Streaming (Ours)

TF-Restormer Streaming Spectrogram

Ground Truth (Clean)

Clean Spectrogram
Compare with Baseline Models
UNIVERSE
UNIVERSE Spectrogram
UNIVERSE++
UNIVERSE++ Spectrogram
StoRM
StoRM Spectrogram
TF-Locoformer
TF-Locoformer Spectrogram
VoiceFixer
VoiceFixer Spectrogram
FINALLY
FINALLY Spectrogram