autocast triggers fp16 kernel selection at first call for each tensor
shape. Since the warmup uses short text, real requests re-trigger
selection and are slower net. Keeping FP32 + conditionals cache.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Voice conditionals (s3tokenizer + voice encoder + mel embeddings) are
expensive to compute but depend only on the reference audio, not the
text. Previously they ran on every synthesis chunk — 3x wasted work for
a 3-chunk request. Now computed once at startup and reused.
Also wrap generate() in torch.amp.autocast(float16) for ~2x speedup on
all model computation (T3 LLM, S3Gen CFM, HiFiGAN vocoder).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ROCm 7.2 + PyTorch 2.11.0 has a bug where PyTorch passes workspace=0 to
MIOpen convolutions, forcing fallback to the slow GemmFwdRest solver.
This caused s3gen.inference to take 15-22s instead of <5s, making
synthesis 3-4x slower than real-time audio playback.
ROCm 6.1 allocates workspace correctly so MIOpen picks fast GEMM solvers
without needing torch.compile workarounds.
Changes:
- Base image: rocm/dev-ubuntu-22.04:7.2 → 6.1
- torch 2.11.0 → 2.5.1 (rocm6.1 wheel index)
- Add pytorch_triton_rocm==3.1.0
- transformers 5.2.0 → 4.46.3, safetensors 0.5.3 → 0.4.0
- s3tokenizer unpinned → 0.3.0
- resemble-perth==1.0.1 directly (v1.0.1 is pip-installable; drop stub)
- Drop Dockerfile perth_stub steps
- Drop torch.compile and timing patches from engine.py (not needed)
- Drop multi-pass warmup from main.py (torch JIT warmup not needed)
- Drop ROCm 7.2-specific env vars from docker-compose.yml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Warmup now uses a ~170-char representative sentence so torch.compile
JIT-compiles for typical token sequence lengths. Previously "Warmup."
compiled for very short shapes, causing a full re-compile (17s) on the
first real HA request and pushing total synthesis past 30s.
- Compile model.ve (voice encoder) in addition to s3gen — both are
convolutional and hit the MIOpen workspace=0 bug.
- Fix _patch_timing: attribute is model.ve not model.voice_encoder,
so the timing wrap was silently skipping the speaker embedding.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
import inside a function creates a local binding that shadows the
module-level torch import, breaking all earlier torch references in
the same function scope.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The GemmFwdRest workspace=0 issue is in MIOpen itself — PyTorch's ROCm
backend does not allocate workspace for convolutions, causing HiFiGAN to
use a slow fallback solver regardless of benchmark settings.
torch.compile(s3gen, dynamic=True) replaces MIOpen's conv path with
Triton-generated kernels, bypassing the issue entirely. dynamic=True
handles variable audio lengths without recompiling per request. The warmup
triggers JIT compilation so first HA request is fast.
Also removes fp16 autocast (Triton handles precision internally) and
cudnn.benchmark (no longer needed without MIOpen convs).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Warmup: run a synthesis before accepting Wyoming connections so MIOpen
benchmarks and caches all conv layer shapes. Without this, the first HA
request triggers hundreds of benchmark runs and times out.
fp16: wrap in try/except so a failed autocast retries in fp32 rather
than dropping the request silently.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 6700 XT has significantly higher fp16 throughput than fp32.
autocast("cuda") uses fp16 for matmuls and convolutions (HiFiGAN,
S3 tokenizer, flow matching) while keeping fp32 for precision-sensitive
ops like softmax and layer norm.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Timing showed s3gen.inference (HiFiGAN vocoder) taking 22s and ref audio
processing ~18s - both dominated by Conv1d ops hitting MIOpen fallback.
With benchmark=False (default), PyTorch passes ptr=0 size=0 workspace to
MIOpen causing GemmFwdRest to fail and fall back to a slow path every call.
With benchmark=True, PyTorch evaluates convolution algorithms with proper
workspace allocation and caches the best result via MIOPEN_USER_DB_PATH.
First inference will be slower while benchmarking; subsequent calls use cache.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps s3tokenizer, voice_encoder, and s3gen.inference with timing logs
so we can see exactly which step is consuming the missing ~33 seconds.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wyoming-only server built around the official chatterbox TTS model.
Includes ROCm/AMD GPU support, sentence-level streaming, config.yaml
management, and Gitea CI for container builds.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>