Commit Graph

12 Commits

Author SHA1 Message Date
9b62fce5c5 [dev-fp16] Convert model weights to fp16 at load time
Converting t3/s3gen/ve to fp16 once at load time means:
- Warmup runs in fp16, covering the right dtypes for all real requests
- No per-call autocast casting overhead
- ~2x faster matrix ops and convolutions on RDNA 2 hardware

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:34:33 -04:00
967ed41239 Revert FP16 autocast — increases TTFA on first request
All checks were successful
Build ROCm Image / build (push) Successful in 3m21s
autocast triggers fp16 kernel selection at first call for each tensor
shape. Since the warmup uses short text, real requests re-trigger
selection and are slower net. Keeping FP32 + conditionals cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:30:49 -04:00
29b66e24bb Cache voice conditionals and add FP16 autocast
All checks were successful
Build ROCm Image / build (push) Successful in 3m17s
Voice conditionals (s3tokenizer + voice encoder + mel embeddings) are
expensive to compute but depend only on the reference audio, not the
text. Previously they ran on every synthesis chunk — 3x wasted work for
a 3-chunk request. Now computed once at startup and reused.

Also wrap generate() in torch.amp.autocast(float16) for ~2x speedup on
all model computation (T3 LLM, S3Gen CFM, HiFiGAN vocoder).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:22:13 -04:00
8de67c8bd9 Switch to ROCm 6.1 + torch 2.5.1 to fix MIOpen workspace=0 slowness
Some checks failed
Build ROCm Image / build (push) Failing after 11s
ROCm 7.2 + PyTorch 2.11.0 has a bug where PyTorch passes workspace=0 to
MIOpen convolutions, forcing fallback to the slow GemmFwdRest solver.
This caused s3gen.inference to take 15-22s instead of <5s, making
synthesis 3-4x slower than real-time audio playback.

ROCm 6.1 allocates workspace correctly so MIOpen picks fast GEMM solvers
without needing torch.compile workarounds.

Changes:
- Base image: rocm/dev-ubuntu-22.04:7.2 → 6.1
- torch 2.11.0 → 2.5.1 (rocm6.1 wheel index)
- Add pytorch_triton_rocm==3.1.0
- transformers 5.2.0 → 4.46.3, safetensors 0.5.3 → 0.4.0
- s3tokenizer unpinned → 0.3.0
- resemble-perth==1.0.1 directly (v1.0.1 is pip-installable; drop stub)
- Drop Dockerfile perth_stub steps
- Drop torch.compile and timing patches from engine.py (not needed)
- Drop multi-pass warmup from main.py (torch JIT warmup not needed)
- Drop ROCm 7.2-specific env vars from docker-compose.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 17:27:21 -04:00
169e003a34 Fix warmup text length and ve attribute for torch.compile
All checks were successful
Build ROCm Image / build (push) Successful in 3m35s
- Warmup now uses a ~170-char representative sentence so torch.compile
  JIT-compiles for typical token sequence lengths. Previously "Warmup."
  compiled for very short shapes, causing a full re-compile (17s) on the
  first real HA request and pushing total synthesis past 30s.
- Compile model.ve (voice encoder) in addition to s3gen — both are
  convolutional and hit the MIOpen workspace=0 bug.
- Fix _patch_timing: attribute is model.ve not model.voice_encoder,
  so the timing wrap was silently skipping the speaker embedding.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:51:08 -04:00
5766870304 Fix UnboundLocalError: move torch._dynamo import to module level
All checks were successful
Build ROCm Image / build (push) Successful in 2m39s
import inside a function creates a local binding that shadows the
module-level torch import, breaking all earlier torch references in
the same function scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:34:45 -04:00
7babd0584e Replace MIOpen convolution path with torch.compile on s3gen
All checks were successful
Build ROCm Image / build (push) Successful in 2m47s
The GemmFwdRest workspace=0 issue is in MIOpen itself — PyTorch's ROCm
backend does not allocate workspace for convolutions, causing HiFiGAN to
use a slow fallback solver regardless of benchmark settings.

torch.compile(s3gen, dynamic=True) replaces MIOpen's conv path with
Triton-generated kernels, bypassing the issue entirely. dynamic=True
handles variable audio lengths without recompiling per request. The warmup
triggers JIT compilation so first HA request is fast.

Also removes fp16 autocast (Triton handles precision internally) and
cudnn.benchmark (no longer needed without MIOpen convs).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:27:09 -04:00
bdde4a2480 Add startup warmup and make fp16 autocast fault-tolerant
All checks were successful
Build ROCm Image / build (push) Successful in 3m10s
Warmup: run a synthesis before accepting Wyoming connections so MIOpen
benchmarks and caches all conv layer shapes. Without this, the first HA
request triggers hundreds of benchmark runs and times out.

fp16: wrap in try/except so a failed autocast retries in fp32 rather
than dropping the request silently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:48:41 -04:00
f20699aed3 Add fp16 autocast to synthesis for faster GPU throughput
All checks were successful
Build ROCm Image / build (push) Successful in 2m49s
The 6700 XT has significantly higher fp16 throughput than fp32.
autocast("cuda") uses fp16 for matmuls and convolutions (HiFiGAN,
S3 tokenizer, flow matching) while keeping fp32 for precision-sensitive
ops like softmax and layer norm.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:34:21 -04:00
514bbad0e9 Enable cudnn.benchmark to fix MIOpen workspace=0 on convolutions
Some checks failed
Build ROCm Image / build (push) Has been cancelled
Timing showed s3gen.inference (HiFiGAN vocoder) taking 22s and ref audio
processing ~18s - both dominated by Conv1d ops hitting MIOpen fallback.

With benchmark=False (default), PyTorch passes ptr=0 size=0 workspace to
MIOpen causing GemmFwdRest to fail and fall back to a slow path every call.
With benchmark=True, PyTorch evaluates convolution algorithms with proper
workspace allocation and caches the best result via MIOPEN_USER_DB_PATH.

First inference will be slower while benchmarking; subsequent calls use cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:24:05 -04:00
bfe20b7742 Add timing instrumentation to pinpoint synthesis bottleneck
All checks were successful
Build ROCm Image / build (push) Successful in 3m21s
Wraps s3tokenizer, voice_encoder, and s3gen.inference with timing logs
so we can see exactly which step is consuming the missing ~33 seconds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:09:14 -04:00
16ea2853f5 Initial implementation: Chatterbox TTS with ROCm and Wyoming
All checks were successful
Build ROCm Image / build (push) Successful in 15m27s
Wyoming-only server built around the official chatterbox TTS model.
Includes ROCm/AMD GPU support, sentence-level streaming, config.yaml
management, and Gitea CI for container builds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 09:51:09 -04:00