Commit Graph

5 Commits

Author SHA1 Message Date
8de67c8bd9 Switch to ROCm 6.1 + torch 2.5.1 to fix MIOpen workspace=0 slowness
Some checks failed
Build ROCm Image / build (push) Failing after 11s
ROCm 7.2 + PyTorch 2.11.0 has a bug where PyTorch passes workspace=0 to
MIOpen convolutions, forcing fallback to the slow GemmFwdRest solver.
This caused s3gen.inference to take 15-22s instead of <5s, making
synthesis 3-4x slower than real-time audio playback.

ROCm 6.1 allocates workspace correctly so MIOpen picks fast GEMM solvers
without needing torch.compile workarounds.

Changes:
- Base image: rocm/dev-ubuntu-22.04:7.2 → 6.1
- torch 2.11.0 → 2.5.1 (rocm6.1 wheel index)
- Add pytorch_triton_rocm==3.1.0
- transformers 5.2.0 → 4.46.3, safetensors 0.5.3 → 0.4.0
- s3tokenizer unpinned → 0.3.0
- resemble-perth==1.0.1 directly (v1.0.1 is pip-installable; drop stub)
- Drop Dockerfile perth_stub steps
- Drop torch.compile and timing patches from engine.py (not needed)
- Drop multi-pass warmup from main.py (torch JIT warmup not needed)
- Drop ROCm 7.2-specific env vars from docker-compose.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 17:27:21 -04:00
59731084cd Multi-pass warmup and smaller chunk_size to fix HA timeout
All checks were successful
Build ROCm Image / build (push) Successful in 2m49s
torch.compile with dynamic=True still specializes per shape family on
first call. The warmup was running one text length, leaving real requests
to JIT-compile their own shapes (15-22s for first chunk). HA freezes
because it gets no AudioChunk for 22 seconds.

Fix:
- Run 3 warmup passes (short/medium/long text) so torch.compile builds
  a dynamic shape graph covering the range HA actually sends. Real
  requests then hit a cached compilation and synthesize in 3-8s.
- Reduce default chunk_size from 300 to 120 chars so the first text
  chunk is shorter, producing faster synthesis and earlier first audio.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 15:04:46 -04:00
169e003a34 Fix warmup text length and ve attribute for torch.compile
All checks were successful
Build ROCm Image / build (push) Successful in 3m35s
- Warmup now uses a ~170-char representative sentence so torch.compile
  JIT-compiles for typical token sequence lengths. Previously "Warmup."
  compiled for very short shapes, causing a full re-compile (17s) on the
  first real HA request and pushing total synthesis past 30s.
- Compile model.ve (voice encoder) in addition to s3gen — both are
  convolutional and hit the MIOpen workspace=0 bug.
- Fix _patch_timing: attribute is model.ve not model.voice_encoder,
  so the timing wrap was silently skipping the speaker embedding.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:51:08 -04:00
bdde4a2480 Add startup warmup and make fp16 autocast fault-tolerant
All checks were successful
Build ROCm Image / build (push) Successful in 3m10s
Warmup: run a synthesis before accepting Wyoming connections so MIOpen
benchmarks and caches all conv layer shapes. Without this, the first HA
request triggers hundreds of benchmark runs and times out.

fp16: wrap in try/except so a failed autocast retries in fp32 rather
than dropping the request silently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:48:41 -04:00
16ea2853f5 Initial implementation: Chatterbox TTS with ROCm and Wyoming
All checks were successful
Build ROCm Image / build (push) Successful in 15m27s
Wyoming-only server built around the official chatterbox TTS model.
Includes ROCm/AMD GPU support, sentence-level streaming, config.yaml
management, and Gitea CI for container builds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 09:51:09 -04:00