Commit Graph

8 Commits

Author SHA1 Message Date
169e003a34 Fix warmup text length and ve attribute for torch.compile
All checks were successful
Build ROCm Image / build (push) Successful in 3m35s
- Warmup now uses a ~170-char representative sentence so torch.compile
  JIT-compiles for typical token sequence lengths. Previously "Warmup."
  compiled for very short shapes, causing a full re-compile (17s) on the
  first real HA request and pushing total synthesis past 30s.
- Compile model.ve (voice encoder) in addition to s3gen — both are
  convolutional and hit the MIOpen workspace=0 bug.
- Fix _patch_timing: attribute is model.ve not model.voice_encoder,
  so the timing wrap was silently skipping the speaker embedding.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:51:08 -04:00
5766870304 Fix UnboundLocalError: move torch._dynamo import to module level
All checks were successful
Build ROCm Image / build (push) Successful in 2m39s
import inside a function creates a local binding that shadows the
module-level torch import, breaking all earlier torch references in
the same function scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:34:45 -04:00
7babd0584e Replace MIOpen convolution path with torch.compile on s3gen
All checks were successful
Build ROCm Image / build (push) Successful in 2m47s
The GemmFwdRest workspace=0 issue is in MIOpen itself — PyTorch's ROCm
backend does not allocate workspace for convolutions, causing HiFiGAN to
use a slow fallback solver regardless of benchmark settings.

torch.compile(s3gen, dynamic=True) replaces MIOpen's conv path with
Triton-generated kernels, bypassing the issue entirely. dynamic=True
handles variable audio lengths without recompiling per request. The warmup
triggers JIT compilation so first HA request is fast.

Also removes fp16 autocast (Triton handles precision internally) and
cudnn.benchmark (no longer needed without MIOpen convs).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:27:09 -04:00
bdde4a2480 Add startup warmup and make fp16 autocast fault-tolerant
All checks were successful
Build ROCm Image / build (push) Successful in 3m10s
Warmup: run a synthesis before accepting Wyoming connections so MIOpen
benchmarks and caches all conv layer shapes. Without this, the first HA
request triggers hundreds of benchmark runs and times out.

fp16: wrap in try/except so a failed autocast retries in fp32 rather
than dropping the request silently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:48:41 -04:00
f20699aed3 Add fp16 autocast to synthesis for faster GPU throughput
All checks were successful
Build ROCm Image / build (push) Successful in 2m49s
The 6700 XT has significantly higher fp16 throughput than fp32.
autocast("cuda") uses fp16 for matmuls and convolutions (HiFiGAN,
S3 tokenizer, flow matching) while keeping fp32 for precision-sensitive
ops like softmax and layer norm.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:34:21 -04:00
514bbad0e9 Enable cudnn.benchmark to fix MIOpen workspace=0 on convolutions
Some checks failed
Build ROCm Image / build (push) Has been cancelled
Timing showed s3gen.inference (HiFiGAN vocoder) taking 22s and ref audio
processing ~18s - both dominated by Conv1d ops hitting MIOpen fallback.

With benchmark=False (default), PyTorch passes ptr=0 size=0 workspace to
MIOpen causing GemmFwdRest to fail and fall back to a slow path every call.
With benchmark=True, PyTorch evaluates convolution algorithms with proper
workspace allocation and caches the best result via MIOPEN_USER_DB_PATH.

First inference will be slower while benchmarking; subsequent calls use cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:24:05 -04:00
bfe20b7742 Add timing instrumentation to pinpoint synthesis bottleneck
All checks were successful
Build ROCm Image / build (push) Successful in 3m21s
Wraps s3tokenizer, voice_encoder, and s3gen.inference with timing logs
so we can see exactly which step is consuming the missing ~33 seconds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:09:14 -04:00
16ea2853f5 Initial implementation: Chatterbox TTS with ROCm and Wyoming
All checks were successful
Build ROCm Image / build (push) Successful in 15m27s
Wyoming-only server built around the official chatterbox TTS model.
Includes ROCm/AMD GPU support, sentence-level streaming, config.yaml
management, and Gitea CI for container builds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 09:51:09 -04:00