rocm-chatterbox-whisper

Author	SHA1	Message	Date
scott	169e003a34	Fix warmup text length and ve attribute for torch.compile All checks were successful Build ROCm Image / build (push) Successful in 3m35s Details - Warmup now uses a ~170-char representative sentence so torch.compile JIT-compiles for typical token sequence lengths. Previously "Warmup." compiled for very short shapes, causing a full re-compile (17s) on the first real HA request and pushing total synthesis past 30s. - Compile model.ve (voice encoder) in addition to s3gen — both are convolutional and hit the MIOpen workspace=0 bug. - Fix _patch_timing: attribute is model.ve not model.voice_encoder, so the timing wrap was silently skipping the speaker embedding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 14:51:08 -04:00
scott	5766870304	Fix UnboundLocalError: move torch._dynamo import to module level All checks were successful Build ROCm Image / build (push) Successful in 2m39s Details import inside a function creates a local binding that shadows the module-level torch import, breaking all earlier torch references in the same function scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 14:34:45 -04:00
scott	7babd0584e	Replace MIOpen convolution path with torch.compile on s3gen All checks were successful Build ROCm Image / build (push) Successful in 2m47s Details The GemmFwdRest workspace=0 issue is in MIOpen itself — PyTorch's ROCm backend does not allocate workspace for convolutions, causing HiFiGAN to use a slow fallback solver regardless of benchmark settings. torch.compile(s3gen, dynamic=True) replaces MIOpen's conv path with Triton-generated kernels, bypassing the issue entirely. dynamic=True handles variable audio lengths without recompiling per request. The warmup triggers JIT compilation so first HA request is fast. Also removes fp16 autocast (Triton handles precision internally) and cudnn.benchmark (no longer needed without MIOpen convs). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 14:27:09 -04:00
scott	bdde4a2480	Add startup warmup and make fp16 autocast fault-tolerant All checks were successful Build ROCm Image / build (push) Successful in 3m10s Details Warmup: run a synthesis before accepting Wyoming connections so MIOpen benchmarks and caches all conv layer shapes. Without this, the first HA request triggers hundreds of benchmark runs and times out. fp16: wrap in try/except so a failed autocast retries in fp32 rather than dropping the request silently. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 13:48:41 -04:00
scott	f20699aed3	Add fp16 autocast to synthesis for faster GPU throughput All checks were successful Build ROCm Image / build (push) Successful in 2m49s Details The 6700 XT has significantly higher fp16 throughput than fp32. autocast("cuda") uses fp16 for matmuls and convolutions (HiFiGAN, S3 tokenizer, flow matching) while keeping fp32 for precision-sensitive ops like softmax and layer norm. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 13:34:21 -04:00
scott	514bbad0e9	Enable cudnn.benchmark to fix MIOpen workspace=0 on convolutions Some checks failed Build ROCm Image / build (push) Has been cancelled Details Timing showed s3gen.inference (HiFiGAN vocoder) taking 22s and ref audio processing ~18s - both dominated by Conv1d ops hitting MIOpen fallback. With benchmark=False (default), PyTorch passes ptr=0 size=0 workspace to MIOpen causing GemmFwdRest to fail and fall back to a slow path every call. With benchmark=True, PyTorch evaluates convolution algorithms with proper workspace allocation and caches the best result via MIOPEN_USER_DB_PATH. First inference will be slower while benchmarking; subsequent calls use cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 13:24:05 -04:00
scott	bfe20b7742	Add timing instrumentation to pinpoint synthesis bottleneck All checks were successful Build ROCm Image / build (push) Successful in 3m21s Details Wraps s3tokenizer, voice_encoder, and s3gen.inference with timing logs so we can see exactly which step is consuming the missing ~33 seconds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 13:09:14 -04:00
scott	16ea2853f5	Initial implementation: Chatterbox TTS with ROCm and Wyoming All checks were successful Build ROCm Image / build (push) Successful in 15m27s Details Wyoming-only server built around the official chatterbox TTS model. Includes ROCm/AMD GPU support, sentence-level streaming, config.yaml management, and Gitea CI for container builds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 09:51:09 -04:00

8 Commits