rocm-chatterbox-whisper

Author	SHA1	Message	Date
scott	9b62fce5c5	[dev-fp16] Convert model weights to fp16 at load time Converting t3/s3gen/ve to fp16 once at load time means: - Warmup runs in fp16, covering the right dtypes for all real requests - No per-call autocast casting overhead - ~2x faster matrix ops and convolutions on RDNA 2 hardware Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 20:34:33 -04:00
scott	967ed41239	Revert FP16 autocast — increases TTFA on first request All checks were successful Build ROCm Image / build (push) Successful in 3m21s Details autocast triggers fp16 kernel selection at first call for each tensor shape. Since the warmup uses short text, real requests re-trigger selection and are slower net. Keeping FP32 + conditionals cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 20:30:49 -04:00
scott	29b66e24bb	Cache voice conditionals and add FP16 autocast All checks were successful Build ROCm Image / build (push) Successful in 3m17s Details Voice conditionals (s3tokenizer + voice encoder + mel embeddings) are expensive to compute but depend only on the reference audio, not the text. Previously they ran on every synthesis chunk — 3x wasted work for a 3-chunk request. Now computed once at startup and reused. Also wrap generate() in torch.amp.autocast(float16) for ~2x speedup on all model computation (T3 LLM, S3Gen CFM, HiFiGAN vocoder). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 20:22:13 -04:00
scott	8de67c8bd9	Switch to ROCm 6.1 + torch 2.5.1 to fix MIOpen workspace=0 slowness Some checks failed Build ROCm Image / build (push) Failing after 11s Details ROCm 7.2 + PyTorch 2.11.0 has a bug where PyTorch passes workspace=0 to MIOpen convolutions, forcing fallback to the slow GemmFwdRest solver. This caused s3gen.inference to take 15-22s instead of <5s, making synthesis 3-4x slower than real-time audio playback. ROCm 6.1 allocates workspace correctly so MIOpen picks fast GEMM solvers without needing torch.compile workarounds. Changes: - Base image: rocm/dev-ubuntu-22.04:7.2 → 6.1 - torch 2.11.0 → 2.5.1 (rocm6.1 wheel index) - Add pytorch_triton_rocm==3.1.0 - transformers 5.2.0 → 4.46.3, safetensors 0.5.3 → 0.4.0 - s3tokenizer unpinned → 0.3.0 - resemble-perth==1.0.1 directly (v1.0.1 is pip-installable; drop stub) - Drop Dockerfile perth_stub steps - Drop torch.compile and timing patches from engine.py (not needed) - Drop multi-pass warmup from main.py (torch JIT warmup not needed) - Drop ROCm 7.2-specific env vars from docker-compose.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 17:27:21 -04:00
scott	169e003a34	Fix warmup text length and ve attribute for torch.compile All checks were successful Build ROCm Image / build (push) Successful in 3m35s Details - Warmup now uses a ~170-char representative sentence so torch.compile JIT-compiles for typical token sequence lengths. Previously "Warmup." compiled for very short shapes, causing a full re-compile (17s) on the first real HA request and pushing total synthesis past 30s. - Compile model.ve (voice encoder) in addition to s3gen — both are convolutional and hit the MIOpen workspace=0 bug. - Fix _patch_timing: attribute is model.ve not model.voice_encoder, so the timing wrap was silently skipping the speaker embedding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 14:51:08 -04:00
scott	5766870304	Fix UnboundLocalError: move torch._dynamo import to module level All checks were successful Build ROCm Image / build (push) Successful in 2m39s Details import inside a function creates a local binding that shadows the module-level torch import, breaking all earlier torch references in the same function scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 14:34:45 -04:00
scott	7babd0584e	Replace MIOpen convolution path with torch.compile on s3gen All checks were successful Build ROCm Image / build (push) Successful in 2m47s Details The GemmFwdRest workspace=0 issue is in MIOpen itself — PyTorch's ROCm backend does not allocate workspace for convolutions, causing HiFiGAN to use a slow fallback solver regardless of benchmark settings. torch.compile(s3gen, dynamic=True) replaces MIOpen's conv path with Triton-generated kernels, bypassing the issue entirely. dynamic=True handles variable audio lengths without recompiling per request. The warmup triggers JIT compilation so first HA request is fast. Also removes fp16 autocast (Triton handles precision internally) and cudnn.benchmark (no longer needed without MIOpen convs). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 14:27:09 -04:00
scott	bdde4a2480	Add startup warmup and make fp16 autocast fault-tolerant All checks were successful Build ROCm Image / build (push) Successful in 3m10s Details Warmup: run a synthesis before accepting Wyoming connections so MIOpen benchmarks and caches all conv layer shapes. Without this, the first HA request triggers hundreds of benchmark runs and times out. fp16: wrap in try/except so a failed autocast retries in fp32 rather than dropping the request silently. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 13:48:41 -04:00
scott	f20699aed3	Add fp16 autocast to synthesis for faster GPU throughput All checks were successful Build ROCm Image / build (push) Successful in 2m49s Details The 6700 XT has significantly higher fp16 throughput than fp32. autocast("cuda") uses fp16 for matmuls and convolutions (HiFiGAN, S3 tokenizer, flow matching) while keeping fp32 for precision-sensitive ops like softmax and layer norm. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 13:34:21 -04:00
scott	514bbad0e9	Enable cudnn.benchmark to fix MIOpen workspace=0 on convolutions Some checks failed Build ROCm Image / build (push) Has been cancelled Details Timing showed s3gen.inference (HiFiGAN vocoder) taking 22s and ref audio processing ~18s - both dominated by Conv1d ops hitting MIOpen fallback. With benchmark=False (default), PyTorch passes ptr=0 size=0 workspace to MIOpen causing GemmFwdRest to fail and fall back to a slow path every call. With benchmark=True, PyTorch evaluates convolution algorithms with proper workspace allocation and caches the best result via MIOPEN_USER_DB_PATH. First inference will be slower while benchmarking; subsequent calls use cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 13:24:05 -04:00
scott	bfe20b7742	Add timing instrumentation to pinpoint synthesis bottleneck All checks were successful Build ROCm Image / build (push) Successful in 3m21s Details Wraps s3tokenizer, voice_encoder, and s3gen.inference with timing logs so we can see exactly which step is consuming the missing ~33 seconds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 13:09:14 -04:00
scott	16ea2853f5	Initial implementation: Chatterbox TTS with ROCm and Wyoming All checks were successful Build ROCm Image / build (push) Successful in 15m27s Details Wyoming-only server built around the official chatterbox TTS model. Includes ROCm/AMD GPU support, sentence-level streaming, config.yaml management, and Gitea CI for container builds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 09:51:09 -04:00

12 Commits