- Merge: voice conditionals cache and warmup pre-computation from main
- Add MIOPEN_LOG_LEVEL=2 to suppress GemmFwdRest workspace=0 warnings
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ROCm 7.2 + PyTorch 2.11.0 has a bug where PyTorch passes workspace=0 to
MIOpen convolutions, forcing fallback to the slow GemmFwdRest solver.
This caused s3gen.inference to take 15-22s instead of <5s, making
synthesis 3-4x slower than real-time audio playback.
ROCm 6.1 allocates workspace correctly so MIOpen picks fast GEMM solvers
without needing torch.compile workarounds.
Changes:
- Base image: rocm/dev-ubuntu-22.04:7.2 → 6.1
- torch 2.11.0 → 2.5.1 (rocm6.1 wheel index)
- Add pytorch_triton_rocm==3.1.0
- transformers 5.2.0 → 4.46.3, safetensors 0.5.3 → 0.4.0
- s3tokenizer unpinned → 0.3.0
- resemble-perth==1.0.1 directly (v1.0.1 is pip-installable; drop stub)
- Drop Dockerfile perth_stub steps
- Drop torch.compile and timing patches from engine.py (not needed)
- Drop multi-pass warmup from main.py (torch JIT warmup not needed)
- Drop ROCm 7.2-specific env vars from docker-compose.yml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The GemmFwdRest workspace=0 warnings are expected (PyTorch ROCm passes
null workspace; MIOpen falls back to a working solver). They are not
actionable and clutter the logs. Level 2 keeps error-level output.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two changes:
- ulimits nofile=65536: MIOpen exhaustive search compiles many MLIR
kernels in parallel, each opening temp files in /tmp. Default container
limit (1024) is too low and ld.lld fails with 'too many open files'.
- MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0: disables the MLIR-based ImplicitGEMM
solvers that generate the failing kernels, leaving Direct/Winograd/GEMM.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cudnn.benchmark triggers MIOpen exhaustive kernel search which then
crashes writing results to SQLite. Disabling the cache skips the write.
PyTorch's in-memory benchmark cache still applies so warmup results are
reused for all requests within a container run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The named volume overlay was causing SQLite 'unable to open database file'
crashes. MIOpen's default cache location (~/.config/miopen) works reliably
inside the container. The startup warmup repopulates it each run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gfx1031 is not natively supported in ROCm 7.2. Without the override
the GPU falls back to software emulation causing 40+ second synthesis.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PyTorch passes ptr=0 size=0 workspace to MIOpen convolutions, causing
GemmFwdRest to warn and fall back to a slow path on every operation.
MIOPEN_DEBUG_CONV_GEMM=0 skips GEMM entirely and uses Direct/Winograd
solvers which have no workspace requirement.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PyTorch 2.11.0 with ROCm 7.2 wheels against rocm/dev-ubuntu-22.04:latest
causes MIOpen version mismatches that force every convolution onto a slow
zero-workspace fallback path (41s synthesis). The existing working project
uses torch 2.5.1 + ROCm 6.1 successfully on the same base image.
Also remove MIOPEN_FIND_ENFORCE override - unnecessary with matched versions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Enforce=3 (SEARCH_DB_UPDATE) runs exhaustive kernel benchmarking on
every single GPU operation, making inference impossibly slow. Enforce=1
searches once, writes to cache, then reuses cached results on all
subsequent calls.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MIOPEN_FIND_ENFORCE=3 tells MIOpen to only select solvers that fit in
available workspace, eliminating the GemmFwdRest fallback warnings and
the associated performance hit. Persisting the MIOpen cache via a named
volume avoids kernel recompilation on every container start.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update torch/torchaudio to 2.11.0 with ROCm 7.2 wheel index
- Drop torchvision (unused for TTS) and pytorch_triton_rocm (bundled in 2.11)
- Update HSA_OVERRIDE_GFX_VERSION docs; RX 7000+ natively supported in ROCm 7.2
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wyoming-only server built around the official chatterbox TTS model.
Includes ROCm/AMD GPU support, sentence-level streaming, config.yaml
management, and Gitea CI for container builds.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>