Two changes:
- ulimits nofile=65536: MIOpen exhaustive search compiles many MLIR
kernels in parallel, each opening temp files in /tmp. Default container
limit (1024) is too low and ld.lld fails with 'too many open files'.
- MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0: disables the MLIR-based ImplicitGEMM
solvers that generate the failing kernels, leaving Direct/Winograd/GEMM.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cudnn.benchmark triggers MIOpen exhaustive kernel search which then
crashes writing results to SQLite. Disabling the cache skips the write.
PyTorch's in-memory benchmark cache still applies so warmup results are
reused for all requests within a container run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The named volume overlay was causing SQLite 'unable to open database file'
crashes. MIOpen's default cache location (~/.config/miopen) works reliably
inside the container. The startup warmup repopulates it each run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gfx1031 is not natively supported in ROCm 7.2. Without the override
the GPU falls back to software emulation causing 40+ second synthesis.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PyTorch passes ptr=0 size=0 workspace to MIOpen convolutions, causing
GemmFwdRest to warn and fall back to a slow path on every operation.
MIOPEN_DEBUG_CONV_GEMM=0 skips GEMM entirely and uses Direct/Winograd
solvers which have no workspace requirement.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PyTorch 2.11.0 with ROCm 7.2 wheels against rocm/dev-ubuntu-22.04:latest
causes MIOpen version mismatches that force every convolution onto a slow
zero-workspace fallback path (41s synthesis). The existing working project
uses torch 2.5.1 + ROCm 6.1 successfully on the same base image.
Also remove MIOPEN_FIND_ENFORCE override - unnecessary with matched versions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Enforce=3 (SEARCH_DB_UPDATE) runs exhaustive kernel benchmarking on
every single GPU operation, making inference impossibly slow. Enforce=1
searches once, writes to cache, then reuses cached results on all
subsequent calls.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MIOPEN_FIND_ENFORCE=3 tells MIOpen to only select solvers that fit in
available workspace, eliminating the GemmFwdRest fallback warnings and
the associated performance hit. Persisting the MIOpen cache via a named
volume avoids kernel recompilation on every container start.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update torch/torchaudio to 2.11.0 with ROCm 7.2 wheel index
- Drop torchvision (unused for TTS) and pytorch_triton_rocm (bundled in 2.11)
- Update HSA_OVERRIDE_GFX_VERSION docs; RX 7000+ natively supported in ROCm 7.2
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wyoming-only server built around the official chatterbox TTS model.
Includes ROCm/AMD GPU support, sentence-level streaming, config.yaml
management, and Gitea CI for container builds.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>