Log every event type on arrival and wrap handle_event in try/except
so silent crashes are visible. Helps diagnose the streaming protocol
hang where no logs appear after supports_synthesize_streaming=True.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this flag HA buffers all audio until AudioStop before forwarding
to the media player. With it, HA streams AudioChunk events to the player
as they arrive, so playback starts on the first chunk rather than after
the full text is synthesized.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The GemmFwdRest workspace=0 warnings are expected (PyTorch ROCm passes
null workspace; MIOpen falls back to a working solver). They are not
actionable and clutter the logs. Level 2 keeps error-level output.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The plain Synthesize event (HA's standard TTS path) should NOT be
followed by SynthesizeStopped. That event belongs only to the streaming
protocol (SynthesizeStart/Chunk/Stop). Sending it after Synthesize
confuses HA's Wyoming client, causing it to hang indefinitely.
Also:
- Guard Synthesize path against duplicate events during streaming
- Send audio as one AudioChunk per sentence (matches working reference)
- Remove numpy import (no longer needed)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.compile with dynamic=True still specializes per shape family on
first call. The warmup was running one text length, leaving real requests
to JIT-compile their own shapes (15-22s for first chunk). HA freezes
because it gets no AudioChunk for 22 seconds.
Fix:
- Run 3 warmup passes (short/medium/long text) so torch.compile builds
a dynamic shape graph covering the range HA actually sends. Real
requests then hit a cached compilation and synthesize in 3-8s.
- Reduce default chunk_size from 300 to 120 chars so the first text
chunk is shorter, producing faster synthesis and earlier first audio.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Warmup now uses a ~170-char representative sentence so torch.compile
JIT-compiles for typical token sequence lengths. Previously "Warmup."
compiled for very short shapes, causing a full re-compile (17s) on the
first real HA request and pushing total synthesis past 30s.
- Compile model.ve (voice encoder) in addition to s3gen — both are
convolutional and hit the MIOpen workspace=0 bug.
- Fix _patch_timing: attribute is model.ve not model.voice_encoder,
so the timing wrap was silently skipping the speaker embedding.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
import inside a function creates a local binding that shadows the
module-level torch import, breaking all earlier torch references in
the same function scope.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The GemmFwdRest workspace=0 issue is in MIOpen itself — PyTorch's ROCm
backend does not allocate workspace for convolutions, causing HiFiGAN to
use a slow fallback solver regardless of benchmark settings.
torch.compile(s3gen, dynamic=True) replaces MIOpen's conv path with
Triton-generated kernels, bypassing the issue entirely. dynamic=True
handles variable audio lengths without recompiling per request. The warmup
triggers JIT compilation so first HA request is fast.
Also removes fp16 autocast (Triton handles precision internally) and
cudnn.benchmark (no longer needed without MIOpen convs).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two changes:
- ulimits nofile=65536: MIOpen exhaustive search compiles many MLIR
kernels in parallel, each opening temp files in /tmp. Default container
limit (1024) is too low and ld.lld fails with 'too many open files'.
- MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0: disables the MLIR-based ImplicitGEMM
solvers that generate the failing kernels, leaving Direct/Winograd/GEMM.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cudnn.benchmark triggers MIOpen exhaustive kernel search which then
crashes writing results to SQLite. Disabling the cache skips the write.
PyTorch's in-memory benchmark cache still applies so warmup results are
reused for all requests within a container run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The named volume overlay was causing SQLite 'unable to open database file'
crashes. MIOpen's default cache location (~/.config/miopen) works reliably
inside the container. The startup warmup repopulates it each run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MIOpen crashes with SQLite 'unable to open database file' when the
directory doesn't exist at container start. mkdir + chmod 777 ensures
the path is present and writable before the named volume overlays it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Warmup: run a synthesis before accepting Wyoming connections so MIOpen
benchmarks and caches all conv layer shapes. Without this, the first HA
request triggers hundreds of benchmark runs and times out.
fp16: wrap in try/except so a failed autocast retries in fp32 rather
than dropping the request silently.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 6700 XT has significantly higher fp16 throughput than fp32.
autocast("cuda") uses fp16 for matmuls and convolutions (HiFiGAN,
S3 tokenizer, flow matching) while keeping fp32 for precision-sensitive
ops like softmax and layer norm.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the entire synthesized audio for a sentence was sent as one
AudioChunk event. HA buffers until it arrives in full, so playback didn't
start until synthesis was complete. Splitting into 4096-sample chunks lets
HA begin playing as data arrives.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Timing showed s3gen.inference (HiFiGAN vocoder) taking 22s and ref audio
processing ~18s - both dominated by Conv1d ops hitting MIOpen fallback.
With benchmark=False (default), PyTorch passes ptr=0 size=0 workspace to
MIOpen causing GemmFwdRest to fail and fall back to a slow path every call.
With benchmark=True, PyTorch evaluates convolution algorithms with proper
workspace allocation and caches the best result via MIOPEN_USER_DB_PATH.
First inference will be slower while benchmarking; subsequent calls use cache.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps s3tokenizer, voice_encoder, and s3gen.inference with timing logs
so we can see exactly which step is consuming the missing ~33 seconds.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gfx1031 is not natively supported in ROCm 7.2. Without the override
the GPU falls back to software emulation causing 40+ second synthesis.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PyTorch passes ptr=0 size=0 workspace to MIOpen convolutions, causing
GemmFwdRest to warn and fall back to a slow path on every operation.
MIOPEN_DEBUG_CONV_GEMM=0 skips GEMM entirely and uses Direct/Winograd
solvers which have no workspace requirement.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Using :latest was pulling a ROCm 6.x image whose MIOpen was incompatible
with our ROCm 7.2 PyTorch wheels. Pinning to the 7.2 tag gives matching
MIOpen libraries and should resolve the workspace/fallback performance issue.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PyTorch 2.11.0 with ROCm 7.2 wheels against rocm/dev-ubuntu-22.04:latest
causes MIOpen version mismatches that force every convolution onto a slow
zero-workspace fallback path (41s synthesis). The existing working project
uses torch 2.5.1 + ROCm 6.1 successfully on the same base image.
Also remove MIOPEN_FIND_ENFORCE override - unnecessary with matched versions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Enforce=3 (SEARCH_DB_UPDATE) runs exhaustive kernel benchmarking on
every single GPU operation, making inference impossibly slow. Enforce=1
searches once, writes to cache, then reuses cached results on all
subsequent calls.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MIOPEN_FIND_ENFORCE=3 tells MIOpen to only select solvers that fit in
available workspace, eliminating the GemmFwdRest fallback warnings and
the associated performance hit. Persisting the MIOpen cache via a named
volume avoids kernel recompilation on every container start.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mode=max was hitting a 400 Bad Request when pushing the large ROCm
PyTorch layer (~GB) as a separate cache blob. Inline cache embeds
metadata in the already-pushed image instead, so no separate upload.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- CI: cache-from/cache-to with mode=max stores all intermediate layers
in the registry so subsequent builds skip unchanged layers (especially
the slow ROCm PyTorch download)
- Dockerfile: move COPY perth_stub.py below pip install layers so a
stub change doesn't bust the cache for everything above it
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
resemble-perth uses uv-build which is incompatible with the old system
pip in the ROCm base image. Since watermarking is unnecessary for
self-hosted private use, stub out the perth module so chatterbox's
import is satisfied without any build complexity.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update torch/torchaudio to 2.11.0 with ROCm 7.2 wheel index
- Drop torchvision (unused for TTS) and pytorch_triton_rocm (bundled in 2.11)
- Update HSA_OVERRIDE_GFX_VERSION docs; RX 7000+ natively supported in ROCm 7.2
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pip's isolated build environments don't have the uv binary available,
causing uv-build to fail. Installing with --no-build-isolation lets pip
use the already-installed uv from the system environment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pip's isolated build environments inherit system PATH but don't get
the uv binary automatically. Symlinking via uv.find_uv_bin() makes it
available so resemble-perth's uv-build backend can execute.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
resemble-perth uses uv as its build backend; without uv installed
the metadata-generation step fails.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update transformers to 5.2.0 (required by official chatterbox)
- Add omegaconf (pulled by s3gen/flow.py)
- Install resemble-perth from git source
- Pin safetensors to 0.5.3
- Remove onnx (not a chatterbox dep)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
resemble-perth, conformer, s3tokenizer, onnx, spacy-pkuseg, pykakasi,
and pyloudnorm are all chatterbox deps that were skipped by --no-deps
and need to be installed explicitly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wyoming-only server built around the official chatterbox TTS model.
Includes ROCm/AMD GPU support, sentence-level streaming, config.yaml
management, and Gitea CI for container builds.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>