Commit Graph

45 Commits

Author SHA1 Message Date
69f5489532 Merge branch 'main' into dev 2026-04-06 17:41:40 -04:00
f292ace76c Trigger rebuild to restore latest tag
All checks were successful
Build ROCm Image / build (push) Successful in 14m47s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 17:33:16 -04:00
766ca9d278 Fix image tagging: dev branch tags as dev, not latest
All checks were successful
Build ROCm Image / build (push) Successful in 25s
main branch → :latest + :sha
other branches → :<branch-name> + :sha

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 17:29:59 -04:00
9a017df4ca Trigger CI builds on dev branch
All checks were successful
Build ROCm Image / build (push) Successful in 17m21s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 17:10:56 -04:00
fe3c77ff4f Upgrade to ROCm 7.2, Python 3.11, PyTorch 2.11.0
- Base image: rocm/dev-ubuntu-22.04:6.1 → 7.2
- Python 3.10 → 3.11 via deadsnakes PPA
- torch/torchaudio: 2.5.1 → 2.11.0
- torchvision: 0.20.1 → 0.26.0
- pytorch_triton_rocm: 3.1.0 → 3.3.0
- transformers: 4.46.3 → >=4.50.0
- diffusers: 0.29.0 → >=0.32.0
- safetensors: >=0.4.1 → >=0.4.5
- config: temperature 0.8→0.9, seed 0→1960

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 17:09:56 -04:00
967ed41239 Revert FP16 autocast — increases TTFA on first request
All checks were successful
Build ROCm Image / build (push) Successful in 3m21s
autocast triggers fp16 kernel selection at first call for each tensor
shape. Since the warmup uses short text, real requests re-trigger
selection and are slower net. Keeping FP32 + conditionals cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:30:49 -04:00
29b66e24bb Cache voice conditionals and add FP16 autocast
All checks were successful
Build ROCm Image / build (push) Successful in 3m17s
Voice conditionals (s3tokenizer + voice encoder + mel embeddings) are
expensive to compute but depend only on the reference audio, not the
text. Previously they ran on every synthesis chunk — 3x wasted work for
a 3-chunk request. Now computed once at startup and reused.

Also wrap generate() in torch.amp.autocast(float16) for ~2x speedup on
all model computation (T3 LLM, S3Gen CFM, HiFiGAN vocoder).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:22:13 -04:00
0fac076de1 Fix safetensors version conflict with transformers 4.46.3
All checks were successful
Build ROCm Image / build (push) Successful in 14m11s
transformers 4.46.3 requires safetensors>=0.4.1, not ==0.4.0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 17:34:23 -04:00
8de67c8bd9 Switch to ROCm 6.1 + torch 2.5.1 to fix MIOpen workspace=0 slowness
Some checks failed
Build ROCm Image / build (push) Failing after 11s
ROCm 7.2 + PyTorch 2.11.0 has a bug where PyTorch passes workspace=0 to
MIOpen convolutions, forcing fallback to the slow GemmFwdRest solver.
This caused s3gen.inference to take 15-22s instead of <5s, making
synthesis 3-4x slower than real-time audio playback.

ROCm 6.1 allocates workspace correctly so MIOpen picks fast GEMM solvers
without needing torch.compile workarounds.

Changes:
- Base image: rocm/dev-ubuntu-22.04:7.2 → 6.1
- torch 2.11.0 → 2.5.1 (rocm6.1 wheel index)
- Add pytorch_triton_rocm==3.1.0
- transformers 5.2.0 → 4.46.3, safetensors 0.5.3 → 0.4.0
- s3tokenizer unpinned → 0.3.0
- resemble-perth==1.0.1 directly (v1.0.1 is pip-installable; drop stub)
- Drop Dockerfile perth_stub steps
- Drop torch.compile and timing patches from engine.py (not needed)
- Drop multi-pass warmup from main.py (torch JIT warmup not needed)
- Drop ROCm 7.2-specific env vars from docker-compose.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 17:27:21 -04:00
23a0b914fa Add per-event logging and top-level exception catching
All checks were successful
Build ROCm Image / build (push) Successful in 6m2s
Log every event type on arrival and wrap handle_event in try/except
so silent crashes are visible. Helps diagnose the streaming protocol
hang where no logs appear after supports_synthesize_streaming=True.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 17:10:32 -04:00
3d3e8bdabf Add supports_synthesize_streaming=True to TtsProgram
All checks were successful
Build ROCm Image / build (push) Successful in 4m56s
Without this flag HA buffers all audio until AudioStop before forwarding
to the media player. With it, HA streams AudioChunk events to the player
as they arrive, so playback starts on the first chunk rather than after
the full text is synthesized.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 16:52:02 -04:00
d0f13dea8d Log incoming HA text in synthesis request line
All checks were successful
Build ROCm Image / build (push) Successful in 3m50s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 16:26:52 -04:00
8c5d3c4f06 Suppress MIOpen workspace warning noise via MIOPEN_LOG_LEVEL=2
Some checks failed
Build ROCm Image / build (push) Has been cancelled
The GemmFwdRest workspace=0 warnings are expected (PyTorch ROCm passes
null workspace; MIOpen falls back to a working solver). They are not
actionable and clutter the logs. Level 2 keeps error-level output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 16:24:18 -04:00
a196294d4a Fix Wyoming protocol: remove SynthesizeStopped from Synthesize path
Some checks failed
Build ROCm Image / build (push) Has been cancelled
The plain Synthesize event (HA's standard TTS path) should NOT be
followed by SynthesizeStopped. That event belongs only to the streaming
protocol (SynthesizeStart/Chunk/Stop). Sending it after Synthesize
confuses HA's Wyoming client, causing it to hang indefinitely.

Also:
- Guard Synthesize path against duplicate events during streaming
- Send audio as one AudioChunk per sentence (matches working reference)
- Remove numpy import (no longer needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 16:22:47 -04:00
59731084cd Multi-pass warmup and smaller chunk_size to fix HA timeout
All checks were successful
Build ROCm Image / build (push) Successful in 2m49s
torch.compile with dynamic=True still specializes per shape family on
first call. The warmup was running one text length, leaving real requests
to JIT-compile their own shapes (15-22s for first chunk). HA freezes
because it gets no AudioChunk for 22 seconds.

Fix:
- Run 3 warmup passes (short/medium/long text) so torch.compile builds
  a dynamic shape graph covering the range HA actually sends. Real
  requests then hit a cached compilation and synthesize in 3-8s.
- Reduce default chunk_size from 300 to 120 chars so the first text
  chunk is shorter, producing faster synthesis and earlier first audio.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 15:04:46 -04:00
169e003a34 Fix warmup text length and ve attribute for torch.compile
All checks were successful
Build ROCm Image / build (push) Successful in 3m35s
- Warmup now uses a ~170-char representative sentence so torch.compile
  JIT-compiles for typical token sequence lengths. Previously "Warmup."
  compiled for very short shapes, causing a full re-compile (17s) on the
  first real HA request and pushing total synthesis past 30s.
- Compile model.ve (voice encoder) in addition to s3gen — both are
  convolutional and hit the MIOpen workspace=0 bug.
- Fix _patch_timing: attribute is model.ve not model.voice_encoder,
  so the timing wrap was silently skipping the speaker embedding.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:51:08 -04:00
5766870304 Fix UnboundLocalError: move torch._dynamo import to module level
All checks were successful
Build ROCm Image / build (push) Successful in 2m39s
import inside a function creates a local binding that shadows the
module-level torch import, breaking all earlier torch references in
the same function scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:34:45 -04:00
7babd0584e Replace MIOpen convolution path with torch.compile on s3gen
All checks were successful
Build ROCm Image / build (push) Successful in 2m47s
The GemmFwdRest workspace=0 issue is in MIOpen itself — PyTorch's ROCm
backend does not allocate workspace for convolutions, causing HiFiGAN to
use a slow fallback solver regardless of benchmark settings.

torch.compile(s3gen, dynamic=True) replaces MIOpen's conv path with
Triton-generated kernels, bypassing the issue entirely. dynamic=True
handles variable audio lengths without recompiling per request. The warmup
triggers JIT compilation so first HA request is fast.

Also removes fp16 autocast (Triton handles precision internally) and
cudnn.benchmark (no longer needed without MIOpen convs).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:27:09 -04:00
cd33b1c161 Fix MIOpen MLIR kernel compilation crash during benchmark search
All checks were successful
Build ROCm Image / build (push) Successful in 18s
Two changes:
- ulimits nofile=65536: MIOpen exhaustive search compiles many MLIR
  kernels in parallel, each opening temp files in /tmp. Default container
  limit (1024) is too low and ld.lld fails with 'too many open files'.
- MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0: disables the MLIR-based ImplicitGEMM
  solvers that generate the failing kernels, leaving Direct/Winograd/GEMM.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:21:32 -04:00
e69b072b70 Add MIOPEN_DISABLE_CACHE=1 to prevent SQLite crash on benchmark
All checks were successful
Build ROCm Image / build (push) Successful in 19s
cudnn.benchmark triggers MIOpen exhaustive kernel search which then
crashes writing results to SQLite. Disabling the cache skips the write.
PyTorch's in-memory benchmark cache still applies so warmup results are
reused for all requests within a container run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:14:44 -04:00
7436c49d44 Remove custom MIOpen cache path — let MIOpen use its defaults
All checks were successful
Build ROCm Image / build (push) Successful in 3m25s
The named volume overlay was causing SQLite 'unable to open database file'
crashes. MIOpen's default cache location (~/.config/miopen) works reliably
inside the container. The startup warmup repopulates it each run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:04:05 -04:00
60279389f2 Create miopen_cache dir in Dockerfile before volume mount
All checks were successful
Build ROCm Image / build (push) Successful in 3m11s
MIOpen crashes with SQLite 'unable to open database file' when the
directory doesn't exist at container start. mkdir + chmod 777 ensures
the path is present and writable before the named volume overlays it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:52:03 -04:00
bdde4a2480 Add startup warmup and make fp16 autocast fault-tolerant
All checks were successful
Build ROCm Image / build (push) Successful in 3m10s
Warmup: run a synthesis before accepting Wyoming connections so MIOpen
benchmarks and caches all conv layer shapes. Without this, the first HA
request triggers hundreds of benchmark runs and times out.

fp16: wrap in try/except so a failed autocast retries in fp32 rather
than dropping the request silently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:48:41 -04:00
f20699aed3 Add fp16 autocast to synthesis for faster GPU throughput
All checks were successful
Build ROCm Image / build (push) Successful in 2m49s
The 6700 XT has significantly higher fp16 throughput than fp32.
autocast("cuda") uses fp16 for matmuls and convolutions (HiFiGAN,
S3 tokenizer, flow matching) while keeping fp32 for precision-sensitive
ops like softmax and layer norm.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:34:21 -04:00
a8e3e62dbc Stream audio in 4096-sample sub-chunks for immediate HA playback
All checks were successful
Build ROCm Image / build (push) Successful in 4m20s
Previously the entire synthesized audio for a sentence was sent as one
AudioChunk event. HA buffers until it arrives in full, so playback didn't
start until synthesis was complete. Splitting into 4096-sample chunks lets
HA begin playing as data arrives.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:25:54 -04:00
514bbad0e9 Enable cudnn.benchmark to fix MIOpen workspace=0 on convolutions
Some checks failed
Build ROCm Image / build (push) Has been cancelled
Timing showed s3gen.inference (HiFiGAN vocoder) taking 22s and ref audio
processing ~18s - both dominated by Conv1d ops hitting MIOpen fallback.

With benchmark=False (default), PyTorch passes ptr=0 size=0 workspace to
MIOpen causing GemmFwdRest to fail and fall back to a slow path every call.
With benchmark=True, PyTorch evaluates convolution algorithms with proper
workspace allocation and caches the best result via MIOPEN_USER_DB_PATH.

First inference will be slower while benchmarking; subsequent calls use cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:24:05 -04:00
bfe20b7742 Add timing instrumentation to pinpoint synthesis bottleneck
All checks were successful
Build ROCm Image / build (push) Successful in 3m21s
Wraps s3tokenizer, voice_encoder, and s3gen.inference with timing logs
so we can see exactly which step is consuming the missing ~33 seconds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:09:14 -04:00
b990cacd31 Enable HSA_OVERRIDE_GFX_VERSION=10.3.0 for RX 6700 XT
All checks were successful
Build ROCm Image / build (push) Successful in 18s
gfx1031 is not natively supported in ROCm 7.2. Without the override
the GPU falls back to software emulation causing 40+ second synthesis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:07:46 -04:00
2a80555c60 Disable MIOpen GEMM solver to fix null workspace fallback
All checks were successful
Build ROCm Image / build (push) Successful in 35s
PyTorch passes ptr=0 size=0 workspace to MIOpen convolutions, causing
GemmFwdRest to warn and fall back to a slow path on every operation.
MIOPEN_DEBUG_CONV_GEMM=0 skips GEMM entirely and uses Direct/Winograd
solvers which have no workspace requirement.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:00:21 -04:00
f15cdcf049 Pin base image to rocm/dev-ubuntu-22.04:7.2, restore torch 2.11.0
All checks were successful
Build ROCm Image / build (push) Successful in 16m4s
Using :latest was pulling a ROCm 6.x image whose MIOpen was incompatible
with our ROCm 7.2 PyTorch wheels. Pinning to the 7.2 tag gives matching
MIOpen libraries and should resolve the workspace/fallback performance issue.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:35:58 -04:00
b68bccb20f Revert to torch 2.5.1 + ROCm 6.1 (known working combination)
Some checks failed
Build ROCm Image / build (push) Has been cancelled
PyTorch 2.11.0 with ROCm 7.2 wheels against rocm/dev-ubuntu-22.04:latest
causes MIOpen version mismatches that force every convolution onto a slow
zero-workspace fallback path (41s synthesis). The existing working project
uses torch 2.5.1 + ROCm 6.1 successfully on the same base image.

Also remove MIOPEN_FIND_ENFORCE override - unnecessary with matched versions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:34:06 -04:00
7a966c8532 Fix MIOPEN_FIND_ENFORCE: 3 -> 1 (DB_UPDATE)
Some checks failed
Build ROCm Image / build (push) Has been cancelled
Enforce=3 (SEARCH_DB_UPDATE) runs exhaustive kernel benchmarking on
every single GPU operation, making inference impossibly slow. Enforce=1
searches once, writes to cache, then reuses cached results on all
subsequent calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:29:54 -04:00
f45fa0496e Fix MIOpen workspace warnings and add kernel cache persistence
Some checks failed
Build ROCm Image / build (push) Has been cancelled
MIOPEN_FIND_ENFORCE=3 tells MIOpen to only select solvers that fit in
available workspace, eliminating the GemmFwdRest fallback warnings and
the associated performance hit. Persisting the MIOpen cache via a named
volume avoids kernel recompilation on every container start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:20:18 -04:00
f81e5f42fb Switch to inline cache to avoid registry blob size limits
Some checks failed
Build ROCm Image / build (push) Has been cancelled
mode=max was hitting a 400 Bad Request when pushing the large ROCm
PyTorch layer (~GB) as a separate cache blob. Inline cache embeds
metadata in the already-pushed image instead, so no separate upload.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:14:35 -04:00
d9a540f8e8 Add registry layer cache and fix Dockerfile cache order
Some checks failed
Build ROCm Image / build (push) Failing after 19m47s
- CI: cache-from/cache-to with mode=max stores all intermediate layers
  in the registry so subsequent builds skip unchanged layers (especially
  the slow ROCm PyTorch download)
- Dockerfile: move COPY perth_stub.py below pip install layers so a
  stub change doesn't bust the cache for everything above it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:53:32 -04:00
4b21d6c252 Fix TtsVoice missing required version argument
Some checks failed
Build ROCm Image / build (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:52:14 -04:00
de6156a336 Replace resemble-perth with a no-op stub
All checks were successful
Build ROCm Image / build (push) Successful in 17m32s
resemble-perth uses uv-build which is incompatible with the old system
pip in the ROCm base image. Since watermarking is unnecessary for
self-hosted private use, stub out the perth module so chatterbox's
import is satisfied without any build complexity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:16:08 -04:00
dc7a3cf769 Upgrade to ROCm 7.2 and PyTorch 2.11.0
Some checks failed
Build ROCm Image / build (push) Failing after 7m25s
- Update torch/torchaudio to 2.11.0 with ROCm 7.2 wheel index
- Drop torchvision (unused for TTS) and pytorch_triton_rocm (bundled in 2.11)
- Update HSA_OVERRIDE_GFX_VERSION docs; RX 7000+ natively supported in ROCm 7.2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:06:39 -04:00
d7247d31fe Install resemble-perth with --no-build-isolation
Some checks failed
Build ROCm Image / build (push) Failing after 5m28s
pip's isolated build environments don't have the uv binary available,
causing uv-build to fail. Installing with --no-build-isolation lets pip
use the already-installed uv from the system environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:58:21 -04:00
88c2084d19 Symlink uv binary to /usr/local/bin for pip build envs
Some checks failed
Build ROCm Image / build (push) Failing after 1m30s
pip's isolated build environments inherit system PATH but don't get
the uv binary automatically. Symlinking via uv.find_uv_bin() makes it
available so resemble-perth's uv-build backend can execute.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:53:14 -04:00
5d1689e7f4 Install uv before pip deps to support uv-build backend
Some checks failed
Build ROCm Image / build (push) Failing after 4m25s
resemble-perth uses uv as its build backend; without uv installed
the metadata-generation step fails.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:43:26 -04:00
84e87dceb2 Fix chatterbox deps to match official pyproject.toml
Some checks failed
Build ROCm Image / build (push) Failing after 4m21s
- Update transformers to 5.2.0 (required by official chatterbox)
- Add omegaconf (pulled by s3gen/flow.py)
- Install resemble-perth from git source
- Pin safetensors to 0.5.3
- Remove onnx (not a chatterbox dep)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:34:42 -04:00
7b34b202da Add missing chatterbox deps to requirements
All checks were successful
Build ROCm Image / build (push) Successful in 13m41s
resemble-perth, conformer, s3tokenizer, onnx, spacy-pkuseg, pykakasi,
and pyloudnorm are all chatterbox deps that were skipped by --no-deps
and need to be installed explicitly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:15:08 -04:00
16ea2853f5 Initial implementation: Chatterbox TTS with ROCm and Wyoming
All checks were successful
Build ROCm Image / build (push) Successful in 15m27s
Wyoming-only server built around the official chatterbox TTS model.
Includes ROCm/AMD GPU support, sentence-level streaming, config.yaml
management, and Gitea CI for container builds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 09:51:09 -04:00
4b15e44181 Initial commit 2026-04-05 09:38:32 -04:00