Create Kokoro TTS JavaScript library (#3)

* Set up JS project

* Finalise JS library

* Update README

* Fix package.json repository url

* Rename package -> `kokoro-js`

* Fix samples in README

* Cleanup README

* Bump `phonemizer` version

* Create web demo

* Run prettier

* Link to model used in demo

* Enable multithreading in HF space demo (~40% faster)

* Add link to demo in README

* Bump to v1.0.1
This commit is contained in:
Joshua Lochner
2025-01-16 19:50:34 +02:00
committed by GitHub
parent 757c80cc5b
commit 0a1dc5750c
37 changed files with 8820 additions and 0 deletions

4
kokoro.js/.gitignore vendored Normal file
View File

@@ -0,0 +1,4 @@
node_modules/
dist
types
LICENSE

View File

@@ -0,0 +1,2 @@
dist
types

55
kokoro.js/README.md Normal file
View File

@@ -0,0 +1,55 @@
# Kokoro TTS
<p align="center">
<a href="https://www.npmjs.com/package/kokoro-js"><img alt="NPM" src="https://img.shields.io/npm/v/kokoro-js"></a>
<a href="https://www.npmjs.com/package/kokoro-js"><img alt="NPM Downloads" src="https://img.shields.io/npm/dw/kokoro-js"></a>
<a href="https://www.jsdelivr.com/package/npm/kokoro-js"><img alt="jsDelivr Hits" src="https://img.shields.io/jsdelivr/npm/hw/kokoro-js"></a>
<a href="https://github.com/hexgrad/kokoro/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/hexgrad/kokoro?color=blue"></a>
<a href="https://huggingface.co/spaces/webml-community/kokoro-web"><img alt="Demo" src="https://img.shields.io/badge/Hugging_Face-demo-green"></a>
</p>
Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out). This JavaScript library allows the model to be run 100% locally in the browser thanks to [🤗 Transformers.js](https://huggingface.co/docs/transformers.js). Try it out using our [online demo](https://huggingface.co/spaces/webml-community/kokoro-web)!
## Usage
First, install the `kokoro-js` library from [NPM](https://npmjs.com/package/kokoro-js) using:
```bash
npm i kokoro-js
```
You can then generate speech as follows:
```js
import { KokoroTTS } from "kokoro-js";
const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});
const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
// Use `tts.list_voices()` to list all available voices
voice: "af_bella",
});
audio.save("audio.wav");
```
## Voices/Samples
> Life is like a box of chocolates. You never know what you're gonna get.
| Voice | Nationality | Gender | Sample |
| ------------------------ | ----------- | ------ | -------------------------------------------------------------------------------------------------------- |
| Default (`af`) | American | Female | <video controls src="https://github.com/user-attachments/assets/c183df83-58a9-4aea-8fdf-225092acec57" /> |
| Bella (`af_bella`) | American | Female | <video controls src="https://github.com/user-attachments/assets/0730fff0-22b3-458f-9675-36d313d872d6" /> |
| Nicole (`af_nicole`) | American | Female | <video controls src="https://github.com/user-attachments/assets/4ce0b3f6-eaec-4e47-901c-9d29e2b60c86" /> |
| Sarah (`af_sarah`) | American | Female | <video controls src="https://github.com/user-attachments/assets/d37dba3f-de59-44c4-bc3d-da91ea1b5a4a" /> |
| Sky (`af_sky`) | American | Female | <video controls src="https://github.com/user-attachments/assets/38230be5-881c-4407-81e6-a0b1e4101565" /> |
| Adam (`am_adam`) | American | Male | <video controls src="https://github.com/user-attachments/assets/66a4c439-e80b-4c91-8a27-ae094486a2d8" /> |
| Michael (`am_michael`) | American | Male | <video controls src="https://github.com/user-attachments/assets/79a8879d-b564-4222-b2d5-a97f783ae897" /> |
| Emma (`bf_emma`) | British | Female | <video controls src="https://github.com/user-attachments/assets/ad5eb254-1d84-4282-9d23-371d5765d820" /> |
| Isabella (`bf_isabella`) | British | Female | <video controls src="https://github.com/user-attachments/assets/ea7e6825-dad0-403c-9ece-680af04f5a25" /> |
| George (`bm_george`) | British | Male | <video controls src="https://github.com/user-attachments/assets/e09040aa-578f-40a6-b7fd-76a5b005346c" /> |
| Lewis (`bm_lewis`) | British | Male | <video controls src="https://github.com/user-attachments/assets/5d7b26bf-8900-4a9a-8ee5-a16c39bb834c" /> |

24
kokoro.js/demo/.gitignore vendored Normal file
View File

@@ -0,0 +1,24 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*
node_modules
dist
dist-ssr
*.local
# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?

59
kokoro.js/demo/README.md Normal file
View File

@@ -0,0 +1,59 @@
---
title: Kokoro Text-to-Speech
emoji: 🗣️
colorFrom: indigo
colorTo: purple
sdk: static
pinned: false
license: apache-2.0
short_description: High-quality speech synthesis powered by Kokoro TTS
header: mini
models:
- onnx-community/Kokoro-82M-ONNX
custom_headers:
cross-origin-embedder-policy: require-corp
cross-origin-opener-policy: same-origin
cross-origin-resource-policy: cross-origin
---
# Kokoro Text-to-Speech
A simple React + Vite application for running [Kokoro](https://github.com/hexgrad/kokoro), a frontier text-to-speech model for its size. The model runs 100% locally in the browser using [kokoro-js](https://www.npmjs.com/package/kokoro-js) and [🤗 Transformers.js](https://www.npmjs.com/package/@huggingface/transformers)!
## Getting Started
Follow the steps below to set up and run the application.
### 1. Clone the Repository
Clone the examples repository from GitHub:
```sh
git clone https://github.com/hexgrad/kokoro.git
```
### 2. Navigate to the Project Directory
Change your working directory to the `demo` folder:
```sh
cd kokoro/kokoro.js/demo
```
### 3. Install Dependencies
Install the necessary dependencies using npm:
```sh
npm i
```
### 4. Run the Development Server
Start the development server:
```sh
npm run dev
```
The application should now be running locally. Open your browser and go to `http://localhost:5173` to see it in action.

View File

@@ -0,0 +1,35 @@
import js from "@eslint/js";
import globals from "globals";
import react from "eslint-plugin-react";
import reactHooks from "eslint-plugin-react-hooks";
import reactRefresh from "eslint-plugin-react-refresh";
export default [
{ ignores: ["dist"] },
{
files: ["**/*.{js,jsx}"],
languageOptions: {
ecmaVersion: 2020,
globals: globals.browser,
parserOptions: {
ecmaVersion: "latest",
ecmaFeatures: { jsx: true },
sourceType: "module",
},
},
settings: { react: { version: "18.3" } },
plugins: {
react,
"react-hooks": reactHooks,
"react-refresh": reactRefresh,
},
rules: {
...js.configs.recommended.rules,
...react.configs.recommended.rules,
...react.configs["jsx-runtime"].rules,
...reactHooks.configs.recommended.rules,
"react/jsx-no-target-blank": "off",
"react-refresh/only-export-components": ["warn", { allowConstantExport: true }],
},
},
];

13
kokoro.js/demo/index.html Normal file
View File

@@ -0,0 +1,13 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<link rel="icon" type="image/svg+xml" href="/hf-logo.svg" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Kokoro Text-to-Speech</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/main.jsx"></script>
</body>
</html>

4680
kokoro.js/demo/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,33 @@
{
"name": "kokoro-web",
"private": true,
"version": "0.0.0",
"type": "module",
"scripts": {
"dev": "vite",
"build": "vite build",
"lint": "eslint .",
"preview": "vite preview"
},
"dependencies": {
"kokoro-js": "file:..",
"motion": "^11.12.0",
"react": "^18.3.1",
"react-dom": "^18.3.1"
},
"devDependencies": {
"@eslint/js": "^9.15.0",
"@types/react": "^18.3.12",
"@types/react-dom": "^18.3.1",
"@vitejs/plugin-react": "^4.3.4",
"autoprefixer": "^10.4.20",
"eslint": "^9.15.0",
"eslint-plugin-react": "^7.37.2",
"eslint-plugin-react-hooks": "^5.0.0",
"eslint-plugin-react-refresh": "^0.4.14",
"globals": "^15.12.0",
"postcss": "^8.4.49",
"tailwindcss": "^3.4.15",
"vite": "^6.0.1"
}
}

View File

@@ -0,0 +1,6 @@
export default {
plugins: {
tailwindcss: {},
autoprefixer: {},
},
};

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 34 KiB

View File

@@ -0,0 +1,9 @@
<svg xmlns="http://www.w3.org/2000/svg" width="1600" height="198">
<defs>
<linearGradient id="a" x1="50%" x2="50%" y1="-10.959%" y2="100%">
<stop stop-color="#57BBC1" stop-opacity=".25" offset="0%"/>
<stop stop-color="#015871" offset="100%"/>
</linearGradient>
</defs>
<path fill="url(#a)" fill-rule="evenodd" d="M.005 121C311 121 409.898-.25 811 0c400 0 500 121 789 121v77H0s.005-48 .005-77z" transform="matrix(-1 0 0 1 1600 0)"/>
</svg>

After

Width:  |  Height:  |  Size: 465 B

144
kokoro.js/demo/src/App.jsx Normal file
View File

@@ -0,0 +1,144 @@
import { useRef, useState, useEffect } from "react";
import { motion } from "motion/react";
export default function App() {
// Create a reference to the worker object.
const worker = useRef(null);
const [inputText, setInputText] = useState("Life is like a box of chocolates. You never know what you're gonna get.");
const [selectedSpeaker, setSelectedSpeaker] = useState("af");
const [status, setStatus] = useState(null);
const [error, setError] = useState(null);
const [loadingMessage, setLoadingMessage] = useState("Loading model (only downloaded once)...");
const [results, setResults] = useState([]);
// We use the `useEffect` hook to setup the worker as soon as the `App` component is mounted.
useEffect(() => {
// Create the worker if it does not yet exist.
worker.current ??= new Worker(new URL("./worker.js", import.meta.url), {
type: "module",
});
// Create a callback function for messages from the worker thread.
const onMessageReceived = (e) => {
switch (e.data.status) {
// TODO: WebGPU feature checking
// case "feature-success":
// break;
// case "feature-error":
// setError(e.data.data);
// break;
case "ready":
setStatus("ready");
break;
case "complete":
const { audio, text } = e.data;
// Generation complete: re-enable the "Generate" button
setResults((prev) => [{ text, src: audio }, ...prev]);
setStatus("ready");
break;
}
};
const onErrorReceived = (e) => {
console.error("Worker error:", e);
};
// Attach the callback function as an event listener.
worker.current.addEventListener("message", onMessageReceived);
worker.current.addEventListener("error", onErrorReceived);
// Define a cleanup function for when the component is unmounted.
return () => {
worker.current.removeEventListener("message", onMessageReceived);
worker.current.removeEventListener("error", onErrorReceived);
};
}, []);
const handleSubmit = (e) => {
e.preventDefault();
setStatus("running");
worker.current.postMessage({
type: "generate",
text: inputText.trim(),
voice: selectedSpeaker,
});
};
return (
<div className="relative w-full min-h-screen bg-gradient-to-br from-gray-900 to-gray-700 flex flex-col items-center justify-center p-4 relative overflow-hidden font-sans">
<motion.div initial={{ opacity: 1 }} animate={{ opacity: status === null ? 1 : 0 }} transition={{ duration: 0.5 }} className="absolute w-screen h-screen justify-center flex flex-col items-center z-10 bg-gray-800/95 backdrop-blur-md" style={{ pointerEvents: status === null ? "auto" : "none" }}>
<div className="w-[250px] h-[250px] border-4 border-white shadow-[0_0_0_5px_#4973ff] rounded-full overflow-hidden">
<div className="loading-wave"></div>
</div>
<p className={`text-3xl my-5 text-center ${error ? "text-red-500" : "text-white"}`}>{error ?? loadingMessage}</p>
</motion.div>
<div className="max-w-3xl w-full space-y-8 relative z-[2]">
<div className="text-center">
<h1 className="text-5xl font-extrabold text-gray-100 mb-2 drop-shadow-lg font-heading">Kokoro Text-to-Speech</h1>
<p className="text-2xl text-gray-300 font-semibold font-subheading">
Powered by&nbsp;
<a href="https://github.com/hexgrad/kokoro" target="_blank" rel="noreferrer" className="underline">
Kokoro
</a>
&nbsp;and&nbsp;
<a href="https://huggingface.co/docs/transformers.js" target="_blank" rel="noreferrer" className="underline">
<img width="40" src="hf-logo.svg" className="inline translate-y-[-2px] me-1"></img>Transformers.js
</a>
</p>
</div>
<div className="bg-gray-800/50 backdrop-blur-sm border border-gray-700 rounded-lg p-6">
<form onSubmit={handleSubmit} className="space-y-4">
<textarea placeholder="Enter text..." value={inputText} onChange={(e) => setInputText(e.target.value)} className="w-full min-h-[100px] max-h-[300px] bg-gray-700/50 backdrop-blur-sm border-2 border-gray-600 rounded-xl resize-y text-gray-100 placeholder-gray-400 px-3 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:border-transparent" rows={Math.min(8, inputText.split("\n").length)} />
<div className="flex flex-col items-center space-y-4">
<select value={selectedSpeaker} onChange={(e) => setSelectedSpeaker(e.target.value)} className="w-full bg-gray-700/50 backdrop-blur-sm border-2 border-gray-600 rounded-xl text-gray-100 px-3 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:border-transparent">
<option value="af">Default (American Female)</option>
<option value="af_bella">Bella (American Female)</option>
<option value="af_nicole">Nicole (American Female)</option>
<option value="af_sarah">Sarah (American Female)</option>
<option value="af_sky">Sky (American Female)</option>
<option value="am_adam">Adam (American Male)</option>
<option value="am_michael">Michael (American Male)</option>
<option value="bf_emma">Emma (British Female)</option>
<option value="bf_isabella">Isabella (British Female)</option>
<option value="bm_george">George (British Male)</option>
<option value="bm_lewis">Lewis (British Male)</option>
</select>
<button type="submit" className="inline-flex justify-center items-center px-6 py-2 text-lg font-semibold bg-gradient-to-t from-blue-600 to-purple-600 hover:from-blue-700 hover:to-purple-700 transition-colors duration-300 rounded-xl text-white disabled:opacity-50" disabled={status === "running" || inputText.trim() === ""}>
{status === "running" ? "Generating..." : "Generate"}
</button>
</div>
</form>
</div>
{results.length > 0 && (
<motion.div initial={{ y: 50, opacity: 0 }} animate={{ y: 0, opacity: 1 }} transition={{ duration: 0.5 }} className="max-h-[250px] overflow-y-auto px-2 mt-4 space-y-6 relative z-[2]">
{results.map((result, i) => (
<div key={i}>
<div className="text-white bg-gray-800/70 backdrop-blur-sm border border-gray-700 rounded-lg p-4 z-10">
<span className="absolute right-5 font-bold">#{results.length - i}</span>
<p className="mb-3 max-w-[95%]">{result.text}</p>
<audio controls src={result.src} className="w-full">
Your browser does not support the audio element.
</audio>
</div>
</div>
))}
</motion.div>
)}
</div>
<div className="bg-[#015871] pointer-events-none absolute left-0 w-full h-[5%] bottom-[-50px]">
<div className="wave"></div>
<div className="wave"></div>
</div>
</div>
);
}

View File

@@ -0,0 +1,100 @@
@tailwind base;
@tailwind components;
@tailwind utilities;
/*
* Wave animations adapted from the following two demos:
* - https://codepen.io/upasanaasopa/pen/poObEWZ
* - https://codepen.io/breakstorm00/pen/qBJZQNB
*/
*,
*:before,
*:after {
margin: 0;
padding: 0;
box-sizing: border-box;
}
.loading-wave {
position: relative;
top: 0;
width: 100%;
height: 100%;
background: #2c74b3;
border-radius: 50%;
box-shadow: inset 0 0 50px 0 rgba(0, 0, 0, 0.5);
}
.loading-wave:before,
.loading-wave:after {
content: "";
position: absolute;
top: 0;
left: 50%;
width: 200%;
height: 200%;
background: black;
transform: translate(-50%, -75%);
}
.loading-wave:before {
border-radius: 45%;
background: rgba(255, 255, 255, 1);
animation: animate 5s linear infinite;
}
.loading-wave:after {
border-radius: 40%;
background: rgba(255, 255, 255, 0.5);
animation: animate 10s linear infinite;
}
.wave {
background: url(/wave.svg) repeat-x;
position: absolute;
top: -198px;
width: 6400px;
height: 198px;
animation: wave 7s cubic-bezier(0.36, 0.45, 0.63, 0.53) infinite;
transform: translate3d(0, 0, 0);
}
.wave:nth-of-type(2) {
top: -175px;
animation:
wave 7s cubic-bezier(0.36, 0.45, 0.63, 0.53) -0.125s infinite,
swell 7s ease -1.25s infinite;
opacity: 1;
}
@keyframes wave {
0% {
margin-left: 0;
}
100% {
margin-left: -1600px;
}
}
@keyframes swell {
0%,
100% {
transform: translate3d(0, -25px, 0);
}
50% {
transform: translate3d(0, 5px, 0);
}
}
@keyframes animate {
0% {
transform: translate(-50%, -75%) rotate(0deg);
}
100% {
transform: translate(-50%, -75%) rotate(360deg);
}
}

View File

@@ -0,0 +1,10 @@
import { StrictMode } from "react";
import { createRoot } from "react-dom/client";
import "./index.css";
import App from "./App.jsx";
createRoot(document.getElementById("root")).render(
<StrictMode>
<App />
</StrictMode>,
);

View File

@@ -0,0 +1,20 @@
import { KokoroTTS } from "kokoro-js";
const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});
self.postMessage({ status: "ready" });
// Listen for messages from the main thread
self.addEventListener("message", async (e) => {
const { text, voice } = e.data;
// Generate speech
const audio = await tts.generate(text, { voice });
// Send the audio file back to the main thread
const blob = audio.toBlob();
self.postMessage({ status: "complete", audio: URL.createObjectURL(blob), text });
});

View File

@@ -0,0 +1,8 @@
/** @type {import('tailwindcss').Config} */
export default {
content: ["./index.html", "./src/**/*.{js,ts,jsx,tsx}"],
theme: {
extend: {},
},
plugins: [],
};

View File

@@ -0,0 +1,12 @@
import { defineConfig } from "vite";
import react from "@vitejs/plugin-react";
// https://vite.dev/config/
export default defineConfig({
plugins: [react()],
worker: { format: "es" },
build: {
target: "esnext",
},
logLevel: process.env.NODE_ENV === "development" ? "error" : "info",
});

2972
kokoro.js/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

65
kokoro.js/package.json Normal file
View File

@@ -0,0 +1,65 @@
{
"name": "kokoro-js",
"version": "1.0.1",
"type": "module",
"exports": {
"types": "./types/kokoro.d.ts",
"node": {
"import": "./dist/kokoro.js",
"require": "./dist/kokoro.cjs"
},
"default": "./dist/kokoro.web.js"
},
"scripts": {
"build": "rm -rf dist types && rollup -c && tsc && cp ../LICENSE LICENSE",
"format": "prettier --write . --print-width 1000",
"test": "vitest"
},
"keywords": [
"kokoro",
"tts",
"text-to-speech"
],
"author": {
"name": "hexgrad",
"email": "hello@hexgrad.com"
},
"browser": {
"path": false,
"fs/promises": false
},
"contributors": [
"Xenova"
],
"license": "Apache-2.0",
"description": "High-quality text-to-speech for the web",
"dependencies": {
"@huggingface/transformers": "^3.3.1",
"phonemizer": "^1.2.1"
},
"devDependencies": {
"@rollup/plugin-node-resolve": "^16.0.0",
"@rollup/plugin-terser": "^0.4.4",
"prettier": "3.4.2",
"rollup": "^4.30.1",
"typescript": "^5.7.3",
"vitest": "^2.1.8"
},
"files": [
"types",
"dist",
"voices",
"README.md",
"LICENSE"
],
"homepage": "https://github.com/hexgrad/kokoro",
"repository": {
"type": "git",
"url": "git+https://github.com/hexgrad/kokoro.git"
},
"publishConfig": {
"access": "public"
},
"jsdelivr": "./dist/kokoro.web.js",
"unpkg": "./dist/kokoro.web.js"
}

View File

@@ -0,0 +1,42 @@
import terser from "@rollup/plugin-terser";
import { nodeResolve } from "@rollup/plugin-node-resolve";
const plugins = (browser) => [nodeResolve({ browser }), terser({ format: { comments: false } })];
const OUTPUT_CONFIGS = [
// Node versions
{
file: "./dist/kokoro.cjs",
format: "cjs",
},
{
file: "./dist/kokoro.js",
format: "esm",
},
// Web version
{
file: "./dist/kokoro.web.js",
format: "esm",
},
];
const WEB_SPECIFIC_CONFIG = {
onwarn: (warning, warn) => {
if (!warning.message.includes("@huggingface/transformers")) warn(warning);
},
};
const NODE_SPECIFIC_CONFIG = {
external: ["@huggingface/transformers", "phonemizer"],
};
export default OUTPUT_CONFIGS.map((output) => {
const web = output.file.endsWith(".web.js");
return {
input: "./src/kokoro.js",
output,
plugins: plugins(web),
...(web ? WEB_SPECIFIC_CONFIG : NODE_SPECIFIC_CONFIG),
};
});

90
kokoro.js/src/kokoro.js Normal file
View File

@@ -0,0 +1,90 @@
import { StyleTextToSpeech2Model, AutoTokenizer, Tensor, RawAudio } from "@huggingface/transformers";
import { phonemize } from "./phonemize.js";
import { getVoiceData, VOICES } from "./voices.js";
const STYLE_DIM = 256;
const SAMPLE_RATE = 24000;
export class KokoroTTS {
/**
* Create a new KokoroTTS instance.
* @param {import('@huggingface/transformers').StyleTextToSpeech2Model} model The model
* @param {import('@huggingface/transformers').PreTrainedTokenizer} tokenizer The tokenizer
*/
constructor(model, tokenizer) {
this.model = model;
this.tokenizer = tokenizer;
}
/**
* Load a KokoroTTS model from the Hugging Face Hub.
* @param {string} model_id The model id
* @param {Object} options Additional options
* @param {"fp32"|"fp16"|"q8"|"q4"|"q4f16"} [options.dtype="fp32"] The data type to use.
* @param {"wasm"|"webgpu"|"cpu"|null} [options.device=null] The device to run the model on.
* @param {import("@huggingface/transformers").ProgressCallback} [options.progress_callback=null] A callback function that is called with progress information.
* @returns {Promise<KokoroTTS>} The loaded model
*/
static async from_pretrained(model_id, { dtype = "fp32", device = null, progress_callback = null } = {}) {
const model = StyleTextToSpeech2Model.from_pretrained(model_id, { progress_callback, dtype, device });
const tokenizer = AutoTokenizer.from_pretrained(model_id, { progress_callback });
const info = await Promise.all([model, tokenizer]);
return new KokoroTTS(...info);
}
get voices() {
return VOICES;
}
list_voices() {
console.table(VOICES);
}
/**
* Generate audio from text.
*
* Note: The model will be loaded on the first call, and subsequent calls will use the same model.
* @param {string} text The input text
* @param {Object} options Additional options
* @param {keyof typeof VOICES} [options.voice="af"] The voice style to use
* @param {number} [options.speed=1] The speaking speed
* @returns {Promise<RawAudio>} The generated audio
*/
async generate(text, { voice = "af", speed = 1 } = {}) {
if (!VOICES.hasOwnProperty(voice)) {
console.error(`Voice "${voice}" not found. Available voices:`);
console.table(VOICES);
throw new Error(`Voice "${voice}" not found. Should be one of: ${Object.keys(VOICES).join(", ")}.`);
}
const language = voice.at(0); // "a" or "b"
const phonemes = await phonemize(text, language);
const { input_ids } = this.tokenizer(phonemes, {
truncation: true,
});
// Select voice style based on number of input tokens
const num_tokens = Math.max(
input_ids.dims.at(-1) - 2, // Without padding;
0,
);
// Load voice style
const data = await getVoiceData(voice);
const offset = num_tokens * STYLE_DIM;
const voiceData = data.slice(offset, offset + STYLE_DIM);
// Prepare model inputs
const inputs = {
input_ids,
style: new Tensor("float32", voiceData, [1, STYLE_DIM]),
speed: new Tensor("float32", [speed], [1]),
};
// Generate audio
const { waveform } = await this.model(inputs);
return new RawAudio(waveform.data, SAMPLE_RATE);
}
}

197
kokoro.js/src/phonemize.js Normal file
View File

@@ -0,0 +1,197 @@
import { phonemize as espeakng } from "phonemizer";
/**
* Helper function to split a string on a regex, but keep the delimiters.
* This is required, because the JavaScript `.split()` method does not keep the delimiters,
* and wrapping in a capturing group causes issues with existing capturing groups (due to nesting).
* @param {string} text The text to split.
* @param {RegExp} regex The regex to split on.
* @returns {{match: boolean; text: string}[]} The split string.
*/
function split(text, regex) {
const result = [];
let prev = 0;
for (const match of text.matchAll(regex)) {
const fullMatch = match[0];
if (prev < match.index) {
result.push({ match: false, text: text.slice(prev, match.index) });
}
if (fullMatch.length > 0) {
result.push({ match: true, text: fullMatch });
}
prev = match.index + fullMatch.length;
}
if (prev < text.length) {
result.push({ match: false, text: text.slice(prev) });
}
return result;
}
/**
* Helper function to split numbers into phonetic equivalents
* @param {string} match The matched number
* @returns {string} The phonetic equivalent
*/
function split_num(match) {
if (match.includes(".")) {
return match;
} else if (match.includes(":")) {
let [h, m] = match.split(":").map(Number);
if (m === 0) {
return `${h} o'clock`;
} else if (m < 10) {
return `${h} oh ${m}`;
}
return `${h} ${m}`;
}
let year = parseInt(match.slice(0, 4), 10);
if (year < 1100 || year % 1000 < 10) {
return match;
}
let left = match.slice(0, 2);
let right = parseInt(match.slice(2, 4), 10);
let suffix = match.endsWith("s") ? "s" : "";
if (year % 1000 >= 100 && year % 1000 <= 999) {
if (right === 0) {
return `${left} hundred${suffix}`;
} else if (right < 10) {
return `${left} oh ${right}${suffix}`;
}
}
return `${left} ${right}${suffix}`;
}
/**
* Helper function to format monetary values
* @param {string} match The matched currency
* @returns {string} The formatted currency
*/
function flip_money(match) {
const bill = match[0] === "$" ? "dollar" : "pound";
if (isNaN(Number(match.slice(1)))) {
return `${match.slice(1)} ${bill}s`;
} else if (!match.includes(".")) {
let suffix = match.slice(1) === "1" ? "" : "s";
return `${match.slice(1)} ${bill}${suffix}`;
}
const [b, c] = match.slice(1).split(".");
const d = parseInt(c.padEnd(2, "0"), 10);
let coins = match[0] === "$" ? (d === 1 ? "cent" : "cents") : d === 1 ? "penny" : "pence";
return `${b} ${bill}${b === "1" ? "" : "s"} and ${d} ${coins}`;
}
/**
* Helper function to process decimal numbers
* @param {string} match The matched number
* @returns {string} The formatted number
*/
function point_num(match) {
let [a, b] = match.split(".");
return `${a} point ${b.split("").join(" ")}`;
}
/**
* Normalize text for phonemization
* @param {string} text The text to normalize
* @returns {string} The normalized text
*/
function normalize_text(text) {
return (
text
// 1. Handle quotes and brackets
.replace(/[]/g, "'")
.replace(/«/g, "“")
.replace(/»/g, "”")
.replace(/[“”]/g, '"')
.replace(/\(/g, "«")
.replace(/\)/g, "»")
// 2. Replace uncommon punctuation marks
.replace(/、/g, ", ")
.replace(/。/g, ". ")
.replace(//g, "! ")
.replace(//g, ", ")
.replace(//g, ": ")
.replace(//g, "; ")
.replace(//g, "? ")
// 3. Whitespace normalization
.replace(/[^\S \n]/g, " ")
.replace(/ +/, " ")
.replace(/(?<=\n) +(?=\n)/g, "")
// 4. Abbreviations
.replace(/\bD[Rr]\.(?= [A-Z])/g, "Doctor")
.replace(/\b(?:Mr\.|MR\.(?= [A-Z]))/g, "Mister")
.replace(/\b(?:Ms\.|MS\.(?= [A-Z]))/g, "Miss")
.replace(/\b(?:Mrs\.|MRS\.(?= [A-Z]))/g, "Mrs")
.replace(/\betc\.(?! [A-Z])/gi, "etc")
// 5. Normalize casual words
.replace(/\b(y)eah?\b/gi, "$1e'a")
// 5. Handle numbers and currencies
.replace(/\d*\.\d+|\b\d{4}s?\b|(?<!:)\b(?:[1-9]|1[0-2]):[0-5]\d\b(?!:)/g, split_num)
.replace(/(?<=\d),(?=\d)/g, "")
.replace(/[$£]\d+(?:\.\d+)?(?: hundred| thousand| (?:[bm]|tr)illion)*\b|[$£]\d+\.\d\d?\b/gi, flip_money)
.replace(/\d*\.\d+/g, point_num)
.replace(/(?<=\d)-(?=\d)/g, " to ")
.replace(/(?<=\d)S/g, " S")
// 6. Handle possessives
.replace(/(?<=[BCDFGHJ-NP-TV-Z])'?s\b/g, "'S")
.replace(/(?<=X')S\b/g, "s")
// 7. Handle hyphenated words/letters
.replace(/(?:[A-Za-z]\.){2,} [a-z]/g, (m) => m.replace(/\./g, "-"))
.replace(/(?<=[A-Z])\.(?=[A-Z])/gi, "-")
// 8. Strip leading and trailing whitespace
.trim()
);
}
/**
* Escapes regular expression special characters from a string by replacing them with their escaped counterparts.
*
* @param {string} string The string to escape.
* @returns {string} The escaped string.
*/
function escapeRegExp(string) {
return string.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"); // $& means the whole matched string
}
const PUNCTUATION = ';:,.!?¡¿—…"«»“”(){}[]';
const PUNCTUATION_PATTERN = new RegExp(`(\\s*[${escapeRegExp(PUNCTUATION)}]+\\s*)+`, "g");
export async function phonemize(text, language = "a", norm = true) {
// 1. Normalize text
if (norm) {
text = normalize_text(text);
}
// 2. Split into chunks, to ensure we preserve punctuation
const sections = split(text, PUNCTUATION_PATTERN);
// 3. Convert each section to phonemes
const lang = language === "a" ? "en-us" : "en";
const ps = (await Promise.all(sections.map(async ({ match, text }) => (match ? text : (await espeakng(text, lang)).join(" "))))).join("");
// 4. Post-process phonemes
let processed = ps
// https://en.wiktionary.org/wiki/kokoro#English
.replace(/kəkˈoːɹoʊ/g, "kˈoʊkəɹoʊ")
.replace(/kəkˈɔːɹəʊ/g, "kˈəʊkəɹəʊ")
.replace(/ʲ/g, "j")
.replace(/r/g, "ɹ")
.replace(/x/g, "k")
.replace(/ɬ/g, "l")
.replace(/(?<=[a-zɹː])(?=hˈʌndɹɪd)/g, " ")
.replace(/ z(?=[;:,.!?¡¿—…"«»“” ]|$)/g, "z");
// 5. Additional post-processing for American English
if (language === "a") {
processed = processed.replace(/(?<=nˈaɪn)ti(?!ː)/g, "di");
}
return processed.trim();
}

121
kokoro.js/src/voices.js Normal file
View File

@@ -0,0 +1,121 @@
import path from "path";
import fs from "fs/promises";
export const VOICES = Object.freeze({
af: {
// Default voice is a 50-50 mix of Bella & Sarah
name: "Default",
language: "en-us",
gender: "Female",
},
af_bella: {
name: "Bella",
language: "en-us",
gender: "Female",
},
af_nicole: {
name: "Nicole",
language: "en-us",
gender: "Female",
},
af_sarah: {
name: "Sarah",
language: "en-us",
gender: "Female",
},
af_sky: {
name: "Sky",
language: "en-us",
gender: "Female",
},
am_adam: {
name: "Adam",
language: "en-us",
gender: "Male",
},
am_michael: {
name: "Michael",
language: "en-us",
gender: "Male",
},
bf_emma: {
name: "Emma",
language: "en-gb",
gender: "Female",
},
bf_isabella: {
name: "Isabella",
language: "en-gb",
gender: "Female",
},
bm_george: {
name: "George",
language: "en-gb",
gender: "Male",
},
bm_lewis: {
name: "Lewis",
language: "en-gb",
gender: "Male",
},
});
const VOICE_DATA_URL = "https://huggingface.co/onnx-community/Kokoro-82M-ONNX/resolve/main/voices";
/**
*
* @param {keyof typeof VOICES} id
* @returns {Promise<ArrayBufferLike>}
*/
async function getVoiceFile(id) {
if (fs?.readFile) {
const file = path.resolve(import.meta.dirname ?? __dirname, `../voices/${id}.bin`);
const { buffer } = await fs.readFile(file);
return buffer;
}
const url = `${VOICE_DATA_URL}/${id}.bin`;
let cache;
try {
cache = await caches.open("kokoro-voices");
const cachedResponse = await cache.match(url);
if (cachedResponse) {
return await cachedResponse.arrayBuffer();
}
} catch (e) {
console.warn("Unable to open cache", e);
}
// No cache, or cache failed to open. Fetch the file.
const response = await fetch(url);
const buffer = await response.arrayBuffer();
if (cache) {
try {
// NOTE: We use `new Response(buffer, ...)` instead of `response.clone()` to handle LFS files
await cache.put(
url,
new Response(buffer, {
headers: response.headers,
}),
);
} catch (e) {
console.warn("Unable to cache file", e);
}
}
return buffer;
}
const VOICE_CACHE = new Map();
export async function getVoiceData(voice) {
if (VOICE_CACHE.has(voice)) {
return VOICE_CACHE.get(voice);
}
const buffer = new Float32Array(await getVoiceFile(voice));
VOICE_CACHE.set(voice, buffer);
return buffer;
}

View File

@@ -0,0 +1,95 @@
import { describe, test, expect } from "vitest";
import { phonemize } from "../src/phonemize.js";
const A_TEST_CASES = new Map([
["Hello", "həlˈoʊ"],
["Test and Example", "tˈɛst ænd ɛɡzˈæmpəl"],
["«Bonjour»", '"bɔːˈʊɹ"'],
["«Test «nested» quotes»", '"tˈɛst "nˈɛstᵻd" kwˈoʊts"'],
["(Hello)", "«həlˈoʊ»"],
["(Nested (Parentheses))", "«nˈɛstᵻd «pɚɹˈɛnθəsˌiːz»»"],
["こんにちは、世界!", "dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ, tʃˈaɪniːzlˌɛɾɚ tʃˈaɪniːzlˌɛɾɚ!"],
["これはテストです:はい?", "dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ: dʒˈæpəniːzlˌɛɾɚ dʒˈæpəniːzlˌɛɾɚ?"],
["Hello World", "həlˈoʊ wˈɜːld"],
["Hello World", "həlˈoʊ wˈɜːld"],
["Hello\n \nWorld", "həlˈoʊ wˈɜːld"],
["Dr. Smith", "dˈɑːktɚ smˈɪθ"],
["DR. Brown", "dˈɑːktɚ bɹˈaʊn"],
["Mr. Smith", "mˈɪstɚ smˈɪθ"],
["MR. Anderson", "mˈɪstɚɹ ˈændɚsən"],
["Ms. Taylor", "mˈɪs tˈeɪlɚ"],
["MS. Carter", "mˈɪs kˈɑːɹɾɚ"],
["Mrs. Johnson", "mˈɪsɪz dʒˈɑːnsən"],
["MRS. Wilson", "mˈɪsɪz wˈɪlsən"],
["Apples, oranges, etc.", "ˈæpəlz, ˈɔɹɪndʒᵻz, ɛtsˈɛtɹə"],
["Apples, etc. Pears.", "ˈæpəlz, ɛtsˈɛtɹə. pˈɛɹz."],
["Yeah", "jˈɛə"],
["yeah", "jˈɛə"],
["1990", "nˈaɪntiːn nˈaɪndi"],
["12:34", "twˈɛlv θˈɜːɾi fˈoːɹ"],
["2022s", "twˈɛnti twˈɛnti tˈuːz"],
["1,000", "wˈʌn θˈaʊzənd"],
["12,345,678", "twˈɛlv mˈɪliən θɹˈiː hˈʌndɹɪd fˈoːɹɾi fˈaɪv θˈaʊzənd sˈɪks hˈʌndɹɪd sˈɛvənti ˈeɪt"],
["$100", "wˈʌn hˈʌndɹɪd dˈɑːlɚz"],
["£1.50", "wˈʌn pˈaʊnd ænd fˈɪfti pˈɛns"],
["12.34", "twˈɛlv pˈɔɪnt θɹˈiː fˈoːɹ"],
["0.01", "zˈiəɹoʊ pˈɔɪnt zˈiəɹoʊ wˈʌn"],
["10-20", "tˈɛn tə twˈɛnti"],
["5-10", "fˈaɪv tə tˈɛn"],
["10S", "tˈɛn ˈɛs"],
["5S", "fˈaɪv ˈɛs"],
["Cat's tail", "kˈæts tˈeɪl"],
["X's mark", "ˈɛksᵻz mˈɑːɹk"],
["U.S.A.", "jˈuːˈɛsˈeɪ."],
["A.B.C", "ˈeɪbˈiːsˈiː"],
]);
const B_TEST_CASES = new Map([
["Hello", "həlˈəʊ"],
["Test and Example", "tˈɛst and ɛɡzˈampəl"],
["«Bonjour»", '"bɔːˈʊə"'],
["«Test «nested» quotes»", '"tˈɛst "nˈɛstɪd" kwˈəʊts"'],
["(Hello)", "«həlˈəʊ»"],
["(Nested (Parentheses))", "«nˈɛstɪd «pəɹˈɛnθəsˌiːz»»"],
["こんにちは、世界!", "dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə, tʃˈaɪniːzlˌɛtə tʃˈaɪniːzlˌɛtə!"],
["これはテストです:はい?", "dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə: dʒˈapəniːzlˌɛtə dʒˈapəniːzlˌɛtə?"],
["Hello World", "həlˈəʊ wˈɜːld"],
["Hello World", "həlˈəʊ wˈɜːld"],
["Hello\n \nWorld", "həlˈəʊ wˈɜːld"],
["Dr. Smith", "dˈɒktə smˈɪθ"],
["DR. Brown", "dˈɒktə bɹˈaʊn"],
["Mr. Smith", "mˈɪstə smˈɪθ"],
["MR. Anderson", "mˈɪstəɹ ˈandəsən"],
["Ms. Taylor", "mˈɪs tˈeɪlə"],
["MS. Carter", "mˈɪs kˈɑːtə"],
["Mrs. Johnson", "mˈɪsɪz dʒˈɒnsən"],
["Apples, oranges, etc.", "ˈapəlz, ˈɒɹɪndʒɪz, ɛtsˈɛtɹə"],
["Apples, etc. Pears.", "ˈapəlz, ɛtsˈɛtɹə. pˈeəz."],
["1990", "nˈaɪntiːn nˈaɪnti"],
["12:34", "twˈɛlv θˈɜːti fˈɔː"],
["1,000", "wˈɒn θˈaʊzənd"],
["12,345,678", "twˈɛlv mˈɪliən θɹˈiː hˈʌndɹɪdən fˈɔːti fˈaɪv θˈaʊzənd sˈɪks hˈʌndɹɪdən sˈɛvənti ˈeɪt"],
["$100", "wˈɒn hˈʌndɹɪd dˈɒləz"],
["£1.50", "wˈɒn pˈaʊnd and fˈɪfti pˈɛns"],
["12.34", "twˈɛlv pˈɔɪnt θɹˈiː fˈɔː"],
["0.01", "zˈiəɹəʊ pˈɔɪnt zˈiəɹəʊ wˈɒn"],
["Cat's tail", "kˈats tˈeɪl"],
["X's mark", "ˈɛksɪz mˈɑːk"],
]);
describe("phonemize", () => {
describe("en-us", () => {
for (const [input, expected] of A_TEST_CASES) {
test(`phonemize("${input}")`, async () => {
expect(await phonemize(input)).toEqual(expected);
});
}
});
describe("en-gb", () => {
for (const [input, expected] of B_TEST_CASES) {
test(`phonemize("${input}")`, async () => {
expect(await phonemize(input, "b")).toEqual(expected);
});
}
});
});

16
kokoro.js/tsconfig.json Normal file
View File

@@ -0,0 +1,16 @@
{
"include": ["src/**/*"],
"compilerOptions": {
"checkJs": true,
"target": "esnext",
"module": "nodenext",
"moduleResolution": "nodenext",
"outDir": "types",
"strict": false,
"skipLibCheck": true,
"declaration": true,
"declarationMap": true,
"noEmit": false,
"emitDeclarationOnly": true
}
}

BIN
kokoro.js/voices/af.bin Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
kokoro.js/voices/af_sky.bin Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.