A TTS program for Linux that uses Piper for voice synthesis

Find a file

user 803907b595 added a slight delay temporarily to prevent double playing		2024-10-16 16:22:54 -04:00
gitdeps	updated git deps again!	2024-10-07 18:13:55 -04:00
src	added a slight delay temporarily to prevent double playing	2024-10-16 16:22:54 -04:00
.gitignore	updated gitignore	2024-10-07 04:16:43 -04:00
.gitmodules	moved upward	2024-10-06 23:39:04 -04:00
clean.sh	moved upward	2024-10-06 23:39:04 -04:00
CMakeLists.txt	moved upward	2024-10-06 23:39:04 -04:00
helper.sh	moved upward	2024-10-06 23:39:04 -04:00
install.sh	moved upward	2024-10-06 23:39:04 -04:00
miTTenS	added a slight delay temporarily to prevent double playing	2024-10-16 16:22:54 -04:00
piperapi	moved upward	2024-10-06 23:39:04 -04:00
README.md	removed debug prints	2024-10-16 16:06:29 -04:00
userdata.db	updated deps	2024-10-07 18:00:47 -04:00
VERSION	moved upward	2024-10-06 23:39:04 -04:00

README.md

miTTens

Art by Sharaa Yippie
For linux, and linux only.

This is a application used to control piper with your keyboard. This integrates it with your computer, in a way, where you have more control. It works via your clipboard. This is intended for visually disabled people, such as myself. To aid in both writing, and reading. It has language detection to automatically change the language depending on the text.

Piper has a natural sound, and is quite fast to synthesize for the quality it gives. On en_US-danny-low.json this enitre readme took about 9 seconds to synthesize. That is quite fast, however, isn't the use case for the tool. The above paragraph took 0.25s to synthesize.

This project's structure is ambigious enough to allow for alternate TTS models to integrated, in the future.

For Developers

Structure

miTTeNs works over UDP and takes user input over a port. It is event and uses the following packet structure

All numbers are encoded in big endian
---
| Length (U16) | Command (U8) | Body (Size = Length, ASCII) |
e.g
---
0x00 0x05 | 0x01 | hello
---

Or the equivalent nim:

proc createStringForSending(a : string, operation : uint8) : string =
  #This converts the integer to an array of bytes
  # x.high gets the 0-counted length of the string.
  let data = cast[array[sizeof(uint16), char]](a.high.uint16)
  return data.join("") & char(operation) & $a

And the following commands exist

- PLAY_MESSAGE = 1 --- Queues, in order.
- PLAY_MESSAGE_PRIORITY = 2 --- Interrupts the currently played message, and clears the queue.
- SET_LENGTH_SCALE = 3 (the body must be float32 big endian)--- Sets the current length scale.
- STOP_PLAYING = 4 --- Stops playing the current message.
- RESUME = 5 --- Continues playing the message from where the message was stopped.
- SET_MODEL = 6 --- Sets the current model to the rowid in the intenral database.
- INC_MODEL = 7 --- Increases the model to the next available model in the internal database.
- DEC_MODEL = 8 --- Increases the model to the next available model in the internal database.
- SCRUB_FORWARD = 9 --- Skips forward in time of the given message. The amount skipped is decided by the interval, set in the database. Controllable by the webcfg frontend.
- SCRUB_BACKWARDS = 10 --- See 9. Decreases the time.
- RESTART = 11 --- Sets the time 0, resetting the message.
- INC_LENGTH_SCALE = 12 --- Slows down the message, by the amount set in the databse, configurable by the webcfg frontend. (requires resyntheis)
- DEC_LENGTH_SCALE = 13 -- See 13. Speeds up the message. (requires resyntheis)

Configuration

There is a custom file format for the configuration of keybinds. Where, on the leftside you have the keybinds, on the right, you have the command.

#comment
Alt_L, Super_L, A -> PLAY_MESSAGE_PRIORITY
Alt_L, Super_L, X -> STOP_PLAYING
Alt_L, Super_L, C -> RESTART

In addition, there is a webcfg, that should be accessable via screen readers.

Thread Structure

miTTenS follows a event based system for managing state. The text in the above immage:

Threads
- Network Thread: reads from the UDP port specified and sends the raw messages onto the Event Processing Thread.
- Event Processing Thread: processes the event, sends multiple events to manage the state of the synthesis thread. Is a separate thread to allow for both playing audio and accepting requests
- Synthesis and message playing Thread: Handles the piper process and audio playing. Is manipulated by the other threads. Has to be on the original thread due to memory safety issues in OMNX RUNTIME which piper uses.

Language detection

miTTenS supports language detection via a custom algorithm with no complex dependencies. Its accuracy is based upon orthography, and thus, gets more accurate the longer the text is. Individual words are liable to snap back to their origin language (e.g Sauna might become Finnish) if there isn't significant spelling differences. It is disabled by default. Additionally , if a language has a similar orthography, it is less accurate. Ukranian and Russian are likely to be confused, whereas, Finnish and Swedish will seldom be.

Setting up

Current system-dependent dependencies are as follows:

- cmake >= 13
- g++ CC >= 13
- sfml2
- sqlite3
- nim (>= 2.0.0)

it also requires a user in the input group whom is not root. I recommend you make a specific user for this task.

sudo usermod -a -G input $user

git clone --recurse-submodules https://gitlab.com/CAlbassort/miTTenS
cd miTTenS
./install.sh

Downloading and installing models

Models can be downloaded from here

The install location is /usr/local/miTTenS, and so you will install them there. Specifically, to /usr/local/miTTenS/models/language. The language code you're supposed to use is based on this list (here)[https://gitlab.com/IAlbassort/zipfs-law-language-detection/-/blob/master/data/multiToEng.json?ref_type=heads] (the Wikipedia extension is used). After which, you drop the models in with their omnx and an identical .json file. For example:

/usr/local/miTTenS/models/
├── en
│   ├── en_US-danny-low.json
│   └── en_US-danny-low.onnx
└── fi
    ├── fi_FI-harri-medium.json
    └── fi_FI-harri-medium.onnx