diff --git a/Makefile b/Makefile index a79730d..e01b5e8 100644 --- a/Makefile +++ b/Makefile @@ -12,7 +12,7 @@ install: deps venv # Install system dependencies (requires sudo) deps: - sudo apt install -y ydotool pipewire libnotify-bin python3-venv socat + sudo apt install -y ydotool ffmpeg pipewire libnotify-bin python3-venv socat # Create Python venv with faster-whisper (default backend) venv: .venv/.done diff --git a/README.md b/README.md index 6e41df6..eae8364 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,17 @@ # talktype -Push-to-talk speech-to-text for Linux. Bind a keyboard shortcut, press it to -start recording, press it again to transcribe and type the text wherever your -cursor is. +Push-to-talk speech-to-text for Linux. Press a hotkey to start recording, press +it again to transcribe and type the text wherever your cursor is. No GUI, no +app to keep running — just a keyboard shortcut. -Transcription is pluggable — ships with -[faster-whisper](https://github.com/SYSTRAN/faster-whisper) by default, but you -can swap in any model or tool that reads audio and prints text. +- **Pluggable backends** — swap transcription models without changing anything else +- **Works everywhere** — GNOME, Sway, Hyprland, i3, X11 +- **~100 lines of bash** — easy to read, easy to hack on + +Ships with [faster-whisper](https://github.com/SYSTRAN/faster-whisper) by +default, plus optional [Parakeet](https://huggingface.co/nvidia/parakeet-ctc-1.1b) +and [Moonshine](https://huggingface.co/UsefulSensors/moonshine-base) backends. +Or bring your own — anything that reads a WAV and prints text works. > **Note:** This project is in early development — expect rough edges. If you > run into issues, please [open a bug](https://github.com/csheaff/talktype/issues). @@ -14,10 +19,10 @@ can swap in any model or tool that reads audio and prints text. ## Requirements - Linux (Wayland or X11) -- PipeWire (default on most modern distros) +- Audio recorder: [ffmpeg](https://ffmpeg.org/) (preferred) or PipeWire (`pw-record`) - [ydotool](https://github.com/ReimuNotMoe/ydotool) for typing text (user must be in the `input` group — see Install) -- [socat](https://linux.die.net/man/1/socat) (only needed for server mode) +- [socat](https://linux.die.net/man/1/socat) (for server-backed transcription) For the default backend (faster-whisper): - NVIDIA GPU with CUDA (or use CPU mode — see Whisper backend options) @@ -53,6 +58,22 @@ Then **reboot** for the group change to take effect. make model ``` +## Configuration + +talktype reads `~/.config/talktype/config` on startup (follows `$XDG_CONFIG_HOME`). +This works everywhere — GNOME shortcuts, terminals, Sway, cron — no need to set +environment variables in each context. + +```bash +mkdir -p ~/.config/talktype +cat > ~/.config/talktype/config << 'EOF' +TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe" +EOF +``` + +Any `TALKTYPE_*` variable can go in this file. Environment variables still work +and are applied after the config file, so they override it. + ## Setup Bind `talktype` to a keyboard shortcut: @@ -75,21 +96,19 @@ bindsym $mod+d exec talktype ## Backends -Three backends are included. Each has a one-shot script (loads model per -invocation) and a server mode (loads model once, keeps it in memory). +Three backends are included. Server backends auto-start on first use — the +model loads once and stays in memory for fast subsequent transcriptions. ### Whisper (default) -The default backend uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper). -Best with a GPU. +[faster-whisper](https://github.com/SYSTRAN/faster-whisper). Best with a GPU. +Works out of the box after `make install` with no config needed. -```bash -# One-shot (default, no extra setup needed) -talktype +For faster repeated use, switch to server mode in your config: -# Server mode (faster — model stays in memory) -./transcribe-server start -export TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe" +```bash +# ~/.config/talktype/config +TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe" ``` | Variable | Default | Description | @@ -99,17 +118,19 @@ export TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe" | `WHISPER_DEVICE` | `cuda` | `cuda` or `cpu` | | `WHISPER_COMPUTE` | `float16` | `float16` (GPU), `int8` or `float32` (CPU) | -### Parakeet (GPU, best accuracy) +### Parakeet (GPU, best word accuracy) [NVIDIA Parakeet CTC 1.1B](https://huggingface.co/nvidia/parakeet-ctc-1.1b) -via HuggingFace Transformers. 1.1B params, excellent accuracy. +via HuggingFace Transformers. 1.1B params, excellent word accuracy. +Note: CTC model — outputs lowercase text without punctuation. ```bash make parakeet +``` -# Server mode (recommended — 4.2GB model) -./backends/parakeet-server start -export TALKTYPE_CMD="/path/to/talktype/backends/parakeet-server transcribe" +```bash +# ~/.config/talktype/config +TALKTYPE_CMD="/path/to/talktype/backends/parakeet-server transcribe" ``` ### Moonshine (CPU, lightweight) @@ -119,25 +140,34 @@ Sensors. 61.5M params, purpose-built for CPU/edge inference. ```bash make moonshine +``` -# One-shot (fine for this small model) -export TALKTYPE_CMD="/path/to/talktype/backends/moonshine" - -# Or server mode -./backends/moonshine-server start -export TALKTYPE_CMD="/path/to/talktype/backends/moonshine-server transcribe" +```bash +# ~/.config/talktype/config +TALKTYPE_CMD="/path/to/talktype/backends/moonshine-server transcribe" ``` Set `MOONSHINE_MODEL=UsefulSensors/moonshine-tiny` for an even smaller 27M param model. +### Manual server management + +The server starts automatically on first transcription. You can also manage +it directly: + +```bash +./backends/parakeet-server start # start manually +./backends/parakeet-server stop # stop the server +``` + ### Custom backends Set `TALKTYPE_CMD` to any command that takes a WAV file path as its last argument and prints text to stdout: ```bash -export TALKTYPE_CMD="/path/to/my-transcriber" +# ~/.config/talktype/config +TALKTYPE_CMD="/path/to/my-transcriber" ``` Your command will be called as: `$TALKTYPE_CMD /path/to/recording.wav` diff --git a/backends/moonshine-server b/backends/moonshine-server index 10d8b45..27439b3 100755 --- a/backends/moonshine-server +++ b/backends/moonshine-server @@ -19,6 +19,11 @@ case "${1:-}" in echo "Already running (PID $(cat "$PIDFILE"))" exit 0 fi + if [ ! -x "$VENV/bin/python3" ]; then + echo "Moonshine backend not installed. Run: make moonshine" >&2 + exit 1 + fi + rm -f "$PIDFILE" "$SOCK" echo "Starting moonshine server (loading $MODEL)..." "$VENV/bin/python3" "$SCRIPT_DIR/moonshine-daemon.py" "$SOCK" "$MODEL" & PID=$! @@ -46,8 +51,7 @@ case "${1:-}" in ;; transcribe) if [ ! -S "$SOCK" ]; then - echo "Moonshine server not running. Start it with: backends/moonshine-server start" >&2 - exit 1 + "$0" start >&2 || exit 1 fi echo "$2" | socat - UNIX-CONNECT:"$SOCK" ;; diff --git a/backends/parakeet-daemon.py b/backends/parakeet-daemon.py index 91bdea6..c4ada61 100644 --- a/backends/parakeet-daemon.py +++ b/backends/parakeet-daemon.py @@ -3,6 +3,7 @@ import sys import socket import signal +import torch import soundfile as sf from transformers import AutoProcessor, AutoModelForCTC @@ -17,11 +18,13 @@ def transcribe(audio_path): audio, sr = sf.read(audio_path) - inputs = processor(audio, sampling_rate=sr) - inputs.to(model.device, dtype=model.dtype) - predicted_ids = model.generate(**inputs) - texts = processor.batch_decode(predicted_ids, skip_special_tokens=True) - return texts[0].strip() if texts else "" + inputs = processor(audio, sampling_rate=sr, return_tensors="pt") + inputs = inputs.to(model.device, dtype=model.dtype) + with torch.no_grad(): + logits = model(**inputs).logits + predicted_ids = torch.argmax(logits, dim=-1) + text = processor.batch_decode(predicted_ids, skip_special_tokens=True) + return text[0].strip() if text else "" def cleanup(*_): diff --git a/backends/parakeet-server b/backends/parakeet-server index feff3ab..479e432 100755 --- a/backends/parakeet-server +++ b/backends/parakeet-server @@ -18,12 +18,16 @@ case "${1:-}" in echo "Already running (PID $(cat "$PIDFILE"))" exit 0 fi + if [ ! -x "$VENV/bin/python3" ]; then + echo "Parakeet backend not installed. Run: make parakeet" >&2 + exit 1 + fi + rm -f "$PIDFILE" "$SOCK" echo "Starting parakeet server (loading model)..." "$VENV/bin/python3" "$SCRIPT_DIR/parakeet-daemon.py" "$SOCK" & PID=$! disown "$PID" echo "$PID" > "$PIDFILE" - # Wait for socket to appear for i in $(seq 1 60); do [ -S "$SOCK" ] && break sleep 1 @@ -45,10 +49,8 @@ case "${1:-}" in fi ;; transcribe) - # Called by talktype — sends audio path to the server, prints result if [ ! -S "$SOCK" ]; then - echo "Parakeet server not running. Start it with: backends/parakeet-server start" >&2 - exit 1 + "$0" start >&2 || exit 1 fi echo "$2" | socat - UNIX-CONNECT:"$SOCK" ;; diff --git a/talktype b/talktype index 42b9c95..8902d25 100755 --- a/talktype +++ b/talktype @@ -13,9 +13,15 @@ # set -euo pipefail +# ── Load user config (works from GNOME shortcuts, cron, etc.) ── +TALKTYPE_CONFIG="${TALKTYPE_CONFIG:-${XDG_CONFIG_HOME:-$HOME/.config}/talktype/config}" +# shellcheck disable=SC1090 +[ -f "$TALKTYPE_CONFIG" ] && source "$TALKTYPE_CONFIG" + TALKTYPE_DIR="${TALKTYPE_DIR:-${XDG_RUNTIME_DIR:-/tmp}/talktype}" PIDFILE="$TALKTYPE_DIR/rec.pid" AUDIOFILE="$TALKTYPE_DIR/rec.wav" +NOTIFYFILE="$TALKTYPE_DIR/notify.id" mkdir -p "$TALKTYPE_DIR" @@ -35,16 +41,33 @@ if [ -z "${TALKTYPE_CMD:-}" ]; then TALKTYPE_CMD="$VENV_DIR/bin/python3 $SCRIPT_DIR/transcribe $WHISPER_MODEL $WHISPER_LANG $WHISPER_DEVICE $WHISPER_COMPUTE" fi +# ── Notification helper ── +notify() { + local icon="$1" msg="$2" + local -a args=(-a TalkType -u critical -i "$icon" -p "TalkType" "$msg") + if [ -f "$NOTIFYFILE" ]; then + args+=(-r "$(cat "$NOTIFYFILE")") + fi + notify-send "${args[@]}" 2>/dev/null | head -1 > "$NOTIFYFILE" || true +} + +notify_close() { + if [ -f "$NOTIFYFILE" ]; then + notify-send -a TalkType -r "$(cat "$NOTIFYFILE")" -e "TalkType" "" 2>/dev/null || true + rm -f "$NOTIFYFILE" + fi +} + # ── Check core dependencies ── check_deps() { local missing=() command -v ydotool &>/dev/null || missing+=(ydotool) - command -v pw-record &>/dev/null || missing+=(pipewire) + command -v ffmpeg &>/dev/null || command -v pw-record &>/dev/null || missing+=("ffmpeg or pipewire") command -v notify-send &>/dev/null || missing+=(libnotify-bin) if [ ${#missing[@]} -gt 0 ]; then echo "Missing: ${missing[*]}" >&2 - notify-send -h string:x-canonical-private-synchronous:talktype -t 3000 -i dialog-error "TalkType" "Missing: ${missing[*]}" 2>/dev/null || true + notify-send -t 3000 -i dialog-error "TalkType" "Missing: ${missing[*]}" 2>/dev/null || true exit 1 fi } @@ -55,21 +78,26 @@ check_deps if [ -f "$PIDFILE" ]; then PID=$(cat "$PIDFILE") kill "$PID" 2>/dev/null || true - wait "$PID" 2>/dev/null || true + # Wait for recorder to finalize the file (not a child, so wait(1) won't work) + while kill -0 "$PID" 2>/dev/null; do sleep 0.05; done rm -f "$PIDFILE" + notify process-working "Transcribing..." + # Run the transcription command with the audio file as last arg TEXT=$($TALKTYPE_CMD "$AUDIOFILE") rm -f "$AUDIOFILE" if [ -z "$TEXT" ]; then - notify-send -h string:x-canonical-private-synchronous:talktype -t 1500 -i dialog-warning "TalkType" "No speech detected" 2>/dev/null || true + notify dialog-warning "No speech detected" exit 0 fi - # Type text at cursor via ydotool (works on any Wayland compositor) - ydotool type -- "$TEXT" + notify_close + + # Type text at cursor via ydotool + ydotool type --key-delay 50 -- "$TEXT" # ── Otherwise → start recording ── else @@ -84,5 +112,5 @@ else PID=$! disown "$PID" echo "$PID" > "$PIDFILE" - notify-send -h string:x-canonical-private-synchronous:talktype -t 1500 -i audio-input-microphone "TalkType" "Listening..." 2>/dev/null || true + notify audio-input-microphone "Listening..." fi diff --git a/test/server.bats b/test/server.bats index 9c69d40..ac9055b 100644 --- a/test/server.bats +++ b/test/server.bats @@ -73,12 +73,12 @@ start_mock_daemon() { # ── Server wrapper logic ── -@test "transcribe fails with helpful message when server not running" { - # Test each server script's transcribe command without a running server +@test "transcribe auto-start fails gracefully when backend not installed" { + # With no venv installed, transcribe should attempt auto-start and fail for server in transcribe-server backends/parakeet-server backends/moonshine-server; do run "$REPO_DIR/$server" transcribe /tmp/test.wav [ "$status" -eq 1 ] - [[ "$output" == *"not running"* ]] + [[ "$output" == *"not installed"* ]] done } diff --git a/test/talktype.bats b/test/talktype.bats index 3a89d15..3dfa480 100644 --- a/test/talktype.bats +++ b/test/talktype.bats @@ -5,6 +5,7 @@ # with simple mocks so we can test the control flow in isolation. setup() { + export TALKTYPE_CONFIG="/dev/null" export TALKTYPE_DIR="$BATS_TEST_TMPDIR/talktype" export TALKTYPE_CMD="$BATS_TEST_DIRNAME/mock-transcribe" diff --git a/transcribe-server b/transcribe-server index 4f1df0c..d887d57 100755 --- a/transcribe-server +++ b/transcribe-server @@ -22,6 +22,11 @@ case "${1:-}" in echo "Already running (PID $(cat "$PIDFILE"))" exit 0 fi + if [ ! -x "$VENV/bin/python3" ]; then + echo "Whisper backend not installed. Run: make install" >&2 + exit 1 + fi + rm -f "$PIDFILE" "$SOCK" echo "Starting whisper server (loading $WHISPER_MODEL model)..." "$VENV/bin/python3" "$SCRIPT_DIR/whisper-daemon.py" "$SOCK" "$WHISPER_MODEL" "$WHISPER_LANG" "$WHISPER_DEVICE" "$WHISPER_COMPUTE" & PID=$! @@ -49,8 +54,7 @@ case "${1:-}" in ;; transcribe) if [ ! -S "$SOCK" ]; then - echo "Whisper server not running. Start it with: transcribe-server start" >&2 - exit 1 + "$0" start >&2 || exit 1 fi echo "$2" | socat - UNIX-CONNECT:"$SOCK" ;;