Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ install: deps venv

# Install system dependencies (requires sudo)
deps:
sudo apt install -y ydotool pipewire libnotify-bin python3-venv socat
sudo apt install -y ydotool ffmpeg pipewire libnotify-bin python3-venv socat

# Create Python venv with faster-whisper (default backend)
venv: .venv/.done
Expand Down
90 changes: 60 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,28 @@
# talktype

Push-to-talk speech-to-text for Linux. Bind a keyboard shortcut, press it to
start recording, press it again to transcribe and type the text wherever your
cursor is.
Push-to-talk speech-to-text for Linux. Press a hotkey to start recording, press
it again to transcribe and type the text wherever your cursor is. No GUI, no
app to keep running — just a keyboard shortcut.

Transcription is pluggable — ships with
[faster-whisper](https://github.com/SYSTRAN/faster-whisper) by default, but you
can swap in any model or tool that reads audio and prints text.
- **Pluggable backends** — swap transcription models without changing anything else
- **Works everywhere** — GNOME, Sway, Hyprland, i3, X11
- **~100 lines of bash** — easy to read, easy to hack on

Ships with [faster-whisper](https://github.com/SYSTRAN/faster-whisper) by
default, plus optional [Parakeet](https://huggingface.co/nvidia/parakeet-ctc-1.1b)
and [Moonshine](https://huggingface.co/UsefulSensors/moonshine-base) backends.
Or bring your own — anything that reads a WAV and prints text works.

> **Note:** This project is in early development — expect rough edges. If you
> run into issues, please [open a bug](https://github.com/csheaff/talktype/issues).

## Requirements

- Linux (Wayland or X11)
- PipeWire (default on most modern distros)
- Audio recorder: [ffmpeg](https://ffmpeg.org/) (preferred) or PipeWire (`pw-record`)
- [ydotool](https://github.com/ReimuNotMoe/ydotool) for typing text
(user must be in the `input` group — see Install)
- [socat](https://linux.die.net/man/1/socat) (only needed for server mode)
- [socat](https://linux.die.net/man/1/socat) (for server-backed transcription)

For the default backend (faster-whisper):
- NVIDIA GPU with CUDA (or use CPU mode — see Whisper backend options)
Expand Down Expand Up @@ -53,6 +58,22 @@ Then **reboot** for the group change to take effect.
make model
```

## Configuration

talktype reads `~/.config/talktype/config` on startup (follows `$XDG_CONFIG_HOME`).
This works everywhere — GNOME shortcuts, terminals, Sway, cron — no need to set
environment variables in each context.

```bash
mkdir -p ~/.config/talktype
cat > ~/.config/talktype/config << 'EOF'
TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe"
EOF
```

Any `TALKTYPE_*` variable can go in this file. Environment variables still work
and are applied after the config file, so they override it.

## Setup

Bind `talktype` to a keyboard shortcut:
Expand All @@ -75,21 +96,19 @@ bindsym $mod+d exec talktype

## Backends

Three backends are included. Each has a one-shot script (loads model per
invocation) and a server mode (loads model once, keeps it in memory).
Three backends are included. Server backends auto-start on first use — the
model loads once and stays in memory for fast subsequent transcriptions.

### Whisper (default)

The default backend uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper).
Best with a GPU.
[faster-whisper](https://github.com/SYSTRAN/faster-whisper). Best with a GPU.
Works out of the box after `make install` with no config needed.

```bash
# One-shot (default, no extra setup needed)
talktype
For faster repeated use, switch to server mode in your config:

# Server mode (faster — model stays in memory)
./transcribe-server start
export TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe"
```bash
# ~/.config/talktype/config
TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe"
```

| Variable | Default | Description |
Expand All @@ -99,17 +118,19 @@ export TALKTYPE_CMD="/path/to/talktype/transcribe-server transcribe"
| `WHISPER_DEVICE` | `cuda` | `cuda` or `cpu` |
| `WHISPER_COMPUTE` | `float16` | `float16` (GPU), `int8` or `float32` (CPU) |

### Parakeet (GPU, best accuracy)
### Parakeet (GPU, best word accuracy)

[NVIDIA Parakeet CTC 1.1B](https://huggingface.co/nvidia/parakeet-ctc-1.1b)
via HuggingFace Transformers. 1.1B params, excellent accuracy.
via HuggingFace Transformers. 1.1B params, excellent word accuracy.
Note: CTC model — outputs lowercase text without punctuation.

```bash
make parakeet
```

# Server mode (recommended — 4.2GB model)
./backends/parakeet-server start
export TALKTYPE_CMD="/path/to/talktype/backends/parakeet-server transcribe"
```bash
# ~/.config/talktype/config
TALKTYPE_CMD="/path/to/talktype/backends/parakeet-server transcribe"
```

### Moonshine (CPU, lightweight)
Expand All @@ -119,25 +140,34 @@ Sensors. 61.5M params, purpose-built for CPU/edge inference.

```bash
make moonshine
```

# One-shot (fine for this small model)
export TALKTYPE_CMD="/path/to/talktype/backends/moonshine"

# Or server mode
./backends/moonshine-server start
export TALKTYPE_CMD="/path/to/talktype/backends/moonshine-server transcribe"
```bash
# ~/.config/talktype/config
TALKTYPE_CMD="/path/to/talktype/backends/moonshine-server transcribe"
```

Set `MOONSHINE_MODEL=UsefulSensors/moonshine-tiny` for an even smaller 27M
param model.

### Manual server management

The server starts automatically on first transcription. You can also manage
it directly:

```bash
./backends/parakeet-server start # start manually
./backends/parakeet-server stop # stop the server
```

### Custom backends

Set `TALKTYPE_CMD` to any command that takes a WAV file path as its last
argument and prints text to stdout:

```bash
export TALKTYPE_CMD="/path/to/my-transcriber"
# ~/.config/talktype/config
TALKTYPE_CMD="/path/to/my-transcriber"
```

Your command will be called as: `$TALKTYPE_CMD /path/to/recording.wav`
Expand Down
8 changes: 6 additions & 2 deletions backends/moonshine-server
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@ case "${1:-}" in
echo "Already running (PID $(cat "$PIDFILE"))"
exit 0
fi
if [ ! -x "$VENV/bin/python3" ]; then
echo "Moonshine backend not installed. Run: make moonshine" >&2
exit 1
fi
rm -f "$PIDFILE" "$SOCK"
echo "Starting moonshine server (loading $MODEL)..."
"$VENV/bin/python3" "$SCRIPT_DIR/moonshine-daemon.py" "$SOCK" "$MODEL" &
PID=$!
Expand Down Expand Up @@ -46,8 +51,7 @@ case "${1:-}" in
;;
transcribe)
if [ ! -S "$SOCK" ]; then
echo "Moonshine server not running. Start it with: backends/moonshine-server start" >&2
exit 1
"$0" start >&2 || exit 1
fi
echo "$2" | socat - UNIX-CONNECT:"$SOCK"
;;
Expand Down
13 changes: 8 additions & 5 deletions backends/parakeet-daemon.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import sys
import socket
import signal
import torch
import soundfile as sf
from transformers import AutoProcessor, AutoModelForCTC

Expand All @@ -17,11 +18,13 @@

def transcribe(audio_path):
audio, sr = sf.read(audio_path)
inputs = processor(audio, sampling_rate=sr)
inputs.to(model.device, dtype=model.dtype)
predicted_ids = model.generate(**inputs)
texts = processor.batch_decode(predicted_ids, skip_special_tokens=True)
return texts[0].strip() if texts else ""
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
inputs = inputs.to(model.device, dtype=model.dtype)
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(predicted_ids, skip_special_tokens=True)
return text[0].strip() if text else ""


def cleanup(*_):
Expand Down
10 changes: 6 additions & 4 deletions backends/parakeet-server
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,16 @@ case "${1:-}" in
echo "Already running (PID $(cat "$PIDFILE"))"
exit 0
fi
if [ ! -x "$VENV/bin/python3" ]; then
echo "Parakeet backend not installed. Run: make parakeet" >&2
exit 1
fi
rm -f "$PIDFILE" "$SOCK"
echo "Starting parakeet server (loading model)..."
"$VENV/bin/python3" "$SCRIPT_DIR/parakeet-daemon.py" "$SOCK" &
PID=$!
disown "$PID"
echo "$PID" > "$PIDFILE"
# Wait for socket to appear
for i in $(seq 1 60); do
[ -S "$SOCK" ] && break
sleep 1
Expand All @@ -45,10 +49,8 @@ case "${1:-}" in
fi
;;
transcribe)
# Called by talktype — sends audio path to the server, prints result
if [ ! -S "$SOCK" ]; then
echo "Parakeet server not running. Start it with: backends/parakeet-server start" >&2
exit 1
"$0" start >&2 || exit 1
fi
echo "$2" | socat - UNIX-CONNECT:"$SOCK"
;;
Expand Down
42 changes: 35 additions & 7 deletions talktype
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,15 @@
#
set -euo pipefail

# ── Load user config (works from GNOME shortcuts, cron, etc.) ──
TALKTYPE_CONFIG="${TALKTYPE_CONFIG:-${XDG_CONFIG_HOME:-$HOME/.config}/talktype/config}"
# shellcheck disable=SC1090
[ -f "$TALKTYPE_CONFIG" ] && source "$TALKTYPE_CONFIG"

TALKTYPE_DIR="${TALKTYPE_DIR:-${XDG_RUNTIME_DIR:-/tmp}/talktype}"
PIDFILE="$TALKTYPE_DIR/rec.pid"
AUDIOFILE="$TALKTYPE_DIR/rec.wav"
NOTIFYFILE="$TALKTYPE_DIR/notify.id"

mkdir -p "$TALKTYPE_DIR"

Expand All @@ -35,16 +41,33 @@ if [ -z "${TALKTYPE_CMD:-}" ]; then
TALKTYPE_CMD="$VENV_DIR/bin/python3 $SCRIPT_DIR/transcribe $WHISPER_MODEL $WHISPER_LANG $WHISPER_DEVICE $WHISPER_COMPUTE"
fi

# ── Notification helper ──
notify() {
local icon="$1" msg="$2"
local -a args=(-a TalkType -u critical -i "$icon" -p "TalkType" "$msg")
if [ -f "$NOTIFYFILE" ]; then
args+=(-r "$(cat "$NOTIFYFILE")")
fi
notify-send "${args[@]}" 2>/dev/null | head -1 > "$NOTIFYFILE" || true
}

notify_close() {
if [ -f "$NOTIFYFILE" ]; then
notify-send -a TalkType -r "$(cat "$NOTIFYFILE")" -e "TalkType" "" 2>/dev/null || true
rm -f "$NOTIFYFILE"
fi
}

# ── Check core dependencies ──
check_deps() {
local missing=()
command -v ydotool &>/dev/null || missing+=(ydotool)
command -v pw-record &>/dev/null || missing+=(pipewire)
command -v ffmpeg &>/dev/null || command -v pw-record &>/dev/null || missing+=("ffmpeg or pipewire")
command -v notify-send &>/dev/null || missing+=(libnotify-bin)

if [ ${#missing[@]} -gt 0 ]; then
echo "Missing: ${missing[*]}" >&2
notify-send -h string:x-canonical-private-synchronous:talktype -t 3000 -i dialog-error "TalkType" "Missing: ${missing[*]}" 2>/dev/null || true
notify-send -t 3000 -i dialog-error "TalkType" "Missing: ${missing[*]}" 2>/dev/null || true
exit 1
fi
}
Expand All @@ -55,21 +78,26 @@ check_deps
if [ -f "$PIDFILE" ]; then
PID=$(cat "$PIDFILE")
kill "$PID" 2>/dev/null || true
wait "$PID" 2>/dev/null || true
# Wait for recorder to finalize the file (not a child, so wait(1) won't work)
while kill -0 "$PID" 2>/dev/null; do sleep 0.05; done
rm -f "$PIDFILE"

notify process-working "Transcribing..."

# Run the transcription command with the audio file as last arg
TEXT=$($TALKTYPE_CMD "$AUDIOFILE")

rm -f "$AUDIOFILE"

if [ -z "$TEXT" ]; then
notify-send -h string:x-canonical-private-synchronous:talktype -t 1500 -i dialog-warning "TalkType" "No speech detected" 2>/dev/null || true
notify dialog-warning "No speech detected"
exit 0
fi

# Type text at cursor via ydotool (works on any Wayland compositor)
ydotool type -- "$TEXT"
notify_close

# Type text at cursor via ydotool
ydotool type --key-delay 50 -- "$TEXT"

# ── Otherwise → start recording ──
else
Expand All @@ -84,5 +112,5 @@ else
PID=$!
disown "$PID"
echo "$PID" > "$PIDFILE"
notify-send -h string:x-canonical-private-synchronous:talktype -t 1500 -i audio-input-microphone "TalkType" "Listening..." 2>/dev/null || true
notify audio-input-microphone "Listening..."
fi
6 changes: 3 additions & 3 deletions test/server.bats
Original file line number Diff line number Diff line change
Expand Up @@ -73,12 +73,12 @@ start_mock_daemon() {

# ── Server wrapper logic ──

@test "transcribe fails with helpful message when server not running" {
# Test each server script's transcribe command without a running server
@test "transcribe auto-start fails gracefully when backend not installed" {
# With no venv installed, transcribe should attempt auto-start and fail
for server in transcribe-server backends/parakeet-server backends/moonshine-server; do
run "$REPO_DIR/$server" transcribe /tmp/test.wav
[ "$status" -eq 1 ]
[[ "$output" == *"not running"* ]]
[[ "$output" == *"not installed"* ]]
done
}

Expand Down
1 change: 1 addition & 0 deletions test/talktype.bats
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
# with simple mocks so we can test the control flow in isolation.

setup() {
export TALKTYPE_CONFIG="/dev/null"
export TALKTYPE_DIR="$BATS_TEST_TMPDIR/talktype"
export TALKTYPE_CMD="$BATS_TEST_DIRNAME/mock-transcribe"

Expand Down
8 changes: 6 additions & 2 deletions transcribe-server
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,11 @@ case "${1:-}" in
echo "Already running (PID $(cat "$PIDFILE"))"
exit 0
fi
if [ ! -x "$VENV/bin/python3" ]; then
echo "Whisper backend not installed. Run: make install" >&2
exit 1
fi
rm -f "$PIDFILE" "$SOCK"
echo "Starting whisper server (loading $WHISPER_MODEL model)..."
"$VENV/bin/python3" "$SCRIPT_DIR/whisper-daemon.py" "$SOCK" "$WHISPER_MODEL" "$WHISPER_LANG" "$WHISPER_DEVICE" "$WHISPER_COMPUTE" &
PID=$!
Expand Down Expand Up @@ -49,8 +54,7 @@ case "${1:-}" in
;;
transcribe)
if [ ! -S "$SOCK" ]; then
echo "Whisper server not running. Start it with: transcribe-server start" >&2
exit 1
"$0" start >&2 || exit 1
fi
echo "$2" | socat - UNIX-CONNECT:"$SOCK"
;;
Expand Down
Loading