Skip to content

Add speech-to-text, text-to-speech, and ElevenLabs provider#472

Open
patrickdet wants to merge 1 commit intoagentjido:mainfrom
patrickdet:feat/speech-transcription-elevenlabs
Open

Add speech-to-text, text-to-speech, and ElevenLabs provider#472
patrickdet wants to merge 1 commit intoagentjido:mainfrom
patrickdet:feat/speech-transcription-elevenlabs

Conversation

@patrickdet
Copy link

Summary

Adds TTS and STT to req_llm, plus an ElevenLabs provider.

Speech (ReqLLM.Speech)

Text-to-speech through the standard prepare_request(:speech, ...) pipeline. Any provider that implements the operation works.

{:ok, result} = ReqLLM.speak("openai:tts-1", "Hello world", voice: "alloy")
File.write!("hello.mp3", result.audio)

Options: voice selection, speed, output format (mp3/wav/opus/flac/aac/pcm), language hints, provider-specific stuff like OpenAI's instructions for gpt-4o-mini-tts.

Transcription (ReqLLM.Transcription)

Speech-to-text. Takes file paths, raw binary, or base64 audio. Returns text with optional segment timing.

{:ok, result} = ReqLLM.transcribe("groq:whisper-large-v3-turbo", "recording.mp3")
result.text #=> "Hello world"
result.segments #=> [%{text: "Hello world", start_second: 0.0, end_second: 1.2}]

ElevenLabs provider

Speech-only. Their API is pretty different from OpenAI's /audio/speech:

  • Voice ID in the URL path (/v1/text-to-speech/{voiceId})
  • xi-api-key header instead of Bearer auth
  • Output format as a query param, not body field
  • text/model_id instead of input/model

voice_settings (stability, similarity_boost, style, speed) go through provider_options. Auto-discovered at startup.

{:ok, result} = ReqLLM.speak(
  %{id: "eleven_multilingual_v2", provider: :elevenlabs},
  "Hello!",
  provider_options: [stability: 0.5, similarity_boost: 0.8]
)

Integration tests

Tagged :integration, excluded by default. Tested against real APIs:

  • ElevenLabs TTS: default voice, voice_settings, language codes
  • OpenAI TTS: tts-1, wav output
  • Groq STT: generate-then-transcribe pattern (OpenAI TTS makes audio, Groq whisper transcribes it) so we don't commit binary fixtures
ELEVENLABS_API_KEY=... OPENAI_API_KEY=... GROQ_API_KEY=... \
  mix test --include integration test/req_llm/integration/

Test plan

  • ElevenLabs unit tests pass (18 tests)
  • Full suite passes (2370 tests, 0 failures)
  • Integration tests pass against real ElevenLabs, OpenAI, and Groq APIs (8/8)

@patrickdet patrickdet force-pushed the feat/speech-transcription-elevenlabs branch from 8e6ea10 to 3d7830c Compare March 1, 2026 21:11
Add speech-to-text transcription (ReqLLM.Transcription) and
text-to-speech generation (ReqLLM.Speech) with provider-agnostic
pipelines that work via prepare_request(:transcription/:speech).

Add ElevenLabs as a speech-only provider with its unique API format
(voice ID in URL path, xi-api-key header, format as query param).

Integration tests verify TTS (ElevenLabs + OpenAI) and STT (Groq
whisper via generate-then-transcribe pattern) against real APIs.
Tagged :integration and excluded by default.
@patrickdet patrickdet force-pushed the feat/speech-transcription-elevenlabs branch from 3d7830c to 9ad9d93 Compare March 1, 2026 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant