Skip to content

feat: add multi-provider audio generation support#135

Draft
esafwan wants to merge 4 commits intodevelopfrom
feature/audio-generation-tts
Draft

feat: add multi-provider audio generation support#135
esafwan wants to merge 4 commits intodevelopfrom
feature/audio-generation-tts

Conversation

@esafwan
Copy link
Contributor

@esafwan esafwan commented Feb 7, 2026

Add text-to-speech functionality using LiteLLM's speech() API for multi-provider audio generation support.

Features:

  • Supports OpenAI (tts-1, tts-1-hd), Gemini, ElevenLabs, Azure, Vertex AI, AWS Polly
  • Auto-detects TTS model based on provider if not specified
  • Supports voice, speed, and format parameters
  • Creates Agent Message with kind="Audio" and generated_audio field
  • Emits WebSocket events for real-time updates
  • Follows same pattern as image generation handler

Technical Details:

  • Saves audio files to Frappe File Manager
  • Returns comprehensive metadata (url, file_id, message_id, voice, format)
  • Adds generated_audio field to Agent Message DocType
  • Adds "Audio" kind option to Agent Message

@esafwan esafwan changed the title feat: add LiteLLM-based audio generation (TTS) handler feat: add multi-provider audio generation support Feb 7, 2026
Add _get_default_tts_model() function to auto-select appropriate
TTS model based on provider.

Supports:
- OpenAI, Azure: tts-1
- Google, Gemini: gemini-2.5-flash-preview-tts
- ElevenLabs: eleven_multilingual_v2
- AWS: polly
- MiniMax: speech-01

Enables automatic model selection for optimal TTS results per provider.
Add handle_generate_audio() function for text-to-speech conversion
using LiteLLM's speech() API.

Features:
- Multi-provider support (OpenAI, Gemini, ElevenLabs, Azure, AWS, etc.)
- Auto-detects TTS model based on provider if not specified
- Supports voice, speed, and format parameters
- Creates Agent Message with kind="Audio" and generated_audio field
- Saves audio files to Frappe File Manager
- Emits WebSocket events for real-time updates
- Returns comprehensive metadata (url, file_id, message_id, voice, format)

Follows same pattern as image generation handler for consistency.
Add support for audio messages in Agent Message DocType:
- Add "Audio" to kind options
- Add generated_audio field (Attach type) for storing generated audio files
- Field is conditionally displayed when kind="Audio"

Enables storing and displaying audio generation results in chat messages.
Add create_generate_audio_tool() function to register the generate_audio
tool in Agent Tool Function DocType.

Features:
- Creates/updates generate_audio tool with proper parameters
- Registers tool type "Audio Generation" if not exists
- Defines parameters: input, voice, model, speed, response_format
- Idempotent: can be called multiple times safely
- Integrated into after_install() and after_migrate() hooks

Enables AI agents to use text-to-speech functionality.
@esafwan esafwan force-pushed the feature/audio-generation-tts branch 2 times, most recently from aaf4144 to 408fdfd Compare February 7, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant