feat: add multi-provider audio transcription support #134

esafwan · 2026-02-07T16:31:42Z

Summary

This PR adds multi-provider audio transcription support to HUF using LiteLLM, following the same pattern as image generation. Agents can now transcribe audio files from OpenAI, Groq, Deepgram, Azure, and other providers through a unified tool interface.

Features

Core Functionality

Multi-Provider Support: OpenAI (whisper-1), Groq (whisper-large-v3), Deepgram (nova-2), Azure, Vertex AI
Flexible File Input: Accepts both Frappe File IDs and file URLs/paths
Auto-Model Detection: Automatically selects appropriate transcription model based on provider
Language Support: Optional language hint (ISO 639-1 format) or auto-detection
Real-time Updates: WebSocket events for live transcription results in chat UI

Technical Implementation

Handler Function: handle_transcribe_audio() in sdk_tools.py
Tool Registration: create_transcribe_audio_tool() with auto-update support
File Handling: Robust lookup by file_id, file_url, or file_name
Response Processing: Creates Agent Message with transcribed text
Error Handling: Comprehensive validation and error messages

Implementation Details

Tool Parameters

Parameter	Type	Required	Description
`file_id`	string	No*	Frappe File document ID (preferred method)
`file_url`	string	No*	File URL/path (alternative: `/files/audio.mp3`)
`language`	string	No	ISO 639-1 language code (`en`, `es`, `fr`, etc.)
`model`	string	No	Model override (defaults by provider)

*At least one of file_id or file_url is required

Default Models by Provider

{
    "openai": "whisper-1",
    "azure": "whisper-1", 
    "groq": "groq/whisper-large-v3",
    "deepgram": "deepgram/nova-2",
    "default": "whisper-1"
}

Request Flow

File Lookup: Resolve file by ID, URL, or name
Model Selection: Use explicit model or auto-detect by provider
API Call: litellm.transcription() with file path
Response Processing: Extract transcribed text
Message Creation: Store as Agent Message with role="agent"
WebSocket Event: Emit new_agent_message for real-time UI update
Return Metadata: Success status, text, file info, message ID

Response Format

{
  "success": true,
  "text": "Transcribed audio content...",
  "file_id": "abc123",
  "file_url": "/files/audio.mp3",
  "language": "en",
  "model": "whisper-1",
  "message_id": "msg-xyz",
  "conversation_id": "conv-123"
}

Usage Examples

Example 1: Basic Transcription

transcribe_audio(
    file_id="file-abc123"
)

Example 2: With Language Hint

transcribe_audio(
    file_url="/files/spanish_audio.mp3",
    language="es"
)

Example 3: With Model Override

transcribe_audio(
    file_id="file-xyz789",
    model="groq/whisper-large-v3"  # Use Groq's model
)

Example 4: Agent Workflow

User: "Can you transcribe this audio file?"
Agent: [Uploads file, gets file_id]
Agent: [Calls transcribe_audio(file_id="file-123")]
Agent: "Here's the transcription: [text]"

Comparison with Image Generation

Aspect	Image Generation	Audio Transcription
Function	`litellm.image_generation()`	`litellm.transcription()`
Input	Text prompt	Audio file
Output	Image file (PNG/JPEG)	Text string
Storage	`generated_image` field	Message content
File Handling	Download/decode → save	Read file path → transcribe
Message Kind	"Image"	"Text" (standard)
Provider Support	OpenAI, Google, Azure, etc.	OpenAI, Groq, Deepgram, etc.

Testing

Tested Scenarios

✅ File lookup by file_id
✅ File lookup by file_url
✅ File lookup by file_name (fallback)
✅ Auto-model detection (OpenAI, Groq)
✅ Language parameter handling
✅ Model override parameter
✅ Agent Message creation
✅ WebSocket event emission
✅ Error handling (missing file, invalid API key)

Test Commands

# Install/migrate
bench --site [site] migrate

# Verify tool exists
bench --site [site] console
>>> frappe.db.exists("Agent Tool Function", "transcribe_audio")

# Test transcription
# (Upload audio file, get file_id, call tool via agent)

Migration Notes

For Existing Installations:

# Pull latest code
cd apps/huf && git pull

# Run migration (auto-creates tool)
bench --site [site] migrate

# Restart server
bench restart

For New Installations:
Tool is automatically created during bench install-app huf.

Checklist

Add handle_transcribe_audio() function using LiteLLM's transcription() API for multi-provider audio transcription support. Features: - Supports OpenAI (whisper-1), Groq (whisper-large-v3), Deepgram (nova-2) - Auto-detects model based on provider if not specified - Handles both file_id and file_url input methods - Optional language parameter for better accuracy - Creates Agent Message with transcribed text - Emits WebSocket events for real-time updates - Follows same pattern as image generation handler Technical Details: - Uses litellm.transcription() for provider abstraction - Normalizes model names via _normalize_model_name() - Handles file reading from Frappe File documents - Creates conversation messages with proper indexing - Returns comprehensive metadata (text, file_id, message_id, language, model) This enables agents to transcribe audio files using any LiteLLM-supported provider, starting with OpenAI support.

Register transcribe_audio tool in Agent Tool Function system with proper parameters and tool type. Changes: - Create create_transcribe_audio_tool() function - Handles both creation and updates (no patch needed) - Creates "Transcription" tool type if not exists - Registers tool with parameters: file_id, file_url, language, model - Updates after_install() and after_migrate() hooks Tool Parameters: - file_id: File document ID (preferred) - file_url: File URL/path (alternative) - language: Optional ISO 639-1 language code - model: Optional model override (defaults by provider) Tool Type: Transcription Function Path: huf.ai.sdk_tools.handle_transcribe_audio The tool is automatically available to agents after installation/migration.

Improve file URL handling to support multiple lookup methods: - Try file_url lookup first - Fallback to file_name lookup if file_url fails - Better error handling for file not found cases This ensures audio files can be found whether referenced by file_id, file_url, or file_name.

…tion LiteLLM transcription() accepts file path (string) or file-like object, not raw bytes. Update implementation to pass file path directly. Benefits: - More efficient (no need to read entire file into memory) - Matches LiteLLM API expectations - Supports both file path and file-like object formats This follows LiteLLM's recommended usage pattern for audio transcription.

esafwan added 4 commits February 7, 2026 16:35

esafwan force-pushed the feature/audio-transcription-litellm branch from 862be09 to b0a214f Compare February 7, 2026 16:35

esafwan changed the title ~~Feature/audio transcription litellm~~ feat: audio transcription litellm Feb 7, 2026

esafwan changed the title ~~feat: audio transcription litellm~~ feat: add multi-provider audio transcription support Feb 7, 2026

esafwan force-pushed the feature/audio-transcription-litellm branch from b0a214f to b47fa63 Compare February 7, 2026 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multi-provider audio transcription support #134

feat: add multi-provider audio transcription support #134

Uh oh!

esafwan commented Feb 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add multi-provider audio transcription support #134

Are you sure you want to change the base?

feat: add multi-provider audio transcription support #134

Uh oh!

Conversation

esafwan commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Core Functionality

Technical Implementation

Implementation Details

Tool Parameters

Default Models by Provider

Request Flow

Response Format

Usage Examples

Example 1: Basic Transcription

Example 2: With Language Hint

Example 3: With Model Override

Example 4: Agent Workflow

Comparison with Image Generation

Testing

Tested Scenarios

Test Commands

Migration Notes

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

esafwan commented Feb 7, 2026 •

edited

Loading