Skip to content

Conversation

@esafwan
Copy link
Contributor

@esafwan esafwan commented Feb 7, 2026

Summary

This PR adds multi-provider audio transcription support to HUF using LiteLLM, following the same pattern as image generation. Agents can now transcribe audio files from OpenAI, Groq, Deepgram, Azure, and other providers through a unified tool interface.


Features

Core Functionality

  • Multi-Provider Support: OpenAI (whisper-1), Groq (whisper-large-v3), Deepgram (nova-2), Azure, Vertex AI
  • Flexible File Input: Accepts both Frappe File IDs and file URLs/paths
  • Auto-Model Detection: Automatically selects appropriate transcription model based on provider
  • Language Support: Optional language hint (ISO 639-1 format) or auto-detection
  • Real-time Updates: WebSocket events for live transcription results in chat UI

Technical Implementation

  • Handler Function: handle_transcribe_audio() in sdk_tools.py
  • Tool Registration: create_transcribe_audio_tool() with auto-update support
  • File Handling: Robust lookup by file_id, file_url, or file_name
  • Response Processing: Creates Agent Message with transcribed text
  • Error Handling: Comprehensive validation and error messages

Implementation Details

Tool Parameters

Parameter Type Required Description
file_id string No* Frappe File document ID (preferred method)
file_url string No* File URL/path (alternative: /files/audio.mp3)
language string No ISO 639-1 language code (en, es, fr, etc.)
model string No Model override (defaults by provider)

*At least one of file_id or file_url is required

Default Models by Provider

{
    "openai": "whisper-1",
    "azure": "whisper-1", 
    "groq": "groq/whisper-large-v3",
    "deepgram": "deepgram/nova-2",
    "default": "whisper-1"
}

Request Flow

  1. File Lookup: Resolve file by ID, URL, or name
  2. Model Selection: Use explicit model or auto-detect by provider
  3. API Call: litellm.transcription() with file path
  4. Response Processing: Extract transcribed text
  5. Message Creation: Store as Agent Message with role="agent"
  6. WebSocket Event: Emit new_agent_message for real-time UI update
  7. Return Metadata: Success status, text, file info, message ID

Response Format

{
  "success": true,
  "text": "Transcribed audio content...",
  "file_id": "abc123",
  "file_url": "/files/audio.mp3",
  "language": "en",
  "model": "whisper-1",
  "message_id": "msg-xyz",
  "conversation_id": "conv-123"
}

Usage Examples

Example 1: Basic Transcription

transcribe_audio(
    file_id="file-abc123"
)

Example 2: With Language Hint

transcribe_audio(
    file_url="/files/spanish_audio.mp3",
    language="es"
)

Example 3: With Model Override

transcribe_audio(
    file_id="file-xyz789",
    model="groq/whisper-large-v3"  # Use Groq's model
)

Example 4: Agent Workflow

User: "Can you transcribe this audio file?"
Agent: [Uploads file, gets file_id]
Agent: [Calls transcribe_audio(file_id="file-123")]
Agent: "Here's the transcription: [text]"

Comparison with Image Generation

Aspect Image Generation Audio Transcription
Function litellm.image_generation() litellm.transcription()
Input Text prompt Audio file
Output Image file (PNG/JPEG) Text string
Storage generated_image field Message content
File Handling Download/decode → save Read file path → transcribe
Message Kind "Image" "Text" (standard)
Provider Support OpenAI, Google, Azure, etc. OpenAI, Groq, Deepgram, etc.

Testing

Tested Scenarios

  • ✅ File lookup by file_id
  • ✅ File lookup by file_url
  • ✅ File lookup by file_name (fallback)
  • ✅ Auto-model detection (OpenAI, Groq)
  • ✅ Language parameter handling
  • ✅ Model override parameter
  • ✅ Agent Message creation
  • ✅ WebSocket event emission
  • ✅ Error handling (missing file, invalid API key)

Test Commands

# Install/migrate
bench --site [site] migrate

# Verify tool exists
bench --site [site] console
>>> frappe.db.exists("Agent Tool Function", "transcribe_audio")

# Test transcription
# (Upload audio file, get file_id, call tool via agent)

Migration Notes

For Existing Installations:

# Pull latest code
cd apps/huf && git pull

# Run migration (auto-creates tool)
bench --site [site] migrate

# Restart server
bench restart

For New Installations:
Tool is automatically created during bench install-app huf.

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Documentation added/updated
  • No breaking changes
  • Migration tested
  • Error handling implemented
  • WebSocket events working
  • Multi-provider support verified

Add handle_transcribe_audio() function using LiteLLM's transcription()
API for multi-provider audio transcription support.

Features:
- Supports OpenAI (whisper-1), Groq (whisper-large-v3), Deepgram (nova-2)
- Auto-detects model based on provider if not specified
- Handles both file_id and file_url input methods
- Optional language parameter for better accuracy
- Creates Agent Message with transcribed text
- Emits WebSocket events for real-time updates
- Follows same pattern as image generation handler

Technical Details:
- Uses litellm.transcription() for provider abstraction
- Normalizes model names via _normalize_model_name()
- Handles file reading from Frappe File documents
- Creates conversation messages with proper indexing
- Returns comprehensive metadata (text, file_id, message_id, language, model)

This enables agents to transcribe audio files using any LiteLLM-supported
provider, starting with OpenAI support.
Register transcribe_audio tool in Agent Tool Function system with
proper parameters and tool type.

Changes:
- Create create_transcribe_audio_tool() function
- Handles both creation and updates (no patch needed)
- Creates "Transcription" tool type if not exists
- Registers tool with parameters: file_id, file_url, language, model
- Updates after_install() and after_migrate() hooks

Tool Parameters:
- file_id: File document ID (preferred)
- file_url: File URL/path (alternative)
- language: Optional ISO 639-1 language code
- model: Optional model override (defaults by provider)

Tool Type: Transcription
Function Path: huf.ai.sdk_tools.handle_transcribe_audio

The tool is automatically available to agents after installation/migration.
Improve file URL handling to support multiple lookup methods:
- Try file_url lookup first
- Fallback to file_name lookup if file_url fails
- Better error handling for file not found cases

This ensures audio files can be found whether referenced by
file_id, file_url, or file_name.
…tion

LiteLLM transcription() accepts file path (string) or file-like object,
not raw bytes. Update implementation to pass file path directly.

Benefits:
- More efficient (no need to read entire file into memory)
- Matches LiteLLM API expectations
- Supports both file path and file-like object formats

This follows LiteLLM's recommended usage pattern for audio transcription.
@esafwan esafwan force-pushed the feature/audio-transcription-litellm branch from 862be09 to b0a214f Compare February 7, 2026 16:35
@esafwan esafwan changed the title Feature/audio transcription litellm feat: audio transcription litellm Feb 7, 2026
@esafwan esafwan changed the title feat: audio transcription litellm feat: add multi-provider audio transcription support Feb 7, 2026
@esafwan esafwan force-pushed the feature/audio-transcription-litellm branch from b0a214f to b47fa63 Compare February 7, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant