Skip to content

TTS/STT: Feature integration #513

@nishika26

Description

@nishika26

Is your feature request related to a problem? Please describe.
Currently, Kaapi only supports text-based user inputs. This limits our ability to:

  • Process audio and voice-based inputs from users
  • Support voice-enabled chatbot experiences for partners

Describe the solution you'd like:
Phase 1: TTS/STT Exploration & Benchmarking

  • Model evaluation: Benchmark existing Indic voice models (Hindi, Tamil, Bengali, etc.) against:

    • Accuracy (WER - Word Error Rate for STT, MOS - Mean Opinion Score for TTS)
    • Latency (real-time vs batch processing)
    • Cost per inference
    • Language/dialect coverage

Phase 2: Platform Integration

  • Unified API extension:

    • Add voice model support to /configs endpoint (similar to existing LLM provider configs)
      Extend /llm/call to handle audio inputs/outputs with automatic transcription
      Support both streaming (real-time) and batch audio processing
  • Evaluation API enhancement:

    • Add audio-specific evaluation metrics (WER, latency, speaker diarization accuracy)
      Enable quick A/B testing of different TTS/STT providers
      Support voice dataset management (upload, version, annotate)

Sub-issues

Metadata

Metadata

Assignees

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions