Skip to content

Conversation

@narsimhaReddyJuspay
Copy link
Contributor

@narsimhaReddyJuspay narsimhaReddyJuspay commented Feb 3, 2026

  • added tts fallback for eleven labs and cartesia
  • added slack alerts after triggering tts fallback
  • control fallback with feature flags

Summary by CodeRabbit

  • New Features

    • Added automatic TTS service fallback when the primary provider encounters failures
    • Implemented background health monitoring for TTS services with failure detection
    • Added configurable failure thresholds and timeout settings for TTS operations
  • Chores

    • Extended dynamic configuration system to support TTS fallback management

@coderabbitai
Copy link

coderabbitai bot commented Feb 3, 2026

Walkthrough

This PR introduces a TTS health observer system that monitors service health, detects provider failures through provider-aware detection strategies, and automatically triggers fallback to alternate providers via Redis state updates and Slack alerts. The system integrates an asynchronous health-check loop into the Breeze Buddy pipeline and adds dynamic configuration accessors for fallback thresholds and timeout settings.

Changes

Cohort / File(s) Summary
TTS Health Observer Module
app/ai/voice/tts/tts_health_observer.py
New module implementing stateful TTS health monitoring with provider-aware failure detection for ElevenLabs (audio bytes) and Cartesia (ErrorFrame), per-provider failure counters, automatic fallback orchestration via Redis/Slack, timeout checks, and background health-check loop.
Agent Integration
app/ai/voice/agents/breeze_buddy/agent/__init__.py, app/ai/voice/agents/breeze_buddy/agent/pipeline.py
Starts asynchronous TTS health-check loop at pipeline initialization with named task and cleanup on termination; always prepares and includes TTS health observer in observer list for both dev and non-dev environments.
TTS Module Re-exports
app/ai/voice/tts/__init__.py, app/ai/voice/agents/breeze_buddy/tts/__init__.py
Main TTS module re-exports health observer components; Breeze Buddy TTS module adds fallback logic to resolve actual provider via swap map when configured provider fails, applying across voice-specific and default service selection paths.
Dynamic Configuration
app/core/config/dynamic.py
Added four async accessors: BB_TTS_FALLBACK_ENABLED(), BB_TTS_FAILURE_THRESHOLD(), BB_TTS_AUDIO_TIMEOUT_SECONDS(), and BB_TTS_FAILED_PROVIDER() for runtime fallback configuration.
Config Store
app/services/live_config/store.py
Added set_config(key, value) async function to update single runtime config values in Redis with logging and failure handling.

Sequence Diagram

sequenceDiagram
    participant Pipeline as Breeze Buddy Pipeline
    participant Observer as TTS Health Observer
    participant Config as Dynamic Config
    participant TTS as TTS Service
    participant Redis as Redis Store
    participant Slack as Slack Alerts

    Pipeline->>Observer: start_tts_health_check_loop() on init
    Observer->>Observer: Begin async monitoring task

    loop TTS Request Cycle
        TTS->>Observer: TTSStartedFrame (with provider)
        TTS->>Observer: TTSAudioRawFrame (audio bytes) or ErrorFrame
        TTS->>Observer: TTSStoppedFrame
        Observer->>Observer: Track failure count per provider
    end

    loop Health Check Interval
        Observer->>Config: Check BB_TTS_FAILURE_THRESHOLD
        Observer->>Config: Check BB_TTS_AUDIO_TIMEOUT_SECONDS
        Observer->>Observer: Evaluate timeout conditions
        
        alt Failure Threshold Exceeded
            Observer->>Config: Get BB_TTS_FAILED_PROVIDER
            Observer->>Config: Get BB_TTS_FALLBACK_ENABLED
            alt Fallback Enabled
                Observer->>Redis: Update swap state for provider
                Observer->>Slack: Alert on fallback trigger
                Note over TTS: Subsequent requests use swapped provider
            end
        end
    end

    Pipeline->>Observer: Cancel health check task on termination
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

Suggested Reviewers

  • badri-singhal
  • murdore
  • swaroopvarma1

Poem

🐰 A health check hops through the TTS stream,
Detecting when providers fail to seem,
With Redis and Slack to spread the news,
And fallback magic when voices blues,
Now Breeze Buddy bounces, forever on cue!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding TTS fallback support for Breeze Buddy, which aligns with the core objective of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 96.88% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Failure to add the new IP will result in interrupted reviews.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@narsimhaReddyJuspay narsimhaReddyJuspay force-pushed the tts-fallback-support branch 4 times, most recently from ddbe08b to 363206f Compare February 3, 2026 08:08
- added tts fallback for eleven labs and cartesia
- added slack alerts after triggering tts fallback
- control fallback with feature flags
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
app/ai/voice/agents/breeze_buddy/tts/__init__.py (1)

156-221: ⚠️ Potential issue | 🟡 Minor

Normalize provider names before comparisons to keep fallback reliable.

If BB_TTS_FAILED_PROVIDER / BREEZE_BUDDY_TTS_SERVICE ever come in mixed case, the equality checks won’t match and fallback won’t trigger. Lowercasing both values makes the routing more robust.

Proposed fix
-    failed_provider = await BB_TTS_FAILED_PROVIDER()
+    failed_provider = (await BB_TTS_FAILED_PROVIDER()).lower()
...
-    configured_service = await BB_TTS_SERVICE()
+    configured_service = (await BB_TTS_SERVICE()).lower()
🤖 Fix all issues with AI agents
In `@app/ai/voice/tts/tts_health_observer.py`:
- Around line 122-131: The _detect_provider function currently defaults unknown
source names to TTSProvider.ELEVENLABS which can misattribute errors; update
_detect_provider to explicitly handle unknown providers by checking source_name
(in function _detect_provider) against known values (TTSProvider.ELEVENLABS,
TTSProvider.CARTESIA) and when none match either log a warning with the
source_name and return a sentinel (e.g., None) or raise a specific exception, or
alternatively add the new provider to the TTSProvider enum and handle it
explicitly; ensure callers of _detect_provider (wherever it's used) are updated
to handle the None/exception/new-enum case to avoid silently assuming
ElevenLabs.
- Around line 320-336: The _trigger_fallback logic currently ignores the return
value of the first set_config call; modify _trigger_fallback to capture both
results (the set for "BB_TTS_FAILED_PROVIDER" and the set for
"BB_TTS_FALLBACK_ENABLED") and only mark self._fallback_triggered = True, reset
self._failure_counts[failed_service] and call await
self._send_fallback_alert(...) when both set_config calls succeed; keep the
order (set failed provider first, then enable fallback) and log distinct errors
when either write fails (include which key failed) instead of proceeding when
only one succeeded.

In `@app/services/live_config/store.py`:
- Around line 238-269: set_config currently does a non-atomic read-modify-write
on FEATURE_FLAGS_KEY using client.get/client.set which can cause lost updates
when concurrent writers (e.g., set_config and fetch_and_update_feature_flags)
run; make the update atomic by either converting storage to a Redis HASH and
using HSET/HGET/HGETALL for individual flag updates (update normalize_key(key)
via HSET FEATURE_FLAGS_KEY normalized_key value) or implement optimistic locking
with WATCH/MULTI/EXEC around FEATURE_FLAGS_KEY in set_config: WATCH
FEATURE_FLAGS_KEY, read raw, modify the JSON, MULTI -> set -> EXEC and retry on
EXEC failure; reference the existing helpers get_redis_service, normalize_key
and the competing function fetch_and_update_feature_flags when adding the atomic
retry logic.
🧹 Nitpick comments (2)
app/ai/voice/agents/breeze_buddy/tts/__init__.py (1)

149-154: Deduplicate provider/voice maps to avoid drift.

VOICE_TO_PROVIDER (and PROVIDER_TO_VOICE) are defined in two functions. Consider promoting them to module-level constants to keep mapping changes in one place.

Also applies to: 276-286

app/ai/voice/tts/tts_health_observer.py (1)

363-385: Drop unused current_service parameter or include it in the alert.

Comment on lines +122 to +131
def _detect_provider(self, source_name: str) -> TTSProvider:
"""Detect TTS provider from source class name."""
name_lower = source_name.lower()
if "elevenlabs" in name_lower:
return TTSProvider.ELEVENLABS
elif "cartesia" in name_lower:
return TTSProvider.CARTESIA
else:
# Default to ElevenLabs detection strategy (more conservative)
return TTSProvider.ELEVENLABS
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle non‑ElevenLabs/Cartesia providers explicitly.

Defaulting unknown providers to ElevenLabs can misattribute failures and trigger incorrect fallback/alerts when Sarvam (or future providers) is configured. Consider explicitly handling unknown providers (log + ignore) or adding them to the enum/config instead of defaulting.

🤖 Prompt for AI Agents
In `@app/ai/voice/tts/tts_health_observer.py` around lines 122 - 131, The
_detect_provider function currently defaults unknown source names to
TTSProvider.ELEVENLABS which can misattribute errors; update _detect_provider to
explicitly handle unknown providers by checking source_name (in function
_detect_provider) against known values (TTSProvider.ELEVENLABS,
TTSProvider.CARTESIA) and when none match either log a warning with the
source_name and return a sentinel (e.g., None) or raise a specific exception, or
alternatively add the new provider to the TTSProvider enum and handle it
explicitly; ensure callers of _detect_provider (wherever it's used) are updated
to handle the None/exception/new-enum case to avoid silently assuming
ElevenLabs.

Comment on lines +320 to +336
async def _trigger_fallback(self, failed_service: str):
"""Enable fallback in Redis and notify."""
alt_service = TTS_SWAP_MAP.get(failed_service, "unknown")

# Set the failed provider FIRST to avoid race conditions
await set_config("BB_TTS_FAILED_PROVIDER", failed_service)
# THEN enable the fallback flag
success = await set_config("BB_TTS_FALLBACK_ENABLED", True)
if success:
self._fallback_triggered = True
self._failure_counts[failed_service] = 0
logger.warning(
f"🔄 TTS FALLBACK TRIGGERED: {failed_service} → {alt_service}"
)
await self._send_fallback_alert(failed_service, alt_service)
else:
logger.error("[TTS Health] Failed to update Redis for fallback!")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Verify both Redis updates before marking fallback triggered.

_trigger_fallback ignores the result of setting BB_TTS_FAILED_PROVIDER. If that write fails but fallback is enabled, routing can switch without an accurate failed-provider value. Check both updates and only mark fallback as triggered when both succeed.

Proposed fix
-        await set_config("BB_TTS_FAILED_PROVIDER", failed_service)
-        # THEN enable the fallback flag
-        success = await set_config("BB_TTS_FALLBACK_ENABLED", True)
-        if success:
+        failed_set = await set_config("BB_TTS_FAILED_PROVIDER", failed_service)
+        fallback_set = False
+        if failed_set:
+            fallback_set = await set_config("BB_TTS_FALLBACK_ENABLED", True)
+        if failed_set and fallback_set:
             self._fallback_triggered = True
             self._failure_counts[failed_service] = 0
             logger.warning(
                 f"🔄 TTS FALLBACK TRIGGERED: {failed_service} → {alt_service}"
             )
             await self._send_fallback_alert(failed_service, alt_service)
         else:
-            logger.error("[TTS Health] Failed to update Redis for fallback!")
+            logger.error(
+                "[TTS Health] Failed to update Redis for fallback "
+                f"(failed_provider={failed_set}, fallback_enabled={fallback_set})"
+            )
🤖 Prompt for AI Agents
In `@app/ai/voice/tts/tts_health_observer.py` around lines 320 - 336, The
_trigger_fallback logic currently ignores the return value of the first
set_config call; modify _trigger_fallback to capture both results (the set for
"BB_TTS_FAILED_PROVIDER" and the set for "BB_TTS_FALLBACK_ENABLED") and only
mark self._fallback_triggered = True, reset self._failure_counts[failed_service]
and call await self._send_fallback_alert(...) when both set_config calls
succeed; keep the order (set failed provider first, then enable fallback) and
log distinct errors when either write fails (include which key failed) instead
of proceeding when only one succeeded.

Comment on lines +238 to +269
async def set_config(key: str, value: Any) -> bool:
"""Update a single config value in Redis at runtime.

This is used for runtime updates like TTS fallback triggers.
Note: This does NOT update DevCycle - that remains the source of truth.
When DevCycle syncs, it may overwrite values set through this function.

Args:
key: Configuration key to update
value: New value to set

Returns:
True if update was successful, False otherwise
"""
try:
redis = await get_redis_service()
client = await redis.get_client()

# Load existing flags
raw = await client.get(FEATURE_FLAGS_KEY)
all_flags = json.loads(raw) if raw else {}

# Update the specific key (normalize to match DevCycle format)
normalized_key = normalize_key(key)
all_flags[normalized_key] = value

# Write back
await client.set(FEATURE_FLAGS_KEY, json.dumps(all_flags))
logger.info(f"Runtime config updated: {normalized_key} = {value}")
return True
except Exception as e:
logger.error(f"Failed to set config {key}: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find and read the store.py file
fd -t f "store.py" --path "*live_config*"

Repository: juspay/clairvoyance

Length of output: 296


🏁 Script executed:

# Look at the actual file to verify the snippet
cat -n app/services/live_config/store.py | sed -n '230,280p'

Repository: juspay/clairvoyance

Length of output: 2126


🏁 Script executed:

# Search for other places where FEATURE_FLAGS_KEY is written
rg "FEATURE_FLAGS_KEY" -B 2 -A 2

Repository: juspay/clairvoyance

Length of output: 1947


🏁 Script executed:

# Check if WATCH/MULTI/transactions are used elsewhere in the file
rg "WATCH|MULTI|EXEC|watch|multi|exec" app/services/live_config/

Repository: juspay/clairvoyance

Length of output: 45


🏁 Script executed:

# Check for concurrent patterns - see how Redis client is used
rg "get_redis_service|get_client" -B 2 -A 5 app/services/live_config/

Repository: juspay/clairvoyance

Length of output: 2771


🏁 Script executed:

# Look at the DevCycle sync operation around line 203
cat -n app/services/live_config/store.py | sed -n '195,215p'

Repository: juspay/clairvoyance

Length of output: 788


🏁 Script executed:

# Check if there's any locking mechanism or mutex elsewhere in the codebase
rg "asyncio.Lock|Lock\(\)|threading.Lock|mutex" app/services/live_config/

Repository: juspay/clairvoyance

Length of output: 45


🏁 Script executed:

# Check how frequently set_config and sync are called
rg "set_config\(" -B 2 -A 2

Repository: juspay/clairvoyance

Length of output: 932


🏁 Script executed:

# Find the DevCycle sync operation that writes to FEATURE_FLAGS_KEY
rg "def.*sync|def.*update.*flag" app/services/live_config/ -A 10 | head -50

Repository: juspay/clairvoyance

Length of output: 807


🏁 Script executed:

# Check if there are any background tasks that sync DevCycle
rg "devcycle|DevCycle" app/services/live_config/ -B 2 -A 5

Repository: juspay/clairvoyance

Length of output: 10253


🏁 Script executed:

# Look at the tts_health_observer context more - this shows the race condition
cat -n app/ai/voice/tts/tts_health_observer.py | grep -B 5 -A 5 "set_config"

Repository: juspay/clairvoyance

Length of output: 1111


Guard against lost updates in concurrent config writes.

set_config performs a read-modify-write on FEATURE_FLAGS_KEY without atomic guarantees. With multiple writers (e.g., set_config from TTS health observer and fetch_and_update_feature_flags from DevCycle sync both executing concurrently), updates can be lost—for example, one process reads the full flags dict, another process reads it and writes back, then the first process overwrites with its stale copy. The codebase already acknowledges this risk in the docstring ("When DevCycle syncs, it may overwrite values set through this function"), but there is no atomic mechanism to prevent it.

Consider using Redis transactions (WATCH/MULTI/EXEC for optimistic locking) or storing flags as individual Redis HASHes instead of a single JSON string to avoid RMW cycles.

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 267-267: Consider moving this statement to an else block

(TRY300)


[warning] 268-268: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@app/services/live_config/store.py` around lines 238 - 269, set_config
currently does a non-atomic read-modify-write on FEATURE_FLAGS_KEY using
client.get/client.set which can cause lost updates when concurrent writers
(e.g., set_config and fetch_and_update_feature_flags) run; make the update
atomic by either converting storage to a Redis HASH and using HSET/HGET/HGETALL
for individual flag updates (update normalize_key(key) via HSET
FEATURE_FLAGS_KEY normalized_key value) or implement optimistic locking with
WATCH/MULTI/EXEC around FEATURE_FLAGS_KEY in set_config: WATCH
FEATURE_FLAGS_KEY, read raw, modify the JSON, MULTI -> set -> EXEC and retry on
EXEC failure; reference the existing helpers get_redis_service, normalize_key
and the competing function fetch_and_update_feature_flags when adding the atomic
retry logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant