Open
Conversation
Test scripts to verify A2E (Audio2Expression) lip sync quality with Japanese audio input, before investing in ZIP motion replacement or VHAP Japanese FLAME params. Includes: - generate_test_audio.py: EdgeTTS Japanese/English/Chinese audio samples - test_a2e_cpu.py: A2E model loading, Wav2Vec2 feature extraction, ZIP validation - save_a2e_output.py: Capture A2E 52-dim ARKit blendshape output - analyze_blendshapes.py: Lip sync quality scoring and language comparison - setup_oac_env.py: Auto-detect known OpenAvatarChat issues (CPU mode, deps, config) - chat_with_lam_jp.yaml: Corrected config (Gemini API + EdgeTTS ja-JP-NanamiNeural) - run_all_tests.py: Master test runner - TEST_PROCEDURE.md: Step-by-step test procedure https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Fix RuntimeError: Input data type <class 'list'> is not supported. - diagnose_onnx_error.py: Tests SileroVAD ONNX, SenseVoice, data flow - patch_vad_handler.py: Fixes timestamp[0] NoneType bug, adds defensive numpy type checking on ONNX inputs, handles 2/3-output model variants - setup_oac_env.py: Adds VAD handler bug detection (check 7/7) https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Simple test script that verifies environment, model files, data_bundle.py fix, Wav2Vec2 loading, and A2E module import. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Gemini's OpenAI-compatible API sometimes returns delta.content as dict/list instead of string, causing TypeError in set_main_data(). This patch script detects and safely converts non-string content before passing to data_bundle. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
gemini-2.0-flash returns 404 "no longer available to new users". The error dict then cascades into the set_main_data TypeError. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
SenseVoice auto-detection defaults to Chinese (<|zh|>), causing Japanese speech to be misrecognized as Chinese text. This patch forces language="ja" in the generate() call. - patch_asr_language.py: Auto-patches asr_handler_sensevoice.py - chat_with_lam_jp.yaml: Added language: "ja" to SenseVoice config - TEST_PROCEDURE.md: Added Step 4.5 for patch application https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Instead of creating a separate config file, this script patches the existing working config/chat_with_lam.yaml with 3 changes: 1. TTS voice → ja-JP-NanamiNeural 2. LLM system_prompt → Japanese 3. ASR language → ja https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Root cause analysis from production logs: - 1st ASR call: rtf=0.629 (1.25s) - OK - 2nd ASR call: rtf=15.027 (29.83s) - GPU memory exhausted, CPU fallback - fastrtc 60s timeout triggers, resets frame pipeline → system unresponsive Fix: Add torch.cuda.empty_cache() + gc.collect() after each SenseVoice and LAM inference to free GPU memory between calls. Also adds startup wrapper with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Create the missing Audio2Expression inference service that bridges gourmet-support backend (which already has A2E hooks in /api/tts/synthesize) with the actual Wav2Vec2 + LAM A2E decoder pipeline. Services: - audio2exp-service: Flask API accepting MP3 audio, returning 52-dim ARKit blendshape coefficients at 30fps. Includes Wav2Vec2 feature extraction and fallback mode when A2E decoder is unavailable. - Frontend ExpressionManager: Maps A2E blendshapes to GVRM bone system, syncing with audio playback via currentTime. Architecture: TTS → MP3 → audio2exp-service → 52-dim blendshapes → frontend https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
The a2e_engine now searches multiple patterns for the checkpoint: - models/LAM_audio2exp_streaming.tar (flat, user's actual layout) - models/LAM_audio2exp/pretrained_models/*.tar (OpenAvatarChat layout) - models/LAM_audio2exp/*.tar (intermediate layout) Falls back to rglob search if none match. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Full drop-in replacement for gourmet-sp's concierge-controller.ts with Audio2Expression integration applied. Key changes marked with ★ comments: - ExpressionManager import and initialization - session_id added to /api/tts/synthesize requests - A2E expression data used for lip sync when available - FFT-based lip sync preserved as fallback - Proper cleanup in stopAvatarAnimation() and dispose() https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Replaces the scaffold version with the real concierge-controller.ts from gourmet-sp (claude/test-concierge-modal-rewGs branch). A2E integration is already built-in via applyExpressionFromTts() + lamAvatarController. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
uvicorn is an ASGI server (FastAPI/Starlette) and cannot serve Flask (WSGI). This caused the Cloud Run container to fail to start and listen on the port, resulting in deployment timeout. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Covers all components: backend (gourmet-support), frontend (gourmet-sp), audio2exp-service, A2E frontend patches, official HF Spaces ZIP generation procedure, test suite, deployment config, and end-to-end data flow diagrams. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
The audio2exp-service returns frames as arrays of numbers (number[][]),
but applyExpressionFromTts expected objects with a .weights property
({weights: number[]}[]), causing TypeError and empty frame buffer.
Changed f.weights[i] to frameData[i] to match the actual backend format.
https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…AvatarController) The previous implementation used window.lamAvatarController which doesn't exist in this codebase, causing lip sync to completely fail (buffer=0, jaw=0, mouth=0). Additionally, the data format was wrong (f.weights[i] vs the actual number[][] response). Now uses ExpressionManager (vrm-expression-manager.ts) which: - Correctly handles the number[][] frame format from audio2exp-service - Syncs to audioElement.currentTime for accurate lip sync timing - Maps ARKit blendshapes (jawOpen, mouthFunnel, etc.) to GVRM bone system - Calls renderer.updateLipSync() directly Changes: - Import ExpressionManager and initialize in init() - Replace lamAvatarController dependency with ExpressionManager - Add expressionManager.stop() in stopAvatarAnimation() - All 5 call sites (speakTextGCP, speakResponseInChunks x2, shop TTS x2) now correctly drive lip sync through ExpressionManager https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
The import '../avatar/vrm-expression-manager' caused a Vite build error because that file doesn't exist in gourmet-sp's src/scripts/avatar/. Solution: inline the ExpressionManager class directly into concierge-controller.ts. This eliminates the need to copy a separate file into gourmet-sp and avoids import resolution issues. The ARKIT_INDEX map is trimmed to only the 7 mouth-related blendshapes actually used for lip sync (jawOpen, mouthFunnel, mouthPucker, etc.) https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Root cause: this.guavaRenderer doesn't exist on CoreController.
LAMAvatar.astro has its own animation loop with buffer/ttsActive state.
The ExpressionManager approach was completely wrong architecture.
Correct approach: use window.lamAvatarController exposed by LAMAvatar.astro
- setExternalTtsPlayer(): links ttsPlayer so LAMAvatar can track playback
- queueExpressionFrames(): feeds A2E frames into LAMAvatar's buffer
- clearFrameBuffer(): clears buffer on stop/new segment
Changes:
- Remove inlined ExpressionManager class (120 lines of dead code)
- Restore lamAvatarController.setExternalTtsPlayer() with retry (500ms x 20)
- applyExpressionFromTts: convert number[][] → {name: value}[] and queue
- stopAvatarAnimation: call clearFrameBuffer() to close mouth
Console should now show:
- "[Concierge] ✅ Linked ttsPlayer with LAMAvatar controller"
- "[Concierge] A2E: N frames queued @ 30fps"
- LAM Health: buffer>0, ttsActive=true during speech
https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
… code Read the ACTUAL LAMAvatar.astro, lam-websocket-manager.ts, and audio-sync-player.ts from gourmet-sp to understand the real architecture. Key findings: - LAMAvatar.getExpressionData() is called at 60fps by renderer - It reads frameBuffer[floor(ttsPlayer.currentTime * frameRate)] - Requires: externalTtsPlayer linked, frameBuffer filled, ttsActive=true - ttsActive is set by play event (requires setExternalTtsPlayer first) 4 chains must ALL work for lip sync: Chain1: Backend must return expression data (needs AUDIO2EXP_SERVICE_URL) Chain2: setExternalTtsPlayer must link ttsPlayer with LAMAvatar Chain3: applyExpressionFromTts must convert & queue frames Chain4: LAMAvatar renders from frameBuffer synced to currentTime Added diagnostic logs at each chain point: [A2E Chain1] expression received or null (backend config issue) [A2E Chain2] setExternalTtsPlayer success or LAMAvatar not found [A2E Chain3] frames queued with jawOpen sample value https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…meBuffer, support both frame formats
Compared with the ORIGINAL gourmet-sp concierge-controller.ts (from
claude/test-concierge-modal-rewGs branch) and found 2 bugs:
1. stopAvatarAnimation() called clearFrameBuffer() which resets
fadeOutStartTime=null, breaking LAMAvatar's graceful 200ms fade-out.
The ORIGINAL code trusts LAMAvatar's own ended event handler.
→ Removed clearFrameBuffer() from stopAvatarAnimation()
2. Frame data format mismatch:
- Original gourmet-sp: f.weights[i] (expects {weights: number[]}[])
- audio2exp-service: number[][] (raw arrays)
→ Now supports BOTH formats: Array.isArray(f) ? f : f.weights
Key fact: before A2E changes, lip sync was working via the renderer's
built-in FFT analysis. The A2E code path was dead code (AUDIO2EXP_SERVICE_URL
not set). These changes ensure A2E is a pure overlay that doesn't break
the existing FFT lip sync.
https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Root cause: When AUDIO2EXP_SERVICE_URL is set, the backend returns
expression data. The original code's applyExpressionFromTts used
f.weights[i] on raw number[] arrays, causing TypeError → caught by
outer try/catch → isAISpeaking=false → STT worked (lucky bug).
My both-format fix removed this error, so audio playback proceeds.
But if the browser blocks autoplay (fires play then immediate pause),
onended never fires → playPromise never resolves → initializeSession
hangs → buttons never enabled → STT completely broken.
Fix: Add onpause deadlock prevention to ALL 8 play-and-wait patterns,
matching the existing pattern in ack playback (line 588):
this.ttsPlayer.onpause = () => {
if (this.ttsPlayer.currentTime < 0.1) done();
};
This detects "play then immediate pause" (autoplay block) and resolves
the promise, preventing deadlock. Normal mid-playback pauses (currentTime
> 0.1) are not affected.
https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
オリジナルのgourmet-sp concierge-controller.tsとの差分を最小化。 唯一の実質変更は applyExpressionFromTts メソッドのみ: - フレーム形式: f.weights[i] → Array.isArray(f) ? f : (f.weights || []) (audio2exp-service の number[][] 形式に対応) - try/catch で非致命的エラーとして処理 - その他全メソッド(speakTextGCP, STT, sendMessage等)はオリジナルと同一 https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…ration Previous patches removed all GVRM renderer integration (import, guavaRenderer, setupAudioAnalysis, startLipSyncLoop) and replaced with non-existent window.lamAvatarController calls, causing all A2E data to be silently dropped and lip sync to degrade to basic jaw flapping. This rewrite is based on the actual production concierge-controller.ts with minimal A2E additions: - Restore GVRM import, guavaRenderer, setupAudioAnalysis, startLipSyncLoop - Add a2eFrames/a2eFrameRate/a2eNames properties for expression storage - Add setA2EFrames() to store expression data from TTS response - Add computeMouthOpenness() to convert 52-dim ARKit blendshapes to scalar - Modify startLipSyncLoop() to use A2E frames when available, FFT as fallback - Override speakTextGCP() with inline fetch to include session_id - Add session_id to ALL TTS requests (ack, chunks, shop flow) https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…t GVRM) Root cause: The patch was based on gourmet-support's concierge-controller.ts which uses GVRM renderer, but the actual deployed frontend (gourmet-sp) uses LAMAvatar.astro with a completely different rendering pipeline. Previous patch problems: - Added GVRM import/renderer that doesn't exist in gourmet-sp - Missing linkTtsPlayer() - LAMAvatar never received ttsPlayer reference -> ttsActive=false, buffer=0, lip sync completely dead - Added setupAudioAnalysis()/startLipSyncLoop() for FFT - unnecessary with LAMAvatar - Called clearFrameBuffer() in stopAvatarAnimation() - breaks LAMAvatar fade-out Fix: Use the exact gourmet-sp version which correctly: - Links ttsPlayer to LAMAvatar via setExternalTtsPlayer() in init() - Sends A2E frames via applyExpressionFromTts() -> lamAvatarController.queueExpressionFrames() - Lets LAMAvatar handle all lip sync rendering internally - Does NOT call clearFrameBuffer() in stopAvatarAnimation() https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…rpolate frames Changes to applyExpressionFromTts(): 1. Mouth blendshape amplification: Scale jawOpen (1.4x), mouthFunnel/Pucker (1.5x), mouthSmile (1.3x), mouthStretch (1.2x) etc. for more visible Japanese vowel distinctions (あ/い/う/え/お) 2. Frame interpolation: 30fps→60fps via linear interpolation between consecutive frames, matching the renderer's ~60fps render loop for smoother animation 3. Diagnostic logging: jawOpen/mouthFunnel/mouthSmile max/avg values logged per expression segment for live quality monitoring 4. LinkTtsPlayer retry: Multiple retry attempts (500ms, 1s, 2s, 4s) with logging to reliably connect ttsPlayer to LAMAvatar even with async initialization Quality context: A2E streaming model (wav2vec2-base-960h, no transformer) produces subtle Japanese phoneme variations. Frontend amplification makes these visible. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
… objects)
The user rewrote audio2exp-service with a2e_engine.py (Flask) which returns
frames as plain arrays [[0.1, ...], ...] instead of the old FastAPI format
[{"weights": [0.1, ...]}, ...].
Frontend now detects both formats: Array.isArray(f) ? f : f.weights
https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Step 1: Add __testLipSync() diagnostic to concierge-controller.ts patch
- Generates 5 Japanese vowel patterns (あいうえお) with known ARKit values
- Creates silent WAV audio, queues frames to LAMAvatar, plays through ttsPlayer
- Verifies whether renderer supports full 52-dim blendshapes
Step 3: Fix a2e_engine.py to use the proper LAM INFER pipeline
- Restore LAM_Audio2Expression module (engines, models, utils, configs)
- Rewrite _load_a2e_decoder → _try_load_infer_pipeline using INFER.build()
- Use infer_streaming_audio() with context for chunked processing
- Includes full postprocessing: smooth_mouth, frame_blending, savitzky_golay,
symmetrize, eye_blinks
- Falls back to Wav2Vec2 energy-based approximation when INFER unavailable
- Add librosa, scipy, addict to requirements.txt
- Add libsndfile to Dockerfile
https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Three issues fixed during local testing: 1. transformers v5.x requires ignore_mismatched_sizes=True and attn_implementation="eager" for Wav2Vec2Model.from_pretrained() 2. HuggingFace checkpoint is double-wrapped (tar.gz containing pretrained_models/lam_audio2exp_streaming.tar) - auto-extract 3. Bare except in infer.py swallowed tracebacks and crashed on uninitialized output_dict - now logs actual error and recovers Result: audio2exp-service starts with mode="infer" and produces 52-dim ARKit blendshapes from audio input. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Exclude downloaded model weights (wav2vec2, LAM checkpoint ~1.1GB) from version control. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Flask's app.run() auto-loads .env files, which crashes with UnicodeDecodeError if a non-UTF-8 .env exists in the path. Pass load_dotenv=False since env vars are set externally. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Problem: In concierge mode, TTS audio was completely skipped for text input because speakResponseInChunks passed isTextInput=true as the skipAudio parameter to speakTextGCP. The avatar should always speak responses in concierge mode regardless of input method. Changes: - Remove isTextInput guard from speakResponseInChunks (TTS plays for both text and voice input when speaker is enabled) - Fix all speakTextGCP calls to use skipAudio=false in concierge paths - Add play() error handling (onerror + catch) to prevent silent hangs when browser autoplay policy blocks playback - Remove !isTextInput guards from shop intro TTS path https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Comprehensive handoff covering: - Owner's true goals (iPhone SE standalone, no backend GPU, production alpha) - LAM technical architecture (paper + WebGL SDK + A2E) - Past session mistakes and warnings for next AI - Current system state and completed work - Unresolved architecture decisions - Priority action items https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…timeout
Root cause: warmup inference (infer_streaming_audio with dummy audio) hangs
indefinitely on CPU, preventing engine._ready from becoming True. The health
endpoint returns {"engine_ready":false,"status":"loading"} forever.
Changes:
- app.py: load engine in background thread so gunicorn responds to Cloud Run
startup probes immediately; health endpoint returns 200 with status
"loading"/"healthy"/"error" accordingly
- a2e_engine.py: add SIGALRM-based timeout (default 120s) to warmup inference
so a hang doesn't block engine initialization forever
- Dockerfile: fix PORT from 8081 to 8080 to match Cloud Run config, switch CMD
to shell form for ${PORT} variable expansion, add torchaudio CPU install
- start.sh: quote bind address for safety
https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Root causes: 1. /app/models was empty in container (.gitignore excludes models/) so wav2vec2 fallback tried downloading ~360MB from HuggingFace at runtime, which hangs indefinitely on Cloud Run 2. signal.SIGALRM used in background thread (Python only allows signals in main thread) causing ValueError 3. No timeout on engine loading - health check returns "loading" forever Fixes: - Download wav2vec2-base-960h during Docker build (baked into image) - Set HF_HUB_OFFLINE=1 to prevent runtime HuggingFace downloads - Install CPU-only PyTorch before requirements.txt to avoid downloading GPU version first - Replace signal.SIGALRM with threading.Event for warmup timeout - Add ENGINE_LOAD_TIMEOUT (300s) - health check switches to error state after timeout instead of staying in "loading" forever - Health check returns elapsed_seconds for debugging - Error state returns 503 instead of 200 https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Root causes of audio2exp-service stuck in "loading" state:
1. Dockerfile saved wav2vec2 in HF cache format (cache_dir) but
_find_wav2vec_dir() expects standard format with config.json at root.
Fix: use save_pretrained() to save in standard directory format.
2. When wav2vec2 not found locally, code fell back to HuggingFace model ID
("facebook/wav2vec2-base-960h") which hangs at runtime because
HF_HUB_OFFLINE=1 is set. Fix: fail fast instead of attempting download.
3. gunicorn --timeout 120 too short for model loading on CPU, and
--threads 4 causes signal-related thread-safety issues.
Fix: --timeout 300 --threads 1.
https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Root cause: sync worker with 1 thread + heavy CPU model loading in daemon thread. The daemon thread holds the GIL for minutes (import torch, INFER.build), preventing gunicorn from sending heartbeats to the arbiter. After timeout seconds, the arbiter kills the worker, master exits, Cloud Run restarts the container — infinite restart loop. Changes: - Switch to gthread worker class with 2 threads so health checks can respond even while model loading holds the GIL - Increase timeout from 300 to 600 seconds (model loading takes 10+ min on CPU) - Set WARMUP_TIMEOUT=0 to skip warmup inference on CPU (saves minutes) - Add WARMUP_TIMEOUT<=0 check to skip warmup entirely when set to 0 https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Add timing logs for each initialization step (import, config parse, model build, device transfer) to identify where the engine hangs during Cloud Run startup. https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
jawOpen and mouthLowerDown amplification was causing unnatural jaw pulling and overly wide mouth opening. Reduced all amplification factors (jawOpen 1.4→0.55, mouthLowerDown 1.3→0.5) while preserving vowel distinction (funnel/pucker/smile). Added EMA temporal smoothing (α=0.6) to eliminate abrupt mouth shape transitions. https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.