Claude/test a2e japanese audio j9 vbt by mirai-gpro · Pull Request #86 · aigc3d/LAM

mirai-gpro · 2026-02-21T02:09:14Z

No description provided.

Test scripts to verify A2E (Audio2Expression) lip sync quality with Japanese audio input, before investing in ZIP motion replacement or VHAP Japanese FLAME params. Includes: - generate_test_audio.py: EdgeTTS Japanese/English/Chinese audio samples - test_a2e_cpu.py: A2E model loading, Wav2Vec2 feature extraction, ZIP validation - save_a2e_output.py: Capture A2E 52-dim ARKit blendshape output - analyze_blendshapes.py: Lip sync quality scoring and language comparison - setup_oac_env.py: Auto-detect known OpenAvatarChat issues (CPU mode, deps, config) - chat_with_lam_jp.yaml: Corrected config (Gemini API + EdgeTTS ja-JP-NanamiNeural) - run_all_tests.py: Master test runner - TEST_PROCEDURE.md: Step-by-step test procedure https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Fix RuntimeError: Input data type <class 'list'> is not supported. - diagnose_onnx_error.py: Tests SileroVAD ONNX, SenseVoice, data flow - patch_vad_handler.py: Fixes timestamp[0] NoneType bug, adds defensive numpy type checking on ONNX inputs, handles 2/3-output model variants - setup_oac_env.py: Adds VAD handler bug detection (check 7/7) https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Simple test script that verifies environment, model files, data_bundle.py fix, Wav2Vec2 loading, and A2E module import. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Gemini's OpenAI-compatible API sometimes returns delta.content as dict/list instead of string, causing TypeError in set_main_data(). This patch script detects and safely converts non-string content before passing to data_bundle. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

gemini-2.0-flash returns 404 "no longer available to new users". The error dict then cascades into the set_main_data TypeError. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

SenseVoice auto-detection defaults to Chinese (<|zh|>), causing Japanese speech to be misrecognized as Chinese text. This patch forces language="ja" in the generate() call. - patch_asr_language.py: Auto-patches asr_handler_sensevoice.py - chat_with_lam_jp.yaml: Added language: "ja" to SenseVoice config - TEST_PROCEDURE.md: Added Step 4.5 for patch application https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Instead of creating a separate config file, this script patches the existing working config/chat_with_lam.yaml with 3 changes: 1. TTS voice → ja-JP-NanamiNeural 2. LLM system_prompt → Japanese 3. ASR language → ja https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Root cause analysis from production logs: - 1st ASR call: rtf=0.629 (1.25s) - OK - 2nd ASR call: rtf=15.027 (29.83s) - GPU memory exhausted, CPU fallback - fastrtc 60s timeout triggers, resets frame pipeline → system unresponsive Fix: Add torch.cuda.empty_cache() + gc.collect() after each SenseVoice and LAM inference to free GPU memory between calls. Also adds startup wrapper with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Create the missing Audio2Expression inference service that bridges gourmet-support backend (which already has A2E hooks in /api/tts/synthesize) with the actual Wav2Vec2 + LAM A2E decoder pipeline. Services: - audio2exp-service: Flask API accepting MP3 audio, returning 52-dim ARKit blendshape coefficients at 30fps. Includes Wav2Vec2 feature extraction and fallback mode when A2E decoder is unavailable. - Frontend ExpressionManager: Maps A2E blendshapes to GVRM bone system, syncing with audio playback via currentTime. Architecture: TTS → MP3 → audio2exp-service → 52-dim blendshapes → frontend https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

The a2e_engine now searches multiple patterns for the checkpoint: - models/LAM_audio2exp_streaming.tar (flat, user's actual layout) - models/LAM_audio2exp/pretrained_models/*.tar (OpenAvatarChat layout) - models/LAM_audio2exp/*.tar (intermediate layout) Falls back to rglob search if none match. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Full drop-in replacement for gourmet-sp's concierge-controller.ts with Audio2Expression integration applied. Key changes marked with ★ comments: - ExpressionManager import and initialization - session_id added to /api/tts/synthesize requests - A2E expression data used for lip sync when available - FFT-based lip sync preserved as fallback - Proper cleanup in stopAvatarAnimation() and dispose() https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Replaces the scaffold version with the real concierge-controller.ts from gourmet-sp (claude/test-concierge-modal-rewGs branch). A2E integration is already built-in via applyExpressionFromTts() + lamAvatarController. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

uvicorn is an ASGI server (FastAPI/Starlette) and cannot serve Flask (WSGI). This caused the Cloud Run container to fail to start and listen on the port, resulting in deployment timeout. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Covers all components: backend (gourmet-support), frontend (gourmet-sp), audio2exp-service, A2E frontend patches, official HF Spaces ZIP generation procedure, test suite, deployment config, and end-to-end data flow diagrams. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

The audio2exp-service returns frames as arrays of numbers (number[][]), but applyExpressionFromTts expected objects with a .weights property ({weights: number[]}[]), causing TypeError and empty frame buffer. Changed f.weights[i] to frameData[i] to match the actual backend format. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

…AvatarController) The previous implementation used window.lamAvatarController which doesn't exist in this codebase, causing lip sync to completely fail (buffer=0, jaw=0, mouth=0). Additionally, the data format was wrong (f.weights[i] vs the actual number[][] response). Now uses ExpressionManager (vrm-expression-manager.ts) which: - Correctly handles the number[][] frame format from audio2exp-service - Syncs to audioElement.currentTime for accurate lip sync timing - Maps ARKit blendshapes (jawOpen, mouthFunnel, etc.) to GVRM bone system - Calls renderer.updateLipSync() directly Changes: - Import ExpressionManager and initialize in init() - Replace lamAvatarController dependency with ExpressionManager - Add expressionManager.stop() in stopAvatarAnimation() - All 5 call sites (speakTextGCP, speakResponseInChunks x2, shop TTS x2) now correctly drive lip sync through ExpressionManager https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

The import '../avatar/vrm-expression-manager' caused a Vite build error because that file doesn't exist in gourmet-sp's src/scripts/avatar/. Solution: inline the ExpressionManager class directly into concierge-controller.ts. This eliminates the need to copy a separate file into gourmet-sp and avoids import resolution issues. The ARKIT_INDEX map is trimmed to only the 7 mouth-related blendshapes actually used for lip sync (jawOpen, mouthFunnel, mouthPucker, etc.) https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Root cause: this.guavaRenderer doesn't exist on CoreController. LAMAvatar.astro has its own animation loop with buffer/ttsActive state. The ExpressionManager approach was completely wrong architecture. Correct approach: use window.lamAvatarController exposed by LAMAvatar.astro - setExternalTtsPlayer(): links ttsPlayer so LAMAvatar can track playback - queueExpressionFrames(): feeds A2E frames into LAMAvatar's buffer - clearFrameBuffer(): clears buffer on stop/new segment Changes: - Remove inlined ExpressionManager class (120 lines of dead code) - Restore lamAvatarController.setExternalTtsPlayer() with retry (500ms x 20) - applyExpressionFromTts: convert number[][] → {name: value}[] and queue - stopAvatarAnimation: call clearFrameBuffer() to close mouth Console should now show: - "[Concierge] ✅ Linked ttsPlayer with LAMAvatar controller" - "[Concierge] A2E: N frames queued @ 30fps" - LAM Health: buffer>0, ttsActive=true during speech https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

… code Read the ACTUAL LAMAvatar.astro, lam-websocket-manager.ts, and audio-sync-player.ts from gourmet-sp to understand the real architecture. Key findings: - LAMAvatar.getExpressionData() is called at 60fps by renderer - It reads frameBuffer[floor(ttsPlayer.currentTime * frameRate)] - Requires: externalTtsPlayer linked, frameBuffer filled, ttsActive=true - ttsActive is set by play event (requires setExternalTtsPlayer first) 4 chains must ALL work for lip sync: Chain1: Backend must return expression data (needs AUDIO2EXP_SERVICE_URL) Chain2: setExternalTtsPlayer must link ttsPlayer with LAMAvatar Chain3: applyExpressionFromTts must convert & queue frames Chain4: LAMAvatar renders from frameBuffer synced to currentTime Added diagnostic logs at each chain point: [A2E Chain1] expression received or null (backend config issue) [A2E Chain2] setExternalTtsPlayer success or LAMAvatar not found [A2E Chain3] frames queued with jawOpen sample value https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

…meBuffer, support both frame formats Compared with the ORIGINAL gourmet-sp concierge-controller.ts (from claude/test-concierge-modal-rewGs branch) and found 2 bugs: 1. stopAvatarAnimation() called clearFrameBuffer() which resets fadeOutStartTime=null, breaking LAMAvatar's graceful 200ms fade-out. The ORIGINAL code trusts LAMAvatar's own ended event handler. → Removed clearFrameBuffer() from stopAvatarAnimation() 2. Frame data format mismatch: - Original gourmet-sp: f.weights[i] (expects {weights: number[]}[]) - audio2exp-service: number[][] (raw arrays) → Now supports BOTH formats: Array.isArray(f) ? f : f.weights Key fact: before A2E changes, lip sync was working via the renderer's built-in FFT analysis. The A2E code path was dead code (AUDIO2EXP_SERVICE_URL not set). These changes ensure A2E is a pure overlay that doesn't break the existing FFT lip sync. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Root cause: When AUDIO2EXP_SERVICE_URL is set, the backend returns expression data. The original code's applyExpressionFromTts used f.weights[i] on raw number[] arrays, causing TypeError → caught by outer try/catch → isAISpeaking=false → STT worked (lucky bug). My both-format fix removed this error, so audio playback proceeds. But if the browser blocks autoplay (fires play then immediate pause), onended never fires → playPromise never resolves → initializeSession hangs → buttons never enabled → STT completely broken. Fix: Add onpause deadlock prevention to ALL 8 play-and-wait patterns, matching the existing pattern in ack playback (line 588): this.ttsPlayer.onpause = () => { if (this.ttsPlayer.currentTime < 0.1) done(); }; This detects "play then immediate pause" (autoplay block) and resolves the promise, preventing deadlock. Normal mid-playback pauses (currentTime > 0.1) are not affected. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

オリジナルのgourmet-sp concierge-controller.tsとの差分を最小化。唯一の実質変更は applyExpressionFromTts メソッドのみ: - フレーム形式: f.weights[i] → Array.isArray(f) ? f : (f.weights || []) (audio2exp-service の number[][] 形式に対応) - try/catch で非致命的エラーとして処理 - その他全メソッド(speakTextGCP, STT, sendMessage等)はオリジナルと同一 https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

…ration Previous patches removed all GVRM renderer integration (import, guavaRenderer, setupAudioAnalysis, startLipSyncLoop) and replaced with non-existent window.lamAvatarController calls, causing all A2E data to be silently dropped and lip sync to degrade to basic jaw flapping. This rewrite is based on the actual production concierge-controller.ts with minimal A2E additions: - Restore GVRM import, guavaRenderer, setupAudioAnalysis, startLipSyncLoop - Add a2eFrames/a2eFrameRate/a2eNames properties for expression storage - Add setA2EFrames() to store expression data from TTS response - Add computeMouthOpenness() to convert 52-dim ARKit blendshapes to scalar - Modify startLipSyncLoop() to use A2E frames when available, FFT as fallback - Override speakTextGCP() with inline fetch to include session_id - Add session_id to ALL TTS requests (ack, chunks, shop flow) https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

…t GVRM) Root cause: The patch was based on gourmet-support's concierge-controller.ts which uses GVRM renderer, but the actual deployed frontend (gourmet-sp) uses LAMAvatar.astro with a completely different rendering pipeline. Previous patch problems: - Added GVRM import/renderer that doesn't exist in gourmet-sp - Missing linkTtsPlayer() - LAMAvatar never received ttsPlayer reference -> ttsActive=false, buffer=0, lip sync completely dead - Added setupAudioAnalysis()/startLipSyncLoop() for FFT - unnecessary with LAMAvatar - Called clearFrameBuffer() in stopAvatarAnimation() - breaks LAMAvatar fade-out Fix: Use the exact gourmet-sp version which correctly: - Links ttsPlayer to LAMAvatar via setExternalTtsPlayer() in init() - Sends A2E frames via applyExpressionFromTts() -> lamAvatarController.queueExpressionFrames() - Lets LAMAvatar handle all lip sync rendering internally - Does NOT call clearFrameBuffer() in stopAvatarAnimation() https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

…rpolate frames Changes to applyExpressionFromTts(): 1. Mouth blendshape amplification: Scale jawOpen (1.4x), mouthFunnel/Pucker (1.5x), mouthSmile (1.3x), mouthStretch (1.2x) etc. for more visible Japanese vowel distinctions (あ/い/う/え/お) 2. Frame interpolation: 30fps→60fps via linear interpolation between consecutive frames, matching the renderer's ~60fps render loop for smoother animation 3. Diagnostic logging: jawOpen/mouthFunnel/mouthSmile max/avg values logged per expression segment for live quality monitoring 4. LinkTtsPlayer retry: Multiple retry attempts (500ms, 1s, 2s, 4s) with logging to reliably connect ttsPlayer to LAMAvatar even with async initialization Quality context: A2E streaming model (wav2vec2-base-960h, no transformer) produces subtle Japanese phoneme variations. Frontend amplification makes these visible. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

… objects) The user rewrote audio2exp-service with a2e_engine.py (Flask) which returns frames as plain arrays [[0.1, ...], ...] instead of the old FastAPI format [{"weights": [0.1, ...]}, ...]. Frontend now detects both formats: Array.isArray(f) ? f : f.weights https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Step 1: Add __testLipSync() diagnostic to concierge-controller.ts patch - Generates 5 Japanese vowel patterns (あいうえお) with known ARKit values - Creates silent WAV audio, queues frames to LAMAvatar, plays through ttsPlayer - Verifies whether renderer supports full 52-dim blendshapes Step 3: Fix a2e_engine.py to use the proper LAM INFER pipeline - Restore LAM_Audio2Expression module (engines, models, utils, configs) - Rewrite _load_a2e_decoder → _try_load_infer_pipeline using INFER.build() - Use infer_streaming_audio() with context for chunked processing - Includes full postprocessing: smooth_mouth, frame_blending, savitzky_golay, symmetrize, eye_blinks - Falls back to Wav2Vec2 energy-based approximation when INFER unavailable - Add librosa, scipy, addict to requirements.txt - Add libsndfile to Dockerfile https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Three issues fixed during local testing: 1. transformers v5.x requires ignore_mismatched_sizes=True and attn_implementation="eager" for Wav2Vec2Model.from_pretrained() 2. HuggingFace checkpoint is double-wrapped (tar.gz containing pretrained_models/lam_audio2exp_streaming.tar) - auto-extract 3. Bare except in infer.py swallowed tracebacks and crashed on uninitialized output_dict - now logs actual error and recovers Result: audio2exp-service starts with mode="infer" and produces 52-dim ARKit blendshapes from audio input. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Exclude downloaded model weights (wav2vec2, LAM checkpoint ~1.1GB) from version control. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Flask's app.run() auto-loads .env files, which crashes with UnicodeDecodeError if a non-UTF-8 .env exists in the path. Pass load_dotenv=False since env vars are set externally. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Problem: In concierge mode, TTS audio was completely skipped for text input because speakResponseInChunks passed isTextInput=true as the skipAudio parameter to speakTextGCP. The avatar should always speak responses in concierge mode regardless of input method. Changes: - Remove isTextInput guard from speakResponseInChunks (TTS plays for both text and voice input when speaker is enabled) - Fix all speakTextGCP calls to use skipAudio=false in concierge paths - Add play() error handling (onerror + catch) to prevent silent hangs when browser autoplay policy blocks playback - Remove !isTextInput guards from shop intro TTS path https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Comprehensive handoff covering: - Owner's true goals (iPhone SE standalone, no backend GPU, production alpha) - LAM technical architecture (paper + WebGL SDK + A2E) - Past session mistakes and warnings for next AI - Current system state and completed work - Unresolved architecture decisions - Priority action items https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

claude added 30 commits February 20, 2026 03:00

Add standalone A2E Japanese audio test script

081f904

Simple test script that verifies environment, model files, data_bundle.py fix, Wav2Vec2 loading, and A2E module import. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

Update Gemini model to gemini-2.5-flash (2.0-flash deprecated)

b50178e

gemini-2.0-flash returns 404 "no longer available to new users". The error dict then cascades into the set_main_data TypeError. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

chore: add .gitignore for audio2exp-service model files

a8a68c3

Exclude downloaded model weights (wav2vec2, LAM checkpoint ~1.1GB) from version control. https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM

claude added 2 commits February 22, 2026 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Claude/test a2e japanese audio j9 vbt#86

Claude/test a2e japanese audio j9 vbt#86
mirai-gpro wants to merge 32 commits intoaigc3d:masterfrom
mirai-gpro:claude/test-a2e-japanese-audio-j9VBT

mirai-gpro commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mirai-gpro commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants