Feature/live transcription websocket #175

saksham-jain177 · 2026-01-27T19:31:29Z

Summary

Adds real-time live audio transcription to MeetMemo using WebSockets, enabling microphone-based speech-to-text with progressive updates, visual audio feedback, and seamless integration into the existing Whisper pipeline.

Fixes #141

Changes Made

Added a FastAPI WebSocket endpoint for live audio transcription (/ws/transcribe/live)
Implemented chunked, low-latency Whisper inference for streaming audio
Introduced LiveAudioControls and useLiveTranscription hook for real-time capture, waveform/audio-level visualization, and transcript display
Integrated live transcription UI into the existing upload flow
Updated Nginx configuration to support WebSocket upgrades and long-lived connections
Added a backend verification script for WebSocket audio streaming

Context / Rationale

MeetMemo previously supported only batch transcription via uploaded or recorded audio files. This PR introduces true real-time transcription, allowing users to see speech converted to text as they speak, without intermediate uploads.

The implementation reuses the existing Whisper infrastructure while optimizing for live use:

Uses smaller Whisper models by default to reduce latency
Serializes inference to avoid GPU/VRAM contention
Skips diarization in live mode to preserve responsiveness

This approach balances correctness, performance, and minimal architectural disruption.

Related Docs or References

Issue: [Feature] Live Audio Transcription with WebSocket Support #141 – Live Audio Transcription with WebSocket Support
Local verification script: backend/scripts/verify_ws_live.py
Screenshots / UI demo available upon request

FastAPI Application Checklist (Delete if PR is not relevant)

API follows RESTful principles (nouns in routes, proper use of verbs)
All endpoints are async and use non-blocking I/O
/health endpoint is implemented and returns 200 OK
Long-running operations support both job polling (e.g., via /status/{job_id} or /progress/{job_id}) and optional webhooks (if a callback_url is provided).
- If callback_url is present in the request payload, the service will POST job results to the specified URL upon completion.
- If callback_url is not provided, the client can retrieve status and results via polling endpoints.
Job results are persisted or recoverable if needed
API schema (OpenAPI) is exposed and browsable at /docs or /redoc
Branch name follows conventions (e.g., feature/*, bugfix/*) — do not use dev directly

General Checklist

I have tested these changes locally
I have updated relevant documentation or added comments where needed
I have linked relevant issues and tagged reviewers
I have followed coding conventions and naming standards

gemini-code-assist · 2026-01-27T19:31:57Z

Summary of Changes

Hello @saksham-jain177, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances MeetMemo's capabilities by introducing real-time, live audio transcription. Users can now speak into their microphone and see their words converted to text instantly, moving beyond the previous batch-only transcription model. The implementation leverages WebSockets for efficient, continuous audio streaming and integrates seamlessly with the existing Whisper transcription infrastructure, optimizing for low-latency performance with smaller models and serialized inference.

Highlights

Backend WebSocket Endpoint: A new FastAPI WebSocket endpoint (/ws/transcribe/live) has been added to handle real-time audio streams for transcription, enabling continuous communication between client and server.
Live Transcription Service: A LiveService was introduced on the backend to manage audio buffering, perform chunked, low-latency Whisper inference, and handle dynamic model/language configuration for optimal real-time performance.
Frontend UI and Logic: New React components (LiveAudioControls) and a custom hook (useLiveTranscription) were developed to capture microphone audio, visualize audio levels, display progressive transcripts, and manage the WebSocket connection from the client side.
Nginx Configuration Update: Nginx was updated to correctly proxy WebSocket traffic to the backend, ensuring long-lived connections and proper header handling for the new real-time transcription feature.
Dependency Management and Verification: The websockets library was added to requirements.txt, a new dependency injection for LiveService was created, and a backend verification script was added to test the WebSocket audio streaming.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and well-structured feature for live audio transcription using WebSockets. However, critical security vulnerabilities have been identified, including a lack of authentication on the WebSocket endpoint and a global inference lock that creates a Denial of Service (DoS) risk. There's also potential for information exposure and path traversal/SSRF due to insufficient input validation. Furthermore, a critical performance issue exists in the backend's audio buffer handling, and a deprecated API is used on the frontend. Addressing these security and performance concerns is crucial for a robust and scalable feature.

gemini-code-assist · 2026-01-27T19:34:01Z

backend/services/live_service.py

+inference_lock = asyncio.Lock()
+
+
+class LiveService:
+    """Service for handling live audio transcription via WebSockets."""
+
+    def __init__(self, settings: Settings, transcription_service: TranscriptionService):
+        self.settings = settings
+        self.transcription_service = transcription_service
+        self.audio_buffer = bytearray()
+        self.sample_rate = 16000
+        self.channels = 1
+        self.sample_width = 2  # 16-bit PCM
+
+        # Minimum audio duration for inference (in seconds)
+        self.min_chunk_duration = 0.5
+        self.min_chunk_bytes = int(
+            self.sample_rate
+            * self.channels
+            * self.sample_width
+            * self.min_chunk_duration
+        )
+
+        # State for incremental transcription
+        self.full_transcript = []
+        self.current_model_name = "small"  # Default for live mode
+        self.current_language = None
+
+    async def process_audio(self, pcm_bytes: bytes) -> Optional[Dict]:
+        """
+        Process incoming PCM audio bytes and return transcript updates if available.
+        """
+        self.audio_buffer.extend(pcm_bytes)
+
+        if len(self.audio_buffer) < self.min_chunk_bytes:
+            return None
+
+        # If we have enough audio, trigger inference
+        return await self._run_inference()
+
+    async def _run_inference(self) -> Optional[Dict]:
+        """
+        Run Whisper inference on the current audio buffer.
+        """
+        if not self.audio_buffer:
+            return None
+
+        # Convert bytearray to numpy array (float32 normalized to [-1, 1])
+        audio_np = (
+            np.frombuffer(self.audio_buffer, dtype=np.int16).astype(np.float32)
+            / 32768.0
+        )
+
+        async with inference_lock:


This section of code introduces a critical Denial of Service (DoS) vulnerability. The global asyncio.Lock for Whisper inference, combined with a lack of authentication and rate limiting, allows a single malicious user to monopolize transcription capacity. Additionally, the audio_buffer grows indefinitely, leading to increasing latency and memory usage, which will cause severe performance degradation in long-running live sessions. The buffer should be consumed and cleared on each inference run for scalable processing and to mitigate performance issues.

gemini-code-assist · 2026-01-27T19:34:01Z

backend/api/ws/live.py

+@router.websocket("/transcribe/live")
+async def websocket_endpoint(
+    websocket: WebSocket, live_service: LiveService = Depends(get_live_service)
+):


The WebSocket endpoint for live transcription lacks any authentication or authorization mechanism. This allows any unauthenticated user to connect and utilize expensive ML inference resources (Whisper), which could lead to unauthorized resource consumption and significant operational costs.

gemini-code-assist · 2026-01-27T19:34:01Z

frontend/src/hooks/useLiveTranscription.js

+        // Use ScriptProcessor for simplicity in downsampling/buffering
+        // Buffer size of 4096 at 16kHz is ~250ms
+        const processor = audioContext.createScriptProcessor(4096, 1, 1);
+        processorRef.current = processor;


The createScriptProcessor API is deprecated and should not be used. It runs on the main UI thread, which can lead to audio glitches and UI stuttering, especially during heavy processing. Its use is strongly discouraged, and it may be removed from browsers in the future.

The modern and recommended replacement is AudioWorklet. It runs audio processing in a separate thread, ensuring the main thread remains responsive and providing a much better user experience. Please migrate this logic to use an AudioWorklet for robust and performant audio processing.

gemini-code-assist · 2026-01-27T19:34:01Z

backend/api/ws/live.py

+                        model = data.get("model")
+                        language = data.get("language")
+                        live_service.set_config(model=model, language=language)


The model parameter from the client-provided configuration is used directly to load a Whisper model without validation. Since the underlying WhisperModel library can accept filesystem paths or remote repository IDs, an attacker could potentially use this to probe the local filesystem or trigger unauthorized outbound network requests (SSRF).

model = data.get("model") if model not in ["tiny", "base", "small", "medium", "turbo"]: model = "small" language = data.get("language") live_service.set_config(model=model, language=language)

gemini-code-assist · 2026-01-27T19:34:01Z

backend/api/ws/live.py

+    except Exception as e:
+        logger.error(f"WebSocket error: {e}", exc_info=True)
+        try:
+            await websocket.send_json({"type": "error", "message": str(e)})


Sending raw exception messages to the client can leak sensitive internal information such as file paths, server configuration, or dependency details. This information can be used by an attacker to gain a deeper understanding of the system's architecture and facilitate further attacks.

await websocket.send_json({"type": "error", "message": "An internal error occurred"})

gemini-code-assist · 2026-01-27T19:34:01Z

frontend/src/components/Upload/LiveAudioControls.jsx

+      <style>{`
+        .pulse-animation {
+          animation: pulse 1.5s ease-in-out infinite;
+        }
+        @keyframes pulse {
+          0% { opacity: 1; transform: scale(1); }
+          50% { opacity: 0.7; transform: scale(1.05); }
+          100% { opacity: 1; transform: scale(1); }
+        }
+        .transcript-container::-webkit-scrollbar {
+          width: 4px;
+        }
+        .transcript-container::-webkit-scrollbar-thumb {
+          background: #ccc;
+          border-radius: 2px;
+        }
+      `}</style>


Embedding a raw <style> tag with CSS inside a React component is generally discouraged. These styles are injected globally into the document's <head>, which can lead to style collisions and makes them difficult to debug as the application grows. This approach also misses out on the benefits of modern CSS tooling like linting, autoprefixing, and code splitting.

For better maintainability and to avoid potential conflicts, consider using a more robust styling solution such as CSS Modules (e.g., creating a LiveAudioControls.module.css file) or a CSS-in-JS library like styled-components or Emotion.

saksham-jain177 added 4 commits January 27, 2026 19:42

feat(backend): add live audio transcription via WebSocket

1d37add

feat(frontend): add live audio transcription UI with WebSocket streaming

5e583cc

chore(nginx): enable WebSocket proxying for live transcription

c742ddb

fix: restore live transcription dependency and stabilize WS hook

e7607d4

saksham-jain177 mentioned this pull request Jan 27, 2026

[Feature] Live Audio Transcription with WebSocket Support #141

Open

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/live transcription websocket #175

Feature/live transcription websocket #175

Uh oh!

saksham-jain177 commented Jan 27, 2026

Uh oh!

gemini-code-assist bot commented Jan 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Uh oh!

gemini-code-assist bot Jan 27, 2026

Uh oh!

gemini-code-assist bot Jan 27, 2026

Uh oh!

gemini-code-assist bot Jan 27, 2026

Uh oh!

gemini-code-assist bot Jan 27, 2026

Uh oh!

gemini-code-assist bot Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feature/live transcription websocket #175

Are you sure you want to change the base?

Feature/live transcription websocket #175

Uh oh!

Conversation

saksham-jain177 commented Jan 27, 2026

Summary

Changes Made

Context / Rationale

Related Docs or References

FastAPI Application Checklist (Delete if PR is not relevant)

General Checklist

Uh oh!

gemini-code-assist bot commented Jan 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant