Skip to content

Conversation

@saksham-jain177
Copy link

Summary

Adds real-time live audio transcription to MeetMemo using WebSockets, enabling microphone-based speech-to-text with progressive updates, visual audio feedback, and seamless integration into the existing Whisper pipeline.

Fixes #141


Changes Made

  • Added a FastAPI WebSocket endpoint for live audio transcription (/ws/transcribe/live)
  • Implemented chunked, low-latency Whisper inference for streaming audio
  • Introduced LiveAudioControls and useLiveTranscription hook for real-time capture, waveform/audio-level visualization, and transcript display
  • Integrated live transcription UI into the existing upload flow
  • Updated Nginx configuration to support WebSocket upgrades and long-lived connections
  • Added a backend verification script for WebSocket audio streaming

Context / Rationale

MeetMemo previously supported only batch transcription via uploaded or recorded audio files. This PR introduces true real-time transcription, allowing users to see speech converted to text as they speak, without intermediate uploads.

The implementation reuses the existing Whisper infrastructure while optimizing for live use:

  • Uses smaller Whisper models by default to reduce latency
  • Serializes inference to avoid GPU/VRAM contention
  • Skips diarization in live mode to preserve responsiveness

This approach balances correctness, performance, and minimal architectural disruption.


Related Docs or References


FastAPI Application Checklist (Delete if PR is not relevant)

  • API follows RESTful principles (nouns in routes, proper use of verbs)
  • All endpoints are async and use non-blocking I/O
  • /health endpoint is implemented and returns 200 OK
  • Long-running operations support both job polling (e.g., via /status/{job_id} or /progress/{job_id}) and optional webhooks (if a callback_url is provided).
    • If callback_url is present in the request payload, the service will POST job results to the specified URL upon completion.
    • If callback_url is not provided, the client can retrieve status and results via polling endpoints.
  • Job results are persisted or recoverable if needed
  • API schema (OpenAPI) is exposed and browsable at /docs or /redoc
  • Branch name follows conventions (e.g., feature/*, bugfix/*) — do not use dev directly

General Checklist

  • I have tested these changes locally
  • I have updated relevant documentation or added comments where needed
  • I have linked relevant issues and tagged reviewers
  • I have followed coding conventions and naming standards

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @saksham-jain177, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances MeetMemo's capabilities by introducing real-time, live audio transcription. Users can now speak into their microphone and see their words converted to text instantly, moving beyond the previous batch-only transcription model. The implementation leverages WebSockets for efficient, continuous audio streaming and integrates seamlessly with the existing Whisper transcription infrastructure, optimizing for low-latency performance with smaller models and serialized inference.

Highlights

  • Backend WebSocket Endpoint: A new FastAPI WebSocket endpoint (/ws/transcribe/live) has been added to handle real-time audio streams for transcription, enabling continuous communication between client and server.
  • Live Transcription Service: A LiveService was introduced on the backend to manage audio buffering, perform chunked, low-latency Whisper inference, and handle dynamic model/language configuration for optimal real-time performance.
  • Frontend UI and Logic: New React components (LiveAudioControls) and a custom hook (useLiveTranscription) were developed to capture microphone audio, visualize audio levels, display progressive transcripts, and manage the WebSocket connection from the client side.
  • Nginx Configuration Update: Nginx was updated to correctly proxy WebSocket traffic to the backend, ensuring long-lived connections and proper header handling for the new real-time transcription feature.
  • Dependency Management and Verification: The websockets library was added to requirements.txt, a new dependency injection for LiveService was created, and a backend verification script was added to test the WebSocket audio streaming.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-structured feature for live audio transcription using WebSockets. However, critical security vulnerabilities have been identified, including a lack of authentication on the WebSocket endpoint and a global inference lock that creates a Denial of Service (DoS) risk. There's also potential for information exposure and path traversal/SSRF due to insufficient input validation. Furthermore, a critical performance issue exists in the backend's audio buffer handling, and a deprecated API is used on the frontend. Addressing these security and performance concerns is crucial for a robust and scalable feature.

Comment on lines +16 to +69
inference_lock = asyncio.Lock()


class LiveService:
"""Service for handling live audio transcription via WebSockets."""

def __init__(self, settings: Settings, transcription_service: TranscriptionService):
self.settings = settings
self.transcription_service = transcription_service
self.audio_buffer = bytearray()
self.sample_rate = 16000
self.channels = 1
self.sample_width = 2 # 16-bit PCM

# Minimum audio duration for inference (in seconds)
self.min_chunk_duration = 0.5
self.min_chunk_bytes = int(
self.sample_rate
* self.channels
* self.sample_width
* self.min_chunk_duration
)

# State for incremental transcription
self.full_transcript = []
self.current_model_name = "small" # Default for live mode
self.current_language = None

async def process_audio(self, pcm_bytes: bytes) -> Optional[Dict]:
"""
Process incoming PCM audio bytes and return transcript updates if available.
"""
self.audio_buffer.extend(pcm_bytes)

if len(self.audio_buffer) < self.min_chunk_bytes:
return None

# If we have enough audio, trigger inference
return await self._run_inference()

async def _run_inference(self) -> Optional[Dict]:
"""
Run Whisper inference on the current audio buffer.
"""
if not self.audio_buffer:
return None

# Convert bytearray to numpy array (float32 normalized to [-1, 1])
audio_np = (
np.frombuffer(self.audio_buffer, dtype=np.int16).astype(np.float32)
/ 32768.0
)

async with inference_lock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

This section of code introduces a critical Denial of Service (DoS) vulnerability. The global asyncio.Lock for Whisper inference, combined with a lack of authentication and rate limiting, allows a single malicious user to monopolize transcription capacity. Additionally, the audio_buffer grows indefinitely, leading to increasing latency and memory usage, which will cause severe performance degradation in long-running live sessions. The buffer should be consumed and cleared on each inference run for scalable processing and to mitigate performance issues.

Comment on lines +17 to +20
@router.websocket("/transcribe/live")
async def websocket_endpoint(
websocket: WebSocket, live_service: LiveService = Depends(get_live_service)
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The WebSocket endpoint for live transcription lacks any authentication or authorization mechanism. This allows any unauthenticated user to connect and utilize expensive ML inference resources (Whisper), which could lead to unauthorized resource consumption and significant operational costs.

Comment on lines +94 to +97
// Use ScriptProcessor for simplicity in downsampling/buffering
// Buffer size of 4096 at 16kHz is ~250ms
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processorRef.current = processor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The createScriptProcessor API is deprecated and should not be used. It runs on the main UI thread, which can lead to audio glitches and UI stuttering, especially during heavy processing. Its use is strongly discouraged, and it may be removed from browsers in the future.

The modern and recommended replacement is AudioWorklet. It runs audio processing in a separate thread, ensuring the main thread remains responsive and providing a much better user experience. Please migrate this logic to use an AudioWorklet for robust and performant audio processing.

Comment on lines +44 to +46
model = data.get("model")
language = data.get("language")
live_service.set_config(model=model, language=language)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The model parameter from the client-provided configuration is used directly to load a Whisper model without validation. Since the underlying WhisperModel library can accept filesystem paths or remote repository IDs, an attacker could potentially use this to probe the local filesystem or trigger unauthorized outbound network requests (SSRF).

                        model = data.get("model")
                        if model not in ["tiny", "base", "small", "medium", "turbo"]:
                            model = "small"
                        language = data.get("language")
                        live_service.set_config(model=model, language=language)

except Exception as e:
logger.error(f"WebSocket error: {e}", exc_info=True)
try:
await websocket.send_json({"type": "error", "message": str(e)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Sending raw exception messages to the client can leak sensitive internal information such as file paths, server configuration, or dependency details. This information can be used by an attacker to gain a deeper understanding of the system's architecture and facilitate further attacks.

            await websocket.send_json({"type": "error", "message": "An internal error occurred"})

Comment on lines +132 to +148
<style>{`
.pulse-animation {
animation: pulse 1.5s ease-in-out infinite;
}
@keyframes pulse {
0% { opacity: 1; transform: scale(1); }
50% { opacity: 0.7; transform: scale(1.05); }
100% { opacity: 1; transform: scale(1); }
}
.transcript-container::-webkit-scrollbar {
width: 4px;
}
.transcript-container::-webkit-scrollbar-thumb {
background: #ccc;
border-radius: 2px;
}
`}</style>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Embedding a raw <style> tag with CSS inside a React component is generally discouraged. These styles are injected globally into the document's <head>, which can lead to style collisions and makes them difficult to debug as the application grows. This approach also misses out on the benefits of modern CSS tooling like linting, autoprefixing, and code splitting.

For better maintainability and to avoid potential conflicts, consider using a more robust styling solution such as CSS Modules (e.g., creating a LiveAudioControls.module.css file) or a CSS-in-JS library like styled-components or Emotion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Live Audio Transcription with WebSocket Support

1 participant