-
Notifications
You must be signed in to change notification settings - Fork 2
Feature/live transcription websocket #175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/live transcription websocket #175
Conversation
Summary of ChangesHello @saksham-jain177, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances MeetMemo's capabilities by introducing real-time, live audio transcription. Users can now speak into their microphone and see their words converted to text instantly, moving beyond the previous batch-only transcription model. The implementation leverages WebSockets for efficient, continuous audio streaming and integrates seamlessly with the existing Whisper transcription infrastructure, optimizing for low-latency performance with smaller models and serialized inference. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-structured feature for live audio transcription using WebSockets. However, critical security vulnerabilities have been identified, including a lack of authentication on the WebSocket endpoint and a global inference lock that creates a Denial of Service (DoS) risk. There's also potential for information exposure and path traversal/SSRF due to insufficient input validation. Furthermore, a critical performance issue exists in the backend's audio buffer handling, and a deprecated API is used on the frontend. Addressing these security and performance concerns is crucial for a robust and scalable feature.
| inference_lock = asyncio.Lock() | ||
|
|
||
|
|
||
| class LiveService: | ||
| """Service for handling live audio transcription via WebSockets.""" | ||
|
|
||
| def __init__(self, settings: Settings, transcription_service: TranscriptionService): | ||
| self.settings = settings | ||
| self.transcription_service = transcription_service | ||
| self.audio_buffer = bytearray() | ||
| self.sample_rate = 16000 | ||
| self.channels = 1 | ||
| self.sample_width = 2 # 16-bit PCM | ||
|
|
||
| # Minimum audio duration for inference (in seconds) | ||
| self.min_chunk_duration = 0.5 | ||
| self.min_chunk_bytes = int( | ||
| self.sample_rate | ||
| * self.channels | ||
| * self.sample_width | ||
| * self.min_chunk_duration | ||
| ) | ||
|
|
||
| # State for incremental transcription | ||
| self.full_transcript = [] | ||
| self.current_model_name = "small" # Default for live mode | ||
| self.current_language = None | ||
|
|
||
| async def process_audio(self, pcm_bytes: bytes) -> Optional[Dict]: | ||
| """ | ||
| Process incoming PCM audio bytes and return transcript updates if available. | ||
| """ | ||
| self.audio_buffer.extend(pcm_bytes) | ||
|
|
||
| if len(self.audio_buffer) < self.min_chunk_bytes: | ||
| return None | ||
|
|
||
| # If we have enough audio, trigger inference | ||
| return await self._run_inference() | ||
|
|
||
| async def _run_inference(self) -> Optional[Dict]: | ||
| """ | ||
| Run Whisper inference on the current audio buffer. | ||
| """ | ||
| if not self.audio_buffer: | ||
| return None | ||
|
|
||
| # Convert bytearray to numpy array (float32 normalized to [-1, 1]) | ||
| audio_np = ( | ||
| np.frombuffer(self.audio_buffer, dtype=np.int16).astype(np.float32) | ||
| / 32768.0 | ||
| ) | ||
|
|
||
| async with inference_lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section of code introduces a critical Denial of Service (DoS) vulnerability. The global asyncio.Lock for Whisper inference, combined with a lack of authentication and rate limiting, allows a single malicious user to monopolize transcription capacity. Additionally, the audio_buffer grows indefinitely, leading to increasing latency and memory usage, which will cause severe performance degradation in long-running live sessions. The buffer should be consumed and cleared on each inference run for scalable processing and to mitigate performance issues.
| @router.websocket("/transcribe/live") | ||
| async def websocket_endpoint( | ||
| websocket: WebSocket, live_service: LiveService = Depends(get_live_service) | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Use ScriptProcessor for simplicity in downsampling/buffering | ||
| // Buffer size of 4096 at 16kHz is ~250ms | ||
| const processor = audioContext.createScriptProcessor(4096, 1, 1); | ||
| processorRef.current = processor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The createScriptProcessor API is deprecated and should not be used. It runs on the main UI thread, which can lead to audio glitches and UI stuttering, especially during heavy processing. Its use is strongly discouraged, and it may be removed from browsers in the future.
The modern and recommended replacement is AudioWorklet. It runs audio processing in a separate thread, ensuring the main thread remains responsive and providing a much better user experience. Please migrate this logic to use an AudioWorklet for robust and performant audio processing.
| model = data.get("model") | ||
| language = data.get("language") | ||
| live_service.set_config(model=model, language=language) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model parameter from the client-provided configuration is used directly to load a Whisper model without validation. Since the underlying WhisperModel library can accept filesystem paths or remote repository IDs, an attacker could potentially use this to probe the local filesystem or trigger unauthorized outbound network requests (SSRF).
model = data.get("model")
if model not in ["tiny", "base", "small", "medium", "turbo"]:
model = "small"
language = data.get("language")
live_service.set_config(model=model, language=language)| except Exception as e: | ||
| logger.error(f"WebSocket error: {e}", exc_info=True) | ||
| try: | ||
| await websocket.send_json({"type": "error", "message": str(e)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending raw exception messages to the client can leak sensitive internal information such as file paths, server configuration, or dependency details. This information can be used by an attacker to gain a deeper understanding of the system's architecture and facilitate further attacks.
await websocket.send_json({"type": "error", "message": "An internal error occurred"})| <style>{` | ||
| .pulse-animation { | ||
| animation: pulse 1.5s ease-in-out infinite; | ||
| } | ||
| @keyframes pulse { | ||
| 0% { opacity: 1; transform: scale(1); } | ||
| 50% { opacity: 0.7; transform: scale(1.05); } | ||
| 100% { opacity: 1; transform: scale(1); } | ||
| } | ||
| .transcript-container::-webkit-scrollbar { | ||
| width: 4px; | ||
| } | ||
| .transcript-container::-webkit-scrollbar-thumb { | ||
| background: #ccc; | ||
| border-radius: 2px; | ||
| } | ||
| `}</style> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Embedding a raw <style> tag with CSS inside a React component is generally discouraged. These styles are injected globally into the document's <head>, which can lead to style collisions and makes them difficult to debug as the application grows. This approach also misses out on the benefits of modern CSS tooling like linting, autoprefixing, and code splitting.
For better maintainability and to avoid potential conflicts, consider using a more robust styling solution such as CSS Modules (e.g., creating a LiveAudioControls.module.css file) or a CSS-in-JS library like styled-components or Emotion.
Summary
Adds real-time live audio transcription to MeetMemo using WebSockets, enabling microphone-based speech-to-text with progressive updates, visual audio feedback, and seamless integration into the existing Whisper pipeline.
Fixes #141
Changes Made
Context / Rationale
MeetMemo previously supported only batch transcription via uploaded or recorded audio files. This PR introduces true real-time transcription, allowing users to see speech converted to text as they speak, without intermediate uploads.
The implementation reuses the existing Whisper infrastructure while optimizing for live use:
This approach balances correctness, performance, and minimal architectural disruption.
Related Docs or References
FastAPI Application Checklist (Delete if PR is not relevant)
/healthendpoint is implemented and returns 200 OK/docsor/redocfeature/*,bugfix/*) — do not usedevdirectlyGeneral Checklist