-
Notifications
You must be signed in to change notification settings - Fork 3
feature(api): diarized transcription api #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9130add to
91cbb87
Compare
…aticc translations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a diarized transcription API feature that enables persistence of transcriptions with vector embeddings to Postgres/pgvector, along with optional AI-powered cleaning, summarization, and sentiment analysis.
- Added Postgres/pgvector integration for storing transcriptions and embeddings
- Implemented AI enrichment pipeline with multi-chunk processing for long transcriptions
- Enhanced REST API to support persistence, diarization, and AI processing parameters
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| db.py | New database layer implementing Postgres schema management, transcript persistence, and vector embedding storage |
| call_processor.py | New subprocess worker for AI enrichment processing (cleaning, summarization, sentiment) |
| tests/test_db.py | Comprehensive test suite for database operations including schema initialization, validation, and embedding replacement |
| api.py | Enhanced transcription endpoint with persistence, AI processing, channel naming, and improved error handling |
| ai.py | Refactored AI processing with chunking support, multi-stage pipeline (clean → summarize → sentiment), and detailed logging |
| tests/test_api.py | Updated API tests to validate persistence, multichannel diarization, timeout handling, and error cases |
| tests/conftest.py | Updated mock response to include paragraphs transcript structure for diarization support |
| requirements.txt | Added dependencies: langchain-text-splitters, pgvector, psycopg[binary] |
| deepgram_connector.py | Removed deprecated AI summary processing code |
| README.md | Added documentation for Postgres configuration, REST API parameters, and usage examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| raise RuntimeError(f"Postgres vectorstore is not configured; missing env vars: {', '.join(missing)}") | ||
|
|
||
| # psycopg conninfo format | ||
| return f"host={host} port={port} user={user} password={password} dbname={dbname}" |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential security issue: database password is included in the connection string without escaping. If the password contains special characters (spaces, quotes, etc.), the conninfo string could be malformed or allow injection attacks. Consider using psycopg's dictionary-style connection parameters or the make_conninfo() function which properly escapes values, rather than string formatting.
tests/test_api.py
Outdated
| def test_missing_uniqueid(self, client, valid_wav_content): | ||
| """Test that missing uniqueid is rejected when persistence is requested.""" | ||
| response = client.post( | ||
| "/api/get_transcription", | ||
| files={"file": ("test.wav", valid_wav_content, "audio/wav")}, | ||
| data={"persist": "true"}, | ||
| ) | ||
|
|
||
| assert response.status_code == 400 | ||
| assert "uniqueid" in response.json()["detail"] |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test doesn't mock db.is_configured() to return False, so the test depends on the actual environment configuration. The test should explicitly control whether the database is considered configured to avoid flaky tests depending on environment state. Consider adding 'patch("api.db.is_configured", return_value=False)' to isolate the test from environment variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…nment (#23) * Initial plan * Mock db.is_configured in test_missing_uniqueid to isolate from environment Co-authored-by: Stell0 <4547897+Stell0@users.noreply.github.com> * Add .gitignore and remove build artifacts from repository Co-authored-by: Stell0 <4547897+Stell0@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Stell0 <4547897+Stell0@users.noreply.github.com>
NethServer/dev#7143