-
Notifications
You must be signed in to change notification settings - Fork 149
feat(sortformer): add Sortformer streaming diarization #249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implements NVIDIA Streaming Sortformer 4-speaker diarization model with CoreML: - SortformerDiarizer: Main diarization class with streaming and complete file processing - SortformerModels: CoreML model loading (separate PreEncoder+Head or combined pipeline) - SortformerModules: State management and streaming update logic - SortformerConfig: Configuration for NVIDIA (1.04s latency) and low-latency modes - SortformerTypes: Result types, segment extraction, and median filtering Key features: - Native Swift mel spectrogram matching NeMo's AudioToMelSpectrogramPreprocessor - cpuOnly CoreML inference for numerical consistency with Python - RTTM ground truth loading for DER evaluation - Simple frame-level DER calculation with permutation search CLI commands: - `sortformer`: Run streaming diarization on audio files - `sortformer-benchmark`: Evaluate on AMI SDM corpus with --native-preprocessing flag Current performance: 40% DER on ES2004a (vs 24.5% Python reference) Gap is in state management - spkcache/fifo update logic needs further work. (cherry picked from commit 9f98fe9)
- Add Repo.sortformer pointing to alexwengg/diar-streaming-sortformer-coreml - Add ModelNames.Sortformer with model file names - Add SortformerModels.loadFromHuggingFace() for automatic download and caching
- Add CALLHOME dataset support to SortformerBenchmark - Add getCALLHOMEFiles() for loading CALLHOME English audio files - Update loadRTTMGroundTruth() for CALLHOME RTTM format - Improve state management in SortformerModules - Add progress file resume support for benchmarks - Benchmark results: 20.29% DER on 140 CALLHOME English files (low-latency)
Add test_ami_nemo.py and test_callhome_nemo.py for comparing Swift CoreML Sortformer against reference NeMo Python implementation. AMI benchmark results (16 files, low-latency config): - Swift CoreML: 35.0% average DER - NeMo Python: 35.8% average DER - Swift wins on 9/16 files, NeMo wins on 7/16 files Config: chunk_len=6, right_context=1, fifo=40, spkcache=120 (~480ms latency)
Swift CoreML vs NeMo Python comparison on AMI test set (16 files): - swift_ami_results.json: Swift CoreML results (35.0% avg DER) - nemo_ami_results.json: NeMo Python results (35.8% avg DER)
- Add gradientDescent config matching SGD2718's Streaming-Sortformer-Conversion (chunk_right_context=7, fifo_len=40, spkcache_len=188, period=31) - Add --gradient-descent flag to sortformer-benchmark command - Update NeMo benchmark script to use gradient descent config for fair comparison - Add AMI benchmark results for both Swift (30.8% DER) and NeMo (29.2% DER) Benchmark comparison (16 files, gradient descent config): - Swift CoreML: 30.8% avg DER, 8.2x RTFx - NeMo Python: 29.2% avg DER, 1.2x RTFx - Swift is ~7x faster with comparable accuracy (+1.5% DER)
Add StreamingUpdateResult to return both confirmed and tentative predictions from streamingUpdate(). Tentative predictions are for frames still within the right context window - they may change when the next chunk arrives with more future context. With rightContext=7 and 80ms frames, this provides 560ms earlier preview of speaker predictions for real-time UI display. Changes: - Add StreamingUpdateResult struct with confirmed/tentative arrays - Update SortformerChunkResult with tentativeProbabilities field - streamingUpdate() now extracts right context as tentative - Streaming process() passes tentative to result - Batch processComplete() uses confirmed only (no impact on DER)
- Make --gradient-descent automatically use HuggingFace models by default - Add --local flag to override and use local models instead - Add --hf flag for explicit HuggingFace download (clears cache) - Display "HuggingFace models" in mode output when using HF - Use correct model directory for gradient descent config
- Refactor streamingUpdate to streamUpdate with cleaner implementation - Port streaming logic from Gradient Descent's MLTensor implementation - Add bounds validation guard in processMelFeatures - Fix loop exit condition for frame processing - Simplify SortformerCommand to use HuggingFace downloads by default - Remove unused variables and debug code
|
Claude finished @Alex-Wengg's task —— View job Reviewing PR: feat(sortformer): add Sortformer streaming diarizationTodo List
Review Complete ✅I've conducted a thorough review of this substantial PR adding Sortformer streaming diarization. The implementation shows excellent architecture and impressive performance (125x RTFx, 31.1% DER), but there are critical blocking issues that need to be addressed: Critical Issues (Blockers):
Strengths:
Recommendation: The detailed review has been submitted as a non-blocking comment to the PR. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The correct error message will now be thrown
Fixed mel spectrogram thread safety Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- SortformerConfig.swift → ModelConfig.swift - SortformerDiarizer.swift → DiarizerManager.swift - SortformerModels.swift → DiarizerInference.swift - SortformerModules.swift → StateUpdater.swift - SortformerTypes.swift → DataTypes.swift
- SortformerModels → DiarizerInference - SortformerModules → StateUpdater - SortformerDiarizer → Pipeline
…lineStruct.swift - DataTypes.swift: Config, State, Results, Errors - TimelineStruct.swift: SortformerTimeline, SortformerSegment
a85a9b8 to
d192abe
Compare
Runs ES2004a with nvidia-high-latency config on PRs
d192abe to
a624060
Compare
…rity Keep simple file names (Pipeline.swift, DiarizerInference.swift, StateUpdater.swift, Timeline.swift) while maintaining Sortformer-prefixed class names for clear public API: - SortformerDiarizer (main entry point) - SortformerModels (CoreML model loading/inference) - SortformerModules (state management and cache compression) - SortformerTimeline (results and segments) This follows Swift naming conventions where types need descriptive prefixes since Swift doesn't have module-level namespaces.
- DiarizerInference.swift → Models.swift (contains SortformerModels) - Pipeline.swift → Diarizer.swift (contains SortformerDiarizer)
Files now match their primary class names: - SortformerDiarizer.swift (SortformerDiarizer) - SortformerModels.swift (SortformerModels) - SortformerModules.swift (SortformerModules) - SortformerTimeline.swift (SortformerTimeline) - SortformerTypes.swift (config structs)
More descriptive name that matches the NeMo class it mirrors.
- SortformerDiarizer.swift → SortformerDiarizerPipeline.swift - SortformerModels → SortformerModelInference (file and class)
| @@ -0,0 +1,174 @@ | |||
| name: Sortformer High-Latency Benchmark | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it "high latency"? is htere a low latyency one lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are configurations called .nvidiaLowLatency and .default, but the benchmark was taking too long on those.
| IS1009c 37.3 12.7 0.9 23.7 4/4 125.1 | ||
| TS3003d 38.4 31.9 0.2 6.3 4/4 124.1 | ||
| TS3003a 41.8 36.8 0.6 4.4 4/4 123.7 | ||
| ---------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite high? whats the baseline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There wasn't a baseline on those specific ones since the other online diarizer was benchmarked with a different subset. It appears that the primary source of error here is missed speech (it does not do well with quiet audio) rather than speaker confusion. However, the other online diarizer's worst-case performance is around 78% in its best configuration due to significant speaker confusion.
| /// Sortformer provides end-to-end streaming diarization with 4 fixed speaker slots, | ||
| /// achieving ~11% DER on DI-HARD III in real-time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the benchmarks.md file contradicts this? says average is closer to 30%?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different benchmark and no fine-tuning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update this line or delete it
Add nemo_ami_benchmark.py for comparing Swift/CoreML implementation against NVIDIA's original NeMo Sortformer model on AMI SDM dataset. Includes: - Batch inference using nvidia/diar_sortformer_4spk-v1 model - DER computation using pyannote.metrics - Support for all 16 AMI test meetings - JSON output for results comparison - README with configuration settings and usage instructions Config settings match Swift SortformerConfig.nvidiaHighLatency: - 30.4s total context (48 + 56 + 56 encoder frames) - 80ms frame duration - 4 speaker maximum 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replaced it with a working one
Summary
Adds Sortformer streaming speaker diarization based on NVIDIA's NeMo Sortformer model.
Features
Benchmarks
CLI Commands
sortformer- Run streaming diarization on audio filessortformer-benchmark- Run benchmarks on AMI/CALLHOME datasetsTest plan
sortformeron sample audio filessortformer-benchmark --single-file ES2004a