Skip to content

Conversation

@Alex-Wengg
Copy link
Contributor

@Alex-Wengg Alex-Wengg commented Jan 4, 2026

Summary

Adds Sortformer streaming speaker diarization based on NVIDIA's NeMo Sortformer model.

Features

  • SortformerDiarizer: Real-time streaming speaker diarization with 4-speaker support
  • SortformerTimeline: Timeline-based output for tracking speaker segments
  • Tentative predictions: Real-time preview of speaker activity before finalization
  • HuggingFace integration: Automatic model download from FluidInference/sortformer-4spk-v1

Benchmarks

  • AMI dataset benchmark support with DER calculation
  • CALLHOME benchmark support
  • NeMo Python comparison scripts for validation
  • Performance: ~125x RTFx, competitive DER on AMI dataset

CLI Commands

  • sortformer - Run streaming diarization on audio files
  • sortformer-benchmark - Run benchmarks on AMI/CALLHOME datasets

Test plan

  • Run sortformer on sample audio files
  • Run sortformer-benchmark --single-file ES2004a
  • Verify HuggingFace model download works

Alex-Wengg and others added 29 commits December 24, 2025 16:01
Implements NVIDIA Streaming Sortformer 4-speaker diarization model with CoreML:

- SortformerDiarizer: Main diarization class with streaming and complete file processing
- SortformerModels: CoreML model loading (separate PreEncoder+Head or combined pipeline)
- SortformerModules: State management and streaming update logic
- SortformerConfig: Configuration for NVIDIA (1.04s latency) and low-latency modes
- SortformerTypes: Result types, segment extraction, and median filtering

Key features:
- Native Swift mel spectrogram matching NeMo's AudioToMelSpectrogramPreprocessor
- cpuOnly CoreML inference for numerical consistency with Python
- RTTM ground truth loading for DER evaluation
- Simple frame-level DER calculation with permutation search

CLI commands:
- `sortformer`: Run streaming diarization on audio files
- `sortformer-benchmark`: Evaluate on AMI SDM corpus with --native-preprocessing flag

Current performance: 40% DER on ES2004a (vs 24.5% Python reference)
Gap is in state management - spkcache/fifo update logic needs further work.

(cherry picked from commit 9f98fe9)
- Add Repo.sortformer pointing to alexwengg/diar-streaming-sortformer-coreml
- Add ModelNames.Sortformer with model file names
- Add SortformerModels.loadFromHuggingFace() for automatic download and caching
- Add CALLHOME dataset support to SortformerBenchmark
- Add getCALLHOMEFiles() for loading CALLHOME English audio files
- Update loadRTTMGroundTruth() for CALLHOME RTTM format
- Improve state management in SortformerModules
- Add progress file resume support for benchmarks
- Benchmark results: 20.29% DER on 140 CALLHOME English files (low-latency)
Add test_ami_nemo.py and test_callhome_nemo.py for comparing Swift CoreML
Sortformer against reference NeMo Python implementation.

AMI benchmark results (16 files, low-latency config):
- Swift CoreML: 35.0% average DER
- NeMo Python: 35.8% average DER
- Swift wins on 9/16 files, NeMo wins on 7/16 files

Config: chunk_len=6, right_context=1, fifo=40, spkcache=120 (~480ms latency)
Swift CoreML vs NeMo Python comparison on AMI test set (16 files):
- swift_ami_results.json: Swift CoreML results (35.0% avg DER)
- nemo_ami_results.json: NeMo Python results (35.8% avg DER)
- Add gradientDescent config matching SGD2718's Streaming-Sortformer-Conversion
  (chunk_right_context=7, fifo_len=40, spkcache_len=188, period=31)
- Add --gradient-descent flag to sortformer-benchmark command
- Update NeMo benchmark script to use gradient descent config for fair comparison
- Add AMI benchmark results for both Swift (30.8% DER) and NeMo (29.2% DER)

Benchmark comparison (16 files, gradient descent config):
- Swift CoreML: 30.8% avg DER, 8.2x RTFx
- NeMo Python:  29.2% avg DER, 1.2x RTFx
- Swift is ~7x faster with comparable accuracy (+1.5% DER)
Add StreamingUpdateResult to return both confirmed and tentative
predictions from streamingUpdate(). Tentative predictions are for
frames still within the right context window - they may change when
the next chunk arrives with more future context.

With rightContext=7 and 80ms frames, this provides 560ms earlier
preview of speaker predictions for real-time UI display.

Changes:
- Add StreamingUpdateResult struct with confirmed/tentative arrays
- Update SortformerChunkResult with tentativeProbabilities field
- streamingUpdate() now extracts right context as tentative
- Streaming process() passes tentative to result
- Batch processComplete() uses confirmed only (no impact on DER)
- Make --gradient-descent automatically use HuggingFace models by default
- Add --local flag to override and use local models instead
- Add --hf flag for explicit HuggingFace download (clears cache)
- Display "HuggingFace models" in mode output when using HF
- Use correct model directory for gradient descent config
- Refactor streamingUpdate to streamUpdate with cleaner implementation
- Port streaming logic from Gradient Descent's MLTensor implementation
- Add bounds validation guard in processMelFeatures
- Fix loop exit condition for frame processing
- Simplify SortformerCommand to use HuggingFace downloads by default
- Remove unused variables and debug code
@claude
Copy link
Contributor

claude bot commented Jan 4, 2026

Claude finished @Alex-Wengg's task —— View job


Reviewing PR: feat(sortformer): add Sortformer streaming diarization

Todo List

  • Start pending GitHub review
  • Get PR diff information
  • Analyze code changes for review points
  • Add inline review comments for specific issues
  • Submit comprehensive review as "COMMENT" type

Review Complete ✅

I've conducted a thorough review of this substantial PR adding Sortformer streaming diarization. The implementation shows excellent architecture and impressive performance (125x RTFx, 31.1% DER), but there are critical blocking issues that need to be addressed:

Critical Issues (Blockers):

  • @unchecked Sendable usage in SortformerDiarizer.swift:26 and ANEMemoryOptimizer.swift:7 directly violates repository CLAUDE.md rules
  • These classes need to be converted to actors or implement proper Sendable conformance

Strengths:

  • Well-designed streaming architecture with proper state management
  • Performance optimization through demand-driven audio preprocessing
  • Clear API design and comprehensive configuration options
  • Good documentation and Swift 6 compatibility (except Sendable violations)

Recommendation:
Core implementation is solid, but the thread-safety violations must be resolved before merging. Converting to actors would be the most appropriate solution.

The detailed review has been submitted as a non-blocking comment to the PR.

SGD2718 and others added 11 commits January 5, 2026 16:34
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The correct error message will now be thrown
Fixed mel spectrogram thread safety

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- SortformerConfig.swift → ModelConfig.swift
- SortformerDiarizer.swift → DiarizerManager.swift
- SortformerModels.swift → DiarizerInference.swift
- SortformerModules.swift → StateUpdater.swift
- SortformerTypes.swift → DataTypes.swift
- SortformerModels → DiarizerInference
- SortformerModules → StateUpdater
- SortformerDiarizer → Pipeline
…lineStruct.swift

- DataTypes.swift: Config, State, Results, Errors
- TimelineStruct.swift: SortformerTimeline, SortformerSegment
@Alex-Wengg Alex-Wengg force-pushed the sortformer-v2 branch 2 times, most recently from a85a9b8 to d192abe Compare January 5, 2026 21:37
Runs ES2004a with nvidia-high-latency config on PRs
…rity

Keep simple file names (Pipeline.swift, DiarizerInference.swift, StateUpdater.swift, Timeline.swift)
while maintaining Sortformer-prefixed class names for clear public API:

- SortformerDiarizer (main entry point)
- SortformerModels (CoreML model loading/inference)
- SortformerModules (state management and cache compression)
- SortformerTimeline (results and segments)

This follows Swift naming conventions where types need descriptive prefixes
since Swift doesn't have module-level namespaces.
- DiarizerInference.swift → Models.swift (contains SortformerModels)
- Pipeline.swift → Diarizer.swift (contains SortformerDiarizer)
Files now match their primary class names:
- SortformerDiarizer.swift (SortformerDiarizer)
- SortformerModels.swift (SortformerModels)
- SortformerModules.swift (SortformerModules)
- SortformerTimeline.swift (SortformerTimeline)
- SortformerTypes.swift (config structs)
More descriptive name that matches the NeMo class it mirrors.
- SortformerDiarizer.swift → SortformerDiarizerPipeline.swift
- SortformerModels → SortformerModelInference (file and class)
@Alex-Wengg Alex-Wengg requested a review from BrandonWeng January 5, 2026 23:41
@@ -0,0 +1,174 @@
name: Sortformer High-Latency Benchmark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it "high latency"? is htere a low latyency one lol

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are configurations called .nvidiaLowLatency and .default, but the benchmark was taking too long on those.

IS1009c 37.3 12.7 0.9 23.7 4/4 125.1
TS3003d 38.4 31.9 0.2 6.3 4/4 124.1
TS3003a 41.8 36.8 0.6 4.4 4/4 123.7
----------------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite high? whats the baseline

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't a baseline on those specific ones since the other online diarizer was benchmarked with a different subset. It appears that the primary source of error here is missed speech (it does not do well with quiet audio) rather than speaker confusion. However, the other online diarizer's worst-case performance is around 78% in its best configuration due to significant speaker confusion.

Comment on lines +8 to +9
/// Sortformer provides end-to-end streaming diarization with 4 fixed speaker slots,
/// achieving ~11% DER on DI-HARD III in real-time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the benchmarks.md file contradicts this? says average is closer to 30%?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different benchmark and no fine-tuning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update this line or delete it

Alex-Wengg and others added 2 commits January 7, 2026 16:51
Add nemo_ami_benchmark.py for comparing Swift/CoreML implementation
against NVIDIA's original NeMo Sortformer model on AMI SDM dataset.

Includes:
- Batch inference using nvidia/diar_sortformer_4spk-v1 model
- DER computation using pyannote.metrics
- Support for all 16 AMI test meetings
- JSON output for results comparison
- README with configuration settings and usage instructions

Config settings match Swift SortformerConfig.nvidiaHighLatency:
- 30.4s total context (48 + 56 + 56 encoder frames)
- 80ms frame duration
- 4 speaker maximum

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replaced it with a working one
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants