feat(sortformer): add Sortformer streaming diarization #249

Alex-Wengg · 2026-01-04T22:52:47Z

Summary

Adds Sortformer streaming speaker diarization based on NVIDIA's NeMo Sortformer model.

Features

SortformerDiarizer: Real-time streaming speaker diarization with 4-speaker support
SortformerTimeline: Timeline-based output for tracking speaker segments
Tentative predictions: Real-time preview of speaker activity before finalization
HuggingFace integration: Automatic model download from FluidInference/sortformer-4spk-v1

Benchmarks

AMI dataset benchmark support with DER calculation
CALLHOME benchmark support
NeMo Python comparison scripts for validation
Performance: ~125x RTFx, competitive DER on AMI dataset

CLI Commands

sortformer - Run streaming diarization on audio files
sortformer-benchmark - Run benchmarks on AMI/CALLHOME datasets

Test plan

Run sortformer on sample audio files
Run sortformer-benchmark --single-file ES2004a
Verify HuggingFace model download works

Implements NVIDIA Streaming Sortformer 4-speaker diarization model with CoreML: - SortformerDiarizer: Main diarization class with streaming and complete file processing - SortformerModels: CoreML model loading (separate PreEncoder+Head or combined pipeline) - SortformerModules: State management and streaming update logic - SortformerConfig: Configuration for NVIDIA (1.04s latency) and low-latency modes - SortformerTypes: Result types, segment extraction, and median filtering Key features: - Native Swift mel spectrogram matching NeMo's AudioToMelSpectrogramPreprocessor - cpuOnly CoreML inference for numerical consistency with Python - RTTM ground truth loading for DER evaluation - Simple frame-level DER calculation with permutation search CLI commands: - `sortformer`: Run streaming diarization on audio files - `sortformer-benchmark`: Evaluate on AMI SDM corpus with --native-preprocessing flag Current performance: 40% DER on ES2004a (vs 24.5% Python reference) Gap is in state management - spkcache/fifo update logic needs further work. (cherry picked from commit 9f98fe9)

- Add Repo.sortformer pointing to alexwengg/diar-streaming-sortformer-coreml - Add ModelNames.Sortformer with model file names - Add SortformerModels.loadFromHuggingFace() for automatic download and caching

- Add CALLHOME dataset support to SortformerBenchmark - Add getCALLHOMEFiles() for loading CALLHOME English audio files - Update loadRTTMGroundTruth() for CALLHOME RTTM format - Improve state management in SortformerModules - Add progress file resume support for benchmarks - Benchmark results: 20.29% DER on 140 CALLHOME English files (low-latency)

Add test_ami_nemo.py and test_callhome_nemo.py for comparing Swift CoreML Sortformer against reference NeMo Python implementation. AMI benchmark results (16 files, low-latency config): - Swift CoreML: 35.0% average DER - NeMo Python: 35.8% average DER - Swift wins on 9/16 files, NeMo wins on 7/16 files Config: chunk_len=6, right_context=1, fifo=40, spkcache=120 (~480ms latency)

Swift CoreML vs NeMo Python comparison on AMI test set (16 files): - swift_ami_results.json: Swift CoreML results (35.0% avg DER) - nemo_ami_results.json: NeMo Python results (35.8% avg DER)

- Add gradientDescent config matching SGD2718's Streaming-Sortformer-Conversion (chunk_right_context=7, fifo_len=40, spkcache_len=188, period=31) - Add --gradient-descent flag to sortformer-benchmark command - Update NeMo benchmark script to use gradient descent config for fair comparison - Add AMI benchmark results for both Swift (30.8% DER) and NeMo (29.2% DER) Benchmark comparison (16 files, gradient descent config): - Swift CoreML: 30.8% avg DER, 8.2x RTFx - NeMo Python: 29.2% avg DER, 1.2x RTFx - Swift is ~7x faster with comparable accuracy (+1.5% DER)

Add StreamingUpdateResult to return both confirmed and tentative predictions from streamingUpdate(). Tentative predictions are for frames still within the right context window - they may change when the next chunk arrives with more future context. With rightContext=7 and 80ms frames, this provides 560ms earlier preview of speaker predictions for real-time UI display. Changes: - Add StreamingUpdateResult struct with confirmed/tentative arrays - Update SortformerChunkResult with tentativeProbabilities field - streamingUpdate() now extracts right context as tentative - Streaming process() passes tentative to result - Batch processComplete() uses confirmed only (no impact on DER)

- Make --gradient-descent automatically use HuggingFace models by default - Add --local flag to override and use local models instead - Add --hf flag for explicit HuggingFace download (clears cache) - Display "HuggingFace models" in mode output when using HF - Use correct model directory for gradient descent config

- Refactor streamingUpdate to streamUpdate with cleaner implementation - Port streaming logic from Gradient Descent's MLTensor implementation - Add bounds validation guard in processMelFeatures - Fix loop exit condition for frame processing - Simplify SortformerCommand to use HuggingFace downloads by default - Remove unused variables and debug code

…udioConverter

claude · 2026-01-04T22:53:06Z

Claude finished @Alex-Wengg's task —— View job

Reviewing PR: feat(sortformer): add Sortformer streaming diarization

Todo List

Start pending GitHub review
Get PR diff information
Analyze code changes for review points
Add inline review comments for specific issues
Submit comprehensive review as "COMMENT" type

Review Complete ✅

I've conducted a thorough review of this substantial PR adding Sortformer streaming diarization. The implementation shows excellent architecture and impressive performance (125x RTFx, 31.1% DER), but there are critical blocking issues that need to be addressed:

Critical Issues (Blockers):

@unchecked Sendable usage in SortformerDiarizer.swift:26 and ANEMemoryOptimizer.swift:7 directly violates repository CLAUDE.md rules
These classes need to be converted to actors or implement proper Sendable conformance

Strengths:

Well-designed streaming architecture with proper state management
Performance optimization through demand-driven audio preprocessing
Clear API design and comprehensive configuration options
Good documentation and Swift 6 compatibility (except Sendable violations)

Recommendation:
Core implementation is solid, but the thread-safety violations must be resolved before merging. Converting to actors would be the most appropriate solution.

The detailed review has been submitted as a non-blocking comment to the PR.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

The correct error message will now be thrown

Fixed mel spectrogram thread safety Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- SortformerConfig.swift → ModelConfig.swift - SortformerDiarizer.swift → DiarizerManager.swift - SortformerModels.swift → DiarizerInference.swift - SortformerModules.swift → StateUpdater.swift - SortformerTypes.swift → DataTypes.swift

- SortformerModels → DiarizerInference - SortformerModules → StateUpdater - SortformerDiarizer → Pipeline

…lineStruct.swift - DataTypes.swift: Config, State, Results, Errors - TimelineStruct.swift: SortformerTimeline, SortformerSegment

Runs ES2004a with nvidia-high-latency config on PRs

…rity Keep simple file names (Pipeline.swift, DiarizerInference.swift, StateUpdater.swift, Timeline.swift) while maintaining Sortformer-prefixed class names for clear public API: - SortformerDiarizer (main entry point) - SortformerModels (CoreML model loading/inference) - SortformerModules (state management and cache compression) - SortformerTimeline (results and segments) This follows Swift naming conventions where types need descriptive prefixes since Swift doesn't have module-level namespaces.

- DiarizerInference.swift → Models.swift (contains SortformerModels) - Pipeline.swift → Diarizer.swift (contains SortformerDiarizer)

Files now match their primary class names: - SortformerDiarizer.swift (SortformerDiarizer) - SortformerModels.swift (SortformerModels) - SortformerModules.swift (SortformerModules) - SortformerTimeline.swift (SortformerTimeline) - SortformerTypes.swift (config structs)

More descriptive name that matches the NeMo class it mirrors.

- SortformerDiarizer.swift → SortformerDiarizerPipeline.swift - SortformerModels → SortformerModelInference (file and class)

BrandonWeng · 2026-01-06T19:16:36Z

.github/workflows/sortformer-benchmark.yml

@@ -0,0 +1,174 @@
+name: Sortformer High-Latency Benchmark


why is it "high latency"? is htere a low latyency one lol

there are configurations called .nvidiaLowLatency and .default, but the benchmark was taking too long on those.

BrandonWeng · 2026-01-06T19:18:38Z

Documentation/Benchmarks.md

+IS1009c          37.3     12.7      0.9     23.7 4/4           125.1
+TS3003d          38.4     31.9      0.2      6.3 4/4           124.1
+TS3003a          41.8     36.8      0.6      4.4 4/4           123.7
+----------------------------------------------------------------------


this is quite high? whats the baseline

There wasn't a baseline on those specific ones since the other online diarizer was benchmarked with a different subset. It appears that the primary source of error here is missed speech (it does not do well with quiet audio) rather than speaker confusion. However, the other online diarizer's worst-case performance is around 78% in its best configuration due to significant speaker confusion.

BrandonWeng · 2026-01-06T19:19:18Z

Sources/FluidAudio/Diarizer/Sortformer/SortformerDiarizerPipeline.swift

+/// Sortformer provides end-to-end streaming diarization with 4 fixed speaker slots,
+/// achieving ~11% DER on DI-HARD III in real-time.


the benchmarks.md file contradicts this? says average is closer to 30%?

different benchmark and no fine-tuning

We should update this line or delete it

Add nemo_ami_benchmark.py for comparing Swift/CoreML implementation against NVIDIA's original NeMo Sortformer model on AMI SDM dataset. Includes: - Batch inference using nvidia/diar_sortformer_4spk-v1 model - DER computation using pyannote.metrics - Support for all 16 AMI test meetings - JSON output for results comparison - README with configuration settings and usage instructions Config settings match Swift SortformerConfig.nvidiaHighLatency: - 30.4s total context (48 + 56 + 56 encoder frames) - 80ms frame duration - 4 speaker maximum 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replaced it with a working one

Alex-Wengg and others added 29 commits December 24, 2025 16:01

feat(sortformer): add HuggingFace model download support

dda48d6

- Add Repo.sortformer pointing to alexwengg/diar-streaming-sortformer-coreml - Add ModelNames.Sortformer with model file names - Add SortformerModels.loadFromHuggingFace() for automatic download and caching

docs(benchmark): add AMI benchmark result JSON files

60053fd

Swift CoreML vs NeMo Python comparison on AMI test set (16 files): - swift_ami_results.json: Swift CoreML results (35.0% avg DER) - nemo_ami_results.json: NeMo Python results (35.8% avg DER)

Fixed Streaming Sortformer Diarization

5c85a9f

cleaned up

e324612

stateful mel spectrogram to eliminate redundant computation

694ae34

replace depricated method

3003b6e

Slight optimization and corrected incorrect padding

37f71b5

replaced some depricated pointer methods with their renamed equivalents

f7b9693

renamed config stuff

bcd1e46

Added SortformerTimeline to replace SortformerResult

b13c337

Added model paths for the nvidia configurations

95f0946

Added Soxr resampler to resolve discrepancies between librosa and AVA…

db18765

…udioConverter

Fixed depricated arguments

c7e1fb4

Implemented test cases

75b1336

Now uses timeline. Streaming and files work (I think)

6d68024

Deleted file

819bf8f

Dynamically linked Soxr package for audio resampling

30439a4

changed default conversion method to AVAudio

e4a3c94

Fixed some tentative count not updating

aefa154

Formatting

80aab55

Fixed Timeline bugs

196e286

Fixed benchmarks requiring local models

a093dba

SGD2718 and others added 11 commits January 5, 2026 16:34

Update Sources/FluidAudio/Diarizer/Sortformer/SortformerConfig.swift

e4ad6b0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update Sources/FluidAudio/Diarizer/Sortformer/SortformerModels.swift

4e371a8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix condition for insufficient predictions error

2257c57

The correct error message will now be thrown

Update Sources/FluidAudio/Diarizer/Sortformer/SortformerDiarizer.swift

73fa31c

Fixed mel spectrogram thread safety Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add numSpeakers property and update frame count calculations

74d6412

Add numSpeakers to StreamingUpdateResult return

07b3758

Fix formatting of return statement in SortformerModules

1a605cf

refactor(sortformer): rename classes to match file names

4726477

- SortformerModels → DiarizerInference - SortformerModules → StateUpdater - SortformerDiarizer → Pipeline

refactor(sortformer): merge ModelConfig and DataTypes into Types.swift

3f16430

refactor(sortformer): split Types.swift into DataTypes.swift and Time…

250fec7

…lineStruct.swift - DataTypes.swift: Config, State, Results, Errors - TimelineStruct.swift: SortformerTimeline, SortformerSegment

Alex-Wengg force-pushed the sortformer-v2 branch 2 times, most recently from a85a9b8 to d192abe Compare January 5, 2026 21:37

ci: add Sortformer high-latency benchmark workflow

a624060

Runs ES2004a with nvidia-high-latency config on PRs

Alex-Wengg force-pushed the sortformer-v2 branch from d192abe to a624060 Compare January 5, 2026 21:38

Alex-Wengg added 9 commits January 5, 2026 17:05

fix(ci): make sortformer benchmark extraction more robust

075c0b0

style: fix swift-format line break issue

f6e7ae0

fix(ci): add --hf flag to download models from HuggingFace

418724f

refactor: rename files for clarity

c777d6a

- DiarizerInference.swift → Models.swift (contains SortformerModels) - Pipeline.swift → Diarizer.swift (contains SortformerDiarizer)

refactor: rename SortformerModules to SortformerStateUpdater

a2b2a61

More descriptive name that matches the NeMo class it mirrors.

refactor: rename files and classes for clarity

99c19d4

- SortformerDiarizer.swift → SortformerDiarizerPipeline.swift - SortformerModels → SortformerModelInference (file and class)

refactor: rename SortformerDiarizer class to SortformerDiarizerPipeline

299a09c

Alex-Wengg requested a review from BrandonWeng January 5, 2026 23:41

BrandonWeng reviewed Jan 6, 2026

View reviewed changes

Alex-Wengg and others added 2 commits January 7, 2026 16:51

Replaced NeMo Sortformer AMI Benchmark script

72c4c62

Replaced it with a working one

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(sortformer): add Sortformer streaming diarization #249

feat(sortformer): add Sortformer streaming diarization #249

Uh oh!

Alex-Wengg commented Jan 4, 2026 •

edited by SGD2718

Loading

Uh oh!

claude bot commented Jan 4, 2026 •

edited

Loading

Uh oh!

BrandonWeng Jan 6, 2026

Uh oh!

SGD2718 Jan 6, 2026

Uh oh!

BrandonWeng Jan 6, 2026

Uh oh!

SGD2718 Jan 6, 2026

Uh oh!

BrandonWeng Jan 6, 2026

Uh oh!

SGD2718 Jan 6, 2026

Uh oh!

Alex-Wengg Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		/// Sortformer provides end-to-end streaming diarization with 4 fixed speaker slots,
		/// achieving ~11% DER on DI-HARD III in real-time.

feat(sortformer): add Sortformer streaming diarization #249

Are you sure you want to change the base?

feat(sortformer): add Sortformer streaming diarization #249

Uh oh!

Conversation

Alex-Wengg commented Jan 4, 2026 • edited by SGD2718 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Benchmarks

CLI Commands

Test plan

Uh oh!

claude bot commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR: feat(sortformer): add Sortformer streaming diarization

Todo List

Review Complete ✅

Uh oh!

BrandonWeng Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

SGD2718 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

BrandonWeng Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

SGD2718 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

BrandonWeng Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

SGD2718 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Alex-Wengg Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Alex-Wengg commented Jan 4, 2026 •

edited by SGD2718

Loading

claude bot commented Jan 4, 2026 •

edited

Loading