Skip to content

feat(subtitles): refactor ai segmentation core protocol#963

Draft
taiiiyang wants to merge 9 commits intomainfrom
feat/subtitles-ai-segmentation-core
Draft

feat(subtitles): refactor ai segmentation core protocol#963
taiiiyang wants to merge 9 commits intomainfrom
feat/subtitles-ai-segmentation-core

Conversation

@taiiiyang
Copy link
Collaborator

@taiiiyang taiiiyang commented Feb 14, 2026

Type of Changes

  • ✨ New feature (feat)
  • πŸ› Bug fix (fix)
  • πŸ“ Documentation change (docs)
  • πŸ’„ UI/style change (style)
  • ♻️ Code refactoring (refactor)
  • ⚑ Performance improvement (perf)
  • βœ… Test related (test)
  • πŸ”§ Build or dependencies update (build)
  • πŸ”„ CI/CD related (ci)
  • 🌐 Internationalization (i18n)
  • 🧠 AI model related (ai)
  • πŸ”„ Revert a previous commit (revert)
  • πŸ“¦ Other changes that do not modify src or test files (chore)

Description

This PR refactors the AI subtitle segmentation core into a stricter, testable line-protocol pipeline.

What changed

  • Replace loose VTT parsing path with explicit from-to | text line protocol.
  • Add protocol utilities for:
    • input cleaning,
    • protocol parsing,
    • contiguous coverage validation.
  • Add post-processing/refinement pipeline:
    • multilingual-friendly merge/split heuristics,
    • cue timeline guard normalization,
    • final fragment rebuild from indexed units.
  • Update background segmentation flow:
    • line-protocol response cleaning,
    • cache key schema versioning for protocol migration safety.
  • Upgrade segmentation system prompt to align model output with protocol and readability constraints.

Scope note

  • This is the bottom PR in a stacked subtitle refactor series.
  • It intentionally focuses on AI segmentation core behavior only.

Related Issue

N/A

How Has This Been Tested?

  • Added unit tests
  • Verified through manual testing

Test details:

  • Added src/utils/subtitles/__tests__/ai-segmentation.test.ts covering protocol parsing, coverage validation, timeline rebuild, and merge/split behavior.
  • Branch passes repository pre-push checks under project hooks.

Screenshots

N/A

Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly if necessary
  • My code follows the code style of this project
  • My changes do not break existing functionality
  • If my code was generated by AI, I have proofread and improved it as necessary.

Additional Information

Added changeset: .changeset/feat-subtitles-ai-segmentation-core-protocol.md (@read-frog/extension: patch).

@changeset-bot
Copy link

changeset-bot bot commented Feb 14, 2026

πŸ¦‹ Changeset detected

Latest commit: a82afc5

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@read-frog/extension Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Feb 14, 2026
@github-actions github-actions bot added the feat label Feb 14, 2026
@dosubot
Copy link

dosubot bot commented Feb 14, 2026

Related Documentation

Checked 14 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?Β Β Join Discord

@taiiiyang taiiiyang marked this pull request as draft February 14, 2026 07:22
@taiiiyang taiiiyang marked this pull request as ready for review February 14, 2026 07:23
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 7 files

@claude
Copy link

claude bot commented Feb 14, 2026

PR Review: feat(subtitles): refactor ai segmentation core protocol

Overall: Well-structured refactor. The VTT-to-line-protocol migration is a solid architectural improvement β€” index-based referencing eliminates the fragile timestamp-extraction problem that the old prompt had to painstakingly explain to the LLM. The decomposition into protocol.ts, refine.ts, guards.ts is clean and follows single-responsibility well.


Code Quality & Architecture

Good:

  • Clear separation: parsing β†’ validation β†’ refinement β†’ rebuild β†’ guards. Each stage is independently testable.
  • validateSegmentationUnits called twice (after parse and after refine) is a smart defensive pattern β€” refinement bugs can't silently corrupt coverage.
  • Immutable patterns throughout (shallow copies before mutation). Consistent.
  • Cache schema versioning (seg-line-v1) is the right way to handle format migration β€” existing caches won't serve stale VTT results.

Issues:

  1. Duplicated normalizeSpaces and hasStrongBoundary across refine.ts and guards.ts β€” identical implementations in both files. Extract to a shared util (e.g. ai-segmentation/utils.ts) to satisfy DRY. These aren't trivial one-liners that justify duplication.

  2. isTinyCueText in guards.ts vs measureTextLengthUnits in refine.ts β€” these measure text length with nearly identical logic (word count for spaced text, char count otherwise) but use different thresholds and slightly different detection heuristics (/\s/ test vs the shouldUseWordLikeMeasure check that also verifies Latin + excludes CJK). The guards version doesn't distinguish CJK from Latin, which means a 6-character CJK string (potentially 2-3 words of meaning) and a 6-character Latin string (like "hello!") are treated identically. Consider whether isTinyCueText should reuse measureTextLengthUnits for consistency.

  3. splitLongUnitsBySignals only splits once β€” it finds the single best split point per unit. A unit with 40+ text-length-units and two strong boundaries would only be split into two parts. This may be intentional to keep things simple, but the prompt tells the AI to target 6-14 words / 10-18 CJK chars. If the AI produces a single massive unit, a single binary split may still leave both halves over the soft limit. Worth documenting that this is a deliberate single-pass design, or consider making it recursive.


Potential Bugs

  1. SENTENCE_END_PATTERN includes comma β€” the regex /[,.。??!!οΌ›;β€¦ΨŸΫ”\n]$/ matches trailing commas as a "strong boundary". This means hasStrongBoundary("well,") returns true, which prevents merging in both refine and guards. The test at line 181 explicitly asserts this behavior. However, commas are not sentence-ending punctuation β€” "well," is a fragment that typically continues. The PR's own prompt rules say "strong boundaries: clear punctuation (. ? ! ; :)" β€” comma isn't listed, yet the code treats it as one. This is inherited from the existing SENTENCE_END_PATTERN constant, but it contradicts the prompt's own guidance. This may cause over-fragmentation of clauses separated by commas.

  2. buildFragmentsFromUnits can crash on out-of-bounds access β€” source[unit.from] and source[unit.to] are accessed without bounds checking. If refineSegmentationUnits produces a unit with indices outside the source array (shouldn't happen after validation, but defensive code would be better), this throws a cryptic Cannot read properties of undefined. The validateSegmentationUnits call before it should prevent this, but since refinement runs between validation and build, and validation is called again after refinement, this is probably safe in practice. Just noting the implicit dependency on call order.

  3. cleanLineProtocolResponse regex for think tags β€” the pattern only strips content before the first closing think tag. If a model outputs multiple think blocks, only the first is stripped and subsequent think blocks leak into parsed output. Minor edge case, but models with chain-of-thought sometimes produce multiple think blocks.


Performance

  1. No concerns for typical workloads. The refine pipeline is O(n) for merging and O(n * m) for splitting where m = fragments in a single unit. Given subtitle blocks are typically under 500 fragments, this is fine.

Security

  1. No concerns. This code processes locally-stored subtitle data and AI-generated text. No user-facing injection vectors. The protocol parser silently ignores non-matching lines, which is good defensive behavior.

Test Coverage

Good:

  • Tests cover the key scenarios: parsing, validation gaps, rebuild, CJK merging, long unit splitting, no-split cases, guards merge/keep/boundary behavior.
  • Edge cases like numbered list prefixes and varied whitespace are tested.

Missing coverage:

  • cleanFragmentsForAi and formatFragmentsToJson have no tests.
  • cleanLineProtocolResponse (in background file) β€” no test for markdown fence stripping or think tag handling.
  • parseLineProtocolToUnits with from > to (e.g. 3-1 | text) β€” parser accepts it but validator catches it. Worth testing parser alone.
  • parseLineProtocolToUnits with all-invalid input β€” should test the thrown error.
  • validateSegmentationUnits with overlapping ranges β€” currently only gap is tested.
  • No test for normalizeTimeline (overlap correction in guards).
  • splitLongUnitsBySignals with English text β€” all split tests use CJK.
  • No integration-level test of aiSegmentBlock with a mocked sendMessage.

Summary

Area Rating
Architecture 🟒 Good β€” clean decomposition, index-based protocol is a clear improvement
Code quality 🟑 Acceptable β€” some DRY violations (normalizeSpaces, hasStrongBoundary)
Correctness 🟑 Acceptable β€” comma-as-boundary is debatable; single-split limitation is a design choice
Tests 🟑 Acceptable β€” core logic tested, but several utility functions and edge cases lack coverage
Security 🟒 No issues

Recommendations (ordered by priority):

  1. Extract normalizeSpaces and hasStrongBoundary into a shared ai-segmentation/utils.ts
  2. Re-evaluate whether comma should be a "strong boundary" or add a comment explaining the rationale
  3. Add tests for cleanLineProtocolResponse, cleanFragmentsForAi, and parser error/edge cases
  4. Consider whether single-pass splitting is sufficient or document the limitation

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 7 files

@claude
Copy link

claude bot commented Feb 14, 2026

Code Review: feat(subtitles): refactor ai segmentation core protocol

Overall Assessment

Good refactor. Replacing loose VTT parsing with a structured from-to | text line protocol is a meaningful improvement β€” it moves from an error-prone timestamp-based format to index-based fragment references, making validation and post-processing possible. The code is well-decomposed into protocol/refine/guards layers with clear responsibilities.


Code Quality

Strengths:

  • Clean module decomposition: protocol.ts (parse/validate), refine.ts (merge/split heuristics), guards.ts (final timeline normalization). Each has a clear single responsibility.
  • validateSegmentationUnits is a solid invariant check β€” contiguous, non-overlapping, full-coverage validation catches AI output errors early.
  • Cache key versioning (AI_SEGMENTATION_CACHE_SCHEMA_VERSION) is a smart addition for protocol migration safety.
  • Test coverage is good for the core pipeline paths.

Issues:

  1. Duplicated normalizeSpaces and hasStrongBoundary functions β€” both guards.ts:8-10,12-14 and refine.ts:15-17,19-21 define identical normalizeSpaces and hasStrongBoundary helpers. Extract these to a shared utility (e.g., ai-segmentation/utils.ts).

  2. SENTENCE_END_PATTERN includes comma β€” The regex /[,.。??!!οΌ›;β€¦ΨŸΫ”\n]$/ treats comma as a sentence-end boundary. This means hasStrongBoundary("well,") returns true, which is tested and intentional, but comma is not a sentence-ending punctuation mark. The test at line 371 ('treats comma as strong boundary from shared sentence-end pattern') acknowledges this as inherited behavior. Worth noting that this may cause under-merging of comma-ended fragments like "however," or "in fact,". Not a blocker since this is pre-existing behavior from the shared constant.

  3. parseLineProtocolToUnits silently skips malformed lines β€” Lines that have a pipe but an unparseable range are silently dropped. This is reasonable for robustness against AI noise, but if all lines except junk are dropped, the function throws. Consider logging skipped lines for debugging AI output issues, especially during the early rollout of this new protocol.

  4. splitLongUnitsBySignals only performs a single split β€” For very long units (e.g., 40+ length units), only one binary split is attempted. If the AI produces a single massive unit covering many fragments, the result may still exceed the soft length limits from the prompt. Consider recursive splitting or iterative splitting until all sub-units are under the threshold.


Potential Bugs

  1. buildFragmentsFromUnits can crash on out-of-bounds access β€” At refine.ts:238-239, source[unit.from] and source[unit.to] are accessed without bounds checking. While validateSegmentationUnits is called before buildFragmentsFromUnits in the pipeline, if someone calls buildFragmentsFromUnits independently with bad data, it will produce undefined.start. Since this is an internal function in a pipeline that validates first, this is low risk but worth a defensive note.

  2. enforceCueGuards applies normalizeTimeline twice β€” At guards.ts:93, the implementation is normalizeTimeline(mergeTinyCues(normalizeTimeline(fragments))). The inner normalizeTimeline fixes overlaps before merging, then the outer one fixes overlaps introduced by merging. This is correct but should have a comment explaining why the double application is needed to avoid confusion for future maintainers.


Performance

  1. splitLongUnitsBySignals linear scan is fine β€” The loop at refine.ts:193-205 iterates over all possible split points within a unit. Given subtitle fragment counts (typically dozens to low hundreds), this is negligible.

  2. composeTextFromSourceRange rebuilds text by joining β€” Called per split candidate during evaluation. For large units this means repeated string concatenation, but again, fragment counts are small enough that this is not a concern.

No performance issues worth acting on.


Security

No security concerns. This code processes subtitle text locally, has no injection vectors, and doesn't interact with external systems beyond the existing AI provider integration.


Test Coverage

Good coverage of:

  • Protocol parsing (valid lines, numbered lines, extra whitespace)
  • Coverage validation (complete, gaps)
  • Timeline rebuild from indexes
  • Merge heuristics (short cues, CJK)
  • Split heuristics (by pause/boundary, no safe split)
  • Guard merge/keep behavior with boundary punctuation

Missing coverage:

  • Edge case: parseLineProtocolToUnits with all-invalid lines (verify the thrown error message)
  • Edge case: single-fragment input (from === to, e.g., 0-0 | word)
  • Edge case: validateSegmentationUnits with overlapping ranges (e.g., 0-2 then 1-3)
  • Edge case: buildFragmentsFromUnits with empty source array
  • The cleanLineProtocolResponse function in ai-segmentation.ts (background) has no tests β€” worth adding for the markdown fence stripping and think-tag removal

Prompt Design

The new system prompt is cleaner and more structured than the old VTT-based one. The index-based protocol (from-to | text) is a better fit for AI output since it avoids timestamp extraction errors. The length guidance with CJK vs space-separated thresholds is thoughtful.

One suggestion: the prompt says "No markdown code blocks" but cleanLineProtocolResponse still strips them. This is correct defensive coding β€” keep both.


Summary

Category Rating
Architecture βœ… Clean decomposition
Correctness βœ… Solid with validation pipeline
Tests 🟑 Good core coverage, some edge cases missing
Performance βœ… No issues
Security βœ… No concerns

Actionable items (ordered by priority):

  1. Extract duplicated normalizeSpaces/hasStrongBoundary to shared module
  2. Add comment explaining double normalizeTimeline in enforceCueGuards
  3. Consider recursive splitting for very long units
  4. Add tests for edge cases listed above

@taiiiyang taiiiyang marked this pull request as draft February 14, 2026 09:15
@claude
Copy link

claude bot commented Feb 14, 2026

Code Review: feat(subtitles): refactor ai segmentation core protocol

Overall Assessment

Good refactor. Replacing the loose VTT parsing with a strict from-to | text line protocol is a clear improvement β€” the index-based protocol eliminates the timestamp extraction ambiguity that plagued the VTT approach and makes validation/testing much more tractable. The separation into protocol/refine/guards modules follows SRP well.


Code Quality

Strengths:

  • Clean module decomposition: protocol.ts (parsing/validation), refine.ts (merge/split heuristics), guards.ts (post-processing safety net). Each has a single responsibility.
  • The validateSegmentationUnits function provides airtight coverage validation β€” gaps, overlaps, out-of-bounds, and incomplete coverage are all caught.
  • Cache key versioning (AI_SEGMENTATION_CACHE_SCHEMA_VERSION) is a smart approach to invalidate stale caches during protocol migration.
  • Pure functions throughout refine/guards β€” easy to test and reason about.

Suggestions:

  1. cleanLineProtocolResponse regex redundancy (protocol.ts:15):

    cleaned = cleaned.replace(/```[\w-]*\n?/g, '').replace(/```\n?/g, '')

    The second .replace(/```\n?/g, '') is redundant β€” the first pattern [\w-]* already matches zero-or-more word chars, which includes the empty string case (closing ```). The first replace alone handles both opening and closing fences. Minor, but worth simplifying.

  2. parseLineProtocolToUnits silently discards malformed lines (protocol.ts:41-78):
    This is mostly fine for robustness against AI model noise, but consider: if the AI returns 0-1 | Hello followed by 2-3 | (empty text), the empty-text line is silently skipped, and validateSegmentationUnits will catch the gap. The error message will say "coverage gap at index 2" which doesn't hint that the AI returned the range but with empty text. Not blocking, but a debug log here could help troubleshoot production issues.

  3. composeTextFromSourceRange in refine.ts:13-31 joins source fragment texts with spaces, which works for space-separated languages but produces "δ½ ε₯½ δΈ–η•Œ" for CJK. The refinement text used after splitting will therefore differ from what the AI originally produced. This seems intentional (the split path recomposes from source), but worth noting that the text field in refined units may contain spaces between CJK characters that weren't in the AI output. Not necessarily wrong β€” just a semantic note.


Potential Bugs

  1. SENTENCE_END_PATTERN comma removal (subtitles.ts:12):
    Removing , from SENTENCE_END_PATTERN changes behavior in scrolling-asr-parser.ts and optimizer.ts β€” not just in the new AI segmentation code. Previously, commas were treated as sentence-end boundaries in the ASR parser and optimizer. Verify this doesn't degrade non-AI segmentation paths (the heuristic optimizer). The PR description doesn't mention this as an intentional behavioral change outside the AI path.

  2. parseLineProtocolToUnits pipe-in-text edge case (protocol.ts:50-56):
    The parser uses indexOf('|') to find the first pipe. If the AI-generated text itself contains a | character (e.g., 0-5 | A | B choice), the text captured will be A | B choice β€” which is correct since it takes everything after the first pipe. Good. But if the left side of the first pipe doesn't match the range pattern, the entire line is skipped even if a later pipe would match. This is unlikely in practice but worth documenting.

  3. enforceCueGuards double normalization (guards.ts:84):

    return normalizeTimeline(mergeTinyCues(normalizeTimeline(fragments), sourceLanguage))

    The comment explains the double normalization, but mergeTinyCues uses Math.max(previous.end, fragment.end) when merging β€” this can only extend end, never create new overlaps. The second normalizeTimeline pass would only be needed if mergeTinyCues could introduce overlaps, which it doesn't seem to. Consider verifying whether the outer normalizeTimeline is actually needed.


Performance

  • The split scoring loop in splitLongUnitsBySignals iterates over all fragment indices within a unit. For typical subtitle chunks (tens of fragments), this is negligible. Fine.
  • measureTextLengthUnits uses Array.from() for CJK character counting, which allocates an array each time. For the hot path (called per-fragment in scoring), [...normalized].length or a manual codepoint counter would avoid the intermediate array, but given the small string sizes in subtitles this is not a real concern.

Security

No concerns. The code processes subtitle text from AI model output and doesn't inject it into DOM unsafely or pass it to shell/eval. The cleanLineProtocolResponse strips markdown fences and think tags, which is appropriate sanitization for the protocol parsing path.


Test Coverage

Good coverage of the core protocol pipeline:

  • Protocol parsing (valid, numbered, invalid)
  • Response cleaning (markdown fences, think tags)
  • Validation (contiguous coverage, gaps, overlaps, out-of-bounds)
  • Timeline rebuild
  • Merge/split refinement (CJK merge, pause-based split, no-split fallback)
  • Cue guards (tiny merge, boundary preservation, comma/colon non-boundary)

Gaps to consider:

  • No test for cleanFragmentsForAi or formatFragmentsToJson β€” these are simple but the new i field in the JSON format is a contract with the AI prompt. A test asserting the JSON shape would catch accidental regressions.
  • No test for the end-to-end aiSegmentBlock orchestration (though this requires mocking sendMessage, so integration-level testing is understandable to defer).
  • No test for normalizeTimeline directly β€” it's covered indirectly through enforceCueGuards, but a test with overlapping timelines would make the behavior explicit.
  • The cleanLineProtocolResponse test doesn't cover the case where <think> tags appear without a closing </think> β€” the current regex would leave the think content in the output. Worth testing if thinking models could produce partial tags.

Minor Nits

  • refine.ts constants (MERGE_MAX_LENGTH_UNITS, SPLIT_MIN_TOTAL_UNITS, etc.) are well-named but undocumented. A brief comment explaining the units (words for English, characters for CJK) would help future contributors tune these thresholds.
  • Consider exporting SegmentationUnit type from a barrel index.ts or re-exporting from ai-segmentation.ts if other modules will need it in the stacked PR series.

@taiiiyang taiiiyang marked this pull request as ready for review February 14, 2026 11:40
@dosubot dosubot bot added the app: browser extension Related to browser extension label Feb 14, 2026
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 13 files

@claude
Copy link

claude bot commented Feb 14, 2026

PR Review: feat(subtitles): refactor ai segmentation core protocol

Overall Assessment

This is a well-structured refactor that replaces a fragile VTT-based AI output protocol with a stricter, index-based line protocol (from-to | text). The decomposition into protocol.ts, refine.ts, and guards.ts is clean and follows the project's SOLID principles. Net deletion of code while adding more functionality and tests is a good sign.


Code Quality & Architecture

Strengths:

  • Clean separation of concerns: protocol parsing, validation, refinement, and timeline guards are each in their own module with single responsibilities.
  • The index-based protocol (0-1 | Hello world.) is fundamentally more robust than VTT timestamp parsing β€” it eliminates an entire class of timestamp-alignment bugs that the old approach was susceptible to.
  • validateSegmentationUnits provides strong invariant checking (contiguous, non-overlapping, in-bounds coverage). Running it both before and after refinement (ai-segmentation.ts:21-27) is a good defensive pattern.
  • Cache key versioning (AI_SEGMENTATION_CACHE_SCHEMA_VERSION = 'seg-line-v1') is a smart way to avoid serving stale cached data after the protocol change.
  • Pure functions throughout β€” refine.ts and guards.ts are entirely side-effect-free and highly testable.

Suggestions:

  • The SENTENCE_END_PATTERN change (removing comma , from the pattern in subtitles.ts:12) is a behavioral change that affects not just the new code but also optimizer.ts and scrolling-asr-parser.ts. This is the right call semantically (commas are not sentence boundaries), but worth calling out that it affects the optimizer and YouTube ASR parser paths, not just the new AI segmentation path. Confirm this doesn't regress the non-AI subtitle processing.

Potential Bugs / Issues

  1. cleanLineProtocolResponse double-regex for code fences (protocol.ts:15):

    cleaned = cleaned.replace(/```[\w-]*\n?/g, '').replace(/```\n?/g, '')

    The first regex [\w-]* already matches zero-or-more word chars, so it matches plain ``` too. The second .replace(/```\n?/g, '') is redundant β€” the first pass already handles it. Not a bug, but dead code.

  2. composeTextFromSourceRange silently skips falsy text (refine.ts:24-27):

    const fragmentText = source[index]?.text
    if (fragmentText) { parts.push(fragmentText) }

    If a fragment has text: "" (empty string after cleaning), it's silently skipped. This is fine given cleanFragmentsForAi filters them out, but the coupling is implicit β€” if someone calls composeTextFromSourceRange on uncleaned fragments, the recomposed text would silently lose fragments. Consider a comment noting the precondition.

  3. <think> tag stripping only handles </think> (protocol.ts:17):
    The regex /<\/think>([\s\S]*)/ extracts everything after </think>. If the model outputs <think>reasoning</think> but also has content before the <think> tag, that content is preserved (since the match fails and afterThink defaults to cleaned). This seems correct for known model behaviors, but if a model outputs protocol lines both before and after <think> tags, the lines before would be lost. This edge case seems unlikely in practice.

  4. Merge can chain: mergeOverFragmentedUnits merges A→B when A is tiny, then the merged AB could still be checked against C. This is intentional (the merged unit is now larger and won't satisfy MERGE_MAX_LENGTH_UNITS), but worth noting that chained merges are bounded by the length check — good design.


Performance

  • The refinement pipeline is O(n) for merge and O(n*m) for split (where m is the fragment span of long units). Given subtitle chunks are bounded by PROCESS_LOOK_AHEAD_MS (60s windows), this is fine β€” n and m are small.
  • enforceCueGuards runs normalizeTimeline twice (before and after merge). This is intentional per the comment, and with small cue counts, the double pass is negligible.
  • No concerns here β€” the new code is at least as efficient as the old VTT parser, likely more so since it avoids regex-heavy VTT parsing.

Security

  • No user input is directly interpolated into HTML, SQL, or shell commands.
  • The cleanLineProtocolResponse function strips <think> tags and markdown fences, which is appropriate sanitization for AI model output.
  • The sendMessage IPC boundary between content script and background worker is maintained correctly.
  • No concerns.

Test Coverage

What's covered well:

  • Protocol parsing (valid lines, numbered lines, markdown fences, think tags, invalid input)
  • Validation (contiguous coverage, gaps, overlaps, out-of-bounds, inverted ranges)
  • Timeline rebuild from units (happy path and out-of-bounds error)
  • Merge behavior (CJK tiny cues, boundary punctuation)
  • Split behavior (pause + punctuation boundary, no-split when no boundary exists, tiny gaps without punctuation)
  • Guard enforcement (tiny cue merging, strong boundary preservation, comma/colon not treated as strong)

Gaps to consider:

  • No test for cleanFragmentsForAi or formatFragmentsToJson β€” these were moved from ai-segmentation.ts to protocol.ts without test additions. They're simple, but since this PR adds a test file, it would be low-cost to cover them.
  • No test for evaluateSplitScore edge cases like equal-length splits (balance bonus/penalty threshold of 8 units difference).
  • No integration-level test for the full aiSegmentBlock pipeline (parse β†’ validate β†’ refine β†’ validate β†’ build β†’ guards). Individual steps are tested, but end-to-end composition is only validated in production. Given the sendMessage IPC dependency this is understandable, but a test with a mocked sendMessage could catch composition bugs.
  • The measureTextLengthUnits test only covers explicit 'en' and 'zh' β€” no coverage for other CJK languages (ja, ko, th) or edge cases like empty strings.

Minor Nits

  • refine.ts has several magic number thresholds (MERGE_MAX_LENGTH_UNITS = 5, SPLIT_MIN_TOTAL_UNITS = 20, etc.) that are well-named constants but lack brief comments explaining the reasoning. A one-liner per constant would help future contributors tune them.
  • The evaluateSplitScore function's scoring weights (gap β‰₯250ms β†’ +2, boundary β†’ +2, imbalance >8 β†’ -1) are domain heuristics. Consider documenting these in a brief comment block at the top of the function.

Verdict

Solid refactor. The line protocol is a clear improvement over VTT parsing for AI output, the validation is rigorous, and the merge/split heuristics are well-bounded. The main actionable items are: (1) verify the comma removal from SENTENCE_END_PATTERN doesn't regress optimizer/ASR paths, (2) consider adding tests for cleanFragmentsForAi/formatFragmentsToJson, and (3) remove the redundant second regex in cleanLineProtocolResponse.

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 14, 2026
@claude
Copy link

claude bot commented Feb 14, 2026

PR Review: feat(subtitles): refactor ai segmentation core protocol

Overall this is a well-structured refactor that replaces a fragile VTT-based AI response format with a stricter index-based line protocol. The decomposition into protocol.ts, refine.ts, guards.ts, and index.ts follows SRP well, and the pipeline is easier to reason about. Good test coverage for the new modules.

Code Quality and Architecture

Strengths:

  • Clean separation of concerns: parsing (protocol.ts), post-processing (refine.ts), timeline normalization (guards.ts), and orchestration (index.ts).
  • The index-based from-to | text protocol eliminates the timestamp drift bugs that plagued the old VTT approach -- timestamps are now derived from source fragments, not from AI output. This is the right call.
  • validateSegmentationUnits is called both after parsing AND after refinement -- good defensive design.
  • Cache key versioning (AI_SEGMENTATION_CACHE_SCHEMA_VERSION) prevents stale cached VTT results from being misinterpreted as line protocol.
  • The refine module merge/split heuristics are language-aware with sensible defaults.

Suggestions:

  1. cleanLineProtocolResponse regex redundancy (protocol.ts:15): The first regex with [\w-]* already matches zero-or-more word chars (including zero), so the second .replace is redundant -- the first pattern already handles bare triple-backticks (where [\w-]* matches zero chars). Consider removing the second replace for clarity.

  2. composeTextFromSourceRange silently skips falsy text (refine.ts:25): The check if (fragmentText) would skip a fragment with text "0" since it is falsy. Since cleanFragmentsForAi already filters empty text upstream this is fine in practice, but consider using fragmentText != null for correctness since "0" is valid subtitle text.

  3. SENTENCE_END_PATTERN comma removal (subtitles.ts:12): Removing comma is the right choice for isStrongSentenceBoundary since commas are not sentence boundaries. The only other consumer is optimizer.ts where the change also looks correct (flush-on-comma would produce over-fragmented subtitles).

  4. Magic numbers in evaluateSplitScore (refine.ts:125-133): The gap thresholds (250ms, 120ms, 60ms) and score adjustments (+2, +1, -1) are hardcoded inline. They are localized to this function so it is manageable, but extracting them as named constants (like the other thresholds at the top of the file) would make future tuning easier.

Potential Bugs / Edge Cases

  1. mergeOverFragmentedUnits -- chain merging (refine.ts:90-94): When unit A merges into B, the merged result expands. Then merged A+B could merge again with C if the original B was short. This greedy forward merge means merged text grows unboundedly. The downstream splitLongUnitsBySignals would then split it, but only once (single-pass). For a pathological case of many tiny fragments, you could end up with one very large unit that cannot be split (if no boundary signals exist). Seems acceptable given the SPLIT_MIN_SCORE guard, but worth being aware of.

  2. normalizeTimeline does not handle single-element edge case (guards.ts:35-37): The loop starts at index 1, so for a single-element array, the end < start check on that element is never reached. If a single fragment has end < start, it passes through uncorrected. Consider adding a check for the first element.

Performance

  • The overall pipeline is O(n) for most operations, O(n * k) for split scoring where k is the span of a single unit. Well within acceptable bounds for subtitle-sized inputs.
  • normalizeTimeline is called twice in enforceCueGuards -- intentional and documented, cost is negligible.

Test Coverage

Good coverage for: protocol parsing (valid/invalid/numbered/spaced), response cleaning (markdown fences, think tags), validation (gaps, overlaps, out-of-bounds, empty), timeline rebuild, merge behavior (CJK short cues), split behavior (pause boundaries, no-boundary cases), guard enforcement (tiny cue merge, strong boundary preservation, comma/colon not-strong).

Missing coverage:

  • cleanFragmentsForAi -- no direct test for newline/whitespace normalization and empty-text filtering
  • formatFragmentsToJson -- no test verifying the i index field is correctly added
  • Full aiSegmentBlock orchestration (understandable since it requires mocking sendMessage)
  • normalizeTimeline edge cases: overlapping fragments, single fragment with end < start
  • measureTextLengthUnits with edge cases like empty string, whitespace-only string
  • mergeOverFragmentedUnits chain-merge scenario (many consecutive tiny units)

Security

No concerns -- this code processes locally-sourced subtitle data and AI model output. cleanLineProtocolResponse properly strips markdown and think tags. No user input flows into executable paths.

Summary

This is a solid refactor. The line protocol is a meaningful improvement over VTT parsing -- it removes an entire class of timestamp-alignment bugs by making the AI only responsible for grouping, not timing. The post-processing pipeline (merge -> split -> guards) is well-layered and each step has clear invariants enforced by validation.

Main actionable items:

  1. Fix the falsy-text check in composeTextFromSourceRange ("0" edge case)
  2. Consider the single-element normalizeTimeline edge case
  3. Add a few more unit tests for cleanFragmentsForAi, formatFragmentsToJson, and chain-merge scenarios

@taiiiyang taiiiyang marked this pull request as draft February 15, 2026 09:27
@claude
Copy link

claude bot commented Feb 15, 2026

Code Review: feat(subtitles): refactor ai segmentation core protocol

Overall Assessment

Solid refactor. Replacing the loose VTT parsing with an explicit index-based line protocol (from-to | text) is a meaningful improvement β€” it makes validation deterministic and eliminates the timestamp-alignment errors inherent in the old approach. The module decomposition into protocol.ts, refine.ts, guards.ts, and index.ts follows SRP well.


Code Quality

Good:

  • Clean separation: protocol parsing β†’ validation β†’ refinement β†’ timeline rebuild β†’ cue guards. Each stage is independently testable.
  • validateSegmentationUnits is thorough β€” checks ordering, contiguity, bounds, and completeness in a single linear pass.
  • Cache key versioning (AI_SEGMENTATION_CACHE_SCHEMA_VERSION) prevents stale cache mismatches during protocol migration.
  • Double validation (before and after refinement) in index.ts:37-40 is a smart defensive pattern.

Suggestions:

  1. SENTENCE_END_PATTERN includes comma β€” The pattern /[,,.。??!!οΌ›;β€¦ΨŸΫ”\n]$/ now includes , and ,. But the function wrapping it is called isStrongSentenceBoundary. Commas are generally weak boundaries (as confirmed by the test 'does not treat comma as strong boundary for cue merge guards'). This works because guards.ts independently checks gap/duration before merging, but the naming is misleading. The pattern itself is shared across optimizer.ts and scrolling-asr-parser.ts where comma is treated as a sentence-end signal for flushing β€” so the semantics differ by callsite. Consider either:

    • Renaming to isSentenceEndPunctuation (more accurate), or
    • Splitting into two patterns: one for strong boundaries (.?!) and one for all sentence-end punctuation.
  2. cleanLineProtocolResponse regex overlap β€” protocol.ts:15:

    cleaned = cleaned.replace(/```[\w-]*\n?/g, '').replace(/```\n?/g, '')

    The second .replace() is redundant β€” the first pattern [\w-]* already matches zero or more word/dash chars, which includes the empty string. The second call would only match backticks already removed by the first. Not a bug, just dead code.

  3. composeTextFromSourceRange silently skips empty fragments β€” refine.ts:59:

    if (fragmentText) {
      parts.push(fragmentText)
    }

    Since cleanFragmentsForAi already filters empty fragments, this guard is probably fine in practice, but it silently drops fragments with falsy text (e.g., "0" would pass but "" wouldn't). Since the upstream filter already handles this, you could simplify to unconditional push, or add an explicit comment noting the assumption.

  4. Single-pass split may leave very long cues β€” refine.ts:207-208 intentionally does only one split per long unit. The comment explains the rationale (semantic continuity), but with extreme AI output (e.g., a 50+ word unit with no good boundary), users could see uncomfortably long subtitles. The test 'keeps long unit when no safe split boundary exists' covers this. Just noting that this is a conscious trade-off, not a bug.


Potential Bugs

  1. mergeOverFragmentedUnits is not idempotent with chained merges β€” When unit A merges into unit B, the merged result becomes the new "previous" for the next comparison. If A was 4 words and B was 4 words (both under targetMinUnits=11), the merged AB (8 words) could then also merge with C (3 words) in the same pass. This cascading behavior is likely intended for the "merge short English units" test case, but it means a sequence of many tiny fragments could cascade into a single very long cue. The MERGE_MAX_COMBINED_UNITS check (refine.ts:103) caps this, but the cap is checked against the accumulated text which is recomputed each iteration β€” so a chain of 3-word units could merge up to 24 words in one pass. This seems acceptable given the constants, but worth being aware of.

  2. measureTextLengthUnits for CJK strips all spaces β€” utils.ts:34:

    return Array.from(normalized.replace(/\s+/g, '')).length

    This counts every non-space character including punctuation. For CJK length limits (hard limit 22 chars), punctuation like γ€‚οΌŸοΌ counts toward the limit. This is probably correct for subtitle display width, but differs from the prompt's guidance of "10-18 chars" which might mean readable characters only. Minor β€” just a consistency note.


Performance

No concerns. All operations are O(n) linear passes over fragment arrays. The split scoring loop (refine.ts:222) is O(k) per unit where k is the fragment span β€” bounded by chunk size from the pipeline, so effectively constant.


Security

  1. cleanLineProtocolResponse think-tag stripping β€” protocol.ts:17:

    const [, afterThink = cleaned] = cleaned.match(/<\/think>([\s\S]*)/) || []

    This only strips content before the closing </think> tag but doesn't handle nested or malformed think tags. Since this is processing AI model output (not user input), the risk is low, but a malicious or confused model could inject <think> tags within the protocol output. The fallback (afterThink = cleaned) handles the no-match case correctly.

  2. No user-controlled input reaches these functions directly β€” all input comes from AI model responses or internal fragment data. No injection risks identified.


Test Coverage

Good coverage of the new protocol pipeline:

  • Protocol parsing (valid lines, numbered prefixes, markdown fences, think tags, invalid input)
  • Validation (contiguous coverage, gaps, overlaps, out-of-bounds, invalid ranges)
  • Timeline rebuild (normal case, out-of-bounds error)
  • Merge behavior (CJK short merge, English short merge with punctuation noise)
  • Split behavior (pause + boundary split, no-split when no safe boundary, no-split on tiny gaps)
  • Cue guards (tiny merge, strong boundary with pause, comma/colon not strong)

Missing coverage:

  • No test for cleanFragmentsForAi (newline/whitespace normalization, empty fragment filtering)
  • No test for formatFragmentsToJson (index i field inclusion)
  • No test for the full aiSegmentBlock integration (would need mocking sendMessage)
  • No test for normalizeTimeline in isolation (overlap correction, zero-duration edge case)
  • No edge case for measureTextLengthUnits with mixed CJK/Latin text
  • The evaluateSplitScore balance penalty (Math.abs(leftLength - rightLength) > 8) is not directly tested

Minor Nits

  • refine.ts has 282 lines with many threshold constants. Consider extracting the constants into a named config object (e.g., const MERGE_CONFIG = { ... }) to group related values and reduce the parameter-passing overhead through the helper functions.
  • The prompt in prompt.ts is well-structured. The explicit "Output ONLY protocol lines" instruction should reduce model hallucination compared to the old VTT format.

Verdict

The refactor achieves its goal: a stricter, more testable segmentation pipeline with proper validation at each stage. The index-based protocol eliminates the timestamp-alignment class of bugs from the old VTT approach. Ship it with the naming nit on isStrongSentenceBoundary addressed (or documented), and consider backfilling tests for cleanFragmentsForAi and formatFragmentsToJson.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

This PR has been inactive for 14 days and is now marked as stale.
It will be automatically closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Mar 3, 2026
@claude
Copy link

claude bot commented Mar 3, 2026

PR Review: feat(subtitles): refactor ai segmentation core protocol

Overall Assessment

Well-structured refactor that replaces loose VTT parsing with an explicit from-to | text line protocol. The modular decomposition into protocol.ts, guards.ts, refine.ts, and index.ts follows the project's SOLID principles. Tests are solid. A few issues worth addressing before merge.


Code Quality & Architecture

Strengths:

  • Clean separation of concerns: protocol parsing, validation, refinement, and guards each have their own module
  • Pure functions throughout β€” refine.ts and guards.ts have zero side effects, making them highly testable
  • Constants are well-named and co-located with their usage context (CJK vs space-separated thresholds)
  • The single-pass split strategy in splitLongUnitsBySignals is a good pragmatic choice over recursive splitting

Issues:

  1. SENTENCE_END_PATTERN includes comma (src/utils/constants/subtitles.ts:12)

    export const SENTENCE_END_PATTERN = /[,,.。??!!οΌ›;β€¦ΨŸΫ”\n]$/
    

    Comma (,) is treated as a "strong sentence boundary" β€” but this contradicts the test at line 281-292 of ai-segmentation.test.ts which explicitly says "does not treat comma as strong boundary." The test passes only because enforceCueGuards merges tiny cues despite the boundary. This is semantically confusing: isStrongSentenceBoundary("well,") returns true, yet commas are "not strong boundaries" in the guards logic. Consider renaming to isSentenceEndPunctuation or splitting into isStrongBoundary (period/question/exclamation only) vs isWeakBoundary (comma/colon) to avoid this conceptual mismatch.

  2. cleanLineProtocolResponse double-strips markdown fences (protocol.ts:15)

    cleaned = cleaned.replace(/```[\w-]*\n?/g, "").replace(/```\n?/g, "")

    The first regex [\w-]* already matches the empty string (zero-width match of *), so ``` without a language tag is matched by the first replace. The second replace is redundant. Not a bug, just dead code.

  3. Cache key doesn't include the system prompt version (ai-segmentation.ts:46-50)

    const cacheKey = Sha256Hex(
      jsonContentHash,
      JSON.stringify(providerConfig),
      AI_SEGMENTATION_CACHE_SCHEMA_VERSION,
    )

    If DEFAULT_SUBTITLES_SEGMENTATION_SYSTEM_PROMPT changes in a future update without bumping AI_SEGMENTATION_CACHE_SCHEMA_VERSION, stale cached results from the old prompt will be served. Consider including a hash of the prompt in the cache key, or documenting that AI_SEGMENTATION_CACHE_SCHEMA_VERSION must be bumped on prompt changes.


Potential Bugs

  1. validateSegmentationUnits called twice β€” second call may fail after refinement (index.ts:40)

    const refinedUnits = refineSegmentationUnits(units, cleanedFragments, sourceLanguage)
    validateSegmentationUnits(refinedUnits, cleanedFragments.length)

    mergeOverFragmentedUnits in refine.ts merges adjacent units, which reduces coverage count. The post-refinement units still cover all indices 0..N-1, so this works. But splitLongUnitsBySignals splits a unit {from: A, to: B} into {from: A, to: splitIndex-1} and {from: splitIndex, to: B} β€” this preserves contiguous coverage. Verified correct, but worth a comment explaining why the second validation is safe (since it's not obvious).

  2. composeTextFromSourceRange silently skips missing fragments (refine.ts:58-60)

    const fragmentText = source[index]?.text
    if (fragmentText) { parts.push(fragmentText) }

    If source[index] exists but has text: "", the fragment is silently dropped. This is consistent with cleanFragmentsForAi filtering empty text, but if a fragment with whitespace-only text somehow survives cleaning, it would be silently omitted. Low risk but worth noting.

  3. SegmentationPipeline.findNextChunk may create non-contiguous chunks (segmentation-pipeline.ts:120-123)

    return this.rawFragments.filter(
      f => f.start >= firstUnprocessed.start && f.start < windowEnd
        && !this.segmentedRawStarts.has(f.start),
    )

    If some fragments within the window were already processed (from a previous chunk), the returned chunk will have gaps. These non-contiguous fragments are then passed to aiSegmentBlock, where validateSegmentationUnits expects contiguous coverage over cleanedFragments.length. This should be fine since the AI re-indexes from 0 based on the cleaned input β€” but verify that partial re-processing of overlapping windows doesn't produce unexpected results.


Performance

  1. normalizeSpaces called repeatedly during merge (refine.ts:94, 147)
    Each merge iteration calls normalizeSpaces on the growing combined text. For a chain of N merges, this is O(N * total_text_length). Not a real concern for subtitle cues (small N), but a StringBuilder-style approach would be cleaner if this ever scales.

  2. Test for measureTextLengthUnits with CJK (ai-segmentation.test.ts:52)

    expect(measureTextLengthUnits("hello world", "zh")).toBe(10)

    This counts "helloworld" (spaces removed) = 10 ASCII chars. This is correct per the implementation which strips spaces for CJK and counts Array.from(...). But semantically, counting ASCII chars as if they were CJK characters is odd. English text measured with CJK rules would yield inflated counts. Not a bug β€” just a design note that CJK measurement assumes the text is actually CJK.


Security

No injection vectors found. The AI output is parsed via strict regex patterns (RANGE_PATTERN, LIST_PREFIX_PATTERN) and never interpolated into code or DOM unsanitized. The <think> tag stripping is appropriate defense against thinking-model output leakage. Cache keys use SHA-256 hashing. No concerns here.


Test Coverage

Good coverage of:

  • Protocol parsing (valid lines, numbered prefixes, edge cases)
  • Response cleaning (markdown fences, think tags)
  • Validation (contiguous coverage, gaps, overlaps, bounds)
  • Timeline rebuilding from source indexes
  • Merge behavior (CJK, English, short cues)
  • Split behavior (by pause/boundary signals, no-split fallback)
  • Guard behavior (tiny cues, strong boundaries with/without pause, comma/colon not treated as strong)

Missing coverage:

  • cleanFragmentsForAi β€” no test for newline/whitespace normalization or empty fragment filtering
  • formatFragmentsToJson β€” no test for the JSON serialization format
  • aiSegmentBlock integration (understandably harder to test due to async message passing)
  • Edge case: what happens when units after refinement is empty (all units merged into one, then split fails)?
  • enforceCueGuards with overlapping input timelines (the normalizeTimeline path)
  • SegmentationPipeline class β€” no unit tests for chunk finding, failure fallback, or restart behavior

Minor Suggestions

  • Consider adding @internal or @visibleForTesting annotations to exports that are only public for testing (e.g., composeTextFromSourceRange is private but its callers are exported)
  • The magic numbers in evaluateSplitScore (250, 120, 60, 8) could be named constants for consistency with the rest of the file
  • The prompt in prompt.ts is excellent β€” clear constraints, good examples, explicit rules about ellipsis and connector handling

Summary

Area Rating
Architecture Good β€” clean module decomposition
Code quality Good β€” pure functions, clear naming
Test coverage Good for core logic, gaps in integration/edge cases
Bugs Low risk β€” one semantic mismatch (#1), one potential chunk contiguity issue (#6)
Performance Fine for current scale
Security No concerns

Recommendation: Address #1 (comma boundary semantics) and #3 (cache key prompt versioning) before merging. The rest are minor improvements that can be follow-ups.

@github-actions github-actions bot removed the Stale label Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: browser extension Related to browser extension feat size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant