Piecewise #533

NicolasBuchin · 2025-12-19T14:05:23Z

This PR introduces a piecewise extension algorithm as an option for single-end reads. Piecewise takes chains and performs alignment only between the anchors composing the chain, as well as two "x-drop" alignments before the first anchor of the chain and after the last. X-drop is an alignment stopping rule where extension continues as long as the alignment score does not fall more than X below the best score seen so far, if it does, the extension stops.

In the current state, piecewise is available with the flag --pw. The Aligner class now has an align_piecewise function that calls a new class, Piecewise::Aligner. This class contains the entire implementation of the piecewise algorithm.

For the global alignments and the x-drop alignments, we use the c bindings of the rust library blockaligner. There are two helper functions for calling global and x-drop alignments using the c bindings, as well as three helper functions to convert CIGARs from blockaligner to the CIGARs used in strobealign.

The piecewise algorithm is split into three parts: align_before_first_anchor, align_after_last_anchor, and piecewise_extension_alignment. The first two handle making x-drop calls as well as verifying the edges of the reference/query. The main function, piecewise_extension_alignment, calls the two previous ones as well as handling global alignment in between anchors, with a global Hamming alignment heuristic.

Heuristics:

The piecewise_extension_alignment contains an early return heuristic. This heuristic helps solve a problem in strobealign where spurious off-diagonal anchors are added to the chains. From my analysis, these spurious anchors are created when the strobemer linkage window shrinks because it reaches the end of the query. When this pattern is detected, the heuristic stops global alignments between anchors and instead calls the align_after_last_anchor function as if it reached the end of the chain. Here is an example of a chain created with those spurious anchors where the heuristic helps find the true best alignment, backed up by SSW:

To explain the strobemer linkage issue I also made this quick figure:

It shows in red a strobemer match that would have been created if the query was longer, but since it stops, the window for strobemer linkage is shorten, creating instead the green strobemer, which happens to match with another strobemer on the reference. The anchors shown in diagonal position show the red strobemer, which should have been created is missing and replaced by the green one, which is off-diagonal.

Even if the current heuristic catches this, the origin of the problem comes from strobemer linkage, and could be fixed there.

Chains are not perfectly on the best alignment path. In piecewise_extension_alignment, before starting the piecewise process, another function is called, remove_spurious_anchors. This function aims at removing the spurious anchors inside the chain. The function does a check on the chain and verifies if a cluster of anchors creates an indel, which is later canceled by another indel of the opposite size (for indel x,y, we have x ~= -y) within a tolerance value set at 5 for now (any indel less than 5 is considered not important, and only an indel >5 will count). If two indels cancel each other, the anchors in between are removed from the chain. Here is an example where this pruning helps remove spurious anchors from a chain:

remove_spurious_anchors also verifies the edges of the chain to prune spurious anchors that are not inside the chain but on the ends of it. It looks at the first/last 10% of anchors and remove them if they create an indel in the chain. Here is an example of the heuristic removing spurious anchors:

Notes:

Chaining now returns a struct called Chain, which replaces Nams.
Piecewise currently cannot retrieve the best alignment with a 100% guarantee, especially compared to SSW, but is much faster. To test improvement on piecewise, I made a small dataset for fruitfly which contains reads that will create chains with spurious anchors that are not handled by the current heuristics inside piecewise.
bad_chains.fastq.gz
Here is also a small dataset for fruitly with reads that don't align as well with piecewise than with SSW, which should be fixable. I believe the issues comes from giving blockaligner block sizes too small.
problems.fastq.gz
To generate the graphs I outputted the anchors, chains, and alignment information in the --trace output, then use a custom parser/plotter tool https://github.com/NicolasBuchin/extract_chains.git to draw the figures from the information, this tool takes in a log file with the output from --trace.

Current Todo list before merging:

Benchmark piecewise thoroughly
Find new/refine current heuristics to improve performances and close the gap between SSW's and piecewise's alignments.

…plicated) instead of the HashMap and Set DS.

…ort on one place (where sorting seem to takes most time). Uclear of pdqsort is faster on my data though (ChrX only), perhaps will be for larger references?

…tion. This commit factors the function collinear_chaining into first chaining (still named collinear_chaining), then to traceback (named extract_chains_from_dp). This enables finding the global optimal chain score (both FW and RC) before backtrack, which was not done in previous commits. This commit however does not use the global score, but keeps the previous individual scores using best_score[is_revcomp] (instead of float max_score = std::max(best_score[0], best_score[1]);) The reason is that, while faster, the global score reduce the alignment score significantly on the (one) dataset I am testing on. However, I beleive one could potentially intorduce the glodal score but relaxing --vp instead to compensate. Further analysis needed Lastly, while this commit should only be a refactoring (keeping identical results), it still changes results. This is because previous commits had `int new_score = dp[j] + score;` while I believe it should be `float new_score = dp[j] + score;`. I verified that this change has a non-negligible effect on chains returned.

Is-new-baseline: yes

Also, the score cutoff in SE mode uses actual score and not n_hits. This leaves accuracy mostly unchanged (though there are a couple of differences both up and down in SIM6), but will improve both SE and PE mapping-only score when using chains (this is unmerged). Is-new-baseline: yes

…f anchors)

…r distance

…ristic

… now

NicolasBuchin and others added 30 commits September 7, 2025 14:43

simple collinear chaining with matches O(N²)

8ec089b

rescure, fwd+rev, simple O(N²) algo

71964c1

chaining inside strboemer in O(N*h)

8597422

deleted old file

3f8b544

params

a77715f

fix

152305c

sorted and non repetitive anchors

eb72e42

valid chain threshold as a param + no need for sorting option

23a416c

chaining upgrade!

c62908e

Implements a vector keeping anchors from all ref_ids (sorted and dedu…

ea832e9

…plicated) instead of the HashMap and Set DS.

Removed returning heavy DS and instead pass by reference.

19efea9

Removed redundant call to std::sort, and replaced std::sort with pdqs…

618edbc

…ort on one place (where sorting seem to takes most time). Uclear of pdqsort is faster on my data though (ChrX only), perhaps will be for larger references?

Typo fix. Default value is set to 0.7 internally

27abd57

Removed unnecessary sort and changed to pdqsort for rescue

b9b851f

skiping diff ref ids

57a3d52

skipping by ref distance

df5d099

scoring with distances + small refactor

8aece83

skip distance is now param as an int, set to 10 000 by default

11a009c

fixed initializing map_param.ch_params with sg value

041986f

formatting errors fixed

9512e02

compile fix; initialize alignment struct

e93c40d

Is-new-baseline: yes

trying a concave gap cost function

bf8626b

small fix to ignore chains overlapping on the reference (no sharing o…

3e53ae4

…f anchors)

some minor edits for the PR

36117ed

chain overlapping candidate selection now by score

59a41d5

better log2 for concave gap function

ec9ed39

names changed

c32a404

small fix for tabulation errors

5aa263b

NicolasBuchin added 30 commits September 7, 2025 15:00

mismatches fix + trace output changed

9375073

new trace ouput for parsing

afd0b54

end bonus for piecewise

0acba15

trace mode with ssw and piecewise debugging

9569483

handling soft clips

63c311f

padding and x drop threshold

5a60534

x-drop with end bonus + remove rust targets

79b889b

piecewise parameter + D/I swapping fix for x-drop mode

ec872a9

merged main into piecewise

714176e

trace mod fix

9264831

cigar simplified

0749ced

simplified blockaligner calls

415150f

heuristic to prune off spurious anchors

e7802c9

updated strobemer match heuristic to keep only best matching strobeme…

d73414e

…r distance

split piecewise extension in multiple parts + changed padding for heu…

9da2fa7

…ristic

minor error fix

a28e835

hamming alignment heuristic to gain time for aligning between 2 anchors

1bbb4df

merged main + reactivated mapping only and turned off paired ends for…

89cec6a

… now

picewisealigner class for blockaligner calls

8d37e02

piecewise extension inside picewiseraligner class

e3627f4

Aligner class now does both ssw and piecewise calls

3968342

forgot this

6eb5a8b

full global alignment computation + adaptive block ranges for x-drops

743d165

main merged

29306a5

main merged

ca441b7

piecewise spurious anchor pruning

4aeef59

removed unused cmdline params

4738451

comments for piecewise

c7a9420

Merge branch 'main' into piecewise

aa58983

heuristic fix

77e66e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Piecewise #533

Piecewise #533

Uh oh!

NicolasBuchin commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Piecewise #533

Are you sure you want to change the base?

Piecewise #533

Uh oh!

Conversation

NicolasBuchin commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants