Skip to content

Conversation

@NicolasBuchin
Copy link
Collaborator

This PR introduces a piecewise extension algorithm as an option for single-end reads. Piecewise takes chains and performs alignment only between the anchors composing the chain, as well as two "x-drop" alignments before the first anchor of the chain and after the last. X-drop is an alignment stopping rule where extension continues as long as the alignment score does not fall more than X below the best score seen so far, if it does, the extension stops.

In the current state, piecewise is available with the flag --pw. The Aligner class now has an align_piecewise function that calls a new class, Piecewise::Aligner. This class contains the entire implementation of the piecewise algorithm.

For the global alignments and the x-drop alignments, we use the c bindings of the rust library blockaligner. There are two helper functions for calling global and x-drop alignments using the c bindings, as well as three helper functions to convert CIGARs from blockaligner to the CIGARs used in strobealign.

The piecewise algorithm is split into three parts: align_before_first_anchor, align_after_last_anchor, and piecewise_extension_alignment. The first two handle making x-drop calls as well as verifying the edges of the reference/query. The main function, piecewise_extension_alignment, calls the two previous ones as well as handling global alignment in between anchors, with a global Hamming alignment heuristic.

Heuristics:

  1. The piecewise_extension_alignment contains an early return heuristic. This heuristic helps solve a problem in strobealign where spurious off-diagonal anchors are added to the chains. From my analysis, these spurious anchors are created when the strobemer linkage window shrinks because it reaches the end of the query. When this pattern is detected, the heuristic stops global alignments between anchors and instead calls the align_after_last_anchor function as if it reached the end of the chain. Here is an example of a chain created with those spurious anchors where the heuristic helps find the true best alignment, backed up by SSW:
chain_id=0_score=255 40

To explain the strobemer linkage issue I also made this quick figure:
image
It shows in red a strobemer match that would have been created if the query was longer, but since it stops, the window for strobemer linkage is shorten, creating instead the green strobemer, which happens to match with another strobemer on the reference. The anchors shown in diagonal position show the red strobemer, which should have been created is missing and replaced by the green one, which is off-diagonal.

Even if the current heuristic catches this, the origin of the problem comes from strobemer linkage, and could be fixed there.

  1. Chains are not perfectly on the best alignment path. In piecewise_extension_alignment, before starting the piecewise process, another function is called, remove_spurious_anchors. This function aims at removing the spurious anchors inside the chain. The function does a check on the chain and verifies if a cluster of anchors creates an indel, which is later canceled by another indel of the opposite size (for indel x,y, we have x ~= -y) within a tolerance value set at 5 for now (any indel less than 5 is considered not important, and only an indel >5 will count). If two indels cancel each other, the anchors in between are removed from the chain. Here is an example where this pruning helps remove spurious anchors from a chain:
chain_id=1_score=246 49

remove_spurious_anchors also verifies the edges of the chain to prune spurious anchors that are not inside the chain but on the ends of it. It looks at the first/last 10% of anchors and remove them if they create an indel in the chain. Here is an example of the heuristic removing spurious anchors:
chain_id=0_score=362 00

Notes:

  • Chaining now returns a struct called Chain, which replaces Nams.

  • Piecewise currently cannot retrieve the best alignment with a 100% guarantee, especially compared to SSW, but is much faster. To test improvement on piecewise, I made a small dataset for fruitfly which contains reads that will create chains with spurious anchors that are not handled by the current heuristics inside piecewise.
    bad_chains.fastq.gz
    Here is also a small dataset for fruitly with reads that don't align as well with piecewise than with SSW, which should be fixable. I believe the issues comes from giving blockaligner block sizes too small.
    problems.fastq.gz

  • To generate the graphs I outputted the anchors, chains, and alignment information in the --trace output, then use a custom parser/plotter tool https://github.com/NicolasBuchin/extract_chains.git to draw the figures from the information, this tool takes in a log file with the output from --trace.

Current Todo list before merging:

  • Benchmark piecewise thoroughly
  • Find new/refine current heuristics to improve performances and close the gap between SSW's and piecewise's alignments.

NicolasBuchin and others added 30 commits September 7, 2025 14:43
…plicated) instead of the HashMap and Set DS.
…ort on one place (where sorting seem to takes most time). Uclear of pdqsort is faster on my data though (ChrX only), perhaps will be for larger references?
…tion.

This commit factors the function collinear_chaining into first chaining (still named collinear_chaining), then to traceback (named extract_chains_from_dp).

This enables finding the global optimal chain score (both FW and RC) before backtrack, which was not done in previous commits.

This commit however does not use the global score, but keeps the previous individual scores using best_score[is_revcomp] (instead of float max_score = std::max(best_score[0], best_score[1]);)

The reason is that, while faster, the global score reduce the alignment score significantly on the (one) dataset I am testing on. However, I beleive one could potentially intorduce the glodal score but relaxing --vp instead to compensate. Further analysis needed

Lastly, while this commit should only be a refactoring (keeping identical results), it still changes results. This is because previous commits had `int new_score = dp[j] + score;` while I believe it should be `float new_score = dp[j] + score;`. I verified that this change has a non-negligible effect on chains returned.
Also, the score cutoff in SE mode uses actual score and not n_hits.
This leaves accuracy mostly unchanged (though there are a couple of
differences both up and down in SIM6), but will improve both SE and PE
mapping-only score when using chains (this is unmerged).

Is-new-baseline: yes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants