-
Notifications
You must be signed in to change notification settings - Fork 24
Piecewise #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
NicolasBuchin
wants to merge
66
commits into
main
Choose a base branch
from
piecewise
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Piecewise #533
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…plicated) instead of the HashMap and Set DS.
…ort on one place (where sorting seem to takes most time). Uclear of pdqsort is faster on my data though (ChrX only), perhaps will be for larger references?
…tion. This commit factors the function collinear_chaining into first chaining (still named collinear_chaining), then to traceback (named extract_chains_from_dp). This enables finding the global optimal chain score (both FW and RC) before backtrack, which was not done in previous commits. This commit however does not use the global score, but keeps the previous individual scores using best_score[is_revcomp] (instead of float max_score = std::max(best_score[0], best_score[1]);) The reason is that, while faster, the global score reduce the alignment score significantly on the (one) dataset I am testing on. However, I beleive one could potentially intorduce the glodal score but relaxing --vp instead to compensate. Further analysis needed Lastly, while this commit should only be a refactoring (keeping identical results), it still changes results. This is because previous commits had `int new_score = dp[j] + score;` while I believe it should be `float new_score = dp[j] + score;`. I verified that this change has a non-negligible effect on chains returned.
Is-new-baseline: yes
Also, the score cutoff in SE mode uses actual score and not n_hits. This leaves accuracy mostly unchanged (though there are a couple of differences both up and down in SIM6), but will improve both SE and PE mapping-only score when using chains (this is unmerged). Is-new-baseline: yes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a piecewise extension algorithm as an option for single-end reads. Piecewise takes chains and performs alignment only between the anchors composing the chain, as well as two "x-drop" alignments before the first anchor of the chain and after the last. X-drop is an alignment stopping rule where extension continues as long as the alignment score does not fall more than X below the best score seen so far, if it does, the extension stops.
In the current state, piecewise is available with the flag
--pw. The Aligner class now has analign_piecewisefunction that calls a new class,Piecewise::Aligner. This class contains the entire implementation of the piecewise algorithm.For the global alignments and the x-drop alignments, we use the c bindings of the rust library blockaligner. There are two helper functions for calling global and x-drop alignments using the c bindings, as well as three helper functions to convert CIGARs from blockaligner to the CIGARs used in strobealign.
The piecewise algorithm is split into three parts:
align_before_first_anchor,align_after_last_anchor, andpiecewise_extension_alignment. The first two handle making x-drop calls as well as verifying the edges of the reference/query. The main function,piecewise_extension_alignment, calls the two previous ones as well as handling global alignment in between anchors, with a global Hamming alignment heuristic.Heuristics:
To explain the strobemer linkage issue I also made this quick figure:

It shows in red a strobemer match that would have been created if the query was longer, but since it stops, the window for strobemer linkage is shorten, creating instead the green strobemer, which happens to match with another strobemer on the reference. The anchors shown in diagonal position show the red strobemer, which should have been created is missing and replaced by the green one, which is off-diagonal.
Even if the current heuristic catches this, the origin of the problem comes from strobemer linkage, and could be fixed there.
piecewise_extension_alignment, before starting the piecewise process, another function is called,remove_spurious_anchors. This function aims at removing the spurious anchors inside the chain. The function does a check on the chain and verifies if a cluster of anchors creates an indel, which is later canceled by another indel of the opposite size (for indel x,y, we have x ~= -y) within a tolerance value set at 5 for now (any indel less than 5 is considered not important, and only an indel >5 will count). If two indels cancel each other, the anchors in between are removed from the chain. Here is an example where this pruning helps remove spurious anchors from a chain:remove_spurious_anchorsalso verifies the edges of the chain to prune spurious anchors that are not inside the chain but on the ends of it. It looks at the first/last 10% of anchors and remove them if they create an indel in the chain. Here is an example of the heuristic removing spurious anchors:Notes:
Chaining now returns a struct called Chain, which replaces Nams.
Piecewise currently cannot retrieve the best alignment with a 100% guarantee, especially compared to SSW, but is much faster. To test improvement on piecewise, I made a small dataset for fruitfly which contains reads that will create chains with spurious anchors that are not handled by the current heuristics inside piecewise.
bad_chains.fastq.gz
Here is also a small dataset for fruitly with reads that don't align as well with piecewise than with SSW, which should be fixable. I believe the issues comes from giving blockaligner block sizes too small.
problems.fastq.gz
To generate the graphs I outputted the anchors, chains, and alignment information in the
--traceoutput, then use a custom parser/plotter tool https://github.com/NicolasBuchin/extract_chains.git to draw the figures from the information, this tool takes in a log file with the output from--trace.Current Todo list before merging: