This notebook will track the progress of my work in the lab.
To build the panMAN using Dockerfile, see panmania/build_panman/ directory.
To build the panMAN see notes on 2/2/2025.
PanMAN built using Dockerfile is broken. Currently using Summit's docker image.
Today I want to start by examining the alignments of sequences on a panMAN. I will write scripts that would compare the global alignments of sequences from the panMAN to the alignments of sequences using mafft. I expect to see that the mafft alignments should be better than from the panMAN because panMAN alignments are from MSA of all the sequences while mafft only compares two sequences at a time. Big differences, however, indicate particular problematic alignments, likely on the block level.
Inside evaluate_alignments, parse_newick_to_pairs.py is a script that would parse the newick tree and output the
parent-child pairs in depth first order. The sequences of each parent-child pair will be compared.
Following commands will generate the pairs, run the alignment comparison, and plot the alignment differences.
python parse_newick_to_pairs.py data/hiv/hiv_panman.newick > data/hiv/hiv20000_pairs.tsv
sbatch evaluate_alignments/evaluate_alignments_hiv.sh
python evaluate_alignments/plot_alignment_diff.py \
evaluate_alignments/out/hiv20000_alignment_differences.tsv \
evaluate_alignments/data/hiv/hiv20000_pairs.tsv \
hiv \
evaluate_alignments/out/hiv20000_alignment_differences.png
For results, see panmania/evaluate_alignments.
I realized that there are newer versions of panmans availble. I'm going to build the latest version of panman and look at the alignments comparison again for newer panmans.
Build using Sumit's Docker image.
Set up
cd /private/groups/corbettlab/alan/panmania/
git clone https://github.com/TurakhiaLab/panman.git
cd panman
Pull the PanMAN docker image from DockerHub
docker pull swalia14/panman:latest
Run the docker container and mount the panman directory
docker run -it -v .:/local_panman swalia14/panman:latest
Edit the CMakeLists.txt by adding the following lines between TARGET_LINK_LIBRARIES(...) and
target_include_directories(...)
add_custom_command(TARGET panmanUtils POST_BUILD
COMMAND mkdir -p ${CMAKE_BINARY_DIR}/lib
COMMAND cp /usr/local/lib/libcapnp*.so* ${CMAKE_BINARY_DIR}/lib/
COMMAND cp /usr/local/lib/libkj*.so* ${CMAKE_BINARY_DIR}/lib/
COMMAND cp /usr/lib/x86_64-linux-gnu/libboost*.so* ${CMAKE_BINARY_DIR}/lib/
COMMAND cp /usr/lib/x86_64-linux-gnu/libprotobuf*.so* ${CMAKE_BINARY_DIR}/lib/
COMMAND cp ${CMAKE_BINARY_DIR}/tbb_cmake_build/tbb_cmake_build_subdir_release/libtbb*.so* ${CMAKE_BINARY_DIR}/lib/
)
Run the install script
./install/installationUbuntu.sh
Add this line to .bashrc and now panmanUtils can be ran outside the docker container
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/private/groups/corbettlab/alan/panmania/panman/build/lib
Will develop panMANIA using this container for now.
- Make a figure that shows the correlation between the pseudo-chaining score and the alignment score.
- Fix
position_info_by_haplotypebug inheuristic_v6.py. - Complete the alignment quality assessment for new panMANs.
I have started writing the specific aim section of my thesis proposal. For aim 1, other than figures for accuracy assessment, I think I can also have figure that shows the correlation between the pseudo-chaining score and the alignment score.
When I'm in the progress of formalizing the method for calling consensus, I noticed a pretty significant bug in
heuristic_v6.py that gives inconsistent position_info_by_haplotype for the same haplotype.
position_info_by_haplotype is a dictionary that stores other haplotypes' position info at a specific aligned position.
Will need to fix this before I can continue with the formalization.
The panMAN gives Exceeded message traversal limit error for SARS and HIV trees. The reason is that the default message
traversal limit is too small.
To fix this, replace uint64_t traversalLimitInWords = 8 * 1024 * 1024 * 128 in
/home/capnproto-c++-1.0.2/src/capnp/message.h with uint64_t traversalLimitInWords = 8 * 1024 * 1024 * 256.
Then run command below to rebuild capnproto.
make clean && ./configure && make -j && make install
As a continuation of 2/1/2025, I need to assess the alignment quality of the new panMANs. Writing and split the fasta
files for nodes on the panMANs are currently taking too long for SARS_20000 and HIV20000.
- Complete the alignment quality assessment for new panMANs.
- Make a figure that illustrates the effectiveness of pseudo-chaining scores at discerning between sequences.
-
Make a figure that shows the correlation between the pseudo-chaining score and the alignment score.need to find a better way to illustrate this -
FixWas redefiningposition_info_by_haplotypebug inheuristic_v6.py.posin debug print statements.
To make the pseudo-chaining score vs sequence similarity figure (refer to 2/3/2025), I'm going to write a script that
would give all 150-bp kmers from a random node (preferably containing no ambiguous bases, OM857280.1) and generate 50
mutated kmers from each kmer, each containg 1 - 50 mutations. Then I will calculate align them using minimap2 and
pseudo-chaining and compare their scores. Scripts and data are placed in
/private/groups/corbettlab/alan/lab_notebook/panmama/pseudo_chaining-vs-sequence_similarity.
The final figure is saved in panmama/pseudo_chaining-vs-sequence_similarity/pseudo_chaining_vs_seq_similarity.png. It
doesn't look very good... A better to illustrate the effectiveness of pseudochaining is to show how it's able to discern
between haplotypes, since we only care about if it's sensitive enough to tell different sequences apart.
- Finish writing the first draft of the specific aim section of my thesis proposal.
- Really do finish formalizing the method for calling consensus.
I talked a little about panMANIA then the consensus calling step with Russ. Russ gave a really good sugesstion on assuming there no back mutations. See notes below in panMAMA section.
I just plotted the alignment differences for the new SARS and RSV panMANs (HIV is skipped because it's taking too long. Seems like the aligned sequences in the new HIV panMAN are very long). The new SARS panMAN looks pretty identical to the old SARs panMAN but the new RSV panMAN looks worse than the old one. See panmania/evaluate_alignments for more details.
Maybe it is worth to also fix the alignments in panMAN for Aim 2.
Notes below need to be beautified and executed. This is a place holder so I don't forget
I have an idea to identify problematic block states. Will try to implement and see how it goes.
- Record positions of mismatches in the panMAN alignment. Merge positions into ranges. Identify nearby ranges with identical lengths of mismatches.
- Nearby ranges with identical lengths of mismatches are likely to be problematic block states.
- Try to align those two ranges together and see how well they align.
Notes below need to be beautified and executed. This is a place holder so I don't forget
Russ has a good idea to call consensus
- Assume there's not back mutation
- If there is a cryptic mutation, assign that cryptic mutation heuristically
- If there are haplotypes where that cryptic mutation is assigned to that haplotype only, assign it.
- If not use maximum likelihood.
- If no cryptic mutation, assign reference
- Really do finish formalizing the method for calling consensus.
-
Finish writing the first draft of the specific aim section of my thesis proposal.See thesis_proposal/directory for the draft
I finished the first draft of my specific aims section for my thesis proposal. Aim 3 still needs a lot of work. See thesis_proposal/ section for the first draft.
With the limited time I have today, I will write up the script for identifying problematic block states that cause large chunks of misalignment. See panmania/evaluate_alignments/ for details
- Really do finish formalizing the method for calling consensus.
- Quantify the prevalence of problematic block states
Working on quantifying the prevalence of problematic block states
-
Really do finish formalizing the method for calling consensus. -
Quantify the prevalence of problematic block states
Today I will try to finish formalizing consensus calling...
Finally finished formalizing consensus calling. See panmama/consensus_calling/ for sample scripts.
Pseudocode:
for line in vcf_all:
possible_alleles = all alleles with depth > abundance - tolerance
split reads into groups by what other haplotypes they are assigned to
if an allele-determining group (reads assigned to only the current haplotype) exists:
assign the allele aligned in the allele-determining group
else
iterate through the groups of reads and calculate the likelihoods of the current haplotype having each possible
allele
split the groups of reads futher into groups by the number of haplotypes they are assigned to.
within each group split by the number of haplotypes they are assigned to, select the subgroup with the highest
likelihood ratio. This is the group's representative likelihood ratio
accumulate the likelihood ratio for each possible allele across the groups
if highest accumulated likelihood ratio - second highest accumluated likelihood ratio > threshold:
assign allele with the highest accumulate likelihood ratio
else:
assign reference allele
I fininally finished quantifying the prevalence of problematic block states (see panmania/evaluate_alignments/). It looks like it's definitely worth fixing the misaligned blocks or representing them differently (a different layer of block coordinate?). Will talk to Russ about it.
Fixing this should also substantially improve the runtime of panmap and panMAMA.
- Characterize errors in consensus calling in panMAMA
-
Develop a formal pipeline for evaluating panMAN internal alignments
Today I will manually look through the output of consensus calling and characterize the errors. See panmama/consensus_calling/ for detail.
Idea: perhaps I can also calculate the average sequence similarity of the reference and reads supporting each allele.
I think I'll also assess and plot the threshold for assigning alleles.
Steps for developing a formal pipeline for testing panMAN internal alignments.
-
Add a function in panMAN to output the ranges of all blocks
-
Add a function in panMAN to randomly selectly N pairs of nodes and output their aligned sequences and unaligned sequences
-
Run my other scripts detailed in panmania/evaluate_alignments/
-
Characterize errors in consensus calling in panMAMA
See panmama/consensus_calling/ for details.
- Improve panMAMA consensus calling
I fixed several issues in the consensus calling step.
-
Implemented more clever way to handle when there is only one read group or all read groups support the same allele.
-
Partially incorporated read to reference sequence similairty into consensus calling. Currently support raw counts but can take position into account as well (counting reads mapped to the same position only once).
-
Still need to implement the 2nd round of ambiguity resolution.
-
Need to implement a way to handle indels.
- Finalize panMAMA consensus calling
For the past few days, I've been improving panMAMA consensus calling. I think I've improved it enough to start finalizing the method.
Some of the trivial things to finish are:
-
implement indel handling
-
Add a likelihood ratio test for cases where more than one haplotype could possibly have the cryptic mutation
-
Perhaps output a file that lists all possible variants and their likelihoods in different haplotypes
Columns:
- #Type (cryptic, non-cryptic)
- REF
- ALT
- INFO (in the format Chrom:Pos:LikelihoodRatio)
I think I will also plot a PR curve for the consensus calling. nvm I tried and it looked ugly.
-
Finalize panMAMA consensus calling (see notes on 2/19/2025)
-
Optimize panMAMA runtime
-
Evaluate panMAMA using real data
-
Implement homopolymer compressed syncmers
-
Implement sorted and merged syncmer/kminmer index if possible
-
Make a figure of read assignments for RussSee panmama/illustrating_figures_andor_diagrams/
I'm going to do some final optimization of panMAMA runtime. This will be done on silverbullet for efficiency.
Keeping the command and options here so I don't have to re-write them.
panmap example.panman example_R1.fastq example_R2.fastq --place-per-read --redo-read-threshold 0 --em-filter-round 2 \
--remove-threshold 0.01 --rounds-remove 5 --preem-filter-method mbc --save-kminmer-binary-coverage
After meeting with Richard Durbin, we got some good ideas to further improve accuracy and speed. We can compress homopolymers when sketching syncmers, and this could reduce syncmer errors due to homopolymer related sequencing errors. Additionally, we can implement a sorted and merged syncmer/kminmer index as direct hash lookup, while has a O(1) lookup time, is much slower than linear lookup of adjacent hash values.
-
Optimize panMAMA runtime (high priority)
-
Modify panmap build and src files to read new panMANs (high priority)
-
Evaluate panMAMA using real data (high priority)
-
Finalize panMAMA consensus calling (see notes on 2/19/2025) (high priority)
-
Implement homopolymer compressed syncmers (low priority)
-
Implement sorted and merged syncmer/kminmer index if possible (low priority)
I'm going to graph the distribution of the types of kminmer changes. Specifically, the number of kminmer changes for a read and the type of kminmer changes (Whether the change had added a kminmer to the chain, removed a kminmer from the chain, or didn't change the chain). See panmama/run_time_optimization/ for details.
-
Optimize panMAMA runtime and memory usage (high priority)
-
Modify panmap build and src files to read new panMANs (high priority)
-
Evaluate panMAMA using real data (high priority)
-
Finalize panMAMA consensus calling (see notes on 2/19/2025) (high priority)
-
Implement homopolymer compressed syncmers (low priority)
-
Implement sorted and merged syncmer/kminmer index if possible (low priority)
I will install and test out a profiler (linaro map) today to see where panMAMA is spending most of its time. See panmama/run_time_optimization/ for more details.
-
Optimize panMAMA runtime and memory usage (high priority)
-
Modify panmap build and src files to read new panMANs (high priority)
-
Evaluate panMAMA using real data (high priority)
-
Finalize panMAMA consensus calling (see notes on 2/19/2025) (high priority)
-
Implement homopolymer compressed syncmers (low priority)
-
Implement sorted and merged syncmer/kminmer index if possible (low priority)
I think I've finished optimizing panMAMA runtime and memory usage for the time being. Now, panMAMA scores 1 million reads against 20,000-sample SARS tree in 5 minutes using 9GB of memory. EM step takes ~14 minutes.
Now to evaluate panMAMA using real data. See panmama/real_data/ for details.
-
Evaluate panMAMA using real data (high priority)
-
Implement coverage filter for panMAMA (high priority)
-
Modify panmap build and src files to read new panMANs (high priority)
I have been trying different ways to analyze the RSV data that Marc sent me. They are looking a little better now that I ran reads generated from RSVA priemrs and RSVB primers separately. Assuming Marc's assignment of mixture/isolated is correct, the false positives from panMAMA are because of low depth over all or high depth of likely contaminants at one or two loci. This is actually very useful information, as it's probably a good idea to consider coverage in the tool during demixing (set a filter to ignore haplotypes with coverage less than some threshold, [at least for amplicon samples]).
See my notes for details.
There are some cases I'm not sure when the samples would have 40-50% coverage with low depth in both RSV A and RSV B, I have these highlighted in bright yellow in the notes column. I also found two instances of "false negatives" where I had assigned them to be unmixed while you assigned them to be mixed:
RSV00174
RSV00220
After manual inspection of these two samples on the IGV, I actually found them to be unmixed (their alignments on IGV look like the false positives where they have high depth at only one or two loci). They are also highlighted in bright yellow.
It seems like a good idea for me to also incorporate kminmer overlap into prefiltering the nodes before EM (at least for amplicon samples). Most of the false positives are also cases where contaminant read stacks map to one or two loci on one of the RSV types.
Will stare at the google sheet for a while to see if I can find any other patterns.
-
Evaluate panMAMA using real data (high priority)
-
Implement coverage filter for panMAMA (high priority)
-
Look over Faith's code and add more plots (high priority)
-
Modify panmap build and src files to read new panMANs (high priority)
I discussed with Marc our reasonings for determining whether the RSV samples are mixed or unmixed. My logic is that if there are truly both RSVA and RSVB present in sample, then the reads generated from RSVA primers should map well to RSVA ref and cover most of the genome, and the reads generated from RSVB primers should map well to RSVB ref and cover most of the genome. On the other hand, samples where reads only map well to a few regions on the genome likely contain contaminants that get amplified in the PCR process. Marc, on the other hand, offers an alternative interpretation that we should believe what the data is telling us:
The target genomes in any of the samples (not just the mixed/co-infected ones) may not always be full-length (due to sample degradation over time(?)), and may not always be present in equimolar amounts (along the genome for one subtype). For me [Marc], I find it much harder to understand how RSVA or RSVB contaminants could be introduced in an essentially stochastic way among 192 samples all collected on different dates in a variety of locations and processed on different dates. You might want to discuss these alternatives (and others) with Russ.
I will discuss this with Russ.
Thanks to Faith Okamoto's amazing rotation project, I have some really good preliminary data and code to work with. I copied over Faith's google docs to my google drive in case they some how disappear from the internet.
I will generate some additional figures as preliminary data for my advancement.
See panmania/imputation/ for details.
-
Look over Faith's code and add more plots (high priority)See panmania/imputation/ for details -
Evaluate panMAMA using real data (high priority)
-
Implement coverage filter for panMAMA (high priority)
-
Modify panmap build and src files to read new panMANs (high priority)
I've generated some additional plots for my advancement. See panmania/imputation/ for details. In a nutshell, I made 2 figures, 1 to show the prevalence of missing data in different panMANs in average number of Ns per genome, and 1 to show the distribution of the sizemissing data neighborhoods. Missing data neighorhoods are coordinates with missing data that occur in contiguous nodes.
I will also begin exploring existing annotation tools to get a sense of how to start with panMAN annotation.
See panmania/annotation/ for details.
It's been a while since I've added to this notebook. I think I should probably start again.
Today, I am going to continue to explore ways to address the memory problem when running panMAMA with the 8M tree.
I tried out sorting the reads using a debruijn graph yesterday. The idea is that reads whose kminmers are close to each other on the graph will tend to be updated together and have similar updates. This way, it will both improve the cache locality during place and allows clever tricks to reduce the memory of score-annotated index.
Memory can be reduced by having a 32-bit startIndex, 16-bit scoreDelta, and 64-bit trailingScoresDelta that stores the differences in scoreDelta with respect to the 16-bit scoreDelta in the trailing 16 sorted reads, using 4 bits per reads. This offers ~6.58 bits per read (in best case scenario) as opposed to 64-bit per read I had previously, with 32-bit readIndex and 32-bit scoreDelta.
Here is a table of estimated memory reduction under different threads using 1 million SARS reads and 20K-SARS tree. The old method, which uses 64 bits per read, will use ~1,459 Mb. I've also attached the time for placement for additional information.
Note: 20K-SARS tree doesn't have any memory issues. I'm using it here to measure relatively how much memory will be reduce. The problem is the 8M-SARS tree. 8M-SARS tree currently is F'ed-up and being fixed by Sumit.
| Threads | Time (s) | Memory (Mb) | Fraction of original | Median Thread Time | Min Thread Time | Max Thread Time | (Max - Min) Thread Time |
|---|---|---|---|---|---|---|---|
| 1 | 183 | 241 | 0.164 | 1 | 1 | 1 | 0 |
| 2 | 155 | 241 | 0.165 | 155 | 132 | 155 | 22 |
| 4 | 77 | 241 | 0.164 | 46 | 29 | 77 | 47 |
| 8 | 89 | 242 | 0.165 | 33 | 21 | 89 | 56 |
| 16 | 52 | 242 | 0.165 | 16 | 13 | 52 | 39 |
| 32 | 65 | 242 | 0.165 | 10 | 8 | 65 | 57 |
| 64 | 39 | 241 | 0.164 | 6 | 5 | 40 | 35 |
It seems like we will get ~6x smaller memory from the read scores, which is not bad. However, dividing up the reads evenly after sorting has caused some unbalanced workload between the threads. Table below provides runtime information with shuffled reads for reference (the original version before sorting).
| Threads | Time (s) | Median Thread Time | Min Thread Time | Max Thread Time | (Max - Min) Thread Time |
|---|---|---|---|---|---|
| 1 | 265 | 265 | 265 | 265 | 0 |
| 2 | 110 | 110 | 104 | 110 | 5 |
| 4 | 58 | 56 | 54 | 57 | 3 |
| 8 | 33 | 30 | 29 | 33 | 4 |
| 16 | 19 | 18 | 17 | 19 | 3 |
| 32 | 14 | 11 | 11 | 14 | 3 |
The unbalanced workflow between threads increases the runtime too much for multiple threads. 7X reduced memory is good but might not be worth it in smaller trees, such as the SARS-20K, RSV-4K, and HIV-20K. After talking with Russ, we decided to implement a low-memory mode, which will be recommended to turn on for the 8M-SARS tree but not for the smaller trees.
Implementation in progress: 5368e898b7da68a3b7f1c358df2b484c1141fc8f
It's probably a good idea to also store the number of tailing deltas stored to avoid redundant operations when getting
read scores for EM. I can also add a uint16_t in the struct without adding memory used because the 2 bytes of memory
was part of the padding in the struct that will now be used for the new uint64_t.
I didn't take into account the cases when there are reads without any updates in the trailing delta. Now trailing scoreDeltas are within -7 to +7 to the first scoreDelta (was -8 to +7 in the implementation from yesterday). That saves me one integer, -8, to represent skipped scoreDelta in a group. This means the memory improve will be slightly less than the estimation from 9/19/2025.
struct readScoreDeltaLowMemory {
uint64_t trailingDelta = 0;
uint32_t readIndex;
uint16_t numTrailing = 0;
int16_t scoreDelta;
void encodeTrailingDelta(int16_t scoreDeltaDiff, uint32_t indexToEncode);
int16_t decodeTrailingDelta(uint32_t offset);
};Now, to score 1 million SARS reads against the SARS-20K tree panMAMA uses ~~~241~~ ~250 Mb of memory, which is ~5.9 times smaller than the original approach to score read scoreDeltas individually.
This is implemented in commit: 928a7d15a53afcfa8296b38a554d3c457c0c4401
Seems like Sumit has fixed the SARS-8M panMAN. I will take a look at it today.
Hmmm... I got this error when trying to build the MGSR index for the new SARS-8M panMAN
stepping right over more than one block from 0 to -1
After many debug lines and long time of waiting, I found that I had a hidden bug that seemed tohave never cause a problem in RSV-4K, SARS-20K, HIV-20K, and the old SARS-8M.
After fixing the bug, I rebuilt the RSV-4K and SARS-20K index and compared them to the old index before I fixed the bug. Indeed they are identical. Just to make sure, I will compare them to brute force again tomorrow.
MGSR index finally built for the new SARS-8M tree. I will take a look at it tomorrow.
Last night, I scored 100K reads against the SARS-8M tree using both the low-memory mode and the normal mode.
Using 8 threads:
Normal mode took 7352 secs (122.5 mins), using max res 220 Gbs
Low-mem mode took 7115 secs (118.5 mins), using max res 81 Gbs
There are 82,640 nodes with significant kminmer overlap coefficient (using the default settin to include the top 1000 rank).
Inside silverbullet:/scratch1/alan/goodmap/panmap/build, the commands used are:
./bin/panmap ../panmans/sars_8M.panman \
../test_data/sars/rep1/sars20000_5hap-a_100000_rep1_R1.fastq \
../test_data/sars/rep1 sars20000_5hap-a_100000_rep1_R2.fastq \
-m sars_8M.new.pmai --cpus 8
./bin/panmap ../panmans/sars_8M.panman \
../test_data/sars/rep1/sars20000_5hap-a_100000_rep1_R1.fastq \
../test_data/sars/rep1 sars20000_5hap-a_100000_rep1_R2.fastq \
-m sars_8M.new.pmai --cpus 8 --low-memory
This doesn't look very good tbh... I scheduled a meeting on 9/29/2025 with Yatish, Russ, and Alex to go over potential strategies. I will make a slide deck to go over the panMAMA program step-by-step.
Pranav's scoring function looks quite interesting. I already store the maximum parsimony score of each read, and it's trivial to also store the EPP (number of genomes that have the maximum parsimony with the read).
For the sake of testing, I only scored the nodes that are already selected using the kminmer overlap coefficient method. Since the overlap coefficient method is quite sensitive, it will pick up a lot of nodes, and I want to see how well the node scores can filter out some of the noises. Similar to Pranav's scoring function, each read's score depends on its maximum parsimony across the entire tree and how many nodes share the paximum parsimony score.
Essentially the score for a specific read i is
Si = (Ri.kminmers.size() - Ri.maxScore + 1) * pow(Ri.epp, 2)
And the score for a specific node j is the sum of read scores of reads that maximally map to the node, or in pseudocode:
curNodeScores
nodeScore = 0
for (i = 0 ... len(reads)):
if (curNodeScores[i] == reads[i].maxScore && reads[i].maxScore > 0):
nodeScore += (reads[i].kminmers.size() - reads[i].maxScore + 1) * pow(reads[i].epp, 2)
I haven't implemented the regularization by fraction of sites covered but it already looks pretty good as it is.
I made some minor changes in panmap (see commit 2ae3b7c85528b05b51b80955437e823218b50c2897aea060e789dafde7cdad4e855d0ea7e6f3c10afff14e03f12b609f1877d845299862cb9b71b12d) to output some stats on the
node scores, and wrote a bash script
panmama/node_scores/run_panmap_for_test_node_scores.sh
to run it on some sample data
# In termial
reps=(rep1 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9 rep10)
for rep in "${reps[@]}"; do
bash run_panmap_for_test_node_scores.sh /scratch1/alan/goodmap/panmap/build/bin/panmap /scratch1/alan/goodmap/panmap/panmans/sars_20000_optimized.panman /scratch1/alan/goodmap/panmap/build/sars_20000.pmai /scratch1/alan/goodmap/panmap/test_data/sars/$rep/ &
doneI will take a look at the results tomorrow.
The bash script seems to have run correctly, and the output files are all there.
I will write a script to ouptut and plot whether true haplotypes are selected by kminmer overlap coefficients and if they are, how their node scores rank within the selected nodes.
Actually, I think I will rebuilt and rerun the panmap to output the node scores with more precision. Using commit
97aea060e789dafde7cdad4e855d0ea7e6f3c10afff14e03f12b609f1877d845299862cb9b71b12d.
Here is the script to output stats on true haplotypes kminmer overlap coefficients and node scores panmama/node_scores/getRank.py.
To run it, for example
python3 getRank.py \
/scratch1/alan/goodmap/panmap/test_data/sars/panman_outputs/rep1/sars20000_5hap-a_abundance_rep1.tsv \
/scratch1/alan/goodmap/panmap/test_data/sars/rep1/sars20000_5hap-a_200000_rep1.testScores.txt \
<(sort -k3,3 -gr /scratch1/alan/goodmap/panmap/test_data/sars/rep1/sars20000_5hap-a_200000_rep1.testScores.txt)The output (after piping it to column -t):
Denmark/DCGC-636002/2022|OY803981.1|2022-12-05 0.5 1.0 0 393 342.41096 0 1
Germany/Molecular_surveillance_of_SARS-CoV-2_in_Germany/2021|OV351922.1|2021-09-17 0.2 1.0 0 393 13.38391 3 4
Scotland/QEUH-3C94804/2022|OW532842.1|2022-03-27 0.15 1.0 0 393 47.15048 2 3
USA/MA-CDCBI-CRSP_AP7MZ6THJQWE6Q7J/2023|OQ727617.1|2023-03-18 0.1 1.0 0 393 121.36997 1 2
England/OXON-AD71F/2020|OX589494.1|2020-04-04 0.05 0.9998 1 396 0.94889 359 578
I then wrote a script panmama/node_scores/run_getRank.sh to run getRank.py for my samples.
reps=(rep1 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9 rep10)
for rep in "${reps[@]}"; do
bash run_getRank.sh getRank.py /scratch1/alan/goodmap/panmap/test_data/sars/$rep/ /scratch1/alan/goodmap/panmap/test_data/sars/panman_outputs/$rep &
doneI found that all of my samples with 80K reads actually contain different haplotypes from the true haplotypes? I probably forgot to delete them before re-simulation from a long time ago. Anyway, I just removed them.
I think I will actually output kminmer overlap coefficient and node scores of all the nodes on the SARS-20K tree. Using
commit fff14e03f12b609f1877d845299862cb9b71b12d instead.
After manual inspection, it seems like Prana's node scoring scheme is pretty promising if I were to pre-select based on kminmer overlap coefficients. I do think it's worth it regularize the node scores using kminmer coverage, which I think will improve both sensitivity and specificity and might free us from pre-selection by kminmer overlap coefficients entirely
Although, I found something weird... I was expecting that, at least for high coverage, in samples with only a single haplotype,
the correct haplotype should have the highest node score. However, for rep10 of my samples, the correct haplotype does
not have the highest node score.
==> sars20000_1hap-a_100000_rep10_rank_stats.tsv <==
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1.0 1.0 0 7 34.7493051276 2 3 2 3
==> sars20000_1hap-a_10000_rep10_rank_stats.tsv <==
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1.0 1.0 0 1 3.1636666008 1 2 1 2
==> sars20000_1hap-a_200000_rep10_rank_stats.tsv <==
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1.0 1.0 0 9 65.2952724646 3 4 3 4
==> sars20000_1hap-a_20000_rep10_rank_stats.tsv <==
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1.0 1.0 0 1 5.9588101007 1 2 1 2
==> sars20000_1hap-a_2000_rep10_rank_stats.tsv <==
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1.0 0.998377611 2 3 0.6550883658 38 39 1 2
==> sars20000_1hap-a_40000_rep10_rank_stats.tsv <==
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1.0 1.0 0 1 12.4419547827 1 2 1 2
I might have made some sense if only one sample doesn't have the correct haplotype ranked highest, due to sheer luck that a sequencing error caused a read or two to map better to the wrong node, but all the samples are like this.
Running for file in sars20000_1hap-a_*_rep10.testScores.txt; do echo ">${file}"; sort -k 3,3 -gr $file | head -n 5; done
>sars20000_1hap-a_100000_rep10.testScores.txt
England/CAMC-1172A52/2021|OD972409.1|2021-01-25 0.9943170286 35.2495987447
England/QEUH-9CFC6A/2020|OA994276.1|2020-09-15 0.9967539055 35.0967812923
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1 34.7493051276
England/PHEC-14D2C4/2020|OX658315.1|2020-11-12 0.9939135727 34.7215691619
England/QEUH-B419FE/2020|OC996183.1|2020-11-08 0.9957412290 34.3945398909
>sars20000_1hap-a_10000_rep10.testScores.txt
England/QEUH-9FFDAA/2020|OA998403.1|2020-09-30 0.9979716024 4.1104912017
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1 3.1636666008
England/CAMC-1172A52/2021|OD972409.1|2021-01-25 0.9912725797 3.1612628447
England/PHEC-14D2C4/2020|OX658315.1|2020-11-12 0.9939135727 3.1610533713
England/QEUH-96E052/2020|OA967653.1|2020-08-18 0.9989858012 3.1107656447q
>sars20000_1hap-a_200000_rep10.testScores.txt
England/CAMC-1172A52/2021|OD972409.1|2021-01-25 0.9943170286 66.2484996600
England/QEUH-9CFC6A/2020|OA994276.1|2020-09-15 0.9997971191 66.0141781118
England/PHEC-14D2C4/2020|OX658315.1|2020-11-12 0.9951308582 65.7404822238
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1.0000000000 65.2952724646
England/QEUH-B419FE/2020|OC996183.1|2020-11-08 0.9959440276 65.1060260050
>sars20000_1hap-a_20000_rep10.testScores.txt
England/QEUH-9CFC6A/2020|OA994276.1|2020-09-15 0.9975654291 6.8180396934
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1 5.9588101007
England/CAMC-1172A52/2021|OD972409.1|2021-01-25 0.9912725797 5.9534744836
England/PHEC-14D2C4/2020|OX658315.1|2020-11-12 0.9939135727 5.9525848546
England/ALDP-B24FC2/2020|OB994572.1|2020-11-04 0.9983779400 5.9253104840
>sars20000_1hap-a_2000_rep10.testScores.txt
USA/NV-CDC-QDX26257699/2021|OK250130.1|2021-06-27 0.9629254457 1.1328065150
England/MILK-B393BF/2020|OC996628.1|2020-11-01 0.9697400487 1.1200297175
SouthAfrica/NHLS-UCT-GS-D051/2021|OM765744.1|2021-07-08 0.9484094617 1.0110808329
Denmark/DCGC-641260/2023|OY797037.1|2023-02-01 0.9163256956 1.0102208789
Germany/IMS-10116-CVDP-DFD838F5-4D85-4B5B-AE67-0724E1E8F1FA/2021|OU078334.1|2021-02-03 0.9691495839 1.0089380969
>sars20000_1hap-a_40000_rep10.testScores.txt
England/CAMC-1172A52/2021|OD972409.1|2021-01-25 0.9924903592 12.7652863894
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1 12.4419547827
England/PHEC-14D2C4/2020|OX658315.1|2020-11-12 0.9939135727 12.4300685190
node_6130 0.9989860069 12.2138878733
England/OXON-F888DC/2020|OY954521.1|2020-10-06 0.9939418417 12.2138875445
I will use the 10K read sample to investigate the issue. England/QEUH-9FFDAA/2020|OA998403.1|2020-09-30 has a much
higher node score than England/MILK-9A8502/2020|OA972423.1|2020-09-02.
I modified panmap to output some info on the read scores that contribute to each of the two haplotypes.
I found that the read below, contributes 1.0 to England/QEUH-9FFDAA/2020|OA998403.1|2020-09-30 but does not contribute
to England/MILK-9A8502/2020|OA972423.1|2020-09-02 at all. For simplicity, I will refer to
England/QEUH-9FFDAA/2020|OA998403.1|2020-09-30 as QEUH, and England/MILK-9A8502/2020|OA972423.1|2020-09-02 as
MILK.
@England_MILK_9A8502_2020_OA972423_1_2020_09_02_181_7/2
AGGGTTATGATTTTGGAAGCGCTCTGAAAAACAGCAAGAAGTGCAACGCCAACAATAAGCCATCCGAAAGGGAGTGAGGCTTGTATCGGTATCGTTGCAGTAGCGCGAACAAAATCTGAAGGAGTAGCATCGTTGATTTCACCTTGCTTCA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFF-FFFFFFF8FFF-FF8FFFFFFFFFFFFFFFFF8FFFFFFFF-FFFFFFFFFF---FFFFFFFFFFFFFF-F-
I then used minimap2 to align this read to both haplotypes.
minimap2 -a --MD sars_original/England_MILK_9A8502_2020_OA972423_1_2020_09_02.unaligned.fasta ../test_data/sars/exp.fastq
England_MILK_9A8502_2020_OA972423_1_2020_09_02_181_7/2 16 England/MILK-9A8502/2020|OA972423.1|2020-09-02 25436 60 151M * 0 0 TGAAGCAAGGTGAAATCAACGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTGTTTTTCAGAGCGCTTCCAAAATCATAACCCT -F-FFFFFFFFFFFFFF---FFFFFFFFFF-FFFFFFFF8FFFFFFFFFFFFFFFFF8FF-FFF8FFFFFFF-FFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:1 ms:i:296 AS:i:296 nn:i:0 tp:A:P cm:i:24 s1:i:129 s2:i:0 de:f:0.0066 MD:Z:19G131 rl:i:0
minimap2 -a --MD sars_original/England_QEUH_9FFDAA_2020_OA998403_1_2020_09_30.unaligned.fasta ../test_data/sars/exp.fastq
England_MILK_9A8502_2020_OA972423_1_2020_09_02_181_7/2 16 England/QEUH-9FFDAA/2020|OA998403.1|2020-09-30 25436 60 151M * 0 0 TGAAGCAAGGTGAAATCAACGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTGTTTTTCAGAGCGCTTCCAAAATCATAACCCT -F-FFFFFFFFFFFFFF---FFFFFFFFFF-FFFFFFFF8FFFFFFFFFFFFFFFFF8FF-FFF8FFFFFFF-FFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 ms:i:302 AS:i:302 nn:i:0 tp:A:P cm:i:28 s1:i:150 s2:i:0 de:f:0 MD:Z:151 rl:i:0
It seems the sequencing error did cause QEUH to map better than MILK...
Could it just be that the MILK genome is so close to its relatives that, more than usually, sequencing errors can
cause reads to map better to other nodes. I will run panmap with no-single mode to remove reads with obvious
sequencing errors.
Indeed, MILK is ranked first in node scores when panmap was run with no-single turned on.
$ sort -k3,3 -gr panmap.testScores.txt | column -t
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1 2.8943746447
England/CAMC-1172A52/2021|OD972409.1|2021-01-25 0.9912725797 2.8922007995
England/PHEC-14D2C4/2020|OX658315.1|2020-11-12 0.9939135727 2.8919996823
England/QEUH-96E052/2020|OA967653.1|2020-08-18 0.9989858012 2.8472196151
node_6123 0.9989768774 2.8472193844
I will also try with the 100K and 200K read samples.
100K sample:
$ sort -k3,3 -gr panmap.testScores.txt | column -t
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1 32.8271963293
England/CAMC-1172A52/2021|OD972409.1|2021-01-25 0.9912725797 32.8044433507
England/PHEC-14D2C4/2020|OX658315.1|2020-11-12 0.9939135727 32.8012638459
England/ALDP-A2E3AF/2020|OA991257.1|2020-10-05 0.9955348082 32.2821446972
England/QEUH-96E052/2020|OA967653.1|2020-08-18 0.9989858012 32.2653056183
200K sample:
$ sort -k3,3 -gr panmap.testScores.txt | column -t
England/CAMC-1172A52/2021|OD972409.1|2021-01-25 0.9924903592 62.8683882201
England/PHEC-14D2C4/2020|OX658315.1|2020-11-12 0.9943193346 62.6107755427
England/MILK-9A8502/2020|OA972423.1|2020-09-02 1 62.4128674700
England/QEUH-9CFC6A/2020|OA994276.1|2020-09-15 0.9963481436 62.1861284750
England/ALDP-A6E6AA/2020|OB982816.1|2020-10-15 0.9975659229 61.7590675371
It seems like 200K sample has a little bit more sequencing error because of its large size. Nevertheless, it also seems
that all the top ranked node scores are MILK's close relatives, as they are all sequenced from England with very close
dates to each other. I think I will actually also regularize the scores using coverage. I cannot use the existing
kminmer overlap coefficient because a kminmer is considered covered as long it's in the read sample. For the purose of
regularization, a kminmer should only be considered covered if it's present in a read that has maximum parsimony at that
node. Hope that makes sense.....
Today I will come up with and develop a way to calculate the kminmer coverage. I will do a second light-weight tree traversal to dynamically calculate it.
I will put the kminmer coverage function in ThreadsManager for now.
I added a computeKminmerCoverage() function in ThreadsManger class. This function will only count a k-min-mer being
covered at a node when it has at least one read that has a maximum score with the node. It's quite difficult to engineer
a way to efficiently do this. Nevertheless, it seems like this coverage actually doesn't differ from the overlap
coefficient too much. I think I will hold off this approach for now.
For book-keeping purpose, computeKminmerCoverage() is implemented in commit:
2e18748a0f23ee7ec69b2630985c5a9458e78837
Here are some keypoints from it:
- PanMAMA's runtime is actually not bad and quite similar to WEPP
- Yatish
suggestedmentioned the option of alternative scoring and deconvolution method.- They tried both EM and Freyja and found that Freyja consistently performed better than EM.
- Alternative deconlution method might allow us to avoid chaining.
- Can pre-filter the probable haplotypes (like the current overlap coeffcient method), then condense the tree for only
the selected nodes
- Still perform score update but on a much smaller scale.
- Can group reads by their initial mapping position on the root (or even pseudo-root) node, then "build" smaller trees, each representing a contiguous small section of the genome on which the read group maps to.
Russ and Yatish think that I'd be solid if I can improve my performance by 5-10X.
I will first try to make a subtree that contains only nodes that have mutations relevent to my read sample. I created
a new branch mgsr-optimization and will work from there.
I need an efficient way to collapse the k-min-mer mutations if I want to collapse the nodes. To do this, I think I need the indexed seed insertions, deletions, and substitutions to be stored together instead of
separately. I would also need to sort them by their positions. I will start with that.
I finally finished this update. Now I'm storing a single seedDeltas<SeedDelta> list per node in the index. Each
SeedDelta contains the seedIndex and isDeleted. This list is sorted by the startPos of seedInfos[seedIndex].
This way, I know that if two consecutive SeedDeltas have the same startPos, they are substitutions. As before, this
allows me to backtrack the reference kminmers using the same index...
This is a bit trickier because my current implementation requires that my coordDeltas at node to be processed
sequentially as indexed. This means that I cannot just sort them to make the collapse easy. I might need to change how
gaps are represented or updated...
During building, a gapRunUpdates vector was processed to produce the coordDeltas vector. I didn't use
gapRunUpdates directly for the index because I want to leave the messy computing in indexing so that I can just apply
the gapMap changes according to coordDeltas. Unlike coordDeltas, gapRunUpdates doesn't need to be processed in a
specific order. To be able to collapse the nodes, I might have to store gapRunUpdates in the index instead. I don't
think the performance will be impacted more than trivially.
I just implemented the new gapMap update, see 10/1/2025. Everything seems to be working correctly.
I implemented a MgsrLiteTree class and a MgsrLiteNode class for better organization of the panMAMA index data.
Now MgsrLiteTree owns the seedInfos instead of ThreadsManger or mgsrPlacer. And MgsrLiteNodes own
their respective indexed seed updates, including seedDeltas, gapRunDeltas, and invertedBlocks, in MgsrLiteTree.
This will set up a better code structure for collaping the tree for optimization.
Same as the changes made on 10/2/2025, This will set up a better code structure for collaping the tree for optimization.
I tried splitting the tree by segment size of [1000, 5000, 10000, 50000, 100000, 500000], and each tree segment is
collapsed to only include nodes that have at least one k-min-mer mutation in its range. I also give it either a 500 or
0.1 * segmentSize overlap offset so that we can include reads at the boundries of the segment when splitting the reads
to each segment.
binSize overlap numTrees totalNodes avgTreeSize medianTree minTree minTreeRange maxTree maxTreeRange
1000 500 3446 658944 191 10 0 21500,22500 75 1157000,1158000
1000 100 1915 364805 190 10 0 21600,22600 74 1152900,1153900
5000 500 383 239527 625 123 134 886500,891500 401 1156500,1161500
5000 500 383 239527 625 123 134 886500,891500 401 1156500,1161500
10000 500 182 195632 1074 417 612 1064000,1074000 681 1149500,1159500
10000 1000 192 204953 1067 408 636 1062000,1072000 698 1152000,1162000
50000 500 35 121376 3467 1639 1761 990000,1040000 1944 1138500,1188500
50000 5000 39 130549 3347 1543 21723 1710000,1760000 1930 1125000,1175000
100000 500 18 96420 5356 2878 25280 1691500,1791500 2987 1094500,1194500
100000 10000 20 105363 5268 3246 25542 1710000,1810000 3314 1080000,1180000
500000 500 4 65668 16417 24597 27053 1498500,1998500 24597 999000,1499000
500000 50000 4 65182 16295 21293 29192 1350000,1850000 21293 900000,1400000
binSize overlap numTrees totalNodes avgTreeSize medianTree minTree minTreeRange maxTree maxTreeRange
1000 500 285 200645202 704018 647251 662308 75500,76500 118214 1500,2500
1000 100 159 110509365 695027 644707 661926 75600,76600 110471 1800,2800
5000 500 32 51089087 1596533 1541060 1355099 45000,50000 417769 0,5000
5000 500 32 51089087 1596533 1541060 1355099 45000,50000 417769 0,5000
10000 500 15 37001282 2466752 2279767 1522025 19000,29000 1099764 0,10000
10000 1000 16 39027983 2439248 2417613 1353098 18000,28000 1090724 0,10000
50000 500 3 17701937 5900645 6053592 6053592 49500,99500 7013652 99000,149000
50000 5000 4 21570598 5392649 6053592 7219395 135000,185000 6053592 90000,140000
100000 500 2 14357690 7178845 7354519 7354519 99500,199500 7003171 0,100000
100000 10000 2 14718565 7359282 7364046 7354519 0,100000 7364046 90000,190000
500000 500 1 8560783 8560783 8560783 8560783 0,500000 8560783 0,500000
500000 50000 1 8560783 8560783 8560783 8560783 0,500000 8560783 0,500000
To roughly estimate this, I get all the global reference positions that each read's k-min-mers can map to, then calcualte the read mapping range between the min and max its reference match positions. This measures how far apart on the reference a read can potentially map to.
From working directory, /scratch1/alan/goodmap/panmap/build, I used
../test_data/sars/rep1/sars20000_5hap-a_100000_rep1_R*.fastq as a test sample.
The results show that most of the reads have very close mapping range (which is what we want) on the SARS 20K tree. However, that doesn't seem like the case for the SARS 8M tree and RSV 4K tree.
It seems like a non-trivial amount of reads would not map to a small segument of the SARS 8M tree, so grouping the reads by their mapping positions then splitting the tree would not work very well.
Therefore, we might have to resolve to reducing the search space early. Today, I will write some scripts to run tests on how well k-min-mer overlap coefficient is able to pre-filter read samples of various compositions:
- Original and mutated haplotype
- All original haplotypes
- All mutated haplotypes
- Mix of original and mutated haplotypes
- Sequencing type
- shot-gun
- tailed amplicon
- Sequencing depth
I will make all combinations of (#SNPs, #Haplotypes, %Mutated, SeqType, Depth)
#SNP: 0 1 5 10 20
#haplotypes: 1 5 10 20 50 100
%Mutated: 0% 20% 50% 70% 100%
SeqType: shot-gun tiled-amplicon
Depth: 10X 50X 100X 500X 1000X
Added --dump-sequences and --simulate-snps options to panmap. Just for the purpose of testing if correct nodes
will be selected, SNPs are uniformally simulated, e.g. base A can mutate into C, G, and T with equal probabilities.
Added --dump-random-nodeIDs option to output a specified number of nodes. This is for selecting random haplotypes to simulate
the mixed samples
Added --random-seed option to specify the seed for the RNG. This is for replicating the results for future reference.
This is implemented in commit b96f016a79afca83c8e8b45b86f339cd367062e3
I just finished writing a script,
simulate_and_test_overlap_coefficient.sh, to
simulate reads with parameters described above. Currently, for %Mutated, I am only simulate either 0% or 100%. I will
deal with mixed mutations tomorrow. See below for how to run the script... I pulled #SNP parameter out for simpler
parallelization on silverbullet.
for num_snp in 0 1 5 10 20; do
bash simulate_and_test_overlap_coefficient.sh \
/scratch1/alan/goodmap/panmap/build/bin/panmap \
/scratch1/alan/goodmap/panmap/panmans/sars_20000_optimized.panman \
/scratch1/alan/goodmap/panmap/build/sars_20000.pmai \
/home/alan/tools/SWAMPy/src/simulate_metagenome.py \
/home/alan/tools/SWAMPy/primer_sets/nimagenV2.bed \
/home/alan/tools/SWAMPy/ref/MN908947.3.fasta \
/home/alan/tools/jvarkit/dist/jvarkit.jar \
/scratch1/alan/lab_notebook/panmama/overlap_coefficients/out \
delete_intermediate \
$num_snp > /dev/null 2>&1 &
done
This is currently running on silverbullet. I will take a look at the results tomorrow.
Here are some stats of the results (for full results, see panmama/overlap_coefficients/shotgun.results and
panmama/overlap_coefficients/amplicon.results):
num_hap num_snp seq_type depth min_rank max_rank median_rank
1 0 0 500 0 0 0
1 1 0 500 0 0 0
1 5 0 500 0 0 0
1 10 0 500 0 0 0
1 20 0 500 0 0 0
5 0 0 500 0 89 0
5 1 0 500 0 0 0
5 5 0 500 0 133 12
5 10 0 500 0 21 0
5 20 0 500 0 27 0
10 0 0 500 0 119 0.5
10 1 0 500 0 25 0
10 5 0 500 0 119 0
10 10 0 500 0 27 0
10 20 0 500 0 1199 0
20 0 0 500 0 491 0
20 1 0 500 0 190 0
20 5 0 500 0 321 0
20 10 0 500 0 509 0
20 20 0 500 0 417 51.5
50 0 0 500 0 728 0
50 1 0 500 0 252 0
50 5 0 500 0 1014 0
50 10 0 500 0 508 0
50 20 0 500 0 905 0
100 0 0 500 0 1340 0
100 1 0 500 0 1771 0
100 5 0 500 0 1490 18
100 10 0 500 0 1241 82.5
100 20 0 500 0 1617 7
1 0 0 1000 0 0 0
1 1 0 1000 0 0 0
1 5 0 1000 0 0 0
1 10 0 1000 0 0 0
1 20 0 1000 0 0 0
5 0 0 1000 0 23 0
5 1 0 1000 0 4 0
5 5 0 1000 0 0 0
5 10 0 1000 0 26 0
5 20 0 1000 0 249 0
10 0 0 1000 0 189 0
10 1 0 1000 0 164 0
10 5 0 1000 0 49 0
10 10 0 1000 0 39 0
10 20 0 1000 0 303 37.5
20 0 0 1000 0 75 0
20 1 0 1000 0 62 0
20 5 0 1000 0 297 0
20 10 0 1000 0 65 0
20 20 0 1000 0 481 0
50 0 0 1000 0 172 0
50 1 0 1000 0 570 0
50 5 0 1000 0 292 0
50 10 0 1000 0 203 0
50 20 0 1000 0 519 0
100 0 0 1000 0 482 0
100 1 0 1000 0 695 0
100 5 0 1000 0 593 0
100 10 0 1000 0 937 0
100 20 0 1000 0 1460 0
num_hap num_snp seq_type depth min_rank max_rank median_rank
1 0 1 500 23 23 23
1 1 1 500 0 0 0
1 5 1 500 17 17 17
1 10 1 500 0 0 0
1 20 1 500 47 47 47
5 0 1 500 232 1233 906
5 1 1 500 255 2729 1713
5 5 1 500 56 643 439
5 10 1 500 253 1877 1262
5 20 1 500 58 2315 586
10 0 1 500 62 5714 692
10 1 1 500 149 2062 564.5
10 5 1 500 129 2039 968
10 10 1 500 227 2828 1245.5
10 20 1 500 37 3582 1883.5
20 0 1 500 134 5274 998.5
20 1 1 500 96 2915 1407.5
20 5 1 500 105 5027 1116
20 10 1 500 140 4749 2079
20 20 1 500 43 3918 1274.5
50 0 1 500 66 6687 1686
50 1 1 500 52 6232 2124
50 5 1 500 14 9003 2020.5
50 10 1 500 132 7184 1785
50 20 1 500 101 6261 1905.5
100 0 1 500 111 8461 1982.5
100 1 1 500 165 8915 1910
100 5 1 500 29 5538 1839
100 10 1 500 70 8112 1938
100 20 1 500 128 8900 1708.5
1 0 1 1000 7 7 7
1 1 1 1000 0 0 0
1 5 1 1000 0 0 0
1 10 1 1000 2 2 2
1 20 1 1000 212 212 212
5 0 1 1000 455 6224 1780
5 1 1 1000 139 2042 1268
5 5 1 1000 452 1295 724
5 10 1 1000 260 1910 475
5 20 1 1000 118 877 367
10 0 1 1000 27 1847 1217
10 1 1 1000 70 2403 256.5
10 5 1 1000 72 3154 459.5
10 10 1 1000 267 2365 890
10 20 1 1000 93 2856 1199
20 0 1 1000 117 6584 1576.5
20 1 1 1000 200 2605 1064
20 5 1 1000 339 6378 1577
20 10 1 1000 41 4245 1566.5
20 20 1 1000 37 3512 1590.5
50 0 1 1000 31 4911 1356
50 1 1 1000 310 5102 1636
50 5 1 1000 70 7058 1792
50 10 1 1000 26 7466 1518
50 20 1 1000 4 6954 1507
100 0 1 1000 32 8745 1696
100 1 1 1000 18 6383 1608
100 5 1 1000 284 5937 3426.5
100 10 1 1000 82 4691 2008
100 20 1 1000 92 7673 2611
I also just realized that I completely forgot to run it with replicates, so I will make sure to include that in my subsequent runs.
I'm not very confident about the results, especially for the tiled amplicon sequencing. The median true haplotype overlap coefficient ranks are pretty low, indicating that true haplotypes are not being selected very well. While it performs a little bit better for the shotgun sequencing samples, a LOT of other nodes would also be selected. I think I need to come up with a better way to reduce the search space.
I started trying WEPP's node scoring approach again with the tiled amplicon data. I tried it manually (running commands on terminal) and found that it's very good with some haplotypes for also very bad for others. After reading the WEPP paper on biorxiv, I found that there's one step that I missed for scoring the nodes: from highest scoring node, after removing the top node, I need to update other node scores by subtracting from them read scores that are also parsimonious for the node just removed. I will work on this tomorrow.
I suddenly realized that Pranav was simulating and working with tiled amplicon data, meaning that much more reads can be
deduplicated. I then simulated 100K amplicon data and ran my current panMAMA implementation on it. As expected,
runtime is much shorter for the amplicon samples than shotgun sequencing samples. Using 8 threads, normal mode took
24.6 minutes, while low-memory mode took 29.5 minutes. In comparison, 100K shotgun samples took 120 minutes on 8
threads. Maybe, what we currently have is pretty good afterall? I will discuss this with Russ tomorrow.
Nevertheless, I still need to implement a better way to reduce the search space. One WIP is the WEPP scoring approach.
After discussing with Russ, he also thinks that my current implementation might be pretty good already. I will simulate
1.5 million amplicon and shotgun sequencing reads and run my current panMAMA implementation on them. I will run them
on the SARS 8M tree on 64 threads.
Mode SeqType Runtime
--------------------------------------------
Normal Amplicon 2342.3s (39.04 mins)
LowMem Amplicon 2801.53s (46.68 mins)
LowMem Shotgun 16246.8s (4.512 hrs)
This is actually not bad compared to WEPP. This should also be further improved after reducing the search space before the scoring step. Shotgun sequencing samples took much longer than amplicon samples because shotgun reads are much less collaspible than amplicon reads. 1,500,000 shotgun reads were collapsed to 374,400 unique k-min-mer sets (~25% of original reads), while 1,401,619 amplicon reads were collapsed to 57,922 unique k-min-mer sets (~4.13% of original reads).
eppwas not updated correctly. I was only incrementing it when a read score was updated. This causedeppto be undercounted.- I didn't account for when nodes are identical or indistinguishable by their k-min-mer sets. I was treating them as separate nodes when identical nodes should be treated as the same node.
I arbitrarily used test_data/sars/rep3/sars20000_10hap-a_100000_rep3_R*.fastq as a test sample. And true haplotypes'
rank for their node scores were:
Rank NodeID Score
-------------------------------------------------------------------------------------------------------
3675 USA/CA-CHLA-PLM46323971/2020|MZ722413.1|2020-12-09 0.1845768297
5244 USA/IL-CDC-QDX40817725/2022|OP411780.1|2022-08-28 0.0764590824
2 England/MILK-344FEB3/2022|OV817379.1|2022-01-26 57.8659772952
1 Denmark/DCGC-599763/2022|OY836217.1|2022-10-17 85.0571468377
3 USA/CO-CDPHE-41411769/2023|PP031687.1|2023-09-25 54.3541009750
4 USA/CO-CDPHE-2007061387/2020|OK659067.1|2020-07-06 25.4734981362
7 USA/WA-S20280/2022|ON660803.1|2022-05-20 8.3954038314
8 USA/FL-CDC-LC0637436/2022|ON608372.1|2022-05-12 8.1637073062
10 Germany/Molecular_surveillance_of_SARS-CoV-2_in_Germany/2021|OV342561.1|2021-09-10 6.2916502660
5 USA/CA-CDC-STM-XAH3WETJM/2022|OP911649.1|2022-11-14 19.0574859274
It's very weird that the the first 2 haplotypes are ranked very low. I then filtered out the reads for only
MZ722413.1. Surprisingly, even with reads that are only from MZ722413.1, it still ranked 636 for its node score.
And it's not even selected if I were to progressively exhaust the reads during the selection process. The same happened
for OP411780.1, which ranked 1016 for its node score and was also not selected if I were to progressively exhaust
the reads during the selection process. As a positive control, I filtered out the reads for only OV342561.1. As
expected, it was ranked 1 for its node score.
I then investigated how the reads map to MZ722413.1. Of all reads that were simulated from MZ722413.1, after
collapsing them, 10,528 unique reads map parismoniously to MZ722413.1, while 1,277 map parsimoniously elsewhere,
meaning that they were sequencing errors. Additionally, among the 1,277 reads that map parsimoniously elsewhere, the
vast majority of them have very low epp value, meaning that they are parsimonious for very few nodes, which boost up
their score weights for other nodes. This was also observed for OP411780.1. OV817379.1, on the other hand, while
also have reads that map parsimoniously elsewhere and have very low epp values, some reads that do map parsimoniously to
OV817379.1 also have very low epp values, meaning that they are parsimonious for many nodes, which downgrades their
score weights for other nodes.
I then tried pruning the reads that have obvious errors using the --skip-singleton option. This boosted up the rank of
MZ722413.1 to 39 for its node score. The same thing happened for OP411780.1, whose node score rank was improved
to 39.
What if I used perfect reads with no errors? I simulated 100,000 perfect reads for the same set of 10 haplotypes, and here are their ranks for their node scores:
Rank NodeID Score
-------------------------------------------------------------------------------------------------------
118 USA/CA-CHLA-PLM46323971/2020|MZ722413.1|2020-12-09 0.2122666691
337 USA/IL-CDC-QDX40817725/2022|OP411780.1|2022-08-28 0.0570905193
3 England/MILK-344FEB3/2022|OV817379.1|2022-01-26 56.9054583197
1 Denmark/DCGC-599763/2022|OY836217.1|2022-10-17 94.1630006842
2 USA/CO-CDPHE-41411769/2023|PP031687.1|2023-09-25 64.9913636546
4 USA/CO-CDPHE-2007061387/2020|OK659067.1|2020-07-06 34.9137689944
8 USA/WA-S20280/2022|ON660803.1|2022-05-20 6.9469028815
13 USA/FL-CDC-LC0637436/2022|ON608372.1|2022-05-12 3.6754517996
7 Germany/Molecular_surveillance_of_SARS-CoV-2_in_Germany/2021|OV342561.1|2021-09-10 8.0952134279
5 USA/CA-CDC-STM-XAH3WETJM/2022|OP911649.1|2022-11-14 17.6248774466
After filtering the reads separately for MZ722413.1 and OP411780.1, each filtered sample has the true haplotype
ranked 1 for its node score. I also noticed that these two haplotypes do not have parsimonious reads that have low
epp values (lowest epp for MZ722413.1 is 25, for OP411780.1 is 33, OV817379.1 as epp value of 1 for 24 collapsed
reads for comparison), meaning that all the reads for these two haplotypes are parsimonious for a non-trivial amount of
nodes, meaning their node scores can get "drowned out" by other true nodes that have more "unique" reads on the tree.
Note that this is all done on the SARS 20K tree, which is much less balanced than the SARS 8M tree. I will run more tests on the SARS 8M tree. I will also need to generate some consistent datasets on phoenix.
Dockerfile and CMakeLists.txt can be found in panmama/10_9_2025.
To build the docker image
docker build -t panmap-dev .
To enter the interactive shell
docker run --rm -it \
-v $(pwd):/panmap \
-w /panmap \
panmap-dev
To build panmap fresh from inside the container
mkdir -p /panmap/build && cd /panmap/build
cmake ..
make -j 32
A simple example:
docker run --rm \
-v $(pwd):/panmap \
-w /panmap \
--user $(id -u):$(id -g) \
panmap-dev \
bash -c "echo rawrrr > tiger"
Using --user tag for tiger to be owned by user instead of root
To run panmap from the host
docker run --rm \
-v $(pwd):/panmap \
-v /scratch1/alan/goodmap/panmap/panmans:/panmap/panmans \
-v /scratch1/alan/goodmap/panmap/test_data:/panmap/test_data \
-w /panmap \
--user $(id -u):$(id -g) \
panmap-dev \
bash -c "/panmap/build/bin/panmap /panmap/panmans/sars_20000_optimized.panman \
/panmap/test_data/sars/rep1/sars20000_5hap-a_100000_rep1_R*.fastq \
-m /panmap/panmans/sars_20000.pmai \
--cpus 4"
To rebuild panmap from the host
docker run --rm \
-v $(pwd):/panmap \
-w /panmap \
panmap-dev \
bash -c "cd /panmap/build && make -j 32"
Not using --user tag to avoid permission issues
Nice. It also built on Phoenix. Now I can run more tests on Phoenix.
❗ Reminder: Evaluation scripts and figures for genotyping with mutation spectrum is in panmap's main
branch (commit f239d06) in dev/denotype_eval1.
Today I will write a script on phoenix to generate benchmark data. This script should be generalizable to any tree with customizable parameters.
Borrowed from 10/6/2025
- Original and mutated haplotype
- All original haplotypes
- All mutated haplotypes
- Mix of original and mutated haplotypes
- Sequencing type
- shot-gun
- tailed amplicon
- Sequencing depth
I will make all combinations of (#SNPs, #Haplotypes, %Mutated, SeqType, Depth)
#SNP: 0 5 10 20
#haplotypes: 1 5 10 50 100
%Mutated: 20% 50% 70% 100% If #SNP > 0
SeqType: shot-gun(0) tiled-amplicon(1)
#Reads: 100000 500000 1000000 2000000
Each combination has 5 replicates
Use panmama/benchmark/gencomb.py to generate combinations of parameters.
python3 panmama/benchmark/gencomb.py \
--snps 0 5 10 20 \
--haplotypes 1 5 10 50 100 \
--percent-mutated 0.2 0.4 0.8 1.0 \
--seq-types shotgun amplicon \
--num-reads 100000 500000 1000000 2000000 \
--num-rep 5 | head | column -t
seq_type haplotype snp percent_mutated num_reads rep
shotgun 1 0 0 100000 0
shotgun 1 0 0 100000 1
shotgun 1 0 0 100000 2
shotgun 1 0 0 100000 3
shotgun 1 0 0 100000 4
shotgun 1 0 0 500000 0
shotgun 1 0 0 500000 1
shotgun 1 0 0 500000 2
shotgun 1 0 0 500000 3
Use panmama/benchmark/genreads.sh to generate reads.
bash panmama/benchmark/genreads.sh \
--seqtype shotgun \
--numhap 5 \
--numsnp 10 \
--permut 0.4 \
--numreads 1000 \
--rep 0 \
--cpus 4 \
--panmap /private/groups/corbettlab/alan/panmap/ \
--panman /private/groups/corbettlab/alan/panmap/panmans/sars_optimized.panman \
--pmi /private/groups/corbettlab/alan/panmap/panmans/sars_optimized.panman.pmi \
--random-seed random \
--out-prefix panmama/benchmark/test
or to simulate amplicon reads
bash panmama/benchmark/genreads.sh \
--seqtype amplicon \
--numhap 5 \
--numsnp 10 \
--permut 0.4 \
--numreads 1000 \
--rep 0 \
--cpus 4 \
--panmap /private/groups/corbettlab/alan/panmap/ \
--panman /private/groups/corbettlab/alan/panmap/panmans/sars_optimized.panman \
--pmi /private/groups/corbettlab/alan/panmap/panmans/sars_optimized.panman.pmi \
--random-seed random \
--swampy /private/home/bzhan146/tools/SWAMPy/src/simulate_metagenome.py \
--reference-primer-bed-file /private/home/bzhan146/tools/SWAMPy/primer_sets/nimagenV2.bed \
--reference-fasta-file /private/home/bzhan146/tools/SWAMPy/ref/MN908947.3.fasta \
--jvarkit /private/home/bzhan146/tools/jvarkit/dist/jvarkit.jar \
--out-prefix panmama/benchmark/test
I just implemented a fast node scoring function scoreReads() and scoreReadsHelper(). This will traverse the tree and
only count the number of matching k-min-mers in each read instead of pseudo-chaining them. This should make the read and
node scoring step much faster as a pre-filter step. The plan is to still do full chaining after filtering out the
probable nodes and collapsing the tree.
Quick follow up on 10/9/2025
Getting sidetracked a bit but I wonder if the two true nodes, MZ722413.1 and OP411780.1, have low scores because
reads that parismoniously map to them also parsimoniously map to many inferred internal nodes, which drives up the epp
and reduces the read weights that map to the true nodes.
Not counting internal nodes when counting epp for reads, the node score ranks for the true haplotypes from sample
test_data/sars/rep3/sars20000_10hap-a_100000_rep3_R*.fastq:
2292 USA/CA-CHLA-PLM46323971/2020|MZ722413.1|2020-12-09 0.3971806677
3833 USA/IL-CDC-QDX40817725/2022|OP411780.1|2022-08-28 0.1865463517
2 England/MILK-344FEB3/2022|OV817379.1|2022-01-26 58.1048931881
1 Denmark/DCGC-599763/2022|OY836217.1|2022-10-17 85.2544216228
3 USA/CO-CDPHE-41411769/2023|PP031687.1|2023-09-25 54.7220291402
4 USA/CO-CDPHE-2007061387/2020|OK659067.1|2020-07-06 27.1159401016
8 USA/WA-S20280/2022|ON660803.1|2022-05-20 8.3984476356
6 USA/FL-CDC-LC0637436/2022|ON608372.1|2022-05-12 10.3158523194
9 Germany/Molecular_surveillance_of_SARS-CoV-2_in_Germany/2021|OV342561.1|2021-09-10 6.8616225312
5 USA/CA-CDC-STM-XAH3WETJM/2022|OP911649.1|2022-11-14 19.1232183505
It looks a little bit better but still not very good. Never mind then...
Since the fast-mode, which doesn't pseudo-chain the k-min-mer matches, outputs very similar results as the normal mode, Russ and I decided to have the fast mode as the default option for faster speed. I will reserve a whole node on phoenix to do some benchmark on the runtime.
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/fast-mode.sh --array=0-3
I did a little bit of optimization and now scoring 1.5 million amplicon short reads on the SARS 8M took ~8 minutes, using 32 threads.
As discussed on 10/9/2025, some nodes are not selected using WEPP scoring methods because they are very similar to other nodes on the tree, meaning that all of the reads that map pasimoniously to them have very high EPP values, thus making them have less scores and more susceptible to sequencing errors.
Russ and I discussed some other options:
- Find local optimas of sum read scores
- Find local optimas of WEPP node scores
- Regularize EPP by giving it a "span value", which measures the genetic distance between the parsimonious nodes... e.g. if all the parsimonious nodes of a read are neighbors, give the read weight a positive correction to increase the node scores of those nodes.
- Combine option 2 and 3
I will be trying out these options in the next couple of dates
I can actually compute the EPP during scoring with minimum overhead. During scoring, if a read has a new max score, I
can set a read.lastParsimoniousNode to the current node. Downstream, when the read's score is changed from max to
a low score, I can calculate the number of parsimoniously mapped nodes by
epp += curNode.dfsIndex - read.lastParsimoniousNode.dfsIndex. If a new max score is reached, I can simply reset the
epp.
As implemented:
auto& lastParsimoniousNode = curRead.lastParsimoniousNode;
if (score > maxScore) {
// reset epp and assign current node as lastParsimoniousNode
curRead.epp = 0;
lastParsimoniousNode = node;
} else if (score < maxScore) {
// collect epp and reset lastParsimoniousNode
if (lastParsimoniousNode != nullptr) {
curRead.epp += node->dfsIndex - lastParsimoniousNode->dfsIndex;
lastParsimoniousNode = nullptr;
}
} else {
if (lastParsimoniousNode == nullptr) {
lastParsimoniousNode = node;
}
}This method would require me to handle identical nodes on the tree more carefully, which is a WIP right now.
Today I finished collapsing the tree.
WIP: Efficient counting of read EPP during scoring. It's actually a bit trickier than described on 10/14/2025.
After roughly a day, I finally correctly implememnted an efficient counting of read EPPs during scoring. Now the program is able to score the SARS 8M tree and simultaneously count the read EPPs in ~6.5 minutes using 32 threads.
For bookkeeping, this is implemented in commit ccc3bcc73b31ba686b4a43d8fedef5180d1f1582.
Now I will experiment with various methods to reduce search space, as described on 10/14/2025.
Just implemented calculation of the sum of raw read matches for each node on the tree (commit
eff3a812d9d76f69f0f04e78fb98221b4998d6c9), which took ~180s for 1.5 million amplicon short reads on the SARS 8M tree.
Will work on the WEPP score tomorrow.
Just implemented the calculate of WEPP node scores in commit 2a327e40c1a6787b56971a148a989e7ac00a9d16. Dynamic
calculating of sum raw read matches and WEPP scores are confirmed to be correct after checking with brute foce on the
SARS 20K tree.
For a given combination of num_snps, num_haps, percent_mutated, and replicate_number, generate the same set of haplotypes and abundances for amplicon and shotgun reads with various depth.
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K \
/private/groups/corbettlab/alan/panmap/panmans/sars_optimized.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000.pmai
and
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai
I compressed the outputs into data_sars_20K.tar.gz and data_sars_8M.tar.gz and removed the original output dirs for
now. I also scp'ed the *tar.gz to alan@silverbullet:/scratch1/alan/lab_notebook/panmama/benchmark.
I forgot to change the panman path when I copied over the bash command to run gendata.sh and ended up generating
samples of SARS 20K haplotypes in the data_sars_8M/ dir. Anyway, regenereting data rn.
I wrote several node scoring schemes for selecting probable nodes.
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_20K.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_optimized.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_large_trees.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M
I just implemented the full WEPP scoring scheme, which includes updating the rest of the node scores every time a top scoring haplotype was selected. This was not implemented in the previous commit described on 10/17/2025.
The nodes scoring step and especially the node score updating step were not as fast as I'd like. So I spent the last two days making them fast enough to my satisfaction.
Hmmm.. It seems like the penalty of 1/epp^2, which is used in WEPP, might be too aggressive. I tried 1/epp and it
actually doesn't look too bad. I will send this to phoenix and do a couple rounds of testing.
Use grep_true.sh to view the node score ranking of an output nodeScores.tsv file.
cd /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark && \
bash grep_true.sh node_scores_out/sars_optimized_amplicon_10_0_0_1500000_1.nodeScores.tsv data_sars_20K/
Then run it on all the outputs
cd /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark && \
bash grep_true_all.sh node_scores_out/ > node_scores_ranks.txt
Only generating shotgun sequencing for now.
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_hiv \
/private/groups/corbettlab/alan/panmap/panmans/hiv_optimized.panman \
/private/groups/corbettlab/alan/panmap/panmans/hiv_optimized.pmai
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_rsv \
/private/groups/corbettlab/alan/panmap/panmans/rsv_optimized.panman \
/private/groups/corbettlab/alan/panmap/panmans/rsv_optimized.pmai
and testing node scoring schemes on pheonix
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_hiv_20K.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_hiv \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/hiv_optimized.panman \
/private/groups/corbettlab/alan/panmap/panmans/hiv_optimized.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_hiv
On the SARS 20K tree I found some nodes that have identical sequences but are non-neighbors, and their actual respective neighbors are quite different from themselves. These are not collapsed during the initial tree collapse because they are not neighbors.
One (hopefully easy and not computationally intensive) fix would be to first sort the nodes by their raw read seed matches and raw maximum-placement read seed matches. Groups of nodes with identical scores are potential idenical nodes. I can then find their LCA and apply read score deltas to each of them to see if they have identical nodes. I hope this wouldn't be too computationally intensive because I expect identical score node groups would be small.
I will do pairwise comparison of all collapsed leaf and internal nodes (including leaf node vs leaf node, leaf node vs internal node, BUT NOT internal node vs internal node)... Using collapsed leaf nodes to avoid comparing neighboring nodes, which are not unexpected if they were identical.
Actually, there would be too many comparisons... Comparing all leaf node pairs would have (N * (N - 1)) / 2
comparisons. That's 199,990,000 comparisons!
I have a way to reduce the search space I will use a random node score files and first find potential identical groups that have identical raw read seed matches and identical raw maximum-placement read seed matches.
cd /private/groups/corbettlab/alan/lab_notebook/panmama/pairwise_comparison &&
python3 search_potential_identical.py panmap.nodeScores.tsv > potential_identical_pairs_sars20k.tsv
sbatch pairwise_comparison.sh \
potential_identical_pairs_sars20k.tsv \
/private/groups/corbettlab/alan/lab_notebook/panmama/pairwise_comparison/sars_fasta \
pairwise_comparison_out_sars20k
Pairs of potential identical nodes are evenly distributed for each task array, and distance stats are printed to each
task array's individual output. After the job is complete, I will cat all the outputs together.
All the pairwise comparisons are complete and I concatenated each task array's output to a final all_comparisons.tsv
cd /private/groups/corbettlab/alan/lab_notebook/panmama/pairwise_comparison &&
python3 get_identical_groups.py pairwise_comparison_out_sars20k/all_comparisons.tsv > sars_20k_identical_group_indices.tsv
It outputs a tsv file where first column is the sequence ids and the second column is the identical group ids. For now, I skipped over groups containing only internal nodes. Nodes are considered identical if their snps, snps_ambiguous, gaps, and gaps_edge_corrected are all 0 (this can be modified to be more lenient, such as allowing snps_ambiguous or gaps as long as gaps_edge_corrected is 0).
I did find many groups of identical leaf nodes on the SARS 20K tree. I also did the same for RSV 4K and HIV 20K. RSV 4K has much less while HIV 20K actually has a lot more.
I will simulate SARS data that are more realistic. Before I had randomly selected 100 nodes from the tree. Realistically, wastewater data should contain many similar haplotypes. So I'm going to selected random clusters of nodes to simulate reads from.
Using panmap commmit c0aeb3a631aa0ab9a1dfb2d149d6d6362a316c58 to selected the random clusters of nodes.
For each cluster, I randomly select a node then use Dijkstra's algorithm to selected its closet N neighbors. Not sure if I should use weighted branch (and how to weights the branch length) or not but using branch length weighted by the number of k-min-mer updates for simlicity for now.
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata_clustered_nodes.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata_clustered_nodes.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai
What to do tomorrow:
- Check on read simulation. Resubmit jobs if needed
- Try out how the new read samples looks for node selection
- Keep thinking of new ways to select probable nodes
Results look much better using reads simulated from clusters of nodes. On the sars_20K tree, I was able to select 95% to 100% of the true nodes from which the reads are simulated from, for both shotgun and amplicon reads. However, it doesn't look very good on the 8M tree, because for a random sample I looked at, only 9 out of 50 of the true nodes were selected. Nevertheless, I will do a formal evaluation with replicates on phoenix.
I also deviced another scoring method to use seed weights instead of read weights, where a seed weight is proportional to its rarity on the the tree, similar to the rarity of a read's parsimony placement. By the first look of it, it's not bad but also not as good as the read weighted scoring method on the 20K tree. I will also include this as part of the evaluation described above.
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_20K.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars20K \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_8M.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars8M \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered
It seems like the node seed scores update step was taking too long on the 8M tree. All of the jobs for the seed weighted node scoring scheme ran out of time. I will make it faster.
I was wrong about seed-weighted node scoring on 10/30/2025. After doing a formal evaluation, it actually performed better than read-weighted node scoring in more cases than not.
python3 eval_node_scores.py node_scores_out_sars20K/ > node_scores_out_sars20K_all.tsv
(head -n1 node_scores_out_sars20K_all.tsv && tail -n+2 node_scores_out_sars20K_all.tsv | sort -t '_' -k2,2n -k6,6n -k5,5n -k 1,1) > tmp
mv tmp node_scores_out_sars20K_all.tsv
Out of 60 simulated cases:
- 15 cases of 100,000 shotgun reads
- equal: 7
- seed-weight: 3
- read-weight: 5
- 15 cases of 1,500,000 shotgun reads
- equal: 6
- seed-weight: 7
- read-weight: 2
- 15 cases of 100,000 amplicon reads
- equal: 2
- seed-weight: 8
- read-weight: 5
- 15 cases of 1,500,000 amplicon reads
- equal: 8
- seed-weight: 5
- read-weight: 2
I think I will also simulate some PERFECT reads to see how PCR and sequencing errors affect the scoring and node selection. If I get much better node selection on error-free reads, I will spend a little more time on identifying and removing likely errors.
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata_clustered_nodes_perfect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata_clustered_nodes_perfect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai
It seems like the SARS 8M tree has some fundamental limitaitons in terms of its phylogenetic correctness. This arises primarily due to edge sequences. Two very similar genomes (a few SNPs apart) can be positioned very far apart on the tree when one genome includes edges in its assembly while the other does not. Edge sequences can alse create more seed hits for genomes that include them. Relying on seed hits, per our method, can also have reduced specificity.
In terms of what to do next. I think I will do some final testing on the 8M tree to see if perfect reads without any sequencing or PCR errors can improve the node selection performance. If it does, then I know it's primarily due to sequencing or PCR errors that create false parsimonious matches, which jack up the node scores of other false nodes to be selected over the true nodes.
After talking with Russ. I will instead start focusing on more divergent genomes, such as HIV and bacterial genomes (TB? etc.). We may also need to build our own panMANs of other divergent pathogen species.
To select out relevent reads from shotgun sequencing, use Kraken or discard reads without significant hits during read scoring in panMAMA.
Last bit of attemp/investigation on 8M tree
- Finish the new faster seed-weighted node score update.
- Run node selection on error-free reads.
- Evaluate the results
More divergent tree
- Find SARS/HIV/etc. isolate samples and create artificial mixtures to test for demixing and genome deconvolution
- Find real mixed samples for divergent genomes, such as HIV and TB
- Build our own panMANs if necessary
I just implemented a much faster way to score seed-weighted scores. I also found a bug in the original code so I will rerun the node scores for all the simualted samples.
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_20K.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars20K \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered
job id: 21214699 21217044
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_8M.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars8M \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered
job id: 21214720 21217989
Also running with perfect samples
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_20K_perfect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars20K \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered
job id: 21215120 21217046
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_8M_perfect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars8M \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered
job id: 21215142
Last bit of attemp/investigation on 8M tree
- Run node selection on error-free reads.
- Evaluate the results
More divergent tree
- Find SARS/HIV/etc. isolate samples and create artificial mixtures to test for demixing and genome deconvolution
- Find real mixed samples for divergent genomes, such as HIV and TB
- Build our own panMANs if necessary
It doesn't seem like the amplicon reads were error-free. Gonna take a look rn.
Yup, just adding --errfree in art_runner.py is not enough. I needed to convert the error-free SAM ouput from art
to fastqs.
Rerunning scripts to generate real error free amplicon reads.
sbatch \
--mem=30000 \
--cpus-per-task=8 \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata_clustered_nodes_perfect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai
job id: 21225800
sbatch \
--mem=60000 \
--cpus-per-task=8 \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata_clustered_nodes_perfect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai
job id: 21226337
And rerunning panMAMA with the perfect reads
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_20K_perfect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars20K \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered
job id: 21226880
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_8M_perfect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars8M \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered
job id: 21226921 21228336
I think I will use the 10 SARS samples Nico found.
Here are a list of potential samples (to be updated)
| Project/Sample ID | Strategy | Layout | #Samples | Data/Paper link |
|---|---|---|---|---|
| PRJNA1118440 | HIV-1 Probe-capture WGS | paired | 82 | Coldbeck-Shackley et al. |
| PRJNA644953 | HIV-1 partial pol gene (Protease, Reverse Transcriptase, and Integrase) RT-PCR | single | 99 | data |
There's not many data available for HIV-2
Use commands below to retrieve fasta sequences from ncbi
Download a single sequence
efetch -db nuccore -id ON284258.1 -format fasta > ON284258.1.fasta
Download multiple sequences from a file (one accession per line)
cat accessions.txt | efetch -db nuccore -format fasta > sequences.fasta
Or download multiple specific accessions
efetch -db nuccore -id ON284258.1,OL672836.1,MW580573.1 -format fasta > sequences.fasta
Perfect amplicon reads (without sequencing/PCR errors) actually did quite well during node selection step. So I do think I will spend a bit more time handling amplicon errors.
I used Claude to write a watered-down python version if ivar trim to output a tsv file, in which the first column is
the read name and second column is the name of the primer that it's assigned to.
During query initialization step, I will first process reads by groups of their primer template and mask probable errors using the depth of the primer stack as reference.
sbatch \
--mem=30000 \
--cpus-per-task=8 \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata_clustered_nodes_amplicon_stack.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai
job id: 21271306
sbatch \
--mem=60000 \
--cpus-per-task=8 \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/gendata_clustered_nodes_amplicon_stack.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai
job id: 21326790
For fair comparison, I will also rerun panmap on the same newly simulated datasets with and without error removal:
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_20K.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars20K \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered
job id: 21327409
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_8M.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars8M \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered
job id: 21327420
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_20K_error_detect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars20K \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered
job id: 21368167
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_8M_error_detect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars8M \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered
job id: 21368178 21368977
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_20K.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars20K \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered
job id: 21476247
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_8M.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars8M \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered
job id: 21476259
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_20K_error_detect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars20K \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered
job id: 21476275
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_8M_error_detect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars8M \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered
job id: 21476287 21492432 21497204
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_20K_perfect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars20K \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_20000_twilight_dipper.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_20K_clustered
job id: 21476302
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/run_panmap_score_nodes_sars_8M_perfect.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/node_scores_out_sars8M \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.panman \
/private/groups/corbettlab/alan/panmap/panmans/sars_8M.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/data_sars_8M_clustered
job id: 21476314
So read-seed-weighted scoring method didn't work out well. While it performed better than read-weighted method on the SARS 20K tree, it's comparably worse on the SARS 8M tree. During my discussion with Russ and Alex, weighting the reads additionally with phylogenetic information (similar to autolin... information theory... how much phylogenetic information does a read's parsimony tells me). I did think about something like this but decided not to pursue because the trees are not quite good enough.
Wait... Now thinking back on it, since SARS 8M tree was built using UShER MAT, the tree topology might be good enough to do this.
- Investigate the phylogeny of nodes that reads are parsimonious to on the 8M tree
- If it looks good: pursue phylogenetic information approach mentioned above
- Restore EM function on panMAMA
- Test on more divergent genomes!
- Simulate RSV amplicon reads and compare panMAMA to WEPP
- Figure out from
art_illuminapage how to simulate amplicon reads - Write a script to simulate amplicon reads for RSV and other more divergent genomes
- Figure out from
- Modify WEPP to also use shotgun reads
- Test if WEPP can run with HIV shotgun reads and compare to panMAMA
- Simulate RSV amplicon reads and compare panMAMA to WEPP
After pulling the WEPP docker...
Enter docker container:
docker run -it pranavgangwar/wepp:latest
Use modified scripts for shotgun sequencing:
docker run -it \
-v /home/alan/tools/WEPP/src/WEPP/qc_preprocess.py:/WEPP/src/WEPP/qc_preprocess.py \
-v /home/alan/tools/WEPP/config/config.yaml:/WEPP/config/config.yaml \
-v /home/alan/tools/WEPP/workflow/rules/qc.smk:/WEPP/workflow/rules/qc.smk \
-v /home/alan/tools/WEPP/shotgun_reads:/data \
-v /scratch1/alan/hiv_usher:/hiv_usher \
pranavgangwar/wepp:latest
Run with SHOTGUN=True tag inside container:
snakemake --config DIR=shotgun_reads FILE_PREFIX=test_run PRIMER_BED=RSVA_all_primers_best_hits.bed TREE=hiv_K03455.pb REF=K03455.fasta CLADE_IDX=-1 SHOTGUN=True --cores 32 --use-conda
python3 calculate_distance_score.py \
--tree-in test/hiv_K03455.pb \
--true-in test/true_abundance.tsv \
--estm-in test/estm_abundance_wepp.tsv \
--tool WEPP \
--fasta-dir /scratch1/alan/panmania/backup/panman/panmans/info/hiv_original/ \
--prefix test
I'm actually quite unsure about how to compare panMAMA and WEPP... It's easy if an estimated haplotype is a sampled
haplotype because I can either do mafft or use the usher MAT to calculate the distance between the estimated haplotype
and its closest true haplotype. The problem is when an estimated haplotype is an internal node because panMAN and MAT
do not have the same internal nodes. panMAN also has the full genomic sequences of the internal nodes while MAT only has
the SNPs with respect to the reference. So I'm not sure how to deal with gaps....
I also noticed that bte.get_haplotype(), which "returns the complete set of mutations (haplotype) the indicated node
has with respect to the reference," also has mutations at positions where a node has a gap with respect to the
reference. I think that this might be due to how bte.get_haplotype() gets the mutations. It might accumulate
all the branch mutations from the root the target node, and some of the branch mutations on the path are infered
mutations on the gap coordinates of the target node. Additionally, when I place the same sample onto the tree, after
changing the sample ID, the new sample's mutations from bte.get_haplotype() don't have any mutations at the gapped
positions.... Very weird
- Remind everybody the background and purpose of panMAMA
- Same as the seminar talk
- Transition to the problems I've solved and am still solving
- New index building function: show a quick overview on method and a graph to show the time it takes to index each genome
- Improve scoring time
- de-duplicating reads
- Multi-threading
- collapsing nodes
- removing seed mutations within each node
- fast mode
- remove most of the unnecessary block mutations
- Memory problem
- Even with score-annotated index, memory still seems to be a problem for sars 8M tree
- Use de bruijn graph to group reads with similar seeds together and store neighboring reads deltas
- Select probable nodes
- Why overlap-coefficients no long work
- Alternative methods
- WEPP method
- Can we make it better?
- seed-weighted method -> nope
- read-seed-weighted method -> better than seed-weighted.. but still nope
- what to do next?
- Various panMAN-related problems
- panGraph sucks
- explain what panGraph is and how panMAN uses panGraph
- F'd up block mutations
- How I some-what fixed them
- Identical samples are not close to each other
- Ways to fix them
- Should-be-sister nodes are not close because their genome edges are also included in the construction
- F'd up block mutations
- explain what panGraph is and how panMAN uses panGraph
- panGraph sucks
Lab meeting yesterday was a success.
I just wrote a script on silverbulelt that makes it easier to run the WEPP
I will run it on all the hiv samples I have on phoenix and compare the results between WEPP and panMAMA
I ran all of the simualed HIV samples I have on WEPP and compared it to MAMA. Weighted Haplotype/Peak distances were calculated using calculate_distance_score.py.
Gathered all the measurements:
cd /private/groups/corbettlab/alan/lab_notebook/panmama/WEPP/WEPP_results/results
(echo -e "hap\trep\tmama_whd\tmama_wpd\twepp_whd\twepp_wpd" && for hap in 1 2 3 5 10; do for rep in 1 2 3 4 5 6 7 8 9 10; do wepp_distances=$(tail -n+2 "hiv20000_${hap}hap-a_60000_rep${rep}/hiv_${hap}hap_rep${rep}.wepp.distance_sum.tsv" | cut -f 2,3); mama_distances=$(tail -n+2 "hiv20000_${hap}hap-a_60000_rep${rep}/hiv_${hap}hap_rep${rep}.mama.distance_sum.tsv" | cut -f 2,3); echo -e -n "${hap}\t${rep}\t${mama_distances}\t"; echo -e "$wepp_distances"; done; done;) > ../../distances.tsv
| hap | rep | mama_whd | mama_wpd | wepp_whd | wepp_wpd |
|---|---|---|---|---|---|
| 1 | 1 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 2 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 3 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 4 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 5 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 6 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 7 | 0.000 | 0.000 | 154.000 | 154.000 |
| 1 | 8 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 9 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 10 | 0.000 | 0.000 | 0.000 | 0.000 |
| 2 | 1 | 0.000 | 0.274 | 0.000 | 0.290 |
| 2 | 2 | 0.000 | 0.000 | 0.000 | 0.016 |
| 2 | 3 | 0.000 | 0.000 | 0.000 | 0.855 |
| 2 | 4 | 0.000 | 0.000 | 0.000 | 10.817 |
| 2 | 5 | 0.000 | 0.000 | 0.000 | 0.000 |
| 2 | 6 | 0.000 | 0.000 | 11.400 | 13.376 |
| 2 | 7 | 0.000 | 0.000 | 0.000 | 5.442 |
| 2 | 8 | 0.000 | 0.157 | 0.000 | 0.740 |
| 2 | 9 | 0.000 | 0.000 | 0.000 | 0.000 |
| 2 | 10 | 0.000 | 0.000 | 0.000 | 0.000 |
| 3 | 1 | 0.000 | 0.000 | 0.000 | 9.486 |
| 3 | 2 | 0.000 | 0.000 | 0.300 | 7.708 |
| 3 | 3 | 0.000 | 0.000 | 11.100 | 13.487 |
| 3 | 4 | 0.000 | 0.000 | 49.600 | 24.703 |
| 3 | 5 | 0.000 | 0.109 | 5.700 | 9.067 |
| 3 | 6 | 0.000 | 0.000 | 3.400 | 13.945 |
| 3 | 7 | 0.000 | 0.000 | 0.600 | 38.168 |
| 3 | 8 | 0.000 | 0.053 | 0.200 | 21.974 |
| 3 | 9 | 0.000 | 0.000 | 0.000 | 26.607 |
| 3 | 10 | 0.000 | 0.147 | 0.000 | 0.068 |
| 5 | 1 | 0.000 | 0.000 | 0.000 | 37.111 |
| 5 | 2 | 0.250 | 0.250 | 0.050 | 14.396 |
| 5 | 3 | 0.000 | 0.000 | 0.000 | 21.012 |
| 5 | 4 | 0.000 | 0.000 | 15.400 | 75.145 |
| 5 | 5 | 0.000 | 0.000 | 4.300 | 45.224 |
| 5 | 6 | 0.000 | 0.000 | 0.000 | 78.194 |
| 5 | 7 | 0.000 | 0.000 | 0.000 | 32.938 |
| 5 | 8 | 0.000 | 0.000 | 0.500 | 21.572 |
| 5 | 9 | 0.000 | 0.000 | 0.000 | 14.293 |
| 5 | 10 | 0.000 | 0.000 | 2.000 | 46.489 |
| 10 | 1 | 1.800 | 1.754 | 3.150 | 47.646 |
| 10 | 2 | 0.100 | 0.122 | 1.850 | 67.739 |
| 10 | 3 | 0.000 | 0.556 | 3.700 | 68.996 |
| 10 | 4 | 0.100 | 0.151 | 5.950 | 60.707 |
| 10 | 5 | 0.000 | 0.000 | 1.200 | 109.556 |
| 10 | 6 | 0.000 | 0.000 | 3.750 | 71.918 |
| 10 | 7 | 0.000 | 0.000 | 0.000 | 81.815 |
| 10 | 8 | 0.000 | 0.000 | 0.300 | 102.034 |
| 10 | 9 | 0.050 | 0.133 | 300.900 | 38.395 |
| 10 | 10 | 0.000 | 0.389 | 0.050 | 37.351 |
Results show that panMAMA outperforms WEPP for HIV samples, likely because HIV genomes are too divergent to be demixed using SNP information alone. PanMAMA shows better performance in both the WHD and WPD metrics, which measure sensitivity and specificity, respectively. The improvement is particularly significant for specificity.
Yatish and Sumit sent us the verbetrate mito panMAT. I generated a metadata file containing samples' taxonomic. Out of the total 15,655 samples, there are 7,893 unique species, 854 families, and 157 orders.
I visually inspected the panMAT using Taxonium and found that, while most samples cluster as expected, there are a non-trivial number of outliers, and several expected clades are fragmented into multiple separate clusters.
- The highlighted nodes represent order Primates. There are two separate small primate clusters that are positioned distinctly from the major primate cluster. Additionally, non-human great apes are also quite far away from the human samples.
(For validation, I aligned a human mitochondrial sample (J01415.2) to a pileated gibbon sample (AB504749.1) from the nearest clade to the human sample and to a chimpanzee sample (D38113.1) using MAFFT. The genetic distance between human and chimpanzee mitochondrial sequences is about half that between human and gibbon sequences.)
- The highlighted nodes belong to order Anura, which contains all frogs and toads. The Anura samples are distributed in several distinct clusters. There are also multiple non-Anura species, such as Caecilians, reptiles, rodents, interspersed within the Anura-major clades.
I then compared tree topology between the mito panMAT and the TimeTree. I was able to identify 6,231 overlapping unique species between the two trees. After removing duplicate samples and excluding non-overlapping species, I calculated a normalized Robinson-Foulds distance of 0.302 (unweighted RF distance: 7,380; maximum RF distance: 24,456).
I wonder if these discrepancies are due to differences in the recorded starting positions of the circular mitochondrial genome. In many MAFFT alignments I have inspected, both samples have very similar sequence length, but one sample often has an extended gap sequence at the beginning of the aligned sequence while the other sample has an extended gap sequence at the end, which is what you'd expect if two circular genomes are linearized at different positions.
I did a quick experiment on the human mitochondrial sample (J01415.2) and the chimpanzee sample (D38113.1) mentioned above. MAFFT alignment of the original NCBI fastas showed that the two fastas differ by 1380 SNPs and 1160 gaps. I then manually rotated the human mitochondrial sequence by moving the region at the beginning where the chimpanzee sequence contains a stretch of gaps to the end of the sequence. MAFFT alignment of the rotated human fasta showed 1,456 SNPs and 26 gaps, or ~8.6% difference, which is consistent with the literature.
I also confirmed from the v_mtdna.aln.gz file that neither human mito sample nor the chimpanzee sample were rotated for the construction of the panMAT.
- Investigate the phylogeny of nodes that reads are parsimonious to on the 8M tree
- If it looks good: pursue phylogenetic information approach mentioned above
- implement syncmer only index
-
Restore EM function on panMAMA-
Modify WEPP to also use shotgun reads -
Test if WEPP can run with HIV shotgun reads and compare to panMAMA
-
I restored the EM function on panMAMA. For now, I will the option to either use overlap coefficients or read-weighted node scores to select probable nodes.
- Investigate the phylogeny of nodes that reads are parsimonious to on the 8M tree
- If it looks good: pursue phylogenetic information approach mentioned above
-
implement syncmer only index -
Restore EM function on panMAMA-
Modify WEPP to also use shotgun reads -
Test if WEPP can run with HIV shotgun reads and compare to panMAMA
-
Index building broke when I set l to 1 or 2... Due to some conceptual misunderstanding, I thought l would have to be at least 3, so I coded the index building function to assume l >= 3...
Now index also builds correctly for l = 1 and l = 2. l = 1 is the equivalent as implenting syncmer only index.
I will make new to-do list for the next couple of weeks
- Formalize evaluation metrics
-
Update seed sketching function to take in any seed parameters - Compare accuracy of different combinations of l, k, s using the evaluation metric
- Implement t offsets for syncmers
- Work on SAPP application
- Look into ancient DNA read error correction
I will write a skeleton slurm script that will generate combinations of indices for panMAMA.
I have also just modified the seed sketching function to sketch syncmers with all l, k, s, t, and open values.
Simulate reads
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/v_mtdna/gendata.sh \
/data/tmp/ \
/private/groups/corbettlab/alan/lab_notebook/panmama/v_mtdna/input_data/v_mtdna.panman \
/private/groups/corbettlab/alan/lab_notebook/panmama/v_mtdna/input_data/v_mtdna.pmai
Run panmap
sbatch /private/groups/corbettlab/alan/lab_notebook/panmama/v_mtdna/run_panmap.sh \
/private/groups/corbettlab/alan/lab_notebook/panmama/v_mtdna/out \
/private/groups/corbettlab/alan/panmap/ \
/private/groups/corbettlab/alan/lab_notebook/panmama/v_mtdna/input_data/v_mtdna.panman \
/private/groups/corbettlab/alan/lab_notebook/panmama/v_mtdna/input_data/v_mtdna.pmai \
/private/groups/corbettlab/alan/lab_notebook/panmama/v_mtdna/sim_reads
I've also just added a simulate_and_run.py (real path:
/private/groups/corbettlab/alan/lab_notebook/panmama/benchmark/simulate_and_run.py) that will simulate reads and run
panMAMA then delete all intermediate files.
Run
python3 ../../benchmark/simulate_and_run.py \
../input_data/v_mtdna.panman \
../input_data/v_mtdna.pmai \
1 \
--selection node_scores
Out of the 652 samples that we got from Bianca, 520 samples do not align to the mammoth mito at all.
I grabbed a random sample ar1_10.mammoth.fq and tried it on panMAMA.
I first aligned it to a mammath sample JF912199.1 on the tree and found that only two reads align to it but the
alignment quality is very good. One read perfectly matches with the reference while the other one only has one SNP
difference.
However, the node scores and the abundance estimation doesn't look too good... Either the sample is very mixed with
other haplotypes or the program is not performing correctly. I grabbed the most abundant haplotype from the output,
MW232459.1, and used bwa to align all the reads to it, and none of the reads mapped to it. I then aligned the reads
to other haplotypes and most reads that do align appear to have very low complexity, or highly repetitive, aka junk.
I think I will do some pre-processing to remove low-complexity reads. Additionally, I should also toss out reads whose maximum parsimony is below a certain threshold.
I will try bbduk to remove reads with low complexity
Using bbduk and setting entropy threshold to 0.7 seem good enough:
~/tools/BBTools/bbduk.sh in=ar1_10.mammoth.fq out=filtered.fq outm=low_complexity.fq entropy=0.7
I will set a paremeter to throw away reads whose maximum parsimony score is less than a specific threshold (default to 0.5 of the total seeds).
Now with ar1_10.mammoth.fq, two reads remain after removing reads low maximum parismony score, and they are reads from
the mammoth genomes
I will try with another sample, cr5_10.mammoth.fq... Also worked. Very good. I will now go on phoenix and run the
pipeline with all the reads from Bianca.
panmama_ancient_mito.sh maps all the read samples to the vertebrate mito tree. Outputs are in ancient_mito_out/. Job
array id: 22686096.
panmama_ancient_mammoth_mito.sh maps all the read samples to the mammoth mito tree. Outputs are in
ancient_mammoth_mito_out/. Job array id: 22687109.
While many samples were estimated to only contain mammoth mito, there still seem to be some contaminants not filtered out. I will also have panmama output the assigned reads to each haplotype estimated to be present, as well as most information on the parsimony scores of the reads remaining after the low-score removal step.
Rerun panmama_ancient_mito.sh (22756229) and panmama_ancient_mammoth_mito.sh (22756235)
I may have to manually look at each read that's assigned outside of the Elephantidae family. I will first filter out the good outputs where all reads are placed to the Elephantidae family.
cd /private/groups/corbettlab/alan/lab_notebook/panmama/v_mtdna
python3 gen_stats_mito_out.py ancient_mito_out
When I was manually investigating samples with reads placing parsimoniously outside of the mammoth "clade", I found this
sample ar1_2 that has some reads that place to other non-mammoth Elepahntidae nodes, specifically the genus
Loxodonta. This may not may not indicate that this sample contains reads from a mixture of haplotype because the reads
might all map better to an ancestor of both Loxodonta and Mammathus. Will look into this further when we have a
better tree.
I visually inspected the outputs of several samples, and panMAMA and pathphynder have either the same placement or very closer placement. Since pathphynder labels the node differently from panMAN, I will need to first convert the node names before I can write a script to compare the output of all the samples.
Sumit just sent us the newly improved vertebrate mito panman. I also found and fixed a bug in the early exit of panMAMA
when no reads have significant placement score. Rerunning panMAMA on the mammoth tree (22920510) and the new verberate
tree (22920457).
Among cases where both PanMAMA and Pathphynder produced results, PanMAMA and Pathphynder identified identical nodes in approximately 70% of samples, and nodes differing by a single branch in 14% of samples.
I will also try to rn pathphynder myself to see how fast it is.
Refering to panmama/v_mtdna/mito_assignment_stats.tsv, I still see what seem to be either mixed or unfiltered read
contaminations in the panMAMA results using the vertebrate mito tree. I think the latter is probably true. I will take a
deeper look at what the reads assigned outside of the mammoth "clade" look like.
Look for key words like: eDNA, metagenome, metagenomics, and commonly targeted mitochondrial regions like 16S, COI, Cyt b and 12S
COI, 12S, Cyt b can better discriminate metazoan species than 16S as 16S can also target bacteria and swampy the data with bacterial reads. On the other hands, we can try to use panMAMA to filter 16S reads for metazoan or organisms of interest.
From Claude:
Fish-focused surveys: MiFish primers (12S) are the clear winner
General vertebrate diversity (fish, amphibians, mammals, birds): 12S with broader primers (e.g., Riaz primers, Vert-12S)
Terrestrial vertebrates: 12S or 16S can work; some researchers prefer 16S for mammals
High-resolution species identification with tissue samples: COI or Cyt b
Multi-kingdom surveys: 18S (nuclear, eukaryotic universal) or combined 16S bacterial + 12S/18S eukaryotic approaches
Are we able to filter out the HIV reads from the rest of the reads we don't care able?
SRX23959018: metabarcoding of fisheries samples: eDNA in catch water to estimate species composition
Can look at the entire bioproject. Filter out junk and estimate what's in the fisheries?
I tried different combinations of --discard parameter in panMAMA (discard reads with placement scores lower than a
threshold) and entropyk value in bbduk (kmer window size to measure entropy) and see which one produces better
read assignments. I've also added more information on the intersection of assigned reads between panMAMA and
Pathphynder.
I will come with a metric to measure the performance and try out different hyperparameters for maximum performance on both the vertebrate tree and the mammoth tree.
I calculated the precision, recall, and F1 scores of the different --discard parameters in panMAMA and entropyk
values in bbduk
I tried entropyk value at 4, 5, and 6 for --dicard threshold at 0.4, 0.5, 0.6, 0.7, 0.8.
It seems like an entropyk=5 is the obvious winner. Between a discard threshold 0.6 and 0.8, there's a tradeoff
between precision and recall, which is not surprising. F1 score shows that a discard threshold of 0.7 has a slight
edge.
Precision-recall curve using entropyk=5
Below is the raw data
MPS Entropyk Precision Recall F1
0.4 4 0.11351030110935023 0.315702479338843 0.16698236922628587
0.4 5 0.29283069673510603 0.9613259668508287 0.44891640866873067
0.4 6 0.07950463225562489 0.4631057268722467 0.13571082781991287
0.5 4 0.260543580131209 0.3068432671081678 0.2818043588443994
0.5 5 0.6777950310559007 0.9635761589403974 0.7958067456700091
0.5 6 0.15223097112860892 0.4476295479603087 0.22719641857862338
0.6 4 0.3025302530253025 0.30353200883002207 0.30303030303030304
0.6 5 0.8676028084252758 0.9547461368653422 0.9090909090909091
0.6 6 0.24492931776275353 0.4403314917127072 0.3147709320695103
0.7 4 0.3093858632676709 0.2950276243093923 0.3020361990950226
0.7 5 0.9519230769230769 0.9298342541436464 0.9407490217998882
0.7 6 0.3572080291970803 0.432357813362783 0.3912065950537097
0.8 4 0.31840796019900497 0.2830292979546711 0.2996780801872988
0.8 5 0.9728395061728395 0.871199557766722 0.9192184310294547
0.8 6 0.4380069524913094 0.417910447761194 0.4277227722772277
MPS Entropyk Precision Recall F1
0.3 5 0.057291321903753224 0.9526692350027518 0.10808279479254472
0.4 5 0.29283069673510603 0.9613259668508287 0.44891640866873067
0.5 5 0.6777950310559007 0.9635761589403974 0.7958067456700091
0.6 5 0.8676028084252758 0.9547461368653422 0.9090909090909091
0.7 5 0.9519230769230769 0.9298342541436464 0.9407490217998882
0.8 5 0.9728395061728395 0.871199557766722 0.9192184310294547
0.9 5 0.9792592592592593 0.7340366463076069 0.8390986988257696
There might be some reads that could be in the Elephantidae family but are not mappable if a Mammuthus primigenius reference is used. I will need to look at the sample individually.
I will run the same analysis again when Sumit sent the a vertebrate tree with more primigenius reads merged in.
I'm starting to look at the sample individually to look for reads potentially from other Elephantidae family members.
| Interesting samples |
|---|
| ar1_2 |
Actually, another feature I can add for read filtering is throwing away reads mapping to spurious nodes on the tree with node phylogenetic siginificance. Will try it out maybe tomorrow. And maybe skip over the EM step entirely for filtering and assigning reads to divergent trees like this here?
I think I will try to make a pretty figure today. Strategy:
- Condense the tree by families
- subsample the family tree
- plot the family tree and highlight the Elephantidae family
- plot the detailed Elephantidae family tree and annotate with reads assigned
Footnotes
-
This is a reminder created before I deleted
bzhan146@emerald:/private/groups/corbettlab/alan/panmapon phoenix. ↩







