A Streamlit application that estimates how likely each DNA sequence is generated by a reference Markov model. It uses a 2nd-order Markov model (trinucleotides) to compute a log-probability of each DNA sequence and compares it against a simulated distribution to generate either a Z-score or a percentile.
This app:
- Builds a 2nd-order Markov model from a reference genome (FASTA).
- Calculates a log-probability for each input DNA sequence under that model.
- Simulates multiple random sequences of the same length to derive a distribution (mean & std or percentile).
- Provides a score (either a Z-score or Rank percentile).
- Offers PDF and Excel reports, plus optional distance metric comparison.
- The user uploads (or uses a default) reference genome in FASTA format.
- Counts 2-mers and 3-mers in the reference genome.
This process computes probabilities:
- P(2-mer)
- P(3-mer | 2-mer)
The log-probability for a sequence ( S ) of length ( L ) is computed as:
log P(S) = log P(S[1..2]) + sum from i = 3 to L of log P(S[i] | S[i-2], S[i-1])
Computations are done in log space to safely handle small probabilities.
- For sequence length
L, the app generatesN_simrandom sequences. - These form an empirical distribution of
log P(S) / L.
- Z-score:
Z = (log P(S) / L - μ) / σ
where ( \mu ) and ( \sigma ) are the mean and standard deviation from simulations.
- Rank Percentile:
The percentile of simulated values below the computed
log P(S) / L.
If a user provides a table with a distance metric identified by seq_id, the app merges data by seq_id and plots Z-score vs. Distance.
- Multiple Input Methods:
- Upload multi-FASTA, paste sequences, or upload ZIP with FASTA files.
- Adaptive Simulation:
- Simulation number
N_simadapts to sequence length.
- Simulation number
- Reports:
- PDF classification table.
- Excel report with top 100 sequences.
- Optional:
- Merge Z-score with additional distance metric.
- Download plots (PNG) of distributions or Z-score vs. distance. It allows you to compare how “in-model” a sequence is (Z-score) with a completely different measure (like Euclidean distance in a feature space, evolutionary distance, or anything else). You can quickly see if “high Z-score” sequences also appear “close” or “far” by other metrics. This helps in multi-metric analysis and can guide further filtering or interpretation.
Ensure Python 3.7+ is installed, then run:
pip install -r requirements.txtIn the repository folder:
streamlit run markovAlexanderApproach.pyOpen http://localhost:8501 in a browser.
Choose a default snippet or upload your own FASTA file. You can upload a multi-FASTA also.
- Upload multi-FASTA
- Paste sequences directly
- Upload a ZIP of FASTA files
Choose either Z-score or Rank Percentile.
- View classification results
- Download classification as PDF or Excel
If using Z-score, upload a distance table identified by seq_id. Download merged data plots (PNG) and Excel files.
If you want to quickly test this application without providing your own data, you can use the example files in the test_data folder.
Within the Streamlit interface:
-
Reference Genome: Upload the multi-FASTA file
ExampleSequences.fafrom thetest_datafolder. This will serve as your sample reference genome. -
Sequences to Classify: Select “ZIP” as the input type, then upload
seqInterested.zipfrom thetest_datafolder. This ZIP contains multiple sequences to be classified.
Additionally, the Excel file named EuclideanGCA_000333975.2_ASM33397v2_genomic.fna.xlsx provides Euclidean Distance values for these same sequences. You can use it to compare the Z-score against an external Euclidean Distance, and it also illustrates the format required for any external distance file. If you want to see how the Euclidean Distance relates to the Z-score, simply download and upload this Excel file in the app.
Note: This comparison feature is only available if you have selected “Z-score” as the scoring method.