DNA Markov Model Classifier

A Streamlit application that estimates how likely each DNA sequence is generated by a reference Markov model. It uses a 2nd-order Markov model (trinucleotides) to compute a log-probability of each DNA sequence and compares it against a simulated distribution to generate either a Z-score or a percentile.

Overview

This app:

Builds a 2nd-order Markov model from a reference genome (FASTA).
Calculates a log-probability for each input DNA sequence under that model.
Simulates multiple random sequences of the same length to derive a distribution (mean & std or percentile).
Provides a score (either a Z-score or Rank percentile).
Offers PDF and Excel reports, plus optional distance metric comparison.

Markov Model & Algorithm Details

1. Reference Genome

The user uploads (or uses a default) reference genome in FASTA format.
Counts 2-mers and 3-mers in the reference genome.

This process computes probabilities:

P(2-mer)
P(3-mer | 2-mer)

2. Score Calculation

Log Probability

The log-probability for a sequence ( S ) of length ( L ) is computed as:

log P(S) = log P(S[1..2]) + sum from i = 3 to L of log P(S[i] | S[i-2], S[i-1])

Computations are done in log space to safely handle small probabilities.

Simulation

For sequence length L, the app generates N_sim random sequences.
These form an empirical distribution of log P(S) / L.

Z-score or Percentile

Z-score:

Z = (log P(S) / L - μ) / σ

where ( \mu ) and ( \sigma ) are the mean and standard deviation from simulations.

Rank Percentile: The percentile of simulated values below the computed log P(S) / L.

3. Optional Euclidean Distance Comparison

If a user provides a table with a distance metric identified by seq_id, the app merges data by seq_id and plots Z-score vs. Distance.

Features

Multiple Input Methods:
- Upload multi-FASTA, paste sequences, or upload ZIP with FASTA files.
Adaptive Simulation:
- Simulation number N_sim adapts to sequence length.
Reports:
- PDF classification table.
- Excel report with top 100 sequences.
Optional:
- Merge Z-score with additional distance metric.
- Download plots (PNG) of distributions or Z-score vs. distance. It allows you to compare how “in-model” a sequence is (Z-score) with a completely different measure (like Euclidean distance in a feature space, evolutionary distance, or anything else). You can quickly see if “high Z-score” sequences also appear “close” or “far” by other metrics. This helps in multi-metric analysis and can guide further filtering or interpretation.

Usage

1. Install Dependencies

Ensure Python 3.7+ is installed, then run:

pip install -r requirements.txt

2. Run the App

In the repository folder:

streamlit run markovAlexanderApproach.py

Open http://localhost:8501 in a browser.

3. Reference Genome

Choose a default snippet or upload your own FASTA file. You can upload a multi-FASTA also.

4. Sequences to Classify

Upload multi-FASTA
Paste sequences directly
Upload a ZIP of FASTA files

5. Scoring Method

Choose either Z-score or Rank Percentile.

6. Results & Reports

View classification results
Download classification as PDF or Excel

7. Optional: Merge Z-score with Distance

If using Z-score, upload a distance table identified by seq_id. Download merged data plots (PNG) and Excel files.

Testing the App

If you want to quickly test this application without providing your own data, you can use the example files in the test_data folder.

Within the Streamlit interface:

Reference Genome: Upload the multi-FASTA file ExampleSequences.fa from the test_data folder. This will serve as your sample reference genome.
Sequences to Classify: Select “ZIP” as the input type, then upload seqInterested.zip from the test_data folder. This ZIP contains multiple sequences to be classified.

Additionally, the Excel file named EuclideanGCA_000333975.2_ASM33397v2_genomic.fna.xlsx provides Euclidean Distance values for these same sequences. You can use it to compare the Z-score against an external Euclidean Distance, and it also illustrates the format required for any external distance file. If you want to see how the Euclidean Distance relates to the Z-score, simply download and upload this Excel file in the app.

Note: This comparison feature is only available if you have selected “Z-score” as the scoring method.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
test_data		test_data
README.md		README.md
markovAlexanderApproach.py		markovAlexanderApproach.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNA Markov Model Classifier

Table of Contents

Overview

Markov Model & Algorithm Details

1. Reference Genome

2. Score Calculation

Log Probability

Simulation

Z-score or Percentile

3. Optional Euclidean Distance Comparison

Features

Usage

1. Install Dependencies

2. Run the App

3. Reference Genome

4. Sequences to Classify

5. Scoring Method

6. Results & Reports

7. Optional: Merge Z-score with Distance

Testing the App

About

Uh oh!

Releases

Packages

Uh oh!

Languages

alejanner/HiddenMarkovModels

Folders and files

Latest commit

History

Repository files navigation

DNA Markov Model Classifier

Table of Contents

Overview

Markov Model & Algorithm Details

1. Reference Genome

2. Score Calculation

Log Probability

Simulation

Z-score or Percentile

3. Optional Euclidean Distance Comparison

Features

Usage

1. Install Dependencies

2. Run the App

3. Reference Genome

4. Sequences to Classify

5. Scoring Method

6. Results & Reports

7. Optional: Merge Z-score with Distance

Testing the App

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages