Skip to content

alejanner/HiddenMarkovModels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DNA Markov Model Classifier

A Streamlit application that estimates how likely each DNA sequence is generated by a reference Markov model. It uses a 2nd-order Markov model (trinucleotides) to compute a log-probability of each DNA sequence and compares it against a simulated distribution to generate either a Z-score or a percentile.


Table of Contents

  1. Overview
  2. Markov Model & Algorithm Details
  3. Features
  4. Usage
  5. Testing the App

Overview

This app:

  • Builds a 2nd-order Markov model from a reference genome (FASTA).
  • Calculates a log-probability for each input DNA sequence under that model.
  • Simulates multiple random sequences of the same length to derive a distribution (mean & std or percentile).
  • Provides a score (either a Z-score or Rank percentile).
  • Offers PDF and Excel reports, plus optional distance metric comparison.

Markov Model & Algorithm Details

1. Reference Genome

  • The user uploads (or uses a default) reference genome in FASTA format.
  • Counts 2-mers and 3-mers in the reference genome.

This process computes probabilities:

  • P(2-mer)
  • P(3-mer | 2-mer)

2. Score Calculation

Log Probability

The log-probability for a sequence ( S ) of length ( L ) is computed as:

log P(S) = log P(S[1..2]) + sum from i = 3 to L of log P(S[i] | S[i-2], S[i-1])

Computations are done in log space to safely handle small probabilities.

Simulation

  • For sequence length L, the app generates N_sim random sequences.
  • These form an empirical distribution of log P(S) / L.

Z-score or Percentile

  • Z-score:

Z = (log P(S) / L - μ) / σ

where ( \mu ) and ( \sigma ) are the mean and standard deviation from simulations.

  • Rank Percentile: The percentile of simulated values below the computed log P(S) / L.

3. Optional Euclidean Distance Comparison

If a user provides a table with a distance metric identified by seq_id, the app merges data by seq_id and plots Z-score vs. Distance.


Features

  • Multiple Input Methods:
    • Upload multi-FASTA, paste sequences, or upload ZIP with FASTA files.
  • Adaptive Simulation:
    • Simulation number N_sim adapts to sequence length.
  • Reports:
    • PDF classification table.
    • Excel report with top 100 sequences.
  • Optional:
    • Merge Z-score with additional distance metric.
    • Download plots (PNG) of distributions or Z-score vs. distance. It allows you to compare how “in-model” a sequence is (Z-score) with a completely different measure (like Euclidean distance in a feature space, evolutionary distance, or anything else). You can quickly see if “high Z-score” sequences also appear “close” or “far” by other metrics. This helps in multi-metric analysis and can guide further filtering or interpretation.

Usage

1. Install Dependencies

Ensure Python 3.7+ is installed, then run:

pip install -r requirements.txt

2. Run the App

In the repository folder:

streamlit run markovAlexanderApproach.py

Open http://localhost:8501 in a browser.

3. Reference Genome

Choose a default snippet or upload your own FASTA file. You can upload a multi-FASTA also.

4. Sequences to Classify

  • Upload multi-FASTA
  • Paste sequences directly
  • Upload a ZIP of FASTA files

5. Scoring Method

Choose either Z-score or Rank Percentile.

6. Results & Reports

  • View classification results
  • Download classification as PDF or Excel

7. Optional: Merge Z-score with Distance

If using Z-score, upload a distance table identified by seq_id. Download merged data plots (PNG) and Excel files.

Testing the App

If you want to quickly test this application without providing your own data, you can use the example files in the test_data folder.

Within the Streamlit interface:

  • Reference Genome: Upload the multi-FASTA file ExampleSequences.fa from the test_data folder. This will serve as your sample reference genome.

  • Sequences to Classify: Select “ZIP” as the input type, then upload seqInterested.zip from the test_data folder. This ZIP contains multiple sequences to be classified.

Additionally, the Excel file named EuclideanGCA_000333975.2_ASM33397v2_genomic.fna.xlsx provides Euclidean Distance values for these same sequences. You can use it to compare the Z-score against an external Euclidean Distance, and it also illustrates the format required for any external distance file. If you want to see how the Euclidean Distance relates to the Z-score, simply download and upload this Excel file in the app.

Note: This comparison feature is only available if you have selected “Z-score” as the scoring method.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages