From 6651fbd3698b264842f5fad784e6180d9753d000 Mon Sep 17 00:00:00 2001 From: Yuwei Sun Date: Tue, 13 May 2025 10:53:24 -0400 Subject: [PATCH 1/3] Update Readme --- README.md | 71 +++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 56 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index e74560e..690bf31 100644 --- a/README.md +++ b/README.md @@ -4,25 +4,55 @@ Execution of *ref*orm requires a reference sequence (fasta), reference annotation (GFF or GTF), the novel sequences to be added (fasta), and corresponding novel annotations (GFF or GTF). A user provides as arguments the name of the modified chromosome and either the position at which the novel sequence is inserted, or the upstream and downstream sequences flanking the novel sequences. This results in the addition and/or deletion of sequence from the reference in the modified fasta file. In addition to the novel annotations, any changes to the reference annotations that result from deleted or interrupted sequence are incorporated into the modified gff. Importantly, modified gff and fasta files include a record of the modifications. +In addition to the editing functionality described above, *ref*orm also supports the addition of novel chromosomes. This requires a reference sequence (FASTA) and annotation (GFF or GTF), along with the novel chromosome to be added. Users must provide the novel sequence (FASTA) and its corresponding annotation (GFF or GTF). The new chromosome is appended to the reference files, and all associated annotations are incorporated accordingly. + Learn more at https://gencore.bio.nyu.edu/reform/ ## Usage -*ref*orm requires Python3, pgzip and Biopython v1.78 or higher. +*ref*orm requires Python3 and Biopython v1.78 or higher. -Install pgzip and biopython if you don't already have it: +Install biopython if you don't already have it: -`pip install pgzip` `pip install biopython` +Reform supports reading and writing .gz files using gzip. To accelerate compression and decompression, it optionally supports pgzip, a parallel implementation of gzip. Users must install pgzip separately to enable this feature. + +Install pgzip if you don't already have it: + +`pip install pgzip` + Invoke the python script: +```bash +### Edit a sequence within position +python3 reform.py + --chrom= \ + --position=,, \ + --in_fasta=,, \ + --in_gff=,, \ + --ref_fasta= \ + --ref_gff= ``` + +```bash +### Edit a sequence within upstream & downstream python3 reform.py --chrom= \ - --position= \ - --in_fasta= \ - --in_gff= \ + --upstream_fasta=, , \ + --downstream_fasta=,, \ + --in_fasta=,, \ + --in_gff=,, \ + --ref_fasta= \ + --ref_gff= +``` + +```bash +### Append a novel chromosome sequence +python3 reform.py + --new_chrom= \ + --in_fasta=,, \ + --in_gff=,, \ --ref_fasta= \ --ref_gff= ``` @@ -31,15 +61,17 @@ python3 reform.py `chrom` ID of the chromsome to modify -`position` Position in chromosome at which to insert . Can use `-1` to add to end of chromosome. Note: Either position, or upstream AND downstream sequence must be provided. **Note: Position is 0-based** +`new_chrom` ID of the novel chromsome to append + +`position` Position in chromosome at which to insert . Can use `-1` to add to end of chromosome. Note: Either position, or upstream AND downstream sequence must be provided. To perform multiple edits in one run, provide multiple positions separated by commas (e.g., 0,5,-1). **Note: Position is 0-based** -`upstream_fasta` Path to Fasta file with upstream sequence. Note: Either position, or upstream AND downstream sequence must be provided. +`upstream_fasta` Paths to Fasta file with upstream sequence. Note: Either position, or upstream AND downstream sequence must be provided. -`downstream_fasta` Path to Fasta file with downstream sequence. Note: Either position, or upstream AND downstream sequence must be provided. +`downstream_fasta` Paths to Fasta file with downstream sequence. Note: Either position, or upstream AND downstream sequence must be provided. -`in_fasta` Path to new sequence to be inserted into reference genome in fasta format. +`in_fasta` Paths to new sequence to be inserted into reference genome in fasta format. -`in_gff` Path to GFF file describing new fasta sequence to be inserted. +`in_gff` Paths to GFF file describing new fasta sequence to be inserted. `ref_fasta` Path to reference fasta file. @@ -50,10 +82,10 @@ python3 reform.py ``` python3 reform.py --chrom="I" \ - --upstream_fasta="data/up.fa" \ - --downstream_fasta="data/down.fa" \ - --in_fasta="data/new.fa" \ - --in_gff="data/new.gff" \ + --upstream_fasta="data/up1.fa,data/up2.fa,data/up3.fa" \ + --downstream_fasta="data/down1.fa,data/down2.fa,data/down3.fa" \ + --in_fasta="data/new1.fa,data/new2.fa,data/new3.fa" \ + --in_gff="data/new1.gff,data/new2.gff,data/new3.gff" \ --ref_fasta="data/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa" \ --ref_gff="data/Saccharomyces_cerevisiae.R64-1-1.34.gff3" ``` @@ -64,3 +96,12 @@ python3 reform.py `reformed.gff3` Modified GFF file. +## Test +After local deployment or modification, you can use `test_reform.py` to verify the functionality of Reform. +This script contains an automated test suite using the Python `unittest` framework. It verifies the correctness of Reform across a variety of genome editing scenarios. + +To run all tests: + +```bash +python3 test_reform.py +``` \ No newline at end of file From 603119d264c81286e091ec977d2b47d784065dfc Mon Sep 17 00:00:00 2001 From: Eric Date: Mon, 19 May 2025 11:13:18 -0400 Subject: [PATCH 2/3] Update Format --- README.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 42c173c..32b75f2 100644 --- a/README.md +++ b/README.md @@ -24,9 +24,9 @@ Install pgzip if you don't already have it: Invoke the python script: -```bash +``` ### Edit a sequence within position -python3 reform.py +python3 reform.py \ --chrom= \ --position=,, \ --in_fasta=,, \ @@ -35,11 +35,11 @@ python3 reform.py --ref_gff= ``` -```bash +``` ### Edit a sequence within upstream & downstream -python3 reform.py +python3 reform.py \ --chrom= \ - --upstream_fasta=, , \ + --upstream_fasta=,, \ --downstream_fasta=,, \ --in_fasta=,, \ --in_gff=,, \ @@ -47,9 +47,9 @@ python3 reform.py --ref_gff= ``` -```bash +``` ### Append a novel chromosome sequence -python3 reform.py +python3 reform.py \ --new_chrom= \ --in_fasta=,, \ --in_gff=,, \ @@ -80,7 +80,7 @@ python3 reform.py ## Example ``` -python3 reform.py +python3 reform.py \ --chrom="I" \ --upstream_fasta="data/up1.fa,data/up2.fa,data/up3.fa" \ --downstream_fasta="data/down1.fa,data/down2.fa,data/down3.fa" \ @@ -104,4 +104,4 @@ To run all tests: ```bash python3 test_reform.py -``` \ No newline at end of file +``` From 0898ffe1b2581c3e497f16c01d58dfc05fa5273f Mon Sep 17 00:00:00 2001 From: mkhalfan Date: Thu, 5 Jun 2025 14:52:34 -0400 Subject: [PATCH 3/3] Update README.md --- README.md | 123 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 71 insertions(+), 52 deletions(-) diff --git a/README.md b/README.md index 32b75f2..63a8a1c 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,15 @@ # reform -[*ref*orm](https://gencore.bio.nyu.edu/reform/) is a python-based command line tool that allows for fast, easy and robust editing of reference genome sequence and annotation files. +[*ref*orm](https://gencore.bio.nyu.edu//) is a Python-based command-line tool for fast, robust, and flexible editing of reference genome sequence and annotation files. -Execution of *ref*orm requires a reference sequence (fasta), reference annotation (GFF or GTF), the novel sequences to be added (fasta), and corresponding novel annotations (GFF or GTF). A user provides as arguments the name of the modified chromosome and either the position at which the novel sequence is inserted, or the upstream and downstream sequences flanking the novel sequences. This results in the addition and/or deletion of sequence from the reference in the modified fasta file. In addition to the novel annotations, any changes to the reference annotations that result from deleted or interrupted sequence are incorporated into the modified gff. Importantly, modified gff and fasta files include a record of the modifications. +To perform an edit, *ref*orm requires a reference genome (FASTA), its annotation file (GFF or GTF), a novel sequence to be inserted (FASTA), and the corresponding annotation (GFF or GTF). The user specifies either: -In addition to the editing functionality described above, *ref*orm also supports the addition of novel chromosomes. This requires a reference sequence (FASTA) and annotation (GFF or GTF), along with the novel chromosome to be added. Users must provide the novel sequence (FASTA) and its corresponding annotation (GFF or GTF). The new chromosome is appended to the reference files, and all associated annotations are incorporated accordingly. +- the chromosome and the position at which to insert the novel sequence, or +- the chromosome along with the upstream and downstream flanking sequences. + +The result is a modified reference genome (FASTA) and annotation file (GFF), incorporating the novel sequence and its annotations. Any reference annotations affected by the insertion or deletion are automatically updated. All modifications are documented within the output files. + +In addition to modifying existing chromosomes, *ref*orm also supports appending entirely new chromosomes. In this mode, users provide the novel chromosome’s sequence and annotations, which are added to the reference genome and integrated into the annotation file. Learn more at https://gencore.bio.nyu.edu/reform/ @@ -16,78 +21,93 @@ Install biopython if you don't already have it: `pip install biopython>=1.78` -Reform supports reading and writing .gz files using gzip. To accelerate compression and decompression, it optionally supports pgzip, a parallel implementation of gzip. Users must install pgzip separately to enable this feature. +*ref*orm supports reading and writing .gz files using gzip. To accelerate compression and decompression, it optionally supports pgzip, a parallel implementation of gzip. Users must install pgzip separately to enable this feature. -Install pgzip if you don't already have it: +*Optional:* Install pgzip if you don't already have it: `pip install pgzip` Invoke the python script: ``` -### Edit a sequence within position +### Minimal Example (Single Edit) python3 reform.py \ --chrom= \ - --position=,, \ - --in_fasta=,, \ - --in_gff=,, \ - --ref_fasta= \ - --ref_gff= + --position= \ + --in_fasta= \ + --in_gff= \ + --ref_fasta= \ + --ref_gff= ``` -``` -### Edit a sequence within upstream & downstream -python3 reform.py \ - --chrom= \ - --upstream_fasta=,, \ - --downstream_fasta=,, \ - --in_fasta=,, \ - --in_gff=,, \ - --ref_fasta= \ - --ref_gff= -``` +## Parameters -``` -### Append a novel chromosome sequence -python3 reform.py \ - --new_chrom= \ - --in_fasta=,, \ - --in_gff=,, \ - --ref_fasta= \ - --ref_gff= -``` +- `chrom`: ID of the chromosome to **modify**. **Required** unless `new_chrom` is specified. Cannot be used together with `new_chrom`. -## Parameters +- `new_chrom`: ID of the novel chromosome to **append**. **Required** if you're adding a new chromosome. Cannot be used together with `chrom`. -`chrom` ID of the chromsome to modify +- `position`: 0-based insertion position(s) in the reference chromosome where `in_fasta` should be inserted. Use `-1` to insert at the end of the chromosome. For **multiple edits**, provide a comma-separated list (e.g., `0,5,-1`). **Note:** Either `position`, or both `upstream_fasta` and `downstream_fasta`, must be provided. -`new_chrom` ID of the novel chromsome to append +- `upstream_fasta`: Path(s) to FASTA file(s) containing the upstream flanking sequence(s) for insertion. For **multiple edits**, provide a comma-separated list (e.g., `up1.fa,up2.fa,up3.fa`). Must be used with `downstream_fasta`. Cannot be used together with `position`. -`position` Position in chromosome at which to insert . Can use `-1` to add to end of chromosome. Note: Either position, or upstream AND downstream sequence must be provided. To perform multiple edits in one run, provide multiple positions separated by commas (e.g., 0,5,-1). **Note: Position is 0-based** +- `downstream_fasta`: Path(s) to FASTA file(s) containing the downstream flanking sequence(s) for insertion. For **multiple edits**, provide a comma-separated list (e.g., `down1.fa,down2.fa,down3.fa`). Must be used with `upstream_fasta`. Cannot be used together with `position`. -`upstream_fasta` Paths to Fasta file with upstream sequence. Note: Either position, or upstream AND downstream sequence must be provided. +- `in_fasta`: Path(s) to FASTA file(s) containing the new sequence(s) to insert. For multiple edits, provide a comma-separated list. **The number of entries must match the number of `position` values or the number of upstream/downstream pairs.** -`downstream_fasta` Paths to Fasta file with downstream sequence. Note: Either position, or upstream AND downstream sequence must be provided. +- `in_gff`: Path(s) to GFF3 file(s) describing the `in_fasta` sequence(s). For multiple edits, provide a comma-separated list. **The number of entries must match the number of `in_fasta` files.** -`in_fasta` Paths to new sequence to be inserted into reference genome in fasta format. +- `ref_fasta` Path to the reference genome FASTA file. -`in_gff` Paths to GFF file describing new fasta sequence to be inserted. +- `ref_gff` Path to the reference genome annotation (GFF3 or GTF) file. -`ref_fasta` Path to reference fasta file. +## Examples -`ref_gff` Path to reference gff file. +### Single Edit by Position -## Example +``` +python3 reform.py \ + --chrom="I" \ + --position=1500 \ + --in_fasta="data/edit.fa" \ + --in_gff="data/edit.gff" \ + --ref_fasta="data/ref.fa" \ + --ref_gff="data/ref.gff3" +``` + +### Single Edit with Upstream/Downstream Flanks ``` python3 reform.py \ --chrom="I" \ - --upstream_fasta="data/up1.fa,data/up2.fa,data/up3.fa" \ - --downstream_fasta="data/down1.fa,data/down2.fa,data/down3.fa" \ - --in_fasta="data/new1.fa,data/new2.fa,data/new3.fa" \ - --in_gff="data/new1.gff,data/new2.gff,data/new3.gff" \ - --ref_fasta="data/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa" \ - --ref_gff="data/Saccharomyces_cerevisiae.R64-1-1.34.gff3" + --upstream_fasta="data/up.fa" \ + --downstream_fasta="data/down.fa" \ + --in_fasta="data/edit.fa" \ + --in_gff="data/edit.gff" \ + --ref_fasta="data/ref.fa" \ + --ref_gff="data/ref.gff3" +``` + +### Batch Edits (Multiple Positions) + +``` +python3 reform.py \ + --chrom="I" \ + --position=1000,2500,3000 \ + --in_fasta="data/edit1.fa,data/edit2.fa,data/edit3.fa" \ + --in_gff="data/edit1.gff,data/edit2.gff,data/edit3.gff" \ + --ref_fasta="data/ref.fa" \ + --ref_gff="data/ref.gff3" +``` + +### Append a Novel Chromosome + +``` +python3 reform.py \ + --new_chrom="new_chr1" \ + --in_fasta="data/new1.fa" \ + --in_gff="data/new1.gff" \ + --ref_fasta="data/ref.fa" \ + --ref_gff="data/ref.gff3" ``` ## Output @@ -96,12 +116,11 @@ python3 reform.py \ `reformed.gff3` Modified GFF file. -## Test -After local deployment or modification, you can use `test_reform.py` to verify the functionality of Reform. -This script contains an automated test suite using the Python `unittest` framework. It verifies the correctness of Reform across a variety of genome editing scenarios. +## Tests +After local deployment or modification, you can run `test_reform.py` to verify the functionality of *ref*orm. This script contains an automated test suite built with Python’s `unittest` framework and validates *ref*orm across a range of genome editing scenarios. To run all tests: ```bash -python3 test_reform.py +python3 test_.py ```