This is an instruction for analyzing functional genome of Xanthomonas citri pv. citri CQ13 using RB-TnSeq1.
The workflow consists of:
-
- Pre-process sequencing data.[
Fastp] - Get barcodes and FASTA files with transposon-truncated reads.[
get_bc_gdna.py&SeqKit] - Map to the reference genome.[
blastn-short] - Select best hit and Match insertion site with gene info.[
select_best_hit.py] - Match barcodes with consistently-inserted genes. [
match_barcode_gene.R]
- Pre-process sequencing data.[
-
- Pre-process sequencing data.[
Fastp] - Get barcodes and Calculate the count of each barcode.[
get_bc_count.py] - Compute strain fitness and gene fitness.[
fitness_calculator.R]
- Pre-process sequencing data.[
- Miniconda (optional, for packages installation)
- Fastp
- SeqKit
- BLAST
- Python
- R, Rstudio (R packages:
tidyverse,rio,plyr) - Excel
- Operating system: Linux
- SSH tool: finalshell (optional, better to have if using remote server)
This step processes sequencing data through quality control, trimming of adapters using fastp.
fastp -i Read1.fq.gz -I Read2.fq.gz -o example_R1.fq.gz -O example_R2.fq.gz -h fastp_report.html
Then, combine pair-end sequencing data into one FASTQ file.
cat example_R1.fq.gz example_R2.fq.gz > example.fq.gz
This step identifies barcodes and genomic DNA flanking the barcoded transposon accroding to the following model. Only reads with 17bp barcode and genomic DNA with length 15bp or more will be considered. Finally, convert the genomic DNA FASTQ file to FASTA for mapping.
GTTCGAATTCNNNNNNNNNNNNNNNNNGAGCTCACTTGTGTATAAGAGTCAG
This model is designed based on the 3' end of our transposon.
GTTCGAATTCis a 10bp sequence located upstream of the barcode on the transposon.NNNNNNNNNNNNNNNNNis the 17bp barcode.GAGCTCACTTGTGTATAAGAGTCAGis the junction between barcode and genomic DNA.
Helper Message
Run python .\Scripts\get_bc_gdna.py -h on Linux terminal
Helper massage is shown:
usage: get_bc_gdna.py [-h] -i -o_bc -o_gdna
Identify barcode and genomic DNA.
optional arguments:
-h, --help show this help message and exit
-i , --input Filename, Tn-Seq sequencing data processed by fastp.
-o_bc , --output_barcode
Filename, an output table of readID and barcode.
-o_gdna , --output_gdna
Filename, an output FASTQ file of genomic DNA flanking barcoded transposons.
Run get_bc_gdna.py on examlple (Linux terminal)
python .\Scripts\get_bc_gdna.py -i .\Example_file\1.1_Pre-process\example.fq.gz -o_bc .\Example_file\1.2_Get_barcode_gdna\table_id_bc.txt -o_gdna .\Example_file\1.2_Get_barcode_gdna\gdna.fq
In order to perform blastn with FASTA file, convert FASTQ file to FASTA file with SeqKit.
seqkit fq2fa gdna.fq -o gdna.fa
This step maps the genomic DNA FASTA file to the Xac CQ13 reference genome downloaded from NCBI. Here, we use the genome sequences instead of genome coding sequences, considering the crucial regulatory roles that ncRNAs play in cellular processes.
- Create database from CQ13 genome sequences (a multi-FASTA file: 1 chromosome + 2 plasmids).
makeblastdb -in CQ13.fna -dbtype nucl -parse_seqids -out db/CQ13
- Perform BLASTN search against the database.
blastn -task blastn-short -query gdna.fa -db db/CQ13 -evalue 1e-5 -outfmt 6 -out ./Example_file/1.3_Blastn-short/blastn_short_result.txt -num_threads 20 -mt_mode 1
blastn application options:
-task blastn-short: blastn supports "blastn-short" task, optimized for sequences less than 30 nucleotides.
-query <filename>: Query FASTA filename.(gdna.fa)
-out <filename>: Output mapping result filename.(blastn_short_result.txt)
-db <database_name>: BLASTN database name.(db/CQ13)
-evalue <value>: Expect value (E) for saving hits, usually use 1e-5, default value is 10.
-outfmt <string>: Set a specified output format.(outfmt 6)
-mt_mode 1 -num_threads <int>: Enable multi-threading by setting the -mt_mode to 1 (default value is 0). Then, specify the number of threads used with the "-num_threads" option(-num_threads 20).
BLASTN Tabular Outformat 6 (Click for more info about outfmt 6)
Column Headers
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
| Col | Field | Brief description |
|---|---|---|
| 1 | qseqid | query sequence id |
| 2 | sseqid | subject or reference genome sequence id |
| 3 | pident | percentage of identical matches |
| 4 | length | alignment length |
| 5 | mismatch | number of mismatches |
| 6 | gapopen | number of gap openings |
| 7 | qstart | start of alignment in query |
| 8 | qend | end of alignment in query |
| 9 | sstart | start of alignment in subject |
| 10 | send | end of alignment in subject |
| 11 | evalue | expect value |
| 12 | bitscore | bit score |
- Select readID(qsedid) with only one best hit.
- Match insertion sites (i.e., sstart of the blastn result) with the positional information of genes.
-
Input files
-
BLASTN mapping result.
-
Gene table, an excel file containing positional information of genes.
Gene table is generated by extracting
scaffoldId(1st col), begin(4th col), end(5th col), attribute(9th col)from gff file downloaded from NCBI.
-
-
Output files
unique_best_hit.txt: the blastn result of reads with only one best hit will be generated at the working directory.not_unique_best_hit.txt: the blastn result of reads with multiple best hits (>=2) will be generated at the working directory.- The Matching result of insertion sites of reads with only one best hit.
Helper Message
Run python .\Scripts\select_best_hit.py -h on Linux terminal
Helper massage is shown:
usage: select_best_hit.py [-h] -i -gt -o
Select genomic DNA with only one best hit, and match the insertion site with gene info.
optional arguments:
-h, --help show this help message and exit
-i , --input Filename, blastn result file.
-gt , --gene_table Filename, an excel file providing gene location (i.e., scaffold, begin, end, description).
-o , --output Filename, the output matching result of genomic DNA with only one best hit.
Run select_best_hit.py on examlple (Linux terminal)
python .\Scripts\select_best_hit.py -i .\Example_file\1.3_Blastn-short\blastn_short_result.txt -gt .\Example_file\1.4_Select_best_hit_match_gene_info\gene_table.xlsx -o .\Example_file\1.4_Select_best_hit_match_gene_info\best_hit_match_gene_res.txt
The following information will be printed on the terminal.
For chromosome, the number of insertions located at:
only one gene: 5 (1.000000).
intergenic region: 0 (0.000000).
overlapping region of genes: 0 (0.000000).
-----------------------------------------------------------------------
For pXAC33, the number of insertions located at:
only one gene: 1 (0.500000).
intergenic region: 1 (0.500000).
overlapping region of genes: 0 (0.000000).
-----------------------------------------------------------------------
For pXAC64, the number of insertions located at:
only one gene: 6 (0.750000).
intergenic region: 2 (0.250000).
overlapping region of genes: 0 (0.000000).
This step mainly cantains three steps:
- Provide barcodes with inserted genes using the mapping result.
- Match each barcode with a unique mutated gene.
- Standards for consistently-inserted genes
-
Type1: A Gene that all reads of a barcode map to.
-
Type2: A Gene that is the primary mapping site of all reads of a barcode.
Requirement for primary mapping location1:
- Matching time >= 10;
- Matching percentage >= 75%;
- P(Second most frequency mapping location) <= 1/8 * P(Primary mapping location)
-
- Standards for consistently-inserted genes
- Select central insertions.
Run match_barcode_gene.R in Rstudio.
See 1.1
This step identifies barcode from BarSeq sequencing data, and then calculate the count of each barcode.
Helper Message
Run python .\Scripts\get_bc_count.py -h on Linux terminal
Helper massage is shown:
usage: get_bc_count.py [-h] -i -o_bc -o_count
Identify barcode and genomic DNA.
optional arguments:
-h, --help show this help message and exit
-i , --input Filename, BarSeq sequencing data processed by fastp.
-o_bc , --output_barcode
Filename, an output table of readID and barcode.
-o_count , --output_count
Filename, an output table of barcode and counts
Run get_bc_count.py on examlple (Linux terminal)
python .\Scripts\get_bc_count.py -i .\Example_file\2.2_Get_barcode_count\test.fq.gz -o_bc .\Example_file\2.2_Get_barcode_count\readid_barcode.txt -o_count .\Example_file\2.2_Get_barcode_count\barcode_count.txt
This step merges barcode counts result from BarSeq sequencing data with Tn-Seq matching result. Finally, the strain fitness and gene fitness can be computed according to the abundance of each barcode and its inserted gene.
Run fitness_calculator.R in Rstudio.
An example file folder can be found in here.