Barcoded Functional Genome Analysis

This is an instruction for analyzing functional genome of Xanthomonas citri pv. citri CQ13 using RB-TnSeq¹.

The workflow consists of:

Tn-Seq analysis
- Pre-process sequencing data.[Fastp]
- Get barcodes and FASTA files with transposon-truncated reads.[get_bc_gdna.py&SeqKit]
- Map to the reference genome.[blastn-short]
- Select best hit and Match insertion site with gene info.[select_best_hit.py]
- Match barcodes with consistently-inserted genes. [match_barcode_gene.R]
BarSeq analysis
- Pre-process sequencing data.[Fastp]
- Get barcodes and Calculate the count of each barcode.[get_bc_count.py]
- Compute strain fitness and gene fitness.[fitness_calculator.R]

Requirements

Miniconda (optional, for packages installation)
Fastp
SeqKit
BLAST
Python
R, Rstudio (R packages: tidyverse, rio, plyr)
Excel
Operating system: Linux
SSH tool: finalshell (optional, better to have if using remote server)

1. Tn-Seq analysis

1.1 Pre-process sequencing data

This step processes sequencing data through quality control, trimming of adapters using fastp.

fastp -i Read1.fq.gz -I Read2.fq.gz -o example_R1.fq.gz -O example_R2.fq.gz -h fastp_report.html

Then, combine pair-end sequencing data into one FASTQ file.

cat example_R1.fq.gz example_R2.fq.gz > example.fq.gz

1.2 Get barcodes and FASTA files with transposon-truncated reads(only genomic DNA remained)

This step identifies barcodes and genomic DNA flanking the barcoded transposon accroding to the following model. Only reads with 17bp barcode and genomic DNA with length 15bp or more will be considered. Finally, convert the genomic DNA FASTQ file to FASTA for mapping.

GTTCGAATTCNNNNNNNNNNNNNNNNNGAGCTCACTTGTGTATAAGAGTCAG

This model is designed based on the 3' end of our transposon.

GTTCGAATTC is a 10bp sequence located upstream of the barcode on the transposon.
NNNNNNNNNNNNNNNNN is the 17bp barcode.
GAGCTCACTTGTGTATAAGAGTCAG is the junction between barcode and genomic DNA.

Helper Message

Run python .\Scripts\get_bc_gdna.py -h on Linux terminal

Helper massage is shown:

usage: get_bc_gdna.py [-h] -i  -o_bc  -o_gdna

Identify barcode and genomic DNA.

optional arguments:
  -h, --help            show this help message and exit
  -i , --input          Filename, Tn-Seq sequencing data processed by fastp.
  -o_bc , --output_barcode
                        Filename, an output table of readID and barcode.
  -o_gdna , --output_gdna
                        Filename, an output FASTQ file of genomic DNA flanking barcoded transposons.

Run get_bc_gdna.py on examlple (Linux terminal)

python .\Scripts\get_bc_gdna.py -i .\Example_file\1.1_Pre-process\example.fq.gz -o_bc .\Example_file\1.2_Get_barcode_gdna\table_id_bc.txt -o_gdna .\Example_file\1.2_Get_barcode_gdna\gdna.fq

In order to perform blastn with FASTA file, convert FASTQ file to FASTA file with SeqKit.

seqkit fq2fa gdna.fq -o gdna.fa

1.3 Map to the reference genome

This step maps the genomic DNA FASTA file to the Xac CQ13 reference genome downloaded from NCBI. Here, we use the genome sequences instead of genome coding sequences, considering the crucial regulatory roles that ncRNAs play in cellular processes.

Create database from CQ13 genome sequences (a multi-FASTA file: 1 chromosome + 2 plasmids).

makeblastdb -in CQ13.fna -dbtype nucl -parse_seqids -out db/CQ13

Perform BLASTN search against the database.

blastn -task blastn-short -query gdna.fa -db db/CQ13 -evalue 1e-5 -outfmt 6 -out ./Example_file/1.3_Blastn-short/blastn_short_result.txt -num_threads 20 -mt_mode 1

blastn application options:

-task blastn-short: blastn supports "blastn-short" task, optimized for sequences less than 30 nucleotides.
-query <filename>: Query FASTA filename.(gdna.fa)
-out <filename>: Output mapping result filename.(blastn_short_result.txt)
-db <database_name>: BLASTN database name.(db/CQ13)
-evalue <value>: Expect value (E) for saving hits, usually use 1e-5, default value is 10.
-outfmt <string>: Set a specified output format.(outfmt 6)
-mt_mode 1 -num_threads <int>: Enable multi-threading by setting the -mt_mode to 1 (default value is 0). Then, specify the number of threads used with the "-num_threads" option(-num_threads 20).

BLASTN Tabular Outformat 6 (Click for more info about outfmt 6)

Column Headers

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

Col	Field	Brief description
1	qseqid	query sequence id
2	sseqid	subject or reference genome sequence id
3	pident	percentage of identical matches
4	length	alignment length
5	mismatch	number of mismatches
6	gapopen	number of gap openings
7	qstart	start of alignment in query
8	qend	end of alignment in query
9	sstart	start of alignment in subject
10	send	end of alignment in subject
11	evalue	expect value
12	bitscore	bit score

1.4 Select best hit and Match insertion site with gene info

Select readID(qsedid) with only one best hit.
Match insertion sites (i.e., sstart of the blastn result) with the positional information of genes.

Input files
- BLASTN mapping result.
- Gene table, an excel file containing positional information of genes.
  
  Gene table is generated by extracting scaffoldId(1st col), begin(4th col), end(5th col), attribute(9th col) from gff file downloaded from NCBI.
Output files
- unique_best_hit.txt : the blastn result of reads with only one best hit will be generated at the working directory.
- not_unique_best_hit.txt : the blastn result of reads with multiple best hits (>=2) will be generated at the working directory.
- The Matching result of insertion sites of reads with only one best hit.

Helper Message

Run python .\Scripts\select_best_hit.py -h on Linux terminal

Helper massage is shown:

usage: select_best_hit.py [-h] -i  -gt  -o

Select genomic DNA with only one best hit, and match the insertion site with gene info.

optional arguments:
  -h, --help           show this help message and exit
  -i , --input         Filename, blastn result file.
  -gt , --gene_table   Filename, an excel file providing gene location (i.e., scaffold, begin, end, description).
  -o , --output        Filename, the output matching result of genomic DNA with only one best hit.

Run select_best_hit.py on examlple (Linux terminal)

python .\Scripts\select_best_hit.py -i .\Example_file\1.3_Blastn-short\blastn_short_result.txt -gt .\Example_file\1.4_Select_best_hit_match_gene_info\gene_table.xlsx -o .\Example_file\1.4_Select_best_hit_match_gene_info\best_hit_match_gene_res.txt

The following information will be printed on the terminal.

For chromosome, the number of insertions located at:
  only one gene: 5 (1.000000).
  intergenic region: 0 (0.000000).
  overlapping region of genes: 0 (0.000000).
-----------------------------------------------------------------------
For pXAC33, the number of insertions located at:
  only one gene: 1 (0.500000).
  intergenic region: 1 (0.500000).
  overlapping region of genes: 0 (0.000000).
-----------------------------------------------------------------------
For pXAC64, the number of insertions located at:
  only one gene: 6 (0.750000).
  intergenic region: 2 (0.250000).
  overlapping region of genes: 0 (0.000000).

1.5 Match barcodes with consistently-inserted genes

This step mainly cantains three steps:

Provide barcodes with inserted genes using the mapping result.
Match each barcode with a unique mutated gene.
- Standards for consistently-inserted genes
  - Type1: A Gene that all reads of a barcode map to.
  - Type2: A Gene that is the primary mapping site of all reads of a barcode.
    
    Requirement for primary mapping location¹:
    1. Matching time >= 10;
    2. Matching percentage >= 75%;
    3. P(Second most frequency mapping location) <= 1/8 * P(Primary mapping location)
Select central insertions.

Run match_barcode_gene.R in Rstudio.

2. BarSeq analysis

2.1 Pre-process sequencing data

See 1.1

2.2 Get barcodes and Calculate the count of each barcode

This step identifies barcode from BarSeq sequencing data, and then calculate the count of each barcode.

Helper Message

Run python .\Scripts\get_bc_count.py -h on Linux terminal

Helper massage is shown:

usage: get_bc_count.py [-h] -i  -o_bc  -o_count
Identify barcode and genomic DNA.

optional arguments:
  -h, --help            show this help message and exit
  -i , --input          Filename, BarSeq sequencing data processed by fastp.
  -o_bc , --output_barcode
                        Filename, an output table of readID and barcode.
  -o_count , --output_count
                        Filename, an output table of barcode and counts

Run get_bc_count.py on examlple (Linux terminal)

python .\Scripts\get_bc_count.py -i .\Example_file\2.2_Get_barcode_count\test.fq.gz -o_bc .\Example_file\2.2_Get_barcode_count\readid_barcode.txt -o_count .\Example_file\2.2_Get_barcode_count\barcode_count.txt

2.3 Compute strain fitness and gene fitness

This step merges barcode counts result from BarSeq sequencing data with Tn-Seq matching result. Finally, the strain fitness and gene fitness can be computed according to the abundance of each barcode and its inserted gene.

Run fitness_calculator.R in Rstudio.

Example file

An example file folder can be found in here.

Reference

Wetmore, K. M. et al. Rapid quantification of mutant fitness in diverse bacteria by sequencing randomly bar-coded transposons. mBio 6, e00306-00315 (2015). ↩ ↩²

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Example_file		Example_file
Scripts		Scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Barcoded Functional Genome Analysis

Requirements

1. Tn-Seq analysis

1.1 Pre-process sequencing data

1.2 Get barcodes and FASTA files with transposon-truncated reads(only genomic DNA remained)

1.3 Map to the reference genome

BLASTN Tabular Outformat 6 (Click for more info about outfmt 6)

1.4 Select best hit and Match insertion site with gene info

1.5 Match barcodes with consistently-inserted genes

2. BarSeq analysis

2.1 Pre-process sequencing data

2.2 Get barcodes and Calculate the count of each barcode

2.3 Compute strain fitness and gene fitness

Example file

Reference

About

Uh oh!

Releases

Packages

Languages

GriffyGe/Barcoded_functional_genomics

Folders and files

Latest commit

History

Repository files navigation

Barcoded Functional Genome Analysis

Requirements

1. Tn-Seq analysis

1.1 Pre-process sequencing data

1.2 Get barcodes and FASTA files with transposon-truncated reads(only genomic DNA remained)

1.3 Map to the reference genome

BLASTN Tabular Outformat 6 (Click for more info about outfmt 6)

1.4 Select best hit and Match insertion site with gene info

1.5 Match barcodes with consistently-inserted genes

2. BarSeq analysis

2.1 Pre-process sequencing data

2.2 Get barcodes and Calculate the count of each barcode

2.3 Compute strain fitness and gene fitness

Example file

Reference

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages