6. Command Line Interface (CLI)¶

The command line usage is as followings:

Usage: perl runPipeline.pl [options] -c config.txt -p 'reads1.fastq reads2.fastq' -o out_directory
Version 1.1
Input File:
        -u            Unpaired reads, Single end reads in fastq

        -p            Paired reads in two fastq files and separate by space in quote

        -c            Config File
Output:
        -o            Output directory.

Options:
        -ref          Reference genome file in fasta

        -primer       A pair of Primers sequences in strict fasta format

        -cpu          number of CPUs (default: 8)

        -version      print verison

A config file (example in the below section, the Graphic User Interface (GUI) will generate config automatically), reads Files in fastq format, and a output directory are required when run by command line. Based on the configuration file, if all modules are turned on, EDGE will run the following steps. Each step contains at least one command line scripts/programs.

Data QC
Host Removal QC
De novo Assembling
Reads Mapping To Contig
Reads Mapping To Reference Genomes
Taxonomy Classification on All Reads or unMapped to Reference Reads
Map Contigs To Reference Genomes
Variant Analysis
Contigs Taxonomy Classification
Contigs Annotation
ProPhage detection
PCR Assay Validation
PCR Assay Adjudication
Phylogenetic Analysis
Generate JBrowse Tracks
HTML report

6.1. Configuration File¶

The config file is a text file with the following information. If you are going to do host removal, you need to build host index for it and change the fasta file path in the config file.

[Count Fastq]
DoCountFastq=auto

[Quality Trim and Filter]
## boolean, 1=yes, 0=no
DoQC=1
##Targets quality level for trimming
q=5
##Trimmed sequence length will have at least minimum length
min_L=50
##Average quality cutoff
avg_q=0
##"N" base cutoff.  Trimmed read has more than this number of continuous base "N" will be discarded.
n=1
##Low complexity filter ratio, Maximum fraction of mono-/di-nucleotide sequence
lc=0.85
## Trim reads with adapters or contamination sequences
adapter=/PATH/adapter.fasta
## phiX filter, boolean, 1=yes, 0=no
phiX=0
## Cut # bp from 5 end before quality trimming/filtering
5end=0
## Cut # bp from 3 end before quality trimming/filtering
3end=0

[Host Removal]
## boolean, 1=yes, 0=no
DoHostRemoval=1
## Use more Host=  to remove multiple host reads
Host=/PATH/all_chromosome.fasta
similarity=90

[Assembly]
## boolean, 1=yes, 0=no
DoAssembly=1
##Bypass assembly and use pre-assembled contigs
assembledContigs=
minContigSize=200
## spades or idba_ud
assembler=idba_ud
idbaOptions="--pre_correction  --mink 31"
## for spades
singleCellMode=
pacbioFile=
nanoporeFile=

[Reads Mapping To Contigs]
# Reads mapping to contigs
DoReadsMappingContigs=auto

[Reads Mapping To Reference]
# Reads mapping to reference
DoReadsMappingReference=0
bowtieOptions=
# reference genbank or fasta file
reference=
MapUnmappedReads=0

[Reads Taxonomy Classification]
## boolean, 1=yes, 0=no
DoReadsTaxonomy=1
## If reference genome exists, only use unmapped reads to do Taxonomy Classification. Turn on AllReads=1 will use all reads instead.
AllReads=0
enabledTools=gottcha-genDB-b,gottcha-speDB-b,gottcha-strDB-b,gottcha-genDB-v,gottcha-speDB-v,gottcha-strDB-v,metaphlan,bwa,kraken_mini

[Contigs Mapping To Reference]
# Contig mapping to reference
DoContigMapping=auto
## identity cutoff
identity=85
MapUnmappedContigs=0

[Variant Analysis]
DoVariantAnalysis=auto

[Contigs Taxonomy Classification]
DoContigsTaxonomy=1

[Contigs Annotation]
## boolean, 1=yes, 0=no
DoAnnotation=1
# kingdom: Archaea Bacteria Mitochondria Viruses
kingdom=Bacteria
contig_size_cut_for_annotation=700
## support tools: Prokka or RATT
annotateProgram=Prokka
annotateSourceGBK=

[ProPhage Detection]
DoProPhageDetection=1

[Phylogenetic Analysis]
DoSNPtree=1
## Availabe choices are Ecoli, Yersinia, Francisella, Brucella, Bacillus
SNPdbName=Ecoli
## FastTree or RAxML
treeMaker=FastTree
## SRA accessions ByrRun, ByExp, BySample, ByStudy
SNP_SRA_ids=

[Primer Validation]
DoPrimerValidation=1
maxMismatch=1
primer=

[Primer Adjudication]
## boolean, 1=yes, 0=no
DoPrimerDesign=0
## desired primer tm
tm_opt=59
tm_min=57
tm_max=63
## desired primer length
len_opt=18
len_min=20
len_max=27
## reject primer having Tm < tm_diff difference with background Tm
tm_diff=5
## display # top results for each target
top=5

[Generate JBrowse Tracks]
DoJBrowse=1

[HTML Report]
DoHTMLReport=1

6.2. Test Run¶

EDGE provides an example data set which is an E. coli MiSeq dataset and has been subsampled to ~10x fold coverage reads.

In the EDGE home directory,

cd testData
sh runTest.sh

Snapshot from the terminal.

See Output

6.3. Descriptions of each module¶

Each module comes with default parameters and user can see the optional parameters by entering the program name with –h or -help flag without any other arguments.

Data QC

Required step? No
Command example
perl $EDGE_HOME/scripts/illumina_fastq_QC.pl  -p 'Ecoli_10x.1.fastq Ecoli_10x.2.fastq'  -q 5 -min_L 50 -avg_q 5 -n 0 -lc 0.85 –d QcReads -t 10
What it does

Quality control

Read filtering

Read trimming

Expected input

Paired-end/Single-end reads in FASTQ format

Expected output

QC.1.trimmed.fastq

QC.2.trimmed.fastq

QC.unpaired.trimmed.fastq

QC.stats.txt

QC_qc_report.pdf

Host Removal QC

Required step? No
Command example
perl $EDGE_HOME/scripts/host_reads_removal_by_mapping.pl  -p 'QC.1.trimmed.fastq QC.2.trimmed.fastq' -u QC.unpaired.trimmed.fastq -ref human_chromosomes.fasta -o QcReads -cpu 10
What it does

Read filtering

Expected input

Paired-end/Single-end reads in FASTQ format

Expected output

host_clean.1.fastq

host_clean.2.fastq

host_clean.mapping.log

host_clean.unpaired.fastq

host_clean.stats.txt

IDBA Assembling

Required step? No
Command example
fq2fa --merge host_clean.1.fastq  host_clean.2.fastq  pairedForAssembly.fasta
idba_ud  --num_threads 10 -o AssemblyBasedAnalysis/idba --pre_correction pairedForAssembly.fasta
What it does

Iterative kmers de novo Assembly, it performs well on isolates as well as metagenomes. It may not work well on very large genomes.

Expected input

Paired-end/Single-end reads in FASTA format

Expected output

contig.fa

scaffold.fa (input paired end)

Reads Mapping To Contig

Required step? No
Command example
perl $EDGE_HOME/scripts/runReadsToContig.pl  -p 'host_clean.1.fastq host_clean.2.fastq' -d AssemblyBasedAnalysis/readsMappingToContig -pre readsToContigs  -ref AssemblyBasedAnalysis/contigs.fa
What it does

Mapping reads to assembled contigs

Expected input

Paired-end/Single-end reads in FASTQ format

Assembled Contigs in Fasta format

Output Directory

Output prefix

Expected output

readsToContigs.alnstats.txt

readsToContigs_coverage.table

readsToContigs_plots.pdf

readsToContigs.sort.bam

readsToContigs.sort.bam.bai

Reads Mapping To Reference Genomes

Required step? No
Command example:
perl $EDGE_HOME/scripts/runReadsToGenome.pl  -p 'host_clean.1.fastq host_clean.2.fastq'  -d ReadsBasedAnalysis -pre readsToRef -ref Reference.fna
What it does

Mapping reads to reference genomes

SNPs/Indels calling

Expected input

Paired-end/Single-end reads in FASTQ format

Reference genomes in Fasta format

Output Directory

Output prefix

Expected output

readsToRef.alnstats.txt

readsToRef_plots.pdf

readsToRef_refID.coverage

readsToRef_refID.gap.coords

readsToRef_refID.window_size_coverage

readsToRef.ref_windows_gc.txt

readsToRef.raw.bcf

readsToRef.sort.bam

readsToRef.sort.bam.bai

readsToRef.vcf

Taxonomy Classification on All Reads or unMapped to Reference Reads

Required step? No
Command example:
perl $EDGE_HOME/scripts/microbial_profiling/microbial_profiling_configure.pl $EDGE_HOME/scripts/microbial_profiling/microbial_profiling.settings.tmpl gottcha-speDB-b > microbial_profiling.settings.ini
perl $EDGE_HOME/scripts/microbial_profiling/microbial_profiling.pl -o  Taxonomy -s microbial_profiling.settings.ini -c 10 UnmappedReads.fastq
What it does

Taxonomy Classification using multiple tools, including BWA mapping to NCBI Refseq, metaphlan, kraken, GOTTCHA.

Unify varies output format and generate reports

Expected input

Reads in FASTQ format

Configuration text file (generated by microbial_profiling_configure.pl)

Expected output

Summary EXCEL and text files.

Heatmaps tools comparison

Radarchart tools comparison

Krona and tree-style plots for each tool.

Map Contigs To Reference Genomes

Required step? No
Command example:
perl $EDGE_HOME/scripts/nucmer_genome_coverage.pl  -e 1 -i 85 –p contigsToRef Reference.fna contigs.fa
What it does

Mapping assembled contigs to reference genomes

SNPs/Indels calling

Expected input

Reference genome in Fasta Format

Assembled contigs in Fasta Format

Output prefix

Expected output

contigsToRef_avg_coverage.table

contigsToRef.delta

contigsToRef_query_unUsed.fasta

contigsToRef.snps

contigsToRef.coords

contigsToRef.log

contigsToRef_query_novel_region_coord.txt

contigsToRef_ref_zero_cov_coord.txt

Variant Analysis

Required step? No
Command example:
perl $EDGE_HOME/scripts/SNP_analysis.pl -genbank Reference.gbk -SNP contigsToRef.snps -format nucmer
perl $EDGE_HOME/scripts/gap_analysis.pl -genbank Reference.gbk -gap  contigsToRef_ref_zero_cov_coord.txt
What it does

Analyze variants and gaps regions using annotation file.

Expected input

Reference in GenBank format

SNPs/INDELs/Gaps files from “Map Contigs To Reference Genomes“

Expected output

contigsToRef.SNPs_report.txt

contigsToRef.Indels_report.txt

GapVSReference.report.txt

Contigs Taxonomy Classification

Required step? No
Command example:
perl $EDGE_HOME/scripts/contig_classifier_by_bwa/contig_classifier_by_bwa.pl --db $EDGE_HOME/database/bwa_index/NCBI-Bacteria-Virus.fna --threads 10 --prefix OuputCT --input contigs.fa
What it does

Taxonomy Classification on contigs using BWA mapping to NCBI Refseq

Expected input

Contigs in Fasta format

NCBI Refseq genomes bwa index

Output prefix

Expected output

prefix.assembly_class.csv

prefix.assembly_class.top.csv

prefix.ctg_class.csv

prefix.ctg_class.LCA.csv

prefix.ctg_class.top.csv

prefix.unclassified.fasta

Contig Annotation

Required step? No
Command example:
prokka --force --prefix PROKKA --outdir Annotation contigs.fa
What it does

The rapid annotation of prokaryotic genomes.

Expected input

Assembled Contigs in Fasta format

Output Directory

Output prefix

Expected output

It produces GFF3, GBK and SQN files that are ready for editing in Sequin and ultimately submitted to Genbank/DDJB/ENA.

ProPhage detection

Required step? No
Command example:
perl $EDGE_HOME/scripts/phageFinder_prepare.pl -o Prophage –p Assembly Annotation/PROKKA.gff Annotation/PROKKA.fna
$EDGE_HOME/thirdParty/phage_finder_v2.1/bin/phage_finder_v2.1.sh Assembly
What it does

Identify and classify prophages within prokaryotic genomes.

Expected input

Annotated Contigs GenBank file

Output Directory

Output prefix

Expected output

phageFinder_summary.txt

PCR Assay Validation

Required step? No
Command example:
perl $EDGE_HOME/scripts/pcrValidation/validate_primers.pl -ref contigs.fa -primer primers.fa -mismatch 1 -output AssayCheck
What it does

In silico PCR primer validation by sequence alignment.

Expected input

Assembled Contigs/Reference in Fasta format

Output Directory

Output prefix

Expected output

pcrContigValidation.log

pcrContigValidation.bam

PCR Assay Adjudication

Required step? No
Command example:
perl $EDGE_HOME/scripts/pcrAdjudication/pcrUniquePrimer.pl --input contigs.fa  --gff3 PCR.Adjudication.primers.gff3
What it does

Design unique primer pairs for input contigs.

Expected input

Assembled Contigs in Fasta format

Output gff3 file name

Expected output

PCR.Adjudication.primers.gff3

PCR.Adjudication.primers.txt

Phylogenetic Analysis

Required step? No
Command example:
perl $EDGE_HOME/scripts/prepare_SNP_phylogeny.pl -o output/SNP_Phylogeny/Ecoli -tree FastTree -db Ecoli -n output -cpu 10 -p QC.1.trimmed.fastq QC.2.trimmed.fastq -c contigs.fa -s QC.unpaired.trimmed.fastq
perl $EDGE_HOME/scripts/SNPphy/runSNPphylogeny.pl output/SNP_Phylogeny/Ecoli/SNPphy.ctrl
What it does

Perform SNP identification against selected pre-built SNPdb or selected genomes

Build SNP based multiple sequence alignment for all and CDS regions

Generate Tree file in newick/PhyloXML format

Expected input

SNPdb path or genomesList

Fastq reads files

Contig files

Expected output

SNP based phylogentic multiple sequence alignment

SNP based phylogentic tree in newick/PhyloXML format.

SNP information table

Generate JBrowse Tracks

Required step? No
Command example:
perl $EDGE_HOME/scripts/edge2jbrowse_converter.pl --in-ref-fa Reference.fna --in-ref-gff3 Reference.gff --proj_outdir EDGE_project_dir
What it does

Convert several EDGE outputs into JBrowse tracks for visualization for contigs and reference, respectively.

Expected input

EDGE project output Directory

Expected output

EDGE post-processed files for JBrowse tracks in the JBrowse directory.

Tracks configuration files in the JBrowse directory.

HTML Report

Required step? No
Command example:
perl $EDGE_HOME/scripts/munger/outputMunger_w_temp.pl EDGE_project_dir
What it does

Generate statistical numbers and plots in an interactive html report page.

Expected input

EDGE project output Directory

Expected output

report.html

6.4. Other command-line utility scripts¶

To extract certain taxa fasta from contig classification result:

cd /home/edge_install/edge_ui/EDGE_output/41/AssemblyBasedAnalysis/Taxonomy
perl /home/edge_install/scripts/contig_classifier_by_bwa/extract_fasta_by_taxa.pl -fasta ../contigs.fa -csv ProjectName.ctg_class.top.csv -taxa "Enterobacter cloacae” > Ecloacae.contigs.fa

To extract unmapped/mapped reads fastq from the bam file:

cd /home/edge_install/edge_ui/EDGE_output/41/AssemblyBasedAnalysis/readsMappingToContig
# extract unmapped reads
perl /home/edge_install/scripts/bam_to_fastq.pl -unmapped readsToContigs.sort.bam
# extract mapped reads
perl /home/edge_install/scripts/bam_to_fastq.pl -mapped readsToContigs.sort.bam

To extract mapped reads fastq of a specific contig/reference from the bam file:

cd /home/edge_install/edge_ui/EDGE_output/41/AssemblyBasedAnalysis/readsMappingToContig
perl /home/edge_install/scripts/bam_to_fastq.pl -id ProjectName_00001 -mapped readsToContigs.sort.bam