GFOLD V1.1.4

NAME

gfold - Generalized fold change for ranking differentially expressed genes from RNA-seq data.

GFOLD is especially useful when no replicate is available. GFOLD generalizes the fold change by considering the posterior distribution of log fold change, such that each gene is assigned a reliable fold change. It overcomes the shortcoming of p-value that measures the significance of whether a gene is differentially expressed under different conditions instead of measuring relative expression changes, which are more interesting in many studies. It also overcomes the shortcoming of fold change that suffers from the fact that the fold change of genes with low read count are not so reliable as that of genes with high read count, even these two genes show the same fold change.

CITATION

Feng J, Meyer CA, Wang Q, Liu JS, Liu XS, Zhang Y. GFOLD: a generalized fold change for ranking differentially expressed genes from RNA-seq data. Bioinformatics 2012

DOWNLOAD

Downloading the latest GFOLD code and annotation data from https://github.com/knowledgefold/gfold is preferred.

Here is the link for downloading GFOLD code for people who cannot access bitbucket gfold.V1.1.4.tar.gz. The file might not be the latest version.

SYNOPSIS

gfold JOBS OPTIONS

EXAMPLES

Example 1: Count reads and rank genes

In the following example, hg19Ref.gtf is the ucsc knownGene table for hg19; sample1.sam and sample2.sam are the mapped reads in SAM format.

gfold count -ann hg19Ref.gtf -tag sample1.sam -o sample1.read_cnt
gfold count -ann hg19Ref.gtf -tag sample2.sam -o sample2.read_cnt
gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 2: Count reads

This example utilizes samtools to produce mapped reads in SAM format from BAM format.

samtools view sample1.bam | gfold count -ann hg19Ref.gtf -tag stdin -o sample1.read_cnt

Example 3: Identify differentially expressed genes without replicates

Suppose there are two samples: sample1 and sample2 with corresponding read count file being sample1.read_cnt sample2.read_cnt. This example finds differentially expressed genes using default parameters on two samples

gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 4: Identify differentially expressed genes with replicates

This example finds differentially expressed genes using default parameters on two group of samples.

gfold diff -s1 sample1,sample2,sample3 -s2 sample4,sample5,sample6 -suf .read_cnt -o 123VS456.diff

Example 5: Identify differentially expressed genes with replicates only in one condition

This example finds differentially expressed genes using default parameters on two group of samples. Only the first group contains replicates. In this case, the variance estimated based on the first group will be used as the variance of the second group.

gfold diff -s1 sample1,sample2 -s2 sample3 -suf .read_cnt -o 12VS3.diff

JOBS

-h: Print help information
count: Given the gene annotation in GTF/GPF/BED format and mapped short reads in SAM/BED format, count the number of reads mapped to each gene. Because of possible overlapping of multiple genes, a read could be mapped to the overlaped region of multiple genes. In this case, a read is counted multiple times with each time for each gene. Furthermore, if a gene is on multiple chromosomes or different strands of the same chromosome, only exons on one strand of one chromosome (the one appear first in the annotation file) will be assigned to this gene. Exons not on this strand of the chromosome will be discarded.
diff: For each gene, calculate GFOLD value and other statistics. diff accepts the output of count as the input. Please refer to the output format of count for more information about the input format. If you are not satisfied with the strategy adopted by count, you can generate gene read counts by yourself. In the input for diff 'GeneSymbol' and 'Read Count' are critical. If you fake 'Gene exon length', the RPKM calculated will not be valid. You can fake other columns to make gfold run. Further more the order of gene names should be the same for different samples. The third column of the output of count (gene length) only influences the RPKM in the output of diff. If it is missing, RPKM will not be generated by diff. diff does not use the forth column of the output of count.

OPTIONS

-ann <file>

Gene annotation file in GTF/GPF/BED format. Note that the knownGene table downloaded from UCSC is in GPF format. For job count only.

-annf <GTF/GPF/BED>

The format of gene annotation file. Default GTF (Gene Transfer Format). For job count only. For GTF format, in the last column, 'gene_id' and 'transcript_id' are required. 'gene_name' and other IDs are optional. For BED format, please provide format which contains 6 columns with the ID at 4'th column. For GPF (Gene Prediction Format), in short, from UCSC Table Browser, the 'knownGene' table with all fields is in GPF and the 'refGene' table without the first column is in GPF format. Note that for either 'knownGene' or 'refGene' table, the downloaded file would contain a header which should be removed before calling GFOLD. More specifically, a file in GPF format contains 12 columns separated by TABs (adapted from UCSC):

name	Name of gene
chrom	Reference sequence chromosome or scaffold
strand	+ or - for strand
txStart	Transcription start position
txEnd	Transcription end position
cdsStart	Coding region start
cdsEnd	Coding region end
exonCount	Number of exons
exonStarts	Exon start positions separated by commas
exonEnds	Exon end positions separated by commas
proteinID	UniProt display ID for Known Genes, UniProt accession or RefSeq protein ID for UCSC Genes
alignID	Unique identifier for each (known gene, alignment position) pair

-tag <file>

Short reads in SAM format. 'stdin' stands for standard input stream. For job count only.

-tagf <SAM/BED>

The format of short reads. Default SAM. For job count only.

-s <T/F>

Whether is the sequencing data strand specific? T stands for strand specific. Default F. If you are not clear about this, using default parameter should be OK even for the strand specific case. For job count only.

-acc <T/F>

When no replicate is available, whether to use accurate method to calculate GFOLD value. T stands for accurate which depends on sequencing depth and slower, F stands for MCMC. Default T. For job diff only.

-o <file>

The file for output for all jobs. For job diff, two output files will be generated. The first is <file> and the second is <file>.ext

-s1 <file>

The prefix for gene read count of the 1st group output by count. Multiple prefixes are separated by commas. For job diff only. If you have gene read count generated by other ways instead of job count, make sure that the format are the same for all files. Each file contains two columns corresponding to gene names and read counts separated by a TAB. All files are sorted by gene names and have the same number of lines.

-s2 <file>

The prefix for gene read count of the 2st group output by count. Multiple prefixes are separated by commas. For job diff only.

-d <file>

Gene description file. The first two columns of the file will be used. The first column should be gene descriptions; The second column should be gene IDs, e.g. the first column of the output of Count. The gene description file can be downloaded from: http://uswest.ensembl.org/biomart/martview . Please select the correct ID type during downloading such that the IDs match the gene IDs appearing in -ann

-suf <string>

The suffix for gene read count file specified by -s1 and -s2. For job diff only.

-sc <num>

The significant cutoff for fold change. Default 0.01. For job diff only.

-bi <num>

For MCMC, the iterations for burn-in phase. Default 1000. For job diff only.

-si <num>

For MCMC, the iterations for sampling phase. Default 1000. For job diff only.

-r <num>

The maximum number of selected pairs for calculating empirical FDR. Default 20. For job diff only.

-v <num>

Verbos level. A larger value gives more information of the running process. Default 2.

-norm <Count/DESeq/NO>

The way to do normalization. 'Count' stands for normalization by total number of mapped reads. 'DESeq' stands for the normalization proposed by DESeq. 'NO' stands for no normalization. You can also specifiy a list of normalization constant separated by commas. E.g. 1.2,2.1,1.0,2.0. Note that the number of constants should be the same as the total number of samples (group1 and group2) and the order should be for -s1 followed by for -s2. GFOLD using normalization constants not by directly multiplication (scaling up) nor division (scaling down). The normalization constants will be built into the model. In the model, division or multiplication has no difference. Default 'DESeq'.

OUTPUT FORMAT

All fields in a output file are separated by TABs.

For JOB count:

The output file contains 5 columns:

GeneSymbol:
For BED file, this is the 4'th column. For GPF file, this is the first column. For GTF format, this corresponds to 'gene_id' if it exists, 'NA' otherwise.
GeneName:
For BED file, this is always 'NA'. For GPF file, this is the 12'th column. For GTF format, this corresponds to 'gene_name' if it exists, 'NA' otherwise.
Read Count:
The number of reads mapped to this gene.
Gene exon length:
The length sum of all the exons of this gene.
RPKM:
The expression level of this gene in RPKM.

For JOB diff:

The first output file contains 7 columns:

#GeneSymbol:
Gene symbols. The order of gene symbol is the same as what appearing in the read count file.
GeneName:
Gene name. The order of gene name is the same as what appearing in the read count file.
GFOLD:
GFOLD value for every gene. The GFOLD value could be considered as a reliable log2 fold change. It is positive/negative if the gene is up/down regulated. The main usefulness of GFOLD is to provide a biological meanlingful ranking of the genes. The GFOLD value is zero if the gene doesn't show differential expression. If the log2 fold change is treated as a random variable, a positive GFOLD value x means that the probability of the log2 fold change (2nd/1st) being larger than x is (1 - the parameter specified by -sc); A negative GFOLD value x means that the probability of the log2 fold change (2st/1nd) being smaller than x is (1 - the parameter specified by -sc). If this file is sorted by this column in descending order then genes ranked at the top are differentially up-regulated and genes ranked at the bottom are differentially down-regulated. Note that a gene with GFOLD value 0 should never be considered differentially expressed. However, it doesn't mean that all genes with non-negative GFOLD value are differentially expressed. For taking top differentially expressed genes, the user is responsible for selecting the cutoff.
E-FDR:
Empirical FDR based on replicates. It is always 1 when no replicates are available.
log2fdc:
log2 fold change. If no replicate is available, and -acc is T, log2 fold change is based on read counts and normalization constants. Otherwise, log2 fold change is based on the sampled expression level from the posterior distribution.
1stRPKM:
The RPKM for the first condition. It is available only if gene length is available. If multiple replicates are available, the RPKM is calculated simply by summing over replicates. Because RPKM is acturally using sequencing depth as the normalization constant, log2 fold change based on RPKM could be different from the log2fdc field.
2ndRPKM:
The RPKM for the second condition. It is available only if gene length is available. Please refer to 1stRPKM for more information.

The second output file (.ext) contains the normalized read counts and gene description

#GeneSymbol:
Gene symbols. The order of gene symbol is the same as what appearing in the read count file.
Normalized Read Counts:
Multiple columns (the same number of input samples) contains the read count normalized by the calculated normalization constants. They are only muturally comparable and the absolute values are meaningless.
Gene Description:
The last column contains the associated gene description if the description file -d is provided.

AUTHOR

Jianxing Feng (jianxing.tongji@gmail.com)

GFOLD V1.1.4