9 FCS-GSEA analysis
9.1 FCS method introduction
ORA method is easy to conduct, but it will lose information when genes’ differences are slight. For example, if a gene is essential in some pathway but is low-expressed, it will be filtered by fold change cutoff.
Unlike ORA, FCS tools do not set a threshold for differentially expressed genes. Instead, it gives each detected gene a differential expression score and then evaluates whether the scores are more positive or negative than expected by chance for each gene set.
According to article “Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges”:
FCS approaches include GSEA, GlobalTest, sigPathway, SAFE, GSA, PADOG, PCOT2, FunCluster, SAM-GS, Category etc.
Next, we will go through the popular Gene Set Enrichment Analysis (GSEA) tool, which uses FCS permutation approaches to determine whether a gene set is significantly associated with higher or lower scores.
9.2 GSEA intruduction
To do GSEA, user must have an ordered gene rank list (e.g., genes with decreasing logFC order).
Three steps in GSEA:
Calculate the enrichment score (ES): represents the amount to which the genes in the set are over-represented at either the top or bottom of the list. GSEA starts at the top of the ranked gene list to calculate the enrichment score. If a gene is a member of the candidate gene set, it adds to a running sum. Otherwise, it subtracts.
Estimate the statistical significance of the ES: this calculation is done by a phenotypic-based permutation test in order to produce a null distribution for the ES.
Adjust for multiple hypothesis testing for when a large number of gene sets are being analyzed at one time: the enrichment scores for each set are normalized, and a false discovery rate is calculated.
If you are interested in GSEA detailed procedure, please visit: https://www.pathwaycommons.org/guide/primers/data_analysis/gsea OR https://github.com/crazyhottommy/RNA-seq-analysis/blob/master/GSEA_explained.md
9.3 Basic usage
The simplest arguments are:
-
id
: pre-ranked genelist with decreasing order (e.g. logFC order/correlation order). Entrez, Ensembl and Symbol are accepted -
geneset
: gene set is a two-column data frame with term id and gene id. (It’s recommended to use geneset) -
p_cutoff
: numeric of cutoff for both pvalue and adjusted pvalue, default is 0.05. -
q_cutoff
: numeric of cutoff for qvalue, default is 0.15.
9.3.1 1st step: prepare pre-ranked gene list
## 948 1638 158471 10610 6947 100133941
## 5.780170 5.633027 4.683610 3.875120 3.357670 3.322533
9.3.2 2nd step: prepare gene set
gs <- geneset::getGO(org = "human",ont = "mf")
9.3.3 3rd step: GSEA analysis
gse <- genGSEA(genelist = geneList, geneset = gs)
Now, let’s look at the result:
It is a list mainly including analysis result (gsea_df
), input genelist (genelist
) and input geneset (geneset
).
class(gse)
## [1] "list"
names(gse)
## [1] "gsea_df" "genelist" "geneset" "exponent" "org"
head(gse$genelist)
## ID logfc
## 1 CD36 5.780170
## 2 DCT 5.633027
## 3 PRUNE2 4.683610
## 4 ST6GALNAC2 3.875120
## 5 TCN1 3.357670
## 6 CD24 3.322533
head(gse$geneset)
## mf gene
## 1 GO:0000009 PIGV
## 2 GO:0000009 ALG12
## 3 GO:0000010 PDSS1
## 4 GO:0000010 PDSS2
## 5 GO:0000014 ENDOG
## 6 GO:0000014 ERCC1
head(gse$gsea_df, 5)
## Hs_MF_ID Description setSize
## GO:0140097 GO:0140097 catalytic activity, acting on DNA 207
## GO:0140098 GO:0140098 catalytic activity, acting on RNA 350
## GO:0008094 GO:0008094 ATP-dependent activity, acting on DNA 97
## GO:0016887 GO:0016887 ATP hydrolysis activity 422
## GO:0004386 GO:0004386 helicase activity 147
## enrichmentScore NES pvalue p.adjust qvalue
## GO:0140097 -0.5894504 -2.428361 7.555161e-17 8.884870e-14 7.380200e-14
## GO:0140098 -0.4747857 -2.065130 4.912469e-14 2.888532e-11 2.399353e-11
## GO:0008094 -0.6697833 -2.480446 7.069515e-13 2.771250e-10 2.301933e-10
## GO:0016887 -0.4341955 -1.929356 7.323023e-12 2.152969e-09 1.788359e-09
## GO:0004386 -0.5805801 -2.284009 1.724354e-11 3.583755e-09 2.976839e-09
## rank leading_edge
## GO:0140097 1298 tags=30%, list=7%, signal=28%
## GO:0140098 2872 tags=38%, list=15%, signal=33%
## GO:0008094 1172 tags=37%, list=6%, signal=35%
## GO:0016887 2528 tags=29%, list=14%, signal=26%
## GO:0004386 1295 tags=31%, list=7%, signal=29%
## geneID
## GO:0140097 3980/348654/1786/1105/10856/1660/7150/63922/5889/10728/57697/7156/27301/7517/10146/5932/56652/3978/54984/5810/5424/54107/8607/5422/2956/7374/11144/7153/5985/4436/8458/2237/5426/64858/5982/1763/5984/10714/1736/5111/55345/4172/84515/8091/83990/4171/4176/5983/10721/79075/55247/4174/4173/79915/4175/8438/54821/5427/641/5888/7516/9156
## GO:0140098 138428/122665/5435/60625/5976/57472/55278/10623/3396/167227/7015/91801/8731/8846/80745/80119/10557/55695/10556/79828/60528/91893/29883/55661/339175/23517/22907/80746/22894/171568/51163/11128/1653/6832/57505/55308/55794/5437/5511/79066/221078/11102/115708/85463/51010/114049/112479/5438/54512/201626/9836/25885/23016/124454/57062/55798/221830/96764/9533/117246/10799/29063/10940/79039/79731/27292/55280/661/54913/51651/9879/55687/55621/57696/56339/29960/55157/10849/56931/246243/54931/64794/57647/1660/9775/23210/54606/2193/54148/57697/81875/9704/5433/27161/56919/1662/24140/79691/10146/54888/10621/10212/114034/28960/92935/64216/10171/4839/64425/4234/56915/10436/90459/25819/80324/10785/10622/8886/1665/51728/10535/11218/10056/84172/87178/10248/23404/2237/5393/9188/51106/5557/83990/9156
## GO:0008094 1105/10856/1660/63922/5889/57697/7517/10146/56652/8607/2956/11144/7153/5985/4436/8458/5982/1763/5984/4172/84515/83990/4171/4176/5983/10721/79075/4174/4173/79915/4175/8438/54821/641/5888/7516
## GO:0016887 22880/3327/80119/10131/730211/23347/55661/23517/3313/22907/3308/55636/3799/64222/1653/6832/5704/9126/55308/55794/391634/23400/154664/3312/81570/5700/338322/5706/4643/57062/4292/5981/6782/3320/10694/23195/79039/547/8243/3329/5702/3324/55210/9879/6891/6950/57696/10575/1105/10856/3309/7203/64794/57647/1660/908/9775/22824/63922/54606/57697/9704/56919/1662/3800/23046/2181/10146/10576/10212/2182/10051/56652/1836/10112/10592/3323/51182/490/8607/8886/1665/11218/22948/5985/81930/3835/4436/10808/80179/9188/166378/3832/11004/9928/5982/1763/9585/5984/3833/63979/146909/4172/9493/84515/83990/56992/4171/4176/5983/4174/4173/79915/4175/9319/91607/8438/29028/4998/54821/641/5888
## GO:0004386 9879/57696/1105/10856/64794/57647/1660/9775/63922/54606/57697/9704/56919/1662/10146/10212/56652/8607/8886/1665/11218/5985/8458/9188/5982/1763/5984/55345/4172/84515/83990/4171/4176/5983/10721/79075/4174/4173/4175/3070/91607/8438/54821/641/5888
## geneID_symbol
## GO:0140097 LIG3/GEN1/DNMT1/CHD1/RUVBL2/DHX9/TOP1/CHTF18/RAD51C/PTGES3/FANCM/TOP3A/APEX2/XRCC3/G3BP1/RBBP8/TWNK/LIG1/PINX1/RAD1/POLD1/POLE3/RUVBL1/POLA1/MSH6/UNG/DMC1/TOP2A/RFC5/MSH2/TTF2/FEN1/POLE/DCLRE1B/RFC2/DNA2/RFC4/POLD3/DKC1/PCNA/ZGRF1/MCM3/MCM8/HMGA2/BRIP1/MCM2/MCM7/RFC3/POLQ/DSCC1/NEIL3/MCM5/MCM4/ATAD5/MCM6/RAD54L/ERCC6L/POLE2/BLM/RAD51/XRCC2/EXO1
## GO:0140098 PTRH1/RNASE8/POLR2F/DHX35/UPF1/CNOT6/QRSL1/POLR3C/MRPL58/DCP2/TERT/ALKBH8/RNMT/ALKBH1/THUMPD2/PIF1/RPP38/NSUN5/RPP30/METTL8/ELAC2/FDXACB1/CNOT7/DDX27/METTL2A/MTREX/DHX30/TSEN2/DIS3/POLR3H/DBR1/POLR3A/DDX1/SUPV3L1/AARS2/DDX19A/DDX28/POLR2H/PPP1R8/METTL16/NSUN6/RPP14/TRMT61A/ZC3H12C/EXOSC3/BUD23/ERI2/POLR2I/EXOSC4/PDE12/LCMT2/POLR1A/EXOSC7/EARS2/DDX24/METTL2B/POLR1F/TGS1/POLR1C/FTSJ3/RPP40/ZCCHC4/POP1/DDX54/NARS2/DIMT1/CWF19L1/POLR3D/RPP25/PTRH2/DDX46/TRMU/TRMT1/DDX55/METTL3/MRM2/DARS2/POLR1G/DUS3L/RNASEH1/TRMT10C/DDX31/DHX37/DHX9/EIF4A3/JMJD6/DDX56/FARSA/MRPL39/FANCM/ISG20L2/DHX34/POLR2D/AGO2/DHX33/DDX10/FTSJ1/QTRT2/G3BP1/NSUN2/POLR3F/DDX39A/TOE1/DCPS/MARS2/TFB2M/RCL1/NOP2/POLR1E/METTL1/EXOSC5/EMG1/ERI1/NOCT/PUS1/WDR4/POLR3G/DDX18/DHX15/POLR3K/RNASEH2A/DDX20/FARSB/POLR1B/PNPT1/POP7/EXOSC2/FEN1/EXOSC9/DDX21/TFB1M/PRIM1/BRIP1/EXO1
## GO:0008094 CHD1/RUVBL2/DHX9/CHTF18/RAD51C/FANCM/XRCC3/G3BP1/TWNK/RUVBL1/MSH6/DMC1/TOP2A/RFC5/MSH2/TTF2/RFC2/DNA2/RFC4/MCM3/MCM8/BRIP1/MCM2/MCM7/RFC3/POLQ/DSCC1/MCM5/MCM4/ATAD5/MCM6/RAD54L/ERCC6L/BLM/RAD51/XRCC2
## GO:0016887 MORC2/HSP90AB3P/PIF1/TRAP1/HSP90AA5P/SMCHD1/DDX27/MTREX/HSPA9/DHX30/HSPA4/CHD7/KIF5B/TOR3A/DDX1/SUPV3L1/PSMC4/SMC3/DDX19A/DDX28/HSP90AB2P/ATP13A2/ABCA13/HSPA8/CLPB/PSMC1/NLRP10/PSMC6/MYO1E/DDX24/MLH1/RFC1/HSPA13/HSP90AA1/CCT8/MDN1/DDX54/KIF1A/SMC1A/HSPD1/PSMC3/HSP90AA2P/ATAD3A/DDX46/TAP2/TCP1/DDX55/CCT4/CHD1/RUVBL2/HSPA5/CCT3/DDX31/DHX37/DHX9/CCT6A/EIF4A3/HSPA4L/CHTF18/DDX56/FANCM/DHX34/DHX33/DDX10/KIF5C/KIF21B/ACSL3/G3BP1/CCT2/DDX39A/ACSL4/SMC4/TWNK/SLC26A2/KIF20A/SMC2/HSP90AA4P/HSPA14/ATP2B1/RUVBL1/DDX18/DHX15/DDX20/CCT5/RFC5/KIF18A/KIF22/MSH2/HSPH1/MYO19/DDX21/SPATA5/KIF11/KIF2C/KIF14/RFC2/DNA2/KIF20B/RFC4/KIFC1/FIGNL1/KIF18B/MCM3/KIF23/MCM8/BRIP1/KIF15/MCM2/MCM7/RFC3/MCM5/MCM4/ATAD5/MCM6/TRIP13/SLFN11/RAD54L/ATAD2/ORC1/ERCC6L/BLM/RAD51
## GO:0004386 DDX46/DDX55/CHD1/RUVBL2/DDX31/DHX37/DHX9/EIF4A3/CHTF18/DDX56/FANCM/DHX34/DHX33/DDX10/G3BP1/DDX39A/TWNK/RUVBL1/DDX18/DHX15/DDX20/RFC5/TTF2/DDX21/RFC2/DNA2/RFC4/ZGRF1/MCM3/MCM8/BRIP1/MCM2/MCM7/RFC3/POLQ/DSCC1/MCM5/MCM4/MCM6/HELLS/SLFN11/RAD54L/ERCC6L/BLM/RAD51
## Count
## GO:0140097 62
## GO:0140098 134
## GO:0008094 36
## GO:0016887 122
## GO:0004386 45
About the
gsea_df
result
Description
: Gene set namesetSize
: number of genes with gene-level statistic values. To be more specific, if we input a pathway gene set with 58 genes (e.g. HALLMARK_MYC_TARGETS_V2 ), while our gene list only have 54 of it, so the result will only show setSize = 54enrichmentScore
: also called ES, same as in Broad GSEA implementation. It reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes.NES
: normalized enrichment score is the primary statistic for examining gene set enrichment results. By normalizing the enrichment score, GSEA accounts for differences in gene set size and in correlations between gene sets and the expression dataset. Therefore, NES can be used to compare analysis results across gene sets. A positive normalized enrichment scores (NES) will indicate that genes in set S will be mostly represented at the top of your list (logFC > 0 or up-regulated genes).rank
: The position in the ranked list at which the maximum enrichment score occurred. If gene sets achieve the maximum enrichment score near the top or bottom of the ranked list, the rank at max is either very small or very large.-
leading_edge
: includes three statistics-
tags
: The percentage of gene hits before (for positive ES) or after (for negative ES) the peak in the running enrichment score, which indicates the percentage of genes contributing to the enrichment score. -
list
: The percentage of genes in the ranked gene list before (for positive ES) or after (for negative ES) the peak in the running enrichment score. This gives an indication of where in the list the enrichment score is attained. -
signal
: If the gene set is entirely within the first N positions in the list, then the signal strength is maximal or 100%. If the gene set is spread throughout the list, then the signal strength decreases towards 0%.
-
geneID & geneID_symbol
: If input symbols are mixed with aliases, a new column “geneID_symbol” will be added; if all symbols are official, only “geneID” column will return
9.4 Advanced usage
9.4.1 Additional arguments
Please refer to ORA part
9.4.2 Simplify GO GSEA result
Please refer to ORA part
9.5 Export GSEA result
Genekitr
provides an easy way to export analysis result for editing and sharing.
Multiple data are saved as different sheets in one excel file. Besides, column names are automatically formatted with bold style.
genekitr::expoSheet(data_list = gse,
data_name = names(gse),
filename = "gsea_result.xlsx",
dir = "./")