7 Get gene sets

Gene sets and statistical methods are central parts for gene enrichment analysis (GEA).

To facilitate GEA, I developed the package geneset, which provides comprehensive list of monthly-updating gene set (GS) libraries.

7.1 Geneset package intruduction

The R package curated GO (BP, CC and MF), KEGG (pathway, module, enzyme, network, drug and disease), WikiPathway, MsigDb, EnrichrDb, Reactome, MeSH, DisGeNET, Disease Ontology (DO), Network of Cancer Gene (NCG) (version 6 and v7) and COVID-19.

It supports both model and non-model species.

For more details, please refer to this site.

  • GO supports 143 species
  • KEGG supports 8213 species
  • MeSH supports 71 species
  • MsigDb supports 20 species
  • WikiPahtwaysupports 16 species
  • Reactome supports 11 species
  • EnrichrDB supports 5 species
  • Disease-related only support human (DO, NCG, DisGeNET and COVID-19)

7.2 Get GO geneset

7.2.1 GO introduction

According to Wikipedia, “Ontologies consist of detectable or directly observable representations of things and the relationships between those things.”

GO is short for Gene Ontology. GO analysis is to find the associations between gene products and GO terms, which has three domains:

  • Biological Processes (BP)
    • A biological process represents a specific objective that the organism is genetically programmed to achieve.
  • Molecular Functions (MF)
    • A molecular process that can be carried out by the action of a single macromolecular machine, usually via direct physical interactions with other molecular entities.
  • Cellular Components (CC)
    • A location, relative to cellular compartments and structures, occupied by a macromolecular machine when it carries out a molecular function.

GO terms are built in a directed acyclic graph with a parent-child relationship.

For more comprehensive introduction of GO, you may visit: https://advaitabio.com/faq-items/understanding-gene-ontology/ OR http://geneontology.org/docs/ontology-documentation/

7.2.2 Usage

The arguments include: - org: organism name - ont: choose from “bp”, “mf” and “cc”

The result is a list includes four parts:

  • gene set (formated as data frame): two columns contains GO term IDs and matched gene IDs
  • geneset_name (formated as data frame): two columns contains GO term IDs and matched GO descriptions
  • organism: stores org information
  • type: stores ont information
gs <- getGO(org = "human", ont = "mf")
str(gs)
## List of 4
##  $ geneset     :'data.frame':    280115 obs. of  2 variables:
##   ..$ mf  : chr [1:280115] "GO:0000009" "GO:0000009" "GO:0000010" "GO:0000010" ...
##   ..$ gene: chr [1:280115] "PIGV" "ALG12" "PDSS1" "PDSS2" ...
##  $ geneset_name:'data.frame':    4878 obs. of  2 variables:
##   ..$ go_id: chr [1:4878] "GO:0000009" "GO:0000010" "GO:0000014" "GO:0000016" ...
##   ..$ Term : chr [1:4878] "alpha-1,6-mannosyltransferase activity" "trans-hexaprenyltranstransferase activity" "single-stranded DNA endodeoxyribonuclease activity" "lactase activity" ...
##  $ organism    : chr "hsapiens"
##  $ type        : chr "mf"

7.3 Get KEGG geneset

7.3.1 KEGG intruduction

KEGG is short for “Kyoto Encyclopedia of Genes and Genomes,” a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances.

The pathway maps are classified into the following sections:

  1. Metabolism
  2. Genetic information processing (transcription, translation, replication and repair, etc.)
  3. Environmental information processing (membrane transport, signal transduction, etc.)
  4. Cellular processes (cell growth, cell death, cell membrane functions, etc.)
  5. Organismal systems (immune system, endocrine system, nervous system, etc.)
  6. Human diseases
  7. Drug development
KEGG overview. Figure taken from https://paintomics.readthedocs.io/en/stable/1_kegg/.

Figure 7.1: KEGG overview. Figure taken from https://paintomics.readthedocs.io/en/stable/1_kegg/.

7.3.2 Usage

The arguments include: - org: organism name (e.g. “hsa”) - category: choose from “pathway”,“module”, “enzyme”, “disease” (human only), “drug” (human only) or “network” (human only)

gs <- getKEGG(org = "hsa",category = "pathway")
str(gs)
## List of 4
##  $ geneset     :'data.frame':    35570 obs. of  2 variables:
##   ..$ id  : chr [1:35570] "hsa00010" "hsa00010" "hsa00010" "hsa00010" ...
##   ..$ gene: chr [1:35570] "10327" "124" "125" "126" ...
##  $ geneset_name:'data.frame':    551 obs. of  2 variables:
##   ..$ id  : chr [1:551] "hsa00010" "hsa00020" "hsa00030" "hsa00040" ...
##   ..$ name: chr [1:551] "Glycolysis / Gluconeogenesis" "Citrate cycle (TCA cycle)" "Pentose phosphate pathway" "Pentose and glucuronate interconversions" ...
##  $ organism    : chr "hsapiens"
##  $ type        : chr "kegg"

7.4 Get MeSH geneset

7.4.1 MeSH intruduction

Medical Subject Headings is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences. It serves as a thesaurus that facilitates searching.

7.4.2 Usage

The arguments include: - org: organism name (e.g. “human”) - method: Method of mapping MeSH ID to gene ID. Choose one from “gendoo”, “gene2pubmed” or “RBBH” (mainly for some minor species). - category: MeSH descriptor categories. More details refer to: How to use MeSH-related Packages]

gs <- getMesh(org = "human", method = "gendoo", category = "A")
str(gs)
## List of 4
##  $ geneset     :'data.frame':    3031273 obs. of  2 variables:
##   ..$ id  : chr [1:3031273] "D000001" "D000001" "D000001" "D000001" ...
##   ..$ gene: chr [1:3031273] "100" "10016" "1003" "10076" ...
##  $ geneset_name:'data.frame':    30194 obs. of  2 variables:
##   ..$ id  : chr [1:30194] "D000001" "D000002" "D000003" "D000004" ...
##   ..$ name: chr [1:30194] "Calcimycin" "Temefos" "Abattoirs" "Abbreviations as Topic" ...
##  $ organism    : chr "hsapiens"
##  $ type        : chr "mesh"

7.5 Get MsigDB geneset

7.5.1 MsigDB intruduction

Msigdb categories is the best GSEA partner which have 9 major collections and several sub-collections from 32880 gene sets:

  • H: hallmark gene sets (50 gene sets)
  • C1: positional gene sets (299 gene sets)
    • by chromosome: chr1 => MT
  • C2: curated gene sets (6366 gene sets)
    • CGP (chemical and genetic perturbations, 3384 gene sets)
    • CP (canonical pathways, 2982 gene sets) includes BioCarta, KEGG, PID, Reactome and WikiPathways
  • C3: regulatory target gene sets (3726 gene sets)
    • MIR (microRNA targets, 2598 gene sets)
    • TFT (all transcription factor targets, 1128 gene sets)
  • C4: computational gene sets (858 gene sets)
    • CGN (cancer gene neighborhoods, 427 gene sets)
    • CM (cancer modules, 431 gene sets)
  • C5: ontology gene sets (15473 gene sets) includes BP, CC and MF
  • C6: oncogenic signature gene sets (189 gene sets)
  • C7: immunologic signature gene sets (5219 gene sets)
    • IMMUNESIGDB (ImmuneSigDB gene sets, 4872 gene sets)
    • VAX (vaccine response gene sets, 347 gene sets)
  • C8: cell type signature gene sets (700 gene sets)

7.5.2 Usage

The arguments include: - org: organism name (e.g. “human”) - category: choose from “H”, “C1”, “C2-CGP”, “C2-CP-BIOCARTA”, “C2-CP-KEGG”, “C2-CP-PID”, “C2-CP-REACTOME”, “C2-CP-WIKIPATHWAYS”, “C3-MIR-MIRDB”,“C3-MIR-MIR_Legacy”, “C3-TFT-GTRD”, “C3-TFT-TFT_Legacy”,“C4-CGN”, “C4-CM”, “C5-GO-BP”, “C5-GO-CC”, “C5-GO-MF”,“C5-HPO”, “C6”, “C7-IMMUNESIGDB”, “C7-VAX”, “C8”

The result is a list includes four parts:

  • gene set (formated as data frame): two columns contains pathway IDs and matched gene IDs
  • geneset_name: NA (because the pathway IDs and names are the same, so we just ignore them)
  • organism: stores org information
  • type: stores ont information
gs <- getMsigdb(org = "human", category = "H")
str(gs)
## List of 4
##  $ geneset     :'data.frame':    8209 obs. of  2 variables:
##   ..$ gs_name    : chr [1:8209] "HALLMARK_ADIPOGENESIS" "HALLMARK_ADIPOGENESIS" "HALLMARK_ADIPOGENESIS" "HALLMARK_ADIPOGENESIS" ...
##   ..$ entrez_gene: int [1:8209] 19 11194 10449 33 34 35 47 50 51 112 ...
##  $ geneset_name: logi NA
##  $ organism    : chr "hsapiens"
##  $ type        : chr "msigdb"

7.6 Get WikiPathways geneset

7.6.1 WikiPathways intruduction

WikiPathways was established to facilitate the contribution and maintenance of pathway information by the biology community. Each month it produces a set of pathways as .gmt files on https://wikipathways-data.wmcloud.org/.

7.6.2 Usage

Only need to input organism name.

gs <- getWiki(org = "human")
str(gs)
## List of 4
##  $ geneset     :'data.frame':    30871 obs. of  2 variables:
##   ..$ id  : chr [1:30871] "WP5187" "WP5187" "WP5187" "WP5187" ...
##   ..$ gene: chr [1:30871] "7098" "8792" "51284" "64135" ...
##  $ geneset_name:'data.frame':    746 obs. of  2 variables:
##   ..$ id  : chr [1:746] "WP5187" "WP5143" "WP2916" "WP4871" ...
##   ..$ name: chr [1:746] "mRNA vaccine activation of dendritic cell and induction of IFN-1" "GDNF signaling" "Interactome of polycomb repressive complex 2 (PRC2) " "Kisspeptin/kisspeptin receptor system in the ovary" ...
##  $ organism    : chr "hsapiens"
##  $ type        : chr "wikipathway"

7.7 Get Reactome geneset

7.7.1 Reactome intruduction

Reactome is a free online database of biological pathways.

7.7.2 Usage

Only need to input organism name.

gs <- getReactome(org = "human")
str(gs)
## List of 4
##  $ geneset     :'data.frame':    125362 obs. of  2 variables:
##   ..$ id  : chr [1:125362] "R-HSA-1059683" "R-HSA-1059683" "R-HSA-1059683" "R-HSA-1059683" ...
##   ..$ gene: chr [1:125362] "3569" "3570" "3572" "3716" ...
##  $ geneset_name:'data.frame':    2566 obs. of  2 variables:
##   ..$ id  : chr [1:2566] "R-HSA-1059683" "R-HSA-109581" "R-HSA-109582" "R-HSA-109606" ...
##   ..$ name: chr [1:2566] "Interleukin-6 signaling" "Apoptosis" "Hemostasis" "Intrinsic Pathway for Apoptosis" ...
##  $ organism    : chr "hsapiens"
##  $ type        : chr "reactome"

7.8 Get Enrichr geneset

7.8.1 Enrichr intruduction

Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries.

7.8.2 Usage

The arguments include: - org: organism name (e.g. “human”) - library: choose one library name from geneset::enrichr_metadata (e.g. “COVID-19_Related_Gene_Sets”)

gs <- getEnrichrdb(org = "human", library = "COVID-19_Related_Gene_Sets")
str(gs)
## List of 4
##  $ geneset     :'data.frame':    60289 obs. of  2 variables:
##   ..$ id  : Factor w/ 205 levels "COVID19-E protein host PPI from Krogan",..: 1 1 1 1 1 1 2 2 2 2 ...
##   ..$ gene: chr [1:60289] "BRD4" "BRD2" "SLC44A2" "ZC3H18" ...
##  $ geneset_name: logi NA
##  $ organism    : chr "hsapiens"
##  $ type        : chr "enrichrdb"

7.9 Get Human disease-related geneset

For now, we suport human disease annotation data from: Disease Ontology (DO), DisGeNET, Network of Cancer Gene (NCG) version 6 and v7 and COVID-19

Only need to input source name from “do”, “ncg_v7”, ncg_v6, “disgenet” and “covid19”.

  • do: The Disease Ontology has been developed as a standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms.

  • ncg_v7 & ncg_v6: Human Network of Cancer Gene (NCG) is a manually curated collection of cancer genes, healthy drivers and their properties.

  • disgenet: DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases. DisGeNET integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies. Additionally, several original metrics are provided to assist the prioritization of genotype–phenotype relationships.

  • covid19: The COVID-19 Drug and Gene Set Library. A collection of drug and gene sets related to COVID-19 research contributed by the community.

gs <- getHgDisease(source = "do")
str(gs)
## List of 4
##  $ geneset     :'data.frame':    387300 obs. of  2 variables:
##   ..$ id  : chr [1:387300] "DOID:0001816" "DOID:0001816" "DOID:0001816" "DOID:0001816" ...
##   ..$ gene: chr [1:387300] "238" "672" "675" "387119" ...
##  $ geneset_name:'data.frame':    11878 obs. of  2 variables:
##   ..$ id  : chr [1:11878] "DOID:0001816" "DOID:0002116" "DOID:0014667" "DOID:0040001" ...
##   ..$ name: chr [1:11878] "Angiosarcoma" "Pterygium" "Disease of metabolism" "Shrimp allergy" ...
##  $ organism    : chr "hsapiens"
##  $ type        : chr "do"