4 Gene ID conversion

If you have handled genomics analysis, then you may spend much time converting gene identifiers to other types.

For example, gene set enrichment analysis is very common in omics downstream process which is mainly based on MsigDb. According to its release note, since MSigDB 7.0, identifiers for genes are mapped to their HGNC approved Gene Symbol and NCBI Gene ID through annotations extracted from Ensembl’s BioMart data service, and will be updated at each MSigDB release with the latest available version of Ensembl. Currently, MSigDB 7.5 has updated human gene annotations to Ensembl v105.

Hence, id conversion could help keeping update with the latest annotations.

The most common gene id types are symbol (e.g. TP53),NCBI Entrez (e.g. 7157) and Ensembl (e.g. ENSG00000141510). ID conversion among different types is very common and there are two main concerns: performance and flexibility.

4.1 Current tools

Currently various tools are available, however different methods has their drawbacks:

  • biomaRt: The R package is frequently used to access up-to-date Ensembl database. It has to access online database each time and built-in parameters are too many.
  • clusterProfiler: The R package depends on Bioconductor annotation packages to extract NCBI gene. NCBI database updates every day while Bioconductor package updates nearly every six months. Besides, it only supports 20 organisms and user should download and load annotation data before mapping.
  • gProfiler: The web server supports 40 types of IDs for more than 60 species. But it only supports one-to-one conversion, for example, if user want to convert gene symbol to both Entrez and Ensembl, it could not happen. Besides, the output result is not convenient to further process.
  • DAVID: The web server interface is not user-friendly.
  • UniProt: The large protein database may support almost all species but user needs to specify input id type and only converts to protein id.
  • GIDcon: The web server only covers human, mouse and rat. User must seperates gene ids with comma if does batch query. Five symbols will take 8 secs to convert.

Considered above problems, genekitr combines both Ensembl and NCBI database to provide more comprehensive result while optimizes running speed.

Also it supports 195 vertebrate species, 120 plant species and 2 bacteria species (see also session 1.1).

4.2 Basic usage

transId provides five arguments:

  • id: gene id (symbol, Entrez or Ensembl), protein id (see also session 4.2) or microarray id (see also session 5.1)

  • transTo: the conversion target type. User could select more than one from “symbol”, “entrez”, “ensembl” or “uniprot”. Besides, abbreviation is acceptable (e.g. “ens” is fine for “ensembl” but “en” will raise error because it could not separate from “entrez”)

  • org: organism name, default is human

  • unique: TRUE or FALSE (default). If TRUE, only returns one matched id.

  • keepNA: TRUE or FALSE (default). If some id has no matched one, keep it or not.

# human genes
id <- c("AKT3", "SSX6P", "FAKE_ID", "BCC7")
transId(id, transTo = "ens")
##   input_id         ensembl
## 1     AKT3 ENSG00000275199
## 2     AKT3 ENSG00000117020
## 3    SSX6P ENSG00000171483
## 4     BCC7 ENSG00000141510
# fruit fly genes
transId(
  id = c("10178780", "10178777", "10178786", "10178792"),
  transTo = "sym", org = "fly"
)
##   input_id          symbol
## 1 10178780         CR33929
## 2 10178777         CG42835
## 3 10178786 Su(Ste):CR40820
## 4 10178792         CG42694

See the human example above, due to NA was removed, the result order is slightly different from input id.

If you want to get the same order, you can set keepNA = TRUE:

transId(id, transTo = "ens", keepNA = TRUE)
##   input_id         ensembl
## 1     AKT3 ENSG00000275199
## 2     AKT3 ENSG00000117020
## 3    SSX6P ENSG00000171483
## 4  FAKE_ID            <NA>
## 5     BCC7 ENSG00000141510

If you want to get one-to-one mapping, you can set unique = TRUE and the result with more comprehensive information will return:

transId(id, transTo = "ens", keepNA = TRUE, unique = TRUE)
##    input_id         ensembl
## 21     AKT3 ENSG00000117020
## 1     SSX6P ENSG00000171483
## 2   FAKE_ID            <NA>
## 3      BCC7 ENSG00000141510

4.3 Features

4.3.1 f1: convert to several types

transTo could accepts more than one type:

transId(id, transTo = c("ens", "ent"))
##   input_id         ensembl entrezid
## 1     AKT3 ENSG00000275199    10000
## 2     AKT3 ENSG00000117020    10000
## 3    SSX6P ENSG00000171483   280657
## 4     BCC7 ENSG00000141510     7157

4.3.2 f2: distinguish from symbol and alias

One big common problem for current tools is confusion of gene symbol and gene alias.

Gene alias is an alternative name of official gene symbol, like “BCC7” is actually alias for “TP53”. Sometimes gene alias is more widely used than official symbol. For example, “PDL1” is the alias for “CD274” in fact.

Due to combination of NCBI and Ensembl metadata, transId could easily recognize gene symbol and alias by setting transTo = 'symbol'.

transId(c("BCC7", "TP53", "PD1", "PDL1", "TET2"), "sym")
##   input_id symbol
## 1     BCC7   TP53
## 2     TP53   TP53
## 3      PD1  PDCD1
## 4      PD1   SNCA
## 5      PD1 SPATA2
## 6     PDL1  CD274
## 7     TET2   TET2

4.3.3 f3: no worry for Ensembl version

An Ensembl stable ID consists of five parts: ENS(species)(object type)(identifier). (version) For example, ENSG00000141510.11 is TP53 while the version .11 may confuse many tools. However using transId, user will not worry this issue any more.

transId("ENSG00000141510.11", "symbol")
##          input_id symbol
## 1 ENSG00000141510   TP53

4.4 Excercise 1: DEG result

The built-in example data is differentially expressed genes from: GSE42872

data(deg, package = "genekitr")
table(deg$change)
## 
##   down stable     up 
##    451  17009    579
id <- deg$entrezid
length(id)
## [1] 18039

4.4.1 clusterProfiler::bitr

User needs to specify input id type and converted type from columns(org.Hs.eg.db).

# User must firstly install the org.Hs.eg.db
# BiocManager::install('org.Hs.eg.db')
library(org.Hs.eg.db)
library(clusterProfiler)
AnnotationDbi::columns(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
## [26] "UNIPROT"
bitr_res <- bitr(id, fromType = "ENTREZID", toType = "SYMBOL", OrgDb = "org.Hs.eg.db")
bitr_sym <- unique(bitr_res$SYMBOL)
length(bitr_sym)
## [1] 18036

4.4.2 biomaRt::getBM

The initializing process depends mainly on Ensembl’s web access speed. Usually, it may take tens of seconds.

# initializing...
ensembl <- biomaRt::useEnsembl(
  biomart = "genes",
  dataset = "hsapiens_gene_ensembl"
)
# conversion
biomart_res <- biomaRt::getBM(
  attributes = c("entrezgene_id", "external_gene_name"),
  filters = c("entrezgene_id"),
  mart = ensembl,
  values = id
)
biomart_sym <- unique(biomart_res$external_gene_name) %>%
  stringi::stri_remove_empty()
length(biomart_sym)
## [1] 16085

4.4.3 genekitr::transId

transid_res <- genekitr::transId(id, transTo = "sym", unique = T)
transid_sym <- unique(transid_res$symbol)
length(transid_sym)
## [1] 18039

4.4.4 compare three results

plotVenn(list(
  bitr = bitr_sym,
  transid = transid_sym,
  biomart = biomart_sym
))

4.4.4.1 bitr vs transId

There are 117 genes only in bitr result:

bitr_sym[!bitr_sym %in% transid_sym] %>% head(5)
## [1] "C19orf71"     "LINC00476"    "LOC100129484" "PP2672"       "LOC100130449"

Let’s take a look for the first one LINC00337, which is converted from gene 148645

bitr_res %>%
  filter(SYMBOL %in% "LINC00337") %>%
  pull(ENTREZID)
## [1] "148645"

When we search it on NCBI and we could see the official symbol is “ICMT-DT” now while “LINC00337” is an alias. When we look back into transId result, it keeps update with NCBI web.

transid_res %>%
  filter(input_id %in% "148645") %>%
  pull(symbol)
## [1] "ICMT-DT"

Besides, there are 118 unique symbols in transId result:

transid_sym[!transid_sym %in% bitr_sym] %>% head(5)
## [1] "TEKTIP1"     "ERCC6L2-AS1" "LINC02983"   "LINC02961"   "GPC1-AS1"

The above gene ICMT-DT is listed.

4.4.4.2 biomart vs transId

There are 13 genes only in biomart result:

biomart_sym[!biomart_sym %in% transid_sym] %>% head()
## [1] "KLRK1"     "FLJ37453"  "C11orf49"  "C19orf71"  "ZNF761"    "LINC01588"
biomart_res %>%
  filter(external_gene_name %in% "KLRK1") %>%
  pull(entrezgene_id)
## [1] 100528032

When we search NCBI 100528032 and we could see the official symbol is “KLRC4-KLRK1”. When we look back into transId and bitr result, they are the same with NCBI web.

bitr_res %>%
  filter(ENTREZID == "100528032") %>%
  pull(SYMBOL)
## [1] "KLRC4-KLRK1"
transid_res %>%
  filter(input_id %in% "100528032") %>%
  pull(symbol)
## [1] "KLRC4-KLRK1"

4.5 Excercise 2: mixture of gene symbol and alias

Compared with Entrez and Ensembl, gene symbol and alias is hard to distinguish. In real science life, the impact of misunderstanding both names is huge.

Takehara et al. (2018) had to retract the published paper by its authors just because they misused “TAZ” as the actual research target “WWTR1” in which “TAZ” is an alias of “WWTR1”.

The similar issues have published in paper: The risks of using unapproved gene symbols. Of course, it is not only happens in human research. So we must raise alert of gene alias.

4.5.1 HGNChelper vs transId

HGNChelper is mainly built for human research, so here we will give simple human gene list as example.

id <- c("TBET", "B220", "BCC7", "PD1", "PDL1")

transid_res <- genekitr::transId(id, transTo = "sym", unique = FALSE)
transid_res
##   input_id symbol
## 1     TBET  TBX21
## 2     B220  PTPRC
## 3     BCC7   TP53
## 4      PD1  PDCD1
## 5      PD1   SNCA
## 6      PD1 SPATA2
## 7     PDL1  CD274
hgnc_res <- HGNChelper::checkGeneSymbols(id)
hgnc_res
##      x Approved          Suggested.Symbol
## 1 TBET    FALSE                      <NA>
## 2 B220    FALSE                      <NA>
## 3 BCC7    FALSE                      <NA>
## 4  PD1    FALSE PDCD1 /// SNCA /// SPATA2
## 5 PDL1    FALSE                     CD274

The first three genes are omitted in HGNChelper.

4.5.2 GeneSymbolThesarus vs transId

GeneSymbolThesarus is a function of Seurat which finds current gene symbols from the HUGO Gene Nomenclature Committee (HGNC). It will search online so the speed will not fast.

## Attaching SeuratObject
## 
## Attaching package: 'Seurat'
## The following object is masked from 'package:DT':
## 
##     JS
start <- Sys.time()
seurat_res <- Seurat::GeneSymbolThesarus(symbols = id)
end <- Sys.time()
(end - start)
## Time difference of 57.88858 secs
seurat_res
##    PDL1 
## "CD274"

Seurat only returns one records.

4.5.3 alias2Symbol vs transId

alias2Symbol is a function of limma which depends on organism annotation package of Bioconductor.

limma_res <- limma::alias2Symbol(id, species = "Hs")
# compare with transId result
id
## [1] "TBET" "B220" "BCC7" "PD1"  "PDL1"
limma_res
## [1] "CD274"  "TBX21"  "PDCD1"  "PTPRC"  "SNCA"   "TP53"   "SPATA2"
transid_res$symbol
## [1] "TBX21"  "PTPRC"  "TP53"   "PDCD1"  "SNCA"   "SPATA2" "CD274"

It seems that alias2Symbol result remains disorder from input id, which may cause misunderstanding for users, while transId keeps the original order (if user want to get one-to-one match, just pass unique = TRUE to function).