why clusterProfiler fails

Recently, there are some comments said that sometimes clusterProfiler failed in KEGG enrichment analysis.

kaji331 compared cluserProfiler with GeneAnswers and found that clusterProfiler gives larger p values. The result forces me to test it.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
require(GeneAnswers)
data('humanGeneInput')
y <- geneAnswersBuilder(humanGeneInput, 'org.Hs.eg.db', 
                        categoryType='KEGG', testType='hyperG', 
                        pvalueT=0.1, geneExpressionProfile=humanExpr, 
                        verbose=FALSE)
yy <- y@enrichmentInfo
 
require(clusterProfiler)
x <- enrichKEGG(humanGeneInput$GeneID, pvalueCutoff=0.2, 
                qvalueCutoff=0.2, minGSSize=1)
xx <- summary(x)
 
id <- sub("hsa", "", xx$ID)
idx <- id %in% rownames(yy)
 
p.clusterProfiler <- xx$pvalue[idx]
p.GeneAnswers <- yy[id[idx],]$"p value"
> cor(p.clusterProfiler, p.GeneAnswers)
[1] 0.9996165
> p.clusterProfiler - p.GeneAnswers
 [1]  1.029789e-04 -3.588252e-05 -4.623010e-05  1.079117e-04 -1.075746e-04
 [6] -1.077398e-04 -3.774637e-04 -2.849278e-04 -4.197993e-04  7.588155e-04
[11] -3.702141e-04  2.314721e-03 -5.695641e-04 -5.940830e-04 -4.923697e-04
[16] -5.560738e-04 -5.884079e-04  2.011138e-03

Here, I used the dataset, humanGeneInput, provided by GeneAnswers. There are 19 pathways have p values below 0.1 by GeneAnswers and 18 pathways have p values below 0.1 by clusterProfiler. 18 of them are the same and p values are highly correlated with very small differences.

Enrichment analyses
are tested by geometric model, I believe all packages do it in the same way. The difference should be attributed to annotation file and I found background ratio are different.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> xx[1,]
               ID Description GeneRatio  BgRatio     pvalue   p.adjust
hsa04110 hsa04110  Cell cycle      6/51 128/5894 0.00075424 0.02191169
             qvalue                      geneID Count
hsa04110 0.01683117 894/595/4175/6502/4173/9134     6
> yy[1,]
      genes in Category percent in the observed List percent in the genome
04110                 6                    0.1176471            0.02112436
      fold of overrepresents odds ratio      p value
04110                5.56926   6.441808 0.0006512612
 
 
> allEG = unique(unlist(as.list(org.Hs.egPATH2EG)))
> length(get("04110", org.Hs.egPATH2EG))
[1] 124
> length(allEG)
[1] 5870
> length(get("04110", org.Hs.egPATH2EG))/length(allEG)
[1] 0.02112436

In the above example, the BgRatio of clusterProfiler is 128/5894, slightly larger than 0.02112436 calculated by GeneAnswers. This is why clusterProfiler gives a p value larger than GeneAnswers (In this particular pathway of this particular example, clusterProfiler not always produce larger p values).

The annotation file used by clusterProfiler is KEGG.db, while GeneAnswers uses org.Hs.eg.db. This is the reason of the difference and it's not a big deal.

It's not a good idea to give the background ratio in decimal format as GeneAnswers does. We have no idea of the number of total genes in background with ratio like 0.02112436. Some software packages may use the gene number of the whole genome as background. If the total gene number is setting to 18000, the background ratio turn out to be 124/18000=0.006888889. With 0.006888889, it's hard to detect the background gene number was set to a larger number. In this scenario, all the p values will become very small, and many false positive were reported.

clusterProfiler and GeneAnswers estimate the total genes in the right way, but there are many packages may setting it to all the genes, no matter they have annotation or not. For instance, some software may setting the background to 20,000 when testing human genes.

I still remember in 2007, I used princeton's web server, http://go.princeton.edu/cgi-bin/GOTermFinder, to test GO enrichment. I found the background gene number was set to a very large number and all the p values were extremely small. When writing this blog post, I submit a job to this web server and found the gene number of human is setting to 45758. After I found this, I never use this web server again. The result is extremely "significant"! The web sever has an option, "Enter the number of products estimated for your organism". This is ridiculous. It's hard to estimate for ordinary users and it is impossible to estimate it correctly since we don't know the annotation file in used. The default number, 45758 for human, is too large, result in many false positive.

Input a gene list as background is useful, for instance we may use all genes that are detected in microarray experiment or all proteins identified by MS as background. This is an effective way to eliminate bias of using the whole genome. Use the whole genome is not mean to use the total number of gene in the genome as background. By default, KEGG enrichment analyses of clusterProfiler and GeneAnswers are using the whole genome, that is about 5900 genes for human, which are all genes have pathway annotation in the whole genome. Others without annotation are abandoned. This itself is a bias actually, but it's the realistic. If we include all genes no matter they have annotation or not, and setting 20000 or even more as background of human genes, definitely we will get significant result! But it is man-made significant.

As I replied to kaji331, if p values obtained by different software are different, trust the one produce larger p values. Many user will choose a software give a more "significant" result, but trust me, you are making type I error. Conservatively, always choose the one give you larger p-values if you know nothing about the calculation. Software that never "fail" is untrustable.

p5rn7vb

Related Posts

Leave a comment ?

17 Comments.

  1. Thanks for your testing and answers. I'll try to test my data again.

    Reply

  2. 我后来发现似乎是我用clusterProfiler时GeneID数据类型是integer,enrichKEGG没有报错,直接就出来很少的结果,而且和GeneAnswers差别巨大;用as.character强制类型转换后两个包就没有太多区别了

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    必須要character啊,謝謝反饋,看來不能依賴用戶傳入正確的輸入,我要在函數裏強制as.character才行。

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    看了一下,現在的release版本裏有強制as.character啊,你用的是什麼版本?

    Reply

    kaji331 China Mozilla Firefox Windows Reply:

    因为Ubuntu14.04里面R是3.0.2,因此安装的2.13版bioconductor,不是最新版

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    再过一个多月,Bioconductor 3.0都要发布了,终究是要升级的 :mrgreen:

    Reply

    kaji331 China Mozilla Firefox Ubuntu Linux Reply:

    唉,升级一堆包,兼容性是个大问题,有些以前能用的脚本后来就不能用了,或者有隐藏的结果错误,还要改,太烦了

    Reply

  3. 快来帮帮我啊。我现在做enrichGO 和enrichKEGG都有问题。

    当我用子类的时候,enrichGO的结果为0.整体是有结果。

    enrichKEGG。我根本就不能运行成功。但是选取一般分gene的时候却可以运行成功。这是为什么啊?

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    如果你沒搞錯,事實上沒顯著性,那就沒結果了,這也是本文的主要內容 :mrgreen:

    至於第二個問題,你必須要check你的input,之前有人郵件問我,後來發現是混了人和小鼠ID做爲輸入的。

    Reply

  4. The Entrez Gene IDs can be used for enrichGO
    But when applying to the enrichKEGG. There is an error:

    Error in if (min(p) 1) { :
    missing value where TRUE/FALSE needed
    In addition: There were 50 or more warnings (use warnings() to see the first 50)

    I think the input is OK.

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    show me a sample of your ID.

    Reply

  5. "27031"

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    are u kidding me! u have more than 500 genes, at least you should give me a sample of size 20 or 30.

    As you are using an old version of R and bioconductor which is not supported anymore, please upgrade your R and bioconductor to the latest release version, and give me the output of your sessionInfo() like the below output of mine. I won't answer any of your questions if you can't do this.

    > sessionInfo()
    R version 3.1.1 (2014-07-10)
    Platform: x86_64-unknown-linux-gnu (64-bit)
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
     [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] parallel  stats     graphics  grDevices utils     datasets  methods  
    [8] base     
    
    other attached packages:
    [1] clusterProfiler_1.12.0 AnnotationDbi_1.26.0   GenomeInfoDb_1.0.2    
    [4] Biobase_2.24.0         BiocGenerics_0.10.0    RSQLite_0.11.4        
    [7] DBI_0.2-7              ggplot2_1.0.0         
    
    loaded via a namespace (and not attached):
     [1] colorspace_1.2-4 digest_0.6.4     DO.db_2.8.0      DOSE_2.2.1      
     [5] GO.db_2.14.0     GOSemSim_1.22.0  grid_3.1.1       gtable_0.1.2    
     [9] igraph_0.7.1     IRanges_1.22.10  KEGG.db_2.14.0   MASS_7.3-33     
    [13] munsell_0.4.2    plyr_1.8.1       proto_0.3-10     qvalue_1.38.0   
    [17] Rcpp_0.11.2      reshape2_1.4     scales_0.2.4     stats4_3.1.1    
    [21] stringr_0.6.2    tcltk_3.1.1      tools_3.1.1     
    

    Reply

  6. 不要生气。我真是有问题问。我已经升级了版本。等会重新运行> sessionInfo()
    R version 3.1.1 (2014-07-10)
    Platform: x86_64-w64-mingw32/x64 (64-bit)

    locale:
    [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
    [5] LC_TIME=English_Australia.1252

    attached base packages:
    [1] stats graphics grDevices utils datasets methods base
    >

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    give me a sample of your gene IDs by the following command:

    > sample(gene, 30)

    I will test it.

    Reply

  7. 非常感谢你,我在升级的版本上都运行了。没问题了。谢谢你!

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    我猜你的gene ID有可能是factor,而输入是需要character的,新版中强制as.character,所以尽可能用新版,老版本首先是不支持了,再者你遇到的问题可能在新版中早已解决。 :cool:

    Reply

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>