why clusterProfiler fails

Recently, there are some comments said that sometimes clusterProfiler failed in KEGG enrichment analysis.

kaji331 compared cluserProfiler with GeneAnswers and found that clusterProfiler gives larger p values. The result forces me to test it.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
require(GeneAnswers)
data('humanGeneInput')
y <- geneAnswersBuilder(humanGeneInput, 'org.Hs.eg.db', 
                        categoryType='KEGG', testType='hyperG', 
                        pvalueT=0.1, geneExpressionProfile=humanExpr, 
                        verbose=FALSE)
yy <- y@enrichmentInfo
 
require(clusterProfiler)
x <- enrichKEGG(humanGeneInput$GeneID, pvalueCutoff=0.2, 
                qvalueCutoff=0.2, minGSSize=1)
xx <- summary(x)
 
id <- sub("hsa", "", xx$ID)
idx <- id %in% rownames(yy)
 
p.clusterProfiler <- xx$pvalue[idx]
p.GeneAnswers <- yy[id[idx],]$"p value"
> cor(p.clusterProfiler, p.GeneAnswers)
[1] 0.9996165
> p.clusterProfiler - p.GeneAnswers
 [1]  1.029789e-04 -3.588252e-05 -4.623010e-05  1.079117e-04 -1.075746e-04
 [6] -1.077398e-04 -3.774637e-04 -2.849278e-04 -4.197993e-04  7.588155e-04
[11] -3.702141e-04  2.314721e-03 -5.695641e-04 -5.940830e-04 -4.923697e-04
[16] -5.560738e-04 -5.884079e-04  2.011138e-03

Here, I used the dataset, humanGeneInput, provided by GeneAnswers. There are 19 pathways have p values below 0.1 by GeneAnswers and 18 pathways have p values below 0.1 by clusterProfiler. 18 of them are the same and p values are highly correlated with very small differences.

Enrichment analyses
are tested by geometric model, I believe all packages do it in the same way. The difference should be attributed to annotation file and I found background ratio are different.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> xx[1,]
               ID Description GeneRatio  BgRatio     pvalue   p.adjust
hsa04110 hsa04110  Cell cycle      6/51 128/5894 0.00075424 0.02191169
             qvalue                      geneID Count
hsa04110 0.01683117 894/595/4175/6502/4173/9134     6
> yy[1,]
      genes in Category percent in the observed List percent in the genome
04110                 6                    0.1176471            0.02112436
      fold of overrepresents odds ratio      p value
04110                5.56926   6.441808 0.0006512612
 
 
> allEG = unique(unlist(as.list(org.Hs.egPATH2EG)))
> length(get("04110", org.Hs.egPATH2EG))
[1] 124
> length(allEG)
[1] 5870
> length(get("04110", org.Hs.egPATH2EG))/length(allEG)
[1] 0.02112436

In the above example, the BgRatio of clusterProfiler is 128/5894, slightly larger than 0.02112436 calculated by GeneAnswers. This is why clusterProfiler gives a p value larger than GeneAnswers (In this particular pathway of this particular example, clusterProfiler not always produce larger p values).

The annotation file used by clusterProfiler is KEGG.db, while GeneAnswers uses org.Hs.eg.db. This is the reason of the difference and it's not a big deal.

It's not a good idea to give the background ratio in decimal format as GeneAnswers does. We have no idea of the number of total genes in background with ratio like 0.02112436. Some software packages may use the gene number of the whole genome as background. If the total gene number is setting to 18000, the background ratio turn out to be 124/18000=0.006888889. With 0.006888889, it's hard to detect the background gene number was set to a larger number. In this scenario, all the p values will become very small, and many false positive were reported.

clusterProfiler and GeneAnswers estimate the total genes in the right way, but there are many packages may setting it to all the genes, no matter they have annotation or not. For instance, some software may setting the background to 20,000 when testing human genes.

I still remember in 2007, I used princeton's web server, http://go.princeton.edu/cgi-bin/GOTermFinder, to test GO enrichment. I found the background gene number was set to a very large number and all the p values were extremely small. When writing this blog post, I submit a job to this web server and found the gene number of human is setting to 45758. After I found this, I never use this web server again. The result is extremely "significant"! The web sever has an option, "Enter the number of products estimated for your organism". This is ridiculous. It's hard to estimate for ordinary users and it is impossible to estimate it correctly since we don't know the annotation file in used. The default number, 45758 for human, is too large, result in many false positive.

Input a gene list as background is useful, for instance we may use all genes that are detected in microarray experiment or all proteins identified by MS as background. This is an effective way to eliminate bias of using the whole genome. Use the whole genome is not mean to use the total number of gene in the genome as background. By default, KEGG enrichment analyses of clusterProfiler and GeneAnswers are using the whole genome, that is about 5900 genes for human, which are all genes have pathway annotation in the whole genome. Others without annotation are abandoned. This itself is a bias actually, but it's the realistic. If we include all genes no matter they have annotation or not, and setting 20000 or even more as background of human genes, definitely we will get significant result! But it is man-made significant.

As I replied to kaji331, if p values obtained by different software are different, trust the one produce larger p values. Many user will choose a software give a more "significant" result, but trust me, you are making type I error. Conservatively, always choose the one give you larger p-values if you know nothing about the calculation. Software that never "fail" is untrustable.

m4s0n501

Related Posts

Leave a comment ?

38 Comments.

  1. Thanks for your testing and answers. I'll try to test my data again.

    Reply

  2. 我后来发现似乎是我用clusterProfiler时GeneID数据类型是integer,enrichKEGG没有报错,直接就出来很少的结果,而且和GeneAnswers差别巨大;用as.character强制类型转换后两个包就没有太多区别了

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    必須要character啊,謝謝反饋,看來不能依賴用戶傳入正確的輸入,我要在函數裏強制as.character才行。

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    看了一下,現在的release版本裏有強制as.character啊,你用的是什麼版本?

    Reply

    kaji331 China Mozilla Firefox Windows Reply:

    因为Ubuntu14.04里面R是3.0.2,因此安装的2.13版bioconductor,不是最新版

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    再过一个多月,Bioconductor 3.0都要发布了,终究是要升级的 :mrgreen:

    Reply

    kaji331 China Mozilla Firefox Ubuntu Linux Reply:

    唉,升级一堆包,兼容性是个大问题,有些以前能用的脚本后来就不能用了,或者有隐藏的结果错误,还要改,太烦了

    Reply

  3. 快来帮帮我啊。我现在做enrichGO 和enrichKEGG都有问题。

    当我用子类的时候,enrichGO的结果为0.整体是有结果。

    enrichKEGG。我根本就不能运行成功。但是选取一般分gene的时候却可以运行成功。这是为什么啊?

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    如果你沒搞錯,事實上沒顯著性,那就沒結果了,這也是本文的主要內容 :mrgreen:

    至於第二個問題,你必須要check你的input,之前有人郵件問我,後來發現是混了人和小鼠ID做爲輸入的。

    Reply

  4. The Entrez Gene IDs can be used for enrichGO
    But when applying to the enrichKEGG. There is an error:

    Error in if (min(p) 1) { :
    missing value where TRUE/FALSE needed
    In addition: There were 50 or more warnings (use warnings() to see the first 50)

    I think the input is OK.

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    show me a sample of your ID.

    Reply

  5. "27031"

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    are u kidding me! u have more than 500 genes, at least you should give me a sample of size 20 or 30.

    As you are using an old version of R and bioconductor which is not supported anymore, please upgrade your R and bioconductor to the latest release version, and give me the output of your sessionInfo() like the below output of mine. I won't answer any of your questions if you can't do this.

    > sessionInfo()
    R version 3.1.1 (2014-07-10)
    Platform: x86_64-unknown-linux-gnu (64-bit)
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
     [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] parallel  stats     graphics  grDevices utils     datasets  methods  
    [8] base     
    
    other attached packages:
    [1] clusterProfiler_1.12.0 AnnotationDbi_1.26.0   GenomeInfoDb_1.0.2    
    [4] Biobase_2.24.0         BiocGenerics_0.10.0    RSQLite_0.11.4        
    [7] DBI_0.2-7              ggplot2_1.0.0         
    
    loaded via a namespace (and not attached):
     [1] colorspace_1.2-4 digest_0.6.4     DO.db_2.8.0      DOSE_2.2.1      
     [5] GO.db_2.14.0     GOSemSim_1.22.0  grid_3.1.1       gtable_0.1.2    
     [9] igraph_0.7.1     IRanges_1.22.10  KEGG.db_2.14.0   MASS_7.3-33     
    [13] munsell_0.4.2    plyr_1.8.1       proto_0.3-10     qvalue_1.38.0   
    [17] Rcpp_0.11.2      reshape2_1.4     scales_0.2.4     stats4_3.1.1    
    [21] stringr_0.6.2    tcltk_3.1.1      tools_3.1.1     
    

    Reply

  6. 不要生气。我真是有问题问。我已经升级了版本。等会重新运行> sessionInfo()
    R version 3.1.1 (2014-07-10)
    Platform: x86_64-w64-mingw32/x64 (64-bit)

    locale:
    [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
    [5] LC_TIME=English_Australia.1252

    attached base packages:
    [1] stats graphics grDevices utils datasets methods base
    >

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    give me a sample of your gene IDs by the following command:

    > sample(gene, 30)

    I will test it.

    Reply

  7. 非常感谢你,我在升级的版本上都运行了。没问题了。谢谢你!

    Reply

    ygc Hong Kong Mozilla Firefox Ubuntu Linux Reply:

    我猜你的gene ID有可能是factor,而输入是需要character的,新版中强制as.character,所以尽可能用新版,老版本首先是不支持了,再者你遇到的问题可能在新版中早已解决。 :cool:

    Reply

  8. It would be much more helpful to the community if people posted in English instead of Chinese. Anyway...

    I am trying to run compareCluster. It doesn't matter for what function I apply, I always get this error:

    The following `from` values were not present in `x`: .id

    I'm not sure if it is something related to the format I am sending the genesList. I'd appreciate some help.
    If you could please send me an email. I could also reply in more details on how I'm inputing the arguments.

    Thanks in advance

    Reply

    ygc Hong Kong Mozilla Firefox Mac OS Reply:

    in enrichXXX function, the input is a character vector, while in compareCluster, the input should be a list of character vectors. It seems you did not sending the input in right format.

    Please check you input, which should be something like:

    > lapply(gcSample, head)
    $X1
    [1] "4597"  "7111"  "5266"  "2175"  "755"   "23046"
    
    $X2
    [1] "23450" "5160"  "7126"  "26118" "8452"  "3675"
    
    $X3
    [1] "894"   "7057"  "22906" "3339"  "10449" "6566"
    
    $X4
    [1] "5573"  "7453"  "5245"  "23450" "6500"  "4926"
    
    $X5
    [1] "5982" "7318" "6352" "2101" "8882" "7803"
    
    $X6
    [1] "5337"  "9295"  "4035"  "811"   "23365" "4629"
    
    $X7
    [1] "2621" "2665" "5690" "3608" "3550" "533"
    
    $X8
    [1] "2665" "4735" "1327" "3192" "5573" "9528"
    

    Reply

  9. Thanks for the fast reply

    I believe that's exactly how I did.

    Look, this is how my geneList parameter:
    [[1]]
    11727902_a_at 11746668_a_at 11722645_s_at 11762867_at 11725127_at 11752399_a_at 11722848_a_at
    "2037" "223" "54680" "5244" "2033" "8880" "5530"
    11718805_a_at 11725934_x_at 11739639_at 11731024_s_at 11731521_at 11732281_s_at 11745275_x_at
    "23369" "3309" "51755" "6314" "9640" "27086" "83737"
    11748645_a_at 11744150_s_at 11747207_a_at 11739563_a_at 11718562_a_at 11739832_a_at 11718166_a_at
    "22821" "9663" "1994" "3708" "1788" "79890" "5934"
    11740212_x_at 11716194_a_at
    "5294" "29072"

    [[2]]
    11736481_at 11728117_a_at 11745809_s_at 11726561_at 11738252_at 11755171_x_at 11715315_s_at
    "79865" "57188" "9674" "10284" "3274" "57662" "27342"
    11747426_a_at 11718923_a_at 11752245_at 11757110_a_at 11761504_at 11722909_a_at 11757433_a_at
    "90060" "9922" "84623" "100129827" "54937" "352954" "5987"
    11749842_a_at 11760720_x_at 11724380_a_at 11749861_a_at 11721783_a_at 11762963_at 11730233_s_at
    "5663" "113278" "57224" "5345" "23315" "8407" "1501"
    11733390_a_at 11732495_at 11730809_at 11735642_at 11744886_a_at 11738921_at 11725879_a_at
    "186" "79000" "26045" "388581" "9043" "58160" "23304"
    11749239_s_at 11716751_a_at 11751192_a_at 11746256_x_at 11751450_a_at 11741558_a_at 11730304_at
    "79894" "2799" "50859" "5005" "54890" "3216" "2175"
    11723440_a_at 11752238_x_at 11722789_a_at 11748694_x_at 11728152_a_at 11726323_a_at 11743471_a_at
    "2043" "64718" "5868" "5868" "3017" "51" "23001"
    11747307_a_at 11733103_a_at 11737894_a_at 11739134_a_at 11724598_at 11732455_at
    "2664" "10714" "9990" "1290" "151516" "3772"

    the way I created it was:
    clusterContainer = list(module17_Entrez,module44_Entrez).

    Then I realise that there is a difference on how your gcSample is displayed and my clusterContainer, and I noticed that this way I am doing I am actually making a list of lists.

    So I converted the internal lists to vectors:
    clusterContainer = list(as.vector(module17_Entrez),as.vector(module44_Entrez)).

    and now it looks more similar to what gcSample has, but still I get the same error :sad:

    [[1]]
    [1] "2037" "223" "54680" "5244" "2033" "8880" "5530" "23369" "3309" "51755" "6314" "9640" "27086"
    [14] "83737" "22821" "9663" "1994" "3708" "1788" "79890" "5934" "5294" "29072"

    [[2]]
    [1] "79865" "57188" "9674" "10284" "3274" "57662" "27342" "90060"
    [9] "9922" "84623" "100129827" "54937" "352954" "5987" "5663" "113278"
    [17] "57224" "5345" "23315" "8407" "1501" "186" "79000" "26045"
    [25] "388581" "9043" "58160" "23304" "79894" "2799" "50859" "5005"
    [33] "54890" "3216" "2175" "2043" "64718" "5868" "5868" "3017"
    [41] "51" "23001" "2664" "10714" "9990" "1290" "151516" "3772"

    Reply

    ygc Hong Kong Mozilla Firefox Mac OS Reply:

    the input should be a named list.

    you should, as an example:

    ?View Code RSPLUS
    1
    
    name(clusterContainer) <- c("cluster_1", "cluster_2")

    and then clusterContainer is ready to go.

    Reply

  10. in that way it works for other functions, but for enrichKEGG I get the same error

    Reply

    ygc Hong Kong Mozilla Firefox Mac OS Reply:

    if then, send your data to me by email. I will check what's going on.

    Reply

  11. 请问:
    cnetplot(kk, categorySize="geneNum", foldChange=geneList), 弄不明白,geneList是一个什么样的数据结构,我要用自己的FC 值分析啊! :?:

    Reply

    ygc Hong Kong Mozilla Firefox Mac OS Reply:

    首先它是数值型的,在示例中,是log2(ratio),这是用来标示上下调的数值,然后cnetplot得知道这些数值是那些基因/蛋白的,所以它们有个名字,名字就是相应的entrez gene ID。

    ?View Code RSPLUS
    1
    2
    3
    4
    5
    
    > head(geneList)
        4312     8318    10874    55143    55388      991
    4.572613 4.514594 4.418218 4.144075 3.876258 3.677857
    > names(head(geneList))
    [1] "4312"  "8318"  "10874" "55143" "55388" "991"

    Reply

  12. 谢谢您的回复,我从results.lim分别提取EntrezID和对应的logFC
    a <- c(results.lim.dn[, "EntrezID"])
    a <- as.character(a)
    b <- c(results.lim.dn[, "logFC"])
    接下来该如何把a,b组合成geneList结构呢 :?: :?: :?: :?:

    Reply

    ygc Hong Kong Mozilla Firefox Mac OS Reply:

    ?View Code RSPLUS
    1
    
    names(b) <- a

    那么b就是了。

    Reply

    amiee2406 China Safari Mac OS Reply:

    谢谢! :razz:

    Reply

  13. 在作KEGG分析时,好奇怪:
    kk <- enrichKEGG(dg, organism="human",
    pvalueCutoff=1,
    pAdjustMethod="BH",
    universe = universe,
    qvalueCutoff=1,
    minGSSize=1,
    readable=TRUE)
    结果显示共有80obs.,其中明明有11条pvalue <0.05的pathway,可设置pvalueCutoff= 0.05(试过比1小的其它数值),kegg均显示为0 obs..
    :?: :?: :?:

    Reply

    ygc Hong Kong Mozilla Firefox Mac OS Reply:

    p.adjust也会被pvalueCutoff卡的。 :cool:

    Reply

    amiee2406 China Safari Mac OS Reply:

    好像明白了 :shock:

    Reply

    amiee2406 China Safari Mac OS Reply:

    明白了 :razz:

    Reply

  14. 新年好!
    当进行GSEA分析时,我在对nPerm、minGSSize两个参数的设置上有些迷糊:1.这两个参数高因有何具体依据?2.nPerm是随机组合次数吗?当我对Affymetrix HG-U133_Plus_2芯片进行GSEA kegg分析时,nPerm设置为1000,花了相当长时间?

    Reply

    ygc Hong Kong Mozilla Firefox Mac OS Reply:

    minGSSize设置的是gene set小于阈值就不进行计算,nPerm是permutation的次数,permutation是computational intensive的,时间长是肯定的。

    Reply

  15. 另外,nPerm和minGSSize两个参数是如何影响ES值的,当设置不同数值时,发现富集到的KEGG通路不仅数量上不同,内容也不同??
    谢谢!

    Reply

    ygc Hong Kong Mozilla Firefox Mac OS Reply:

    minGSSize不影响,nPerm肯定会影响的,nPerm取值比较大(比如说>1000)的时候,差别是可以忽略的。

    Reply

  16. 您好!gseKEGG命令得到的结果中,setSize含义是什么。它的值和什么参数有关呢! :?: 随意调整nperm和minGSSize值,当nperm=100,ninGSSize=120,得到KEGG3条;nPerm=1000, minGSSize=10,得到kegg6条;nPerm=1000, minGSSize=5,得到kegg7条。不同参数,不仅KEGG数目不同,种类也不同??而结果显示的相同pathway所对应的setSize和ES值都是不变的,那么到底依据什么来设置这些参数 :?:

    Reply

    ygc Hong Kong Mozilla Firefox Mac OS Reply:

    setSize只是告诉你这个gene Set有多大而已,它和参数没有关系。minGSSize和nPerm上面已经说了,你看一下结果,再看我说的,就理解了。

    Reply

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>