Bug of R package ChIPpeakAnno

I used R package ChIPpeakAnno for annotating peaks, and found that it handle the DNA strand in the wrong way. Maybe the developers were from the computer science but not biology background.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> require(ChIPpeakAnno)
> packageVersion("ChIPpeakAnno")
[1] '2.10.0'
> peak <- RangedData(space="chr1", IRanges(24736757, 24737528))
> data(TSS.human.GRCh37)
> ap <- annotatePeakInBatch(peak, Annotation=TSS.human.GRCh37)
> ap
RangedData with 1 row and 9 value columns across 1 space
                     space               ranges |        peak      strand
                  <factor>            <IRanges> | <character> <character>
1 ENSG00000001461        1 [24736757, 24737528] |           1           +
                          feature start_position end_position insideFeature
                      <character>      <numeric>    <numeric>   <character>
1 ENSG00000001461 ENSG00000001461       24742284     24799466      upstream
                  distancetoFeature shortestDistance fromOverlappingOrNearest
                          <numeric>        <numeric>              <character>
1 ENSG00000001461             -5527             4756             NearestStart

In this example, I defined a peak ranging from chr1:24736757 to chr1:24737528 and annotated the peak using ChIPpeakAnno package.

It returns that the nearest gene is ENSG00000001461, whose gene symbol is NIPAL3.

1
2
3
4
5
> require(org.Hs.eg.db)
> gene.ChIPpeakAnno <- select(org.Hs.eg.db, key=ap$feature, keytype="ENSEMBL", columns=c("ENSEMBL", "ENTREZID", "SYMBOL"))
> gene.ChIPpeakAnno
          ENSEMBL ENTREZID SYMBOL
1 ENSG00000001461    57185 NIPAL3

When looking at the peak in Genome Browser, I found the nearest gene is STPG1.
Screenshot 2014-01-13 22.00.46

The gene symbol, STPG1, was converted to ENTREZID for future processing.

1
2
3
4
> gene.nearest <- select(org.Hs.eg.db, key="STPG1", keytype="SYMBOL", columns=c("ENSEMBL", "ENTREZID", "SYMBOL"))
> gene.nearest
  SYMBOL         ENSEMBL ENTREZID
1  STPG1 ENSG00000001460    90529

We can query the coordination of these two genes, and compare their distances to the peak.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> require(TxDb.Hsapiens.UCSC.hg19.knownGene)
> knownGene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
> 
> gene.ChIPpeakAnno.info <- knownGene[[gene.ChIPpeakAnno$ENTREZID]]
> gene.ChIPpeakAnno.info
GRanges with 4 ranges and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr1 [24742245, 24781314]      + |       618  uc010oek.2
  [2]     chr1 [24742245, 24799473]      + |       619  uc001bjh.3
  [3]     chr1 [24742245, 24799473]      + |       620  uc009vrc.3
  [4]     chr1 [24782628, 24792864]      + |       621  uc001bji.3
  ---
  seqlengths:
                    chr1                  chr2 ...        chrUn_gl000249
               249250621             243199373 ...                 38502

After getting the information of the gene annotated by ChIPpeakAnno, I also found that the ranges of the gene is slightly different from the one returned by annotatePeakInBatch function. This may due to the variation of different versions and it's not a big deal.

As the gene, NIPAL3, is encoded in + strand, the nearest distance is:

1
2
> min(abs(start(peak) - start(gene.ChIPpeakAnno.info)))
[1] 5488

While the gene, STPG1, is encoded in - strand, the end of the gene coordination is actually the start position of the gene and the start of the gene coordination is the end position. So the distance should be calculated by the end coordination.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> gene.nearest.info <- knownGene[[gene.nearest$ENTREZID]]
> gene.nearest.info
GRanges with 6 ranges and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr1 [24683489, 24700300]      - |      4706  uc010oej.2
  [2]     chr1 [24683489, 24740262]      - |      4707  uc001bja.3
  [3]     chr1 [24683489, 24740262]      - |      4708  uc001bjb.3
  [4]     chr1 [24683489, 24740262]      - |      4709  uc001bjc.3
  [5]     chr1 [24683489, 24741587]      - |      4710  uc001bjd.3
  [6]     chr1 [24695211, 24718169]      - |      4711  uc001bjf.3
  ---
  seqlengths:
                    chr1                  chr2 ...        chrUn_gl000249
               249250621             243199373 ...                 38502
> min(abs(end(peak) - end(knownGene[[gene.nearest$ENTREZID]])))
[1] 2734

STPG1 is more close to the peak than NIPAL3. Those genes encoded in - strand can't be well-handled by ChIPpeakAnno package.

I look careful to the source code of annotatePeakInBatch function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
    TSS.ordered <- AnnotationData
    rm(AnnotationData)
    if (!length(rownames(TSS.ordered))) {
        rownames(TSS.ordered) = formatC(1:dim(TSS.ordered)[1],
            width = nchar(dim(TSS.ordered)[1]), flag = "0")
    }
    if (length(TSS.ordered$strand) == length(start(TSS.ordered))) {
        r2 = cbind(rownames(TSS.ordered), start(TSS.ordered),
            end(TSS.ordered), as.character(TSS.ordered$strand))
    }
    else {
        TSS.ordered$strand = rep("+", length(start(TSS.ordered)))
        r2 = cbind(rownames(TSS.ordered), start(TSS.ordered),
            end(TSS.ordered), rep("+", length(start(TSS.ordered))))
    }
    colnames(r2) = c("feature_id", "start_position", "end_position",
        "strand")

The AnnotationData object is provided by the package, or query from biomaRt, the length(TSS.ordered$strand) == length(start(TSS.ordered) should be TRUE, and this code should works fine. But when the expression return FALSE, the function should complain this, with warning message showing in the screen or even stop running. But the author just simply assign all the genes encoded in + strand, this is nonsense.

Then for calculating the feauture location:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
    if (FeatureLocForDistance == "middle" || FeatureLocForDistance ==
        "m") {
        FeatureLoc = unlist(lapply(1:dim(r2)[1], function(i) {
            round(mean(c(as.numeric(r2[i, 2]), as.numeric(r2[i,
                3]))))
        }))
        FeatureLocForDistance = "middle"
    }
    else if (FeatureLocForDistance == "start" || FeatureLocForDistance ==
        "s") {
        FeatureLoc = r2[, 2]
        FeatureLocForDistance = "start"
    }
    else if (FeatureLocForDistance == "end" || FeatureLocForDistance ==
        "e") {
        FeatureLoc = r2[, 3]
        FeatureLocForDistance = "end"
    }
    else if (FeatureLocForDistance == "geneEnd" || FeatureLocForDistance ==
        "g") {
        FeatureLoc = unlist(lapply(1:dim(r2)[1], function(i) {
            if (as.character(r2[i, 4]) == "+" || as.character(r2[i,
                4]) == "1" || as.character(r2[i, 4]) == "*") {
                r2[i, 3]
            }
            else {
                r2[i, 2]
            }
        }))
        FeatureLocForDistance = "geneEnd"
    }
    else {
        FeatureLoc = unlist(lapply(1:dim(r2)[1], function(i) {
            if (as.character(r2[i, 4]) == "+" || as.character(r2[i,
                4]) == "1" || as.character(r2[i, 4]) == "*") {
                r2[i, 2]
            }
            else {
                r2[i, 3]
            }
        }))
        FeatureLocForDistance = "TSS"
    }

Only "geneEnd" and "TSS" types consider the strand information. For "middle", omits the strand information should be fine, but for "start" and "end", the "start" should be "end" and "end" should be "start" when the gene is located at - strand. Again, the annotatePeakInBatch function omits the strand information of genes/features.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
        TSS.ordered$FeatureLoc = FeatureLoc
        myPeakList$PeakLoc = PeakLoc
        plusAnno = TSS.ordered[as.character(TSS.ordered$strand) %in%
            c("+", "*", "1"), ]
        minusAnno = TSS.ordered[as.character(TSS.ordered$strand) %in%
            c("-1", "-"), ]
        r1 = do.call(rbind, lapply(seq_len(numberOfChromosome),
            function(i) {
                chr = allChr[i]
                if (chr %in% allChr.Anno) {
                  featureStart = as.numeric(TSS.ordered[chr]$FeatureLoc)
                  peakLoc = as.numeric(myPeakList[chr]$PeakLoc)
                  peakStart = as.numeric(start(myPeakList[chr]))
                  peakEnd = as.numeric(end(myPeakList[chr]))
                  name = rownames(myPeakList[chr])
                  peakRanges = IRanges(start = peakStart, end = peakEnd,
                    names = name)
                  featureID = rownames(TSS.ordered[chr])
                  featureRanges = IRanges(start = featureStart,
                    end = featureStart, names = featureID)
                  nearestFeature = featureRanges[nearest(peakRanges,
                    featureRanges)]
                  data.frame">data.frame(name = name, feature_id = names(nearestFeature))
                }
            }))

For identifying the nearest feature, the author use:

nearestFeature = featureRanges[nearest(peakRanges,
                    featureRanges)]

The "start" of the features should be used when it were encoded in + strand and "end" when features were encoded in - strand. But the author use nearest function which use both "start" and "end" of the interval to calculate the distance by default.

Considering the "start" and "end" records may reverse when the genes/features were encoded in - strand and the way ChIPpeakAnno calcultes the distance is too simple by using nearest function, I can't trust their results.

It's not hard to implement a function to annotate peaks. Preparing the gene annotation and calculating the distances among genes and peaks for finding the nearest gene of the peak. That's it.

Related Posts

Leave a comment ?

16 Comments.

  1. Good and important bug catch! Maybe one more reason to have certain R packages peer-reviewed...

    Cheers,
    Andrej

    Reply

    ruikeoln United States Unknow Browser Unknow Os Reply:

    Any reseacher in the field of biology, regardless of his/ her previous background, should be equiped with minimum common knowledges of biology, e.g., what is DNA, that are the plus/ minus strands, what are the codons, what are diploids, what are introns...

    Reply

  2. António Domingues Germany Unknow Browser Unknow Os

    Have you contacted the package maintainers about this?

    Reply

    ygc Hong Kong Unknow Browser Unknow Os Reply:

    yes. waiting for their bug fix, then.

    Reply

  3. Is there any better alternative or do one best go about implement it oneself?

    Reply

    ygc Hong Kong Unknow Browser Unknow Os Reply:

    you may try HOMER

    Reply

    ygc Hong Kong Mozilla Firefox Mac OS Reply:

    I have develop my own R package for ChIP peak annotation, and the package is now available in Bioconductor. You can try it if you are interested. http://www.bioconductor.org/packages/2.14/bioc/html/ChIPseeker.html

    Reply

  4. "Better to light a candle than curse the darkness", so glad to see you reported the bug. However, referring to it as "sucks" won't win friends. It's always possible the maintainers had a reason for selecting different results from what you expect (or maybe not). R is open source, and "peer review" happens specifically because experts around the world test and report bugs. To answer KPJ: there are many genetics packages, so there may be other distance functions available; but people should feel free to copy&mod code packages as they see fit.

    Reply

    ygc Hong Kong Unknow Browser Unknow Os Reply:

    Yes, there may be other distance functions available. But all functions should consider the DNA strand. If the strand information were just omitted, the distance function is incorrect.

    I agree with you, and will change the "sucks" to another word. Thanks for your reply.

    Reply

  5. Dear Guangchuang,

    Thank you very much for the feedback! Any feedbacks and responses sent to bioconductor@r-project.org are archived and web searchable. I would appreciate if you could send email to Bioconductor list for feedbacks forward.

    If you type ?annotatePeakInBatch, you will notice that FeatureLocForDistance is set to TSS by default, meaning that using start of feature when feature is on plus strand and using end of feature when feature is on minus strand. I think this is consistent with your expectation.

    The discrepancy described by you is due to the different annotation source used. You used transcript coordinates instead of gene coordinates stored in TSS.human.NCBI36. Another difference is that you used end(peak) to calculate distance while annotatePeakInBatch uses start(peak) by default. As a user, you can set PeakLocForDistance as start, end or middle of the peak. However, the result will be different. Please type ?annotatePeakInBatch to get the details to customize your parameter setting to fit your needs. Also annotation from different databases of different versions will be slightly different as you already pointed out. You can use getAnnotation function in ChIPpeakAnno to obtain the most recent Ensemble annotation. Alternatively, you could download annotation from UCSC genome browser and convert it to RangedData as annotation source. You also can download transcript coordinates to feed into annotatePeakInBatch to get the nearest transcript instead of gene.

    To test whether there exists this bug in the annotatePeakInBatch , let us use your example with a simple annotation consisting only two genes with the exact coordinates provided by you instead of using the built-in annotation. Looks like you are interested in peak end to TSS, so I will set PeakLocForDistance = "end" here.

    You can see that the results obtained, by customizing the parameters in annoatePeakInBatch, is consistent with what you get.

    require(ChIPpeakAnno)
    packageVersion("ChIPpeakAnno")
    peak <- RangedData(space="chr1", IRanges(24736757, 24737528))
    
    TSS <- RangedData(space="chr1", IRanges(start=c(24742284 ,24683489,24683489,24683489,24695211), end = c( 24799466, 24700300,24740262, 24741587,24718169), names=c("NIPAL3", "STPG1-1","STPG1-2","STPG1-3","STPG1-4")), strand=c("+","-", "-", "-","-"))
    TSS
    RangedData with 5 rows and 1 value column across 1 space
               space               ranges |      strand
                         | 
    NIPAL3      chr1 [24742284, 24799466] |           +
    STPG1-1     chr1 [24683489, 24700300] |           -
    STPG1-2     chr1 [24683489, 24740262] |           -
    STPG1-3     chr1 [24683489, 24741587] |           -
    STPG1-4     chr1 [24695211, 24718169] |           -
    ap <- annotatePeakInBatch(peak, Annotation=TSS, PeakLocForDistance="end")
     ap
    RangedData with 1 row and 9 value columns across 1 space
                 space               ranges |        peak      strand     feature start_position end_position insideFeature distancetoFeature shortestDistance
                           |                                 
    1 STPG1-2     chr1 [24736757, 24737528] |           1           -     STPG1-2       24683489     24740262        inside              2734             2734
              fromOverlappingOrNearest
                           
    1 STPG1-2             NearestStart
    

    Hope this clears things for you. Please do not hesitate to send feedbacks to Bioconductor list if you have additional questions or suggestions.

    Many thanks for the detailed examples!

    Best regards,

    Julie

    Reply

    ygc Hong Kong Unknow Browser Unknow Os Reply:

    I don't understand why setting PeakLocForDistance parameter.

    My point is it should use peak "start" when feature in + strand and use peak "end" when feature in - strand.

    Reply

  6. Dear Guangchuang,
    Thank you very much for the feedback! I posted the clarification. However, it says that "Your comment is awaiting moderation".
    Is this your personal website?
    I believe that your feedback and my clarification together will help users understand the parameter settings in annotatePeakInBatch more.

    Best regards,

    Julie

    Reply

  7. Guangchuang,

    Thanks for the clarification!

    annotatePeakInBatch uses gene "start" when gene is in + strand and use gene "end" when gene is in - strand to calculate distance from peak to TSS. Peaks from ChIP-seq/ChIP-chip is strand-less and peaks are not necessary always at the upstream of genes, so it is rather arbitrary to pick peak start for genes located in + and peak end for genes located in - strand . annotatePeakInBatch lets users to specify PeakLocForDistance (start, middle, end) regardless the gene strand to calculate distance.

    If you want to use peak start to calculate distance to genes located in + strand, but use peak end to calculate distance to genes located in - strand, I suggest you create two annotation files, one for + strand and the other for - strand, and specify PeakLocForDistance ="start" for + annotation and PeakLocForDistance="end" for - annotation.

    Hope this is helpful. Thanks!

    Best regards,

    Julie

    Reply

    ygc Hong Kong Unknow Browser Unknow Os Reply:

    Let me explain my point of view, if we know the strand of a peak, then what we would like to know is only the nearest feature located in the same strand.

    As we both know, peaks are strand-less and we don't have that information.

    Then we assume the peak could be located in + strand and - strand. When we consider it is in + strand, we identify the nearest feature by min(abs(peakStart-featureStart)). While when we consider it is in - strand, we should calculate the distance by min(abs(peakEnd-featureEnd)) as the "End" coordination is actually the start when it is in the - strand no matter it is a feature or a peak.

    If a feature didn't have strand information, it make sense to set PeakLocForDistance="middle".

    But if we have the feature strand information, I don't agree with your approach by setting the parameter PeakLocForDistance to "start" or "end". The function should automatically decide using "start" or "end" considering the feature strand information.

    Bests,
    Guangchuang

    Reply

    Puriney United States Unknow Browser Unknow Os Reply:

    maybe this figure (http://www.nature.com/nmeth/journal/v6/n11s/fig_tab/nmeth.1371_F4.html) from Nature could illustrate the point.

    Reply

  8. ChIPseeker for ChIP peak Annotation | YGC United States WordPress Unknow Os - pingback on April 13, 2014 at 9:02 pm

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>