## Bug of R package ChIPpeakAnno

I used R package ChIPpeakAnno for annotating peaks, and found that it handle the DNA strand in the wrong way. Maybe the developers were from the computer science but not biology background.

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  > require(ChIPpeakAnno) > packageVersion("ChIPpeakAnno") [1] '2.10.0' > peak <- RangedData(space="chr1", IRanges(24736757, 24737528)) > data(TSS.human.GRCh37) > ap <- annotatePeakInBatch(peak, Annotation=TSS.human.GRCh37) > ap RangedData with 1 row and 9 value columns across 1 space space ranges | peak strand | 1 ENSG00000001461 1 [24736757, 24737528] | 1 + feature start_position end_position insideFeature 1 ENSG00000001461 ENSG00000001461 24742284 24799466 upstream distancetoFeature shortestDistance fromOverlappingOrNearest 1 ENSG00000001461 -5527 4756 NearestStart

In this example, I defined a peak ranging from chr1:24736757 to chr1:24737528 and annotated the peak using ChIPpeakAnno package.

It returns that the nearest gene is ENSG00000001461, whose gene symbol is NIPAL3.

?View Code RSPLUS
 1 2 3 4 5  > require(org.Hs.eg.db) > gene.ChIPpeakAnno <- select(org.Hs.eg.db, key=ap$feature, keytype="ENSEMBL", columns=c("ENSEMBL", "ENTREZID", "SYMBOL")) > gene.ChIPpeakAnno ENSEMBL ENTREZID SYMBOL 1 ENSG00000001461 57185 NIPAL3 When looking at the peak in Genome Browser, I found the nearest gene is STPG1. The gene symbol, STPG1, was converted to ENTREZID for future processing. ?View Code RSPLUS  1 2 3 4  > gene.nearest <- select(org.Hs.eg.db, key="STPG1", keytype="SYMBOL", columns=c("ENSEMBL", "ENTREZID", "SYMBOL")) > gene.nearest SYMBOL ENSEMBL ENTREZID 1 STPG1 ENSG00000001460 90529 We can query the coordination of these two genes, and compare their distances to the peak. ?View Code RSPLUS  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  > require(TxDb.Hsapiens.UCSC.hg19.knownGene) > knownGene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") > > gene.ChIPpeakAnno.info <- knownGene[[gene.ChIPpeakAnno$ENTREZID]] > gene.ChIPpeakAnno.info GRanges with 4 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name | [1] chr1 [24742245, 24781314] + | 618 uc010oek.2 [2] chr1 [24742245, 24799473] + | 619 uc001bjh.3 [3] chr1 [24742245, 24799473] + | 620 uc009vrc.3 [4] chr1 [24782628, 24792864] + | 621 uc001bji.3 --- seqlengths: chr1 chr2 ... chrUn_gl000249 249250621 243199373 ... 38502

After getting the information of the gene annotated by ChIPpeakAnno, I also found that the ranges of the gene is slightly different from the one returned by annotatePeakInBatch function. This may due to the variation of different versions and it's not a big deal.

As the gene, NIPAL3, is encoded in + strand, the nearest distance is:

?View Code RSPLUS
 1 2  > min(abs(start(peak) - start(gene.ChIPpeakAnno.info))) [1] 5488

While the gene, STPG1, is encoded in - strand, the end of the gene coordination is actually the start position of the gene and the start of the gene coordination is the end position. So the distance should be calculated by the end coordination.

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  > gene.nearest.info <- knownGene[[gene.nearest$ENTREZID]] > gene.nearest.info GRanges with 6 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name | [1] chr1 [24683489, 24700300] - | 4706 uc010oej.2 [2] chr1 [24683489, 24740262] - | 4707 uc001bja.3 [3] chr1 [24683489, 24740262] - | 4708 uc001bjb.3 [4] chr1 [24683489, 24740262] - | 4709 uc001bjc.3 [5] chr1 [24683489, 24741587] - | 4710 uc001bjd.3 [6] chr1 [24695211, 24718169] - | 4711 uc001bjf.3 --- seqlengths: chr1 chr2 ... chrUn_gl000249 249250621 243199373 ... 38502 > min(abs(end(peak) - end(knownGene[[gene.nearest$ENTREZID]]))) [1] 2734

STPG1 is more close to the peak than NIPAL3. Those genes encoded in - strand can't be well-handled by ChIPpeakAnno package.

I look careful to the source code of annotatePeakInBatch function.

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17   TSS.ordered <- AnnotationData rm(AnnotationData) if (!length(rownames(TSS.ordered))) { rownames(TSS.ordered) = formatC(1:dim(TSS.ordered)[1], width = nchar(dim(TSS.ordered)[1]), flag = "0") } if (length(TSS.ordered$strand) == length(start(TSS.ordered))) { r2 = cbind(rownames(TSS.ordered), start(TSS.ordered), end(TSS.ordered), as.character(TSS.ordered$strand)) } else { TSS.ordered$strand = rep("+", length(start(TSS.ordered))) r2 = cbind(rownames(TSS.ordered), start(TSS.ordered), end(TSS.ordered), rep("+", length(start(TSS.ordered)))) } colnames(r2) = c("feature_id", "start_position", "end_position", "strand") The AnnotationData object is provided by the package, or query from biomaRt, the length(TSS.ordered$strand) == length(start(TSS.ordered) should be TRUE, and this code should works fine. But when the expression return FALSE, the function should complain this, with warning message showing in the screen or even stop running. But the author just simply assign all the genes encoded in + strand, this is nonsense.

Then for calculating the feauture location:

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43   if (FeatureLocForDistance == "middle" || FeatureLocForDistance == "m") { FeatureLoc = unlist(lapply(1:dim(r2)[1], function(i) { round(mean(c(as.numeric(r2[i, 2]), as.numeric(r2[i, 3])))) })) FeatureLocForDistance = "middle" } else if (FeatureLocForDistance == "start" || FeatureLocForDistance == "s") { FeatureLoc = r2[, 2] FeatureLocForDistance = "start" } else if (FeatureLocForDistance == "end" || FeatureLocForDistance == "e") { FeatureLoc = r2[, 3] FeatureLocForDistance = "end" } else if (FeatureLocForDistance == "geneEnd" || FeatureLocForDistance == "g") { FeatureLoc = unlist(lapply(1:dim(r2)[1], function(i) { if (as.character(r2[i, 4]) == "+" || as.character(r2[i, 4]) == "1" || as.character(r2[i, 4]) == "*") { r2[i, 3] } else { r2[i, 2] } })) FeatureLocForDistance = "geneEnd" } else { FeatureLoc = unlist(lapply(1:dim(r2)[1], function(i) { if (as.character(r2[i, 4]) == "+" || as.character(r2[i, 4]) == "1" || as.character(r2[i, 4]) == "*") { r2[i, 2] } else { r2[i, 3] } })) FeatureLocForDistance = "TSS" }

Only "geneEnd" and "TSS" types consider the strand information. For "middle", omits the strand information should be fine, but for "start" and "end", the "start" should be "end" and "end" should be "start" when the gene is located at - strand. Again, the annotatePeakInBatch function omits the strand information of genes/features.

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25   TSS.ordered$FeatureLoc = FeatureLoc myPeakList$PeakLoc = PeakLoc plusAnno = TSS.ordered[as.character(TSS.ordered$strand) %in% c("+", "*", "1"), ] minusAnno = TSS.ordered[as.character(TSS.ordered$strand) %in% c("-1", "-"), ] r1 = do.call(rbind, lapply(seq_len(numberOfChromosome), function(i) { chr = allChr[i] if (chr %in% allChr.Anno) { featureStart = as.numeric(TSS.ordered[chr]$FeatureLoc) peakLoc = as.numeric(myPeakList[chr]$PeakLoc) peakStart = as.numeric(start(myPeakList[chr])) peakEnd = as.numeric(end(myPeakList[chr])) name = rownames(myPeakList[chr]) peakRanges = IRanges(start = peakStart, end = peakEnd, names = name) featureID = rownames(TSS.ordered[chr]) featureRanges = IRanges(start = featureStart, end = featureStart, names = featureID) nearestFeature = featureRanges[nearest(peakRanges, featureRanges)] data.frame(name = name, feature_id = names(nearestFeature)) } }))

For identifying the nearest feature, the author use:

nearestFeature = featureRanges[nearest(peakRanges,
featureRanges)]


The "start" of the features should be used when it were encoded in + strand and "end" when features were encoded in - strand. But the author use nearest function which use both "start" and "end" of the interval to calculate the distance by default.

Considering the "start" and "end" records may reverse when the genes/features were encoded in - strand and the way ChIPpeakAnno calcultes the distance is too simple by using nearest function, I can't trust their results.

It's not hard to implement a function to annotate peaks. Preparing the gene annotation and calculating the distances among genes and peaks for finding the nearest gene of the peak. That's it.

#### Related Posts

1. Good and important bug catch! Maybe one more reason to have certain R packages peer-reviewed...

Cheers,
Andrej

Any reseacher in the field of biology, regardless of his/ her previous background, should be equiped with minimum common knowledges of biology, e.g., what is DNA, that are the plus/ minus strands, what are the codons, what are diploids, what are introns...

2. António Domingues

yes. waiting for their bug fix, then.

3. Is there any better alternative or do one best go about implement it oneself?

you may try HOMER

I have develop my own R package for ChIP peak annotation, and the package is now available in Bioconductor. You can try it if you are interested. http://www.bioconductor.org/packages/2.14/bioc/html/ChIPseeker.html

4. "Better to light a candle than curse the darkness", so glad to see you reported the bug. However, referring to it as "sucks" won't win friends. It's always possible the maintainers had a reason for selecting different results from what you expect (or maybe not). R is open source, and "peer review" happens specifically because experts around the world test and report bugs. To answer KPJ: there are many genetics packages, so there may be other distance functions available; but people should feel free to copy&mod code packages as they see fit.

Yes, there may be other distance functions available. But all functions should consider the DNA strand. If the strand information were just omitted, the distance function is incorrect.

I agree with you, and will change the "sucks" to another word. Thanks for your reply.

5. Dear Guangchuang,

Thank you very much for the feedback! Any feedbacks and responses sent to bioconductor@r-project.org are archived and web searchable. I would appreciate if you could send email to Bioconductor list for feedbacks forward.

If you type ?annotatePeakInBatch, you will notice that FeatureLocForDistance is set to TSS by default, meaning that using start of feature when feature is on plus strand and using end of feature when feature is on minus strand. I think this is consistent with your expectation.

The discrepancy described by you is due to the different annotation source used. You used transcript coordinates instead of gene coordinates stored in TSS.human.NCBI36. Another difference is that you used end(peak) to calculate distance while annotatePeakInBatch uses start(peak) by default. As a user, you can set PeakLocForDistance as start, end or middle of the peak. However, the result will be different. Please type ?annotatePeakInBatch to get the details to customize your parameter setting to fit your needs. Also annotation from different databases of different versions will be slightly different as you already pointed out. You can use getAnnotation function in ChIPpeakAnno to obtain the most recent Ensemble annotation. Alternatively, you could download annotation from UCSC genome browser and convert it to RangedData as annotation source. You also can download transcript coordinates to feed into annotatePeakInBatch to get the nearest transcript instead of gene.

To test whether there exists this bug in the annotatePeakInBatch , let us use your example with a simple annotation consisting only two genes with the exact coordinates provided by you instead of using the built-in annotation. Looks like you are interested in peak end to TSS, so I will set PeakLocForDistance = "end" here.

You can see that the results obtained, by customizing the parameters in annoatePeakInBatch, is consistent with what you get.

require(ChIPpeakAnno)
packageVersion("ChIPpeakAnno")
peak <- RangedData(space="chr1", IRanges(24736757, 24737528))

TSS <- RangedData(space="chr1", IRanges(start=c(24742284 ,24683489,24683489,24683489,24695211), end = c( 24799466, 24700300,24740262, 24741587,24718169), names=c("NIPAL3", "STPG1-1","STPG1-2","STPG1-3","STPG1-4")), strand=c("+","-", "-", "-","-"))
TSS
RangedData with 5 rows and 1 value column across 1 space
space               ranges |      strand
|
NIPAL3      chr1 [24742284, 24799466] |           +
STPG1-1     chr1 [24683489, 24700300] |           -
STPG1-2     chr1 [24683489, 24740262] |           -
STPG1-3     chr1 [24683489, 24741587] |           -
STPG1-4     chr1 [24695211, 24718169] |           -
ap <- annotatePeakInBatch(peak, Annotation=TSS, PeakLocForDistance="end")
ap
RangedData with 1 row and 9 value columns across 1 space
space               ranges |        peak      strand     feature start_position end_position insideFeature distancetoFeature shortestDistance
|
1 STPG1-2     chr1 [24736757, 24737528] |           1           -     STPG1-2       24683489     24740262        inside              2734             2734
fromOverlappingOrNearest

1 STPG1-2             NearestStart


Hope this clears things for you. Please do not hesitate to send feedbacks to Bioconductor list if you have additional questions or suggestions.

Many thanks for the detailed examples!

Best regards,

Julie

I don't understand why setting PeakLocForDistance parameter.

My point is it should use peak "start" when feature in + strand and use peak "end" when feature in - strand.

6. Dear Guangchuang,
Thank you very much for the feedback! I posted the clarification. However, it says that "Your comment is awaiting moderation".
I believe that your feedback and my clarification together will help users understand the parameter settings in annotatePeakInBatch more.

Best regards,

Julie

7. Guangchuang,

Thanks for the clarification!

annotatePeakInBatch uses gene "start" when gene is in + strand and use gene "end" when gene is in - strand to calculate distance from peak to TSS. Peaks from ChIP-seq/ChIP-chip is strand-less and peaks are not necessary always at the upstream of genes, so it is rather arbitrary to pick peak start for genes located in + and peak end for genes located in - strand . annotatePeakInBatch lets users to specify PeakLocForDistance (start, middle, end) regardless the gene strand to calculate distance.

If you want to use peak start to calculate distance to genes located in + strand, but use peak end to calculate distance to genes located in - strand, I suggest you create two annotation files, one for + strand and the other for - strand, and specify PeakLocForDistance ="start" for + annotation and PeakLocForDistance="end" for - annotation.

Best regards,

Julie

Let me explain my point of view, if we know the strand of a peak, then what we would like to know is only the nearest feature located in the same strand.

As we both know, peaks are strand-less and we don't have that information.

Then we assume the peak could be located in + strand and - strand. When we consider it is in + strand, we identify the nearest feature by min(abs(peakStart-featureStart)). While when we consider it is in - strand, we should calculate the distance by min(abs(peakEnd-featureEnd)) as the "End" coordination is actually the start when it is in the - strand no matter it is a feature or a peak.

If a feature didn't have strand information, it make sense to set PeakLocForDistance="middle".

But if we have the feature strand information, I don't agree with your approach by setting the parameter PeakLocForDistance to "start" or "end". The function should automatically decide using "start" or "end" considering the feature strand information.

Bests,
Guangchuang