Category Archives: Biology

ChIPseeker for ChIP peak annotation

ChIPpeakAnno WAS the only R package for ChIP peak annotation. I used it for annotating peak in my recent study.

I found it does not consider the strand information of genes. I reported the bug to the authors, but they are reluctant to change.

So I decided to develop my own package, ChIPseeker, and it's now available in Bioconductor.
Read more »

old habits die hard

Screenshot 2014-01-23 01.10.07
从2011年1月我就在实验室的QQ群里发群邮件说IPI关门,时至今日,已经关门3年了,主页上一直停留在关门大吉的那一刻。
Screenshot 2014-01-23 00.03.17

我不断在邮件里, lab meeting上强调要换成uniprot来搜库,然而时至今日,依然还是有很多的人在使用IPI,想想真可怕,实验室真是100年不更新一下数据啊。
Read more »

Bug of R package ChIPpeakAnno

I used R package ChIPpeakAnno for annotating peaks, and found that it handle the DNA strand in the wrong way. Maybe the developers were from the computer science but not biology background.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> require(ChIPpeakAnno)
> packageVersion("ChIPpeakAnno")
[1] '2.10.0'
> peak < - RangedData(space="chr1", IRanges(24736757, 24737528))
> data(TSS.human.GRCh37)
> ap < - annotatePeakInBatch(peak, Annotation=TSS.human.GRCh37)
> ap
RangedData with 1 row and 9 value columns across 1 space
                     space               ranges |        peak      strand
                  <factor>            <iranges> | <character> </character><character>
1 ENSG00000001461        1 [24736757, 24737528] |           1           +
                          feature start_position end_position insideFeature
                      </character><character>      <numeric>    </numeric><numeric>   <character>
1 ENSG00000001461 ENSG00000001461       24742284     24799466      upstream
                  distancetoFeature shortestDistance fromOverlappingOrNearest
                          <numeric>        </numeric><numeric>              <character>
1 ENSG00000001461             -5527             4756             NearestStart
</character></numeric></character></numeric></character></iranges></factor>

In this example, I defined a peak ranging from chr1:24736757 to chr1:24737528 and annotated the peak using ChIPpeakAnno package.

It returns that the nearest gene is ENSG00000001461, whose gene symbol is NIPAL3.

?View Code RSPLUS
1
2
3
4
5
> require(org.Hs.eg.db)
> gene.ChIPpeakAnno < - select(org.Hs.eg.db, key=ap$feature, keytype="ENSEMBL", columns=c("ENSEMBL", "ENTREZID", "SYMBOL"))
> gene.ChIPpeakAnno
          ENSEMBL ENTREZID SYMBOL
1 ENSG00000001461    57185 NIPAL3

When looking at the peak in Genome Browser, I found the nearest gene is STPG1.
Screenshot 2014-01-13 22.00.46
Read more »

Mortal Fibonacci Rabbits

Problem
Recall the definition of the Fibonacci numbers from “Rabbits and Recurrence Relations”, which followed the recurrence relation Fn=Fn−1+Fn−2 and assumed that each pair of rabbits reaches maturity in one month and produces a single pair of offspring (one male, one female) each subsequent month.

Our aim is to somehow modify this recurrence relation to achieve a dynamic programming solution in the case that all rabbits die out after a fixed number of months. See Figure 4 for a depiction of a rabbit tree in which rabbits live for three months (meaning that they reproduce only twice before dying).

Given: Positive integers n≤100 and m≤20.

Return: The total number of pairs of rabbits that will remain after the n-th month if all rabbits live for m months.

Sample Dataset
6 3

Sample Output
4

假定一对初生的兔子需要一个月才成熟,成熟的一对兔子每月产雌雄一对小兔子,Fibonacci数模拟兔子的增长, \(F_n = F_{n-1} + F_{n-2}\) 这条著名的公式,用来讲递归最好。
Rabbits and Recurrence Relations一题中,假定每月产子数为k对小兔子,这样就变成了Fibonacci数的通用形式 \(F_n = F_{n-1} + k \cdot F_{n-2}\) , 从这里可以看出 \(F_{n-1}\) 项是原来的兔子对数,而 \(k \cdot F_{n-2}\) 是新生的兔子对数。

这个模型假定兔子是永生的,显然不符合常理,这道题假设兔子的寿命是m个月,所以每个月的兔子对数就变成了 \(F_n = F_{n-1} + F_{n-2} - F_{n_{die}}\) ,即上个月的兔子对数加上新生的兔子对数再减去死亡的兔子对数。我一开始就是这么做的,根据我从数列中找出的规律 \( F_n = F_{n-1} + F_{n-2} - F_{n-m-1}\) .
Read more »

Consensus and Profile

Problem
A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that 'A' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

		A T C C A G C T
		G G G C A A C T
		A T G G A T C T
DNA Strings	A A G C A A C C
		T T G G A A C T
		A T G C C A T T
		A T G G C A C T

	    A   5 1 0 0 5 5 0 0
Profile	    C   0 0 1 4 2 0 6 1
	    G   1 1 6 3 0 1 0 0
	    T   1 5 0 0 0 1 1 6

Consensus	A T G C A A C T

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

Sample Dataset
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

Sample Output
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

FASTA format was introduced previously in Computing GC content, in which there is no need to store the sequence. In this problem, the situation is a little more complicated since the storage issue should be considered. The storage issue is easy to solve if using high level languages, for instance R, which support memory dynamic allocation.

Here, I will using C to implement readFASTA function. The FASTA file of this problem is easy to parse for all its description lines are in equal length and all the sequence lines are also in equal length. But it is not the true in reality. The description/sequence lines are very common to have unequal lengths, and the sequence lines can contain return characters. All these situations should be considered when implementing a general function.

If we predefined character array to store the description and sequence line, the drawback is obvious. If the array size is not large enough, the function will lost generality, but if it is large, the function will waste a lot of memory.

To solve the storage issue, I use Link List to store description and sequence lines, re-store them in character array and then pack them in SEQ structure. The sequence number in FASTA file is also unknown, so the SEQ structures are also organized as Link List. It's easy to implement a wrapper function to return the SEQ structure in pointer array if necessary.
Read more »