Category Archives: Biology

proper use of GOSemSim

One day, I am looking for R packages that can analyze PPI and after searching, I found the ppiPre package in CRAN.

2014-11-22-215959_895x256_scrot

The function of this package is not impressive, and I already knew some related works, including http://intscore.molgen.mpg.de/. The authors of this webserver contacted me for the usages of GOSemSim when they developing it.

What makes me curious is that the ppiPre package can calculate GO semantic similarity and supports 20 species exactly like GOSemSim. I opened the source tarball, and surprisingly found that its sources related to semantic similarity calculation are totally copied from GOSemSim.

GOSemSim was firstly released in 2008 Bioconductor 2.4 (at that time, devel version) and published in Bioinformatics in 2010. After compared the sources, I found the sources in ppiPre were copied from GOSemSim version 1.6.8 which released in 2010 Bioconductor 2.6.
Read more »

Yearly Topic Trend in PubMed

I found an R script in R-blogger that can be used to track PubMed trend. The script needs Perl package TGen-EUtils to perform query and it is not available now.

It's not difficult to query Pubmed record in R. We can use RCurl package to fetch and use XML package to parse the downloaded record as shown in stackoverflow.

Before I write my own script, I found that there is a well written package, RISmed, that provides many functions to access the NCBI databases.

I write a wrapper function called getPubmedTrend, which import EUtilsSummary and QueryCount from RISmed, to track PubMed trend. Another function called plotPubmedTrend was also implemented for visualizing the trend. These two functions is available in my toy package, yplots.
Read more »

SIR Model of Epidemics

The SIR model divides the population to three compartments: Susceptible, Infected and Recovered. If the disease dynamic fits the SIR model, then the flow of individuals is one direction from the susceptible group to infected group and then to the recovered group. All individuals are assumed to be identical in terms of their susceptibility to infection, infectiousness if infected and mixing behaviour associated with disease transmission.

We defined:

\(S_t\) = the number of susceptible individuals at time t

\(I_t\) = the number of infected individuals at time t

\(R_t\) = the number of recovered individuals at time t

Suppose on average every infected individual will contact \(\gamma\) person, and \(\kappa\) percent of these \(\gamma\) person will be infected. Then on average there are \(\beta = \gamma \times \kappa\) person will be infected an infected individual.

So with infected number \(I_t\) , they will infected \(\beta I_t\) individuals. Since not all people are susceptible, this number should multiple to the percentage of susceptible individuals. Therefore, \(I_t\) infected individuals will infect \(\beta \frac{S_t}{N} I_t\) individuals.

Another parameter \(\alpha\) describes the percentage of infected individuals to recover in a time period. That is on average, it takes \(1/\alpha\) periods for an infected person to recover.
Read more »

multiple annotation in ChIPseeker

m4s0n501

Nearest gene annotation

Almost all annotation software calculate the distance of a peak to the nearest TSS and assign the peak to that gene. This can be misleading, as binding sites might be located between two start sites of different genes or hit different genes which have the same TSS location in the genome.

The function annotatePeak provides option to assign genes with a max distance cutoff and all genes within this distance were reported for each peak.

Read more »

hierachical clustering with mlass

构建phylogenetic tree最简单的方法莫过于UPGMA (unweighted pair group method with arithmetic mean), 因为这个方法假定所有的evolution rate是一样的,构建出来的树是有bias的,所以在phylogenetic research上应用很少,但由于简单,通常会使用它来介绍一些基本概念。

UPGMA实际上是使用average linkage的层次聚类,说到聚类,不得不吐槽一下,当年某人在暨大要开个生物信息学的课,叫我去讲聚类分析,我准备好了slides,结果还没轮到我去讲,那课已经结束了,哥长这么大,就没遇到这么放鸽子的!于是我准备的Cluster Analysis and its Applications,从此至终没有讲过。

课表上可以看到,其它所有的课都是在讲怎么使用软件,只有我是准备讲代码的,我觉得既然要讲聚类分析,最起码就应该讲清楚层次聚类和kmeans,而要讲清楚原理,莫过于把最基本的东西,通过结构化的代码实现,通过讲代码来演示计算过程和原理。

kmeans之前实现过,所以是现成的。对于层次聚类,本来想通过分析hclust函数来讲解,结果发现R里的hclust函数实际上是调用了fortran的代码,这种老古董的语言,反正也是看不懂,于是自己写。
Read more »