Tag Archives: R

auto-complete in ESS

Auto complete is good, it can save you times in typing and prevent typo sometimes.


RStudio now supports function arguments in auto complete. ESS's auto complete is more advance, it supports help page.

Read more »


proper use of GOSemSim

One day, I am looking for R packages that can analyze PPI and after searching, I found the ppiPre package in CRAN.


The function of this package is not impressive, and I already knew some related works, including http://intscore.molgen.mpg.de/. The authors of this webserver contacted me for the usages of GOSemSim when they developing it.

What makes me curious is that the ppiPre package can calculate GO semantic similarity and supports 20 species exactly like GOSemSim. I opened the source tarball, and surprisingly found that its sources related to semantic similarity calculation are totally copied from GOSemSim.

GOSemSim was firstly released in 2008 Bioconductor 2.4 (at that time, devel version) and published in Bioinformatics in 2010. After compared the sources, I found the sources in ppiPre were copied from GOSemSim version 1.6.8 which released in 2010 Bioconductor 2.6.
Read more »

Yearly Topic Trend in PubMed

I found an R script in R-blogger that can be used to track PubMed trend. The script needs Perl package TGen-EUtils to perform query and it is not available now.

It's not difficult to query Pubmed record in R. We can use RCurl package to fetch and use XML package to parse the downloaded record as shown in stackoverflow.

Before I write my own script, I found that there is a well written package, RISmed, that provides many functions to access the NCBI databases.

I write a wrapper function called getPubmedTrend, which import EUtilsSummary and QueryCount from RISmed, to track PubMed trend. Another function called plotPubmedTrend was also implemented for visualizing the trend. These two functions is available in my toy package, yplots.
Read more »

SIR Model of Epidemics

The SIR model divides the population to three compartments: Susceptible, Infected and Recovered. If the disease dynamic fits the SIR model, then the flow of individuals is one direction from the susceptible group to infected group and then to the recovered group. All individuals are assumed to be identical in terms of their susceptibility to infection, infectiousness if infected and mixing behaviour associated with disease transmission.

We defined:

\(S_t\) = the number of susceptible individuals at time t

\(I_t\) = the number of infected individuals at time t

\(R_t\) = the number of recovered individuals at time t

Suppose on average every infected individual will contact \(\gamma\) person, and \(\kappa\) percent of these \(\gamma\) person will be infected. Then on average there are \(\beta = \gamma \times \kappa\) person will be infected an infected individual.

So with infected number \(I_t\) , they will infected \(\beta I_t\) individuals. Since not all people are susceptible, this number should multiple to the percentage of susceptible individuals. Therefore, \(I_t\) infected individuals will infect \(\beta \frac{S_t}{N} I_t\) individuals.

Another parameter \(\alpha\) describes the percentage of infected individuals to recover in a time period. That is on average, it takes \(1/\alpha\) periods for an infected person to recover.
Read more »

multiple annotation in ChIPseeker

Nearest gene annotation

Almost all annotation software calculate the distance of a peak to the nearest TSS and assign the peak to that gene. This can be misleading, as binding sites might be located between two start sites of different genes or hit different genes which have the same TSS location in the genome.

The function annotatePeak provides option to assign genes with a max distance cutoff and all genes within this distance were reported for each peak.

Read more »