Category Archives: Biology

install 454 GS Data Analysis Software on ubuntu

Usually Roche's installer is a catastrophe, they only provides rpm packages of the software for 454 GS FLX (version 2.9). Although the package contains setup.sh, the script is useless since it is actually a binary payload.

I run the setup.sh, and it throw error of not finding /sbin/lspci. In debian derived distribution, lspci command is located in /bin folder. This issue is easy to solve by adding a soft link to /sbin/lspci.

The second error message popping up says: "Error: Could not execute command: type rocks 2>&1", and I used the command, sudo ln -s /bin/true /bin/rocks, to solve it.

The third error is lack of libraries zlib.i386, libXi.i386, libXtst.i386, and libXaw.i386.
Since my OS is 64bit ubuntu 14.04 LTS, I used, sudo apt-get install ia32-libs, to install all the 32bit compatible libraries.

The fourth error is weird for it can't found /bin/sh which is available for all unix-like systems. Since debian links sh to dash, while most of the Linux distributions links to bash, I changed the link to bash but the error still exists.

I can't figure out how to solve the fourth error and tried to install the rpm packages by using rpm -ivh command but the error doesn't change.
Read more »

insertion size

p5rn7vb

在进行测序的时候,需要将DNA打断,构建library,这些fragment需要接上adaptor,好进行扩增,illumina的测序,可以有single end和paired end两种,分别从一端和两端进行测序。

fragment                  ========================================
fragment + adaptors    ~~~========================================~~~
SE read                   --------->
PE reads                R1--------->                    < ---------R2
unknown gap                         ....................

insertion并不是指R1和R2之间的unknown gap,早在NGS之前,当我们在使用ecoli构建载体的时候,这个概念就已经形成,它是adaptors之间的序列。而unknown gap则称之为inner mate:

PE reads      R1--------->                    < ---------R2
fragment     ~~~========================================~~~
insert          ========================================
inner mate                ....................

Read more »

The spread of new mutations

Genetic drift is the term used in population genetics to refer to the statistical drift over time of gene frequencies in a population due to random sampling effects in the formation of successive generations.

In a narrower sense, genetic drift refers to the expected population dynamics of neutral alleles (those defined as having no positive or negative impact on reproductive fitness), which are predicted to eventually become fixed at zero or 100% frequency in the absence of other mechanisms affecting allele distributions.

The most important keyword in the definition of genetic drift is random sampling effects. The figure belowed illustrates this idea. The surviving individuals do not necessarily have selection advantage. They are randomly selected.
beetles_mech3
Read more »

why clusterProfiler fails

Recently, there are some comments said that sometimes clusterProfiler failed in KEGG enrichment analysis.

kaji331 compared cluserProfiler with GeneAnswers and found that clusterProfiler gives larger p values. The result forces me to test it.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
require(GeneAnswers)
data('humanGeneInput')
y < - geneAnswersBuilder(humanGeneInput, 'org.Hs.eg.db', 
                        categoryType='KEGG', testType='hyperG', 
                        pvalueT=0.1, geneExpressionProfile=humanExpr, 
                        verbose=FALSE)
yy <- y@enrichmentInfo
 
require(clusterProfiler)
x <- enrichKEGG(humanGeneInput$GeneID, pvalueCutoff=0.2, 
                qvalueCutoff=0.2, minGSSize=1)
xx <- summary(x)
 
id <- sub("hsa", "", xx$ID)
idx <- id %in% rownames(yy)
 
p.clusterProfiler <- xx$pvalue[idx]
p.GeneAnswers <- yy[id[idx],]$"p value"
> cor(p.clusterProfiler, p.GeneAnswers)
[1] 0.9996165
> p.clusterProfiler - p.GeneAnswers
 [1]  1.029789e-04 -3.588252e-05 -4.623010e-05  1.079117e-04 -1.075746e-04
 [6] -1.077398e-04 -3.774637e-04 -2.849278e-04 -4.197993e-04  7.588155e-04
[11] -3.702141e-04  2.314721e-03 -5.695641e-04 -5.940830e-04 -4.923697e-04
[16] -5.560738e-04 -5.884079e-04  2.011138e-03

Here, I used the dataset, humanGeneInput, provided by GeneAnswers. There are 19 pathways have p values below 0.1 by GeneAnswers and 18 pathways have p values below 0.1 by clusterProfiler. 18 of them are the same and p values are highly correlated with very small differences.
Read more »

enrichment map

In PLOB's QQ group, someone asked how to change the color of enrichment map in Cytoscape. I am very curious how enrichment map can helps to interpret enrichment results. It took me 2 hours to implement it using R and I am surprised that the enrichment map is better than anticipated.

Screenshot 2014-07-30 22.20.07

Now in the development version of clusterProfiler, DOSE, and ReactomePA, you can use enrichmap function to generate the enrichment map of enrichment results obtained by hypergeometric test or gene set enrichment analysis.