Tag Archives: R - Page 8

clusterProfiler in Bioconductor 2.8

In recently years, high-throughput experimental techniques such as microarray and mass spectrometry can identify many lists of genes and gene products. The most widely used strategy for high-throughput data analysis is to identify different gene clusters based on their expression profiles. Another commonly used approach is to annotate these genes to biological knowledge, such as Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), and identify the statistically significantly enriched categories. These two different strategies were implemented in many bioconductor packages, such as Mfuzz and BHC for clustering analysis and GOstats for GO enrichment analysis.

After clustering analysis, researchers not only want to determine whether there is a common theme to a particular gene cluster, but also would like to compare the biological themes among gene clusters, which have different expression profiles.

There is no existing tools to bridge this gap, and I have designed an R package clusterProfiler, for comparing functional profiles among gene clusters.

For any further details, please refer to: http://bioconductor.org/packages/devel/bioc/html/clusterProfiler.html

m4s0n501

Machine Learning Ex2 - Linear Regression

Thanks to the post by al3xandr3, I found OpenClassroom. In addition, thanks to Andrew Ng and his lectures, I took my first course in machine learning. These videos are quite easy to follow. Exercise 2 requires implementing gradient descent algorithm to model data with linear regression.
Read more »

The easiest way to get UTR sequence

I just figure out the way to query UTR sequences from ensembl by biomart tool.

It is very simple compared with using bioperl to parse gbk files to extract UTR sequences.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
require(biomaRt)
require(org.Hs.eg.db)
 
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
 
eg <- mappedkeys(org.Hs.egGO)
 
utr <- getSequence(id=eg, type="entrezgene", seqType="3utr", mart=ensembl)
 
outfile <- file("human-3utr.fa", "w")
for (i in 1:nrow(utr)) {
	h = paste(c(">", utr[i,2]), collapse="")
	writeLines(h, outfile)
	writeLines(utr[i,1], outfile)
}
close(outfile)

Estimate Probability and Quantile

Simple root finding and one dimensional integrals algorithms were implemented in previous posts.

These algorithms can be used to estimate the cumulative probabilities and quantiles.

Here, take normal distribution as an example.

Normal distribution is defined as:

#probability density function
y.dnorm <- function(x, mean=0, sd=1) exp(-(x-mean)^2/(2*sd^2))/sqrt(2*pi*sd^2) 

The cumulative probabilities can be estimated by integrating the PDF function. Here, using function *simpson_v2*, which implemented in previous post, for integral calculation.

y.cdf <- function(pdf=y.dnorm, q) simpson_v2(pdf, -Inf, q)

The quantile function is mathematically the inverse of the cumulative distribution function. Here, the quantile was estimated by using newton-raphson algorithm to find the root of function CDF(q) - p = 0.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
y.qf <- function(pdf=y.dnorm, p, tol=1e-7, niter=100) {
	x0 = 0
	for (i in 1:niter) {
		fx <- y.cdf(pdf, x0) - p
		x = x0 - fx/pdf(x0)
		if (abs(fx) < tol)
			return(x)
		x0 = x
	}
	stop("exceeded allowed number of iterations")	
}
> y.dnorm(1)
[1] 0.2419707
> x=y.cdf(pdf=y.dnorm, q=1.64)
> y.qf(pdf=y.dnorm, p=x)
[1] 1.64
> qnorm(p=x)
[1] 1.636615

Single variable optimization

Optimization means to seek minima or maxima of a funtion within a given defined domain.

If a function reach its maxima or minima, the derivative at that point is approaching to 0. If we apply Newton-Raphson method for root finding to f', we can get the optimizing f.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
f2df < - function(fun) {
	fun.list = as.list(fun)
	var <- names(fun.list[1])
	fun.exp = fun.list[[2]] 
	diff.fun = D(fun.exp, var) 
	df = list(x=0, diff.fun) 
	df = as.function(df) 
	return(df)
}
 
newton <- function(fun, x0, tol=1e-7, niter=100) { 
	df = f2df(fun) 
	for (i in 1:niter) { 
		x = x0 - fun(x0)/df(x0) 
		if (abs(fun(x)) < tol) 
			return(x) 
		x0 = x      
	} 
	stop("exceeded allowed number of iterations") 
}
 
newton_optimize <- function(fun, x0, tol=1e-7, niter=100) {
	df <- f2df(fun)
	x = newton(df, x0, tol, niter)
	ddf <- f2df(df)
	if (ddf(x) > 0) {
		cat ("minima:\t", x, "\n")
	} else {
		cat ("maxima:\t", x, "\n")
	}
	return(x)
}

Read more »