Tag Archives: R - Page 8

The easiest way to get UTR sequence

I just figure out the way to query UTR sequences from ensembl by biomart tool.

It is very simple compared with using bioperl to parse gbk files to extract UTR sequences.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
require(biomaRt)
require(org.Hs.eg.db)
 
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
 
eg <- mappedkeys(org.Hs.egGO)
 
utr <- getSequence(id=eg, type="entrezgene", seqType="3utr", mart=ensembl)
 
outfile <- file("human-3utr.fa", "w")
for (i in 1:nrow(utr)) {
	h = paste(c(">", utr[i,2]), collapse="")
	writeLines(h, outfile)
	writeLines(utr[i,1], outfile)
}
close(outfile)

Estimate Probability and Quantile

Simple root finding and one dimensional integrals algorithms were implemented in previous posts.

These algorithms can be used to estimate the cumulative probabilities and quantiles.

Here, take normal distribution as an example.

Normal distribution is defined as:

#probability density function
y.dnorm <- function(x, mean=0, sd=1) exp(-(x-mean)^2/(2*sd^2))/sqrt(2*pi*sd^2) 

The cumulative probabilities can be estimated by integrating the PDF function. Here, using function *simpson_v2*, which implemented in previous post, for integral calculation.

y.cdf <- function(pdf=y.dnorm, q) simpson_v2(pdf, -Inf, q)

The quantile function is mathematically the inverse of the cumulative distribution function. Here, the quantile was estimated by using newton-raphson algorithm to find the root of function CDF(q) - p = 0.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
y.qf <- function(pdf=y.dnorm, p, tol=1e-7, niter=100) {
	x0 = 0
	for (i in 1:niter) {
		fx <- y.cdf(pdf, x0) - p
		x = x0 - fx/pdf(x0)
		if (abs(fx) < tol)
			return(x)
		x0 = x
	}
	stop("exceeded allowed number of iterations")	
}
> y.dnorm(1)
[1] 0.2419707
> x=y.cdf(pdf=y.dnorm, q=1.64)
> y.qf(pdf=y.dnorm, p=x)
[1] 1.64
> qnorm(p=x)
[1] 1.636615

Single variable optimization

Optimization means to seek minima or maxima of a funtion within a given defined domain.

If a function reach its maxima or minima, the derivative at that point is approaching to 0. If we apply Newton-Raphson method for root finding to f', we can get the optimizing f.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
f2df < - function(fun) {
	fun.list = as.list(fun)
	var <- names(fun.list[1])
	fun.exp = fun.list[[2]] 
	diff.fun = D(fun.exp, var) 
	df = list(x=0, diff.fun) 
	df = as.function(df) 
	return(df)
}
 
newton <- function(fun, x0, tol=1e-7, niter=100) { 
	df = f2df(fun) 
	for (i in 1:niter) { 
		x = x0 - fun(x0)/df(x0) 
		if (abs(fun(x)) < tol) 
			return(x) 
		x0 = x      
	} 
	stop("exceeded allowed number of iterations") 
}
 
newton_optimize <- function(fun, x0, tol=1e-7, niter=100) {
	df <- f2df(fun)
	x = newton(df, x0, tol, niter)
	ddf <- f2df(df)
	if (ddf(x) > 0) {
		cat ("minima:\t", x, "\n")
	} else {
		cat ("maxima:\t", x, "\n")
	}
	return(x)
}

Read more »

one-dimensional integrals

p5rn7vb

The foundamental idea of numerical integration is to estimate the area of the region in the xy-plane bounded by the graph of function f(x). The integral was esimated by divide x to small intervals, then add all the small approximations to give a total approximation.

Numerical integration can be done by trapezoidal rule, simpson's rule and quadrature rules. R has a built-in function integrate, which performs adaptive quadrature.

Trapezoidal rule works by approximating the region under the graph f(x) as a trapezoid and calculating its area.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
trapezoid < - function(fun, a, b, n=100) {
	# numerical integral of fun from a to b
	# using the trapezoid rule with n subdivisions
	# assume a < b and n is a positive integer
	h <- (b-a)/n
	x <- seq(a, b, by=h)
	y <- fun(x)
	s <- h * (y[1]/2 + sum(y[2:n]) + y[n+1]/2)
	return(s)
}

Simpson's rule subdivides the interval [a,b] into n subintervals, where n is even, then on each consecutive pairs of subintervals, it approximates the behaviour of f(x) by a parabola (polynomial of degree 2) rather than by the straight lines used in the trapezoidal rule.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
simpson < - function(fun, a, b, n=100) {
	# numerical integral using Simpson's rule
	# assume a < b and n is an even positive integer
	h <- (b-a)/n
	x <- seq(a, b, by=h)
	if (n == 2) {
		s <- fun(x[1]) + 4*fun(x[2]) +fun(x[3])
	} else {
		s <- fun(x[1]) + fun(x[n+1]) + 2*sum(fun(x[seq(2,n,by=2)])) + 4 *sum(fun(x[seq(3,n-1, by=2)]))
	}
	s <- s*h/3
	return(s)
}

Read more »

Project Euler -- Problem 187

http://projecteuler.net/index.php?section=problems&id=187

A composite is a number containing at least two prime factors. For example, 15 = 3 × 5; 9 = 3 × 3; 12 = 2 × 2 × 3.

There are ten composites below thirty containing precisely two, not necessarily distinct, prime factors: 4, 6, 9, 10, 14, 15, 21, 22, 25, 26.

How many composite integers, n < 10^(8), have precisely two, not necessarily distinct, prime factors?

library(gmp)
bign <- 10^8
n <- 1:floor((bign -1)/2)
p=1:10000
p = p[as.logical(isprime(p))]
for (i in p) {
	n <- n[as.logical(n %% i)]
}
idx <- as.logical(isprime(n))
allp <- c(p,n[idx])
count <- 0
for (i in allp[allp < sqrt(bign)]) {
	count <- count + length(allp[allp < bign/i])
	allp = allp[-1]
}
print(count)

----
[1] 17427258
user system elapsed
449.67 72.40 522.87

Not fast enough...