Tag Archives: R - Page 8

Machine Learning Ex2 - Linear Regression

Thanks to the post by al3xandr3, I found OpenClassroom. In addition, thanks to Andrew Ng and his lectures, I took my first course in machine learning. These videos are quite easy to follow. Exercise 2 requires implementing gradient descent algorithm to model data with linear regression.
Read more »

zv7qrnb

The easiest way to get UTR sequence

I just figure out the way to query UTR sequences from ensembl by biomart tool.

It is very simple compared with using bioperl to parse gbk files to extract UTR sequences.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
require(biomaRt)
require(org.Hs.eg.db)
 
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
 
eg <- mappedkeys(org.Hs.egGO)
 
utr <- getSequence(id=eg, type="entrezgene", seqType="3utr", mart=ensembl)
 
outfile <- file("human-3utr.fa", "w")
for (i in 1:nrow(utr)) {
	h = paste(c(">", utr[i,2]), collapse="")
	writeLines(h, outfile)
	writeLines(utr[i,1], outfile)
}
close(outfile)

Estimate Probability and Quantile

Simple root finding and one dimensional integrals algorithms were implemented in previous posts.

These algorithms can be used to estimate the cumulative probabilities and quantiles.

Here, take normal distribution as an example.

Normal distribution is defined as:

#probability density function
y.dnorm <- function(x, mean=0, sd=1) exp(-(x-mean)^2/(2*sd^2))/sqrt(2*pi*sd^2) 

The cumulative probabilities can be estimated by integrating the PDF function. Here, using function *simpson_v2*, which implemented in previous post, for integral calculation.

y.cdf <- function(pdf=y.dnorm, q) simpson_v2(pdf, -Inf, q)

The quantile function is mathematically the inverse of the cumulative distribution function. Here, the quantile was estimated by using newton-raphson algorithm to find the root of function CDF(q) - p = 0.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
y.qf <- function(pdf=y.dnorm, p, tol=1e-7, niter=100) {
	x0 = 0
	for (i in 1:niter) {
		fx <- y.cdf(pdf, x0) - p
		x = x0 - fx/pdf(x0)
		if (abs(fx) < tol)
			return(x)
		x0 = x
	}
	stop("exceeded allowed number of iterations")	
}
> y.dnorm(1)
[1] 0.2419707
> x=y.cdf(pdf=y.dnorm, q=1.64)
> y.qf(pdf=y.dnorm, p=x)
[1] 1.64
> qnorm(p=x)
[1] 1.636615

Single variable optimization

Optimization means to seek minima or maxima of a funtion within a given defined domain.

If a function reach its maxima or minima, the derivative at that point is approaching to 0. If we apply Newton-Raphson method for root finding to f', we can get the optimizing f.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
f2df < - function(fun) {
	fun.list = as.list(fun)
	var <- names(fun.list[1])
	fun.exp = fun.list[[2]] 
	diff.fun = D(fun.exp, var) 
	df = list(x=0, diff.fun) 
	df = as.function(df) 
	return(df)
}
 
newton <- function(fun, x0, tol=1e-7, niter=100) { 
	df = f2df(fun) 
	for (i in 1:niter) { 
		x = x0 - fun(x0)/df(x0) 
		if (abs(fun(x)) < tol) 
			return(x) 
		x0 = x      
	} 
	stop("exceeded allowed number of iterations") 
}
 
newton_optimize <- function(fun, x0, tol=1e-7, niter=100) {
	df <- f2df(fun)
	x = newton(df, x0, tol, niter)
	ddf <- f2df(df)
	if (ddf(x) > 0) {
		cat ("minima:\t", x, "\n")
	} else {
		cat ("maxima:\t", x, "\n")
	}
	return(x)
}

Read more »

one-dimensional integrals

The foundamental idea of numerical integration is to estimate the area of the region in the xy-plane bounded by the graph of function f(x). The integral was esimated by divide x to small intervals, then add all the small approximations to give a total approximation.

Numerical integration can be done by trapezoidal rule, simpson's rule and quadrature rules. R has a built-in function integrate, which performs adaptive quadrature.

Trapezoidal rule works by approximating the region under the graph f(x) as a trapezoid and calculating its area.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
trapezoid < - function(fun, a, b, n=100) {
	# numerical integral of fun from a to b
	# using the trapezoid rule with n subdivisions
	# assume a < b and n is a positive integer
	h <- (b-a)/n
	x <- seq(a, b, by=h)
	y <- fun(x)
	s <- h * (y[1]/2 + sum(y[2:n]) + y[n+1]/2)
	return(s)
}

Simpson's rule subdivides the interval [a,b] into n subintervals, where n is even, then on each consecutive pairs of subintervals, it approximates the behaviour of f(x) by a parabola (polynomial of degree 2) rather than by the straight lines used in the trapezoidal rule.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
simpson < - function(fun, a, b, n=100) {
	# numerical integral using Simpson's rule
	# assume a < b and n is an even positive integer
	h <- (b-a)/n
	x <- seq(a, b, by=h)
	if (n == 2) {
		s <- fun(x[1]) + 4*fun(x[2]) +fun(x[3])
	} else {
		s <- fun(x[1]) + fun(x[n+1]) + 2*sum(fun(x[seq(2,n,by=2)])) + 4 *sum(fun(x[seq(3,n-1, by=2)]))
	}
	s <- s*h/3
	return(s)
}

Read more »