The first part of the Exercise 5.1 requires to implement a regularized version of linear regression.

Adding regularization parameter can prevent the problem of over-fitting when fitting a high-order polynomial.

Plot the data:

^{?}[Copy to clipboard]View Code RSPLUS

1 2 3 4 5 6 7 8 9 | x <- read.table("ex5Linx.dat") y <- read.table("ex5Liny.dat") x <- x[,1] y <- y[,1] require(ggplot2) d <- data.frame(x=x,y=y) p <- ggplot(d, aes(x,y)) + geom_point(colour="red", size=3) |

I will fit a 5th order polynomial, the hypothesis is:

\( h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2^2 + \theta_3 x_3^3 + \theta_4 x_4^4 + \theta_5 x_5^5 \)

The idea of regularization is to impose Occam's razor on the solution, by scaling down the \( \theta \) which will lead to the tiny contribution of the higher order features.

For that, the cost function was defined as:

\( J(\theta) = \frac{1}{2m} [\sum_{i=1}^m ((h_\theta(x^{(i)}) - y^{(i)})^2) + \lambda \sum_{i=1}^n \theta^2] \)

**Gradient Descent with regularization parameters:**

Firstly, I implemented a gradient descent algorithm to find the theta.

^{?}[Copy to clipboard]View Code RSPLUS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | mapFeature <- function(x, degree=5) { return(sapply(0:degree, function(i) x^i)) } ## hypothesis function h <- function(theta, x) { #sapply(1:m, function(i) theta %*% x[i,]) toReturn <- x %*% t(theta) return(toReturn) } ## cost function J <- function(theta, x, y, lambda=1) { m <- length(y) r <- theta^2 r[1] <- 0 j <- 1/(2*m) * sum((h(theta, x)-y)^2) + lambda*sum(r) return(j) } gradDescent <- function(theta, x, y, alpha=0.1, niter=1000, lambda=1) { m <- length(y) for (i in 1:niter) { tt <- theta tt[1] <- 0 dj <- 1/m * (t(h(theta,x)-y) %*% x + lambda * tt) theta <- theta - alpha * dj } return(theta) } |

**fitting the data with gradient descent algorithm:**

^{?}[Copy to clipboard]View Code RSPLUS

1 2 3 4 5 6 7 8 | x <- mapFeature(x) theta <- matrix(rep(0,6), nrow=1) theta <- gradDescent(theta, x, y) x.test <- seq(-1,1, 0.001) y.test <- mapFeature(x.test) %*% t(theta) p+geom_line(aes(x=x.test, y=y.test), colour="blue") |

As shown above, the fitting model fits the data well.

**Normal Equation with regularization parameters:**

The Exercise requires implementing Normal Equation with the regularization parameters added.

that is:

\( \theta = (X^T X + \lambda \begin{bmatrix} 0 & & & \\ & 1 & & \\ & & ? & \\ & & & 1 \end{bmatrix} )^{-1} (X^T y) \)

^{?}[Copy to clipboard]View Code RSPLUS

1 2 3 4 5 6 7 8 9 | ## normal equations normEq <- function(x,y, lambda) { n <- ncol(x) ## extra regularizatin terms r <- lambda * diag(n) r[1,1] <- 0 theta <- solve(t(x) %*% x + r) %*% t(x) %*% y return(theta) } |

I try 3 different lambda values to see how it influences the fit.

^{?}[Copy to clipboard]View Code RSPLUS

1 2 3 4 5 6 7 8 9 | lambda <- c(0,1,10) theta <- sapply(lambda, normEq, x=x, y=y) x.test <- seq(-1,1, 0.001) yy <- sapply(1:3, function(i) mapFeature(x.test) %*% theta[,i]) yy <- melt(yy) yy[,1] <- rep(x.test, 3) colnames(yy) <- c("X", expression(lambda), "Y") yy$lambda=factor(yy$lambda, labels=unique(lambda)) p+geom_line(data=yy,aes(X,Y, group=lambda, colour=lambda)) |

With lambda=0, the fit is very tight to the original points (the red line), and of course it is over-fitting.

As lambda increase, the model gets less tight and more generalized, and therefore preventing over-fitting.

This figure can also lead to a conclusion, that when lambda is too large, the model will under-fitting.

Reference:

Machine Learning Course

Exercise 5

I'm having trouble replicating your results. In your code shown I have two questions: (1) when is the cost function, J, called? and (2) shouldn't the "i:niter" in gradDescent function be "1:niter"?

Reply

andy Reply:

October 26th, 2011 at 3:26 am

I forgot to mention: I really enjoy your Open Classroom posts!

Reply

ygc Reply:

October 26th, 2011 at 9:48 am

1. yes, the cost function is defined as J.

2. you are right, sorry for the typo.

I am glad that you find it useful.

Reply

andy Reply:

October 26th, 2011 at 10:22 am

But shouldn't J be called somewhere in the code?

Reply

ygc Reply:

October 26th, 2011 at 10:26 am

Actually, the derivative function of J is need, but not J itself.

As you can see, the variable *dj* in function *gradDescent*.

Reply

Running your code as posted I get "Error: object 'm' not found" and "Error: object 'lambda' not found" if I set m to some dummy value -- so my question really is where are those variables set, or initialized, besides in *J*?

Reply

ygc Reply:

October 27th, 2011 at 12:41 pm

corrected, please run it again.

The issue is cause by the definition of function *gradDescent*.

I forget to set the values.

Reply

andy Reply:

October 28th, 2011 at 3:11 am

Thanks! It works fine now. Sorry for the pestering

Reply

In the gradDescent function definition, shouldn't it say tt[1] <- 0 instead of tt[0] <- 0?

Reply

ygc Reply:

January 29th, 2012 at 12:40 am

yes.

Reply