Machine Learning Ex 5.1 - Regularized Linear Regression

The first part of the Exercise 5.1 requires to implement a regularized version of linear regression.

Adding regularization parameter can prevent the problem of over-fitting when fitting a high-order polynomial.

Plot the data:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
x <- read.table("ex5Linx.dat")
y <- read.table("ex5Liny.dat")
 
x <- x[,1]
y <- y[,1]
 
require(ggplot2)
d <- data.frame">data.frame(x=x,y=y)
p <- ggplot(d, aes(x,y)) + geom_point(colour="red", size=3)


I will fit a 5th order polynomial, the hypothesis is:
\( h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2^2 + \theta_3 x_3^3 + \theta_4 x_4^4 + \theta_5 x_5^5 \)

The idea of regularization is to impose Occam's razor on the solution, by scaling down the \( \theta \) which will lead to the tiny contribution of the higher order features.

For that, the cost function was defined as:
\( J(\theta) = \frac{1}{2m} [\sum_{i=1}^m ((h_\theta(x^{(i)}) - y^{(i)})^2) + \lambda \sum_{i=1}^n \theta^2] \)

Gradient Descent with regularization parameters:

Firstly, I implemented a gradient descent algorithm to find the theta.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
mapFeature <- function(x, degree=5) {
    return(sapply(0:degree, function(i) x^i))
}
 
## hypothesis function
h <- function(theta, x) {
    #sapply(1:m, function(i) theta %*% x[i,])
    toReturn <- x %*% t(theta)
    return(toReturn)
}
 
## cost function
J <- function(theta, x, y, lambda=1) {
    m <- length(y)
    r <- theta^2
    r[1] <- 0
    j <- 1/(2*m) * sum((h(theta, x)-y)^2) + lambda*sum(r)
    return(j)
}
 
gradDescent <- function(theta, x, y, alpha=0.1, niter=1000, lambda=1) {
    m <- length(y)
    for (i in 1:niter) {
        tt <- theta
        tt[1] <- 0
        dj <- 1/m * (t(h(theta,x)-y) %*% x + lambda * tt)
        theta <- theta - alpha * dj
    }
    return(theta)
}

fitting the data with gradient descent algorithm:

?View Code RSPLUS
1
2
3
4
5
6
7
8
x <- mapFeature(x)
theta <- matrix(rep(0,6), nrow=1)
theta <- gradDescent(theta, x, y)
 
x.test <- seq(-1,1, 0.001)
y.test <- mapFeature(x.test) %*% t(theta)
 
p+geom_line(aes(x=x.test, y=y.test), colour="blue")

As shown above, the fitting model fits the data well.

Normal Equation with regularization parameters:
The Exercise requires implementing Normal Equation with the regularization parameters added.
that is:
\( \theta = (X^T X + \lambda \begin{bmatrix} 0 & & & \\ & 1 & & \\ & & ? & \\ & & & 1 \end{bmatrix} )^{-1} (X^T y) \)

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
## normal equations
normEq <- function(x,y, lambda) {
    n <- ncol(x)
    ## extra regularizatin terms
    r <- lambda * diag(n)
    r[1,1] <- 0
    theta <- solve(t(x) %*% x + r) %*% t(x) %*% y
    return(theta)
}

I try 3 different lambda values to see how it influences the fit.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
lambda <- c(0,1,10)
theta <- sapply(lambda, normEq, x=x, y=y)
x.test <- seq(-1,1, 0.001)
yy <- sapply(1:3, function(i) mapFeature(x.test) %*% theta[,i])
yy <- melt(yy)
yy[,1] <- rep(x.test, 3)
colnames(yy) <- c("X", expression(lambda), "Y")
yy$lambda=factor(yy$lambda, labels=unique(lambda))
p+geom_line(data=yy,aes(X,Y, group=lambda, colour=lambda))


With lambda=0, the fit is very tight to the original points (the red line), and of course it is over-fitting.

As lambda increase, the model gets less tight and more generalized, and therefore preventing over-fitting.

This figure can also lead to a conclusion, that when lambda is too large, the model will under-fitting.

Reference:
Machine Learning Course
Exercise 5

p5rn7vb

Related Posts

Leave a comment ?

10 Comments.

  1. I'm having trouble replicating your results. In your code shown I have two questions: (1) when is the cost function, J, called? and (2) shouldn't the "i:niter" in gradDescent function be "1:niter"?

    Reply

    andy United States Unknow Browser Unknow Os Reply:

    I forgot to mention: I really enjoy your Open Classroom posts!

    Reply

    ygc China Unknow Browser Unknow Os Reply:

    1. yes, the cost function is defined as J.
    2. you are right, sorry for the typo.

    I am glad that you find it useful.

    Reply

    andy United States Unknow Browser Unknow Os Reply:

    But shouldn't J be called somewhere in the code?

    Reply

    ygc China Unknow Browser Unknow Os Reply:

    Actually, the derivative function of J is need, but not J itself.
    As you can see, the variable *dj* in function *gradDescent*.

    Reply

  2. Running your code as posted I get "Error: object 'm' not found" and "Error: object 'lambda' not found" if I set m to some dummy value -- so my question really is where are those variables set, or initialized, besides in *J*?

    Reply

    ygc China Unknow Browser Unknow Os Reply:

    corrected, please run it again.

    The issue is cause by the definition of function *gradDescent*.
    I forget to set the values.

    Reply

    andy United States Unknow Browser Unknow Os Reply:

    Thanks! It works fine now. Sorry for the pestering :)

    Reply

  3. In the gradDescent function definition, shouldn't it say tt[1] <- 0 instead of tt[0] <- 0?

    Reply

    ygc China Unknow Browser Unknow Os Reply:

    yes.

    Reply

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>