Exercise 4 required implementing Logistic Regression using Newton's Method.

The dataset in use is 80 students and their grades of 2 exams, 40 students were admitted to college and the other 40 students were not. We need to implement a binary classification model to estimates college admission based on the student's scores on these two exams.

**plot the data**

^{?}View Code RSPLUS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | x <- read.table("ex4x.dat",header=F, stringsAsFactors=F) x <- cbind(rep(1, nrow(x)), x) colnames(x) <- c("X0", "Exam1", "Exam2") x <- as.matrix(x) y <- read.table("ex4y.dat",header=F, stringsAsFactors=F) y <- y[,1] ## plotting data d <- data.frame">data.frame(x, y = factor(y, levels=c(0,1), labels=c("Not admitted","Admitted" ) ) ) require(ggplot2) p <- ggplot(d, aes(x=Exam1, y=Exam2)) + geom_point(aes(shape=y, colour=y)) + xlab("Exam 1 score") + ylab("Exam 2 score") |

**Logistic Regression**

We first need to define our Hypothesis Function that return values between[0,1]，suitable for binary classification.

\(h_\theta(x) = g(\theta^T x) = \frac{1}{ 1 + e ^{- \theta^T x} }\)

function g is the sigmoid function, and function h return the probability of y=1：

\( h_\theta(x) = P (y=1 | x; \theta) \)

What we need is to compute \( \theta \) ，to find out the proper Hypothesis Function.

Similar to the linear regression，We defined the cost function, which estimate the error of hypothesis function fitting the sample data, at a given \( \theta\) .

The cost function was defined as:

\(J(\theta) = \frac{1}{m} \sum_{i=1}^m ((-y)log(h_\theta(x)) - (1 - y) log(1- h_\theta(x)) )\)

To determine the most suitable hypothesis function, we need to find the \( \theta\) value which minimize the value of \(J(\theta)\) . This can be achieved by the Newton's method, by finding the root of the derivative function of the cost function.

And the \( \theta\) can be updated by:

\(\theta^{(t+1)} = \theta^{(t)} - H^{-1} \nabla_{\theta}J\)

the gradient and Hessian are defined as:

\(\nabla_{\theta}J = \frac{1}{m} \sum_{i=1}^m (h_\theta(x) - y) x\)

\(H = \frac{1}{m} \sum_{i=1}^m [h_\theta(x) (1 - h_\theta(x)) x^T x]\)

The above equations were implemented using R:

^{?}View Code RSPLUS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | ### Newton's Method ## sigmoid function g <- function(z) { 1/(1+exp(-z)) } ## hypothesis function h <- function(theta, x) { g(x %*% theta) } ## cost function J <- function(theta, x, y) { m <- length(y) s <- sapply(1:m, function(i) y[i]*log(h(theta,x[i,])) + (1-y[i])*log(1-h(theta,x[i,])) ) j <- -1/m * sum(s) return(j) } ## gradient grad <- function(theta, x, y) { m <- length(y) g <- 1/m * t(x) %*% (h(theta,x)-y) return(g) } ## Hessian Hessian <- function(theta, x) { m <- nrow(x) H <- 1/m * t(x) %*% x * diag(h(theta,x)) * diag(1-h(theta,x)) return(H) } |

The first question need to determine how many iteration until convergence.

^{?}View Code RSPLUS

1 2 3 4 5 6 7 8 9 10 11 12 | theta <- rep(0, ncol(x)) j <- rep(0,10) for (i in 1:10) { theta <- theta - solve(Hessian(theta,x)) %*% grad(theta,x,y) j[i] <- J(theta,x,y) } ggplot()+ aes(x=1:10,y=j)+ geom_point(colour="red")+ geom_path()+xlab("Iteration")+ ylab("Cost J") |

As illustrated in the above figure, Newton's method converge very fast, only 4-5 iterations was needed.

The second question:What is the probability that a student with a score of 20 on Exam 1 and a score of 80 on Exam 2 will not be admitted?

> (1 - g(c(1, 20, 80) %*% theta))* 100 [,1] [1,] 64.24722

In our model, we predicted that the probability of the student will not admitted is 64%.

At last, we calculate our classification model based on \( \theta^T x = 0\) , and visualize the fit as below:

^{?}View Code RSPLUS

1 2 3 4 5 | x1 <- c(min(x[,2]),max(x[,2])) x2 <- -1/theta[3,] * (theta[2,]*x1+theta[1,]) a <- (x2[2]-x2[1])/(x1[2]-x1[1]) b <- x2[2]-a*x1[2] p+geom_abline(slope=a, intercept=b) |

Fantastic.

**References:**

Machine Learning Course

Exercise 4

some other efforts in implementing these exercises were collected in

http://www.statalgo.com/stanford-machine-learning/, which provides extra information and is highly recommended.

Reply

Thanks for the mention! Enjoying your posts as well.

Reply

YGC,thank you for your share. I have a problem, please help me at your convenience. When I use the gradient descent method to solve EX4, it can not converge.My code are as follows:

x=read.table("E:/DM/text_book/ML_ex/ex4x.dat", header=F)

y=read.table("E:/DM/text_book/ML_ex/ex4y.dat", header=F)

x=cbind(rep(1,length(length(y))), x)

x=as.matrix(x)

y=y[,1]

cost_f 1e-10 && iter<50000){

cost_old=cost_f(beta)

beta_new=beta-alpha*t(x)%*%(1/(1+exp(-x%*%beta))-y)

cost_new=cost_f(beta_new)

alpha1=alpha

if(cost_new<cost_old) {

cost=cost_old

i=i+1

while(cost_newcost_old){

cost=cost_new

alpha=alpha/2

beta_new=beta-alpha*t(x)%*%(1/(1+exp(-x%*%beta))-y)

cost_new=cost_f(beta_new)

while(cost_new<cost){

cost=cost_new

alpha=alpha/2

beta_new=beta-alpha*t(x)%*%(1/(1+exp(-x%*%beta))-y)

cost_new=cost_f(beta_new)

j=j+1

}

alpha=alpha*2

}

beta=beta-alpha*t(x)%*%(1/(1+exp(-x%*%beta))-y)

cost=cost_f(beta)

epsion=abs(beta)

iter=iter+1

alpha=alpha0

}

print(beta)

print(cost)

print(iter)

print(i)

print(j)

Reply

ygc Reply:

December 23rd, 2011 at 5:13 pm

you may set a lower alpha value.

Reply

hi please any logistic rgerssion implementation using c# thank u

Reply