Exercise 4 required implementing Logistic Regression using Newton's Method.
The dataset in use is 80 students and their grades of 2 exams, 40 students were admitted to college and the other 40 students were not. We need to implement a binary classification model to estimates college admission based on the student's scores on these two exams.
plot the data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | x <- read.table("ex4x.dat",header=F, stringsAsFactors=F) x <- cbind(rep(1, nrow(x)), x) colnames(x) <- c("X0", "Exam1", "Exam2") x <- as.matrix(x) y <- read.table("ex4y.dat",header=F, stringsAsFactors=F) y <- y[,1] ## plotting data d <- data.frame">data.frame(x, y = factor(y, levels=c(0,1), labels=c("Not admitted","Admitted" ) ) ) require(ggplot2) p <- ggplot(d, aes(x=Exam1, y=Exam2)) + geom_point(aes(shape=y, colour=y)) + xlab("Exam 1 score") + ylab("Exam 2 score") |
Logistic Regression

We first need to define our Hypothesis Function that return values between[0,1],suitable for binary classification.
h_\theta(x) = g(\theta^T x) = \frac{1}{ 1 + e ^{- \theta^T x} }$
function g is the sigmoid function, and function h return the probability of y=1:
h_\theta(x) = P (y=1 | x; \theta) $
What we need is to compute \theta $,to find out the proper Hypothesis Function.
Similar to the linear regression,We defined the cost function, which estimate the error of hypothesis function fitting the sample data, at a given \theta$.
The cost function was defined as:
J(\theta) = \frac{1}{m} \sum_{i=1}^m ((-y)log(h_\theta(x)) - (1 - y) log(1- h_\theta(x)) )$
To determine the most suitable hypothesis function, we need to find the \theta$ value which minimize the value of J(\theta)$. This can be achieved by the Newton's method, by finding the root of the derivative function of the cost function.
And the \theta$ can be updated by:
\theta^{(t+1)} = \theta^{(t)} - H^{-1} \nabla_{\theta}J$
the gradient and Hessian are defined as:
\nabla_{\theta}J = \frac{1}{m} \sum_{i=1}^m (h_\theta(x) - y) x$
H = \frac{1}{m} \sum_{i=1}^m [h_\theta(x) (1 - h_\theta(x)) x^T x]$
The above equations were implemented using R:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | ### Newton's Method ## sigmoid function g <- function(z) { 1/(1+exp(-z)) } ## hypothesis function h <- function(theta, x) { g(x %*% theta) } ## cost function J <- function(theta, x, y) { m <- length(y) s <- sapply(1:m, function(i) y[i]*log(h(theta,x[i,])) + (1-y[i])*log(1-h(theta,x[i,])) ) j <- -1/m * sum(s) return(j) } ## gradient grad <- function(theta, x, y) { m <- length(y) g <- 1/m * t(x) %*% (h(theta,x)-y) return(g) } ## Hessian Hessian <- function(theta, x) { m <- nrow(x) H <- 1/m * t(x) %*% x * diag(h(theta,x)) * diag(1-h(theta,x)) return(H) } |
The first question need to determine how many iteration until convergence.
1 2 3 4 5 6 7 8 9 10 11 12 | theta <- rep(0, ncol(x)) j <- rep(0,10) for (i in 1:10) { theta <- theta - solve(Hessian(theta,x)) %*% grad(theta,x,y) j[i] <- J(theta,x,y) } ggplot()+ aes(x=1:10,y=j)+ geom_point(colour="red")+ geom_path()+xlab("Iteration")+ ylab("Cost J") |

As illustrated in the above figure, Newton's method converge very fast, only 4-5 iterations was needed.
The second question:What is the probability that a student with a score of 20 on Exam 1 and a score of 80 on Exam 2 will not be admitted?
> (1 - g(c(1, 20, 80) %*% theta))* 100
[,1]
[1,] 64.24722
In our model, we predicted that the probability of the student will not admitted is 64%.
At last, we calculate our classification model based on \theta^T x = 0$, and visualize the fit as below:
1 2 3 4 5 | x1 <- c(min(x[,2]),max(x[,2])) x2 <- -1/theta[3,] * (theta[2,]*x1+theta[1,]) a <- (x2[2]-x2[1])/(x1[2]-x1[1]) b <- x2[2]-a*x1[2] p+geom_abline(slope=a, intercept=b) |
Fantastic.
References:
Machine Learning Course
Exercise 4



some other efforts in implementing these exercises were collected in
http://www.statalgo.com/stanford-machine-learning/, which provides extra information and is highly recommended.
Reply
Thanks for the mention! Enjoying your posts as well.
Reply
YGC,thank you for your share. I have a problem, please help me at your convenience. When I use the gradient descent method to solve EX4, it can not converge.My code are as follows:
x=read.table("E:/DM/text_book/ML_ex/ex4x.dat", header=F)
y=read.table("E:/DM/text_book/ML_ex/ex4y.dat", header=F)
x=cbind(rep(1,length(length(y))), x)
x=as.matrix(x)
y=y[,1]
cost_f 1e-10 && iter<50000){
cost_old=cost_f(beta)
beta_new=beta-alpha*t(x)%*%(1/(1+exp(-x%*%beta))-y)
cost_new=cost_f(beta_new)
alpha1=alpha
if(cost_new<cost_old) {
cost=cost_old
i=i+1
while(cost_newcost_old){
cost=cost_new
alpha=alpha/2
beta_new=beta-alpha*t(x)%*%(1/(1+exp(-x%*%beta))-y)
cost_new=cost_f(beta_new)
while(cost_new<cost){
cost=cost_new
alpha=alpha/2
beta_new=beta-alpha*t(x)%*%(1/(1+exp(-x%*%beta))-y)
cost_new=cost_f(beta_new)
j=j+1
}
alpha=alpha*2
}
beta=beta-alpha*t(x)%*%(1/(1+exp(-x%*%beta))-y)
cost=cost_f(beta)
epsion=abs(beta)
iter=iter+1
alpha=alpha0
}
print(beta)
print(cost)
print(iter)
print(i)
print(j)
Reply
ygc
Reply:
December 23rd, 2011 at 5:13 pm
you may set a lower alpha value.
Reply
hi
please any logistic rgerssion implementation using c#
thank u
Reply