Thursday 16 January 2020

Logistic Loss Function implementation in NN (part 2)

Colossus - Computer Sci Museum - Bletchley Park
This is part 2 of loss functions. This post is like the last, however, it is expanded to include 2 neurons and incorporates a logistic function for a classification problem.
R code is at the end.

Again, this does not use matrices,clever optimisation or a form of regularisation. It's aimed to be accessible programmatically and mathematically so one can see how weights and classification work in an incremental fashion. You can adjust the learning rates as well as add in more features and data points by copying and pasting the derivative code. I've used xy, x^2,y^2,sin and cosine for interesting visuals and results. The derivative was worked out using a mix of other posts, symbolab and wikipedia. That working out can be found at the very end.

For this demonstration, we will be using a basic NN and loss and logistic function to computationally solve a colour classification problem using an variables x and y.


We want to classify the following dots - red (zero) on the top, blue (one) at the bottom:


Components
As before, a basic neural network, like many optimisation algorithms, is built with the following steps.

1. Initialise 
2. Compute 
3. Measure
4. Adjust
5. Terminate

The difference this time is the data is slightly different and the partial derivative has been expanded to include a logistic function which will be expanded further on.

Our starting data set is:
x y z
10 12 0
10 10 1
20 19 1
20 24 0

Our input values are x and y coordinates and our value to be predicted is z.

Logistic Function
Because our prediction is a binary value, we will be using the logistic (also called the sigmoid) function.
$$f(x)={1\over(1+e^{-x} )}$$

This function is a compressor–it takes in any value of x and squeezes it down to a number between 0 & 1.

Example:

$$f(10)=  {1\over{1+e^{-10}}} = {1\over{1+{1\over{e^{10}}}}}= {1\over{1+{1\over{22026}}}} = {1\over{1.000045}} =~1.0 $$


$$f(1)=  {1\over{1+e^{-1}}} = {1\over{1+{1\over{e^{1}}}}}= {1\over{1+{1\over{2.71828}}}} = {1\over{1.3678}} =~0.73 $$


$$f(-10)=  {1\over{1+e^{--10}}} = {1\over{1+{{e^{10}}}}}= {1\over{1+{{22026}}}} = {1\over{22027}} =~0.00004 \text{ ~ } 0 $$


Because we’re mixing multiple different input value ranges into our loss function, any large numbers can potentially blow up our computations and make tuning the learning rate a little bit harder. So, we must scale the x and y values down to a range that is more resonant with 0-1. To scale, I’ve used the lower bound -2 and the upper bound 2. I could have picked any other small number range but I’ve chose these 2 values.

We apply the following transformation to our data:
$$x=>{(x-min(x))}*{new\_upper-new\_lower\over{max(x)-min(x)}}+new\_lower$$

Where min(x),max(x) is the smallest and largest value within the input data of x and new_lower and new_upper are our range limits for the destination range the values are being converted to, namely -2 to 2.

We’re taking a given value of x, removing the smallest value in the range of x to find x’s relative position in the range. We then multiply it by the ratio of the old range to the new range.
We then add back the lowest value of the new range. 

Our transformed values are as follows:

x  y z
-1 -0.8  0
-1 -1.0  1
 0 -0.1  1
 0  0.4  0

1. Initialize

We instantiate our weights (W)
Weight for the x input variable (w1) = 1
Weight for the y input variable (w2) = 1
The learning rate (α) = .5
The bias (b) = 2
The bias weight (w3) = 1

If we are to visualise the network, it would look something like this:


2. Compute

We fetch our first record – (x,y,z)(-1,-0.8,0)
x  y  z
-1 -0.8  0
-1 -1.0  1
 0 -0.1  1
 0  0.4  0

We multiply our first x value by the weight w1 to yield w1x: -1x1=-1 => w1x=-1
We multiply our first y value by the weight w2 to yield w2y: -.8x1=-0.8 => w2y=-0.8
We multiply our bias value by the bias weight w3 to yield w3b: 2x1=1 => w3b=2

3. Measure
We now measure the difference between the actual value of z=0 and the predicted value of this newly instantiated network.

At present, our network makes predictions using the following formula:
$$ {1\over{1+e^{-x}}}  => {1\over{1+e^{-(W_1 X+W_2 Y+W_3 b)}}} $$

Using our initialised variables, our prediction is 
$$ ž  = {1\over{1+e^{-((1)(-1)+(1)(-0.8)+(1)(1))}}}=0.54983 $$

For it to learn, it needs to tune its predictions through the weights. This is how we work out by how much using a loss function:

$$\text{(realvalue - nn predicted value)^2}$$
or
$$(z-ž)^2$$
or
$$(z-{1\over{1+e^{-(W_1 X+W_2 Y+W_3 b)}}})^2$$
So if we substitute for our values (w1x=10, w2y=12, w3b=2) and our prediction (z=0) into the above, our error is:
$$(ž-{1\over{1+e^{-(W_1 X+W_2 Y+W_3 b)}}})^2=>(0-{1\over{1+e^{-(-1-0.8+2)}}})^2=>(0-{1\over{1+e^{-0.2}}})^2$$
$$(0-{1\over{1.81873}})^2=>(0-0.5498)^2=>0.3023$$
Our error is 0.3023 for the first data point.

4. Adjust

Again, like in the first post on loss functions, we need a partial derivative of all the weights in terms of the loss function 
$$(ž-{1\over{1+e^{-(W_1 X+W_2 Y+W_3 b)}}})^2$$

For brevity, the final partial derivative is placed below. If you want the full workings out, skip to the end.
$$\text{Let }f(wx)=f(W_1 X+W_2 Y+W_3 b)={1\over{1+e^{-(W_1 X+W_2 Y+W_3 b)}}}$$
Then
$$\text{Error}=(z-f(wx))^2$$
$${\partial{Error}\over{\partial{W_1}}}={2.(z-f(wx)).f(x).(1-f(wx)).(x)}$$
This is our partial derivative and we will use this form to find the weight for all other weights.
Now we can make our first weight adjustments in a similar manner to the OLS post.
$$\text{Logistic function∶ }f(wx)=f(W_1 X+W_2 Y+W_3 b)=f(-1-.8+2)= .5498$$
$$∇W_1= 2(z-f(wx)).(1-f(wx)).(-1)$$
$$=2.(0-.5498).(.5498).(1-(.5498)).(-1)$$
$$=2(-.5498).(0.4502).(-1)=0.272$$
Now that we have our change, we can adjust W1  using our adjustments and our learning rate
α = .5

So our new weight is:
$$W1 = W1 + α∇W1 => (1) + (.5)(0.272) = 1+(0. 136)=> 1.136$$
W2 ,Wb are computed the exact same way. Owing to the increase in size of the pde, I will leave an R console print out of the weights further on.

5. Terminate

A neural network will pass in the next set of record values (x,y) and carry out step 2, 3, and 4 with the new weights from above. Once all the data is used, the process will continue from the first record again until all the epochs are completed. An epoch is a single pass of the entire dataset through step 2 to 4.

Because we’re running 500 epochs, we’re not done processing. However, for brevity, I will include the R printout only:

w1: 1.13609302657525  w2: 1.1088744212602   w3: 0.727813946849507
w1: 0.988396129296515 w2: 0.961177523981466 w3: 1.02320774140697
w1: 0.988396129296515 w2: 0.959820053896929 w3: 1.05035714309772
w1: 0.988396129296515 w2: 0.933597153055977 w3: 0.919242638892959
w1: 1.11949168599472  w2: 1.03847359841454  w3: 0.65705152549655
...............
w1: 1.04296060447026  w2: 0.783478305387954 w3: 0.439546216434397
w1: 0.897875953093364 w2: 0.638393654011055 w3: 0.729715519188197
w1: 0.897875953093364 w2: 0.635235101469541 w3: 0.792886570018465
w1: 0.897875953093364 w2: 0.594408140160519 w3: 0.588751763473355
epoch: 5 w1: 0.9 w2: 0.59 w3: 0.59 error: 0.3800738174159
……… epoch readout omitted 
w1: 9.41114740924591 w2: -9.57178533392144 w3: 0.538001948131482
w1: 9.48636747379138 w2: -9.51160928228506 w3: 0.387561819040533
w1: 9.42007593702699 w2: -9.57790081904946 w3: 0.52014489256932
w1: 9.42007593702699 w2: -9.57915632568083 w3: 0.545255025196754
w1: 9.42007593702699 w2: -9.58053573491764 w3: 0.538357979012715

epoch: 500 w1: 9.42 w2: -9.58 w3: 0.54 error: 0.0568526922745829



Our final weights W1 ,W2 and WB  are 19.42,-9.58 and .54 and a visual progression is displayed below.




The dash line at the bottom of the image is the loss function over time. The dashed line through the middle is the .5 separation.
RScript
#Load libraries
library(rgl)
library(ggplot2)

#set working directory for images
setwd("C:\\nonlinear")

#Scale function 
scaleRange <- function (x,from_min,from_max,to_min,to_max)
{
x <- (x - from_min) * (to_max - to_min) / (from_max - from_min) + to_min
return(x)
}

#data generator
generateCluster <- function(x,y,variance) {
  x_offset <- rnorm(1,x,variance)
  y_offset <- rnorm(1,y,variance)
  return (c(x_offset,y_offset))
}

logit <- function(x) {
  return (1/(1+exp(-(x))))
}

#Our data frame
dfXY <- data.frame(x=1:4, y=1:4, z=0)

dfXY[1,]$x=10
dfXY[1,]$y=12
dfXY[1,]$z=0

dfXY[2,]$x=10
dfXY[2,]$y=10
dfXY[2,]$z=1

dfXY[3,]$x=20
dfXY[3,]$y=19
dfXY[3,]$z=1

dfXY[4,]$x=20
dfXY[4,]$y=24
dfXY[4,]$z=0

#Init weights
w1 <- 1
w2 <- 1
w3 <- 1
b <- 2

#create a weights frame for reporting
dfWeights <- data.frame(x=w1, y=w2, z=2)

#Parameters
learningRate <- .5
epoch <- 500

#Scale values down to -2 to 2
dfXY$sx <- NULL
dfXY$sx <- scaleRange(dfXY$x,0,40,-2,2)
dfXY$sy <- NULL
dfXY$sy <- scaleRange(dfXY$y,0,40,-2,2)

#Create a graph frame of all mxn
dfCol <- data.frame(ix=1:900, x=0, y=0, col="")

#Begin iterating through frame to set x and y values of graph matrix
xcount <- 1 
ycount <- 1

for (z in 1:nrow(dfCol))
{
  dfCol[z,]$x <- xcount
  dfCol[z,]$y <- ycount
  if (xcount < 30)
  {
  xcount <- xcount+1
  }
  else
  {
  xcount <- 1
  ycount <- ycount+1
  }
}

#Scale the values down
dfCol$sx  <- scaleRange(dfCol$x,0,40,-2,2)
dfCol$sy  <- scaleRange(dfCol$y,0,40,-2,2)
ec <- 0

#Loop through epochs
for (ec in 1:epoch) {

  error <- 0

  ww1 <- 0
  ww2 <- 0
  ww3 <- 0

  #Loop through data set
  for (loopCount in 1:nrow(dfXY)) {

    #Assign scaled values
    x <- dfXY[loopCount,]$sx
    y <- dfXY[loopCount,]$sy
    
    #Set the true/false value
    z <- dfXY[loopCount,]$z

    #Product of weight and data
    w1x   <- w1*x
    w2y   <- w2*y
    w3b   <- w3*b

    ww1 <- 2*(z-logit(w1x+w2y+w3b))*(logit(w1x+w2y+w3b))*(1-(logit(w1x+w2y+w3b)))*x
    ww2 <- 2*(z-logit(w1x+w2y+w3b))*(logit(w1x+w2y+w3b))*(1-(logit(w1x+w2y+w3b)))*y
    ww3 <- 2*(z-logit(w1x+w2y+w3b))*(logit(w1x+w2y+w3b))*(1-(logit(w1x+w2y+w3b)))*b
    
    #Adjust weights
    w1 <- w1+(learningRate*ww1)
    w2 <- w2+(learningRate*ww2)
    w3 <- w3+(learningRate*ww3)
    
    print(paste("w1:",w1,"w2:",w2,"w3:",w3))

    #Measure the errors
    error <- error+abs((z-(1/(1+exp(-(w1x+w2y+w3b)))))^2)
  }
  
  #Add the weight and error of the last epoch
  if (ec %% 1 == 0 && ec > 1)  {
    dfWeights <- rbind(dfWeights,data.frame(x=w1, y=w2, z=error))
  }
  
  #Print out visuals so the evolution can be seen
  if (ec %% 10 == 0 || ec == 5 || ec == 10 || ec == 50 || ec == 100) {
print(paste("epoch:",ec,"w1:",round(w1,2),"w2:",round(w2,2),"w3:",round(w3,2),"error:",error/nrow(dfXY)))

    dfCol$yhat <- logit(dfCol$sx*w1+dfCol$sy*w2+b*w3)
    
    dfCol$col <- ifelse(dfCol$yhat <= .05, colorRampPalette(c("#cc0000","#2952a3"))(20)[1],
                  ifelse(dfCol$yhat <= .1, colorRampPalette(c("#cc0000","#2952a3"))(20)[2],
                   ifelse(dfCol$yhat <= .15, colorRampPalette(c("#cc0000","#2952a3"))(20)[3],
                    ifelse(dfCol$yhat <= .2, colorRampPalette(c("#cc0000","#2952a3"))(20)[4],
                     ifelse(dfCol$yhat <= .25, colorRampPalette(c("#cc0000","#2952a3"))(20)[5],
                      ifelse(dfCol$yhat <= .3, colorRampPalette(c("#cc0000","#2952a3"))(20)[6],
                       ifelse(dfCol$yhat <= .35, colorRampPalette(c("#cc0000","#2952a3"))(20)[7],
                        ifelse(dfCol$yhat <= .4, colorRampPalette(c("#cc0000","#2952a3"))(20)[8],
                         ifelse(dfCol$yhat <= .45, colorRampPalette(c("#cc0000","#2952a3"))(20)[9],
                          ifelse(dfCol$yhat <= .5, colorRampPalette(c("#cc0000","#2952a3"))(20)[10],
                           ifelse(dfCol$yhat <= .55, colorRampPalette(c("#cc0000","#2952a3"))(20)[11],
                            ifelse(dfCol$yhat <= .6, colorRampPalette(c("#cc0000","#2952a3"))(20)[12],
                             ifelse(dfCol$yhat <= .65, colorRampPalette(c("#cc0000","#2952a3"))(20)[13],
                              ifelse(dfCol$yhat <= .7, colorRampPalette(c("#cc0000","#2952a3"))(20)[14],
                               ifelse(dfCol$yhat <= .75, colorRampPalette(c("#cc0000","#2952a3"))(20)[15],
                                ifelse(dfCol$yhat <= .8, colorRampPalette(c("#cc0000","#2952a3"))(20)[16],
                                 ifelse(dfCol$yhat <= .85, colorRampPalette(c("#cc0000","#2952a3"))(20)[17],
                                  ifelse(dfCol$yhat <= .9, colorRampPalette(c("#cc0000","#2952a3"))(20)[18],
                                   ifelse(dfCol$yhat <= .95, colorRampPalette(c("#cc0000","#2952a3"))(20)[19],
                                    ifelse(dfCol$yhat <= 1.0, colorRampPalette(c("#cc0000","#2952a3"))(20)[20]
                         ))))))))))))))))))))

    ggplot(data=dfCol) + 
      geom_point(aes(x=dfCol$x, y=dfCol$y), color=dfCol$col, size=10,shape=15) + 
      geom_point(data=dfXY,aes(x=dfXY$x, y=dfXY$y), color=ifelse(dfXY$z==1,"skyblue","red"), size=2) + 
      ggtitle(paste("NN Color Classification - Error:",round(error,2),"epoch:",ec)) + 
      xlab("x")+ylab("y")+xlim(c(0,30))+ylim(c(0,30))+
      geom_line(data=dfCol[dfCol$yhat >= .4 & dfCol$yhat <= .6,],aes(x=dfCol[dfCol$yhat >= .4 & dfCol$yhat <= .6,]$x,y=dfCol[dfCol$yhat >= .4 & dfCol$yhat <= .6,]$y),linetype = "dotted", color="black",alpha=0.5) +
      geom_line(data=dfWeights,aes(scaleRange(1:nrow(dfWeights),0,nrow(dfWeights),2,30),scaleRange(dfWeights$z,0,max(dfWeights$z),2,10)),linetype = "dashed",alpha=0.5) +
      geom_text(data = dfWeights[nrow(dfWeights),], aes(label = round(error,2), x = 30, y = scaleRange(z,0,2,2,30)),color="white") +
      theme_bw() +
      theme(
        plot.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.ticks.x=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.x=element_blank(),
        axis.text.y=element_blank()
      )
    
    ggsave(paste0(ec,".jpg"), plot = last_plot(), width = 180, height = 120, limitsize = TRUE, units="mm")

    file.copy(paste0(ec,".jpg"), "current.jpg", overwrite = TRUE )
  }

References and Full PDE of loss function 
With assistance from: https://www.symbolab.com/

For a fun live version of this post click here

Full partial derivative working out: 

$$\text{Let f(x)=}{1\over{{1+e^{(-W_1 X+W_2 Y+W_3 b))}}}}$$
then
$${\partial{Error}\over{\partial{W_1}}}{(z-f(x))^2}$$


No comments:

Post a Comment