Tuesday, 14 January 2020

Loss Function implementation in NN (part 1)

BOMBE - Computer Sci Museum - Bletchley Park
This post is 1 of 2 posts illustrating how a loss function works with a single neuron neural network. Post 2 will add an additional neuron to make things more interesting.

This is a conceptual implementation of neural network, it's a small one that does not use matrices or clever optimisation, but is written to be accessible in terms of programming and mathematics for learning purposes to understand the most atomic part of an NN as well as an atomic part of other loss/error functions. All the R Code is at the end.

Although facets of NN’s are in here, the take away is how a loss function and its partial derivatives works along with impact on the optimisation algorithm. Loss functions are featured in many different machine learning methods and understanding a loss function will help understand a lot of other methods.

Conceptually, think of a neural network as a recursive function – like the Newton Method, but with partial derivatives, all of which must be minimised in terms of the weights with respect to a loss function.  Each pass of the function yields a value by which we increase or decrease weights leading to a smaller loss or cost in terms of the error.

For this demonstration, I will be using a basic NN and loss function to computationally solve an OLS. I won’t make predictions per se - but by solving the weights, we solve the ordinary least square line.

A least squares function is usually found in the following form:
$$Ε· = 𝑏_0 + 𝑏_1π‘₯$$
Analytically, this can be solved easily using the variance and co-variances of x and y. How to do that analytically can be found here

We can also solve this, and more complicated systems, using a single neuron or a sprawling layered network of them.

Components 
A basic neural network, like many optimisation algorithms, is built with the following steps.

1. Initialise 
2. Compute 
3. Measure
4. Adjust
5. Terminate

These steps are general and anecdotal, more advanced algorithms can branch off at any one of the above steps.

Depending on the design, steps 2 to 4 are executed repeatedly until some conclusion is reached.

For our neural network, we are wanting to find b0 and b1 algorithmically for our least square line
$$Ε· = 𝑏_0 + 𝑏_1π‘₯$$

Our starting data set is:
x  y 
1  2
2  1
3  4
4  3

1. Initialize 

We instantiate our weights (W)  - these can be also be considered coefficients, they impact input.
Weight for the x input variable (w1x) = 1
The learning rate (Ξ±) = .01 - this impacts the rate the algorithm tunes the weights.
The bias (b) = 2
The bias weight (w2b) = 1

Because our prediction is a continuous value, we are not making use of a sigmoid function, we want the least square coefficients(b0,b1) which are the equivalent of w1 and w2 in this example. Sigmoid will be posted in part 2.

If we are to visualise the network, it would look something like this:


Except, our activation function - F(X) - is not a logistic function but the product of the data, bias and weights which will become clearer further on. 

2. Compute 

We fetch our first record – (x,y)(1,2) 
x  y 
1  2
2  1
3  4
4  3

We multiply our first x value by the weight w1 to yield w1x: 1x1=1 => w1x=1 
We multiply our bias value b by the bias weight w2 to yield w2b: 1x1=1 => w2b=1 

3. Measure 
We now measure the difference between the actual value of y=2 and the predicted value of this newly instantiated network. 

At present, our network makes predictions using the following formula: 
$$π‘Š_1𝑋 + π‘Š_2𝑏$$  
(πΏπ‘œπ‘œπ‘˜ π‘“π‘Žπ‘šπ‘–π‘™π‘–π‘Žπ‘Ÿ? 𝑏0 + 𝑏1π‘₯

If we substitute our data along with the newly instantiated weights into the above formula, our model makes the following prediction: 

$$π‘Š_1𝑋 + π‘Š_2𝑏 => (1)(1) + (1)(2) => 3$$ 

For it to learn, it needs to tune it’s predictions through the weights. This is how we work out by how much: 
$$(\text{𝒓𝒆𝒂𝒍 𝒗𝒂𝒍𝒖𝒆 π‘šπ‘–π‘›π‘’π‘  𝒏𝒏 π’‘π’“π’†π’…π’Šπ’„π’•π’†π’… 𝒗𝒂𝒍𝒖𝒆})^𝟐$$ 
or 
$$(y − Ε·)^𝟐$$
or
$$(𝐲 − (π‘ΎπŸπ‘Ώ + π‘ΎπŸπ’ƒ))^2$$

So if we substitute for our first value (x=1) and our actual value (y=2) into the above, our error is:
$$(y − π‘Š1𝑋 − π‘Š2𝑏)^𝟐 => ((2) − 3)^𝟐 => (−1)^𝟐 => 1$$

Our error is 1 (2-3=-1)

So the question is, by how much must we change W1 and W2 to reduce our error of 1 to something smaller? 

4. Adjust 

This is where we need to formulate our partial derivative of our error function in terms of each of the weights. 

A derivative is the rate of change of a function. Think f(x)=2x. If x is 1, f(1) will be 2. If x is 2, f(2) will be 4, f(3)=6 and so on. Our rate of change, or the derivative of f(x)=2x is 2

A partial derivative is similar to the above, but the derivative is now subject to 2 or more variables. We get the derivative of one parameter (say x) while fixing the remaining parameters (say y). We say we get the derivative of function f(x,y) in terms of x – because we’re interested in the rate of change of x alone while y is treated as constant. See the graph below to intuit how the rate of change for x varies at each value of y (hence the fixing)


If we want to find the rate of change for x, we have to hold y in place, because the rate of change for x is potentially different for all values of y.  

Because our error function has other values in it apart from our value w1 (the weight we want to adjust to improve the error rate) we need to get the partial derivative to compute the adjustment. We will do this next.
$$ 𝑬𝒓𝒓𝒐𝒓 = (Ε· − π‘Š_1𝑋 − π‘Š_2𝑏)^𝟐 $$

$${{πœ•πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ}\over{πœ•π‘Š1}} = 2(−𝑿)(Ε· − π‘Š_1𝑋 − π‘Š_2𝑏) $$
$$= −2𝑿(Ε· − π‘Š_1𝑋 − π‘Š_2𝑏)$$

This is our partial derivative and we will use it again to find the weight for the bias.

$${{πœ•πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ}\over{πœ•π‘Š2}} = 2(−b)(Ε· − π‘Š_1𝑋 − π‘Š_2𝑏) $$
$$= −2b(Ε· − π‘Š_1𝑋 − π‘Š_2𝑏)$$

Now we can make our first weight adjustments

$$∇π‘Š1: − 2𝑋(Ε· − π‘Š_1𝑋 − π‘Š_2𝑏) => −2(1)((2) − (1)(1) − (1)(2)) => −2(−1) => 𝟐$$
$$∇π‘Š2: − 2𝑏(Ε· − π‘Š_1𝑋 − π‘Š_2𝑏) => −2(2)((2) − (1)(1) − (1)(2)) => −4(−1) => πŸ’$$

The method above falls under stochastic gradient descent – when you hear gradient descent, you can think partial derivatives 

Now that we have our change, we can adjust W1 and W2  using our adjustments and our learning rate  Ξ± = .01
So our new weights are:

$$W_1 = W_1 - Ξ±∇W_1 => (1) – (.01)(2) = 1-(0.02)=>.98$$
$$W_2 = W_2 - Ξ±∇W_2 => (1) – (.01)(4) = 1-(0.04)=>.96$$

5. Terminate
A neural network will pass in the next set of record values (x,y) and carry out step 2, 3, and 4 with the new weights from above (.98 & .96). Once all the data is used, the process will continue from the first record again until all the epochs are completed. An epoch is a single pass of the entire dataset through step 2 to 4.

Because we’re running 100 epochs, we’re not done and so we go back to step 2 using weights W1 = .98 and W2 = .96

Continuation 

I will do one more record and then provide the read out from an R script: (x,y)=(2,1)

$$∇π‘Š_1: − 2𝑋(Ε· − π‘Š_1𝑋 − π‘Š_2𝑏) => −2(2)((1) − (.98)(2) − (.96)(2)) => −4(−2.880) = 11.52$$
$$∇π‘Š_2: − 2𝑏(Ε· − π‘Š_1𝑋 − π‘Š_2𝑏) => −2(2)((1) − (.98)(2) − (.96)(2)) => −4(−2.880) = 11.52$$

And our new weights are:

$$W_1 = W_1 - Ξ±∇W_1 => (.98) – (.01)(11.52) = .98-(0.115) =>.865$$
$$W_2 = W_2 - Ξ±∇W_2 => (.96) – (.01)(11.52) = .96-(0.115) =>.845$$

Readout from 1-100 epochs 
x: 1 y: 2 epoch: 1   w1 0.98 : w2 0.96 
x: 2 y: 1 epoch: 1   w1 0.86 : w2 0.84 
x: 3 y: 4 epoch: 1   w1 0.85 : w2 0.83 
x: 4 y: 3 epoch: 1   w1 0.68 : w2 0.75 
x: 1 y: 2 epoch: 2   w1 0.68 : w2 0.74
x: 2 y: 1 epoch: 2   w1 0.61 : w2 0.67 
x: 3 y: 4 epoch: 2   w1 0.66 : w2 0.7 
x: 4 y: 3 epoch: 2   w1 0.57 : w2 0.66 
x: 1 y: 2 epoch: 3   w1 0.58 : w2 0.67 
x: 2 y: 1 epoch: 3   w1 0.52 : w2 0.61
x: 3 y: 4 epoch: 3   w1 0.59 : w2 0.66 
x: 4 y: 3 epoch: 3   w1 0.54 : w2 0.63 
x: 1 y: 2 epoch: 4   w1 0.54 : w2 0.64 
x: 2 y: 1 epoch: 4   w1 0.49 : w2 0.58 
x: 3 y: 4 epoch: 4   w1 0.57 : w2 0.64 
x: 4 y: 3 epoch: 4   w1 0.52 : w2 0.62 
..... 
x: 1 y: 2 epoch: 99  w1 0.57 : w2 0.56 
x: 2 y: 1 epoch: 99  w1 0.52 : w2 0.51 
x: 3 y: 4 epoch: 99  w1 0.6  : w2 0.57 
x: 4 y: 3 epoch: 99  w1 0.56 : w2 0.55 

x: 1 y: 2 epoch: 100 w1 0.57 : w2 0.56
x: 2 y: 1 epoch: 100 w1 0.52 : w2 0.51
x: 3 y: 4 epoch: 100 w1 0.6  : w2 0.57 
x: 4 y: 3 epoch: 100 w1 0.56 : w2 0.54 

Our final weights W1 and W2  are .56 and .54 

If I pass the same data independently through a generic least squares model in R to find the gradient (done using R’s lm() for shortcut) I find the following:
Coefficients: 
(Intercept)            x   
        1.0          0.6   
W2 corresponds to our intercept of 1.0 (b0*w2=2*.54~1)
And our W1 (.56) corresponds to our x of .6

Our algorithm increments are visually represented below, and we can see the progression of accuracy per epoch:

If we carry out the same process using 100 data points with 3000 epochs, and a learning rate of .00001 with w1 and w2 initialised to -10 and -20, we will see the following: 

Our starting point is a line with a steep gradient and an intercept around 1 
Incrementally, the intercept shifts and the gradient changes 




By the end of the last epoch, the line is suitably fitted to the data where the data was generated with an intercept of 50 and a gradient of 1.5 – which is basically the same as below. 

> lm(dfXY, formula = y~.) 
Call: lm(formula = y ~ ., data = dfXY) 
Coefficients: 
(Intercept)            x        
49.658        1.566   

A note on learning rate:  Learning rate will allow for the error function to fall into a minima and reach the smallest allowable global error. If the learning rate is too high, the weights will cycle back and forth, continually overshooting the minima and fail to allow an adequate window for the algorithm to converge on the minima. The flip flopping will usually cause weights to spiral out of control and ‘blow up’ – weight or weights hit INF or NANs 

If the coefficients are converged and it can be visually seen that the line settles in a position that is clearly not efficient, it means the algorithm has found a local minima, and not the global minima. This can be corrected using advanced analytic and programming methods 

A note on weights: In the above notes, weights are declared as separate variables and all the moving parts reside in a variable. This is done so that it is easy to follow. However, in real implementations of NNs, the weights per neuron and the input data are stored in matrices and their dot product are used to create the prediction. This is programmatically neater and generally computationally efficient. 

RScript
library(ggplot2)  
setwd("C:\\MyImageFolder")  
#Noise generator 
generatePairs <- function(x,b0,b1,corre) 
{   
  err <- sample(rnorm(100,0,corre),1)   
  y <- (  b0 + b1*x + err  )   
  return(y) 
} 
 
#Holding frame 
dfXY <- data.frame(x=1:100, y=1:100) 
 
#Create a noisy scatter plot with an intercept and a gradient 
for (x in 1:100) 
{   
  dfXY[x,]$y <- generatePairs(x,50,1.5,20) 
} 
 
#Set variables 
learningRate <- .00001 
epoch <- 3000 
b0 <- 2 
w1 <- -10 
w2 <- -20 
 
#Loop through epochs 
for (z in 1:epoch) 
{   
  error <- 0 
 
  #loop through data points   
  for (loopCount in 1:nrow(dfXY)) 
  {     
    #assign variables     
    w1x <- w1*dfXY[loopCount,]$x     
    w2b <- w2*b0 
 
    #partial derivative value     
    pdw1x <- -2*dfXY[loopCount,]$x*(dfXY[loopCount,]$y-w1x-w2b)     
    pdw2b <- -2*b0*(dfXY[loopCount,]$y-w1x-w2b) 
 
    #error     
    error <- error +(dfXY[loopCount,]$y-w1x-w2b)^2 
 
    #Reassign weights     
    w1 <- w1-(learningRate*pdw1x)     
    w2 <- w2-(learningRate*pdw2b)   
  } 
 
  #Create plots at x epochs   
  if (z %% 200 == 0 || z == 1 || z == 2 || z == 3 || z == 4 || z == 5 || z == 10 || z == 15 || z == 20 || z == 30 || z == 50 || z == 100) 
  { 
    print(paste("x:",dfXY[loopCount,]$x,"y:",y,"epoch:",z,"w1x",w1,": w2b",w2,"error:",error)) 
 
    ggplot(data=dfXY,aes(x=dfXY$x,y=dfXY$y))+       
    geom_point()+       
    geom_line(aes(x=1:nrow(dfXY),w2*b0+w1*(1:nrow(dfXY))))+       
    ggtitle(paste("OLS b0 & b1 search - Ξ£Error:",round(error,2),"epoch:",z,"b0:",round(w2*b0,2)," b1:",round(w1,2)))+       
    xlim(c(0,220))+       
    ylim(c(0,220))+       
    ggsave(paste0(z,".jpg"), plot = last_plot(), width = 180, height = 120,limitsize = TRUE, units="mm") 
 
    #For monitoring progression     
    file.copy(paste0(z,".jpg"), "current.jpg", overwrite = TRUE )   
  } 
} 

1 comment: