Thursday 16 January 2020

Logistic Loss Function implementation in NN (part 2)

Colossus - Computer Sci Museum - Bletchley Park
This is part 2 of loss functions. This post is like the last, however, it is expanded to include 2 neurons and incorporates a logistic function for a classification problem.
R code is at the end.

Again, this does not use matrices,clever optimisation or a form of regularisation. It's aimed to be accessible programmatically and mathematically so one can see how weights and classification work in an incremental fashion. You can adjust the learning rates as well as add in more features and data points by copying and pasting the derivative code. I've used xy, x^2,y^2,sin and cosine for interesting visuals and results. The derivative was worked out using a mix of other posts, symbolab and wikipedia. That working out can be found at the very end.

For this demonstration, we will be using a basic NN and loss and logistic function to computationally solve a colour classification problem using an variables x and y.


We want to classify the following dots - red (zero) on the top, blue (one) at the bottom:


Components
As before, a basic neural network, like many optimisation algorithms, is built with the following steps.

1. Initialise 
2. Compute 
3. Measure
4. Adjust
5. Terminate

The difference this time is the data is slightly different and the partial derivative has been expanded to include a logistic function which will be expanded further on.

Our starting data set is:
x y z
10 12 0
10 10 1
20 19 1
20 24 0

Our input values are x and y coordinates and our value to be predicted is z.

Logistic Function
Because our prediction is a binary value, we will be using the logistic (also called the sigmoid) function.
$$f(x)={1\over(1+e^{-x} )}$$

This function is a compressor–it takes in any value of x and squeezes it down to a number between 0 & 1.

Example:

$$f(10)=  {1\over{1+e^{-10}}} = {1\over{1+{1\over{e^{10}}}}}= {1\over{1+{1\over{22026}}}} = {1\over{1.000045}} =~1.0 $$


$$f(1)=  {1\over{1+e^{-1}}} = {1\over{1+{1\over{e^{1}}}}}= {1\over{1+{1\over{2.71828}}}} = {1\over{1.3678}} =~0.73 $$


$$f(-10)=  {1\over{1+e^{--10}}} = {1\over{1+{{e^{10}}}}}= {1\over{1+{{22026}}}} = {1\over{22027}} =~0.00004 \text{ ~ } 0 $$


Because we’re mixing multiple different input value ranges into our loss function, any large numbers can potentially blow up our computations and make tuning the learning rate a little bit harder. So, we must scale the x and y values down to a range that is more resonant with 0-1. To scale, I’ve used the lower bound -2 and the upper bound 2. I could have picked any other small number range but I’ve chose these 2 values.

We apply the following transformation to our data:
$$x=>{(x-min(x))}*{new\_upper-new\_lower\over{max(x)-min(x)}}+new\_lower$$

Where min(x),max(x) is the smallest and largest value within the input data of x and new_lower and new_upper are our range limits for the destination range the values are being converted to, namely -2 to 2.

We’re taking a given value of x, removing the smallest value in the range of x to find x’s relative position in the range. We then multiply it by the ratio of the old range to the new range.
We then add back the lowest value of the new range. 

Our transformed values are as follows:

x  y z
-1 -0.8  0
-1 -1.0  1
 0 -0.1  1
 0  0.4  0

1. Initialize

We instantiate our weights (W)
Weight for the x input variable (w1) = 1
Weight for the y input variable (w2) = 1
The learning rate (α) = .5
The bias (b) = 2
The bias weight (w3) = 1

If we are to visualise the network, it would look something like this:


2. Compute

We fetch our first record – (x,y,z)(-1,-0.8,0)
x  y  z
-1 -0.8  0
-1 -1.0  1
 0 -0.1  1
 0  0.4  0

We multiply our first x value by the weight w1 to yield w1x: -1x1=-1 => w1x=-1
We multiply our first y value by the weight w2 to yield w2y: -.8x1=-0.8 => w2y=-0.8
We multiply our bias value by the bias weight w3 to yield w3b: 2x1=1 => w3b=2

3. Measure
We now measure the difference between the actual value of z=0 and the predicted value of this newly instantiated network.

At present, our network makes predictions using the following formula:
$$ {1\over{1+e^{-x}}}  => {1\over{1+e^{-(W_1 X+W_2 Y+W_3 b)}}} $$

Using our initialised variables, our prediction is 
$$ ž  = {1\over{1+e^{-((1)(-1)+(1)(-0.8)+(1)(1))}}}=0.54983 $$

For it to learn, it needs to tune its predictions through the weights. This is how we work out by how much using a loss function:

$$\text{(realvalue - nn predicted value)^2}$$
or
$$(z-ž)^2$$
or
$$(z-{1\over{1+e^{-(W_1 X+W_2 Y+W_3 b)}}})^2$$
So if we substitute for our values (w1x=10, w2y=12, w3b=2) and our prediction (z=0) into the above, our error is:
$$(ž-{1\over{1+e^{-(W_1 X+W_2 Y+W_3 b)}}})^2=>(0-{1\over{1+e^{-(-1-0.8+2)}}})^2=>(0-{1\over{1+e^{-0.2}}})^2$$
$$(0-{1\over{1.81873}})^2=>(0-0.5498)^2=>0.3023$$
Our error is 0.3023 for the first data point.

4. Adjust

Again, like in the first post on loss functions, we need a partial derivative of all the weights in terms of the loss function 
$$(ž-{1\over{1+e^{-(W_1 X+W_2 Y+W_3 b)}}})^2$$

For brevity, the final partial derivative is placed below. If you want the full workings out, skip to the end.
$$\text{Let }f(wx)=f(W_1 X+W_2 Y+W_3 b)={1\over{1+e^{-(W_1 X+W_2 Y+W_3 b)}}}$$
Then
$$\text{Error}=(z-f(wx))^2$$
$${\partial{Error}\over{\partial{W_1}}}={2.(z-f(wx)).f(x).(1-f(wx)).(x)}$$
This is our partial derivative and we will use this form to find the weight for all other weights.
Now we can make our first weight adjustments in a similar manner to the OLS post.
$$\text{Logistic function∶ }f(wx)=f(W_1 X+W_2 Y+W_3 b)=f(-1-.8+2)= .5498$$
$$∇W_1= 2(z-f(wx)).(1-f(wx)).(-1)$$
$$=2.(0-.5498).(.5498).(1-(.5498)).(-1)$$
$$=2(-.5498).(0.4502).(-1)=0.272$$
Now that we have our change, we can adjust W1  using our adjustments and our learning rate
α = .5

So our new weight is:
$$W1 = W1 + α∇W1 => (1) + (.5)(0.272) = 1+(0. 136)=> 1.136$$
W2 ,Wb are computed the exact same way. Owing to the increase in size of the pde, I will leave an R console print out of the weights further on.

5. Terminate

A neural network will pass in the next set of record values (x,y) and carry out step 2, 3, and 4 with the new weights from above. Once all the data is used, the process will continue from the first record again until all the epochs are completed. An epoch is a single pass of the entire dataset through step 2 to 4.

Because we’re running 500 epochs, we’re not done processing. However, for brevity, I will include the R printout only:

w1: 1.13609302657525  w2: 1.1088744212602   w3: 0.727813946849507
w1: 0.988396129296515 w2: 0.961177523981466 w3: 1.02320774140697
w1: 0.988396129296515 w2: 0.959820053896929 w3: 1.05035714309772
w1: 0.988396129296515 w2: 0.933597153055977 w3: 0.919242638892959
w1: 1.11949168599472  w2: 1.03847359841454  w3: 0.65705152549655
...............
w1: 1.04296060447026  w2: 0.783478305387954 w3: 0.439546216434397
w1: 0.897875953093364 w2: 0.638393654011055 w3: 0.729715519188197
w1: 0.897875953093364 w2: 0.635235101469541 w3: 0.792886570018465
w1: 0.897875953093364 w2: 0.594408140160519 w3: 0.588751763473355
epoch: 5 w1: 0.9 w2: 0.59 w3: 0.59 error: 0.3800738174159
……… epoch readout omitted 
w1: 9.41114740924591 w2: -9.57178533392144 w3: 0.538001948131482
w1: 9.48636747379138 w2: -9.51160928228506 w3: 0.387561819040533
w1: 9.42007593702699 w2: -9.57790081904946 w3: 0.52014489256932
w1: 9.42007593702699 w2: -9.57915632568083 w3: 0.545255025196754
w1: 9.42007593702699 w2: -9.58053573491764 w3: 0.538357979012715

epoch: 500 w1: 9.42 w2: -9.58 w3: 0.54 error: 0.0568526922745829



Our final weights W1 ,W2 and WB  are 19.42,-9.58 and .54 and a visual progression is displayed below.




The dash line at the bottom of the image is the loss function over time. The dashed line through the middle is the .5 separation.
RScript
#Load libraries
library(rgl)
library(ggplot2)

#set working directory for images
setwd("C:\\nonlinear")

#Scale function 
scaleRange <- function (x,from_min,from_max,to_min,to_max)
{
x <- (x - from_min) * (to_max - to_min) / (from_max - from_min) + to_min
return(x)
}

#data generator
generateCluster <- function(x,y,variance) {
  x_offset <- rnorm(1,x,variance)
  y_offset <- rnorm(1,y,variance)
  return (c(x_offset,y_offset))
}

logit <- function(x) {
  return (1/(1+exp(-(x))))
}

#Our data frame
dfXY <- data.frame(x=1:4, y=1:4, z=0)

dfXY[1,]$x=10
dfXY[1,]$y=12
dfXY[1,]$z=0

dfXY[2,]$x=10
dfXY[2,]$y=10
dfXY[2,]$z=1

dfXY[3,]$x=20
dfXY[3,]$y=19
dfXY[3,]$z=1

dfXY[4,]$x=20
dfXY[4,]$y=24
dfXY[4,]$z=0

#Init weights
w1 <- 1
w2 <- 1
w3 <- 1
b <- 2

#create a weights frame for reporting
dfWeights <- data.frame(x=w1, y=w2, z=2)

#Parameters
learningRate <- .5
epoch <- 500

#Scale values down to -2 to 2
dfXY$sx <- NULL
dfXY$sx <- scaleRange(dfXY$x,0,40,-2,2)
dfXY$sy <- NULL
dfXY$sy <- scaleRange(dfXY$y,0,40,-2,2)

#Create a graph frame of all mxn
dfCol <- data.frame(ix=1:900, x=0, y=0, col="")

#Begin iterating through frame to set x and y values of graph matrix
xcount <- 1 
ycount <- 1

for (z in 1:nrow(dfCol))
{
  dfCol[z,]$x <- xcount
  dfCol[z,]$y <- ycount
  if (xcount < 30)
  {
  xcount <- xcount+1
  }
  else
  {
  xcount <- 1
  ycount <- ycount+1
  }
}

#Scale the values down
dfCol$sx  <- scaleRange(dfCol$x,0,40,-2,2)
dfCol$sy  <- scaleRange(dfCol$y,0,40,-2,2)
ec <- 0

#Loop through epochs
for (ec in 1:epoch) {

  error <- 0

  ww1 <- 0
  ww2 <- 0
  ww3 <- 0

  #Loop through data set
  for (loopCount in 1:nrow(dfXY)) {

    #Assign scaled values
    x <- dfXY[loopCount,]$sx
    y <- dfXY[loopCount,]$sy
    
    #Set the true/false value
    z <- dfXY[loopCount,]$z

    #Product of weight and data
    w1x   <- w1*x
    w2y   <- w2*y
    w3b   <- w3*b

    ww1 <- 2*(z-logit(w1x+w2y+w3b))*(logit(w1x+w2y+w3b))*(1-(logit(w1x+w2y+w3b)))*x
    ww2 <- 2*(z-logit(w1x+w2y+w3b))*(logit(w1x+w2y+w3b))*(1-(logit(w1x+w2y+w3b)))*y
    ww3 <- 2*(z-logit(w1x+w2y+w3b))*(logit(w1x+w2y+w3b))*(1-(logit(w1x+w2y+w3b)))*b
    
    #Adjust weights
    w1 <- w1+(learningRate*ww1)
    w2 <- w2+(learningRate*ww2)
    w3 <- w3+(learningRate*ww3)
    
    print(paste("w1:",w1,"w2:",w2,"w3:",w3))

    #Measure the errors
    error <- error+abs((z-(1/(1+exp(-(w1x+w2y+w3b)))))^2)
  }
  
  #Add the weight and error of the last epoch
  if (ec %% 1 == 0 && ec > 1)  {
    dfWeights <- rbind(dfWeights,data.frame(x=w1, y=w2, z=error))
  }
  
  #Print out visuals so the evolution can be seen
  if (ec %% 10 == 0 || ec == 5 || ec == 10 || ec == 50 || ec == 100) {
print(paste("epoch:",ec,"w1:",round(w1,2),"w2:",round(w2,2),"w3:",round(w3,2),"error:",error/nrow(dfXY)))

    dfCol$yhat <- logit(dfCol$sx*w1+dfCol$sy*w2+b*w3)
    
    dfCol$col <- ifelse(dfCol$yhat <= .05, colorRampPalette(c("#cc0000","#2952a3"))(20)[1],
                  ifelse(dfCol$yhat <= .1, colorRampPalette(c("#cc0000","#2952a3"))(20)[2],
                   ifelse(dfCol$yhat <= .15, colorRampPalette(c("#cc0000","#2952a3"))(20)[3],
                    ifelse(dfCol$yhat <= .2, colorRampPalette(c("#cc0000","#2952a3"))(20)[4],
                     ifelse(dfCol$yhat <= .25, colorRampPalette(c("#cc0000","#2952a3"))(20)[5],
                      ifelse(dfCol$yhat <= .3, colorRampPalette(c("#cc0000","#2952a3"))(20)[6],
                       ifelse(dfCol$yhat <= .35, colorRampPalette(c("#cc0000","#2952a3"))(20)[7],
                        ifelse(dfCol$yhat <= .4, colorRampPalette(c("#cc0000","#2952a3"))(20)[8],
                         ifelse(dfCol$yhat <= .45, colorRampPalette(c("#cc0000","#2952a3"))(20)[9],
                          ifelse(dfCol$yhat <= .5, colorRampPalette(c("#cc0000","#2952a3"))(20)[10],
                           ifelse(dfCol$yhat <= .55, colorRampPalette(c("#cc0000","#2952a3"))(20)[11],
                            ifelse(dfCol$yhat <= .6, colorRampPalette(c("#cc0000","#2952a3"))(20)[12],
                             ifelse(dfCol$yhat <= .65, colorRampPalette(c("#cc0000","#2952a3"))(20)[13],
                              ifelse(dfCol$yhat <= .7, colorRampPalette(c("#cc0000","#2952a3"))(20)[14],
                               ifelse(dfCol$yhat <= .75, colorRampPalette(c("#cc0000","#2952a3"))(20)[15],
                                ifelse(dfCol$yhat <= .8, colorRampPalette(c("#cc0000","#2952a3"))(20)[16],
                                 ifelse(dfCol$yhat <= .85, colorRampPalette(c("#cc0000","#2952a3"))(20)[17],
                                  ifelse(dfCol$yhat <= .9, colorRampPalette(c("#cc0000","#2952a3"))(20)[18],
                                   ifelse(dfCol$yhat <= .95, colorRampPalette(c("#cc0000","#2952a3"))(20)[19],
                                    ifelse(dfCol$yhat <= 1.0, colorRampPalette(c("#cc0000","#2952a3"))(20)[20]
                         ))))))))))))))))))))

    ggplot(data=dfCol) + 
      geom_point(aes(x=dfCol$x, y=dfCol$y), color=dfCol$col, size=10,shape=15) + 
      geom_point(data=dfXY,aes(x=dfXY$x, y=dfXY$y), color=ifelse(dfXY$z==1,"skyblue","red"), size=2) + 
      ggtitle(paste("NN Color Classification - Error:",round(error,2),"epoch:",ec)) + 
      xlab("x")+ylab("y")+xlim(c(0,30))+ylim(c(0,30))+
      geom_line(data=dfCol[dfCol$yhat >= .4 & dfCol$yhat <= .6,],aes(x=dfCol[dfCol$yhat >= .4 & dfCol$yhat <= .6,]$x,y=dfCol[dfCol$yhat >= .4 & dfCol$yhat <= .6,]$y),linetype = "dotted", color="black",alpha=0.5) +
      geom_line(data=dfWeights,aes(scaleRange(1:nrow(dfWeights),0,nrow(dfWeights),2,30),scaleRange(dfWeights$z,0,max(dfWeights$z),2,10)),linetype = "dashed",alpha=0.5) +
      geom_text(data = dfWeights[nrow(dfWeights),], aes(label = round(error,2), x = 30, y = scaleRange(z,0,2,2,30)),color="white") +
      theme_bw() +
      theme(
        plot.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.grid = element_blank(),
        axis.ticks.x=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.x=element_blank(),
        axis.text.y=element_blank()
      )
    
    ggsave(paste0(ec,".jpg"), plot = last_plot(), width = 180, height = 120, limitsize = TRUE, units="mm")

    file.copy(paste0(ec,".jpg"), "current.jpg", overwrite = TRUE )
  }

References and Full PDE of loss function 
With assistance from: https://www.symbolab.com/

For a fun live version of this post click here

Full partial derivative working out: 

$$\text{Let f(x)=}{1\over{{1+e^{(-W_1 X+W_2 Y+W_3 b))}}}}$$
then
$${\partial{Error}\over{\partial{W_1}}}{(z-f(x))^2}$$


Tuesday 14 January 2020

Loss Function implementation in NN (part 1)

BOMBE - Computer Sci Museum - Bletchley Park
This post is 1 of 2 posts illustrating how a loss function works with a single neuron neural network. Post 2 will add an additional neuron to make things more interesting.

This is a conceptual implementation of neural network, it's a small one that does not use matrices or clever optimisation, but is written to be accessible in terms of programming and mathematics for learning purposes to understand the most atomic part of an NN as well as an atomic part of other loss/error functions. All the R Code is at the end.

Although facets of NN’s are in here, the take away is how a loss function and its partial derivatives works along with impact on the optimisation algorithm. Loss functions are featured in many different machine learning methods and understanding a loss function will help understand a lot of other methods.

Conceptually, think of a neural network as a recursive function – like the Newton Method, but with partial derivatives, all of which must be minimised in terms of the weights with respect to a loss function.  Each pass of the function yields a value by which we increase or decrease weights leading to a smaller loss or cost in terms of the error.

For this demonstration, I will be using a basic NN and loss function to computationally solve an OLS. I won’t make predictions per se - but by solving the weights, we solve the ordinary least square line.

A least squares function is usually found in the following form:
$$ŷ = 𝑏_0 + 𝑏_1𝑥$$
Analytically, this can be solved easily using the variance and co-variances of x and y. How to do that analytically can be found here

We can also solve this, and more complicated systems, using a single neuron or a sprawling layered network of them.

Components 
A basic neural network, like many optimisation algorithms, is built with the following steps.

1. Initialise 
2. Compute 
3. Measure
4. Adjust
5. Terminate

These steps are general and anecdotal, more advanced algorithms can branch off at any one of the above steps.

Depending on the design, steps 2 to 4 are executed repeatedly until some conclusion is reached.

For our neural network, we are wanting to find b0 and b1 algorithmically for our least square line
$$ŷ = 𝑏_0 + 𝑏_1𝑥$$

Our starting data set is:
x  y 
1  2
2  1
3  4
4  3

1. Initialize 

We instantiate our weights (W)  - these can be also be considered coefficients, they impact input.
Weight for the x input variable (w1x) = 1
The learning rate (α) = .01 - this impacts the rate the algorithm tunes the weights.
The bias (b) = 2
The bias weight (w2b) = 1

Because our prediction is a continuous value, we are not making use of a sigmoid function, we want the least square coefficients(b0,b1) which are the equivalent of w1 and w2 in this example. Sigmoid will be posted in part 2.

If we are to visualise the network, it would look something like this:


Except, our activation function - F(X) - is not a logistic function but the product of the data, bias and weights which will become clearer further on. 

2. Compute 

We fetch our first record – (x,y)(1,2) 
x  y 
1  2
2  1
3  4
4  3

We multiply our first x value by the weight w1 to yield w1x: 1x1=1 => w1x=1 
We multiply our bias value b by the bias weight w2 to yield w2b: 1x1=1 => w2b=1 

3. Measure 
We now measure the difference between the actual value of y=2 and the predicted value of this newly instantiated network. 

At present, our network makes predictions using the following formula: 
$$𝑊_1𝑋 + 𝑊_2𝑏$$  
(𝐿𝑜𝑜𝑘 𝑓𝑎𝑚𝑖𝑙𝑖𝑎𝑟? 𝑏0 + 𝑏1𝑥

If we substitute our data along with the newly instantiated weights into the above formula, our model makes the following prediction: 

$$𝑊_1𝑋 + 𝑊_2𝑏 => (1)(1) + (1)(2) => 3$$ 

For it to learn, it needs to tune it’s predictions through the weights. This is how we work out by how much: 
$$(\text{𝒓𝒆𝒂𝒍 𝒗𝒂𝒍𝒖𝒆 𝑚𝑖𝑛𝑢𝑠 𝒏𝒏 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒗𝒂𝒍𝒖𝒆})^𝟐$$ 
or 
$$(y − ŷ)^𝟐$$
or
$$(𝐲 − (𝑾𝟏𝑿 + 𝑾𝟐𝒃))^2$$

So if we substitute for our first value (x=1) and our actual value (y=2) into the above, our error is:
$$(y − 𝑊1𝑋 − 𝑊2𝑏)^𝟐 => ((2) − 3)^𝟐 => (−1)^𝟐 => 1$$

Our error is 1 (2-3=-1)

So the question is, by how much must we change W1 and W2 to reduce our error of 1 to something smaller? 

4. Adjust 

This is where we need to formulate our partial derivative of our error function in terms of each of the weights. 

A derivative is the rate of change of a function. Think f(x)=2x. If x is 1, f(1) will be 2. If x is 2, f(2) will be 4, f(3)=6 and so on. Our rate of change, or the derivative of f(x)=2x is 2

A partial derivative is similar to the above, but the derivative is now subject to 2 or more variables. We get the derivative of one parameter (say x) while fixing the remaining parameters (say y). We say we get the derivative of function f(x,y) in terms of x – because we’re interested in the rate of change of x alone while y is treated as constant. See the graph below to intuit how the rate of change for x varies at each value of y (hence the fixing)


If we want to find the rate of change for x, we have to hold y in place, because the rate of change for x is potentially different for all values of y.  

Because our error function has other values in it apart from our value w1 (the weight we want to adjust to improve the error rate) we need to get the partial derivative to compute the adjustment. We will do this next.
$$ 𝑬𝒓𝒓𝒐𝒓 = (ŷ − 𝑊_1𝑋 − 𝑊_2𝑏)^𝟐 $$

$${{𝜕𝐸𝑟𝑟𝑜𝑟}\over{𝜕𝑊1}} = 2(−𝑿)(ŷ − 𝑊_1𝑋 − 𝑊_2𝑏) $$
$$= −2𝑿(ŷ − 𝑊_1𝑋 − 𝑊_2𝑏)$$

This is our partial derivative and we will use it again to find the weight for the bias.

$${{𝜕𝐸𝑟𝑟𝑜𝑟}\over{𝜕𝑊2}} = 2(−b)(ŷ − 𝑊_1𝑋 − 𝑊_2𝑏) $$
$$= −2b(ŷ − 𝑊_1𝑋 − 𝑊_2𝑏)$$

Now we can make our first weight adjustments

$$∇𝑊1: − 2𝑋(ŷ − 𝑊_1𝑋 − 𝑊_2𝑏) => −2(1)((2) − (1)(1) − (1)(2)) => −2(−1) => 𝟐$$
$$∇𝑊2: − 2𝑏(ŷ − 𝑊_1𝑋 − 𝑊_2𝑏) => −2(2)((2) − (1)(1) − (1)(2)) => −4(−1) => 𝟒$$

The method above falls under stochastic gradient descent – when you hear gradient descent, you can think partial derivatives 

Now that we have our change, we can adjust W1 and W2  using our adjustments and our learning rate  α = .01
So our new weights are:

$$W_1 = W_1 - α∇W_1 => (1) – (.01)(2) = 1-(0.02)=>.98$$
$$W_2 = W_2 - α∇W_2 => (1) – (.01)(4) = 1-(0.04)=>.96$$

5. Terminate
A neural network will pass in the next set of record values (x,y) and carry out step 2, 3, and 4 with the new weights from above (.98 & .96). Once all the data is used, the process will continue from the first record again until all the epochs are completed. An epoch is a single pass of the entire dataset through step 2 to 4.

Because we’re running 100 epochs, we’re not done and so we go back to step 2 using weights W1 = .98 and W2 = .96

Continuation 

I will do one more record and then provide the read out from an R script: (x,y)=(2,1)

$$∇𝑊_1: − 2𝑋(ŷ − 𝑊_1𝑋 − 𝑊_2𝑏) => −2(2)((1) − (.98)(2) − (.96)(2)) => −4(−2.880) = 11.52$$
$$∇𝑊_2: − 2𝑏(ŷ − 𝑊_1𝑋 − 𝑊_2𝑏) => −2(2)((1) − (.98)(2) − (.96)(2)) => −4(−2.880) = 11.52$$

And our new weights are:

$$W_1 = W_1 - α∇W_1 => (.98) – (.01)(11.52) = .98-(0.115) =>.865$$
$$W_2 = W_2 - α∇W_2 => (.96) – (.01)(11.52) = .96-(0.115) =>.845$$

Readout from 1-100 epochs 
x: 1 y: 2 epoch: 1   w1 0.98 : w2 0.96 
x: 2 y: 1 epoch: 1   w1 0.86 : w2 0.84 
x: 3 y: 4 epoch: 1   w1 0.85 : w2 0.83 
x: 4 y: 3 epoch: 1   w1 0.68 : w2 0.75 
x: 1 y: 2 epoch: 2   w1 0.68 : w2 0.74
x: 2 y: 1 epoch: 2   w1 0.61 : w2 0.67 
x: 3 y: 4 epoch: 2   w1 0.66 : w2 0.7 
x: 4 y: 3 epoch: 2   w1 0.57 : w2 0.66 
x: 1 y: 2 epoch: 3   w1 0.58 : w2 0.67 
x: 2 y: 1 epoch: 3   w1 0.52 : w2 0.61
x: 3 y: 4 epoch: 3   w1 0.59 : w2 0.66 
x: 4 y: 3 epoch: 3   w1 0.54 : w2 0.63 
x: 1 y: 2 epoch: 4   w1 0.54 : w2 0.64 
x: 2 y: 1 epoch: 4   w1 0.49 : w2 0.58 
x: 3 y: 4 epoch: 4   w1 0.57 : w2 0.64 
x: 4 y: 3 epoch: 4   w1 0.52 : w2 0.62 
..... 
x: 1 y: 2 epoch: 99  w1 0.57 : w2 0.56 
x: 2 y: 1 epoch: 99  w1 0.52 : w2 0.51 
x: 3 y: 4 epoch: 99  w1 0.6  : w2 0.57 
x: 4 y: 3 epoch: 99  w1 0.56 : w2 0.55 

x: 1 y: 2 epoch: 100 w1 0.57 : w2 0.56
x: 2 y: 1 epoch: 100 w1 0.52 : w2 0.51
x: 3 y: 4 epoch: 100 w1 0.6  : w2 0.57 
x: 4 y: 3 epoch: 100 w1 0.56 : w2 0.54 

Our final weights W1 and W2  are .56 and .54 

If I pass the same data independently through a generic least squares model in R to find the gradient (done using R’s lm() for shortcut) I find the following:
Coefficients: 
(Intercept)            x   
        1.0          0.6   
W2 corresponds to our intercept of 1.0 (b0*w2=2*.54~1)
And our W1 (.56) corresponds to our x of .6

Our algorithm increments are visually represented below, and we can see the progression of accuracy per epoch:

If we carry out the same process using 100 data points with 3000 epochs, and a learning rate of .00001 with w1 and w2 initialised to -10 and -20, we will see the following: 

Our starting point is a line with a steep gradient and an intercept around 1 
Incrementally, the intercept shifts and the gradient changes 




By the end of the last epoch, the line is suitably fitted to the data where the data was generated with an intercept of 50 and a gradient of 1.5 – which is basically the same as below. 

> lm(dfXY, formula = y~.) 
Call: lm(formula = y ~ ., data = dfXY) 
Coefficients: 
(Intercept)            x        
49.658        1.566   

A note on learning rate:  Learning rate will allow for the error function to fall into a minima and reach the smallest allowable global error. If the learning rate is too high, the weights will cycle back and forth, continually overshooting the minima and fail to allow an adequate window for the algorithm to converge on the minima. The flip flopping will usually cause weights to spiral out of control and ‘blow up’ – weight or weights hit INF or NANs 

If the coefficients are converged and it can be visually seen that the line settles in a position that is clearly not efficient, it means the algorithm has found a local minima, and not the global minima. This can be corrected using advanced analytic and programming methods 

A note on weights: In the above notes, weights are declared as separate variables and all the moving parts reside in a variable. This is done so that it is easy to follow. However, in real implementations of NNs, the weights per neuron and the input data are stored in matrices and their dot product are used to create the prediction. This is programmatically neater and generally computationally efficient. 

RScript
library(ggplot2)  
setwd("C:\\MyImageFolder")  
#Noise generator 
generatePairs <- function(x,b0,b1,corre) 
{   
  err <- sample(rnorm(100,0,corre),1)   
  y <- (  b0 + b1*x + err  )   
  return(y) 
} 
 
#Holding frame 
dfXY <- data.frame(x=1:100, y=1:100) 
 
#Create a noisy scatter plot with an intercept and a gradient 
for (x in 1:100) 
{   
  dfXY[x,]$y <- generatePairs(x,50,1.5,20) 
} 
 
#Set variables 
learningRate <- .00001 
epoch <- 3000 
b0 <- 2 
w1 <- -10 
w2 <- -20 
 
#Loop through epochs 
for (z in 1:epoch) 
{   
  error <- 0 
 
  #loop through data points   
  for (loopCount in 1:nrow(dfXY)) 
  {     
    #assign variables     
    w1x <- w1*dfXY[loopCount,]$x     
    w2b <- w2*b0 
 
    #partial derivative value     
    pdw1x <- -2*dfXY[loopCount,]$x*(dfXY[loopCount,]$y-w1x-w2b)     
    pdw2b <- -2*b0*(dfXY[loopCount,]$y-w1x-w2b) 
 
    #error     
    error <- error +(dfXY[loopCount,]$y-w1x-w2b)^2 
 
    #Reassign weights     
    w1 <- w1-(learningRate*pdw1x)     
    w2 <- w2-(learningRate*pdw2b)   
  } 
 
  #Create plots at x epochs   
  if (z %% 200 == 0 || z == 1 || z == 2 || z == 3 || z == 4 || z == 5 || z == 10 || z == 15 || z == 20 || z == 30 || z == 50 || z == 100) 
  { 
    print(paste("x:",dfXY[loopCount,]$x,"y:",y,"epoch:",z,"w1x",w1,": w2b",w2,"error:",error)) 
 
    ggplot(data=dfXY,aes(x=dfXY$x,y=dfXY$y))+       
    geom_point()+       
    geom_line(aes(x=1:nrow(dfXY),w2*b0+w1*(1:nrow(dfXY))))+       
    ggtitle(paste("OLS b0 & b1 search - ΣError:",round(error,2),"epoch:",z,"b0:",round(w2*b0,2)," b1:",round(w1,2)))+       
    xlim(c(0,220))+       
    ylim(c(0,220))+       
    ggsave(paste0(z,".jpg"), plot = last_plot(), width = 180, height = 120,limitsize = TRUE, units="mm") 
 
    #For monitoring progression     
    file.copy(paste0(z,".jpg"), "current.jpg", overwrite = TRUE )   
  } 
}