Sunday 26 August 2018

Least Squares Line in R (or on pen and paper)


I like to know how things work. I am not 100% comfortable in reusing plugins, packs or technologies without at least drilling to some depth into the underlying supporting knowledge. I do this so I can maybe understand it's parameters and how the component behaves. Ever wondered how a simple linear regression line in Excel or R works? This post will cover a simpler version of a linear regression model.

Least Squares Method

The Least Squares approach is a basic linear regression method for 2 variables that visually represents their relationship. This model will tell us how strongly 2 variables are coupled and approximately by how much one variable (y) changes in reaction to the amount of change in the other variable (x). Usually, in a variable pair, one variable is considered a dependent (y) while the other is considered independent (x). The method below can be done on paper for small data sets, but usually processing will be done in R or Excel for larger sets.

To build a least square line, I need to have some data. To keep in line with an earlier post, I'll stick to SA crime and economic data. I will try to see if there is a relationship between common burglary and the country’s gross domestic product (GDP). These 2 variables are complex outputs of 2 very broad processes - so don't look too deeply into the findings.

I found this data on tradingeconomics.com and the SAPS statistics resources page and manually plugged it together as shown below.

Year,      x  ,y
2005-01-01,247,25351
2006-01-01,261,25148
2007-01-01,287,22456
2008-01-01,297,20410
2009-01-01,375,19842
2010-01-01,416,18007
2011-01-01,396,15826
2012-01-01,366,15404
2013-01-01,350,15579
2014-01-01,317,17379
2015-01-01,295,18051
2016-01-01,349,17367

x is GDP in billions of dollars. y is the number of burglaries recorded during the financial year.
The data on a scatter plot is below:




R Script
#Load data into string
strCrimeGDP <- "
Year,      x  ,y
2005-01-01,247,25351
2006-01-01,261,25148
2007-01-01,287,22456
2008-01-01,297,20410
2009-01-01,375,19842
2010-01-01,416,18007
2011-01-01,396,15826
2012-01-01,366,15404
2013-01-01,350,15579
2014-01-01,317,17379
2015-01-01,295,18051
2016-01-01,349,17367"  

#Create a dataframe
dfGDPC <- read.delim(textConnection(strCrimeGDP),header=TRUE,sep=",",strip.white=TRUE)
dfGDPC$label <- paste0("(",dfGDPC$x,",",dfGDPC$y,")")

ggplot(data=dfGDPC, aes(x=x, y=y)) +
    geom_point(color='darkblue') +
    geom_text(aes(label=label),hjust=0, vjust=-1) +
    ggtitle(paste0("SA Common Burglary and GDP")) +
    ylab("Common Burglary (y)") +
    xlab("GDP (x)")

Now, I need the mean, variance and standard deviations of both X and Y followed by their covariance.

Mean x
`\bar{x} = \frac{ sum_(i=1)^n x }{n} = \frac{ 3956 }{12} = 329.66`


Mean y
`\bar{y} = \frac{ sum_(i=1)^n y }{n} = \frac{ 230820 }{12} = 19235`


x Variance
`Sx^2 = 1/(n-1)(sum_(i=1)^n x^2 - (sum_(i=1)^n x)^2/n)=1/(12-1)(1335976 - (3956)^2/12) = 2892.24`


x Standard deviation
`Sx = \sqrt{Sx^2} = \sqrt{2892.24} = 53.78`


y Variance
`Sy^2 = 1/(n-1)(sum_(i=1)^n y^2 - (sum_(i=1)^n y)^2/n)=1/(12-1)(4573823818 - (230820)^2/12) = 12181920`


y Standard deviation
`Sy = \sqrt{Sy^2} = \sqrt{ 12181920 } = 3490.261`


Covariance
`Sxy = 1/(n-1)(sum_(i=1)^n xy - ((sum_(i=1)^n x)(sum_(i=1)^n y))/n)=1/(12-1)( 74516510 - ((3956)( 230820))/12) = -143377.3`

With covariance being negative, this tells us that when x increases (GDP), y decreases (Burglaries)
With the values in hand, I substitute these values into the least squares formula:

`\hat{y} = b_0 + b_1x`

where 
`b_1 = \frac{Sxy}{Sx2}`

`b_0 = \bary - b_1\barx`

`x = {sum_(i=1)^n x}/n`


Substitute
`b_1 = frac{Sxy}{Sx^2} = frac{-143377.3}{2892.24} = -49.57`
and
`b_0 = \bary-b_1\barx = 19235 - -49.57(329.66) = 35576.25`

The line is now
`\hat{y} = 35576.25 -49.57\bar{x}`

We can now plug values in for X to find Y and draw a line. I’ve picked `x_1` = 247 and `x_2` = 416. Substituting these into the formula, I get `y_1` = 23332.46 and `y_2` = 14955.13

With the line, the scatter plot now looks like this


R Script

dfPoints <- data.frame(x=c(247,416),y=c(23332.46,14955.13),label=c("(247,23332)","(416,14955)"))
ggplot(data=dfGDPC, aes(x=x, y=y)) + 
    geom_point(color='darkblue') + 
    geom_text(aes(label=label),hjust=0, vjust=-1) +
    geom_point(data=dfPoints, aes(x=x, y=y),color='red',size=2) + 
    geom_text(data=dfPoints, aes(label=label),hjust=0, vjust=-1) +
    geom_line(data=dfPoints,aes(x=x,y=y))
    ggtitle(paste0("SA Common Burglary ~ GDP")) +
    ylab("Common Burglary (y)") +
    xlab("GDP (x)")

Now to tie the whole thing up from beginning to end.

R Script

#Load data into string
strCrimeGDP <- "
Year,      x  ,y
2005-01-01,247,25351
2006-01-01,261,25148
2007-01-01,287,22456
2008-01-01,297,20410
2009-01-01,375,19842
2010-01-01,416,18007
2011-01-01,396,15826
2012-01-01,366,15404
2013-01-01,350,15579
2014-01-01,317,17379
2015-01-01,295,18051
2016-01-01,349,17367"   

#Create a dataframe
dfGDPC <- read.delim(textConnection(strCrimeGDP),header=TRUE,sep=",",strip.white=TRUE)
dfGDPC$label <- paste0("(",dfGDPC$x,",",dfGDPC$y,")")

#Create a dataframe
dfGDPC <- read.delim(textConnection(strCrimeGDP),header=TRUE,sep=",",strip.white=TRUE)
dfGDPC$label <- paste0("(",dfGDPC$x,",",dfGDPC$y,")")

#Compute the mean and standard deviation for X and Y
MeanX <- mean(dfGDPC$x)
MeanY <- mean(dfGDPC$y)
StdX <- sd(dfGDPC$x)
StdY <- sd(dfGDPC$y)

#Compute the variance of X
varX <- sum((dfGDPC$x-MeanX)^2)/(nrow(dfGDPC)-1)

#Compute the covariance of X and Y
covXY <- (1/(nrow(dfGDPC)-1))*(sum(dfGDPC$x*dfGDPC$y) - ( (sum(dfGDPC$x)*sum(dfGDPC$y))/(nrow(dfGDPC))   ))

#Compute the slope of the regression line or m of y = mx + c
slope <- covXY/varX

#Compute the intercept of y = mx + c
intercept <- MeanY-(slope*MeanX)

#Compute the coefficient of correlation between X and Y (r)
CoefCor <- covXY/(StdX*StdY)

#Compute the coefficient of determination between X and Y (r^2)
CoefDet <- CoefCor^2

#Load the graphic
library(ggplot2)

dfPoints <- data.frame(x=c(247,416),y=c(23332.46,14955.13),label=c("(247,23332)","(416,14955)"))

plot <- ggplot(data=dfGDPC, aes(x=x, y=y)) + 
        geom_point(color='darkblue') + 
 geom_text(aes(label=label),hjust=0, vjust=-1) +
 geom_point(data=dfPoints, aes(x=x, y=y),color='red',size=2) + 
 geom_text(data=dfPoints, aes(label=label),hjust=0, vjust=-1) +
 geom_abline(intercept = intercept, slope = slope, color="red", linetype="dashed", size=0.5) + 
 ggtitle(paste0("SA Common Burglary ~ GDP : \u0176 = b0 + b1x\u0304 / Coef Correlation : ", round(CoefCor,2)," / Coeff Determination:",round(CoefDet,2))) +
 ylab("Common Burglary (y)") +
 xlab("GDP (x)")
  
plot



Now, we don’t really have to do so many calculations from the ground up. It can all be done automatically for you in GGPLOT using an inbuilt linear regression model - lm through geom_smooth. Thats right, there is no need to calculate covariance or the slope and intercept. All one has to do is plug the data into GGPLOT and invoke geom_smooth with method=lm

R Script
ggplot(dfGDPC,aes(x,y))+
       geom_point(color='darkblue') +
       geom_text(aes(label=label),hjust=0, vjust=-1) +
       geom_point(data=dfPoints, aes(x=x, y=y),color='red',size=2) +
       geom_text(data=dfPoints, aes(label=label),hjust=0, vjust=-1) +
       geom_smooth(method='lm',formula=y~x) +
       ggtitle(paste0("SA Common Burglary ~ GDP")) +
       ylab("Common Burglary (y)") +
       xlab("GDP (x)")




The linear regression line cuts through the red points that were calculated manually, meaning the calculations are accurate.

So does GDP impact the amount of common burglaries? Possibly, to some degree. Again, these 2 measurements are extracted from 2 very general and very complicated processes. SA is effectively 2 countries in 1 - a large formal and informal sector. Being so, the the relationship could be incidental.

To play devils advocate, the measurement for common burglaries may be subject to a number of issues - it is widely understood that many people don't bother reporting common burglaries so the number of reported incidents may be dwindling in relation to an increasing GDP reinforcing the inverse relationship. In reality, the true number of burglaries could be increasing along with GDP but we won't see it owing to under reporting. As for GDP, these increases might be increasing owing to an increase in government spending - not necessarily job creation, one of the antidotes of unemployment and crime. Government spending makes up ~50% of SA GDP and has been steadily increasing for the last decade using borrowed money. This crowds out the private sector, the sector which creates more sustainable employment opportunities than the public sector. With rampant corruption, these borrowed amounts (pushing up our GDP) will not be efficiently discharged into the economy and will not have the swaying impact one would imagine on unemployment.