Correlation and Regression


Introduction

The data below gives marks obtained by 10 students taking exam on math and computer test.

StudentsABCDEFGHIJ
X Marks (Math)1518324966151212
Y Marks (Computer)10151131362111316

Is there a connection between the marks obtained by 10 students in Math and computer test? A starting point would be to plot the marks of both subjects in a scatter diagram.

Now calculating the means, we get
\(\bar{X}=\frac{120}{10}=12\)
\(\bar{Y}=\frac{100}{10}=10\)

And using them to divide the graph into four slots. It clearly shows that, the areas in the bottom right and top left of the graph are largely vacant.
So there is a tendency for the points to run from bottom left to top right.
In this example, most of the points (1st and 3rd quadrants) give positive value of
\((x-\bar{X})(y-\bar{Y})\)

The problem is to find a way to measure how strong this tendency is. To answer this question, we proceed further.
Here
\(\frac{x-\bar{x}}{s_x}\)
gives normalized distance of each x from \(\bar{x}\) and makes it unit free.

Also
\(\frac{y-\bar{y}}{s_y}\)
gives normalized distance of each y from \(\bar{y}\) and makes it unit free.

So
\(\frac{1}{n} \displaystyle \sum \left (\frac{x-\bar{x}}{s_x} \right ) \left ( \frac{y-\bar{y}}{s_y} \right )\)
gives normalized product moment, which is the value of correlation.


The value of correlation (r) gives a measure of how close the points are to lying on a straight line.

  1. r= 1 indicates that all the points lie exactly on a state line with positive gradient
  2. r=-1 gives the same information with a line having negative gradient
  3. r=0 tells us that there is no connection at all between the two sets of data

The illustration shows that, quantification of the relationship between variables is very essential to take the benefit of study of relationship, called corelation. For this, we find there are two basic methods of measurement of correlation, which can be represented as graphical method and algebraic method.




Scatter Diagram: Graphic method

Scatter Diagram is graphic method of measurement of correlation. It is a diagrammatic representation of bivariate data to ascertain the relationship between two variables. Under this method the given data are plotted on a graph paper. Once the values are plotted on the graph it reveals the type of the correlation between variable X and Y. However please note that, the correlation is affected by each point.

Strong Positive Correlation Low Positive Correlation No Correlation Low Negative Correlation Strong Negative Correlation



Correlation

Correlation is a technique to measures strength of association (सम्बन्धको मापन) between two variables, say X and Y. The intensity (मात्रा) of the correlation is expressed by a number, called the coefficient of correlation, and it is denoted by r. The value of correlation lies between -1 to 1 (inclusive).

  1. Coefficient of correlation is first introduced by Galton (1886)
  2. Formalized by Karl Pearson (1896)
  3. Developed/extended by Fisher (1935)
  4. The main idea is to compute an index (number) which reflects how much two variables are related to each other.
  5. f two variables are related such that both increase or both decrease, then the correlation is positive
  6. If increase in any one variable is associated with decrease in the other variable, the correlation is negative



Types of Correlation coefficient
There are two main types of correlation coefficients: Pearson's product moment correlation coefficient and Spearman's rank correlation coefficient. The correct usage of correlation coefficient type depends on the types of variables being used. However, the different types of correlation are given in the table below.
QuantativeOrdinalNominal
Quantitative Pearson's BiserialPoint Biserial
Ordinal Biserial Spearman rho Rank Biserial
Nominal Point Biserial Rank Biserial Phi



Interpretation of correlation coefficients

Generally, the coefficient of correlation is either positive or negative or zero. If the correlation is positive, then the variables are related such that both increase or both decrease. If the correlation is negative , then increase in any one variable is associated with decrease in the other variable and vice-versa. If the correlation is zero then the variables are not related. On top, the correlation is interpreted as following
Correlation coefficients whose magnitude \(r\) lies

between 0.8 and 1.0 is very high (perfect) correlation
between 0.6 and 0.8 is high correlation
between 0.4 and 0.6 is moderate correlation
between 0.2 and 0.4 is low correlation
between 0.0 and 0.2 is very low (no) correlation



Coefficient of determination

Correlation coefficient measuring a linear relationship between the two variables indicates the amount of variation one variable accounted for by the other variable. A better measure for this purpose is provided by the square of the correlation coefficient, known as “coefficient of determination”. This can be interpreted as the ratio between the explained variance to total variance:
\(r^2 =\frac{\text{explained variance}}{\text{total variance}}\)
Similarly, Coefficient of non-determination is
\(1-r^2 \)
Thus
The square of correlation coefficient is called coefficient of determination. It r is obtained by two variables \(X\) and \(Y\) then \(r^2\) is the fraction of variation in \(Y\) that is explained by \(X\).

For example, if correlation between “Math score” and “Anxiety” is \(r=-0.4\), then \(r^2=0.16\), it means 16% of the variability in Math score and anxiety “overlaps” in opposite manner.
Based on this example, a coefficient of determination of 0.16 is obtained. It can be interpreted that the variation in Math Score can explain 16% of the variation in Anxiety score. The remaining 84% represents the variation in Math Score explained by other variables not included in the model.




Properties of coefficient of Correlation

As correlation measure the strength of association between two variables, the major properties of such correlation coefficients can be summarized into following bullets:

  1. The correlation coefficient lies between \(-1\) and \(+1\)
  2. Independent from unit of measurement
  3. Independent of origin and scale
  4. Symmetrical i.e., \(r_{xy} = r_{yx}\)



Limitation of Correlation

A key thing to remember is that correlation is not responsible in change in one variable causes a change in another. Sales of personal computers and athletic shoes have both risen strongly over the years and there is a high correlation between them, it cannot be assumed that buying computers causes people to buy athletic shoes (or vice versa).

The second caution is that the Pearson correlation technique works best with linear relationships: as one variable gets larger, the other gets larger (or smaller) in direct proportion. It does not work well with curvilinear relationships (in which the relationship does not follow a straight line). An example of a curvilinear relationship is age and health care. They are related, but the relationship doesn't follow a straight line. Young children and older people both tend to use much more health care than teenagers or young adults.

  1. r is measure of linear relationship only. There may be an exact connection between X and Y, but if it is no straight line, there is no help.
  2. correlation does not imply causality. A survay may result that strong correlation between left feet people and mental mathematics.
  3. An usual freak result may have strong effect on the value of r

Pearson's product moment correlation:Algebraic Method

Karl Pearson’s method of calculating coefficient of correlation is based on the covariance of the two variables in a series. This method is widely used in practice and the coefficient of correlation is denoted by the symbol \(r\). It is used when both variables being studied are normally distributed and Quantitative in scale . For a correlation between variables \(X\) and \(Y\) the formula for calculating the Pearson's correlation coefficient is given by

\( r=\frac{Cov(X,Y)}{\sigma_x \sigma_y}\) where \(Cov(X,Y) =\frac{1}{n} \sum (x-\bar{x})(y-\bar{y})\) Variance method

\( r=\frac{ \sum xy}{\sqrt{\sum x^2} \sqrt{\sum y^2}}\)Deviation method

\( r=\frac{n \sum XY-\sum X \sum Y}{\sqrt{n\sum X^2-(\sum X)^2} \sqrt{n\sum Y^2-(\sum Y)^2}}\)Raw method




Example 1
Calculate the correlation cofficient of the marks in Mathematics and Statistics for eight students as below.
Marks in Math (X)67 68 65 68 72 72 69 71
Marks in Stat (Y)65 66 67 67 68 69 70 72
Solution
Based on the data given above, we can calculate the correlation subtracting \(65\) from each data both in X and Y. Now, the table of calculation is given below.
X Y X Y \(X^2\) \(Y^2\) XY
67 65 2 0 4 0 0
68 66 3 1 9 1 3
65 67 0 2 0 4 0
68 67 3 2 9 4 6
72 68 7 3 49 9 21
72 69 7 4 49 16 28
69 70 4 5 16 25 20
71 72 6 7 36 49 42
\(\sum X=32\) \(\sum Y=24\) \(\sum X^2=172\) \(\sum Y^2=108\) \(\sum XY=120\)

Now, using formula, the correlation cofficient is
\( r=\frac{N \sum XY-\sum X \sum Y}{\sqrt{N\sum X^2-(\sum X)^2} \sqrt{N\sum Y^2-(\sum Y)^2}}\)
or \( r=\frac{8 .120-32.24}{\sqrt{8.172-(32)^2} \sqrt{8.108-(24)^2}}=0.60\)




Spearman’s Rank Correlation

When quantification of variables becomes difficult such as beauty of female, leadership ability, knowledge of person etc, then this method of rank correlation is useful which was developed by British psychologist Charles Edward Spearman in 1904. In this method ranks are allotted to each element either in ascending or descending order. The correlation coefficient between these allotted two series of ranks is popularly called as “Spearman’s Rank Correlation” and denoted by \(\rho\). It is appropriate when one or both variables are skewed or ordinal in scale. For a correlation between variables \(X\) and \(Y\) the formula for calculating the Spermans' rho correlation coefficient is given by
\(\rho=1-\frac{6\displaystyle \sum_{i=1}^n d_i^2}{n(n^2-1)}\)
where
\(d_i\)= the difference between the ranks of corresponding variables
\(n=\) number of observations

NOTE
If there are tied ranks, we give mean of the ranks they would have if they were not tied. In this case, we use the formula as below.
\(\rho=1-\frac{6 \left [\displaystyle \sum_{i=1}^n d_i^2+ \sum_k \frac{m_k^2(m_k^2-1)}{12}\right ]}{n(n^2-1)}\)
where
\(k=\)repeated items




Proof of \(\rho=1-\frac{6\displaystyle \sum_{i=1}^n d_i^2}{n(n^2-1)}\)

Consider a bivariate sample \(x_i,y_i\) for \(i=1,2, \cdots,n\), then, \(x_i\) and \(y_i\) get ranks, each is a permutation of the same sequence of numbers \(1,2,3,\cdots,n\)
Thus
\(\bar{x}=\frac{\displaystyle \sum_{i=1}^n i}{n}\)
or\(\bar{x}= \frac{1+2+\cdots+n}{n}\)
or\(\bar{x}=\frac{n+1}{2}\)

Similarly,
\(s_x^2=\frac{1}{n} \displaystyle \sum_{i=1}^n (i^2)-(\bar{x})^2\)

or\(s_x^2=\frac{1}{n}\frac{n(n+1)(2n+1)}{6}-(\frac{n+1}{2})^2\)

or\(s_x^2=\frac{(n+1)(2n+1)}{6}-(\frac{n+1}{2})^2\)

or\(s_x^2=(\frac{n+1}{2}) \left [\frac{(2n+1)}{3}-\frac{n+1}{2} \right ]\)

or\(s_x^2=(\frac{n+1}{2})[\frac{n-1}{6}] \)

or\(s_x^2=\frac{n^2-1}{12}\)

So, we have
\(\bar{x}=\bar{y}=\frac{n+1}{2}\)
\(s_x^2=s_y^2=\frac{n^2-1}{12}\)
Next, we consider that
\(d_i= (x_i-\bar{x})(y_i-\bar{y})\)
Therefore
\(\displaystyle \frac{1}{n} \sum_{i=1}^n d_i^2= \frac{1}{n} \sum_{i=1}^n [(x-\bar{x})(y-\bar{y})]^2\)

or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= s_x^2+s_y^2-2.r. s_xs_y\)

or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= 2s_x^2-2.r. s_x^2\)

or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= 2s_x^2(1-r)\)

or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= 2 \frac{n^2-1}{12}. (1-r)\)

or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= \frac{n^2-1}{6}. (1-r)\)

or\( \frac{ \displaystyle 6\sum_{i=1}^n d_i^2}{n(n^2-1)}= (1-r)\)

or\(r=1- \frac{ \displaystyle 6\sum_{i=1}^n d_i^2}{n(n^2-1)}\)




Example 2
Calculate the rank correlation score in Mathematics and IQ for seven students as below.
Score in Math (X)52 51 53 55 54 56 57
Score in IQ (Y)61 63 62 64 67 65 66

Based on the data given above, we can calculate the correlation subtracting \(50\) from each data in \(X\) and subtracting \(60\)( from each data in \(Y\).
Now, the table of calculation is given below.

X Y X Y Rank of X Rank of Y d:Rx-Ry Square of Rank difference: \(d^2\)
52 61 2 1 6 7 -1 1
51 63 1 3 7 5 2 4
53 62 3 2 5 6 -1 1
55 64 5 4 3 4 -1 1
54 67 4 7 4 1 3 9
56 65 6 5 2 3 -1 1
57 66 7 6 1 2 -1 1
\(\sum d^2=18\)

Now, using formula, the correlation cofficient is
\(\rho=1-\frac{6\displaystyle \sum_{i=1}^n d_i^2}{n(n^1-1)}\)
or \(\rho=1-\frac{6 \times 18}{7(7^2-1)}=0.68\)




Regression equations of two variables

Regression analysis is a statistical tool to estimate (or predict) the unknown values of dependent variable from the known values of independent variable.
The variable that forms the basis for predicting another variable is known as the Independent Variable and the variable that is predicted is known as dependent variable.
For Example,
in \(Y=a+bX\)
one can obtain value of \(Y\) by putting the value of \(X\)
So,
X is called independent variable
Y is called dependent variable
Therefore
Regression is a technique to measure “dependence of one variable upon other variable”.




Regression Equation

Let ‘X’ is a independent variable and ‘Y’ is an dependent variable, then regression equation of Y on X is
\(Y=a +b X\) (1)
where ‘a’ and ‘b’ are constants, where
\(a\) represent y-intercept
\(b\) represent slope of line
To better understand it, just compare \(Y=b X+a\) with \(Y=m X+c\), then, it can be said that
\(a=c\) represent y-intercept
\(b=m\) represent slope of line
To compute the values of these constant ‘a’ and ‘b’, corresponding normal equations are given below
Normal Equation for ‘a’ is [Taking sum of both sides, we get]
\( \sum Y=na+b \sum X \)(2)
Normal Equation for ‘b’ is [Multiply equation (1) by X and take sum of both sides we get] \( \sum XY=a \sum X+ b \sum X^2\) (3)
Solving (2) and (3) to find ‘a’ and ‘b’ , we get
\( b =\frac{\sum XY -\frac{\sum X \sum Y}{n}}{\sum X^2-\frac{(\sum X)^2}{n}}\)

or\( b =\frac{\sum XY -n \bar{X}\bar{Y}}{\sum X^2-n \bar{X}^2} \)

And
\( a =\bar{Y}-n \bar{X}\)




The Method of Least Square
Let us consider a data set given as
X4812
Y616
Let us plot these data in a graph, and try to estimate a best fit line. The two possibilities are given below.
From the two possibilities of goodness of fit of an estimating line, summing the error of two estimations, we get that
First GraphSecond Graph
8-6=2
1-5=-4
6-4=2
-----=--
Error=0
8-2=6
1-5=-4
6-8=-2
-----=--
Error=0

It shows that, the process of summing individual differences for calculating the error is NOT a relaible way to judge the goodness of fit for an estimating line.
Therefore
We further proceed with absolute value of each error to judge the estimation line with best goodness of fit, in a new example.The result is as follows

From the two possibilities of goodness of fit of an estimating line, summing the error of two estimations, we get that
First GraphSecond Graph
|4-4|=0
|7-3|=4
|2-2|=0
-----=--
Error=4
|4-5|=1
|7-4|=3
|2-3|=1
-----=--
Error=5

It shows that, the process of summing absolute value of individual differences for calculating the error is NOT a relaible way to judge the goodness of fit for an estimating line, because,
According to data, Graph 2 has best fit
The absolute value shows that, Graph 1 has best fit
Therefore
We further proceed with summing square of each error to judge the estimation line with best goodness of fit. This is called least square methos. The result is as follows

First GraphSecond Graph
(4-4)2=0
(7-3)2=16
(2-2)2=0
-----=--
Error=16
(4-5)2=1
(7-4)2=9
(2-3)2=1
-----=--
Error=11

It shows that, the process of summing square value of individual differences for calculating the error is BEST relaible way to judge the goodness of fit for an estimating line, because,
According to data, Graph 2 has best fit
The square value shows that, Graph 2 has best fit
Therefore
Least Square Method is the best way to judge the goodness of fit for an estimating line




Coefficient of Regression equation

Let ‘X’ is a independent variable and ‘Y’ is an dependent variable, then regression equation of Y on X is
\(Y=a +b X\) (1)
The quantity \(a\) in the regression equation (1) is called y-intercept or orign (threashold) coefficient of Y. Here, \(a\) is the average value of Y when X is zero.
Next, the quantity \(b\) in the regression equation (1) is called slope coefficient.
Since there are two regression equations,
Y on X, given as \(Y=a+bX\)
X on Y, given as \(X=c+dx\)
Therefore, we have two regression coefficients.
Regression Coefficient of X on Y, symbolically written as \(b_{xy}\)
Regression Coefficient of Y on X, symbolically written as \(b_{yx}\)
Which can be summarized as below

  1. \( Y= a + b_{yx} X \)
  2. \( Y-\bar{Y}=b_{yx} (X-\bar{X})\)
  3. \( b_{yx} = \frac{Cov(X,Y)}{V(X)} \)
  4. \( b_{yx} = r \times \frac{s_y}{s_x} \)
  5. \( r = \pm \sqrt{b_{xy} \times b_{yx} } \)



Properties of Regression Coefficients
The major properties of regression coefficients can be summarized into following bullets:
  1. Regression coefficients are independent of the changes of origin but not of scale.
  2. If one of the regression coefficients is greater than unity, the other must be less than unity.
  3. Both regression coefficient have same sign with respect to the correlation coefficient.
  4. The regression line always passes through the mean
  5. The intersection point of two regression line is the means
  6. Two regression line coincides if \( r=\pm 1\)
  7. Two regression line are perpendicular if \( r=o\)
  8. Arithmetic mean of the regression coefficients is greater than the correlation coefficient r, provided that r > 0.
    \( \frac{ b_{xy} + b_{yx} }{2} \ge r \)
  9. Geometric mean of regression coefficients is the correlation coefficient
    i.e. \( r = \sqrt{b_{xy} \times b_{yx} } \)



Difference Between Correlation and Regression
Below mentioned are a few key differences between these two aspects.
CorrelationRegression
‘Correlation’ determines the interconnection or a co-relationship between the variables. ‘Regression’ explains how an independent variable is numerically associated with the dependent variable
Both the independent and dependent values have no difference. Both the dependent and independent variable are different.
The primary objective is, to find out a quantitative/numerical value expressing the association between the values. The primary intent is, to find the values of a a variable based on the values of the fixed variable.
Correlation stipulates the degree to which both of the variables can move together. Regression specifies the effect of the change in unit, in the known variable (X) on the evaluated variable (Y).
Correlation helps is constituting the connection between the two variables. Regression helps in estimating a variable’s value based on another given value.
Example 3
Calculate the regression equation of Score in IQ on Score in Math from the following data.
Score in Math (X)8 10 9 12 10 11
Score in IQ (Y)2 2 3 5 5 6

Solution
Assuming \(‘X’\) as a independent variable and \(‘Y’\) as dependent variable, the regression equation of \(Y\) on \(X\) is
\(Y=a +b X\), where \(‘a’\) and \(‘b’\) are constants
To compute the values of these constant \(‘a’\) and \(‘b’\), the corresponding normal equations are
\(\sum Y=na+b \sum X\) (1)
\(\sum XY= a \sum X +b \sum X^2\) (2)
Based on the data given above, the table of calculation is given below.

X Y \(X^2\) \(Y^2\) XY
8 2 64 4 16
10 2 100 4 20
9 3 81 9 27
12 5 144 25 60
10 5 100 25 50
11 6 121 36 66
\(\sum X=60\) \(\sum Y=23\) \(\sum X^2=610\) \(\sum Y^2=103\) \(\sum XY=239\)

Based on the table of calculation, the normal equations are
\(23 = 7 a + 60b\) (3)
\(239 = 60 a + 610 b\) (4)
Solving the two equation (3) and (4), we get
\(a = 0.46, b = 0.44\)
Hence, the regression equation of \(Y\) on \(X\) is
\(Y=0.46 +0.44 X\)
Next, assuming \(‘Y’\) as a independent variable and \(‘X’\) as dependent variable, the regression equation of \(X\) on \(Y\) is
\(X=c +d Y\), where \(c\) and \(d\) are constants
To compute the values of these constant \(c\) and \(d\), the corresponding normal equations are
\(\sum X=nc+d \sum Y\) (5)
\(\sum XY= a \sum Y +b \sum Y^2\) (6)
Based on the table of calculation, the normal equations are
\(60 = 7 c + 23 d\) (7)
\(239= 23 c + 103d \) (8)
Solving the two equation (7) and (8), we get
\(c =3.56, d=1.53\)
Hence, the regression equation of \(Y\) on \(X\) is
\(X=3.56 +1.53 Y\)
NOTE
The regression equation of X on Y and Y on X do NOT necessarily estimate same value, as it does in linear equation. For example,

Consider a linear equation
\(2x+3y=12\)
Here, equation for x is
\(x=\frac{12-3y}{2}\)
If we put \(y=2\) then we get \(x=3\)
Next, the equation for y is
\(y=\frac{12-2x}{3}\)
If we put \(x=3\) then we get \(y=2\)
Here, we get same pair of values.

But, this situation may not happen in a pair of regression equation.
Based on example given above, the regression equation for x is
\(X=3.56 +1.53 Y\)
If we put \(y=2\) then we get \(x=6.62\)
Next, Based on example given above, the regression equation for y is
\(Y=0.46 +0.44 X\)
If we put \(x=6.62\) then we get \(y=3.37\)
Here, we get different pair of values.

Therefore, regression equation of X on Y and Y on X do not necessarily estimate same value


Exercise

  1. Coefficient of correlation between X and Y is 0.3. Their covariance is 9. The variance of X is 16. Find the standard devotion of Y series.
  2. Find the two regression equation of X on Y and Y on X from the following data:
    X : 10 12 16 11 15 14 20 22
    Y : 15 18 23 14 20 17 25 28
  3. The data below gives marks obtained by 10 students taking exam on math and computer test.
    Students A B C D E F G H I J
    X Marks (Math) 15 18 3 24 9 6 6 15 12 12
    Y Marks (Computer) 10 15 1 13 13 6 2 11 13 16
    Is there a connection between the marks obtained by 10 students in Math and computer test?
  4. Suppose you have calculated two months of attendance of five randomly selected students. Their unit test results are presented in a table, which is given below:
    X 10 20 30 40 50
    Y 20 30 40 30 50
    Calculate the Pearson's correlation coefficient of the above data and interpret the correlation. What conclusion do you draw from the correlation coefficient?
  5. In a laboratory experiment on correlation research study, the equation to the to regression lines were to be 2X-Y+1=0 and 3X-2Y+7=0. Find the means of X and Y. Also work out the values of the regression coefficients and the coefficient of correlation between the two variables X and Y. Given variance of X=9 find the standard deviation of Y.
  6. The coefficient of rank correlation of the marks obtained by 10 students in statistics and accountancy was found to be 0.8. It was later discovered that the difference in ranks in the two subjects obtained by one of the students was wrongly taken as 7 instead of 9. Find the correct coefficient of rank correlation.
  7. find the correlation between X and Y.
    X Good Excellent Good Excellent Excellent Excellent
    Y Poor Good Poor Excellent Very Good Good
  8. Imagine you are a secondary level teacher conducting research on the correlation between students' self-reported study habits and their academic achievement. Determine whether you would use Pearson's correlation or Spearman's rank correlation for this study, and justify your choice. Discuss potential benefits and challenges associated with your chosen method, emphasizing how the results could inform teaching strategies.

No comments:

Post a Comment