 Introduction
 Scatter Diagram: Graphic method
 Correlation
 Types of Correlation coefficient
 Interpretation of correlation coefficients
 Coefficient of determination
 Properties of coefficient of Correlation
 Pearson's product moment correlation
 Spearman’s Rank Correlation
 Regression equations of two variables
 The Method of Least Square
 Coefficient of Regression equation
 Properties of Regression Coefficients
 Exercise
Introduction
The data below gives marks obtained by 10 students taking exam on math and computer test.
Students  A  B  C  D  E  F  G  H  I  J 
X Marks (Math)  15  18  3  24  9  6  6  15  12  12 
Y Marks (Computer)  10  15  1  13  13  6  2  11  13  16 
Is there a connection between the marks obtained by 10 students in Math and computer test? A starting point would be to plot the marks of both subjects in a scatter diagram.
Now calculating the means, we get
\(\bar{X}=\frac{120}{10}=12\)
\(\bar{Y}=\frac{100}{10}=10\)
And using them to divide the graph into four slots. It clearly shows that, the areas in the bottom right and top left of the graph are largely vacant.
So there is a tendency for the points to run from bottom left to top right.
In this example, most of the points (1st and 3rd quadrants) give positive value of
\((x\bar{X})(y\bar{Y})\)
The problem is to find a way to measure how strong this tendency is. To answer this question, we proceed further.
Here
\(\frac{x\bar{x}}{s_x}\)
gives normalized distance of each x from \(\bar{x}\) and makes it unit free.
Also
\(\frac{y\bar{y}}{s_y}\)
gives normalized distance of each y from \(\bar{y}\) and makes it unit free.
So
\(\frac{1}{n} \displaystyle \sum \left (\frac{x\bar{x}}{s_x} \right ) \left ( \frac{y\bar{y}}{s_y} \right )\)
gives normalized product moment, which is the value of correlation.
The value of correlation (r) gives a measure of how close the points are to lying on a straight line.
 r= 1 indicates that all the points lie exactly on a state line with positive gradient
 r=1 gives the same information with a line having negative gradient
 r=0 tells us that there is no connection at all between the two sets of data
The illustration shows that, quantification of the relationship between variables is very essential to take the benefit of study of relationship, called corelation. For this, we find there are two basic methods of measurement of correlation, which can be represented as graphical method and algebraic method.
Scatter Diagram: Graphic method
Scatter Diagram is graphic method of measurement of correlation. It is a diagrammatic representation of bivariate data to ascertain the relationship between two variables. Under this method the given data are plotted on a graph paper. Once the values are plotted on the graph it reveals the type of the correlation between variable X and Y. However please note that, the correlation is affected by each point.
Strong Positive Correlation  Low Positive Correlation  No Correlation  Low Negative Correlation  Strong Negative Correlation 
Correlation
Correlation is a technique to measures strength of association (सम्बन्धको मापन) between two variables, say X and Y. The intensity (मात्रा) of the correlation is expressed by a number, called the coefficient of correlation, and it is denoted by r. The value of correlation lies between 1 to 1 (inclusive).
 Coefficient of correlation is first introduced by Galton (1886)
 Formalized by Karl Pearson (1896)
 Developed/extended by Fisher (1935)
 The main idea is to compute an index (number) which reflects how much two variables are related to each other.
 f two variables are related such that both increase or both decrease, then the correlation is positive
 If increase in any one variable is associated with decrease in the other variable, the correlation is negative
Types of Correlation coefficient
There are two main types of correlation coefficients: Pearson's product moment correlation coefficient and Spearman's rank correlation coefficient. The correct usage of correlation coefficient type depends on the types of variables being used. However, the different types of correlation are given in the table below.Quantative  Ordinal  Nominal  
Quantitative  Pearson's  Biserial  Point Biserial 
Ordinal  Biserial  Spearman rho  Rank Biserial 
Nominal  Point Biserial  Rank Biserial  Phi 
Interpretation of correlation coefficients
Generally, the coefficient of correlation is either positive or negative or zero. If the correlation is positive, then the variables are related such that both increase or both decrease. If the correlation is negative , then increase in any one variable is associated with decrease in the other variable and viceversa. If the correlation is zero then the variables are not related. On top, the correlation is interpreted as following
Correlation coefficients whose magnitude \(r\) lies
between 0.8 and 1.0  is very high (perfect) correlation 
between 0.6 and 0.8  is high correlation 
between 0.4 and 0.6  is moderate correlation 
between 0.2 and 0.4  is low correlation 
between 0.0 and 0.2  is very low (no) correlation 
Coefficient of determination
Correlation coefficient measuring a linear relationship between the two
variables indicates the amount of variation one variable accounted for by the other
variable. A better measure for this purpose is provided by the square of the
correlation coefficient, known as “coefficient of determination”. This can be
interpreted as the ratio between the explained variance to total variance:
\(r^2 =\frac{\text{explained variance}}{\text{total variance}}\)
Similarly, Coefficient of nondetermination is
\(1r^2 \)
Thus
The square of correlation coefficient is called coefficient of determination. It r is obtained by two variables \(X\) and \(Y\) then \(r^2\) is the fraction of variation in \(Y\) that is explained by \(X\).
For example, if correlation between “Math score” and “Anxiety” is \(r=0.4\), then \(r^2=0.16\), it means 16% of the variability in Math score and anxiety “overlaps” in opposite manner.
Based on this example, a coefficient of determination of 0.16 is obtained. It can be interpreted that the variation in Math Score can explain 16% of the variation in Anxiety score. The remaining 84% represents the variation in Math Score explained by other variables not included in the model.
Properties of coefficient of Correlation
As correlation measure the strength of association between two variables, the major properties of such correlation coefficients can be summarized into following bullets:
 The correlation coefficient lies between \(1\) and \(+1\)
 Independent from unit of measurement
 Independent of origin and scale
 Symmetrical i.e., \(r_{xy} = r_{yx}\)
Limitation of Correlation
A key thing to remember is that correlation is not responsible in change in one variable causes a change in another. Sales of personal computers and athletic shoes have both risen strongly over the years and there is a high correlation between them, it cannot be assumed that buying computers causes people to buy athletic shoes (or vice versa).
The second caution is that the Pearson correlation technique works best with linear relationships: as one variable gets larger, the other gets larger (or smaller) in direct proportion. It does not work well with curvilinear relationships (in which the relationship does not follow a straight line). An example of a curvilinear relationship is age and health care. They are related, but the relationship doesn't follow a straight line. Young children and older people both tend to use much more health care than teenagers or young adults.
 r is measure of linear relationship only. There may be an exact connection between X and Y, but if it is no straight line, there is no help.
 correlation does not imply causality. A survay may result that strong correlation between left feet people and mental mathematics.
 An usual freak result may have strong effect on the value of r
Pearson's product moment correlation:Algebraic Method
Karl Pearson’s method of calculating coefficient of correlation is based on the
covariance of the two variables in a series. This method is widely used in practice and
the coefficient of correlation is denoted by the symbol \(r\). It is used when both variables being studied are normally distributed and Quantitative in scale . For a correlation between variables \(X\) and \(Y\) the formula for calculating the Pearson's correlation coefficient is given by
\( r=\frac{Cov(X,Y)}{\sigma_x \sigma_y}\) where \(Cov(X,Y) =\frac{1}{n} \sum (x\bar{x})(y\bar{y})\) Variance method
\( r=\frac{ \sum xy}{\sqrt{\sum x^2} \sqrt{\sum y^2}}\)Deviation method
\( r=\frac{n \sum XY\sum X \sum Y}{\sqrt{n\sum X^2(\sum X)^2} \sqrt{n\sum Y^2(\sum Y)^2}}\)Raw method
Example 1
Calculate the correlation cofficient of the marks in Mathematics and Statistics for eight students as below.Marks in Math (X)  67  68  65  68  72  72  69  71 
Marks in Stat (Y)  65  66  67  67  68  69  70  72 
Based on the data given above, we can calculate the correlation subtracting \(65\) from each data both in X and Y. Now, the table of calculation is given below.
X  Y  X  Y  \(X^2\)  \(Y^2\)  XY 
67  65  2  0  4  0  0 
68  66  3  1  9  1  3 
65  67  0  2  0  4  0 
68  67  3  2  9  4  6 
72  68  7  3  49  9  21 
72  69  7  4  49  16  28 
69  70  4  5  16  25  20 
71  72  6  7  36  49  42 
\(\sum X=32\)  \(\sum Y=24\)  \(\sum X^2=172\)  \(\sum Y^2=108\)  \(\sum XY=120\) 
Now, using formula, the correlation cofficient is
\( r=\frac{N \sum XY\sum X \sum Y}{\sqrt{N\sum X^2(\sum X)^2} \sqrt{N\sum Y^2(\sum Y)^2}}\)
or \( r=\frac{8 .12032.24}{\sqrt{8.172(32)^2} \sqrt{8.108(24)^2}}=0.60\)
Spearman’s Rank Correlation
When quantification of variables becomes difficult such as beauty of female, leadership
ability, knowledge of person etc, then this method of rank correlation is useful which
was developed by British psychologist Charles Edward Spearman in 1904. In this
method ranks are allotted to each element either in ascending or descending order.
The correlation coefficient between these allotted two series of ranks is popularly
called as “Spearman’s Rank Correlation” and denoted by \(\rho\). It is appropriate when one or both variables are skewed or ordinal in scale. For a correlation between variables \(X\) and \(Y\) the formula for calculating the Spermans' rho correlation coefficient is given by
\(\rho=1\frac{6\displaystyle \sum_{i=1}^n d_i^2}{n(n^21)}\)
where
\(d_i\)= the difference between the ranks of corresponding variables
\(n=\) number of observations
NOTE
If there are tied ranks, we give mean of the ranks they would have if they were not tied. In this case, we use the formula as below.
\(\rho=1\frac{6 \left [\displaystyle \sum_{i=1}^n d_i^2+ \sum_k \frac{m_k^2(m_k^21)}{12}\right ]}{n(n^21)}\)
where
\(k=\)repeated items
Proof of \(\rho=1\frac{6\displaystyle \sum_{i=1}^n d_i^2}{n(n^21)}\)
Consider a bivariate sample \(x_i,y_i\) for \(i=1,2, \cdots,n\), then, \(x_i\) and \(y_i\) get ranks, each is a permutation of the same sequence of numbers \(1,2,3,\cdots,n\)
Thus
\(\bar{x}=\frac{\displaystyle \sum_{i=1}^n i}{n}\)
or\(\bar{x}=
\frac{1+2+\cdots+n}{n}\)
or\(\bar{x}=\frac{n+1}{2}\)
Similarly,
\(s_x^2=\frac{1}{n} \displaystyle \sum_{i=1}^n (i^2)(\bar{x})^2\)
or\(s_x^2=\frac{1}{n}\frac{n(n+1)(2n+1)}{6}(\frac{n+1}{2})^2\)
or\(s_x^2=\frac{(n+1)(2n+1)}{6}(\frac{n+1}{2})^2\)
or\(s_x^2=(\frac{n+1}{2}) \left [\frac{(2n+1)}{3}\frac{n+1}{2} \right ]\)
or\(s_x^2=(\frac{n+1}{2})[\frac{n1}{6}] \)
or\(s_x^2=\frac{n^21}{12}\)
So, we have
\(\bar{x}=\bar{y}=\frac{n+1}{2}\)
\(s_x^2=s_y^2=\frac{n^21}{12}\)
Next, we consider that
\(d_i= (x_i\bar{x})(y_i\bar{y})\)
Therefore
\(\displaystyle \frac{1}{n} \sum_{i=1}^n d_i^2= \frac{1}{n} \sum_{i=1}^n [(x\bar{x})(y\bar{y})]^2\)
or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= s_x^2+s_y^22.r. s_xs_y\)
or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= 2s_x^22.r. s_x^2\)
or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= 2s_x^2(1r)\)
or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= 2 \frac{n^21}{12}. (1r)\)
or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= \frac{n^21}{6}. (1r)\)
or\( \frac{ \displaystyle 6\sum_{i=1}^n d_i^2}{n(n^21)}= (1r)\)
or\(r=1 \frac{ \displaystyle 6\sum_{i=1}^n d_i^2}{n(n^21)}\)
Example 2
Calculate the rank correlation score in Mathematics and IQ for seven students as below.Score in Math (X)  52  51  53  55  54  56  57 
Score in IQ (Y)  61  63  62  64  67  65  66 
Based on the data given above, we can calculate the correlation subtracting \(50\) from each data in \(X\) and subtracting \(60\)( from each data in \(Y\).
Now, the table of calculation is given below.
X  Y  X  Y  Rank of X  Rank of Y  d:RxRy  Square of Rank difference: \(d^2\) 
52  61  2  1  6  7  1  1 
51  63  1  3  7  5  2  4 
53  62  3  2  5  6  1  1 
55  64  5  4  3  4  1  1 
54  67  4  7  4  1  3  9 
56  65  6  5  2  3  1  1 
57  66  7  6  1  2  1  1 
\(\sum d^2=18\) 
Now, using formula, the correlation cofficient is
\(\rho=1\frac{6\displaystyle \sum_{i=1}^n d_i^2}{n(n^11)}\)
or \(\rho=1\frac{6 \times 18}{7(7^21)}=0.68\)
Regression equations of two variables
Regression analysis is a statistical tool to estimate (or predict) the
unknown values of dependent variable from the known values of independent
variable.
The variable that forms the basis for predicting another variable is known as
the Independent Variable and the variable that is predicted is known as dependent
variable.
For Example,
in \(Y=a+bX\)
one can obtain value of \(Y\) by putting the value of \(X\)
So,
X is called independent variable
Y is called dependent variable
Therefore
Regression is a technique to measure “dependence of one variable upon other variable”.
Regression Equation
Let ‘X’ is a independent variable and ‘Y’ is an dependent variable, then regression equation of Y on X is
\(Y=a +b X\) (1)
where ‘a’ and ‘b’ are constants, where
\(a\) represent yintercept
\(b\) represent slope of line
To better understand it, just compare \(Y=b X+a\) with \(Y=m X+c\), then, it can be said that
\(a=c\) represent yintercept
\(b=m\) represent slope of line
To compute the values of these constant ‘a’ and ‘b’, corresponding normal equations are given below
Normal Equation for ‘a’ is [Taking sum of both sides, we get]
\( \sum Y=na+b \sum X \)(2)
Normal Equation for ‘b’ is [Multiply equation (1) by X and take sum of both sides we get]
\( \sum XY=a \sum X+ b \sum X^2\) (3)
Solving (2) and (3) to find ‘a’ and ‘b’ , we get
\( b =\frac{\sum XY \frac{\sum X \sum Y}{n}}{\sum X^2\frac{(\sum X)^2}{n}}\)
or\( b =\frac{\sum XY n \bar{X}\bar{Y}}{\sum X^2n \bar{X}^2} \)
And
\( a =\bar{Y}n \bar{X}\)
The Method of Least Square
Let us consider a data set given asX  4  8  12 
Y  6  1  6 
First Graph  Second Graph  


It shows that, the process of summing individual differences for calculating the error is NOT a relaible way to judge the goodness of fit for an estimating line.
Therefore
We further proceed with absolute value of each error to judge the estimation line with best goodness of fit, in a new example.The result is as follows
First Graph  Second Graph  


It shows that, the process of summing absolute value of individual differences for calculating the error is NOT a relaible way to judge the goodness of fit for an estimating line, because,
According to data, Graph 2 has best fit
The absolute value shows that, Graph 1 has best fit
Therefore
We further proceed with summing square of each error to judge the estimation line with best goodness of fit. This is called least square methos. The result is as follows
First Graph  Second Graph  


It shows that, the process of summing square value of individual differences for calculating the error is BEST relaible way to judge the goodness of fit for an estimating line, because,
According to data, Graph 2 has best fit
The square value shows that, Graph 2 has best fit
Therefore
Least Square Method is the best way to judge the goodness of fit for an estimating line
Coefficient of Regression equation
Let ‘X’ is a independent variable and ‘Y’ is an dependent variable, then regression equation of Y on X is
\(Y=a +b X\) (1)
The quantity \(a\) in the regression equation (1) is called yintercept or orign (threashold) coefficient of Y. Here, \(a\) is the average value of Y when X is zero.
Next, the quantity \(b\) in the regression equation (1) is called slope coefficient.
Since there are two regression equations,
Y on X, given as \(Y=a+bX\)
X on Y, given as \(X=c+dx\)
Therefore, we have two regression coefficients.
Regression Coefficient of X on Y, symbolically written as \(b_{xy}\)
Regression Coefficient of Y on X, symbolically written as \(b_{yx}\)
Which can be summarized as below
 \( Y= a + b_{yx} X \)
 \( Y\bar{Y}=b_{yx} (X\bar{X})\)
 \( b_{yx} = \frac{Cov(X,Y)}{V(X)} \)
 \( b_{yx} = r \times \frac{s_y}{s_x} \)
 \( r = \pm \sqrt{b_{xy} \times b_{yx} } \)
Properties of Regression Coefficients
The major properties of regression coefficients can be summarized into following bullets: Regression coefficients are independent of the changes of origin but not of scale.
 If one of the regression coefficients is greater than unity, the other must be less than unity.
 Both regression coefficient have same sign with respect to the correlation coefficient.
 The regression line always passes through the mean
 The intersection point of two regression line is the means
 Two regression line coincides if \( r=\pm 1\)
 Two regression line are perpendicular if \( r=o\)
 Arithmetic mean of the regression coefficients is greater than the correlation coefficient r, provided that r > 0.
\( \frac{ b_{xy} + b_{yx} }{2} \ge r \)  Geometric mean of regression coefficients is the correlation coefficient
i.e. \( r = \sqrt{b_{xy} \times b_{yx} } \)
Difference Between Correlation and Regression
Below mentioned are a few key differences between these two aspects.Correlation  Regression 
‘Correlation’ determines the interconnection or a corelationship between the variables.  ‘Regression’ explains how an independent variable is numerically associated with the dependent variable 
Both the independent and dependent values have no difference.  Both the dependent and independent variable are different. 
The primary objective is, to find out a quantitative/numerical value expressing the association between the values.  The primary intent is, to find the values of a a variable based on the values of the fixed variable. 
Correlation stipulates the degree to which both of the variables can move together.  Regression specifies the effect of the change in unit, in the known variable (X) on the evaluated variable (Y). 
Correlation helps is constituting the connection between the two variables.  Regression helps in estimating a variable’s value based on another given value. 
Example 3
Calculate the regression equation of Score in IQ on Score in Math from the following data.Score in Math (X)  8  10  9  12  10  11 
Score in IQ (Y)  2  2  3  5  5  6 
Solution
Assuming \(‘X’\) as a independent variable and \(‘Y’\) as dependent variable, the regression equation of \(Y\) on \(X\) is
\(Y=a +b X\), where \(‘a’\) and \(‘b’\) are constants
To compute the values of these constant \(‘a’\) and \(‘b’\), the corresponding normal equations are
\(\sum Y=na+b \sum X\) (1)
\(\sum XY= a \sum X +b \sum X^2\) (2)
Based on the data given above, the table of calculation is given below.
X  Y  \(X^2\)  \(Y^2\)  XY 
8  2  64  4  16 
10  2  100  4  20 
9  3  81  9  27 
12  5  144  25  60 
10  5  100  25  50 
11  6  121  36  66 
\(\sum X=60\)  \(\sum Y=23\)  \(\sum X^2=610\)  \(\sum Y^2=103\)  \(\sum XY=239\) 
Based on the table of calculation, the normal equations are
\(23 = 7 a + 60b\) (3)
\(239 = 60 a + 610 b\) (4)
Solving the two equation (3) and (4), we get
\(a = 0.46, b = 0.44\)
Hence, the regression equation of \(Y\) on \(X\) is
\(Y=0.46 +0.44 X\)
Next, assuming \(‘Y’\) as a independent variable and \(‘X’\) as dependent variable, the regression equation of \(X\) on \(Y\) is
\(X=c +d Y\), where \(c\) and \(d\) are constants
To compute the values of these constant \(c\) and \(d\), the corresponding normal equations are
\(\sum X=nc+d \sum Y\) (5)
\(\sum XY= a \sum Y +b \sum Y^2\) (6)
Based on the table of calculation, the normal equations are
\(60 = 7 c + 23 d\) (7)
\(239= 23 c + 103d \) (8)
Solving the two equation (7) and (8), we get
\(c =3.56, d=1.53\)
Hence, the regression equation of \(Y\) on \(X\) is
\(X=3.56 +1.53 Y\)
NOTE
The regression equation of X on Y and Y on X do NOT necessarily estimate same value, as it does in linear equation. For example,
Consider a linear equation 
But, this situation may not happen in a pair of regression equation. 
Exercise
 Coefficient of correlation between X and Y is 0.3. Their covariance is 9. The variance of X is 16. Find the standard devotion of Y series.
 Find the two regression equation of X on Y and Y on X from the following data:
X : 10 12 16 11 15 14 20 22 Y : 15 18 23 14 20 17 25 28 
The data below gives marks obtained by 10 students taking exam on math and computer test.
Students A B C D E F G H I J X Marks (Math) 15 18 3 24 9 6 6 15 12 12 Y Marks (Computer) 10 15 1 13 13 6 2 11 13 16  Suppose you have calculated two months of attendance of five randomly selected students. Their unit test results are presented in a table, which is given below:
X 10 20 30 40 50 Y 20 30 40 30 50  In a laboratory experiment on correlation research study, the equation to the to regression lines were to be 2XY+1=0 and 3X2Y+7=0. Find the means of X and Y. Also work out the values of the regression coefficients and the coefficient of correlation between the two variables X and Y. Given variance of X=9 find the standard deviation of Y.
 The coefficient of rank correlation of the marks obtained by 10 students in statistics and accountancy was found to be 0.8. It was later discovered that the difference in ranks in the two subjects obtained by one of the students was wrongly taken as 7 instead of 9. Find the correct coefficient of rank correlation.
 find the correlation between X and Y.
X Good Excellent Good Excellent Excellent Excellent Y Poor Good Poor Excellent Very Good Good  Imagine you are a secondary level teacher conducting research on the correlation between students' selfreported study habits and their academic achievement. Determine whether you would use Pearson's correlation or Spearman's rank correlation for this study, and justify your choice. Discuss potential benefits and challenges associated with your chosen method, emphasizing how the results could inform teaching strategies.
No comments:
Post a Comment