17.1.11.2 Algorithm (Correlation Coefficient)

There are a number of coefficients which are appropriate to use under different circumstances. Among them, the most frequently-used one is Pearson's product moment correlation coefficient.

Correlation Coefficients

Pearson's product moment correlation coefficient

Pearson's product moment correlation coefficient measures the linear relations between two variables.

Let \sigma _x\, .and \sigma _y\, be the standard deviations of two random variables X and Y respectively. Then the Pearson's product moment correlation coefficient between the variables is

\rho _{x,y}=\frac{cov(X,Y)}{\sigma _x\sigma _y}=\frac{E((X-E(X))(Y-E(Y)))}{\sigma _x\sigma _y}

where E(.) denotes the expected value of the variable, and cov(.) means covariance.

To use this method, one should make sure that the interval data comes from paired observations, and that the variables are normally distributed. The data should not contain any extreme values, because they are apt to affect the result. Pearson's product moment correlation coefficient could sometimes be misleadingly small when the variables have a non-linear relationship.

Spearman Rank Correlation Coefficient

Spearman Rank correlation coefficient is a non-parametric measure; therefore, it is suitable for data that is not normally distributed. It works better in detecting a non-linear relationship between two variables. It can be defined as

r^{\prime }=1-6\sum \frac{d^2}{N(N^2-1)}

where d is the difference in statistical rank of corresponding variables.

Because statistical rank is just the ordinal number of a value in a list, Spearman Rank correlation coefficient can be computed even when actual values of the variables are unknown.

Kendall correlation coefficient

Kendall correlation coefficient, or Kendall tau, is equivalent to Spearman R in terms of their assumptions and statistical power. However, Kendal correlation coefficient has a more intuitive interpretation. And its algebraic structure is simpler. Furthermore, it does not require ordering of the data before the computation.

Kendall correlation coefficient can be computed by

t=\frac{C-D}{\sqrt{q}}

where C is the number of concordant pairs (pairs of observations that have the same signs), D is the number of discordant pairs (pairs of observations that have opposite signs), and q is defined in Significance Level of r.

Significance of R

Pearson and Spearman types

For Pearson and Spearman correlation types, let


t = |r\sqrt{\frac{N-2}{1-r^2}}|

where r is the correlation of two variables and N is number of observations.

Then t follows a t-distribution with N-2 degrees of freedom. The two-tailed significance level can be calculated as:


p=2(1-\mbox{tcdf} (t,N-2))\;

Kendall type

For Kendall correlation type, let


z=\frac{r\sqrt{q}}{\sqrt{v}}

where

v_0 = N(N-1)(2N+5)\;
\tau = \sum_{k} t_k (t_k-1)\;
\tau_1 = \sum_{k} t_k (t_k-1)(t_k-2)\;
\tau_2 = \sum_{k} t_k (t_k-1)(2t_k+5)\;
t_k \mbox{ is the number of tied values in the kth group of ties for a variable.}\;
q=(N(N-1)/2-\tau(i)/2)(N(N-1)/2-\tau(j)/2)\;
v=(v_0-\tau_2(i)-\tau_2(j))/18 + \tau (i)\tau (j)/(2N(N-1)) + \tau_1 (i)\tau_1 (j)/(9N(N-1)(N-2)) \;
r \mbox{ is the correlation between variable } i \mbox{ and variable } j. \;

Then z is approximated by a standard normal distribution. And the two-tailed significance level is:


p=2(1-\mbox{normcdf} (\mbox{abs} (z)))\;