17.1.4.3 Algorithms (CrossTabs)



CrossTabs is also called Contingency Tables. This tool is used to examine the existence or the strength of any association between variables.

CrossTabs Method

  • Frequency Counts
  • Marginal and Cell
  • Chi-Square Tests Table
  • Fisher's Exact Test Table (2 x 2 only)
  • Measures of Association
  • Measures of Agreement
  • Odds Ratio and Relative Risk (2 x 2 only)
  • Cochran-Mantel-Haenszel

Frequency Counts

Define

X_i are distinct values of row variable in ascending order, i.e. X_1 < X_2 < \cdots X_R
Y_i are distinct values of column variable in ascending order, i.e. Y_1 < Y_2 < \cdots Y_C
f_{ij} is the frequency with respect to cell (i,j)
r_i = \sum_{j=1}^{C}f_{ij} is subtotal of the ith row
c_j = \sum_{i=1}^{R}f_{ij} is subtotal of the jth column
N = \sum_{j=1}^{C}c_j = \sum_{i=1}^{R}r_i is the total number.

Marginal and Cell

Statistics Formula and Explanation
Count f_{ij}
Expected Count E_{ij} = \frac{r_i c_j}{N}
Row Percent 100*\frac{f_{ij}}{r_i}
Column Percent 100*\frac{f_{ij}}{c_j}
Total Percent 100*\frac{f_{ij}}{N}
Residual R_{ij} = f_{ij} - E_{ij}
Std. Residual StdR_{ij} = \frac{R_{ij}}{\sqrt{E_{ij}}}
Adj. Residual AdjR_{ij} = \frac{R_{ij}}{\sqrt{E_{ij}\left(1-\frac{r_i}{N}\right)\left(1-\frac{c_j}{N}\right)}}

Chi-Square Statistics

Statistics Formula and Explanation Degree of Freedom
Pearson Chi-Square \chi_p^2 = \sum_{ij} \frac{(f_{ij}-E_{ij})^2}{E_{ij}} (R-1)(C-1)
Likelihood Ratio \chi_{LR}^2 = -2\sum_{ij} f_{ij} \ln (E_{ij}/f_{ij}) (R-1)(C-1)
Linear Association \chi_{LA}^2 = (N-1)r^2, where r is the Pearson correlation coefficient. 1
Continuity Correction \chi_C^2 = \frac{N(|f_{11}f_{22}-f_{12}f_{21}|-0.5N)^2}{r_1r_2c_1c_2} I(|f_{11}f_{22}-f_{12}f_{21}|>0.5N), which is calculated only for 2 x 2 table 1

Fisher's Exact Test

This test is useful when some expected cell count is low (less than 5). It's calculated only for 2 x 2 table. Suppose we have the table in the following:

X_1 X_2 Subtotal/Total
Y_1 n_1 n_3 n_1+n_3
Y_2 n_2 n_4 n_2+n_4
Subtotal/Total n_1+n_2 n_3+n_4 N

Under the null hypothesis (Independence), the count of the first cell N_1 is a hypergeometric distribution with probability given by

Pr(N_1=n_1) = \frac{(n_1+n_2)!(n_3+n_4)!(n_1+n_3)!(n_2+n_4)!}{N!n_1!n_2!n_3!n_4!}, \max(0,n_1-n_4)\leq N_1 \leq \min(n_1+n_2,n_1+n_3).

one-Sided test

The one-sided test significance level is calculated by

p(left-sided test) = Pr(N_1\leq n_1)
p(right-sided test) = Pr(N_1\geq n_1)

Two-Sided tail

The two-tail significance is

p_2 = p_1 + p_3

where

p_{1}= Pr(N_1\leq n_1), if n_{1}\leq (n_{1}+n_{2})(n_{1}+n_{3})/N
p_{1}= Pr(N_1\geq n_1), if n_{1}>(n_{1}+n_{2})(n_{1}+n_{3})/N


p_3 = \sum_{x:\text{ between }\min(n_1+n_2,n_1+n_3) \text{ and } (n_1+1); Pr(N_1=x) \leq Pr(N_1=n_1)} Pr(N_1=x)

Measures of Association

Define

D_r = N^2 - \sum_{i=1}^{R}r_i^2
D_c = N^2 - \sum_{j=1}^{C}c_j^2
C_{ij} = \sum_{h<i}\sum_{k<j}f_{hk}+\sum_{h>i}\sum_{k>j}f_{hk}
D_{ij} = \sum_{h<i}\sum_{k>j}f_{hk}+\sum_{h>i}\sum_{k<j}f_{hk}
P = \sum_{ij}f_{ij}C_{ij}
Q = \sum_{ij}f_{ij}D_{ij}
r_i = \sum_{j=1}^{C}f_{ij} is subtotal of the ith row
c_j = \sum_{i=1}^{R}f_{ij} is subtotal of the jth column
N = \sum_{j=1}^{C}c_j = \sum_{i=1}^{R}r_i is the total number.
Statistics Formula and Explanation Standard Error
Phi Coefficient \phi = \sqrt{\chi_p^2/N}, which is calculated for not 2 x 2 table. For a 2 x 2 table, it is equal to r

The value ranges from [0,M], where M = min(\sqrt{R-1},\sqrt{C-1}),

Cramer's V V = \sqrt{\frac{\chi_p^2}{N\min\{R,C\}}}
Contingency Coefficient CC = \sqrt{\frac{\chi_p^2}{\chi_p^2+N}}
Gamma \gamma = \frac{P-Q}{P+Q} \frac{2}{P+Q}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}
Kendall Tau-b \tau_b = \frac{P-Q}{\sqrt{D_rD_c}} 2\sqrt{\frac{1}{D_rD_c}\left[\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2\right]}
Tau-c \tau_c = \frac{(P-Q)q}{N^2(q-1)}, where q = \min\{R,C\} \frac{2q}{N^2(q-1)}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}
Somer's D C|R d_{C|R} = \frac{P-Q}{D_r} \frac{2}{D_r}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}
R|C d_{R|C} = \frac{P-Q}{D_c} \frac{2}{D_c}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}
Symmetric d = 2\frac{P-Q}{D_c+D_r} \frac{4}{D_c+D_r}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}
Lambda C|R \lambda_{C|R} = \frac{1}{N-c_m}\left(\sum_{i=1}^{R}f_{im}-c_m\right), where f_{im} is the largest count in ith row, and c_m is the largest column subtotal. \sqrt{ \frac{ N - \displaystyle\sum_{i=1}^{R} f_{im} }{ (N-c_m)^3 } \left(\sum_{i=1}^{R} f_{im} + c_m -2\sum_{i=1}^{R} (f_{im}|l_i=l) \right) },

where l_i is the column index of f_{im}, l is the index of column subtotal for c_m.

R|C \lambda_{R|C} = \frac{1}{N-r_m}\left(\sum_{j=1}^{C}f_{mj}-r_m\right),

where f_{mj} is the largest count in jth column, and r_m is the largest row subtotal.

\sqrt{ \frac{ N - \displaystyle\sum_{j=1}^{C} f_{mj} }{ (N-r_m)^3 } \left(\sum_{j=1}^{C} f_{mj} + r_m -2\sum_{j=1}^{C} (f_{mj}|k_j=k) \right) },

where k_j is the row index of f_{mj}, k is the index of row subtotal for r_m.

Symmetric \lambda = \frac { \displaystyle \sum_{i=1}^{R}f_{im} + \sum_{j=1}^{C}f_{mj} - c_m - r_m }{2N-r_m-c_m} \frac{1}{w^2} \sqrt{ wvy - 2w^2\left( N-\sum_{i=1}^{R} (f_{im}|i=k_{l_i}) \right) - 2v^2(N-f_{kl}) }

where w=2N-r_m-c_m, v = 2N - \sum_{i=1}^{R}f_{im} - \sum_{j=1}^{C}f_{mj}, x = \sum_{i=1}^R (f_{im}|l_i=l) + \sum_{j=1}^C (f_{mj}|k_j=k) + f_{km} + f_{ml}, and y = 8N - w - v - 2x.

Uncertainty C|R U_{R|C} = \frac{U(X)+U(Y)-U(XY)}{U(Y)}, where U(X) = -\sum_{i=1}^{R}\frac{r_i}{N}\ln\frac{r_i}{N}, and U(Y) = -\sum_{j=1}^{C}\frac{c_j}{N}\ln\frac{c_j}{N}, and U(XY) = -\sum_{ij}\frac{f_{ij}}{N}\ln\frac{f_{ij}}{N} \frac{1}{NU(Y)}\sqrt{P-N\left(U(X)+U(Y)-U(XY)\right)^2}, where P = \sum_{ij}f_{ij}\ln\left(\frac{r_ic_j}{f_{ij}N}\right)^2
R|C U_{C|R} = \frac{U(X)+U(Y)-U(XY)}{U(X)} \frac{1}{NU(X)}\sqrt{P-N\left(U(X)+U(Y)-U(XY)\right)^2}
Symmetric U = 2\frac{U(X)+U(Y)-U(XY)}{U(X)+U(Y)} \frac{2}{N(U(X)+U(Y))}\sqrt{P-\frac{1}{N}\left(U(X)+U(Y)-U(XY)\right)^2}

Measures of Agreement

This table is calculated only when two conditions are satisfied (1) square table, i.e. R=C, and (2) the row variable and column variable have same values.

The Kappa statistic is calculated by

 \kappa = \frac{N\sum_{i=1}^{R}f_{ii} - \sum_{i=1}^{R}r_ic_i}{N^2 - \sum_{i=1}^{R}r_ic_i}

The standard error is estimated by:

SE_1 = \frac{1}{1-p_e} \sqrt{ \frac{A+B-C}{N} }.

where p_e = \frac{ \sum_{i=1}^R r_i c_i }{ N^2 },  A = \sum_{i=1}^R \frac{f_{ii}}{N} \left( 1-\frac{(r_i+c_i)(1- \kappa)}{N} \right)^2,
B = (1-\kappa)^2 \sum_{i=1}^R \sum_{j=1, j \ne i}^{C} \frac{f_{ij} (r_i+c_j)^2}{N^3} and C = \Bigl( \kappa - p_e( 1-\kappa ) \Bigr)^2.

The corresponding asymptotic standard error under the null hypothesis \kappa = 0 is given by

SE_0 = \sqrt{\frac{1}{N\left(N^2 - \sum_{i=1}^{R}r_ic_i\right)^2} \left[N^2\sum_{i=1}^{R}r_ic_i + \left(\sum_{i=1}^{R}r_ic_i\right)^2 - N \sum_{i=1}^{R}r_ic_i(r_i+c_i)\right]}

Another related statistic is Bowker, which is used to test H_0: p_{ij} = p_{ji} for all pairs. If R>2, the statistic is calculated as

Bo = \sum_{i=1}^R \sum_{j=1}^{j<i}\frac{(f_{ij}-f_{ji})^2}{f_{ij}+f_{ji}}

For lager samples, Bo is asymptotically chi-square distribution with degree of freedom 0.5R(R-1).

Note that for 2 x 2 table, Bowker's test is equal to McNemar's test. So we only give Bowker's test.

Odds Ratio and Relative Risk

These statistics are calculated only for 2 x 2 table.

Odds Ratio

The Odds Ratio is calculated as

OR = \frac{f_{11}f_{22}}{f_{12}f_{21}}

Relative Risk

The Relative Risks are given by

P(Y_1|X_1)/P(Y_1|X_2) = \frac{f_{11}(f_{21}+f_{22})}{f_{21}(f_{11}+f_{12})}
P(Y_1|X_2)/P(Y_1|X_1) = \frac{f_{21}(f_{11}+f_{12})}{f_{11}(f_{21}+f_{22})}
P(Y_2|X_1)/P(Y_2|X_2) = \frac{f_{12}(f_{21}+f_{22})}{f_{22}(f_{12}+f_{11})}
P(Y_2|X_2)/P(Y_2|X_1) = \frac{f_{22}(f_{12}+f_{11})}{f_{12}(f_{21}+f_{22})}

Cochran-Mantel-Haenszel

Define

K be the number of layers
f_{ijk} be the frequency in the ith row, jth column and kth layer
c_{jk} = \sum_{i=1}^{R} f_{ijk} be the jth column, kth layer subtotal
r_{ik} = \sum_{j=1}^{C} f_{ijk} be the ith row, kth layer subtotal
n_{k} = \sum_{i=1}^{R}\sum_{j=1}^{C} f_{ijk} be the kth layer subtotal
E_{ijk} = \frac{r_{ik}c_{jk}}{n_k} be the expected frequency of the ith row jth column kth layer cell
\hat{p}_{ik} = \frac{f_{i1k}}{r_{ik}}, d_k = \hat{p}_{1k} - \hat{p}_{2k}, \hat{p}_{k} = \frac{c_{1k}}{n_{k}}

Mantel-Haenszel statistic

The Mantel-Haenszel statistic is given by

MH = \left(\sum_{k=1}^{K}\frac{r_{1k}r_{2k}}{n_k-1} \hat{p}_{k}(1-\hat{p}_{k}) \right)^{-1/2}\left(\big|\sum_{k=1}^{K} (f_{11k}-E_{11k})\big|-0.5\right)sgn\left(\sum_{k=1}^{K} (f_{11k}-E_{11k})\right)

where sgn is the sign function sgn(x) = I(x>0)-I(x<0)+0*I(x=0).


Breslow-Day statistic

The Breslow-Day statistic is

BD = \sum_{k=1}^{K} V_k \left[f_{11k}-\hat{f}_{11k}\right]^2

where V_k = \frac{1}{\hat{f}_{11k}}+\frac{1}{\hat{f}_{12k}}+\frac{1}{\hat{f}_{21k}}+\frac{1}{\hat{f}_{22k}}.

Tarone’s Statistic

The Tarone’s Statistic is

T = \sum_{k=1}^{K} V_k \left[f_{11k}-\hat{f}_{11k}\right]^2- \frac{\sum_{k=1}^{K}\left[f_{11k}-\hat{f}_{11k}\right]^2}{\sum_{k=1}^{K}\frac {1}{V_k} }

where V_k = \frac{1}{\hat{f}_{11k}}+\frac{1}{\hat{f}_{12k}}+\frac{1}{\hat{f}_{21k}}+\frac{1}{\hat{f}_{22k}}.

Common Odds Ratio

For a 2×2×K table, the odds ratio at the kth layer is OR_{k}. Assuming that the true common odds ratio exists,taht is OR_{1}=OR_{2}=...OR_{K} , Mantel-Haenszel's estimator of the common odds ratio is

\hat OR_{MH}=\frac{\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}}{\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}

The asymptotic variance for ln(\hat OR_{MH}) is:

\hat Var[ln(\hat OR_{MH})]=\frac{\sum_{k=1}^{K}\frac{(f_{11k}+f_{22k})f_{11k} f_{22k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}}+\frac{\sum_{k=1}^{K}\frac{(f_{11k}+f_{22k})f_{12k} f_{21k}+(f_{12k}+f_{21k})f_{11k} f_{22k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}+\frac{\sum_{k=1}^{K}\frac{(f_{12k}+f_{21k})f_{12k} f_{21k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}

The lower confidence limit(LCL) and upper confidence limit(UCL) for ln(\hat OR_{MH}) is:

ln(\hat OR_{MH})-z({alpha}/2)\sqrt{\hat Var[ln(\hat OR_{MH})]} and ln(\hat OR_{MH})+z(alpha/2)\sqrt{\hat Var[ln(\hat OR_{MH})]}