# 17.1.4.3 Algorithms (CrossTabs)

CrossTabs is also called Contingency Tables. This tool is used to examine the existence or the strength of any association between variables.

## CrossTabs Method

• Frequency Counts
• Marginal and Cell
• Chi-Square Tests Table
• Fisher's Exact Test Table (2 x 2 only)
• Measures of Association
• Measures of Agreement
• Odds Ratio and Relative Risk (2 x 2 only)
• Cochran-Mantel-Haenszel

### Frequency Counts

Define

$X_i$ are distinct values of row variable in ascending order, i.e. $X_1 < X_2 < \cdots X_R$
$Y_i$ are distinct values of column variable in ascending order, i.e. $Y_1 < Y_2 < \cdots Y_C$
$f_{ij}$ is the frequency with respect to cell $(i,j)$
$r_i = \sum_{j=1}^{C}f_{ij}$ is subtotal of the $i$th row
$c_j = \sum_{i=1}^{R}f_{ij}$ is subtotal of the $j$th column
$N = \sum_{j=1}^{C}c_j = \sum_{i=1}^{R}r_i$ is the total number.

### Marginal and Cell

Statistics Formula and Explanation
Count $f_{ij}$
Expected Count $E_{ij} = \frac{r_i c_j}{N}$
Row Percent $100*\frac{f_{ij}}{r_i}$
Column Percent $100*\frac{f_{ij}}{c_j}$
Total Percent $100*\frac{f_{ij}}{N}$
Residual $R_{ij} = f_{ij} - E_{ij}$
Std. Residual $StdR_{ij} = \frac{R_{ij}}{\sqrt{E_{ij}}}$
Adj. Residual $AdjR_{ij} = \frac{R_{ij}}{\sqrt{E_{ij}\left(1-\frac{r_i}{N}\right)\left(1-\frac{c_j}{N}\right)}}$

### Chi-Square Statistics

Statistics Formula and Explanation Degree of Freedom
Pearson Chi-Square $\chi_p^2 = \sum_{ij} \frac{(f_{ij}-E_{ij})^2}{E_{ij}}$ $(R-1)(C-1)$
Likelihood Ratio $\chi_{LR}^2 = -2\sum_{ij} f_{ij} \ln (E_{ij}/f_{ij})$ $(R-1)(C-1)$
Linear Association $\chi_{LA}^2 = (N-1)r^2$, where $r$ is the Pearson correlation coefficient. $1$
Continuity Correction $\chi_C^2 = \frac{N(|f_{11}f_{22}-f_{12}f_{21}|-0.5N)^2}{r_1r_2c_1c_2} I(|f_{11}f_{22}-f_{12}f_{21}|>0.5N)$, which is calculated only for 2 x 2 table $1$

### Fisher's Exact Test

This test is useful when some expected cell count is low (less than 5). It's calculated only for 2 x 2 table. Suppose we have the table in the following:

$X_1$ $X_2$ Subtotal/Total
$Y_1$ $n_1$ $n_3$ $n_1+n_3$
$Y_2$ $n_2$ $n_4$ $n_2+n_4$
Subtotal/Total $n_1+n_2$ $n_3+n_4$ $N$

Under the null hypothesis (Independence), the count of the first cell $N_1$ is a hypergeometric distribution with probability given by

$Pr(N_1=n_1) = \frac{(n_1+n_2)!(n_3+n_4)!(n_1+n_3)!(n_2+n_4)!}{N!n_1!n_2!n_3!n_4!}$, $\max(0,n_1-n_4)\leq N_1 \leq \min(n_1+n_2,n_1+n_3)$.

#### one-Sided test

The one-sided test significance level is calculated by

p(left-sided test) =$Pr(N_1\leq n_1)$
p(right-sided test) =$Pr(N_1\geq n_1)$

#### Two-Sided tail

The two-tail significance is

$p_2 = p_1 + p_3$

where

$p_{1}= Pr(N_1\leq n_1)$, if $n_{1}\leq (n_{1}+n_{2})(n_{1}+n_{3})/N$
$p_{1}= Pr(N_1\geq n_1)$, if $n_{1}>(n_{1}+n_{2})(n_{1}+n_{3})/N$

$p_3 = \sum_{x:\text{ between }\min(n_1+n_2,n_1+n_3) \text{ and } (n_1+1); Pr(N_1=x) \leq Pr(N_1=n_1)} Pr(N_1=x)$

### Measures of Association

Define

$D_r = N^2 - \sum_{i=1}^{R}r_i^2$
$D_c = N^2 - \sum_{j=1}^{C}c_j^2$
$C_{ij} = \sum_{hi}\sum_{k>j}f_{hk}$
$D_{ij} = \sum_{hj}f_{hk}+\sum_{h>i}\sum_{k
$P = \sum_{ij}f_{ij}C_{ij}$
$Q = \sum_{ij}f_{ij}D_{ij}$
$r_i = \sum_{j=1}^{C}f_{ij}$ is subtotal of the $i$th row
$c_j = \sum_{i=1}^{R}f_{ij}$ is subtotal of the $j$th column
$N = \sum_{j=1}^{C}c_j = \sum_{i=1}^{R}r_i$ is the total number.
Statistics Formula and Explanation Standard Error
Phi Coefficient $\phi = \sqrt{\chi_p^2/N}$, which is calculated for not 2 x 2 table. For a 2 x 2 table, it is equal to $r$

The value ranges from $[0,M]$, where $M = min(\sqrt{R-1},\sqrt{C-1})$,

Cramer's V $V = \sqrt{\frac{\chi_p^2}{N\min\{R,C\}}}$
Contingency Coefficient $CC = \sqrt{\frac{\chi_p^2}{\chi_p^2+N}}$
Gamma $\gamma = \frac{P-Q}{P+Q}$ $\frac{2}{P+Q}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}$
Kendall Tau-b $\tau_b = \frac{P-Q}{\sqrt{D_rD_c}}$ $2\sqrt{\frac{1}{D_rD_c}\left[\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2\right]}$
Tau-c $\tau_c = \frac{(P-Q)q}{N^2(q-1)}$, where $q = \min\{R,C\}$ $\frac{2q}{N^2(q-1)}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}$
Somer's D C$|$R $d_{C|R} = \frac{P-Q}{D_r}$ $\frac{2}{D_r}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}$
R$|$C $d_{R|C} = \frac{P-Q}{D_c}$ $\frac{2}{D_c}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}$
Symmetric $d = 2\frac{P-Q}{D_c+D_r}$ $\frac{4}{D_c+D_r}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}$
Lambda C$|$R $\lambda_{C|R} = \frac{1}{N-c_m}\left(\sum_{i=1}^{R}f_{im}-c_m\right)$, where $f_{im}$ is the largest count in ith row, and $c_m$ is the largest column subtotal. $\sqrt{ \frac{ N - \displaystyle\sum_{i=1}^{R} f_{im} }{ (N-c_m)^3 } \left(\sum_{i=1}^{R} f_{im} + c_m -2\sum_{i=1}^{R} (f_{im}|l_i=l) \right) }$,

where $l_i$ is the column index of $f_{im}$, $l$ is the index of column subtotal for $c_m$.

R$|$C $\lambda_{R|C} = \frac{1}{N-r_m}\left(\sum_{j=1}^{C}f_{mj}-r_m\right)$,

where $f_{mj}$ is the largest count in jth column, and $r_m$ is the largest row subtotal.

$\sqrt{ \frac{ N - \displaystyle\sum_{j=1}^{C} f_{mj} }{ (N-r_m)^3 } \left(\sum_{j=1}^{C} f_{mj} + r_m -2\sum_{j=1}^{C} (f_{mj}|k_j=k) \right) }$,

where $k_j$ is the row index of $f_{mj}$, $k$ is the index of row subtotal for $r_m$.

Symmetric $\lambda = \frac { \displaystyle \sum_{i=1}^{R}f_{im} + \sum_{j=1}^{C}f_{mj} - c_m - r_m }{2N-r_m-c_m}$ $\frac{1}{w^2} \sqrt{ wvy - 2w^2\left( N-\sum_{i=1}^{R} (f_{im}|i=k_{l_i}) \right) - 2v^2(N-f_{kl}) }$

where $w=2N-r_m-c_m$, $v = 2N - \sum_{i=1}^{R}f_{im} - \sum_{j=1}^{C}f_{mj}$, $x = \sum_{i=1}^R (f_{im}|l_i=l) + \sum_{j=1}^C (f_{mj}|k_j=k) + f_{km} + f_{ml}$, and $y = 8N - w - v - 2x$.

Uncertainty C$|$R $U_{R|C} = \frac{U(X)+U(Y)-U(XY)}{U(Y)}$, where $U(X) = -\sum_{i=1}^{R}\frac{r_i}{N}\ln\frac{r_i}{N}$, and $U(Y) = -\sum_{j=1}^{C}\frac{c_j}{N}\ln\frac{c_j}{N}$, and $U(XY) = -\sum_{ij}\frac{f_{ij}}{N}\ln\frac{f_{ij}}{N}$ $\frac{1}{NU(Y)}\sqrt{P-N\left(U(X)+U(Y)-U(XY)\right)^2}$, where $P = \sum_{ij}f_{ij}\ln\left(\frac{r_ic_j}{f_{ij}N}\right)^2$
R$|$C $U_{C|R} = \frac{U(X)+U(Y)-U(XY)}{U(X)}$ $\frac{1}{NU(X)}\sqrt{P-N\left(U(X)+U(Y)-U(XY)\right)^2}$
Symmetric $U = 2\frac{U(X)+U(Y)-U(XY)}{U(X)+U(Y)}$ $\frac{2}{N(U(X)+U(Y))}\sqrt{P-\frac{1}{N}\left(U(X)+U(Y)-U(XY)\right)^2}$

### Measures of Agreement

This table is calculated only when two conditions are satisfied (1) square table, i.e. $R=C$, and (2) the row variable and column variable have same values.

The Kappa statistic is calculated by

$\kappa = \frac{N\sum_{i=1}^{R}f_{ii} - \sum_{i=1}^{R}r_ic_i}{N^2 - \sum_{i=1}^{R}r_ic_i}$

The standard error is estimated by:

$SE_1 = \frac{1}{1-p_e} \sqrt{ \frac{A+B-C}{N} }$.

where $p_e = \frac{ \sum_{i=1}^R r_i c_i }{ N^2 }$, $A = \sum_{i=1}^R \frac{f_{ii}}{N} \left( 1-\frac{(r_i+c_i)(1- \kappa)}{N} \right)^2$,
$B = (1-\kappa)^2 \sum_{i=1}^R \sum_{j=1, j \ne i}^{C} \frac{f_{ij} (r_i+c_j)^2}{N^3}$ and $C = \Bigl( \kappa - p_e( 1-\kappa ) \Bigr)^2$.

The corresponding asymptotic standard error under the null hypothesis $\kappa = 0$ is given by

$SE_0 = \sqrt{\frac{1}{N\left(N^2 - \sum_{i=1}^{R}r_ic_i\right)^2} \left[N^2\sum_{i=1}^{R}r_ic_i + \left(\sum_{i=1}^{R}r_ic_i\right)^2 - N \sum_{i=1}^{R}r_ic_i(r_i+c_i)\right]}$

Another related statistic is Bowker, which is used to test $H_0: p_{ij} = p_{ji}$ for all pairs. If $R>2$, the statistic is calculated as

$Bo = \sum_{i=1}^R \sum_{j=1}^{j

For lager samples, $Bo$ is asymptotically chi-square distribution with degree of freedom $0.5R(R-1)$.

Note that for 2 x 2 table, Bowker's test is equal to McNemar's test. So we only give Bowker's test.

### Odds Ratio and Relative Risk

These statistics are calculated only for 2 x 2 table.

#### Odds Ratio

The Odds Ratio is calculated as

$OR = \frac{f_{11}f_{22}}{f_{12}f_{21}}$

#### Relative Risk

The Relative Risks are given by

$P(Y_1|X_1)/P(Y_1|X_2) = \frac{f_{11}(f_{21}+f_{22})}{f_{21}(f_{11}+f_{12})}$
$P(Y_1|X_2)/P(Y_1|X_1) = \frac{f_{21}(f_{11}+f_{12})}{f_{11}(f_{21}+f_{22})}$
$P(Y_2|X_1)/P(Y_2|X_2) = \frac{f_{12}(f_{21}+f_{22})}{f_{22}(f_{12}+f_{11})}$
$P(Y_2|X_2)/P(Y_2|X_1) = \frac{f_{22}(f_{12}+f_{11})}{f_{12}(f_{21}+f_{22})}$

### Cochran-Mantel-Haenszel

Define

$K$ be the number of layers
$f_{ijk}$ be the frequency in the ith row, jth column and kth layer
$c_{jk} = \sum_{i=1}^{R} f_{ijk}$ be the jth column, kth layer subtotal
$r_{ik} = \sum_{j=1}^{C} f_{ijk}$ be the ith row, kth layer subtotal
$n_{k} = \sum_{i=1}^{R}\sum_{j=1}^{C} f_{ijk}$ be the kth layer subtotal
$E_{ijk} = \frac{r_{ik}c_{jk}}{n_k}$ be the expected frequency of the ith row jth column kth layer cell
$\hat{p}_{ik} = \frac{f_{i1k}}{r_{ik}}, d_k = \hat{p}_{1k} - \hat{p}_{2k}, \hat{p}_{k} = \frac{c_{1k}}{n_{k}}$

#### Mantel-Haenszel statistic

The Mantel-Haenszel statistic is given by

$MH = \left(\sum_{k=1}^{K}\frac{r_{1k}r_{2k}}{n_k-1} \hat{p}_{k}(1-\hat{p}_{k}) \right)^{-1/2}\left(\big|\sum_{k=1}^{K} (f_{11k}-E_{11k})\big|-0.5\right)sgn\left(\sum_{k=1}^{K} (f_{11k}-E_{11k})\right)$

where sgn is the sign function $sgn(x) = I(x>0)-I(x<0)+0*I(x=0)$.

#### Breslow-Day statistic

The Breslow-Day statistic is

$BD = \sum_{k=1}^{K} V_k \left[f_{11k}-\hat{f}_{11k}\right]^2$

where $V_k = \frac{1}{\hat{f}_{11k}}+\frac{1}{\hat{f}_{12k}}+\frac{1}{\hat{f}_{21k}}+\frac{1}{\hat{f}_{22k}}$.

#### Tarone’s Statistic

The Tarone’s Statistic is

$T = \sum_{k=1}^{K} V_k \left[f_{11k}-\hat{f}_{11k}\right]^2- \frac{\sum_{k=1}^{K}\left[f_{11k}-\hat{f}_{11k}\right]^2}{\sum_{k=1}^{K}\frac {1}{V_k} }$

where $V_k = \frac{1}{\hat{f}_{11k}}+\frac{1}{\hat{f}_{12k}}+\frac{1}{\hat{f}_{21k}}+\frac{1}{\hat{f}_{22k}}$.

#### Common Odds Ratio

For a 2×2×K table, the odds ratio at the kth layer is $OR_{k}$. Assuming that the true common odds ratio exists,taht is $OR_{1}=OR_{2}=...OR_{K}$ , Mantel-Haenszel's estimator of the common odds ratio is

$\hat OR_{MH}=\frac{\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}}{\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}$

The asymptotic variance for $ln(\hat OR_{MH})$ is:

$\hat Var[ln(\hat OR_{MH})]=\frac{\sum_{k=1}^{K}\frac{(f_{11k}+f_{22k})f_{11k} f_{22k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}}+\frac{\sum_{k=1}^{K}\frac{(f_{11k}+f_{22k})f_{12k} f_{21k}+(f_{12k}+f_{21k})f_{11k} f_{22k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}+\frac{\sum_{k=1}^{K}\frac{(f_{12k}+f_{21k})f_{12k} f_{21k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}$

The lower confidence limit(LCL) and upper confidence limit(UCL) for $ln(\hat OR_{MH})$ is:

$ln(\hat OR_{MH})-z({alpha}/2)\sqrt{\hat Var[ln(\hat OR_{MH})]}$ and $ln(\hat OR_{MH})+z(alpha/2)\sqrt{\hat Var[ln(\hat OR_{MH})]}$