nag_mv_canon_var (Nag_Weightstype weight, Integer n, Integer m, const double x[], Integer tdx, const Integer isx[], Integer nx, const Integer ing[], Integer ng, const double wt[], Integer nig[], double cvm[], Integer tdcvm, double e[], Integer tde, Integer *ncv, double cvx[], Integer tdcvx, double tol, Integer *irankx, NagError *fail)

3

Description

Let a sample of

n

observations on

n_{x}

variables in a data matrix come from

n_{g}

groups with

n_{1}, n_{2}, \dots, n_{n_{g}}

observations in each group,

\sum n_{i} = n

. Canonical variate analysis finds the linear combination of the

n_{x}

variables that maximizes the ratio of between-group to within-group variation. The variables formed, the canonical variates can then be used to discriminate between groups.

The canonical variates can be calculated from the eigenvectors of the within-group sums of squares and cross-products matrix. However, nag_mv_canon_var (g03acc) calculates the canonical variates by means of a singular value decomposition (SVD) of a matrix

V

. Let the data matrix with variable (column) means subtracted be

X

, and let its rank be

k

; then the

k

(n_{g} - 1)

matrix

V

is given by:

V = Q_{X}^{T} Q_{g},

where

Q_{g}

is an

n

(n_{g} - 1)

orthogonal matrix that defines the groups and

Q_{X}

is the first

k

rows of the orthogonal matrix

Q

either from the

Q R

decomposition of

X

X = Q R

X

is of full column rank, i.e.,

k = n_{x}

, else from the SVD of

X

X = {Q D P}^{T} .

Let the SVD of

V

be:

V = U_{x} Δ U_{g}^{T}

then the nonzero elements of the diagonal matrix

Δ

δ_{i}

, for

i = 1, 2, \dots, l

, are the

l

canonical correlations associated with the

l

canonical variates, where

l = \min (k, n_{g})

The eigenvalues,

λ_{i}^{2}

, of the within-group sums of squares matrix are given by:

λ_{i}^{2} = \frac{δ_{i}^{2}}{1 - δ_{i}^{2}} .

and the value of

π_{i} = λ_{i}^{2} / \sum λ_{i}^{2}

gives the proportion of variation explained by the

i

th canonical variate. The values of the

π_{i}

's give an indication as to how many canonical variates are needed to adequately describe the data, i.e., the dimensionality of the problem.

To test for a significant dimensionality greater than

i

the

χ^{2}

statistic:

(n - 1 - n_{g} - \frac{1}{2} ({k - n}_{g})) \sum_{j = i + 1}^{l} \log (1 + λ_{j}^{2})

can be used. This is asymptotically distributed as a

χ^{2}

distribution with

(k - i) (n_{g} - 1 - i)

degrees of freedom. If the test for

i = h

is not significant, then the remaining tests for

i > h

should be ignored.

The loadings for the canonical variates are calculated from the matrix

U_{x}

. This matrix is scaled so that the canonical variates have unit within group variance.

In addition to the canonical variates loadings the means for each canonical variate are calculated for each group.

Weights can be used with the analysis, in which case the weighted means are subtracted from each column and then each row is scaled by an amount

\sqrt{w_{i}}

, where

w_{i}

is the weight for the

i

th observation (row).

4

References

Chatfield C and Collins A J (1980) Introduction to Multivariate Analysis Chapman and Hall

Gnanadesikan R (1977) Methods for Statistical Data Analysis of Multivariate Observations Wiley

Hammarling S (1985) The singular value decomposition in multivariate statistics SIGNUM Newsl. 20(3) 2–25

Kendall M G and Stuart A (1979) The Advanced Theory of Statistics (3 Volumes) (4th Edition) Griffin

5

Arguments

1: $weight$ – Nag_WeightstypeInput

On entry: indicates the type of weights to be used in the analysis.

$weight = Nag_NoWeights$: No weights are used.
$weight = Nag_Weightsfreq$: The weights are treated as frequencies and the effective number of observations is the sum of the weights.
$weight = Nag_Weightsvar$: The weights are treated as being inversely proportional to the variance of the observations and the effective number of observations is the number of observations with nonzero weights.

Constraint:

weight = Nag_NoWeights

Nag_Weightsfreq

Nag_Weightsvar

2: $n$ – IntegerInput

On entry: the number of observations,

n

Constraint:

n \geq nx + ng

3: $m$ – IntegerInput

On entry: the total number of variables,

m

Constraint:

m \geq nx

4: $x [n \times tdx]$ – const doubleInput

On entry:

x [(i - 1) \times tdx + j - 1]

must contain the

i

th observation for the

j

th variable, for

i = 1, 2, \dots, n

and

j = 1, 2, \dots, m

5: $tdx$ – IntegerInput

On entry: the stride separating matrix column elements in the array x.

Constraint:

tdx \geq m

6: $isx [m]$ – const IntegerInput

On entry:

isx [j - 1]

indicates whether or not the

j

th variable is to be included in the analysis.

isx [j - 1] > 0

, then the variable contained in the

j

th column of x is included in the canonical variate analysis, for

j = 1, 2, \dots, m

Constraint:

isx [j - 1] > 0

for nx values of

j

7: $nx$ – IntegerInput

On entry: the number of variables in the analysis,

n_{x}

Constraint:

nx \geq 1

8: $ing [n]$ – const IntegerInput

On entry:

ing [i - 1]

indicates which group the

i

th observation is in, for

i = 1, 2, \dots, n

. The effective number of groups is the number of groups with nonzero membership.

Constraint:

1 \leq ing [i - 1] \leq ng

, for

i = 1, 2, \dots, n

9: $ng$ – IntegerInput

On entry: the number of groups,

n_{g}

Constraint:

ng \geq 2

10: $wt [n]$ – const doubleInput

On entry: if

weight = Nag_Weightsfreq

Nag_Weightsvar

then the elements of wt must contain the weights to be used in the analysis.

wt [i - 1] = 0.0

then the

i

th observation is not included in the analysis.

Constraints:

$wt [i - 1] \geq 0.0$ , for $i = 1, 2, \dots, n$ ;
$\sum_{i = 1}^{n} wt [i - 1] \geq nx +$ effective number of groups.
Note: if $weight = Nag_NoWeights$ then wt is not referenced and may be NULL.

11: $nig [ng]$ – IntegerOutput

On exit:

nig [j - 1]

gives the number of observations in group

j

, for

j = 1, 2, \dots, n_{g}

12: $cvm [ng \times tdcvm]$ – doubleOutput

On exit:

cvm [(i - 1) \times tdcvm + j - 1]

contains the mean of the

j

th canonical variate for the

i

th group, for

i = 1, 2, \dots, n_{g}

and

j = 1, 2, \dots, l

; the remaining columns, if any, are used as workspace.

13: $tdcvm$ – IntegerInput

On entry: the stride separating matrix column elements in the array cvm.

Constraint:

tdcvm \geq nx

14: $e [\min (nx, ng - 1) \times tde]$ – doubleOutput

On exit: the statistics of the canonical variate analysis.

e [(i - 1) \times tde]

, the canonical correlations,

δ_{i}

, for

i = 1, 2, \dots, l

e [(i - 1) \times tde + 1]

, the eigenvalues of the within-group sum of squares matrix,

λ_{i}^{2}

, for

i = 1, 2, \dots, l

e [(i - 1) \times tde + 2]

, the proportion of variation explained by the

i

th canonical variate, for

i = 1, 2, \dots, l

e [(i - 1) \times tde + 3]

, the

χ^{2}

statistic for the

i

th canonical variate, for

i = 1, 2, \dots, l

e [(i - 1) \times tde + 4]

, the degrees of freedom for

χ^{2}

statistic for the

i

th canonical variate, for

i = 1, 2, \dots, l

e [(i - 1) \times tde + 5]

, the significance level for the

χ^{2}

statistic for the

i

th canonical variate, for

i = 1, 2, \dots, l

15: $tde$ – IntegerInput

On entry: the stride separating matrix column elements in the array e.

Constraint:

tde \geq 6

16: $ncv$ – Integer *Output

On exit: the number of canonical variates,

l

. This will be the minimum of

n_{g} - 1

and the rank of x.

17: $cvx [nx \times tdcvx]$ – doubleOutput

On exit: the canonical variate loadings.

cvx [(i - 1) \times tdcvx + j - 1]

contains the loading coefficient for the

i

th variable on the

j

th canonical variate, for

i = 1, 2, \dots, n_{x}

and

j = 1, 2, \dots, l

; the remaining columns, if any, are used as workspace.

18: $tdcvx$ – IntegerInput

On entry: the stride separating matrix column elements in the array cvx.

Constraint:

tdcvx \geq ng - 1

19: $tol$ – doubleInput

On entry: the value of tol is used to decide if the variables are of full rank and, if not, what is the rank of the variables. The smaller the value of tol the stricter the criterion for selecting the singular value decomposition. If a non-negative value of tol less than machine precision is entered, then the square root of machine precision is used instead.

Constraint:

tol \geq 0.0

20: $irankx$ – Integer *Output

On exit: the rank of the dependent variables.

If the variables are of full rank then

irankx = nx

If the variables are not of full rank then irankx is an estimate of the rank of the dependent variables. irankx is calculated as the number of singular values greater than

tol \times

(largest singular value).

21: $fail$ – NagError *Input/Output

The NAG error argument (see Section 3.7 in How to Use the NAG Library and its Documentation).

6

Error Indicators and Warnings

NE_2_INT_ARG_LT: On entry, $m = 〈value〉$ while $nx = 〈value〉$ . These arguments must satisfy $m \geq nx$ .

On entry, $tdcvm = 〈value〉$ while $nx = 〈value〉$ . These arguments must satisfy $tdcvm \geq nx$ .

On entry, $tdcvx = 〈value〉$ while $ng = 〈value〉$ . These arguments must satisfy $tdcvx \geq ng - 1$ .

On entry, $tdx = 〈value〉$ while $m = 〈value〉$ . These arguments must satisfy $tdx \geq m$ .
NE_3_INT_ARG_CONS: On entry, $n = 〈value〉$ , $nx = 〈value〉$ and $ng = 〈value〉$ . These arguments must satisfy $n \geq nx + ng$ .
NE_ALLOC_FAIL: Dynamic memory allocation failed.
NE_BAD_PARAM: On entry, argument weight had an illegal value.
NE_CANON_CORR_1: A canonical correlation is equal to one. This will happen if the variables provide an exact indication as to which group every observation is allocated.
NE_GROUPS: Either the effective number of groups is less than two or the effective number of groups plus the number of variables, nx is greater than the effective number of observations.
NE_INT_ARG_LT: On entry, $ng = 〈value〉$ .
Constraint: $ng \geq 2$ .

On entry, $nx = 〈value〉$ .
Constraint: $nx \geq 1$ .

On entry, $tde = 〈value〉$ .
Constraint: $tde \geq 6$ .
NE_INTARR_INT: On entry, $ing [〈value〉] = 〈value〉$ , $ng = 〈value〉$ . Constraint: $1 \leq ing [i - 1] \leq ng$ , for $i = 1, 2, \dots, n$ .
NE_INTERNAL_ERROR: An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
NE_NEG_WEIGHT_ELEMENT: On entry, $wt [〈value〉] = 〈value〉$ .
Constraint: When referenced, all elements of wt must be non-negative.
NE_RANK_ZERO: The rank of the variables is zero. This will happen if all the variables are constants.
NE_REAL_ARG_LT: On entry, tol must not be less than $0.0$ : $tol = 〈value〉$ .
NE_SVD_NOT_CONV: The singular value decomposition has failed to converge. This is an unlikely error exit.
NE_VAR_INCL_INDICATED: The number of variables, nx in the analysis $= 〈value〉$ , while number of variables included in the analysis via array $isx = 〈value〉$ .
Constraint: these two numbers must be the same.
NE_WT_ARGS: The wt array argument must not be NULL when the weight argument indicates weights.

7

Accuracy

As the computation involves the use of orthogonal matrices and a singular value decomposition rather than the traditional computing of a sum of squares matrix and the use of an eigenvalue decomposition, nag_mv_canon_var (g03acc) should be less affected by ill conditioned problems.

8

Parallelism and Performance

nag_mv_canon_var (g03acc) is not threaded in any implementation.

9

Further Comments

None.

10

Example

A sample of nine observations, each consisting of three variables plus group indicator, is read in. There are three groups. An unweighted canonical variate analysis is performed and the results printed.

NAG Library Function Document

nag_mv_canon_var (g03acc)

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

1

Purpose

2

Specification

3

Description

4

References

5

Arguments

6

Error Indicators and Warnings

7

Accuracy

8

Parallelism and Performance

9

Further Comments

10

Example

10.1

Program Text

10.2

Program Data

10.3

Program Results