18.104.22.168 Algorithms(Principal Component Analysis)
Principal Component Analysis examines relationships of variables. It can be used to reduce the number of variables in regression and clustering, for example.
Each principal component in Principal Component Analysis is the linear combination of the variables and gives a maximized variance. Let X be a matrix for n observations by p variables, and the covariance matrix is S. Then for a linear combination of the variables
where is the ith variable, are linear combination coefficients for , they can be denoted by a column vector , and normalized by . The variance of will be .
The vector is found by maximizing the variance. And is called the first principal component. The second principal component can be found in the same way by maximizing:
- subject to the constraints and
It gives the second principal component that is orthogonal to the first one. Remaining principal components can be derived in a similar way. In fact coefficients can be calculated from eigenvectors of the matrix S. Origin uses different methods according to the way of excluding missing values.
Listwise Exclusion of Missing Values
An observation containing one or more missing values will be excluded in the analysis. And a matrix for SVD can be derived from X depending on the matrix type for analysis.
Matrix Type for Analysis
- Let be the matrix X with each column's mean subtracted from each variable and each column scaled by .
- Let be the matrix X with each column's mean subtracted from each variable and each column scaled by where is the standard deviation of the ith variable.
Quantities to Compute
Perform SVD on .
where V is an n by p matrix with , P is a p by p matrix, and is a diagonal matrix with diagonal elements .
- Eigenvalues are sorted in descending order. The proportion of variance explained by the ith principal component is .
- Eigenvectors are also known as loadings or coefficients for principal components. Each column in P is the eigenvector corresponding to the eigenvalue or principal component.
- Note that the eigenvector's sign is not unique for SVD, Origin normalizes its sign by forcing the sum of each column to be positive.
- Each column in is the scores corresponding to the principal component. And scores will be missing values corresponding to an observation containing missing values.
- Note that variance of scores for each principal component equals its corresponding eigenvalue for this method.
- Scores for each principal component are standardized so that they have unit variance.
Pairwise Exclusion of Missing Values
An observation is excluded only in the calculation of covariance or correlation between two variables if missing values exist in either of the two variables for the observation.
Eigenvalues and eigenvectors are calculated from the covariance or correlation matrix S.
where P is a p by p matrix and D is a diagonal matrix with diagonal elements .
- is the ith eigenvalue for the ith principal component. And eigenvalues are sorted in descending order.
- Note that eigenvalues can be negative for missing values excluded in a pairwise way, which will make no sense for principal components. Origin sets the loading and scores to zeros for a negative eigenvalue.
- Each column in P is the eigenvector corresponding to the eigenvalue or principal component.
- Note that the eigenvector's sign is not unique; Origin normalizes its sign by forcing the sum of each column to be positive.
- where is the matrix X with each column's mean subtracted from each variable.
- Scores will be missing values corresponding to an observation containing missing values.
- Note that variance of scores for each principal component may not equal its corresponding eigenvalue for this method.
- Scores for each principal component are scaled by the square root of its eigenvalue.
Bartlett's Test tests the equality of the remaining p-k eigenvalues. It is available only when analysis matrix is covariance matrix.
It approximates a distribution with degrees of freedom.