17.7.1 Principal Component Analysis

Video Image.png See more related video:Principal Component Analysis

Principal Component Analysis (PCA) is used to explain the variance-covariance structure of a set of variables through linear combinations. It is often used as a dimensionality-reduction technique.

PCA BiPlot.png

Goals

There are two primary reasons for using PCA:

  • Data Reduction
    PCA is most commonly used to condense the information contained in a large number of original variables into a smaller set of new composite dimensions, with a minimum loss of information.
  • Interpretation
    PCA can be used to discover important features of a large data set. It often reveals relationships that were previously unsuspected, thereby allowing interpretations that would not ordinarily result.

PCA is typically used as an intermediate step in data analysis when the number of input variables is otherwise too large for useful analysis.

Processing Procedure

Preparing Analysis Data

PCA should be used mainly for variables which are strongly correlated. If the relationship is weak between variables, PCA does not work well to reduce data. Refer to the correlation matrix to determine. In general, if most of the correlation coefficients are smaller than 0.3, PCA will not help.

Selecting Principal Methods

The Number of Principal Components

There is always the question of how many components to retain. Please refer to the scree plot and the Eigenvalues of the Correlation Matrix for more information.

Start From Correlation Matrix or Covariance Matrix

The correlation matrix is simply the covariance matrix standardized by setting all variances equal to one. When scales of variables are similar, the covariance matrix is always preferred, as the correlation matrix will lose information when standardizing the variance. The correlation matrix is recommended when variables are measured in different scales.

Exclude Missing Values Listwise or Pairwise

The use of pairwise or listwise exclusion of missing data depends on the nature of the missing values. If there are only a few missing values for a single variable, it often makes sense to delete an entire row of data. This is listwise exclusion. If there are missing values for two and more variables, it is typically best to employ pairwise exclusion.

Performing Principal Component Analysis

  • Select Statistics: Multivariate Analysis: Principal Component Analysis
    Or
  • Type pca -d in script window


Topics covered in this section: