17.7.3.1 The Hierarchical Cluster Analysis Dialog Box


Dialog Theme

Load or save Dialog Theme. Additionally, generate script for the X-Function using the current dialog box settings.

Recalculate

Set the Recalculate mode.

Input

Variables Select data for the Hierarchical Cluster Analysis. Data in each column corresponds to a variable and each row to an observation.
Observation Labels Select labels for observations. If labels are chosen, they will be shown as X axis ticks in the dendrogram. Enabled only when the objects to cluster are observations. The label column will be set as categorical if Text column.

Settings

Specify the settings for the Hierarchical Cluster Analysis.

Cluster Specify the type of objects to cluster.
  • Observations
Cluster observations. Rows in the input data are classified into groups.
  • Variables
Cluster variables. Columns in the input data are classified into groups.

Note that for different types of objects to cluster, available distance types are also different.

Cluster Method Select the linkage method to calculate the distance between a cluster and a new cluster. Six methods are available.
  • Nearest neighbor
The minimum of two distances between a cluster and two clusters merged to a new cluster. Also called single linkage.
  • Furthest neighbor
The maximum of distances between a cluster and two clusters merged to a new cluster. Also called complete linkage.
  • Group average
The mean of two distances between a cluster and two clusters merged to a new cluster.
  • Centroid
Clusters are produced that maximize the distance between the centers of clusters.
  • Median
The median distance between an item in one cluster and an item in the other cluster.
  • Ward
Clusters are produced that minimize the within-cluster variance.


To learn more about linkage methods, see the algorithm of linkage methods.

Distance Type Select a distance type in the Hierarchical Cluster Analysis.

For observations to cluster, three methods are available:

  • Euclidean
The square root of the sum of the squared differences between two observations.
  • Squared Euclidean
The sum of the squared differences between two observations.
  • City block
The sum of the absolute differences between two observations. Also known as Manhattan distance.
  • Cosine
The difference between 1 and the cosine coefficient of two observations. Cosine coefficient is the cosine of the angle between two vectors.
  • Pearson correlation
The difference between 1 and the correlation of two observations.
  • Jaccard
The difference between 1 and the Jaccard coefficient of two observations. For binary data, Jaccard coefficient equals the ratio of sizes of intersection and union of two observations.


For variables to cluster, two methods are available. Missing values are excluded in a pairwise manner to calculate the correlation.

  • Correlation
The difference between 1 and the correlation of two variables.
  • Absolute correlation
The difference between 1 and the absolute correlation of two variables.


To learn how to calculate distance, see the distance algorithm.

Standardize Variables Specify the method to standardize variables. Available only when objects to cluster are observations.
  • None
Variables are not standardized.
  • Z scores (standardize to N(0, 1))
Variables are standardized with zero mean and unit standard deviation.
  • Normalize to (0,1)
Variable are standardized in the range of 0 and 1.
Number of Clusters Specify the number of clusters. The value should be greater than 0 and no more than the number of effective observations (cluster observations) or variables (cluster variables).
Find Clustroid by Specify the method to find the clustroid: the most/least representative variable/observation
  • Sum of distances
Find Clustroid using the sum of distances measured from all other observations/variables in the cluster. In a cluster, the most representative variable/observation would have the minimum Sum of distances; the least representative variable/observation would have the maximum Sum of distances.
  • Maximum distance
Find Clustroid using the Maximum distance among all distances measured from other observations/variables in the cluster. In a cluster, the most representative variable/observation would have the smallest Maximum distance; the least representative variable/observation would have the biggest Maximum distance.
  • Sum of squares of distances
Find Clustroid using the sum of the squares of distances measured from all other observations/variables in the cluster. In a cluster, the most representative variable/observation would have the minimum Sum of squares of distances; the least representative variable/observation would have the maximum Sum of squares of distances.

Quantities

Specify the quantities to calculate for the Hierarchical Cluster Analysis. Note that descriptive statistics and cluster membership are included in the result of Hierarchical Cluster Analysis by default.

Dissimilarity Matrix Specify whether to output the distance matrix. For a large number of objects, the distance matrix will be shown in a sheet instead of the report.
Cluster Stages Specify whether to output the cluster stages. In each stage two clusters are merged to a new cluster.
Cluster Center Specify whether to calculate cluster centers. It is available only when objects to cluster are observations. When a standardization method is chosen in Standardize Variables of the Settings branch, cluster centers are calculated from standardized variables.
Distance between Cluster Centers Specify whether to calculate the distances between cluster centers. It is available only when objects to cluster are observations.
Distance between Observations and Clusters Specify whether to calculate the distance between each observation and cluster centers. It is available only when objects to cluster are observations.
Clustroid Info Specify whether to list the most/least representative variable or observation.

Plot

Specify whether and how to show the dendrogram.

Dendrogram Specify whether to show the dendrogram. Note that the default dendrogram can be exchanged for a more dynamic "Phylogenetic Tree" in which nodes and subtrees can be highlighted and swapped.
Show Y Axis with
  • Distance
Distance as computed by Distance Type.
  • Similarity
Similarity is computed as 100*(1-d/dmax), where d is the distance, dmax is the maximum distance for all observations, i.e. the last distance calculation in the Cluster Stages table. If you have opted to plot a separate graph for each cluster (Plot tab > Show Dendrogram button), then dmax is the maximum for all graphs.

Hierarchical cluster dialog box image001.png

Show Dendrogram Specify whether to show the dendrogram in a single graph or in separate graphs for clusters. Enabled only when Dendrogram is checked.
  • in a single graph
Show the dendrogram in a single graph. Different clusters are shown in different colors.
  • in separate graphs for clusters
Show the dendrogram in separate graphs for clusters. Each cluster is output to separate graph.
Orientation Specify the orientation of the dendrogram. Enabled only when Dendrogram is checked.
  • Vertical
Plot Dendrogram vertically.
  • Horizontal
Plot Dendrogram horizontally.
  • Circular
Plot circular Dendrogram

Output Settings

Specify the destination of output results for the Hierarchical Cluster Analysis.

Cluster Report Specify the sheet for the Hierarchical Cluster Analysis report. The default value is a new sheet in the workbook of input data.
Cluster Membership Specify the sheet for cluster membership and distance between observations and clusters. The default value is a new sheet in the workbook of input data.