Spring 2018
Both aim to simplify the data via small number of summaries
Both aim to simplify the data via small number of summaries
K-means
hierarchical clustering
It is an approach for partitioning a dataset into K distinct, non-overlapping clusters.
\(C_1, ..., C_k\): Sets containing indices of observations in each cluster.
A good cluster is one for which the within-cluster variation is as small as possible
Within-cluster variation (squared Euclidean distance)
\[W(C_k) = \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^{p}(x_{ij} - x_{i'j})^2\]
A good cluster is one for which the within-cluster variation is as small as possible
Within-cluster variation (squared Euclidean distance)
\[W(C_k) = \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^{p}(x_{ij} - x_{i'j})^2\]
\[\underset{C_1, \ldots, C_k}{\text{minimize}}\left\{\sum_{k=1}^{K}W(C_k)\right\}\]
\[\underset{C_1, \ldots, C_k}{\text{minimize}}\left\{\sum_{k=1}^{K} \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^{p}(x_{ij} - x_{i'j})^2 \right\}\]
\[\underset{C_1, \ldots, C_k}{\text{minimize}}\left\{\sum_{k=1}^{K} \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^{p}(x_{ij} - x_{i'j})^2 \right\}\]
\[\underset{C_1, \ldots, C_k}{\text{minimize}}\left\{\sum_{k=1}^{K} \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^{p}(x_{ij} - x_{i'j})^2 \right\}\]
Difficult problem: \(K^n\) ways to partition \(n\) observations into \(K\) clusters.
Fortunately, there is a simple algorithm that can provide a local optimum
Show that the algorithm in the previous slide is guaranteed to decrease the value of the objective
\[\underset{C_1, \ldots, C_k}{\text{minimize}}\left\{\sum_{k=1}^{K} \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^{p}(x_{ij} - x_{i'j})^2 \right\}\]
at each step.
Perform k-means clustering in the New York Times stories data used at the end of lecture 1.
Data can be downloaded from
Does not require us to commit to a particular choice of K in advance
Produces an attractive tree-based representation called dendogram
Does not require us to commit to a particular choice of K in advance
Produces an attractive tree-based representation called dendogram
Not always suited for a arbitrary dataset
Not always suited for a arbitrary dataset
Not always suited for a arbitrary dataset
This explains why hierarchical clusters can sometimes yield worse results than K-means for a given number of clusters
Euclidean distance is most common dissimilarity measure to use.
But there are other options
Euclidean distance is most common dissimilarity measure to use.
But there are other options
Need to extend the concept between dissimilarity between pairs of observations to pairs of groups of observations
Perform hierarchical clustering in the New York Times stories data used at the end of lecture 1.
Data can be downloaded from