Skip to main content

Clustering & Dimensionality Reduction

Learn unsupervised techniques for discovering patterns and visualizing high-dimensional data

~45 min
Listen to this lesson

Clustering & Dimensionality Reduction

Unsupervised learning finds patterns in data without labels. The two main tasks are:

  • Clustering: Group similar data points together
  • Dimensionality Reduction: Compress data to fewer dimensions while preserving structure
  • These techniques are essential for exploratory data analysis, visualization, preprocessing, and discovering hidden patterns.

    K-Means Clustering

    K-Means is the most widely used clustering algorithm. It partitions data into K clusters by:

    1. Initialize K random centroids 2. Assign each point to the nearest centroid 3. Recalculate centroids as the mean of assigned points 4. Repeat steps 2-3 until convergence

    K-Means is fast and works well on spherical, evenly-sized clusters. But you need to choose K in advance.

    python
    1from sklearn.cluster import KMeans
    2from sklearn.datasets import make_blobs
    3import numpy as np
    4
    5# Generate synthetic clustered data
    6X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
    7
    8# Fit K-Means
    9kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
    10kmeans.fit(X)
    11
    12print(f"Cluster centers:\n{kmeans.cluster_centers_}")
    13print(f"Labels (first 10): {kmeans.labels_[:10]}")
    14print(f"Inertia (sum of squared distances): {kmeans.inertia_:.2f}")

    The Elbow Method: Choosing K

    The elbow method runs K-Means for a range of K values and plots the inertia (sum of squared distances from each point to its centroid). The "elbow" where the curve bends is a good choice for K -- beyond that point, adding more clusters gives diminishing returns.

    python
    1from sklearn.cluster import KMeans
    2import numpy as np
    3
    4# Try different values of K
    5inertias = []
    6K_range = range(1, 11)
    7
    8for k in K_range:
    9    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    10    km.fit(X)
    11    inertias.append(km.inertia_)
    12
    13# Print inertia values to find the elbow
    14print("K  |  Inertia")
    15print("-" * 25)
    16for k, inertia in zip(K_range, inertias):
    17    bar = "=" * int(inertia / max(inertias) * 40)
    18    print(f"{k:2d} | {inertia:8.1f}  {bar}")

    Silhouette Score: Validating Clusters

    The silhouette score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1:

  • +1: Point is well-matched to its cluster and far from others (perfect)
  • 0: Point is on the boundary between clusters
  • -1: Point is likely in the wrong cluster
  • python
    1from sklearn.metrics import silhouette_score
    2
    3# Compute silhouette score for different K
    4print("K  |  Silhouette Score")
    5print("-" * 30)
    6for k in range(2, 8):
    7    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    8    labels = km.fit_predict(X)
    9    score = silhouette_score(X, labels)
    10    bar = "=" * int(score * 40)
    11    print(f"{k:2d} | {score:.4f}  {bar}")

    K-Means Assumptions and Limitations

    K-Means assumes clusters are spherical and equally sized. It struggles with elongated, irregular, or overlapping clusters. It is sensitive to initialization (use n_init=10 or kmeans++) and outliers. Always visualize your clusters to check if K-Means is appropriate.

    DBSCAN: Density-Based Clustering

    DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density rather than distance to centroids. It has two key parameters:

  • eps: The maximum distance between two points to be considered neighbors
  • min_samples: Minimum number of points required to form a dense region (core point)
  • DBSCAN's advantages:

  • Does NOT require you to specify K
  • Can find clusters of arbitrary shape
  • Automatically identifies outliers as noise (label = -1)
  • python
    1from sklearn.cluster import DBSCAN
    2from sklearn.datasets import make_moons
    3from sklearn.preprocessing import StandardScaler
    4
    5# Generate non-spherical clusters that K-Means struggles with
    6X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)
    7X_moons_scaled = StandardScaler().fit_transform(X_moons)
    8
    9# DBSCAN clustering
    10dbscan = DBSCAN(eps=0.3, min_samples=5)
    11labels = dbscan.fit_predict(X_moons_scaled)
    12
    13n_clusters = len(set(labels) - {-1})
    14n_noise = list(labels).count(-1)
    15
    16print(f"Number of clusters found: {n_clusters}")
    17print(f"Noise points: {n_noise}")
    18print(f"Labels (unique): {sorted(set(labels))}")
    19
    20# Compare with K-Means on the same data
    21km = KMeans(n_clusters=2, random_state=42, n_init=10)
    22km_labels = km.fit_predict(X_moons_scaled)
    23print(f"\nK-Means silhouette: {silhouette_score(X_moons_scaled, km_labels):.4f}")
    24print(f"DBSCAN silhouette:  {silhouette_score(X_moons_scaled, labels):.4f}")

    When to Use DBSCAN over K-Means

    Use DBSCAN when: (1) you don't know how many clusters exist, (2) clusters have irregular shapes, (3) you need to identify outliers/noise, (4) clusters have varying densities. Use K-Means when clusters are roughly spherical and you have a good guess for K. Always scale your data before DBSCAN since it is distance-based.

    PCA: Principal Component Analysis

    PCA is the most popular dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in the data.

    Key ideas:

  • The first principal component captures the most variance
  • Each subsequent component captures the most remaining variance, orthogonal to all previous ones
  • You can reduce 100 features to 2-3 components for visualization while retaining most of the information
  • The explained variance ratio tells you how much information each component captures
  • python
    1from sklearn.decomposition import PCA
    2from sklearn.datasets import load_digits
    3from sklearn.preprocessing import StandardScaler
    4
    5# Load the digits dataset (64 features = 8x8 pixel images)
    6X, y = load_digits(return_X_y=True)
    7X_scaled = StandardScaler().fit_transform(X)
    8
    9# Reduce from 64 dimensions to 2 for visualization
    10pca = PCA(n_components=2)
    11X_2d = pca.fit_transform(X_scaled)
    12
    13print(f"Original shape: {X_scaled.shape}")
    14print(f"Reduced shape:  {X_2d.shape}")
    15print(f"Variance explained by 2 components: {sum(pca.explained_variance_ratio_):.4f}")
    16
    17# How many components to retain 95% variance?
    18pca_full = PCA(n_components=0.95)  # Keep 95% of variance
    19X_95 = pca_full.fit_transform(X_scaled)
    20print(f"Components for 95% variance: {pca_full.n_components_}")
    21
    22# Show cumulative explained variance
    23pca_all = PCA().fit(X_scaled)
    24cumvar = pca_all.explained_variance_ratio_.cumsum()
    25for n_comp in [2, 5, 10, 20, 30]:
    26    print(f"  {n_comp} components: {cumvar[n_comp-1]:.4f} variance explained")

    What PCA Maximizes

    PCA finds the directions (components) along which the data varies the most. The first component points in the direction of maximum variance, the second in the direction of maximum remaining variance (orthogonal to the first), and so on. This is equivalent to finding the eigenvectors of the data's covariance matrix.

    t-SNE and UMAP: Nonlinear Visualization

    PCA is linear -- it can only find straight-line patterns. For complex datasets, nonlinear methods often produce better visualizations.

    t-SNE (t-distributed Stochastic Neighbor Embedding)

  • Excellent for visualization (typically 2D)
  • Preserves local neighborhood structure
  • Non-deterministic (different runs give different results)
  • Slow on large datasets
  • Not suitable for general dimensionality reduction (only visualization)
  • The perplexity parameter controls the balance between local and global structure
  • UMAP (Uniform Manifold Approximation and Projection)

  • Faster than t-SNE, scales better
  • Preserves both local AND global structure
  • More deterministic with a random_state
  • Can be used for general dimensionality reduction (not just 2D)
  • python
    1from sklearn.manifold import TSNE
    2from sklearn.datasets import load_digits
    3from sklearn.preprocessing import StandardScaler
    4
    5# Load digits dataset
    6X, y = load_digits(return_X_y=True)
    7X_scaled = StandardScaler().fit_transform(X)
    8
    9# t-SNE reduction to 2D
    10tsne = TSNE(
    11    n_components=2,
    12    perplexity=30,
    13    random_state=42,
    14    n_iter=1000
    15)
    16X_tsne = tsne.fit_transform(X_scaled)
    17
    18print(f"t-SNE output shape: {X_tsne.shape}")
    19print(f"t-SNE KL divergence: {tsne.kl_divergence_:.4f}")
    20
    21# Show cluster separation for each digit
    22for digit in range(10):
    23    mask = y == digit
    24    center = X_tsne[mask].mean(axis=0)
    25    print(f"  Digit {digit}: center at ({center[0]:.1f}, {center[1]:.1f})")

    t-SNE Pitfalls

    Do not over-interpret t-SNE plots. Distances between clusters are not meaningful -- only neighborhood relationships within clusters. t-SNE is non-deterministic (set random_state for reproducibility). Cluster sizes in t-SNE plots don't reflect actual cluster sizes. Always run with multiple perplexity values to check stability.