Clustering & Dimensionality Reduction

Unsupervised learning finds patterns in data without labels. The two main tasks are:

Clustering: Group similar data points together

Dimensionality Reduction: Compress data to fewer dimensions while preserving structure

These techniques are essential for exploratory data analysis, visualization, preprocessing, and discovering hidden patterns.

K-Means Clustering

K-Means is the most widely used clustering algorithm. It partitions data into K clusters by:

1. Initialize K random centroids 2. Assign each point to the nearest centroid 3. Recalculate centroids as the mean of assigned points 4. Repeat steps 2-3 until convergence

K-Means is fast and works well on spherical, evenly-sized clusters. But you need to choose K in advance.

python

1from sklearn.cluster import KMeans
2from sklearn.datasets import make_blobs
3import numpy as np
4
5# Generate synthetic clustered data
6X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
7
8# Fit K-Means
9kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
10kmeans.fit(X)
11
12print(f"Cluster centers:\n{kmeans.cluster_centers_}")
13print(f"Labels (first 10): {kmeans.labels_[:10]}")
14print(f"Inertia (sum of squared distances): {kmeans.inertia_:.2f}")

The Elbow Method: Choosing K

The elbow method runs K-Means for a range of K values and plots the inertia (sum of squared distances from each point to its centroid). The "elbow" where the curve bends is a good choice for K -- beyond that point, adding more clusters gives diminishing returns.

python

1from sklearn.cluster import KMeans
2import numpy as np
3
4# Try different values of K
5inertias = []
6K_range = range(1, 11)
7
8for k in K_range:
9    km = KMeans(n_clusters=k, random_state=42, n_init=10)
10    km.fit(X)
11    inertias.append(km.inertia_)
12
13# Print inertia values to find the elbow
14print("K  |  Inertia")
15print("-" * 25)
16for k, inertia in zip(K_range, inertias):
17    bar = "=" * int(inertia / max(inertias) * 40)
18    print(f"{k:2d} | {inertia:8.1f}  {bar}")

Silhouette Score: Validating Clusters

The silhouette score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1:

+1: Point is well-matched to its cluster and far from others (perfect)

0: Point is on the boundary between clusters

-1: Point is likely in the wrong cluster

python

1from sklearn.metrics import silhouette_score
2
3# Compute silhouette score for different K
4print("K  |  Silhouette Score")
5print("-" * 30)
6for k in range(2, 8):
7    km = KMeans(n_clusters=k, random_state=42, n_init=10)
8    labels = km.fit_predict(X)
9    score = silhouette_score(X, labels)
10    bar = "=" * int(score * 40)
11    print(f"{k:2d} | {score:.4f}  {bar}")

K-Means Assumptions and Limitations

K-Means assumes clusters are spherical and equally sized. It struggles with elongated, irregular, or overlapping clusters. It is sensitive to initialization (use n_init=10 or kmeans++) and outliers. Always visualize your clusters to check if K-Means is appropriate.

DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density rather than distance to centroids. It has two key parameters:

eps: The maximum distance between two points to be considered neighbors

min_samples: Minimum number of points required to form a dense region (core point)

DBSCAN's advantages:

Does NOT require you to specify K

Can find clusters of arbitrary shape

Automatically identifies outliers as noise (label = -1)

python

1from sklearn.cluster import DBSCAN
2from sklearn.datasets import make_moons
3from sklearn.preprocessing import StandardScaler
4
5# Generate non-spherical clusters that K-Means struggles with
6X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)
7X_moons_scaled = StandardScaler().fit_transform(X_moons)
8
9# DBSCAN clustering
10dbscan = DBSCAN(eps=0.3, min_samples=5)
11labels = dbscan.fit_predict(X_moons_scaled)
12
13n_clusters = len(set(labels) - {-1})
14n_noise = list(labels).count(-1)
15
16print(f"Number of clusters found: {n_clusters}")
17print(f"Noise points: {n_noise}")
18print(f"Labels (unique): {sorted(set(labels))}")
19
20# Compare with K-Means on the same data
21km = KMeans(n_clusters=2, random_state=42, n_init=10)
22km_labels = km.fit_predict(X_moons_scaled)
23print(f"\nK-Means silhouette: {silhouette_score(X_moons_scaled, km_labels):.4f}")
24print(f"DBSCAN silhouette:  {silhouette_score(X_moons_scaled, labels):.4f}")

When to Use DBSCAN over K-Means

Use DBSCAN when: (1) you don't know how many clusters exist, (2) clusters have irregular shapes, (3) you need to identify outliers/noise, (4) clusters have varying densities. Use K-Means when clusters are roughly spherical and you have a good guess for K. Always scale your data before DBSCAN since it is distance-based.

PCA: Principal Component Analysis

PCA is the most popular dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in the data.

Key ideas:

The first principal component captures the most variance

Each subsequent component captures the most remaining variance, orthogonal to all previous ones

You can reduce 100 features to 2-3 components for visualization while retaining most of the information

The explained variance ratio tells you how much information each component captures

python

1from sklearn.decomposition import PCA
2from sklearn.datasets import load_digits
3from sklearn.preprocessing import StandardScaler
4
5# Load the digits dataset (64 features = 8x8 pixel images)
6X, y = load_digits(return_X_y=True)
7X_scaled = StandardScaler().fit_transform(X)
8
9# Reduce from 64 dimensions to 2 for visualization
10pca = PCA(n_components=2)
11X_2d = pca.fit_transform(X_scaled)
12
13print(f"Original shape: {X_scaled.shape}")
14print(f"Reduced shape:  {X_2d.shape}")
15print(f"Variance explained by 2 components: {sum(pca.explained_variance_ratio_):.4f}")
16
17# How many components to retain 95% variance?
18pca_full = PCA(n_components=0.95)  # Keep 95% of variance
19X_95 = pca_full.fit_transform(X_scaled)
20print(f"Components for 95% variance: {pca_full.n_components_}")
21
22# Show cumulative explained variance
23pca_all = PCA().fit(X_scaled)
24cumvar = pca_all.explained_variance_ratio_.cumsum()
25for n_comp in [2, 5, 10, 20, 30]:
26    print(f"  {n_comp} components: {cumvar[n_comp-1]:.4f} variance explained")

What PCA Maximizes

PCA finds the directions (components) along which the data varies the most. The first component points in the direction of maximum variance, the second in the direction of maximum remaining variance (orthogonal to the first), and so on. This is equivalent to finding the eigenvectors of the data's covariance matrix.

t-SNE and UMAP: Nonlinear Visualization

PCA is linear -- it can only find straight-line patterns. For complex datasets, nonlinear methods often produce better visualizations.

t-SNE (t-distributed Stochastic Neighbor Embedding)

Excellent for visualization (typically 2D)

Preserves local neighborhood structure

Non-deterministic (different runs give different results)

Slow on large datasets

Not suitable for general dimensionality reduction (only visualization)

The perplexity parameter controls the balance between local and global structure

UMAP (Uniform Manifold Approximation and Projection)

Faster than t-SNE, scales better

Preserves both local AND global structure

More deterministic with a random_state

Can be used for general dimensionality reduction (not just 2D)

python

1from sklearn.manifold import TSNE
2from sklearn.datasets import load_digits
3from sklearn.preprocessing import StandardScaler
4
5# Load digits dataset
6X, y = load_digits(return_X_y=True)
7X_scaled = StandardScaler().fit_transform(X)
8
9# t-SNE reduction to 2D
10tsne = TSNE(
11    n_components=2,
12    perplexity=30,
13    random_state=42,
14    n_iter=1000
15)
16X_tsne = tsne.fit_transform(X_scaled)
17
18print(f"t-SNE output shape: {X_tsne.shape}")
19print(f"t-SNE KL divergence: {tsne.kl_divergence_:.4f}")
20
21# Show cluster separation for each digit
22for digit in range(10):
23    mask = y == digit
24    center = X_tsne[mask].mean(axis=0)
25    print(f"  Digit {digit}: center at ({center[0]:.1f}, {center[1]:.1f})")

t-SNE Pitfalls

Do not over-interpret t-SNE plots. Distances between clusters are not meaningful -- only neighborhood relationships within clusters. t-SNE is non-deterministic (set random_state for reproducibility). Cluster sizes in t-SNE plots don't reflect actual cluster sizes. Always run with multiple perplexity values to check stability.

Clustering & Dimensionality Reduction

Unsupervised learning finds patterns in data without labels. The two main tasks are:

Clustering: Group similar data points together

Dimensionality Reduction: Compress data to fewer dimensions while preserving structure

These techniques are essential for exploratory data analysis, visualization, preprocessing, and discovering hidden patterns.

K-Means Clustering

K-Means is the most widely used clustering algorithm. It partitions data into K clusters by:

1. Initialize K random centroids 2. Assign each point to the nearest centroid 3. Recalculate centroids as the mean of assigned points 4. Repeat steps 2-3 until convergence

K-Means is fast and works well on spherical, evenly-sized clusters. But you need to choose K in advance.

python

1from sklearn.cluster import KMeans
2from sklearn.datasets import make_blobs
3import numpy as np
4
5# Generate synthetic clustered data
6X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
7
8# Fit K-Means
9kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
10kmeans.fit(X)
11
12print(f"Cluster centers:\n{kmeans.cluster_centers_}")
13print(f"Labels (first 10): {kmeans.labels_[:10]}")
14print(f"Inertia (sum of squared distances): {kmeans.inertia_:.2f}")

The Elbow Method: Choosing K

python

1from sklearn.cluster import KMeans
2import numpy as np
3
4# Try different values of K
5inertias = []
6K_range = range(1, 11)
7
8for k in K_range:
9    km = KMeans(n_clusters=k, random_state=42, n_init=10)
10    km.fit(X)
11    inertias.append(km.inertia_)
12
13# Print inertia values to find the elbow
14print("K  |  Inertia")
15print("-" * 25)
16for k, inertia in zip(K_range, inertias):
17    bar = "=" * int(inertia / max(inertias) * 40)
18    print(f"{k:2d} | {inertia:8.1f}  {bar}")

Silhouette Score: Validating Clusters

The silhouette score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1:

+1: Point is well-matched to its cluster and far from others (perfect)

0: Point is on the boundary between clusters

-1: Point is likely in the wrong cluster

python

1from sklearn.metrics import silhouette_score
2
3# Compute silhouette score for different K
4print("K  |  Silhouette Score")
5print("-" * 30)
6for k in range(2, 8):
7    km = KMeans(n_clusters=k, random_state=42, n_init=10)
8    labels = km.fit_predict(X)
9    score = silhouette_score(X, labels)
10    bar = "=" * int(score * 40)
11    print(f"{k:2d} | {score:.4f}  {bar}")

K-Means Assumptions and Limitations

DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density rather than distance to centroids. It has two key parameters:

eps: The maximum distance between two points to be considered neighbors

min_samples: Minimum number of points required to form a dense region (core point)

DBSCAN's advantages:

Does NOT require you to specify K

Can find clusters of arbitrary shape

Automatically identifies outliers as noise (label = -1)

python

1from sklearn.cluster import DBSCAN
2from sklearn.datasets import make_moons
3from sklearn.preprocessing import StandardScaler
4
5# Generate non-spherical clusters that K-Means struggles with
6X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)
7X_moons_scaled = StandardScaler().fit_transform(X_moons)
8
9# DBSCAN clustering
10dbscan = DBSCAN(eps=0.3, min_samples=5)
11labels = dbscan.fit_predict(X_moons_scaled)
12
13n_clusters = len(set(labels) - {-1})
14n_noise = list(labels).count(-1)
15
16print(f"Number of clusters found: {n_clusters}")
17print(f"Noise points: {n_noise}")
18print(f"Labels (unique): {sorted(set(labels))}")
19
20# Compare with K-Means on the same data
21km = KMeans(n_clusters=2, random_state=42, n_init=10)
22km_labels = km.fit_predict(X_moons_scaled)
23print(f"\nK-Means silhouette: {silhouette_score(X_moons_scaled, km_labels):.4f}")
24print(f"DBSCAN silhouette:  {silhouette_score(X_moons_scaled, labels):.4f}")

When to Use DBSCAN over K-Means

PCA: Principal Component Analysis

PCA is the most popular dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in the data.

Key ideas:

The first principal component captures the most variance

Each subsequent component captures the most remaining variance, orthogonal to all previous ones

You can reduce 100 features to 2-3 components for visualization while retaining most of the information

The explained variance ratio tells you how much information each component captures

python

1from sklearn.decomposition import PCA
2from sklearn.datasets import load_digits
3from sklearn.preprocessing import StandardScaler
4
5# Load the digits dataset (64 features = 8x8 pixel images)
6X, y = load_digits(return_X_y=True)
7X_scaled = StandardScaler().fit_transform(X)
8
9# Reduce from 64 dimensions to 2 for visualization
10pca = PCA(n_components=2)
11X_2d = pca.fit_transform(X_scaled)
12
13print(f"Original shape: {X_scaled.shape}")
14print(f"Reduced shape:  {X_2d.shape}")
15print(f"Variance explained by 2 components: {sum(pca.explained_variance_ratio_):.4f}")
16
17# How many components to retain 95% variance?
18pca_full = PCA(n_components=0.95)  # Keep 95% of variance
19X_95 = pca_full.fit_transform(X_scaled)
20print(f"Components for 95% variance: {pca_full.n_components_}")
21
22# Show cumulative explained variance
23pca_all = PCA().fit(X_scaled)
24cumvar = pca_all.explained_variance_ratio_.cumsum()
25for n_comp in [2, 5, 10, 20, 30]:
26    print(f"  {n_comp} components: {cumvar[n_comp-1]:.4f} variance explained")

What PCA Maximizes

t-SNE and UMAP: Nonlinear Visualization

PCA is linear -- it can only find straight-line patterns. For complex datasets, nonlinear methods often produce better visualizations.

t-SNE (t-distributed Stochastic Neighbor Embedding)

Excellent for visualization (typically 2D)

Preserves local neighborhood structure

Non-deterministic (different runs give different results)

Slow on large datasets

Not suitable for general dimensionality reduction (only visualization)

The perplexity parameter controls the balance between local and global structure

UMAP (Uniform Manifold Approximation and Projection)

Faster than t-SNE, scales better

Preserves both local AND global structure

More deterministic with a random_state

Can be used for general dimensionality reduction (not just 2D)

python

1from sklearn.manifold import TSNE
2from sklearn.datasets import load_digits
3from sklearn.preprocessing import StandardScaler
4
5# Load digits dataset
6X, y = load_digits(return_X_y=True)
7X_scaled = StandardScaler().fit_transform(X)
8
9# t-SNE reduction to 2D
10tsne = TSNE(
11    n_components=2,
12    perplexity=30,
13    random_state=42,
14    n_iter=1000
15)
16X_tsne = tsne.fit_transform(X_scaled)
17
18print(f"t-SNE output shape: {X_tsne.shape}")
19print(f"t-SNE KL divergence: {tsne.kl_divergence_:.4f}")
20
21# Show cluster separation for each digit
22for digit in range(10):
23    mask = y == digit
24    center = X_tsne[mask].mean(axis=0)
25    print(f"  Digit {digit}: center at ({center[0]:.1f}, {center[1]:.1f})")