W8 Unsupervised Learning
Notes
KMeans
- Process:
-
- Randomly initialize cluster centroids
-
- Assign all data points to their nearest cluster centroids
-
- Move cluster centroids to the mean value of all of the samples assigned to each previous centroid.
-
- Repeats these last two steps until this value is less than a threshold.
- If no sample is assigned to the k-th centroid, just eliminate that centroid.
-
- Optimization Objectives
- Distortion cost function
- Multiple random initializations to avoid local optima. For large K, not work.
- Elbow method: Cost function J as a function of numer of clusters K.
Dimentionality Reduction
- Aim: 1. Data compression 2. Visualization
- Princical Component Analysis (PCA)
- Feature scaling / mean normalization before performing PCA.
- PCA procedure:
-
- compute covariance matrix Sigma
-
- eigen vector: [U, S, V]=svd(Sigma) or eig(Sigma)
-
- U=[U1, U2, …, Uk (col)] the first k cols will represent k-dim features reduced from n-dim original features.
-
- z=transpose(Ureduce)*x
-
- Choose k with the smallest value that 99% of variance is retained.
- [U, S, V], S is used for check variance, with diagnal entries S11 S22 … added up to check the variance.
-
Do not use PCA to avoid overfitting (no info from Y)
- aptitude 资质