Difference between targets and no targets. Unsupervised: 10 items and we separate by similarities. Supervised has a y label.
- Goal of supervised learning
- discover subgroups
- is there a better way to view data?
- Is there a way to visualize to show underlying information
Principal components analysis and clustering
Challenge of Unsupervised learning
- there is no goal
- Breast cancer patients grouped by gene expression measurements
- shoppers characterized by their browsing and purchase histories
- movies grouped by ratings assigned by movie viewers
Another advantage
- A lot easier to obtain unlabeled data
- Lots of images on the web
- the information can be taken from machine
- movie reviews by movie quality.
- It’s sometimes difficult for machine to see sarcasm
Principal Components Analysis
- low-dimensional representation of a dataset
- 1st component is the highest variance
- 2nd component is the
- one of the most widely used tools
- Z_1 is the linear combination of the features.
- We normalize: os that the sum of squares is 1.
- green line has the highest variance to the

- green line has the higest variance of the feautres
- the blue dashed has the highest variance that is uncorrelated with the first componenet
- If you only have 2 variables, you’ll only have two components
- PCA needs to have centered at mean 0.
- Looking for linear combination that has highest variance
Geometry of PCA
- the loading vector is 1st component.
- We replace each point with how far it is from the 1st component
Further principal components
- because its uncorrelatied, we’ll still look for where the variance is the highest
- Maximizes variance subject to being uncorrelated to previous ones.
US Arrests
- What is loading vectors?
- 1st PCA has positive loading on the 3 types of crime
Another Interpretation of Principal Components
- approximating data by using lower data
- We are looking for the hyperplane of the 2 largest principal components.
- We want the data to be as spread out as much as possible.
- In Linear regression, we are looking for difference from point to slope line. In PCA, we are looking for perpendicular to hyper plane.
Scaling of the Variables Matters
- We need to scale by standardizing (everything has to be measured on an equal scale)
- If the 1st two principal componenets
- proportion variance explained: if 1st 2 principal components explain 96% of data. We can just use the first 2.
- Cross-validation doesn’t really help because there’s no y variables.
- We can use cross-validation to help decide how many principal components to use. At this moment, we’ll have a supervising response.
PCA vs Clustering
- clustering looks for homogenous subgroups of observations
- Subgroups: task of segmenting via clustering
- K-means: number of subgroups
- hierarchical clustering: we won’t know in advance.
K-means clustering
- it will find clusters based on the number of clusters (K)
- clusters are sets (subsets) there’s no overlap in clusters.
- Want to find partitions where in-
- Algorithm: we sign 1 to k. We find a centroid.
- We assign each observation to the cluster whose centroid is closest.
- We then move the centroid to the mean of their respective cluster.
- We then reassign with the new centroid. We continue this process until we get to the our number?
- Finds the Local minimum: the valley. But not necessarily the global minimum. The function is not convex.
- Clustering will differ when it moves.
Hierarchical Clustering