Unsupervised Learning

Difference between targets and no targets. Unsupervised: 10 items and we separate by similarities. Supervised has a y label.

  • Goal of supervised learning
  • discover subgroups
  • is there a better way to view data?
  • Is there a way to visualize to show underlying information

Principal components analysis and clustering

Challenge of Unsupervised learning

  • there is no goal
  • Breast cancer patients grouped by gene expression measurements
  • shoppers characterized by their browsing and purchase histories
  • movies grouped by ratings assigned by movie viewers

Another advantage

  • A lot easier to obtain unlabeled data
  • Lots of images on the web
  • the information can be taken from machine
  • movie reviews by movie quality.
  • It’s sometimes difficult for machine to see sarcasm

Principal Components Analysis

  • low-dimensional representation of a dataset
  • 1st component is the highest variance
  • 2nd component is the
  • one of the most widely used tools
  • Z_1 is the linear combination of the features.
  • We normalize: os that the sum of squares is 1.
  • green line has the highest variance to the
  • green line has the higest variance of the feautres
  • the blue dashed has the highest variance that is uncorrelated with the first componenet
  • If you only have 2 variables, you’ll only have two components
  • PCA needs to have centered at mean 0.
  • Looking for linear combination that has highest variance

Geometry of PCA

  • the loading vector is 1st component.
  • We replace each point with how far it is from the 1st component

Further principal components

  • because its uncorrelatied, we’ll still look for where the variance is the highest
  • Maximizes variance subject to being uncorrelated to previous ones.

US Arrests

  • What is loading vectors?
  • 1st PCA has positive loading on the 3 types of crime

Another Interpretation of Principal Components

  • approximating data by using lower data
  • We are looking for the hyperplane of the 2 largest principal components.
  • We want the data to be as spread out as much as possible.
  • In Linear regression, we are looking for difference from point to slope line. In PCA, we are looking for perpendicular to hyper plane.

Scaling of the Variables Matters

  • We need to scale by standardizing (everything has to be measured on an equal scale)
  • If the 1st two principal componenets
  • proportion variance explained: if 1st 2 principal components explain 96% of data. We can just use the first 2.
  • Cross-validation doesn’t really help because there’s no y variables.
  • We can use cross-validation to help decide how many principal components to use. At this moment, we’ll have a supervising response.

PCA vs Clustering

  • clustering looks for homogenous subgroups of observations
  • Subgroups: task of segmenting via clustering
  • K-means: number of subgroups
  • hierarchical clustering: we won’t know in advance.

K-means clustering

  • it will find clusters based on the number of clusters (K)
  • clusters are sets (subsets) there’s no overlap in clusters.
  • Want to find partitions where in-
  • Algorithm: we sign 1 to k. We find a centroid.
  • We assign each observation to the cluster whose centroid is closest.
  • We then move the centroid to the mean of their respective cluster.
  • We then reassign with the new centroid. We continue this process until we get to the our number?
  • Finds the Local minimum: the valley. But not necessarily the global minimum. The function is not convex.
  • Clustering will differ when it moves.

Hierarchical Clustering