Saturday 13 April 2019

PCA


Dataset with missing names, and you want to quickly identify the relation between variables, identify the 'Principle Components'.
>library(ggplot2)
>data(mpg)
>data <- mpg[,c("displ", "year", "cyl", "cty", "hwy")]
# get the numeric columns only for this easy demo
>prcomp(data, scale=TRUE)

Standard deviations:
  [1] 1.8758132 1.0069712 0.5971261 0.2658375 0.2002613

Rotation:
  PC1         PC2        PC3         PC4         PC5
displ  0.49818034 -0.07540283  0.4897111  0.70386376 -0.10435326
year   0.06047629 -0.98055060 -0.1846807 -0.01604536  0.02233245
cyl    0.49820578 -0.04868461  0.5028416 -0.68062021  0.18255766
cty   -0.50575849 -0.09911736  0.4348234  0.15195854  0.72264881
hwy   -0.49412379 -0.14366800  0.5330619 -0.13410105 -0.65807527

Here is how you interpret the result:

(1) The standard deviations, which is the diagonal matrix in the middle when you apply the singular value decomposition. Explains how much variance each 'Principle Component'? / layer / transparency explains in the whole variance in the matrix. For example,
70 % = 1.8758132^2 / (1.8758132^2 + 1.0069712^2 + 0.5971261^2 + 0.2658375^2 + 0.2002613^2)
Which indicates the first column itself already explains 70% of the variance in the whole matrix.
(2) Now let's look at the first column in the rotation matrix / V:
          PC1      
displ  0.49818034
year   0.06047629
cyl    0.49820578
cty   -0.50575849
hwy   -0.49412379

We can see: displ has a positive relation with cyl and negative relation with cty and hwy. And in this dominant layer, year is not that obvious.

The makes sense, the more displacement or cylinders you have in your car, it probably has a very high MPG.
Here is the plot between the variables just for you information.
pairs(data)

No comments:

Post a Comment