Principal Component Analysis on LandClass: Australia Wildfires
In light of the recent CallForCode Spot challenge: Australian wildfires, I will utilize PCA to analyze land classes and it’s relation with wildfires. The Australian Regions are:
- NSW=New South Wales
- NT = Northern Territory
- QL = Queensland
- SA = Southern Australia
- TA = Tasmania
- VI = Victoria
- WA = Western Australia
The land class dataset is a measurement (of 2015 alone) comprised of each region broken down into separate percentages of the terrain. There are 14 types of terrain that can describe what kind of landscape each state is.
The main difference between evergreen and deciduous is that deciduous shed all their leaves every autumn while evergreens retain some leaves. Herbaceous vegetations are plants that have non-woody stems.
Principal Component Analysis, in short, is a way to summarize a continuous dataset by creating new features that will explain the original variables. If I have 14 features (like above), then the maximum number of components derived will be 14. However, PCA can reduce a dataset to 4–5 features that will explain the majority of the variance. Also, PCA can help improve regression because the components created are not correlated with each other.
The created features can explain at least 95% of the variance from the original dataset. The transformed values in itself are not very interpretable, therefore I will now try to decypher each component by looking at its loading values. Loading values are calculated as below:
Loading values can help interpret principal components. For instance, PCA_1 can be interpreted as wooded marshlands as there are positive coefficients for Evergreen forests, wetlands, and bodies of water while negative coefficients for shrubs and herbaceous vegetation.
PCA_2 can be interpreted as outback-like regions with some water bodies. The positive coefficients on bare vegetation, water bodies, and lack of forests made me think of an outback scene with some batches of water.
PCA_3 may represent human civilized regions due to the positive coefficients on cropland and urban / built up.
And finally, PCA_4 to be park-like plains. In conclusion, principal components can be interpreted by calculating the loading values.