Principal Component Analysis on LandClass: Australia Wildfires

Photo by Joshua Hibbert on Unsplash
https://en.wikipedia.org/wiki/Template:Australia_states_imagemap

In light of the recent CallForCode Spot challenge: Australian wildfires, I will utilize PCA to analyze land classes and it’s relation with wildfires. The Australian Regions are:
- NSW=New South Wales
- NT = Northern Territory
- QL = Queensland
- SA = Southern Australia
- TA = Tasmania
- VI = Victoria
- WA = Western Australia

The land class dataset is a measurement (of 2015 alone) comprised of each region broken down into separate percentages of the terrain. There are 14 types of terrain that can describe what kind of landscape each state is.

https://gist.github.com/albertum1/cc251eac3251589c84d2473f549ffb08

The main difference between evergreen and deciduous is that deciduous shed all their leaves every autumn while evergreens retain some leaves. Herbaceous vegetations are plants that have non-woody stems.

Principal Component Analysis, in short, is a way to summarize a continuous dataset by creating new features that will explain the original variables. If I have 14 features (like above), then the maximum number of components derived will be 14. However, PCA can reduce a dataset to 4–5 features that will explain the majority of the variance. Also, PCA can help improve regression because the components created are not correlated with each other.

The created features can explain at least 95% of the variance from the original dataset. The transformed values in itself are not very interpretable, therefore I will now try to decypher each component by looking at its loading values. Loading values are calculated as below:

Loading values can help interpret principal components. For instance, PCA_1 can be interpreted as wooded marshlands as there are positive coefficients for Evergreen forests, wetlands, and bodies of water while negative coefficients for shrubs and herbaceous vegetation.

PCA_2 can be interpreted as outback-like regions with some water bodies. The positive coefficients on bare vegetation, water bodies, and lack of forests made me think of an outback scene with some batches of water.

PCA_3 may represent human civilized regions due to the positive coefficients on cropland and urban / built up.

And finally, PCA_4 to be park-like plains. In conclusion, principal components can be interpreted by calculating the loading values.

--

--

--

Hello! My name is Albert Um.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

March Madness — A case study in Monte Carlo Simulation

Why Machine Learning Engineers are Replacing Data Scientists

COVID-19 Daily Analysis

By Hippocrates, Turing

3 Ways to add Annotations to Data Studio time series chart

Timeseries chart with two annotations applied

What is Voting For?

You’ve understood Data Science, now what next?

How to count the number of a column, and show the max so far I have.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Albert Um

Albert Um

Hello! My name is Albert Um.

More from Medium

Confidence interval and sampling error

Adding a Single Regression Line to a Scatterplot with Multiple Categories

An Easy Way to Remember the Differences Between Precision & Recall

Logistic Regression Concepts