Between July 2019 through March 2020, Australia experienced one of its worst wildfire seasons. To gain attention to such a problem, the Call for Code: Australian Wildfires started in November 2020. The competition‘s objective is to forecast daily wildfires by region for February 2021. Contest information provided here: https://community.ibm.com/community/user/datascience/blogs/susan-malaika/2020/11/10/call-for-code-spot-challenge-for-wildfires
The competition provides a stack of CSVs for the competitors; wildfires, vegetation, weather. The hosts compiled these datasets by aggregating petabytes of raster and vector data. It’s essential to try to understand how the hosts compressed these datasets before using them for modeling.
- Historical Wildfires(dt: 2005–2021Jan.)
The dataset contains daily statistics regarding wildfires by region from 2005 to 2021. It includes the y dependent, ‘estimated_fire_area,’ which I will need to predict for February 2021. Multiplying the scan and track values from the MCD14DL dataset will return the estimated fire area. The MCD14DL is a tabulated dataset created from the MODIS thermal anomalies raster images. The scan and track pixel values are calculated due to the increasing resolution size as the pixels are further away from the satellite. The MODIS image at 1km resolution is not uniformly 1km, and the pixel resolution is larger for the areas further away from the satellite. The areas are then grouped and summed by daily region.
- Historical Weather(dt: 2005–2021)
The weather dataset contains daily mean, min, max, and standard deviations of weather data(Precipitation, Humidity, Soil Water Content, Solar Radiation, Temperature, and Wind Speed). The original dataset comes from the ERAS data, a global reanalysis of the recorded climate observations. I believe the hosts aggregated by daily region and compressed to a CSV.
- Vegetation Index(dt: 2005–2021Jan.)
The vegetation dataset contains monthly vegetation values(min, max, mean, variance) for each region. The original data, MOD13Q1, includes the NDVI bands, which are then, once again, aggregated by region.
The above gif represents the NDVI seasonality in Australia for 2020. I included this gif to show how massive the original datasets are before compression. Compressing by aggregation has shortened petabytes of data to a workable CSV. However, this has also caused a loss of granularity.
I used the weather and vegetation means, log scaled wildfires, and feature engineered surface area for my features.
- Weather & Vegetation Means
Due to compression, I assumed the mean distributions of weather and vegetation were uniform per region. To elaborate, the average precipitation, humidity, soil water content, etc., are the same at any point within a selected region. I’ll address this incorrection assumption in my future steps.
- Log Scaled Wildfires
Wildfires, like most natural disasters, have a power-law-like distribution. In general, there are no disasters, but when there are, it can get huge. I log-transformed the areas to have a more normal-looking distribution, and I’ll exponentiate after I predict.
- Surface Area
The wildfire area at time t has a strong correlation with the area at time
t + 1. I think this relationship is a bit indirect, and the fire expansion is correlated with the surface coverage more than the actual fire area. Imagine watching a yule log on Christmas morning; the expansion of fire is correlated by the surface of the existing fire more than the area.
It might be impossible to find the actual surface area because the wildfires were totaled by region. However, I can make a few assumptions about the area and create two features, Surface Area Conglomerated and Surface Area Separated.
The first is to assume the fire area is conglomerated into one giant fireball. If the area is a square, then the surface area will be four times the area’s square root. Since four is a constant, I’ll drop it and leave the square root of the area.
The second assumes the total area is made of separated EQUAL-sized pixels, no pixel touching another pixel. I can then calculate the surface area as such, and if there are no fires, then the surface area would be 0.
Maybe some combination of these two features can generalize the actual surface area.
The objective of the challenge was to forecast daily fire areas by region in February 2021. The last day for weather statistics that I had was January 18. Therefore, I needed to output a minimum of 41 days(January 18 + 41 days = February 28).
Before I can input the data into a convoluted neural network, I needed to slice the training set into sequences where the input and output indices incrementally increase by 1. Envision a sliding window, capturing the input and output steps as the window slides by 1.
The original dataset, made up of days and features, should return two matrices X and y.
I utilized a Dilated Convolution Neural Network very similar to the WaveNet model that predicts soundwave sequences.
I windowed my dataset with an input step of 120 days and an output of 41. I slingshotted the 41 days, which means the output steps are independent of each other(t+1 is independent from t+2). The prediction assumptions seem faulty, but the model can predict some “good” results. I’ll make sure to include an autoregressive element in my future steps.
The nodes of the Conv1D layers are calculated using neighboring inputs that get shifted incrementally. The addition of padding prevents the layers from losing their original shape and adding dilation reduce the number of hidden layers required. Within the hidden layers, inputs are skipped at a dilation rate, and an increasing dilation rate per layer can reduce the total number of layers needed.
I don’t believe the grey nodes are calculated during the fit, but I included them to represent the model architecture. The orange boxes with 0s are the padding type to retain the original shape.
For my next steps, I plan to incorporate an autoregressive element into the model. I would also like to reaggregate the original raw data(wildfires, weather, vegetation) to administrative level 2. Reaggregating to smaller shapes will provide a more granulated dataset while still compressing enormous raster images.