For this blog, I will run a hypothesis test if the population count around a well affects its functionality. I will be using the dataset from the Tanzania Water Pump Challenge hosted by Data-Driven and the WorldPop population estimations to feature engineer population estimations in a 1km, 5km, and 10km radius around the well.
- Pump it Up: Data Mining the Water Table
- Introduction to challenge and dataset
- WorldPop Estimation
- Peoples per pixel(‘ppp’; raster image of counts)
- Population analysis
- Kruskal Wallis test for medians
Pump it Up: Data Mining the Water Table
I will use the dataset from the Pump it Up Water Challenge hosted by Data-Driven. The competition’s objective is to predict the functionality of water pumps in Tanzania. It’s a classification problem with three classes: functional, needs repairs, non-functional.
The features include the date recorded, geographic locations, and many characteristics of the pump. The ‘population’ feature is one of the provided features. However, the distance and extent of the population counts are not provided. To simply put, I don’t know how the population counts were accumulated.
The population around the well ranges from 0 to 30500, of which half is less than 25. I suspect the population counts' land coverage was small, explaining to 0 counts when there is no one around ~100 feet and larger counts when the pump is in a town.
Since I don’t know the coverage extent when the population counts were compiled, I will aggregate population estimates for each well by three circles(1km, 5km, 10km) using the WorldPop estimations. The estimations utilize census data(as micro available) and VMAP0/1 and map population count to pixels(100m resolution). The output of their population project is raster images of countries separated by year.
The Tanzania Water Pump Challenge Dataset contains coordinates of the water pump and recorded data. I can aggregate population counts using custom shapes created with GeoPandas’ helper function(buffer) and the WorldPop population estimations.
It’s also important to note the competition moderators recorded the water pumps’ observations in different years with a range between 2001 and 2013(2001, 2002, 2003, 2011, 2012, and 2013 specifically). I split the dataset by year into six different data frames. The gist below represents an example of compiling the sums using only one tif file(one year).
It is interesting to visually see that the pumps that need repairs have a higher median surrounding population than the non-functional or functional pumps. The greater median might be explained if the pumps that need repairs but are not often used are not reported as needs repair.
If the y dependent, functionality, had a subclass of usage(for example, low usage or high usage), then the dependents will be: non-functional low usage, non-functional high usage, needs repair low usage, needs repair high usage, functional low usage, functional high usage. I can speculate the needs repair low usage might be biased to be reported as functional or non-functional to discourage complexity.
I will like to run a hypothesis test for medians to see if the population medians are different among the samples of the non-functional, needs repair, and functional classes. I standardize my alpha due to the high number of observations of 59400—the idea of standardizing alpha, when high observations, was derived here from Daniel Lakens's blog.
The population estimates of 1km around the pumps fail to reject the hypothesis that the median population counts surrounding non-functional and functional pumps are equal. More notably, the test rejects the hypothesis that the medians are equal for populations surrounding 5km or 10km.
I should take this analysis with caution as the population counts are estimates using micro census data and digital mappings to predict people counts. Nonetheless, with feature engineering, I can interpret the data in more detail.