This post is about the first assignment of Machine Learning for Data Analysis by Wesleyan University on Coursera. In the past month I have tried to mine the dataset of the pumpItUp challenge on DrivenData. The challenge requires to predict the functioning status of water pumps in Taarifa, Tanzania. For the challenge I did most of the mining in R since I found a tutorial on-line which was basically caring one by hand through the process. Since the coursera class uses python instead of R I thought of (making a first attempt toward) translating the R into python. Later on it will be interesting to compare the approach of python and R in solving the same problem in the same way. Starting with the first difference the R tutorial uses random forest to predict the test data, but I have to use a decision tree since that is the requirement of this week assignment.
Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a categorical response variable. The dataset is from the pumpItUp challenge on DrivenData. For the present analyses, the entropy ‘goodness of split’ criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.
The explanatory variables ‘longitude’, ‘latitude’, ‘construction_year’, were included as possible contributors to a classification tree model evaluating ‘status_group’ reflecting the functioning status of water pumps (my response variable).
Construction_year was the first variable to separate the sample into two subgroups. Pumps built from 1998 onwards (range 1998 to 2013 M=2006, SD=4.16) were more likely to be functioning in comparison with pumps built earlier (67.17% vs. 46.75%).
Of pumps built from 1998 onwards, a further subdivision was made with the longitude variable. Pumps set at a longitude lower than 37.44 degrees were more likely to be functional. Pumps built from 1998 onwards which were placed below longitude 37.44 were more likely to be functional. The total model classified 64.50% of the sample correctly, 71.97% of the functional pumps, and 27.96% of the functional pumps needing repair and 60.94% of the non functional pumps.
The script I used is rather uninteresting so rather then pasting it below I am placing a link to the github repository hosting it.