The random forest algorithm is the topic of the second assignment of Machine Learning for Data Analysis by Wesleyan University on Coursera. This assignment extends the previous one because besides from using random forest instead of decision trees I included more variables than the previous assignment. In this analysis I included also categorical variables. In the previous assignment I excluded all the non-numerical variables because of the error: ‘ValueError: could not convert string to float: communal standpipe‘. It is not a serious errors, but it implies that one must convert categorical variables into their corresponding dummies (operation which I did not expect to have to request since the R language does the conversion automatically). To keep it simple I opted to use numerical variables, which do not transformation into dummies. Another reason to not transform the variables is that the plot of the decision tree would have looked very cluttered with more than 30 variables. Therefore opting to include in the modeling only the three numerical variables seemed an optimal solution to this a problem.
Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a polytomous categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating pump functionality (‘functional’, ‘functional needs repair’, and ‘non functional’ – my response variable). The variables ‘longitude’,’latitude’, ‘extraction_type_group’, ‘quality_group’, ‘quantity’, ‘waterpoint_type’, ‘construction_year’ were used as predictors.
The bar plot shows the six explanatory variables with the highest relative importance scores: longitude, latitude, construction_year, dry (a dummy variables created from one of the levels of the ‘quantity’ variable: ‘enough’, ‘dry’, ‘insufficient’, ‘seasonal’, ‘unknown’), enough (same as dry) and communal standpipe (a dummy variables created from one of the levels of the ‘waterpoint_type’ variable). The accuracy of the random forest was 80.64%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.
In comparison with the outcome of last week the classification accuracy was higher when using random forest than when using decision trees. This could also be an artifact, since when doing model fitting the inclusion of more variables will (likely) increase the classification accuracy. A true test of the two modeling algorithms would compare the two models with the same variables using cross-validation. Also, one could argue a more interesting step would be do upload the classification of the testing set to the DrivenData website, but I will leave these two exciting tests to another day.
The python code for the analysis is on github.