This is the third assignment of the Machine Learning for Data Analysis by Wesleyan University on Coursera. I applied least absolute shrinkage and selection operator (LASSO) to the DrivenData data set pumpItUp. LASSO is a technique which does variable selection shrinking the ‘useless’ coefficients (i.e., variables) toward zero. Applying this method I am not literally shrinking a pump, but it feels a bit like it. Moreover, I abandoned the secondary goal of translating into python the tutorial on DataCamp mining the pumpItUp data set with R. I abandoned it because the differences between that mining process and the examples given in class are too large. Maybe I will pick up the translation challenge when the course is finished, but now the deadlines are too tight and I would not finish in time.
To keep my assignment in line with the example on the coursera website I recoded my response variable. From three levels (functional, functional needs repair, non-functional) I recoded the variable into a binary response variable (functional, non-functional). All the functional-needs-repair pumps became non-functional pumps.
A lasso regression analysis was conducted to identify a subset of variables from a pool of 34 categorical and quantitative predictor variables that best predicted a binary categorical response variable reflecting the functioning or malfunctioning of water pump in Taarifa, Tanzania. Longitude, latitude and construction_year were continuous variables. Categorical predictors included extraction_type_group, quality_group, quantity, waterpoint_type and were all recoded in dummies binary categorical variables. All predictor variables were standardized to have a mean of zero and a standard deviation of one.
Data were randomly split into a training set that included 70% of the observations (N=41580) and a test set that included 30% of the observations (N=17820). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Of the 34 predictor variables, 29 were retained in the selected model. During the estimation process ‘dry’, ‘communal standpipe’, ‘nira/tanira’, ‘enough’, ‘other’, ‘improved spring’, ‘afridev’, were most strongly associated with the pump being functional. ‘Dry’ and ‘other’ were negatively associated with pump functioning. ‘Communal standpipe’, ‘nira/tanira’, ‘enough’, ‘improved spring’ and ‘afridev’ were positively associated with pump functioning. Other predictors associated with pump functioning were not contributing with a lesser extent and I decided not to report them. These 29 variables accounted for 23.16% of the variance in the functional pump response variable.
Python code is on github.