Clustering Pumps [mlw4]

This is the fourth and last assignment of Machine Learning for Data Analysis by Wesleyan University on Coursera. My assignment diverges quite a bit from the approach taken by the instructor since I wanted to have only three clusters to determine pumps functionality (functional, functional needs repair, and non-functional) for the pumpItUp challenge on DrivenData. Anyway, I did try to follow the road-map for this assignment as close as I could, but when it did not make sense anymore I stopped. The first part was successful.

A k-means cluster analysis was conducted to identify underlying subgroups of pumps based on their similarity of responses on 11 variables that represent characteristics that could have an impact on the functionality of the pump (i.e., whether the pump was functional, functional needing repair, non-functional). Clustering variables included ‘longitude’, ‘latitude’, ‘extraction_type_group’, ‘quality_group’, ‘quantity’, ‘waterpoint_type’, ‘construction_year’. Of these variables, the categorical ones were recoded into dummy variables with a binary encoding, yielding 34 variables in total. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=41580) and a test set that included 30% of the observations (N=17820). A series of k-means cluster analyses were conducted on the training data specifying k=1-25 clusters, using Euclidean distance. The figure below shows the variance (r-square) in the clustering variables that was accounted for by the 25 cluster solutions. The plotted elbow curve guides the choice of the number of clusters to interpret.

The elbow curve suggests that the 3 or 4-cluster solutions might be interpreted. Above 10 there does not seems to be a major gain in choosing more clusters. The results reported further are for an interpretation of the 3-clusters solution. I choose three clusters because the independent variable also has three categories. To know how accurately k-means clustering can classify pumps’ functionality I compared the training labels with the labels obtained by K-means clustering. K-means perform poorly, with an accuracy of 33.07%. Below is the confusion matrix:

```import sklearn.metrics
sklearn.metrics.accuracy_score(targetTrain, clusterLabels)
0.3307
sklearn.metrics.confusion_matrix(targetTrain, clusterLabels)
array([[ 7433, 14441,   625],
[  760,  2157,   139],
[ 3822,  8041,  4162]])
```

The example assignment on Coursera uses canonical discriminant analysis (or principal component analysis – PCA) to reduce the clustering variables into a smaller, more interpretable set. I tried the same approach on my data in spite of the fact that the classification accuracy was low. Indeed a check of the ratio of variance explained by the two PCA components leaves few hope to the helpfulness of PCA.

```from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pcaFit = pca.fit(clus_train).transform(clus_train)
pca.explained_variance_ratio_
array([ 0.09983554,  0.07262318])
```

I tried to reduce the variables using linear discriminant analysis (LDA). Dimensionality reduction might not make much sense since I only use three clusters. However, it could be interesting to try to find the features which maximally discriminate among the clusters. Given the poor performance of the clustering technique the LDA does not add much. The plot belows show PCA and LDA solutions to the cluster classification.

Lastly, the example assignment uses ANOVA to determine whether the groups yielded by the clustering differed from one another or not. I understand it is an important test, but I think it is not relevant in my case because the classification rate is as low as 33%. Would it be 93% than it would have made sense, especially the identification of the discriminant features.

The full code is available on github.