Parallel coordinates can be very helpful in understanding relationships among more than two variables. The first time I encountered parallel coordinates I did not understand their potential, until I saw Alberto Cairo’s slopegraph. In that slopegraph Cairo color-coded the directions of the lines in the graph. Red lines go up, blue lines go down and the linear relationship between education level and obesity is crystal clear. The slopegraph of Cairo is also interesting because it shows the importance of plotting all the points rather than an average line. In fact, when plotting only the average line no relationship between variables would have been apparent.
I decided to try Cairo’s approach to slopegraph with parallel coordinates myself, but I was soon disappointed. I was disappointed because the parallel coordinates function I found in R did not plot the direction of the slopes color-coded. However, after a few trials and errors I understood what I was doing wrong. Below is how I got the function to plot how I wanted it, but first a short intro to the dataset I used.
The dataset has three columns, 1) measured average reaction times [RT], 2) measured percentage of correct responses [acc] and 3) measured average response to a questionnaire [Quest]. In total there were 68 respondents. Dataset and R code available on github.
require(MASS) # contains function for parallel coordinates plot meanBehav = read.table('data', header = TRUE) parcoord(meanBehav)
The plot looks quite crude like this. But adding color is simple:
colVect = 1 + (0 : (dim(meanBehav)-1)) %/% round(dim(meanBehav) / 2) # 2 groups parcoord(meanBehav, col = colVect)
Addition of labels to the plot helps identifying the range of the displayed data. Moreover, one can invert the order of the variables very easily. This might help when try to explore for relationships among variables or in highlighting one relationship over others.
colVect = 1 + (0 : (dim(meanBehav)-1)) %/% round(dim(meanBehav) / 3) # 3 groups parcoord(meanBehav[, c('acc', 'Quest', 'RT')], var.label = TRUE, col = colVect)
The interpretation of the graph is difficult, all the lines seem to go in every direction and the colors seem to be picked randomly. But it is not the case, the color codes follow the order of the (first variable in the) database: the first 1/3 of the data is black, the second 1/3 of the data is red, and the last 1/3 is green. The first dimension/feature/variable determines the order of the color. But the order is specified in the dataset, not in the plot. This is why the first 1/3 of the lines is not black, the second 1/3 red and the last 1/3 green. Interpretation of the color coding (and of a potential relationship?) could be simpler if color codes would be sorted according to one of the variables plotted. Then we could expect 1) black lines for the smallest values along the given feature, 2) red lines for the middle values and 3) green lines for the highest values. Or vice-versa. Ordering the dataset along one variable’s dimension will allow this transformation helping interpretation (hopefully):
orderedQuest = sort(meanBehav$Quest, index.return = TRUE)$ix parcoord(meanBehav[orderedQuest, c('Quest', 'acc', 'RT')], var.label = TRUE, col = colVect)
Note that I have also ordered the variable names in the dataset to start with ‘Quest’, the variable that is sorted by the sort command. This is so that the first column will have first black lines, then red and then green.
Maybe now there is a more interesting pattern emerging. The majority of the black lines is going upward and then downward. The green lines appear to have a wider spread than the red and the black lines. The red lines seem to be more narrowly spread than the green and black. But again there is no clear relationship between the behavioral performance (RT and accuracy) and the subjective measure (response to the questionnaire).
And what about color-coding the lines direction? This procedure requires a few more steps, but it is straightforward. I created a vector of ‘up’ and ‘down’ values computed depending on the relative position of the current value depending on the maximum and minimum values of the given variable set and then compared the current value with the value for the same point but in the other variable set. Then I recoded the values into colors to plot (NOTE: one could skip creating the vector with ‘up’ and ‘down’ values but it makes it easy to double check that the procedure is working). This operation is complex to describe but the code is actually quite simple:
dirSlopes = ifelse( (meanBehav$RT - min(meanBehav$RT)) / max(meanBehav$RT) < (meanBehav$Quest - min(meanBehav$Quest)) / max(meanBehav$Quest), 'up', 'down') colDir = ifelse(dirSlopes == 'up', 4, 2) # red, blue parcoord(meanBehav[ , c('RT', 'Quest', 'acc')], var.label = TRUE, col = colDir) legend('top', legend = c('up', 'down'), text.col = c(4, 2), bty = 'n')
In spite of everything the display does remain pretty dense. To reduce the complexity I thought of assigning the dotted-line type for the ‘flat’ lines, or the lines with slopes smaller than 5%. This is because I regard flat lines to be less interesting than the lines with a slope. The procedure to “hide” the flat lines resolves to 1) finding the relative values of the current point for both sets, 2) subtracting the two sets and 3) assigning the dotted type to the differences between minus five and plus five.
tmp = cbind((meanBehav$RT - min(meanBehav$RT)) / max(meanBehav$RT), (meanBehav$Quest - min(meanBehav$Quest)) / max(meanBehav$Quest)) flats = (tmp[, 1] - tmp[, 2]) > -.05 & (tmp[, 1] - tmp[, 2]) < .05 linesType = rep('solid', 1, nEls) linesType[flats] = 'dotted' parcoord(meanBehav[ , c('RT', 'Quest', 'acc')], var.label = TRUE, col = colDir, main = 'directions', lty = linesType)
Of course this could be extended further with line thickness or with more complex expression considering, for example, also the slope of the second line, but 1) I wanted to keep it simple and 2) it is trivial to extend this approach to include one or more dimension(s).
I would like however to terminate with parallel coordinates in combination with clustering. For me the combination of parallel coordinates and clustering was quite instructive because I finally saw clearly the importance of scaling before applying a clustering algorithm. In fact, when I just clustered the data and plotted the results on parallel coordinates everything was driven by the questionnaire measure. After being joyful for a day or two thinking that I actually found something interesting in my data, my eye fell on the labels of the plot and saw the enormous difference in scales among the variables. The ‘Quest’ variable was driving all the clustering because its scale was so much bigger than the other two variables. Needless to say, after scaling the data before clustering the dominance of the questionnaire measure was gone. In spite of having to give up my sensational finding, this discovery was very instructive for me. Below is the code for color-coded parallel coordinates based on cluster results with and without scaling.
clusterR = hclust(dist(meanBehav)) colDir = cutree(clusterR, 3) + 1 par(mfrow = c(1,2)) parcoord(meanBehav, var.label = TRUE, col = colDir, main = 'unscaled') scaledVals = scale(meanBehav[ , c('RT', 'Quest', 'acc')]) scaledClusters = hclust(dist(scaledVals)) colDir = cutree(scaledClusters, 3) + 1 parcoord(scaledVals, var.label = TRUE, col = colDir, main = 'scaled') par(mfrow = c(1,1))
Noted the dominance of the Quest variable on the clusters definition? It is apparent when looking at the Quest column in the unscaled graph. The three groups changes in color according to the magnitude of the Quest values (and without ordering of the variable as done above.
Next step? Adding interactivity with d3.js.