First of all let me make clear that this post is about identifying cheaters who fills in questionnaires with fictitious answers. This post does not describe how to determine whether your (or your friend’s) lover is cheating on you (or your friend’s). Cheater identification will not work with the method I will describe below unless, of course, you posses data with similar characteristics to a matrix filled in with values ranging between 1 and 7. At the bottom of the post I will discuss whether sex had nothing to do with someone predisposition with cheating, but with sex I mean gender differences. Nothing else. At the bottom of this post I report statistics on the proportion of men who cheated and the proportion of women who cheated on an on-line questionnaire. Be ware that gender predisposition with cheating on an on-lie questionnaire is unrelated to the cheating participant’s sexual escamotages. Unless of course trait cheater is similar to state cheater.
This post describes the assignment of week four of the data analysis and interpretation coursera course that I am currently following. In a previous post I described how I got to this dataset of cheaters (i.e., people who provided fictitious answers to an on-line questionnaire they were asked to respond to). First of all I will determine how many participants cheated:
print len(numpy.unique(cheaters['subIdx'])) >>> 28
28 out of 1198 is not that bad (it’s 2.3%). Still disappointing that they cheated, but not too bad. And hey, I am having fun because of those 28 persons, so I guess I should be thankful?
Now let’s check which cheaters are the worst, defined by the amount of questionnaires in which they cheated
# 1 - sum up the scores tmp = df['subIdx'].value_counts() # 2 - summarize the counts per number of questionnaires cheated on unique, counts = numpy.unique(tmp.values, return_counts=True) print numpy.asarray((unique, counts)).T >>>array([[ 1, 18], [ 2, 5], [ 3, 5]])
18 participants cheated in one questionnaire, 5 in 2 and 5 in 3. I am puzzled about the number of people who cheated in only one questionnaire though. Why bother cheating on one questionnaire only? Now I am intrigued, and I would like to determine if they cheated in the last questionnaire (questionnaire number 3). Cheating on questionnaire number three could imply participants got bored by filling in the previous two, therefore it could be reasonable that the filling in the third with fictitious answers.
To identify the Questionnaire in which participants cheated we need to loop through the ‘tmp’ array because ‘df’ and ‘tmp’ have different lengths.
# initialize empty array to store the responses quests =  for idx in tmp.index[tmp.values == 1]: quests.append(df['quest'][df['subIdx'] == idx].values) unique, counts = numpy.unique(quests, return_counts=True) print numpy.asarray((unique, counts)).T >>>[['Q1' '2'] ['Q2' '14'] ['Q3' '2']]
That is interesting! Only two participants cheated on the last questionnaire, 2 participants cheated on the first one, and 14 in the second. Of course, it might be that they cheated in a different way in the other two questionnaires. Looking at the specific responses of these persons might yield new cheating patterns, in the sense of sequences of numbers which do not represent genuine responses. But I will leave this for another time. Instead let’s give a quick look at whether there was a preferred response among the cheaters.
df['value'].value_counts() >>> 1 21 3 8 4 5 5 3 2 3 6 2 7 1 df['value'].value_counts(normalize=True) >>> 1 0.488372 3 0.186047 4 0.116279 5 0.069767 2 0.069767 6 0.046512 7 0.023256
The response 1 was the most common. Response 3 and 4 followed to 1 as the most frequent, probably since they are the one in the middle of the list of options; then the other response numbers. But now let’s explore the more exiting question: Did the gender of the participant played a role in their predisposition toward cheating?
# find unique id of the cheaters and match that to the # 'data' database containing the sex information. data['sex'][df['subIdx'].unique()].value_counts() >>> f 19 m 8
19 Females and 8 Males. Whereas these values might suggest women cheated more than men, they are likely a reflection of the fact that more women than men completed the questionnaires.
data['sex'].value_counts() >>> f 849 m 349
Transforming those numbers to proportions yield a more reliable estimate of the proportion of men and women who cheat.
prop = data['sex'][df['subIdx'].unique()].value_counts() / data['sex'].value_counts() print prop >>> f 0.022379 m 0.022923
2.24 % vs. 2.29% suggests that there is not much of a difference between men and women in cheating predisposition. At least when it comes to filling in on-line questionnaires. Grafically this can be represented as:
# select proportion of women cheating and add to female # array of scores meansWomen = [data['sex'].value_counts(normalize=True)['f']] meansWomen.append(prop['f']) # same for man meansMen = [data['sex'].value_counts(normalize=True)['m']] meansMen.append(prop['m']) import matplotlib.pyplot as plt nPercent = 2 index = numpy.arange(nPercent) fig, ax = plt.subplots() barWidth = 0.35 opacity = 0.4 rects1 = plt.bar(index, meansMen, barWidth, alpha=opacity, color='b', label='Men') rects2 = plt.bar(index + barWidth, meansWomen, barWidth, alpha=opacity, color='r', label='Women') plt.xlabel('Type of proportion') plt.ylabel('Proportion') plt.title('Scores by Proportion Type and Gender') plt.xticks(index + barWidth, ('relative to gr. cheaters', 'relative to ALL participants')) plt.legend() plt.tight_layout() plt.show()
Plotting the two proportions types together show that there are no gender differences among cheaters when the proportion is relative to the total amount of participants.