# Cheat Hunt [DAI III]

This post title is inspired by the title of a movie, witch hunt, I did not see, but I do like the sound of the title. I decided to change the dataset I am exploring for the data management and visualization course (if you need an introduction check this previous post). I decided to change dataset because it is not interesting to do the assignments with an already clean dataset. In fact, this week assignment requires pure data management, which is 1) identification and removal of missing values 2) computation of new variables etc. Since my dataset is already clean and only has three variables, I have nothing to do for the assignment. In the previous assignment I already came up with a new variable, and I was not capable to invent something new. But then I got a fantastic idea.

In week three of the data management and visualization course (DAI III) we are supposed to explore the data set we are using. I thought of twisting this exploration into a search: A search for cheaters – therefore the cheat hunt! How do I know the dataset contains cheaters? I do not! But I suspect there might be cheaters. Why? Because the dataset I am using is part of a battery of test I administered to 1213 first year psychology students. It is a collection of three questionnaires with 35, 18 and 23 questions each. Since the students complete the questionnaires in exchange for study credits, some of them might simply choose for the shortcut to the credits: filling in the questionnaire as quickly as possible, get the credits and move on. And here comes the exciting question of this exploration: Can I spot the cheaters? Below I will explain what I tried and if I found any, but if you have other ideas comment along.

```import pandas
import numpy
data = pandas.read_csv('allData', low_memory=False, sep = ' ')
# F_1 is character, should be numeric
data['F_1'] = pandas.to_numeric(data['F_1'], errors='coerce')
data.shape
>>> [1213 rows x 79 columns]
```

There we go. 1213 participants with 79 variables each. Now I will search participants with missing values. Instead of recoding them I will exclude them from the study since I do not know why there are missing values. In fact, since I design the study I did not include a condition where missing values could be possible. The presence of missing values indices that something went wrong. Since I do have quite a few observation I opt to remove those participants, rather than keeping them in with missing values. I will also exclude participants who scores a 0, since possible scores should have been between 1 and 7. I use a trick to identify and remove these participants that is called ‘logical indexing’. I use logical indexing to identify participants with missing values as such: search for missing values, assign a true value to the missing value and false to the others. Sum all the scores column-wise (false is also coded as 0 and true as 1, so if you sum them and a row has one true value it will give 1). Then I use the same vector to delete all the participants who have a one. Here is how the procedure works

```# find missing values
pandas.isnull(data)
# sum over rows
pandas.DataFrame.sum(data,1)
# sum over rows the missing values
pandas.DataFrame.sum(pandas.isnull(data),1)
# if there are no missing values the sum will be 0
pandas.DataFrame.sum(pandas.isnull(data),1) > 0
# indexing the dataframe with the array of values above zeros I should
# get in return the person(s) who have missing values
data['idx'][pandas.DataFrame.sum(pandas.isnull(data),1) > 0]
>>> 816    817
# remove them from the dataframe
data = data[pandas.DataFrame.sum(pandas.isnull(data),1) == 0]
data.shape
>>> (1212, 79)
```

BRILLIANT! The same idea can be applied at other type of (numeric) data to check, for example, whether there are people who filled in only 1, only 2, only 7, etc. Such a sequence of responses is probably not genuine, so below I will write a bit of code to identify people who might have responded in such a way (i.e., cheaters!).

```# 3 arrays with indexes of the answers (e.g., answer to question 1, 2, ... 76)
Q1 = numpy.arange(1-1, 35)
Q2 = numpy.arange(35 + 1 -1, 35 + 18)
Q3 = numpy.arange(35 + 18 + 1 -1, 35 + 18 + 23)
# prepare list to store the cheaters (if any)
cheaters = {'subIdx':[], 'value':[], 'quest':[]}
# loop through all possible responses (i.e., 1-7)
for possibleScore in range(1, 8):
Q1Ans = numpy.ones(35) * possibleScore
Q2Ans = numpy.ones(18) * possibleScore
Q3Ans = numpy.ones(23) * possibleScore
# loop through all the participants
for subIdx in range(0, nObs):
# check questionnaire 1
if (sum(dum[subIdx][Q1] == Q1Ans) == 35):
cheaters['subIdx'].append(subIdx)
cheaters['value'].append(possibleScore)
cheaters['quest'].append('Q1')
# check questionnaire 2
if (sum(dum[subIdx][Q2] == Q2Ans) == 18):
cheaters['subIdx'].append(subIdx)
cheaters['value'].append(possibleScore)
cheaters['quest'].append('Q2')
# check questionnaire 3
if (sum(dum[subIdx][Q3] == Q3Ans) == 23):
cheaters['subIdx'].append(subIdx)
cheaters['value'].append(possibleScore)
cheaters['quest'].append('Q3')

df = pandas.DataFrame(cheaters)
```

It is a long piece code, but what it does is simple. Basically it loops through the range of possible answers (1, 2, 3, 4, 5, 6, 7). For each questionnaire it creates an array repeating the given answer (e.g., 1, 1, 1, 1, … , 1). Then it matches the set of answers for that questionnaire to the one given by the participant. If it finds a match the array becomes filled with true, which are 1, which will some to 35, 18 or 23 respectively if a participants responded to a given questionnaire with only 1, or 2, or 7. Neat, isn’t it? Nicely enough the ‘algorithm’ also check the scores separately for each questionnaire, in case participants started willingly, giving genuine answers, and got bored (evil) while responding to it. Where there cheaters?

```print(df)
0     Q1      67      1
1     Q2      67      1
2     Q3      67      1
3     Q1     131      1
4     Q2     160      1
5     Q2     181      1
6     Q2     187      1
7     Q2     229      1
8     Q2     277      1
9     Q2     355      1
10    Q2     357      1
11    Q2     394      1
12    Q2     501      1
13    Q1     520      1
14    Q2     520      1
15    Q3     520      1
16    Q2     612      1
17    Q2     614      1
18    Q2     702      1
19    Q2     939      1
20    Q2    1158      1
21    Q1     394      2
22    Q1     564      2
23    Q3     608      2
24    Q1     144      3
25    Q2     245      3
26    Q2     486      3
27    Q2     564      3
28    Q2     602      3
29    Q2     823      3
30    Q1    1207      3
31    Q2    1207      3
32    Q2      27      4
33    Q3     245      4
34    Q3     486      4
35    Q3     823      4
36    Q3    1207      4
37    Q1     533      5
38    Q2     533      5
39    Q3     564      5
40    Q3      27      6
41    Q3     195      6
42    Q3     533      7
```

Maybe I should actually analyze the cheaters, rather than the data set…
Other comments ideas on how to identify cheaters?