Loading data and frequencies [DAI -II]

Second assignment for the Coursera Data Management and Visualization ‘challenge’ (here is the introduction). I rename the data management and visualization course to challenge since it has been a bit challenging to keep up with the weekly deadlines (and this is only the second week). But I am happy I am still close to the deadline when submitting the assignments. The goal of this second assignment is to load the data set and explore it by means of some descriptive statistics. Below I adapted the sample text for this assignment.

A sample of 68 young adults performed a comparative visual search task and completed an online version of a perfectionism questionnaire. The following code loads the data set.

import pandas

data = pandas.read_csv('data', low_memory=False, sep=" ")
data.columns

print (len(data)) #number of observations (rows)
print (len(data.columns)) # number of variables (columns)
print (len(data['RT'])) #number of observations (rows)

The data represents averaged scores and therefore it is not useful to presents the data as frequencies or proportions because there is mostly one observation per participant.

print 'counts for RT'
c1 = data['RT'].value_counts(sort=False)
print (c1)

print 'percentages for RT'
p1 = data['RT'].value_counts(sort=False, normalize=True)
print (p1)

print 'counts for acc'
c2 = data['acc'].value_counts(sort=False)
print(c2)

print 'percentages for acc'
p2 = data['acc'].value_counts(sort=False, normalize=True)
print (p2)

print 'counts for Quest'
c3 = data['Quest'].value_counts(sort=False)
print(c3)

print 'percentages for Quest'
p3 = data['Quest'].value_counts(sort=False, normalize=True)
print (p3)

Grouping variables would also not make sense since there are no groups. At this point the insights gathered from this exploration are rather uninteresting. I will try to make it more interesting defining groups based on the perfectionism scores. A quick-and-dirty way to create groups based on questionnaire scores is to apply a median split (see McClelland et al. 2015 for a discussion on the appropriateness of the method). Participants scoring above the median score will be the ‘high’ perfectionism group, the one scoring below will be the ‘low’ group.

data['highPerf'] = data['Quest'] > data['Quest'].median()
print(data['highPerf'].value_counts()) 
print(data.groupby('highPerf').size() * 100 / len(data))

36 participants’ scores are lower than the median and 32 are above the median score of 95.0. This corresponds to a 53 and 47 percent of the participants respectively. After splitting the data into two groups, distributions can be computed for each group. Since I found frequency distributions on these groups also uninteresting because the observations preserved their uniqueness, I computed mean summaries of reaction times and accuracy per each group:

highGr=data[(data['highPerf']==True)]
print(highGr['acc'].mean())
print(highGr['RT'].mean())

lowGr=data[(data['highPerf']==False)]
print(lowGr['acc'].mean())
print(lowGr['RT'].mean())

Python code and data are here. This previous posts describes how I used parallel coordinates to display the data set used here with the R programming language.

Advertisements
Loading data and frequencies [DAI -II]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s