Q
uantitative
R
easoning &
P
roblem
S
olving
218
© 2014 Pacific Crest
Step
Explanation
6. Clean the data
What can you do with missing values, values outside of valid
ranges, and other potential problems?
1 USA Today
No changes needed.
2 ESRL
Data is present for the continuous period; no changes needed.
3 CDIAC
The scale differences between the studies (50 years of daily readings with ESRL
and 1,000 year ‘chunks’ with CDIAC don’t matter for my purposes). No changes
needed.
7. Determine readiness
Does the data meet the following parameters?
1. Meets your needs
2. Generated in an unbiased way
3. Has been adjusted to meet your criteria for use
1 USA Today
1. Yes 2. Seems to be 3. Didn’t need to be
2 ESRL
1. Yes 2. Seems to be 3. Yes (with graphs)
3 CDIAC
1. Yes 2. Seems to be 3. Yes (with graphs)
O
ops
! A
voiding
C
ommon
E
rrors
●
Accepting the data as-is instead of challenging its quality
Example
: Survey data has different partial results (survey participants chose to fill out what
mattered to them) but is not consistent with respect to which parts were filed out by
each participant.
Why?
The filling out of the survey could be very biased, where participants focused on the
areas that best matched their self-interest. It is important to constantly ask “
Why?”
when looking at data. If you can identify the reasons and rationale for how the survey
was conducted, the data generated from that survey has greater validity for your use.
●
Errors in the data continue after cleanup is completed
Example
: A scientific team collected continuous data, but seemingly forgot to remove the
calibration data. You’re nearly certain that the first three readings should be identified
as calibration data (the data obtained while using the measuring equipment in its
default and zeroed state) because the readings, across every data column, are identical.
Why?
Throwing out anomalies is bad science, but keeping verifiable errors in the data
leads to incorrect inferences and conclusions. This is an easy problem to fix; you
simply check with any of the team members and verify that those three readings were
equipment tests and should be removed during your data cleaning process.
●
Missing metadata
Example
: According to the US Census Bureau, medianAmerican household income in 2000 was
$41,262 per year. In 2012, it was $50,099. Obviously median American households
were better off in 2012 than they were 12 years earlier.
Why?
The metadata must help establish a consistency in the data so that comparisons of data
across time, space, iterations, groups, etc., can be made. This means the relationship