© 2014 Pacific Crest
215
5.2
Using Data from Other Sources
P
urpose
The role this topic plays in quantitative reasoning
The large amount of existing data available to us means that data can play a significant role in quantitative
reasoning. The ability to effectively obtain and use data produced by others is a critical skill when it
comes to solving quantitative problems.
Data warehouses are used for supporting decision making. It is important for these data warehouses to
ensure their data is accurate and valid to avoid incorrect conclusions being drawn. Duplicated or missing
information often produces incorrect or misleading statistics, for instance. The time-honored truism of
“garbage in, garbage out” applies when it comes to data.
When obtaining data from different sources, it is important that you first verify that the data meets your
needs. The next step is to identify its structure and its identifiers. Finally, you must clarify any issues
with the data.
Data is often very discipline-dependent. For example, data generated and used in the field of medicine
has properties that differ the properties of data used in mathematics research. Medical data is often
dependent on many real-life variables and errors in that data can be life threatening. When using data
across disciplines, we must be careful to determine how transferable the data is.
Before data is used, it must go through a clean up. This process includes reviewing the data to ensure
that values are in the appropriate range, that the number of observations cover the dimensions expected,
and the number of variables anticipated are present.
L
earning Goals
What you should learn while completing this activity
1. Determine if your purpose for use of the obtained data aligns with the original purpose behind its
generation.
2. Identify all the issues contained in a set of data
3. Use various techniques for cleaning and validating data
4. Document the remaining issues in the metadata for others to consider with their use
D
iscovery
Finding out for yourself
From the companion website, pick an interesting data set for exploration. With your data, determine the
following:
1. What does each column represent?
2. Who collected the data and why?
3. Are the values for each variable within expected values?
4. Are there missing values?
5. What units are represented by each variable?
6. What types of bias may exist with the data?