Know your data
Overview
Teaching: 5 min
Exercises: 5 minQuestions
What information is in our data?
What kinds of questions can our data answer?
What kinds of questions can not be answered by our data?
Objectives
Think about the data and it’s potential uses.
What the data says and what it doesn’t say
The data
Each row includes the following information:
- Branch ID
- Branch Name
- Number of Holds
- Title
- Author
- As of Date
- Web URL
Given a row like this one:
Branch ID | Branch Name | Number of Holds | Title | Author | As of Date | Web URL |
---|---|---|---|---|---|---|
EPLLON | Londonderry Branch | 36 | The girl on the train / Paula Hawkins | Hawkins Paula | 03/16/2015 12:00:00 AM | http://epl.bibliocommons.com/search?t=smart&q=the%20girl%20on |
We can interpret this in words as “On the week preceding March 16, 2015, The girl on the train
by
Paula Hawkins
was one of the top-ten most requested items at the Londonderry Branch
.
There were 36
holds for this item”.
Note that for this week and for this branch, there will be nine other rows containing the other top-ten titles.
Questions about the data
Think about the data as a collection, and contemplate the following questions:
Question
Notice that the above row does not give us enough information to tell us what position in the top-ten
The girl on the train
was atLondonderry Branch
onMarch 16, 2015
.But can we figure this out from this data set?
Answer
Yes. In general to answer questions like this, we need to find all of the items in the top ten for that week at the Londonderry Branch (there should be 10 of them). We can then order by the number of holds to find out where
The girl on the train
sat that week. In our case, our data already has some organization to it, and we can answer this question by looking at the first ten rows of our data file.
Question
Is there any information in our data set that is either redundant or highly correlated?
Answer
Ideally, the
Branch ID
and theBranch Name
fields should be pretty closely correlated.The
Title
data also includes author information.Probably the
Web URL
will hold similar information asTitle
.
Question
If I have a date and a title, can I add up the holds in the rows in the data that match to get an indication of the total number of holds for that item at that date for all of EPL?
Answer
Not really. This data is for those branches for which the title was in the top 10 holds. This will not include data for branches where the title was not in the top 10 titles, so we will not get the total number of holds.
Question
If I have a branch name and a title, can I add up the number of holds over the weeks to get the total number of people requesting that item at that branch?
Answer
No. Suppose you wait longer than a week for your hold to come, then you will be included in at least two different weeks worth of data, and you will be counted twice
Question
Can I use the data to determine how many weeks a title was in the top 10 at a particular branch?
Answer
Yes. You would look for the rows where the title and the branch match and count the number of rows.
Key Points
It’s important to think about what questions your data can answer.