Facets
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How can we use facets to explore our data?
How can we find problems in our raw data?
Objectives
Use facets to summarize data from a column.
Using Facets
Facets are one of the most useful features of OpenRefine and can help both get an overview of the data in a project as well as helping you bring more consistency to the data. OpenRefine supports faceted browsing as a mechanism for
- seeing a big picture of your data, and
- filtering down to just the subset of rows that you want to change in bulk.
A ‘Facet’ groups all the like values that appear in a column, and then allow you to filter the data by these values and edit values across many records at the same time.
Although you may not be familiar with the term ‘facet’, you have most likely experienced this kind of filtering before. Two great websites that offer searching using facets:
One type of Facet is called a ‘Text facet’. This groups all the identical text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.
Here we will use faceting to look for potential errors in data entry in the Branch Name
column. We know for a fact that there are only 22 branches at EPL that have participated in this data set.
- Scroll over to the
Branch Name
column. - Click the down arrow and choose
Facet
>Text facet
. - In the left panel, you’ll now see a box containing every unique value in the
Branch Name
column along with a number representing how many times that value occurs in the column. - Try sorting this facet by name and by count.
Exercise
Do you notice any problems with the data? What are they?
Solution
- There are 48 different values … but only 22 branches!
Calder Branch
is likely the same branch asCalder
.Woodcroft Branch
is likely the same branch as bothWoodcroft (Westmount)
andWoodcroft (Westmount) Branch
.MacEwan Lending Machine
refers to the same place asMacEwan University
andMacEwan University Lending Machine
.- There are
1230
rows with an empty branch name.- and …
Hover the mouse over one of the names in the Facet
list. You should see that you have an edit
function available.
You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.
To Fix!
We have now identified some data to clean:
- There are inconsistencies in the branch names.
We’ll explore some strategies to tackle this in a later section.
Using Facets to filter data
Facets give an excellent summary of the data in a data field, but they are also useful for choosing what data we want to look at.
Try it!
Use
include / exclude
to select only entries that match selected branches.
- In the facet (left margin), click on one of the names, such as
Abbotsfield - Penny McKee
. Notice that when you click on the name, or hover over it, there are entries to the right foredit
andinclude
.- Click
include
. This will explicitly include this branch, and exclude others that are not explicitly included. Notice that the option now changes toexclude
.- Click
include
andexclude
on the other branches and notice how the entries appear and disappear from the table.- Invert: You can look at data at every branch except for the currently selected facets. The
invert
button to the right ofBranch Name
makes this easy. Pressing it twice returns the previously selected facets.- Reset: After playing around with some of the facets, remove all selected facets so that we have a full dataset to work with. The
reset
link to the right ofBranch Name
in the facet panel will quickly do this.
Dates and timelines
Lets look at our date column As of Date
. We would like to facet this column, and there is a facet
type called ‘Timeline facet’ that would probably be a good fit, but when try to add this type
of facet, we get bad results (and we notice that the facet panel says there are 33321 rows of
type Non-time
).
The issue here is that the As of Date
column is considered to hold text data, and not
time/date data. We can add a Text facet
instead, and doing so we discover how many unique choices
for dates there are in the dataset. (You may get a ‘too many to display error’, and a Set choice
count limit
link. Use the link to increase the limit to 1000).
Look at the ordering of these facets. Are these in chronological order? The day-month-year formatting of the dates ensures that dates on the first of the month are first, ordered by the month the date is in, then the year.
We can fix this by converting the type of the data in the column to ‘date’ type.
- Remove all facets on the
As of Date
column. - Convert the type of the data in the column to
date
usingEdit cells
>Common transforms
>To date
from theAs of Date
column menu. Notice the data in the column is now green. - Use faceting to produce a timeline display for
As of Date
. Play with the handles to filter the data by date - Create a second facet on the date column using the
Custom Text Facet
option with the expressionvalue.toString()
. This produces a new text facet for the dates, but the values are in chronological order! Play with the handles of the timeline facet to watch these new facet values change.
Questions about our data
- Using the custom text facet created above, how many different
As of Date
values there are in the data set?- What are the first and last weeks that data was collected?
- Are there any periods where data was not collected?
Solution
There are 164 unique dates in the dataset.
The first data collected was from February 2015, and the most recent data is from October 2018.
Notice there are holes in the data for late 2016 and the early half of 2017.
More on Facets
As well as
Text
andTimeline
facets, OpenRefine also supports a range of other types of facet. These include:
- Numeric facets
- Custom facets
- Scatterplot facets
Numeric and Scatterplot facets display graphs instead of lists of values. The numeric facet graph includes ‘drag and drop’ controls you can use to set a start and end range to filter the data displayed. These facets are explored further in Examining Numbers in OpenRefine
Custom facets are a range of different types of facets. Some of the default custom facets are:
- Word facet - this breaks down text into words and counts the number of records each word appears in
- Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
- Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
- Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.
Exercise
Add a text facet on the ‘Branch Name’ column on top of the ‘As of Date’ timeline facet construced above. Do you see any patterns on the timeline graph when looking at branches with duplicate names?
Solution
Quite often the coding of branch name will change over time. For example,
Calder Branch
is used in earlier data records in the set, whereasCalder
is used in later data samples.
Key Points
Faceting is a useful technique to explore data in a column
A column with dates can be converted to date type for faceting.