Facets

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can we use facets to explore our data?

  • How can we find problems in our raw data?

Objectives
  • Use facets to summarize data from a column.

Using Facets

Facets are one of the most useful features of OpenRefine and can help both get an overview of the data in a project as well as helping you bring more consistency to the data. OpenRefine supports faceted browsing as a mechanism for

A ‘Facet’ groups all the like values that appear in a column, and then allow you to filter the data by these values and edit values across many records at the same time.

Although you may not be familiar with the term ‘facet’, you have most likely experienced this kind of filtering before. Two great websites that offer searching using facets:

One type of Facet is called a ‘Text facet’. This groups all the identical text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.

Here we will use faceting to look for potential errors in data entry in the Branch Name column. We know for a fact that there are only 22 branches at EPL that have participated in this data set.

  1. Scroll over to the Branch Name column.
  2. Click the down arrow and choose Facet > Text facet.
  3. In the left panel, you’ll now see a box containing every unique value in the Branch Name column along with a number representing how many times that value occurs in the column.
  4. Try sorting this facet by name and by count.

Exercise

Do you notice any problems with the data? What are they?

Solution

  • There are 48 different values … but only 22 branches!
  • Calder Branch is likely the same branch as Calder.
  • Woodcroft Branch is likely the same branch as both Woodcroft (Westmount) and Woodcroft (Westmount) Branch.
  • MacEwan Lending Machine refers to the same place as MacEwan University and MacEwan University Lending Machine.
  • There are 1230 rows with an empty branch name.
  • and …

Hover the mouse over one of the names in the Facet list. You should see that you have an edit function available.

You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.

To Fix!

We have now identified some data to clean:

  • There are inconsistencies in the branch names.

We’ll explore some strategies to tackle this in a later section.

Using Facets to filter data

Facets give an excellent summary of the data in a data field, but they are also useful for choosing what data we want to look at.

Try it!

Use include / exclude to select only entries that match selected branches.

  1. In the facet (left margin), click on one of the names, such as Abbotsfield - Penny McKee. Notice that when you click on the name, or hover over it, there are entries to the right for edit and include.
  2. Click include. This will explicitly include this branch, and exclude others that are not explicitly included. Notice that the option now changes to exclude.
  3. Click include and exclude on the other branches and notice how the entries appear and disappear from the table.
  4. Invert: You can look at data at every branch except for the currently selected facets. The invert button to the right of Branch Name makes this easy. Pressing it twice returns the previously selected facets.
  5. Reset: After playing around with some of the facets, remove all selected facets so that we have a full dataset to work with. The reset link to the right of Branch Name in the facet panel will quickly do this.

Dates and timelines

Lets look at our date column As of Date. We would like to facet this column, and there is a facet type called ‘Timeline facet’ that would probably be a good fit, but when try to add this type of facet, we get bad results (and we notice that the facet panel says there are 33321 rows of type Non-time).

The issue here is that the As of Date column is considered to hold text data, and not time/date data. We can add a Text facet instead, and doing so we discover how many unique choices for dates there are in the dataset. (You may get a ‘too many to display error’, and a Set choice count limit link. Use the link to increase the limit to 1000).

Look at the ordering of these facets. Are these in chronological order? The day-month-year formatting of the dates ensures that dates on the first of the month are first, ordered by the month the date is in, then the year.

We can fix this by converting the type of the data in the column to ‘date’ type.

  1. Remove all facets on the As of Date column.
  2. Convert the type of the data in the column to date using Edit cells > Common transforms > To date from the As of Date column menu. Notice the data in the column is now green.
  3. Use faceting to produce a timeline display for As of Date. Play with the handles to filter the data by date
  4. Create a second facet on the date column using the Custom Text Facet option with the expression value.toString(). This produces a new text facet for the dates, but the values are in chronological order! Play with the handles of the timeline facet to watch these new facet values change.

Questions about our data

  1. Using the custom text facet created above, how many different As of Date values there are in the data set?
  2. What are the first and last weeks that data was collected?
  3. Are there any periods where data was not collected?

Solution

There are 164 unique dates in the dataset.

The first data collected was from February 2015, and the most recent data is from October 2018.

Notice there are holes in the data for late 2016 and the early half of 2017.

More on Facets

OpenRefine Wiki: Faceting

As well as Text and Timeline facets, OpenRefine also supports a range of other types of facet. These include:

  • Numeric facets
  • Custom facets
  • Scatterplot facets

Numeric and Scatterplot facets display graphs instead of lists of values. The numeric facet graph includes ‘drag and drop’ controls you can use to set a start and end range to filter the data displayed. These facets are explored further in Examining Numbers in OpenRefine

Custom facets are a range of different types of facets. Some of the default custom facets are:

  • Word facet - this breaks down text into words and counts the number of records each word appears in
  • Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
  • Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
  • Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.

Exercise

Add a text facet on the ‘Branch Name’ column on top of the ‘As of Date’ timeline facet construced above. Do you see any patterns on the timeline graph when looking at branches with duplicate names?

Solution

Quite often the coding of branch name will change over time. For example, Calder Branch is used in earlier data records in the set, whereas Calder is used in later data samples.

Key Points

  • Faceting is a useful technique to explore data in a column

  • A column with dates can be converted to date type for faceting.