Filtering with OpenRefine
Overview
Teaching: 5 min
Exercises: 5 minQuestions
How can we filter our data?
Objectives
Filter in a column.
Filtering
With so many records in the dataset, it’s sometimes useful to only look at a subset of the data.
Faceting was introduced to do this, but this technique does not work well with data that has
a lot of different, free-form data. The Title
column is an example of such data: you
can add a text facet, but there are 3248 different values to facet through. Luckily
OpenRefine has a more general-purpose text filter for these kinds of fields.
Let’s look at a subset of the data, filtered by title:
- Click the down arrow next to
Title
>Text filter
. ATitle
filter panel will appear on the left margin. - Type in
tidying
and press return. There are 1270 matching rows of the original 33321 rows. - At the top, change the view to
Show
50rows
. This way you will see most of the matching rows.
Note that the Title
column is pretty squished in the view in OpenRefine. We can make more room
for it by collapsing other columns, in particular the Web URL
column that we won’t use. Click
on the Web URL
menu in the column header and select View
> Collapse this column
.
Exercise
- How many distinct titles match this ‘tidying’ pattern? (Hint: add a facet after adding your filter)
- Do you see any potential problems with this data?
Solution
- Do
Facet
>Text facet
on theTitle
column after filtering. This will show that six titles match your filter criteria.- Some of these titles represent the same item, except that the author information has some variation.
To Fix!
We have now identified some more data to clean:
- There are inconsistencies in the way titles are entered. A lot of the problems are due to the combination of title and author both being encoded into the title.
We’ll work on this later when we look at splitting data in fields.
Key Points
OpenRefine provides a way to filter data without affecting the raw data.