Cleaning by Clustering
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How can we find and correct errors in our raw data?
Objectives
Use clustering to detect possible typing errors.
Understand that there are different clustering algorithms which might give different results.
Using clustering to detect possible data entry problems
In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”.
For example, the two strings New York
and new york
are very likely to refer to the
same concept and just have capitalization differences. Likewise, Gödel
and Godel
probably refer to the same person.
Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. We will experiment with them, and learn more about these algorithms and how they work.
We will use clustering to fix the branch names in our data. To ensure consistency, we should think about how we want to standardize the naming of branches. To this end, let’s adopt the following convention:
We would like to avoid the use of the word Branch
and Library
in a branch name whenever an alternative name is. This convention matches the trends noticed in the data (older records have the word Branch
in them, newer ones don’t).
- In the
Branch Name
Text Facet we created previously, click theCluster
button. - In the resulting pop-up window, you can change the
Method
and theKeying Function
. Try different combinations to see what different mergers of values are suggested. - Select the
key collision
method andmetaphone3
keying function. It should identify several clusters. - Click the name that you feel should be set for each cluster found
(for example, click on
Woodcroft (Westmount)
). Notice that the checkboxMerge?
is selected now. - If there are clusters that you don’t think should be merged together, don’t click anything –
and make sure that the
Merge?
checkbox is NOT selected. - When you are finished reviewing the clusters, click
Merge Selected and Recluster
to apply the corrections to the dataset. - Try selecting different
Methods
andKeying Functions
again, to see what new merges are suggested. For example, somce reasonable clusters are found by selecting methodNearest neighbor
, distance functionPPM
, and radius2.0
(with defaultBlock Chars
set to6
).Merge Selected and Recluster
as needed. Try again with a higher radius: you should find some good matches are found and some bad ones. - If you know of other duplicates that you find hard to detect with the clustering options,
you can still merge these similar values. You need to hover over them in the
Branch Name
text facet, select edit, and manually change the names. TheMacEwan University
/MacEwan Lending Machine
should be a bit tricky.
But what about those blank branch names?
The last item in the Branch Name
facet panel has the name (blank)
. The branch name
has not been included.
We have a couple of options:
- Remove these records;
- See if we can figure out the branch ourselves some other way.
We can remove the records by selecting the (blank)
facet, the going to the menu at the
top of the All
column and selecting Edit rows
> Remove all matching rows
.
But luckily we have another field to fall back on to do some detective work: the Branch ID
column.
If we add a text facet to the Branch ID
column and select the (blank)
value in
the Branch Name
panel, we see that the blank branch names only have two different
branch IDs: EPLMCN
and EPLWHP
.
We can reverse this information: if we now select EPLMCN
in the branch ID panel,
we see that there are only two branch names that have this as a branch ID:
McConachie
and (blank)
. From this we can conclude that EPLMCN
is the
branch ID for McConachie
.
Let’s fill in the missing data:
- Make sure that the
(blank)
branch name facet and theEPLMCN
facet are selected. We are told that there are450
rows that match these two filtering conditions. - Go to the first row in the data displayed and allow your mouse pointer to hover over
the blank
Branch Name
cell. As you hover, the wordEdit
appears. Click it. - A dialog appears. Type ‘McConachie’ in the dialog, and press the button
Apply to all Identical Cells
- You will see that the
(blank)
facet now says there are0
matching rows, and theMcConachie
facet count has increased by450
.
Exercise
Using the same technique as above:
- How many rows with blank values for the branch name have the branch ID set to
EPLWHP
?- Which branch does
EPLWHP
correspond to?- Fix these blank branch names.
Solution
- Selecting the
EPLWHP
branch ID facet and the(blank)
branch name facet should give780
rows.EPLWHP
corresponds toWest Henday Promenade (Lewis Estates)
.- This is done in the same way as was done with the blank values for the McConachie branch.
Different clustering algorithms
The technical details of how the different clustering algorithm work can be found at the link below.
Key Points
Clustering in OpenRefine can help to identify different values that might mean the same thing.