Cleaning by Clustering

Overview

Teaching: 10 min
Exercises: 10 min

Questions

How can we find and correct errors in our raw data?

Objectives

Use clustering to detect possible typing errors.

Understand that there are different clustering algorithms which might give different results.

Using clustering to detect possible data entry problems

In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”.

For example, the two strings New York and new york are very likely to refer to the same concept and just have capitalization differences. Likewise, Gödel and Godel probably refer to the same person.

Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. We will experiment with them, and learn more about these algorithms and how they work.

We will use clustering to fix the branch names in our data. To ensure consistency, we should think about how we want to standardize the naming of branches. To this end, let’s adopt the following convention:

We would like to avoid the use of the word Branch and Library in a branch name whenever an alternative name is. This convention matches the trends noticed in the data (older records have the word Branch in them, newer ones don’t).

In the Branch Name Text Facet we created previously, click the Cluster button.
In the resulting pop-up window, you can change the Method and the Keying Function. Try different combinations to see what different mergers of values are suggested.
Select the key collision method and metaphone3 keying function. It should identify several clusters.
Click the name that you feel should be set for each cluster found (for example, click on Woodcroft (Westmount)). Notice that the checkbox Merge? is selected now.
If there are clusters that you don’t think should be merged together, don’t click anything – and make sure that the Merge? checkbox is NOT selected.
When you are finished reviewing the clusters, click Merge Selected and Recluster to apply the corrections to the dataset.
Try selecting different Methods and Keying Functions again, to see what new merges are suggested. For example, somce reasonable clusters are found by selecting method Nearest neighbor, distance function PPM, and radius 2.0 (with default Block Chars set to 6). Merge Selected and Recluster as needed. Try again with a higher radius: you should find some good matches are found and some bad ones.
If you know of other duplicates that you find hard to detect with the clustering options, you can still merge these similar values. You need to hover over them in the Branch Name text facet, select edit, and manually change the names. The MacEwan University / MacEwan Lending Machine should be a bit tricky.

But what about those blank branch names?

The last item in the Branch Name facet panel has the name (blank). The branch name has not been included.

We have a couple of options:

Remove these records;
See if we can figure out the branch ourselves some other way.

We can remove the records by selecting the (blank) facet, the going to the menu at the top of the All column and selecting Edit rows > Remove all matching rows.

But luckily we have another field to fall back on to do some detective work: the Branch ID column.

If we add a text facet to the Branch ID column and select the (blank) value in the Branch Name panel, we see that the blank branch names only have two different branch IDs: EPLMCN and EPLWHP.

We can reverse this information: if we now select EPLMCN in the branch ID panel, we see that there are only two branch names that have this as a branch ID: McConachie and (blank). From this we can conclude that EPLMCN is the branch ID for McConachie.

Let’s fill in the missing data:

Make sure that the (blank) branch name facet and the EPLMCN facet are selected. We are told that there are 450 rows that match these two filtering conditions.
Go to the first row in the data displayed and allow your mouse pointer to hover over the blank Branch Name cell. As you hover, the word Edit appears. Click it.
A dialog appears. Type ‘McConachie’ in the dialog, and press the button Apply to all Identical Cells
You will see that the (blank) facet now says there are 0 matching rows, and the McConachie facet count has increased by 450.

Exercise

Using the same technique as above:

How many rows with blank values for the branch name have the branch ID set to EPLWHP?

Which branch does EPLWHP correspond to?

Fix these blank branch names.

Solution

Selecting the EPLWHP branch ID facet and the (blank) branch name facet should give 780 rows.

EPLWHP corresponds to West Henday Promenade (Lewis Estates).

This is done in the same way as was done with the blank values for the McConachie branch.

Different clustering algorithms

The technical details of how the different clustering algorithm work can be found at the link below.

previous episode

Exploring and Cleaning Data with OpenRefine

next episode

Cleaning by Clustering

Overview

Using clustering to detect possible data entry problems

But what about those blank branch names?

Exercise

Solution

Different clustering algorithms

Key Points

previous episode

next episode