Cleaning by Clustering

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can we find and correct errors in our raw data?

Objectives
  • Use clustering to detect possible typing errors.

  • Understand that there are different clustering algorithms which might give different results.

Using clustering to detect possible data entry problems

In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”.

For example, the two strings New York and new york are very likely to refer to the same concept and just have capitalization differences. Likewise, Gödel and Godel probably refer to the same person.

Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. We will experiment with them, and learn more about these algorithms and how they work.

We will use clustering to fix the branch names in our data. To ensure consistency, we should think about how we want to standardize the naming of branches. To this end, let’s adopt the following convention:

We would like to avoid the use of the word Branch and Library in a branch name whenever an alternative name is. This convention matches the trends noticed in the data (older records have the word Branch in them, newer ones don’t).

  1. In the Branch Name Text Facet we created previously, click the Cluster button.
  2. In the resulting pop-up window, you can change the Method and the Keying Function. Try different combinations to see what different mergers of values are suggested.
  3. Select the key collision method and metaphone3 keying function. It should identify several clusters.
  4. Click the name that you feel should be set for each cluster found (for example, click on Woodcroft (Westmount)). Notice that the checkbox Merge? is selected now.
  5. If there are clusters that you don’t think should be merged together, don’t click anything – and make sure that the Merge? checkbox is NOT selected.
  6. When you are finished reviewing the clusters, click Merge Selected and Recluster to apply the corrections to the dataset.
  7. Try selecting different Methods and Keying Functions again, to see what new merges are suggested. For example, somce reasonable clusters are found by selecting method Nearest neighbor, distance function PPM, and radius 2.0 (with default Block Chars set to 6). Merge Selected and Recluster as needed. Try again with a higher radius: you should find some good matches are found and some bad ones.
  8. If you know of other duplicates that you find hard to detect with the clustering options, you can still merge these similar values. You need to hover over them in the Branch Name text facet, select edit, and manually change the names. The MacEwan University / MacEwan Lending Machine should be a bit tricky.

But what about those blank branch names?

The last item in the Branch Name facet panel has the name (blank). The branch name has not been included.

We have a couple of options:

We can remove the records by selecting the (blank) facet, the going to the menu at the top of the All column and selecting Edit rows > Remove all matching rows.

But luckily we have another field to fall back on to do some detective work: the Branch ID column.

If we add a text facet to the Branch ID column and select the (blank) value in the Branch Name panel, we see that the blank branch names only have two different branch IDs: EPLMCN and EPLWHP.

We can reverse this information: if we now select EPLMCN in the branch ID panel, we see that there are only two branch names that have this as a branch ID: McConachie and (blank). From this we can conclude that EPLMCN is the branch ID for McConachie.

Let’s fill in the missing data:

Exercise

Using the same technique as above:

  1. How many rows with blank values for the branch name have the branch ID set to EPLWHP?
  2. Which branch does EPLWHP correspond to?
  3. Fix these blank branch names.

Solution

  1. Selecting the EPLWHP branch ID facet and the (blank) branch name facet should give 780 rows.
  2. EPLWHP corresponds to West Henday Promenade (Lewis Estates).
  3. This is done in the same way as was done with the blank values for the McConachie branch.

Different clustering algorithms

The technical details of how the different clustering algorithm work can be found at the link below.

More on clustering

Key Points

  • Clustering in OpenRefine can help to identify different values that might mean the same thing.