Splitting data

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • How do we deal with multiple data in a single field?

Objectives
  • Learn to split data

  • Recover from problems when modifying data

Splitting data

As we’ve noticed, the Title column holds more than just a title. After the title of an item, there is a slash character (surrounded by spaces). We can tell OpenRefine to look for patterns to use as delimitters of separata data items.

The Title column is now gone and replaced by three columns: Title 1, Title 2, and Title 3. The Title 1 column does appear to hold the title, Title 2 appears to hold author information, but Title 3 appears to be blank.

We can facet on Title 3 to find out what data it holds (create a text facet). We discover that every row is blank except for one: Joshua Williamson, Tom King ....

If we click on that row, we discover that the Title 1 is Batman, and Title 2 is The Flash : the button, deluxe edition.

Let’s undo the column split to see what’s going on.

If we now do a text filter on the Title column and search for batman, we find a number of matches, in particular Batman / The Flash : the button, deluxe edition / Joshua Williamson, Tom King, writers ; Jason Fabok, Howard Porter, artists ; Brad Anderson, Hi-Fi, colorists ; Deron Bennett, Steve Wands, letterers.

The problem here is that the same set of characters that is used to split the title and author information is also used inside of a title!

We can fix this:

Remove the filter and split the column using ‘ / ‘. We now just get two columns Title 1 and Title 2.

We can now rename the columns: in the Title 1 header menu, go to Edit column > Rename this column, and change the name to Title.

The Title 2 column can be named to something like Title Extra.

Key Points

  • OpenRefine can break apart columns that hold multiple data