Splitting data
Overview
Teaching: 5 min
Exercises: 5 minQuestions
How do we deal with multiple data in a single field?
Objectives
Learn to split data
Recover from problems when modifying data
Splitting data
As we’ve noticed, the Title
column holds more than just a title.
After the title of an item, there is a slash character (surrounded by spaces). We can tell OpenRefine to look for patterns to use as delimitters of separata data items.
- Remove all facets and filters on the data.
- Go to the menu at the top of the
Title
column. - Select
Edit column
>Split into several columns ...
- Make sure that
By Separator
is selected - For the separator, type in ‘ / ‘ (a space, a slash, and another space)
- Press ‘Ok’
The Title
column is now gone and replaced by three columns:
Title 1
, Title 2
, and Title 3
. The Title 1
column does appear to hold the title, Title 2
appears to hold author information, but Title 3
appears to be blank.
We can facet on Title 3
to find out what data it holds (create a text facet). We discover that every row is blank except for one: Joshua Williamson, Tom King ...
.
If we click on that row, we discover that the Title 1
is Batman
, and Title 2
is The Flash : the button, deluxe edition
.
Let’s undo the column split to see what’s going on.
If we now do a text filter on the Title
column and search for batman
,
we find a number of matches, in particular Batman / The Flash : the button, deluxe edition / Joshua Williamson, Tom King, writers ; Jason Fabok, Howard Porter, artists ; Brad Anderson, Hi-Fi, colorists ; Deron Bennett, Steve Wands, letterers
.
The problem here is that the same set of characters that is used to split the title and author information is also used inside of a title!
We can fix this:
- Go to the header menu for the
Title
column and selectEdit Cells
>Replace
- Replace
Batman / The Flash
withBatman/The Flash
in the title.
Remove the filter and split the column using ‘ / ‘. We now just
get two columns Title 1
and Title 2
.
We can now rename the columns: in the Title 1
header menu,
go to Edit column
> Rename this column
, and change the name to
Title
.
The Title 2
column can be named to something like Title Extra
.
Key Points
OpenRefine can break apart columns that hold multiple data