Exploring and Cleaning Data with OpenRefine

Attribution

This workshop uses the Data Cleaning with OpenRefine for Social Scientists course from Data Carpentry as it’s basis.

A part of the data workflow is preparing the data for analysis. Some of this involves data cleaning, where errors in the data are identifed and corrected or formatting made consistent. This step must be taken with the same care and attention to reproducibility as the analysis.

OpenRefine (formerly Google Refine) is a powerful free and open source tool for working with messy data: cleaning it and transforming it from one format into another.

This lesson will teach you to use OpenRefine to effectively clean and format data and automatically track any changes that you make. Many people comment that this tool saves them literally months of work trying to make these edits by hand.

Getting Started

This workshop is modeled after the methodologies of Software/Data Carpentry: the teaching is hands-on, so participants are encouraged to use their own computers to ensure the proper setup of tools for an efficient workflow.

These lessons assume no prior knowledge of the skills or tools.

To get started, follow the directions in the “Setup” tab to download data to your computer and follow any installation instructions.

Prerequisites

This lesson requires a working copy of OpenRefine (also called GoogleRefine).

To most effectively use these materials, please make sure to install everything before working through this lesson.

For Instructors

If you are teaching this lesson in a workshop, please see the Instructor notes.

Schedule

Setup Download files required for the lesson
00:00 1. Introduction What is OpenRefine useful for?
00:05 2. Getting data into OpenRefine How can we bring our data into OpenRefine?
00:15 3. Know your data What information is in our data?
What kinds of questions can our data answer?
What kinds of questions can not be answered by our data?
00:25 4. Undo/Redo How can we undo operations of our data
00:35 5. Facets How can we use facets to explore our data?
How can we find problems in our raw data?
00:55 6. Filtering with OpenRefine How can we filter our data?
01:05 7. Examining Numbers in OpenRefine How can we convert a column from one data type to another?
01:15 8. Sorting data with OpenRefine How can we sort our data?
01:25 9. Cleaning by Clustering How can we find and correct errors in our raw data?
01:45 10. Splitting data How do we deal with multiple data in a single field?
01:55 11. Exporting and Saving Data from OpenRefine How can we save and export our cleaned data from OpenRefine?
02:05 12. Using scripts How can we document the data-cleaning steps we’ve applied to our data?
How can we apply these steps to additional data sets?
02:15 13. Other Resources in OpenRefine What other resources are available for working with OpenRefine?
02:20 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.