Exploring and Cleaning Data with OpenRefine: Glossary

Key Points

Introduction
  • OpenRefine is a powerful, free and open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps allowing you to backtrack as needed and providing a record of all work done

Getting data into OpenRefine
  • OpenRefine can import a variety of file types.

Know your data
  • It’s important to think about what questions your data can answer.

Undo/Redo
  • Undo and redo are useful ways to rewind operations we have done to our data

Facets
  • Faceting is a useful technique to explore data in a column

  • A column with dates can be converted to date type for faceting.

Filtering with OpenRefine
  • OpenRefine provides a way to filter data without affecting the raw data.

Examining Numbers in OpenRefine
  • OpenRefine also provides ways to get overviews of numerical data.

Sorting data with OpenRefine
  • OpenRefine provides a way to sort data without affecting the raw data.

  • Multiple sort criteria can be used

Cleaning by Clustering
  • Clustering in OpenRefine can help to identify different values that might mean the same thing.

Splitting data
  • OpenRefine can break apart columns that hold multiple data

Exporting and Saving Data from OpenRefine
  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Using scripts
  • All changes are being tracked in OpenRefine, and this information can be used for scripts for future analyses or reproducing an analysis.

Other Resources in OpenRefine
  • Other examples and resources online are good for learning more about OpenRefine

Glossary

including tab separated (tsv), comma separated (csv), Excel (xls, xlsx), JSON, XML, RDF as XML, Google Spreadsheets

csv
A file extension indicating that a text file that has values separated by commas (comma-separated-values).
Clustering
A method for finding different groups of values that may actually be representing the same thing.
Faceting
A method for exploring the values in a variable. In this episode it is used to explore the values in order to identify errors in data entry.
Filter
To select a subset of data from a dataframe.
JSON
A file extension indicating that the values in a text file are structured using JavaScript Object Notation (JSON).
RDF
A file that extension indicating that the values in a file are structured using Resource Description Framework (RDF).
Regular expressions (regex)
A text string for describing a search pattern. They usually incorporate the use of wildcards to match letters, numbers, punctuation, spacing, or some combination.
tsv
A file extension indicating that a text file that has values separated by tabs (tab-separated-values).
xls
A file extension indicating that a file is a spreadsheet created by Microsoft Excel.
xlsx
A file extension indicating that a file is a spreadsheet created by Microsoft Excel using XML.
XML
A file extension indicating that the values in a file are structured using Extensible Markup Language (XML).