Importing data into OpenRefine

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How do I get data into OpenRefine?

Objectives
  • Successfully import data into OpenRefine

Importing data

What kinds of data files can I import?

There are several options for getting your data set into OpenRefine. You can upload or import files in a variety of formats including:

  • TSV (tab-separated values)
  • CSV (comma-separated values)
  • TXT
  • Excel
  • JSON (javascript object notation)
  • XML (extensible markup language)
  • Google Spreadsheet

Create your first OpenRefine project (using provided data)

To import the data for the exercise below, follow the instructions in Setup to download the data and run OpenRefine. NOTE: If OpenRefine does not open in a browser window, open your browser and type the address http://127.0.0.1:3333/ to take you to the OpenRefine interface.

  1. Once OpenRefine is launched in your browser, click Create Project from the left hand menu and select Get data from This Computer
  2. Click Choose Files (or ‘Browse’, depending on your setup) and locate the file which you have downloaded called doaj-article-sample.csv
  3. Click Next >> - the next screen (see below) gives you options to ensure the data is imported into OpenRefine correctly. The options vary depending on the type of data you are importing.
  4. Click in the Character encoding box and set it to UTF-8. This ensures that OpenRefine correctly interprets the imported data as UTF-8 encoded. If you don’t select this you may find that some special characters (e.g. smart quotation marks) are not displayed correctly.
  5. Ensure the first row is used to create the column headings by checking the box Parse next 1 line(s) as column headers
  6. OpenRefine will automatically select “Use character” to enclose cells containing column separators (such as a comma) as part of their data. This will make sure that OpenRefine doesn’t misinterpret any commas (or other characters) within the column data as a delimiter. Keep this option selected.
  7. From OpenRefine 3.4 onwards there is an option to Trim leading & trailing whitespace from strings when importing separator-based files. Keeping this checked will ensure that values like English and English , which differ by a single trailing space, are not treated as different values after the import
  8. Make sure the Parse cell text into numbers, dates, ... box is not checked, so OpenRefine doesn’t try to automatically detect numbers as it may cause errors such as confusion between date formats (e.g. DD/MM/YYYY vs MM/DD/YYYY)
  9. The Project Name box in the upper right corner will default to the title of your imported file. Click in the Project Name box to give your project a different name, if desired.
  10. Once you have selected the appropriate options for your project, click the Create Project >> button at the top right of the screen. This will create the project and open it for you. Projects are saved as you work on them, there is no need to save copies as you go along.

Screenshot of Open Refine Create Project Screen

To open an existing project in OpenRefine you can click Open Project from the main OpenRefine screen (in the left hand menu). When you click this, you will see a list of the existing projects and can click on a project’s name to open it.

Going Further

Data Moment

Quality Assessment

When considering working with a dataset, it is critical to evaluate whether its attributes meet the quality standards required for the project/task at hand. This includes checking for missing data and inspecting erros in data input, verifying if the data is in the desired unit of analysis and if the sample size is enough to support their intended use. We typically suggest you inspect the data having these criteria in mind

  1. Accuracy: for whatever data is described, it needs to be accurate.
  2. Relevancy: the data should meet the requirements for the intended use.
  3. Completeness: the data should be as complete as possible and missing values should be explicitly noted.
  4. Consistency: all values should be recorded following the defined rules for the specifc variables.

For this sample dataset, we will observe many issues that should be addressed and reconciled before the dataset can be considered apt for reuse. Afterall, this workshop was designed to introduce skills to resolve inconsistencies in messy datasets and make them more reusable. Can you think of any examples that have prevented you from moving forward with a dataset because these critera weren’t met?

Key Points

  • Use the Create Project option to import data

  • You can control how data imports using options on the import screen

  • Several files types may be imported into OpenRefine.