Data Cleaning

Definition

Data cleaning is the process of identifying and correcting data that are inaccurate, missing, or incomplete. Data cleaning tasks can include removing duplicate records, investigating extreme values (e.g., outliers), converting dates from one format to another, removing unwanted text, splitting multiple data points in a cell into separate cells, or coding missing or NA values.

Examples

Using a regular expression (regex) in OpenRefine to identify a pattern or sequence within text and removing and/or replacing it (e.g., finding all “NA” values and replacing it with -999).

Similar Terms

Deduplication

Data Standardization

Data Cleansing

Data Scrubbing

Tools

Tidyverse is a collection of open source R packages, several of which can be used for data wrangling and cleaning.

Pandas in a collection of open source Python libraries for data manipulation and analysis.

OpenRefine is a user-friendly, point-and-click tool for working with messy data.

Relevant Literature

Hadley Wickham’s well-known article “Tidy Data” (2014) explains the tidy approach wherein every variable is a column, every observation is a row, and each type of observational unit is a table. This, in turn, is the basis for “long”, rather than “wide” data.

Critical Perspective: In “Against Cleaning”, Katie Rawson and Trevor Muñoz discuss the impact of data cleaning and the need to scrutinize the bias and assumptions involved in the data cleaning process and the implications of words like “cleaning” and “messy.”

Become a Member Organization

Membership FAQ

Members Directory

Member Services

Building a Healthy Information Environment

Bridging the Digital Divide

Environmental Determinants of Health

NNLM Discovery Podcast

Available Now

Funded Projects

Project & Proposal Writing Support

My Projects

Available Classes

Obtain a Specialization

Recordings

My Classes

Order Free Informational Materials

An Introduction to Health Literacy

Resources for Healthcare Providers

DOCLINE

Data Cleaning

Definition

Similar Terms

Relevant Literature

Filter

Contact Us

Regional Medical Libraries