Data cleaning is the process of identifying and correcting data that are inaccurate, missing, or incomplete. Data cleaning tasks can include removing duplicate records, investigating extreme values (e.g., outliers), converting dates from one format to another, removing unwanted text, splitting multiple data points in a cell into separate cells, or coding missing or NA values.
Using a regular expression (regex) in OpenRefine to identify a pattern or sequence within text and removing and/or replacing it (e.g., finding all “NA” values and replacing it with -999).
Hadley Wickham’s well-known article “Tidy Data” (2014) explains the tidy approach wherein every variable is a column, every observation is a row, and each type of observational unit is a table. This, in turn, is the basis for “long”, rather than “wide” data.
Critical Perspective: In “Against Cleaning”, Katie Rawson and Trevor Muñoz discuss the impact of data cleaning and the need to scrutinize the bias and assumptions involved in the data cleaning process and the implications of words like “cleaning” and “messy.”