Best Practices for Working with Research Data
IN THIS CHAPTER
- Understanding the need for best practices
- Choosing best practices to follow
- Differing best practices for data management and data curation
- Adjusting best practices to fit the research
IN ART, MUSIC, WRITING, EVEN IN LIFE, THERE IS A SAYING, “You have to know the rules before you can break them.” This applies to data management as well. It is important to be aware of best practices for data management, and the discipline-specific variations, before trying to adjust a data management plan to fit with a particular workflow or reflect the time constraints of the researcher work cycle discussed in chapter 2. It is also a good idea to consider the reasons for data management when deciding on best practices. Researchers who want to find and understand the data collected by their students will have different needs from researchers who must contribute their data to a government repository. Researchers who must share their data to comply with a funding agency requirement will want to be sure it can be found and cited properly, and will want good documentation so the data is not misinterpreted.
Reasons for Best Practices
Most researchers probably consider best practices, especially guidelines from a subject repository they must use, to be just another set of hoops they must jump through to conduct their research. But adhering to best practices can help ensure consistent data that is easier for researchers to process and use for analysis and visualization, and later sharing. Best practices for long-term storage mean that data can be found and used by the researcher or anyone who wishes to conduct a meta-analysis or reanalyze the data. There are many reasons to adhere to some best practices while collecting and using data.
Good data documentation practices make it easier for researchers and everyone they work with to collect, find, understand, and analyze the data needed for their work. Standard practices also make a researcher less dependent on “lab members who may have graduated or moved on to other endeavors” (White et al., 2013: 13). Well-documented data saves time by being easier to locate and easier to clean up for analysis. It is also easier to share data with collaborators if it is properly documented. Time and money are spared when experiments do not need to be repeated due to lost or messy data. Having well-documented data makes it easy to prove results if there are questions about findings in publications. Also, data sharing increases reputation and reuse increases impact (see chapter 7), and good documentation (chapter 6) is necessary for reuse.
Following best practices can also avoid problems that might result in misconduct investigations. Several problems related to data have been noted in responsible conduct of research training offered through the Office of Research Integrity (ORI), of the U.S. Department of Health & Human Services (HHS):
- technical data not recorded properly
- technical data management not supervised by the primary investigator (PI)
- data not maintained at the institution
- financial or administrative data not maintained properly
- data not stored properly
- data not held in accordance with retention requirements
- data not retained by the institution (Boston College, 2015)
In the United States, most data generated by research that is funded by the federal government belongs to the institution where the funded researcher works (Blum, 2012). Some institutions have developed data policies that require the researcher to be a data steward or custodian, which means the researcher must know where data is stored and be able to present data if needed for potential intellectual property or misconduct questions, or freedom of information requests. Some institutional responsible conduct of research (RCR) policies or intellectual property (IP) policies also address data management issues, and state institutions might have added state policies covering the retention of data and other products of research.
For example, Johns Hopkins University Data Management Services recommends that researchers check the Data Ownership and Retention, Institutional Review Board, Intellectual Property, and Responsible Conduct of Research Policies, and includes highlights of these policies to consider when writing a data management plan (https://dmp.data.jhu.edu/resources/jhu-policies/). The provost at the University of Pittsburgh provides research data management guidelines (http://www.provost.pitt.edu/documents/RDM_Guidelines.pdf ). And Columbia University has a section on data acquisition.
Funders around the world are realizing that requiring data sharing increases the value of the research they support. The U.S. OSTP memo (Stebbins, 2013) requires public access to data from the research supported by about twenty federal agencies, not only for reuse to produce new insights, but it also “allows companies to focus resources and efforts on understanding and exploiting discoveries.”
Other federal governments have similar requirements, and private funders such as& the Gates Foundation and Wellcome Trust now require data be shared publicly. Most of these funders are also asking for data management plans, and progress reports need to address adherence to the plan. Data sharing is expected by these funders, so data needs to be usable and clear to minimize misinterpretation.
REASONS FOR SHARING DATA
- reinforcing open scientific inquiry
- encouraging diversity of analysis and opinion
- promoting new research, testing of new or alternative hypotheses and methods of analysis
- supporting studies on data collection methods and measurement
- facilitating education of new researchers
- enabling the exploration of topics not envisioned by the initial investigators
- permitting the creation of new datasets by combining data from multiple sources. (NIH, 2003)
Many publishers are supporting efforts of various groups, societies, and government agencies to promote transparency and reproducibility of research by requiring authors to make data available and register clinical trials, for example, Nature (http://www.nature. com/sdata/data-policies) and PLOS (https://www.plos.org/plos-data-policy-faq/). Recent papers that attempt to reproduce research results or reanalyze shared data have shown that conclusions from original data were wrong, confirming the need for data sharing requirements. For example, Ben Goldacre (2015) describes how the reassessment of deworming trials data found analysis problems that resulted in an outcome that suggests there should be different recommendations for the use of deworming medicines. Force11 (https://www.force11.org/) has developed Transparency and Openness Promotion (TOP) Guidelines (Alter et al., 2015) for use by publishers to help facilitate morewidespread adoption of standards by journals.The TOP Guidelines cover eight standards at three levels, allowing adoption of standards based on the discipline covered by the journal, and data transparency is one of the standards.
Subject repositories are used to make data available for sharing and provide long-term preservation, so they require well-documented, clean, consistent data and have specific guidelines for data that will be deposited. ICPSR’s Guide to Social Science Data Preparation and Archiving (https://www.icpsr.umich.edu/icpsrweb/content/deposit/guide/) has a section on preparing data for sharing and another on depositing data. GenBank (NCBI) has a couple of ways to deposit nucleotide sequences—BankIt (http://www. ncbi.nlm.nih.gov/WebSub/html/requirements.html) and Sequin (http://www.ncbi.nlm. nih.gov/Sequin/QuickGuide/sequin.htm#BeforeYouBegin)—and each has specific documentation requirements for any accepted sequence deposits.
Overview of Basic Best Practices
There are some basic suggestions that are common to most best practice lists. These are repository independent and work for most subject areas. The researcher work cycle (shown in chapter 2, figure 2.1) is complex, so it is easier to look at best practices focusing just on the data life cycle. It is helpful to look at data management through the whole data cycle before starting to identify when actions or interventions might need to be taken (Corti et al., 2014). The sections below list these basic best practices arranged in the order they appear in a simplified data life cycle (see figure 3.1) based on the ICPSR (shown in chapter 2, figure 2.3) and DataONE (https://www.dataone.org/best-practices) data life cycles.
Even if it is not required for a grant, a data management plan can ensure consistent practice throughout a project and among all people involved in the project. Chapter 8 has more complete information on writing data management plans. The plan should include:
- Data backup policy for all data, not just digital data
- Assignment of responsibilities for data collection and data management upkeep through the life cycle
- Available storage, and the like, based on data sensitivity and local security classification
- Potential repositories or journals, so their policies can be considered
- Length of time the data must be stored after the project, based on funder requirements, institutional requirements, or applicable government requirements (e.g., state records management policy for state-funded institutions)
- Estimated budget for data collection, processing, storage, and so forth
Data collection needs to be easy and intuitive enough to fit the workflow of the research being done, but also thorough and accurate enough to be used for calculations, visualizations, and conclusions later.
- Define and list the types of data and file formats for the research, and use consistently.
- Choose a naming convention for files and folders, and ensure that the rules are followed systematically by always including the same information in the same order.
- Avoid using:
- Generic data file names that may conflict when moved from one location to another
- Special characters
- Periods or spaces
- Use a standardized data format. Check with subject repository or find disciplinary standards if data will be shared or reused.
- Reserve the three-letter file extensions for the codes the system assigns to the file type (e.g., WRL, CSV, TIF).
- Don’t rely on file names as your sole source of documentation.
- Ensure that raw data is preserved with no processing. Document steps used to clean data for analysis.
Good data descriptions make it easier for researchers to reuse their own data in the future, as well as allowing others to replicate or repurpose the data.
- Readme files
- General project files describing overall organization and models, responsible parties, instruments, and so forth
- Specific files for the contents of data files that define parameters, contents, date and time formats, measurements, and so forth; anything that will help facilitate the use of the data
- For analyzed or processed data, include descriptions and references to software or code used
- Choose a meaningful directory hierarchy/naming convention.
- Create a data dictionary that explains terms used in files and folders (e.g., units of measurement and how they are collected, standards, calibrations).
Process and Analyze
Keeping track of all the steps needed to convert raw data to the figures and tables used in publication is important to ensure reproducibility and back up the results presented.
- Develop a quality assurance and quality control plan.
- Double-check data entry.
- Use consistent missing value coding.
- Document all analyses; include steps and complete software information where applicable.
Publish and Share
The need to deposit data in a recommended repository for journal publication, or make data publicly available in a repository to comply with a funder or institutional mandate, means that researchers will need to clean data up and add metadata as required by the repository they choose (see more in chapter 7 on sharing).
- Document all processing needed to ready data for sharing.
- Save data in formats that have lossless data compression and are supported by multiple software platforms (e.g., CSV, TXT).
- Add documentation that includes the basic information needed to make the data reusable. Formal metadata is not always necessary, as long as all the needed information is included in a readme file, data dictionary, or other open format documentation. Basic information fields include:
- Persistent identifier
- Access information (restrictions, embargo)
- File names
- File format
Data needs ongoing storage and backups while experiments are being conducted. And once the research project is finished, data needs to be stored according to funder and institutional policies (see more on long-term storage in chapter 5).
- Keep three copies of your data: original; another location on-site; an off-site/ remote location.
- Back up data, and copies, at regular intervals. Hard-to-replicate data should be backed up daily.
- Try to keep data unencrypted unless necessary, and uncompressed.
- Use a reliable device for backups. Managed network drives or cloud storage will also be backed up, providing more insurance against loss.
- Ensure backup copies are identical to the original copy.
- Store final data using stable file formats.
- Refer to funder or publisher policies as well as institutional data and intellectual property policies and federal and state laws for duration of storage for data retention.
- Document policy for data destruction. This can be based on the institutional records management policy procedures and forms for disposal of materials.
Even these basic best practices are not set in stone. Researchers will need more, and sometimes less, than is listed here, depending on their project, their funder, and the institution. Librarians working with researchers can help by being aware of institutional, subject, and funder requirements pertaining to data documentation, and providing templates and guidelines to help researchers comply. The section that follows, “Available Best Practices,” provides a listing of best practices suggested by some of the major repositories and data management groups.
Available Best Practices
Suggestions for best practices have come from many groups. Articles by groups of scientists encourage their peers to use some basic rules to make their data more usable. RDM services based in libraries offer lists of suggestions for documenting data. Usually repositories have more formal lists of requirements for deposit, be they subject or government repositories.
More and more libraries are developing data management services, and providing best practices guidance is one of the RDM services, other than DMP support, that many libraries provide (Fearon et al., 2013). Some examples of best practices for data management services can be found at the following locations:
- Data Management Services at Stanford University Libraries http://library .stanford.edu/research/data-management-services/data-best-practices
- Research Data Management Services Group at Cornell University http://data .research.cornell.edu/content/best-practices
- Research Data Management Services at the University of Oregon https://library .uoregon.edu/datamanagement/guidelines.html
Institutional repositories may or may not allow data deposit, and some, like Purdue University, have a separate data repository. Requirements vary, although in general, open formats are encouraged. Size limits may be in place for data sets.
- The Purdue University Research Repository (PURR) provides a collaborative working space as well as a data sharing platform. File formats for preservation are recommended: https://purr.purdue.edu/legal/file-format-recommendations.
- eCommons at Cornell University has size and format requirements in their Data Deposit Policy: https://ecommons.cornell.edu/policy.html#data.
- ResearchWorks at University Libraries, University of Washington, will take most digital formats but has a list of suggested formats that will retain functionality over time: http://digital.lib.washington.edu/preferred-formats.html.
Subject repositories encourage and advertise reuse, so there are usually more requirements for properly formatted and documented data. In some cases, the forms that must be filled out during the deposit process are converted into the required metadata. Chapter 7 will discuss sharing in depth and include a list of repositories, but a couple of examples of repository requirements include the following:
- The Inter-university Consortium for Political and Social Research (ICPSR) supports data deposit and reuse, and provides educational materials for instructors who wish to teach with the poll, census, or research data sets in the collection. ICPSR has recommended elements listed in their “Guide to Social Science Data Preparation and Archiving” (ICPSR, 2012).
- Crystallography Open Database provides information on the page “Advice to Potential CIF Donators: Fair Practices” (http://www.crystallography.net/cod/donators/ advices.html).
General repositories usually have fewer requirements or guidelines for data deposit, but learning about any discipline-specific standards will help make the data more usable in the future. Dryad and Figshare are well-known repositories that accept data in all subject areas, and they are used by many journals as the repository for supporting data, or at least one of the accepted repositories.
- Dryad does not have file format restrictions but encourages submission in open formats such as ASCII and HTML so preservation and reuse are easier. There is, however, a list of suggestions to help optimize the data that is submitted (https:// datadryad.org/pages/faq).
- Figshare does have a list of file types supported, but the list is extensive (https:// figshare.zendesk.com/hc/en-us/articles/203993533-Supported-file-types), and when data is deposited, there is a form that guides the researcher to include categories, tags, and descriptions, to help findability.
Best practices can also be required by funding agencies when deposit into an agency repository is required. U.S. federal initiatives for open government data have mandated that agencies make their data available to the public, including both research data and data about their activities (The White House, 2009).
- The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) curates biogeochemical data collected with NASA-funded research and has prepared a best practices guide (https://daac.ornl.gov/PI/BestPractices-2010. pdf ) for those who must deposit data.
- The National Centers for Environmental Information, part of the National Oceanic and Atmospheric Administration (NOAA), collects, processes, preserves, and shares oceanographic data from around the world. This data is very important for many agencies and researchers, so there are many guidelines that must be followed (http://www.ncddc.noaa.gov/activities/science-technology/data-management/). As well as a best practices guide (http://service.ncddc.noaa.gov/rdn/www/media/ documents/activities/community-best-practices.pdf ), there are lists of policies, plans, examples, and the Data Management Resources section has links to resources outside of NOAA that are also helpful for researchers depositing their data.
The sensitive nature of health data that includes personal information requires another layer of best practices to ensure confidentiality. For example, there have been recent cases of DNA data that hackers have been able to use to identify the patients in a study (Hayden, 2013), so restrictions on data publication will need to be considered, or more secure protocols for anonymization need to be developed before sharing data. The Office for Civil Rights, in the Department of Health & Human Services (HHS), provides guidance on the deidentification of patient data (http:// www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/ guidance.html); further information on working with human subjects data can be found in chapter 7.
The Slow Creep of Best Practice Usage
A panel session at the 10th International Digital Curation Conference (IDCC15) on the topic of “Why is it taking so long?” echoed the question that comes up at many meetings of research data managers. The international panel debated about whether the RDM culture change really is taking a long time, or whether good progress actually is being made (Cope, 2015). The issues include the current practice of giving funds to individual projects for their data management needs, as opposed to contributing to institutions for the development of infrastructures that support the whole research community with short- and long-term data management and storage needs. Amy Hodge suggested that data managers and librarians should ask researchers what they need, rather than telling them what the institution has for them (Digital Curation Center, 2015).
In the end, it doesn’t matter how well best practices for data management support necessary curation and long-term preservation if researchers find them too cumbersome to follow. Data managers should recognize, as records professionals were found to realize, that they may be part of the problem by having unrealistic or constraining demands (McLeod, Childs, and Hardiman, 2011). Better to create a series of steps that lead researchers to best practices gently, based on what they must do to get funded. As they recognize the benefits of having clean, organized, shared data, it will become easier to make those practices more robust.
Following a few guidelines will make it easier for researchers to find, share, and use the data they collect.
- Planning through the data life cycle before research starts makes the whole process easier.
- Check for specialized deposit requirements before starting data collection.
- Work with researchers to develop realistic plans for data documentation.
The next chapter will cover the interview process, based on the reference interview, that can be used to learn about research workflows and data management needs.
Alter, George, George C. Banks, Denny Borsboom, et al. 2015. “Transparency and Openness Promotion (TOP) Guidelines.” Open Science Framework. June 27. https://osf.io/9f6gx.
Blum, Carol. 2012. Access to, Sharing and Retention of Research Data: Rights and Responsibilities. Washington, D.C.: Council on Governmental Relations, Council on Governmental Relations. http://www.cogr.edu/COGR/files/ccLibraryFiles/Filename/000000000024/access_to_ sharing_and_retention_of_research_data-_rights_&_responsibilities.pdf.
Boston College. 2015. “Examples of Problems.” Administrators and the Responsible Conduct of Research. Office for Research Integrity and Compliance, via Office of Research Integrity, U.S. Department of Health & Human Services. Accessed August 1, 2015. https://ori.hhs.gov/ education/products/rcradmin/topics/data/tutorial_12.shtml.
Cope, Jez. 2015. “International Digital Curation Conference 2015.” Open Access and Digital Scholarship Blog. March 12. http://www.imperial.ac.uk/blog/openaccess/2015/03/12/international- digital-curation-conference-2015/.
Corti, Louise, Veerle Van den Eynden, Libby Bishop, and Matthew Wollard. 2014. Managing and Sharing Research Data: A Guide to Good Practice. London: Sage.
Digital Curation Center. 2015. “IDCC15: Why Is It Taking So Long?” YouTube video. 1:02:12. February 13. https://www.youtube.com/watch?v=2M6v7d2VdYo.
Fearon, David, Betsy Gunia, Sherry Lake, et al. 2013. SPEC Kit 334: Research Data Management Services. Washington, D.C.: Association of Research Libraries, Office of Management Services.
Goldacre, Ben. 2015. “Scientists Are Hoarding Data and It’s Ruining Medical Research.” Buzz- Feed, August 3. http://www.buzzfeed.com/bengoldacre/deworming-trials.
Hayden, Erika C. 2013. “Privacy Protections: The Genome Hacker.” Nature 497 (7448): 172–74. doi:10.1038/497172a.
ICPSR (Inter-university Consortium for Political and Social Research). 2012. Guide to Social Science Data Preparation and Archiving: Best Practice throughout the Data Life Cycle. 5th ed. Ann Arbor, Mich.: ICPSR. http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf.
McLeod, Julie, Sue Childs, and Rachel Hardiman. 2011. “Accelerating Positive Change in Electronic Records Management—Headline Findings from a Major Research Project.” Archives & Manuscripts 39, no. 2: 66–94.
NIH (National Institutes of Health). 2003. “Grants and Funding. Frequently Asked Questions. Data Sharing.” National Institutes of Health. http://grants1.nih.gov/grants/policy/data_sharing/ data_sharing_faqs.htm.
Stebbins, Michael. 2013. “Expanding Public to the Results of Federally Funded Research.” Office of Science and Technology Policy Blog. February 22. http://www.whitehouse.gov/blog/2013/02/22/ expanding-public-access-results-federally-funded-research.
White, Ethan P., Elita Baldridge, Zachary T. Brym, et al. 2013. “Nine Simple Ways to Make It Easier to (Re)Use Your Data.” PeerJ PrePrints 1: e7v2 https://dx.doi.org/10.7287/peerj. preprints.7v2.
The White House. 2009. “Transparency and Open Government Memorandum (Memorandum to the Heads of Executive Departments and Agencies).” 74 Fed. Reg. 4,685. January 21. http:// www.gpo.gov/fdsys/pkg/FR-2009-01-26/pdf/E9-1777.pdf.
Copyright © 2016. Rowman & Littlefield Publishers. All rights reserved.
Henderson, Margaret E.. Data Management : A Practical Guide for Librarians, Rowman & Littlefield Publishers, 2016. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/utah/detail.action?docID=4717216. Created from utah on 2018-09-14 14:17:01.