Data Management

What constitutes research data?

"Material or information on which an argument, theory, test or hypothesis, or another research output is based."

"What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models"

"Units of information created in the course of research"

"(i) Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues."

Queensland University of Technology. Manual of Procedures and Policies. Section 2.8.3. http://www.mopp.qut.edu.au/D/D_02_08.jsp
Marieke Guy. http://www.slideshare.net/MariekeGuy/bridging-the-gap-between-researchers-and-research-data-management , #2
https://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp
OMB-110, Subpart C, section 36, (d) (i), http://www.whitehouse.gov/omb/circulars_a110/

Material or information necessary to come to your conclusion

Getting data

Collection
Experiment
Observation
Simulation
Compilation/Derivation
Reference data
Methods
What was done
How it was done
Instrumentation
Limitations

Being data

Stage
Raw
Cleaned
Processed
Analyzed
Visualized
Form
Textual
Numeric
Audio
Video
Image
Code

Forms of data

Non-digital text (lab books, field notebooks)

Digital texts or digital copies of text

Spreadsheets

Audio, video

Computer Aided Design/CAD

Statistical analysis (SPSS, SAS)

Databases

Geographic Information Systems (GIS) and spatial data

Digital copies of images

Web files

Scientific sample collections

Matlab files & 3D Models

Metadata & Paradata

Data visualizations

Computer code

Standard operating procedures and protocols

Protein or genetic sequences

Artistic products

Curriculum materials

Collection of digital objects acquired and generated during research

Adapted from: Georgia Tech–http://libguides.gatech.edu/content.php?pid=123776&sid=3067221

Why share data?

  • Ensure reproducibility
  • Promote discovery
  • Synthesis for data mining
  • Citation
  • Correct credit
  • Required!
http://dx.doi.org/10.1890/1051-0761(1997)007%5B0330:NMFTES%5D2.0.CO;2
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

What is data management?

Considerations

What data do you expect to have?

How will you describe/document your data?

How will you store it?

What are your obligations for these data (storage/security)?

How will you expose these data?

How will you preserve these data?

What data do you expect to have?

What tools are you using for your data?

How much data?

How will you describe/document your data?

Write a data abstract

Provide a data dictionary

How will you store it?

Multiple locations: Here, near, and far

How will it be maintained(during and after)?

Version Control

What are your obligations for these data?

Security
Physical
Network
Computer system and file
Privacy
Personally identifying info (PII)
Personal health info (PHI)

How will you expose these data?

What will be shared?

Who is the audience?

Is the data citable?

Persistent access

Quality assurance

How will you preserve it?

Don't archive in proprietary format!

Sustainable formats:
Unencrypted
Uncompressed
Open standard

ASCII, PDF, .csv, FLAC, TIFF, JPEG2000, MPEG-4, XML, RDF, .txt, .r

Domain specific repositories

Institutional repositories

Best practices

What are the data structure standards for your discipline?

Standard ways to label fields

Specific variables and coding guidelines

Accepted hierarchies and directory structures

Structure

  1. Use one variable per column
  2. Make one observation per row
  3. Use human-readable column names
  4. Include one table per tab
  5. If using multiple related tables, use an ID or key to indicate how the tables are related

Context

Include a readme text file with the following:

Abstract
Describe why the data has been collected
Content
List and describe the files in your data package
Basic Data Dictionary
List and describe the variables in the file

Other best practices

Do:
Consider how to represent your NULL values
Consider whether a more robust data dictionary is required (e.g. with in-depth description of methods, instruments, models, etc.)
Do not:
Use formatting to convey information
Place comments in cells
Use special characters in field names
educopia.org ETD guidance

Data management plans

"Beginning January 18, 2011, proposals submitted to NSF must include a supplementary document of no more than two pages labeled "Data Management Plan" (DMP). This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. Proposals that do not include a DMP will not be able to be submitted."
https://www.nsf.gov/eng/general/dmp.jsp

THE END

Thanks to Amy Nurnberger, Columbia University