Every dataset in openESM passes through a structured curation pipeline before it becomes available. This page explains each stage (what happens, where it happens, and what it produces).


Stage 1: Submission

Datasets enter the pipeline in one of two ways: we identify them through literature review, or researchers submit them by opening an issue at openesm-project/openesm-metadata using one of the provided issue templates (new dataset submission or metadata update). At this point, the dataset is registered in our curation tracking sheet with basic provenance information and assigned a unique four-digit ID (e.g. 0042). Currently, this action is performed by an internal team member, but we plan to automate it in the future.


Stage 2: Metadata collection

We populate dataset-level metadata in our curation sheet: citation, sample size, country, study design, assessment frequency, assessment duration, number of items, language, and license. We also locate the codebook and any cleaning or scoring code associated with the original study. Each dataset gets its own tab in the curation sheet where variable-level metadata will be annotated in the next stage.


Stage 3: Data cleaning & harmonization

This is the core harmonization step. A dedicated R cleaning script is written for each dataset and stored in openesm-cleaning. The script reads variable-level metadata directly from the curation sheet at runtime via googlesheets4, applies all harmonization steps, and produces two outputs: a cleaned data file in TSV format, and a structured metadata file in JSON format stored in data/metadata/.

Harmonization includes:

The raw data are never modified. The cleaning script is the complete, reproducible record of every transformation applied.

After the script runs, the output JSON can be validated against the openESM metadata schema via validate_metadata_json().


Stage 4: Variable-level metadata annotation

For each variable, we annotate structured metadata in the curation sheet (see our Data Documentation for more information). This annotation is read by the cleaning script at runtime and embedded in the output JSON, so the variable browser on each dataset page is generated entirely from the metadata file.


Stage 5: Push to openesm-cleaning main

The finalized JSON in data/metadata/ is committed and pushed to the main branch of openesm-cleaning. This is a required step: the next stage fetches metadata directly from this branch via the GitHub API, so local-only files are invisible to it.


Stage 6: Zenodo upload

The cleaned TSV and codebook are uploaded to the openESM Zenodo collection, where they receive a permanent DOI. The DOI is then added back to the metadata JSON, which is regenerated and pushed to openesm-cleaning main again.


Stage 7: Metadata aggregation in openesm-metadata

With the JSON live on openesm-cleaning main, we run two scripts in openesm-metadata. The first fetches all per-dataset JSON files from openesm-cleaning via the GitHub API and places them into versioned dataset folders. The second aggregates all individual JSONs into a single datasets.json file that the website consumes. Both steps are committed to openesm-metadata. This process is initiated by opening an issue in openesm-metadata using the provided templates, which include a maintainer checklist for each case.


Stage 8: Release & automated website update

A new versioned release is created in openesm-metadata following semantic versioning (v1.X.0 for new datasets, v1.X.Y for corrections). This release triggers two automated processes:

  1. The openesm-metadata repository is synced to its linked Zenodo record, creating a citable, versioned snapshot of the full metadata database.
  2. A GitHub Action copies the updated JSON files to the openesm website repository and runs three Node.js scripts that regenerate the dataset pages, the dataset table, and the search index. The result is committed and GitHub Pages deploys automatically.

The dataset is then live at openesmdata.org and accessible via the openESM R and Python packages.


Pipeline overview

flowchart TD
    A(["Submission via\nGitHub issue (new dataset or metadata update)\nor literature review"]) --> B[Registration & ID assignment\ncuration sheet]
    B --> C[Dataset-level metadata\ncuration sheet]
    C --> D[Variable-level annotation\ncuration sheet · construct hierarchy · scale metadata]
    D --> E[Cleaning script\nopenesm-cleaning · reads sheet via googlesheets4\noutputs TSV + validated JSON]
    E --> F[Push JSON to\nopenesm-cleaning main]
    F --> G[Zenodo upload\nTSV + codebook · DOI minted\nDOI written back to JSON]
    G --> F
    F --> H[Metadata aggregation\nopenesm-metadata · fetches JSON via GitHub API\nbundles into datasets.json]
    H --> I[Versioned release\nopenesm-metadata]
    I --> J[Zenodo snapshot\nmetadata database]
    I --> K[GitHub Action\ncopies JSON · regenerates pages\nand search index]
    K --> L([Live on openesmdata.org\naccessible via R & Python])

    style A fill:#085AB3,color:#fff,stroke:none
    style L fill:#085AB3,color:#fff,stroke:none
    style E fill:#E78A00,color:#fff,stroke:none
    style D fill:#E78A00,color:#fff,stroke:none