Skip to content

data release conventions #4

@bwalsh

Description

@bwalsh

data releases

use case

As an engineer, I need to know the sources, provenance and locations of all data in a predictable manner. I need to store all of the above in a cold storage archive. It should be discoverable, identify all relative and then know how to parse and load it into an active database.

MUSTS

  • all data stored in ndjson with homegeneous record type per file
  • all files are named with a pattern *.OBJECT_LABEL.ndjson.gz
  • there will be a manifest file in the same directory manifest.yaml
    • File listing including:
      • MD5
  • stored in file, web directory or s3

SHOULDS

* File listing including:
    * provenance meta data see https://github.com/DLR-SC/gitlab2prov

EXAMPLE

├── README.md
├── file-Patient.ndjson.gz
├── file-Specimen.ndjson.gz
├── file-Task.ndjson.gz
└── sub-dir
    ├── file-DocumentReference.ndjson.gz
    ├── file-Observation.ndjson.gz
    └── file-Compound.ndjson.gz

"An iceberg's calf"

Would have an manifest.yaml


id: unique
name: 
author: email
version: semantic
related-to:

tags: []
schema:
    - url: http://some-publically-readable-url
      # embedded copy
      data: {}
source:
    # all files extracted from this source
    - url:
    # with this provenance
    - provenance: {}
code:
    # all files created with this provenance
    - provenance: {}
files:
    - name: file-Patient.ndjson.gz
      md5: XXXX
      # except this one
      code_provenance: {}
      source_provenance: {}
    - name: file-Specimen.ndjson.gz
      md5: XXXX
      tags: []
    - name: file-Patient.ndjson.gz
      md5: XXXX
      tags: []
    - name: sub-dir/file-DocumentReference.ndjson.gz
      md5: XXXX
      tags: []
    - name: sub-dir/file-Observation.ndjson.gz
      md5: XXXX
      tags: []
    - name: sub-dir/file-Compound.ndjson.gz
      md5: XXXX
      tags: []

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions