ceda-di (CEDA Data Index)

Introduction

The ceda_di module is a Python backend to interface with a wide range of historical scientific data formats.

Within this project, you will find:

  • Library interfaces to common scientific data formats (focusing mostly on data formats within the BADC archive), including:
    • NetCDF
    • HDF4
    • GeoTIFF
    • ENVI BIL/BSQ
  • Interfaces to the file format modules, to enable extraction of domain-specific metadata (including geospatial/temporal metadata)

  • A command-line interface to a suite of tools designed to be run on a multi-core system with filesystem access to scientific data. This includes:
    • A tool to extract metadata from common scientific data formats
    • A tool to index metadata into an Elasticsearch installation
    • A suite of tools to query and visualise the data stored within Elasticsearch

Project Goals

The metadata extraction toolkit was developed with following goals:

  • Extract geospatial, temporal and parameter metadata from a historical dataset with homogenous data stored in various file formats
  • Generate JSON documents from the readable files containing geospatial and temporal metadata
  • Ingest the created JSON documents into an enterprise full-text search system (ElasticSearch)
  • Build a query system around the ElasticSearch module

Command-line Usage

Usage string

The usage string for the command-line application is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
"""
Usage:
    di.py (--help | --version)
    di.py index [options] <path-to-json-docs>
    di.py search <extents> [options] [--file-paths | --json]
    di.py extract [options] [--send-to-index]
                  [<input-path> (<output-path> | --no-create-files)]
    di.py test

Options:
    --help                     Show this screen.
    --version                  Show version.
    --config=<path>            Config file. [default: ../config/ceda_di.json]
    --host=<host>              Specify ElasticSearch host.
    --port=<port>              Specify ElasticSearch port.
    --index=<name>             Specify ElasticSearch index name.
    --send-to-index            Index metadata with ElasticSearch.
    --no-create-files          Don't create JSON metadata files.
    --max-results=<num>        Max number of results to return when searching
    --file-paths               Print out search results as file paths.

Sample usage

Note: This is not a comprehensive guide! Please refer to the Command-Line Interface documentation for a more thorough explanation.

# Extracts metadata from files in /path/to/input
# Outputs data to Elasticsearch (--send-to-index)
# Outputs JSON files to /path/to/output
di.py extract --send-to-index /path/to/input /path/to/output

# Extracts metadata from files in /path/to/input
# Outputs data to Elasticsearch (--send-to-index)
# Does not create JSON documents (--no-create-files)
di.py extract --no-create-files --send-to-index /path/to/input

# Finds all JSON documents in /path/to/json and sends to Elasticsearch
di.py index /path/to/json

# Run unit tests
di.py test

More information

For more information, see the section on the Command-Line Interface.

Configuration File

Format

The configuration for the command-line interface for di.py is written in a JSON file with various nested elements for various options.

More information

For a full summary of all configuration options, read the section on the Configuration File.

Indices and Tables