Dataset Configuration
Every dataset should be configured with a JSON file. A template exists at config.
This module aims to be generic enough so that adding a new IMOS dataset should be driven only through a json configuration file.
Examples of dataset configuration can be found at:
The main choice left to create a cloud optimised dataset with this module is to decide either to use the Apache Parquet format or the Zarr format.
As a rule of thumb, for:
tabular dataset (NetCDF, CSV) -> Parquet
gridded dataset (NetCDF) -> Zarr
All dataset configuration files are placed under: https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/config/dataset/
Create Dataset Configuration (semi-automatic)
See section Create Dataset Configuration (semi-automatic) to help creating a dataset configuration.
Parquet Configuration from NetCDF file
Note
Important Note
Some dataset format are more complicated and can’t use the
GenericParquetHandler.This is the case with Argo data for example. In this case, it is possible to create a specific handler which would inherit with
Super()all of the methods from theaodn_cloud_optimised.lib.GenericParquetHandler.GenericHandlerclass.
As an example, we’ll explain the
aodn_cloud_optimised.config.slocum_glider_delayed_qc.json config
file.
The Basics
The first sections to add are
{
"dataset_name": "slocum_glider_delayed_qc",
"logger_name": "slocum_glider_delayed_qc",
"cloud_optimised_format": "parquet",
"metadata_uuid": "a681fdba-c6d9-44ab-90b9-113b0ed03536",
...
}
dataset_name: the dataset name as it will appear on AWS S3 storage (minus its format, ie: slocum_glider_delayed_qc.parquet)
cloud_optimised_format key is important as this will allow the code to either choose the Parquet handler or the zarr handler
metadata_uuid: the Geonetwork uuid metadata record. This value will be written in the parquet sidecar file
Creating the Schema
For both zarr and parquet format, consistency of the dataset is essential.
In this section, we’re demonstrating how to create the full schema definition from an input NetCDF file. Each variable will be defined, with its respective variable attributes the type.
Note
Important Note
Currently, when processing a new NetCDF which has different attributes values (for example “Degrees” vs “degree”), a warning/error message will be displayed without consequences.
On a previous version, files were not processed, but this lead to too many files failing
The following snippet creates the required schema from a random NetCDF.
generate_json_schema_from_s3_netcdf will output the schema into a
json file in a temporary file.
import os
from aodn_cloud_optimised.lib.config import load_variable_from_config
from aodn_cloud_optimised.lib.schema import generate_json_schema_from_s3_netcdf
BUCKET_RAW_DEFAULT = load_variable_from_config('BUCKET_RAW_DEFAULT')
obj_key = 'IMOS/ANFOG/slocum_glider/AIMS20151021/IMOS_ANFOG_BCEOPSTUVN_20151021T035731Z_SL416_FV01_timeseries_END-20151027T015319Z.nc'
nc_file = os.path.join('s3://', BUCKET_RAW_DEFAULT, obj_key)
generate_json_schema_from_s3_netcdf(nc_file)
The output will be such as:
{
"PLATFORM": {
"type": "string",
"trans_system_id": "Irridium",
"positioning_system": "GPS",
"platform_type": "Slocum G2",
"platform_maker": "Teledyne Webb Research",
"firmware_version_navigation": 7.1,
"firmware_version_science": 7.1,
"glider_serial_no": "416",
"battery_type": "Alkaline",
"glider_owner": "CSIRO",
"operating_institution": "ANFOG",
"long_name": "platform informations"
},
"DEPLOYMENT": {
"type": "string",
"deployment_start_date": "2015-10-21-T05:00:02Z",
"deployment_start_latitude": -18.9373,
"deployment_start_longitude": 146.881,
"deployment_start_technician": "Gregor, Rob",
"deployment_end_date": "2015-10-27-T01:56:23Z",
"deployment_end_latitude": -19.2358,
"deployment_end_longitude": 147.5188,
"deployment_end_status": "recovered",
"deployment_pilot": "pilot, CSIRO",
"long_name": "deployment informations"
},
"SENSOR1": {
"type": "string",
"sensor_type": "CTD",
"sensor_maker": "Seabird",
"sensor_model": "GPCTD",
"sensor_serial_no": "9117",
"sensor_calibration_date": "2013-09-17",
"sensor_parameters": "TEMP, CNDC, PRES, PSAL",
"long_name": "sensor1 informations"
},
Simply copy this into the schema key of the dataset config, so that we have:
{
"dataset_name": "slocum_glider_delayed_qc",
"logger_name": "slocum_glider_delayed_qc",
"cloud_optimised_format": "parquet",
"metadata_uuid": "a681fdba-c6d9-44ab-90b9-113b0ed03536",
"schema": {
"PLATFORM": {
"type": "string",
"trans_system_id": "Irridium",
"positioning_system": "GPS",
"platform_type": "Slocum G2",
"platform_maker": "Teledyne Webb Research",
"firmware_version_navigation": 7.1,
"firmware_version_science": 7.1,
"glider_serial_no": "416",
"battery_type": "Alkaline",
"glider_owner": "CSIRO",
"operating_institution": "ANFOG",
"long_name": "platform informations"
},
"DEPLOYMENT": {
"type": "string",
"deployment_start_date": "2015-10-21-T05:00:02Z",
"deployment_start_latitude": -18.9373,
"deployment_start_longitude": 146.881,
"deployment_start_technician": "Gregor, Rob",
"deployment_end_date": "2015-10-27-T01:56:23Z",
"deployment_end_latitude": -19.2358,
"deployment_end_longitude": 147.5188,
"deployment_end_status": "recovered",
"deployment_pilot": "pilot, CSIRO",
"long_name": "deployment informations"
},
"SENSOR1": {
"type": "string",
"sensor_type": "CTD",
"sensor_maker": "Seabird",
"sensor_model": "GPCTD",
"sensor_serial_no": "9117",
"sensor_calibration_date": "2013-09-17",
"sensor_parameters": "TEMP, CNDC, PRES, PSAL",
"long_name": "sensor1 informations"
},
...
Note
Important Note
The chosen NetCDF may not contain all of the variables that will exist in the dataset.
In order to add them, it is advised to create a first pass of the dataset. The module will log the json info to be added into the config for each missing variable, which can simply be paste.
Parquet Schema Transformation
Adding Variables Dynamically
You can define new variables to add to the dataset using the following source types:
@filename (required)
Adds the original file name as a variable. This is required for traceability and to safely overwrite old data.
The IMOS/AODN processing is very file-oriented. To reprocess data and delete previously generated output, we need to keep track of the original filename. That’s why it must be included in the schema definition:
"filename": {
"source": "@filename",
"schema": {
"type": "string",
"units": "1",
"long_name": "Filename of the source file"
}
}
@partitioning (required)
Generates time and space partitioning variables (timestamp, polygon) for optimised cloud access.
"timestamp": {
"source": "@partitioning:time_extent",
"schema": {
"type": "int64",
"units": "1",
"long_name": "Partition timestamp"
}
},
"polygon": {
"source": "@partitioning:spatial_extent",
"schema": {
"type": "string",
"units": "1",
"long_name": "Spatial partition polygon"
}
}
The above requires a corresponding partitioning section:
"partitioning": [
{
"source_variable": "timestamp",
"type": "time_extent",
"time_extent": {
"time_varname": "TIME",
"partition_period": "M"
}
},
{
"source_variable": "polygon",
"type": "spatial_extent",
"spatial_extent": {
"lat_varname": "LATITUDE",
"lon_varname": "LONGITUDE",
"spatial_resolution": 5
}
}
]
@global_attribute:<name>
Copies the value from a NetCDF global attribute into a new variable.
"vessel_name": {
"source": "@global_attribute:vessel_name",
"schema": {
"type": "string",
"units": "1",
"_FillValue": "",
"long_name": "vessel name"
}
}
@variable_attribute:<varname>.<varatt>
Extracting specific variable attributes and promoting them to variables.
Example:
"instrument_identifgier": {
"source": "@variable_attribute:TEMP.instrument_id",
"schema": {
"type": "string",
"units": "1",
"_FillValue": "",
"long_name": "my instrument id"
}
}
@function:<function_name> to Extract from File Paths
Custom logic can be applied to the object key (e.g. S3 path) using the @function:<name> syntax in add_variables. These require a corresponding function definition in the functions block.
This is helpful when useful information are missing from the NetCDF files but available from the filepath.
Example:
"campaign_name": {
"source": "@function:campaign_name_extract",
"schema": {
"type": "string",
"units": "1",
"_FillValue": "",
"long_name": "voyage identifier"
}
}
And the function definition:
"functions": {
"campaign_name_extract": {
"extract_method": "object_key",
"method": {
"key_pattern": ".*/IMOS/AUV/{campaign_name}/{dive_name}/hydro_netcdf/{filename}",
"extraction_code": "def extract_info_from_key(key):\n parts = key.split('/')\n return {'campaign_name': parts[-4]}"
}
}
}
You may define multiple functions this way. They are applied to every input path at runtime.
@function:<function_name> to create a new variable from input variables
Custom logic can be applied to derive new variables from existing dataframe columns using the @function:<name> syntax in add_variables. These require a corresponding function definition in the functions block.
This is useful when the desired variable is not directly present in the dataset but can be computed from one or more columns.
Example:
"TIME": {
"source": "@function:time_creation",
"schema": {
"type": "timestamp[ns]",
"units": "days since 1970-01-01T00:00:00Z",
"_FillValue": "",
"long_name": "Derived timestamp"
}
}
And the function definition:
"functions": {
"time_creation": {
"extract_method": "from_variables",
"method": {
"creation_code": "def time_creation_from_variables(df):\n import pandas as pd\n date_col = df.get('survey_date')\n hour_col = df.get('hour')\n\n # Fill missing hour with 00:00:00\n if hour_col is None:\n hour_col = pd.Series(['00:00:00']*len(df), index=df.index)\n else:\n hour_col = hour_col.fillna('00:00:00')\n\n # Combine date and hour strings\n dt_str = date_col.astype(str) + ' ' + hour_col.astype(str)\n result = pd.to_datetime(dt_str, errors='coerce', format='%Y-%m-%d %H:%M:%S')\n\n # Fallback to just date if conversion failed\n mask = result.isna()\n if mask.any():\n result.loc[mask] = pd.to_datetime(date_col[mask], errors='coerce')\n\n return result"
}
}
}
You may define multiple functions this way. They are applied to the dataframe at runtime, deriving new columns as specified in the configuration.
Global Attributes
The global_attributes section allows you to delete or override global attributes on the output dataset.
Delete global attributes
"global_attributes": {
"delete": [
"geospatial_lat_max",
"geospatial_lat_min",
"geospatial_lon_max",
"geospatial_lon_min",
"date_created"
]
}
Set or override global attributes
"global_attributes": {
"set": {
"title": "IMOS Underway CO2 dataset measured",
"featureType": "trajectory",
"principal_investigator": "",
"principal_investigator_email": ""
}
}
Choosing the Partition keys
Any variable available in the schema definition could be used as a partition.
Partition keys are defined through the schema_transformation.partitioning section.
Each partitioning variable must exist, either in the original schema section, or added in add_variables.
Timestamp partition
To enable time-based partitioning, a variable like timestamp is added using:
"timestamp": {
"source": "@partitioning:time_extent",
"schema": {
"type": "int64",
"units": "1",
"long_name": "Partition timestamp"
}
}
This must be matched with a partitioning definition like:
{
"source_variable": "timestamp",
"type": "time_extent",
"time_extent": {
"time_varname": "TIME",
"partition_period": "Q"
}
}
partition_period controls how time is grouped: M (monthly), Y (yearly), Q (quarterly), etc. See the full list of supported values at:
https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-period-aliases
Note
Important Note
Choose the period wisely though testing! A finer period, such as day, will create a lot more objects or chunks and will slow considerably data queries
Geospatial Partition
To add spatial filtering, define a variable like polygon:
"polygon": {
"source": "@partitioning:spatial_extent",
"schema": {
"type": "string",
"units": "1",
"long_name": "Spatial partition polygon"
}
}
Then define how it’s calculated from the coordinates:
{
"source_variable": "polygon",
"type": "spatial_extent",
"spatial_extent": {
"lat_varname": "LATITUDE",
"lon_varname": "LONGITUDE",
"spatial_resolution": 5
}
}
spatial_resolution controls the grid size in degrees. Smaller values give finer granularity.
Note
Important Note
Choose the spatial_resolution wisely though testing! Similarly to the
partition_timestamp_periodabove, a smaller value will lead to more objects.
Partition Key Summary
All partition keys must also be listed under the partitioning config, the order will matter. For example:
"partitioning": [
{
"source_variable": "timestamp",
"type": "time_extent",
"time_extent": {
"time_varname": "TIME",
"partition_period": "M"
}
},
{
"source_variable": "polygon",
"type": "spatial_extent",
"spatial_extent": {
"lat_varname": "LATITUDE",
"lon_varname": "LONGITUDE",
"spatial_resolution": 5
}
},
{
"source_variable": "platform_code"
}
],
In this example, the order of partitions will be timestamp -> polygon -> platform_code
Parquet Configuration from CSV file
To create a Parquet dataset from CSV files, all of the previous configuration options still apply. However, there are some special configuration options to deal with various CSV formats.
We will use the example configuration file:
aodn_cloud_optimised.config.aatams_acoustic_tagging.json
The CSV-specific options are grouped under the nested key csv_config and validated by CSVConfigModel. For CSV ingestion, configure exactly one reader:
pandas_read_csv_configforpandas.read_csv()polars_read_csv_configforpolars.read_csv()
Validation rules:
pandas_read_csv_configandpolars_read_csv_configare mutually exclusive (you cannot provide both).Keys provided under each config are validated against the corresponding reader function signature.
If neither config is provided for CSV input, the CSV handler raises an error at runtime.
Example using Pandas:
"csv_config": {
"pandas_read_csv_config": {
"delimiter": ";",
"header": 0,
"index_col": "detection_timestamp",
"parse_dates": [
"detection_timestamp"
],
"na_values": [
"N/A",
"NaN"
],
"encoding": "utf-8"
}
}
Example using Polars:
"csv_config": {
"polars_read_csv_config": {
"separator": ";",
"has_header": true,
"parse_dates": ["detection_timestamp"],
"null_values": ["N/A", "NaN"],
"encoding": "utf-8"
}
}
pandas_read_csv_config: Options are passed directly to pandas.read_csv().
polars_read_csv_config: Options are passed directly to polars.read_csv().
You can use any valid arguments for the corresponding CSV reader. See the official documentation for reference:
Parquet Configuration with a Custom NetCDF Handler
Some NetCDF collections have an unusual structure that the generic handler cannot flatten automatically (e.g. CF indexed ragged array files where coordinates live on different dimensions). In those cases you can:
Set
"handler_class"to the name of a custom Python class that subclassesGenericHandlerand overridespreprocess_data().Pass handler-specific options via the top-level
"netcdf_read_config"key.
netcdf_read_config is a free-form JSON object; its keys and meaning are entirely
defined by the custom handler. The DatasetConfig model accepts any dictionary here
without further validation, so the handler itself is responsible for documenting and
checking the values it reads.
Built-in option — ``spectral_flatten``
The BODBAWHandler (used for the IMOS Bio-Optical Database of Australian Waters
datasets) reads one boolean flag from netcdf_read_config:
Key |
Type |
Description |
|---|---|---|
|
|
When |
Example for a non-spectral dataset (pigment, suspended matter):
"netcdf_read_config": {}
Example for a spectral dataset (absorption, backscattering):
"netcdf_read_config": {
"spectral_flatten": true
}
Note
If you are writing a new custom handler, read netcdf_read_config from
self.dataset_config.get("netcdf_read_config") or {} inside your handler so
a missing key or a JSON null value is treated as an empty mapping. If your
handler requires a dictionary, also validate the type before using it, and
document each key you introduce in the handler’s class docstring.
Parquet Configuration from Parquet file
In some instances we already have a parquet file, but still need to update it to the cloud optimised format and apply AODN conventions.
There is currently no additional configuration required to create a parquet dataset from a parquet file. Follow the dataset configuration as per explained above for NetCDF.
Zarr Configuration from NetCDF
As an example, we’ll explain the
aodn_cloud_optimised.config.radar_velocity_hourly_averaged_delayed_qc_main.json
config file.
https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/config/dataset/radar_velocity_hourly_averaged_delayed_qc_main.json
Note
Important Note
It is possible to have a main config and a child config to avoid duplication. This is especially useful for IMOS Zarr datasets such as the ones from the Radar and GHRSST datasets, which are similar in terms of metadata.
See for example the two related configuration files: radar_velocity_hourly_averaged_delayed_qc_main.json radar_TurquoiseCoast_velocity_hourly_averaged_delayed_qc.json
The Basics
The first section to add is
{
"dataset_name": "radar_velocity_hourly_averaged_delayed_qc_main",
"logger_name": "radar_velocity_hourly_averaged_delayed_qc_main",
"cloud_optimised_format": "zarr",
"metadata_uuid": "a681fdba-c6d9-44ab-90b9-113b0ed03536",
...
}
dataset_name: the name as it will appear on AWS S3 storage
cloud_optimised_format key is important as this will allow the code to either choose the Parquet handler or the zarr handler
metadata_uuid: the GeoNetwork uuid metadata record. This value will be written in the parquet sidecar file
The chunks
Add the following to the “schema_transformation” section
"dimensions": {
"time": {"name": "TIME",
"chunk": 1500,
"rechunk": true,
"append_dim": true},
"latitude": {"name": "J",
"chunk": 60},
"longitude": {"name": "I",
"chunk": 59}
},
Variable Template
The name of a variable which will be used as a template to create missing variables from the dataset and have similar shape
"var_template_shape": "UCUR",
Variables to drop
when setting region explicitly in to_zarr() method, all variables in
the dataset to write must have at least one dimension in common with the
region’s dimensions [‘TIME’]. We need to remove the variables from the
dataset which fall into this condition:
"vars_incompatible_with_region": ["I", "J", "LATITUDE", "LONGITUDE", "GDOP"],
Also, when a dataset to be converted to ZARR has some variables which aren’t always in the dataset, it is at the moment (July 2024) good practice to drop them:
"wind_speed_dtime_from_sst": {
"type": "float32",
"drop_var": true
},
Note
Important Note
Ideally, when a variable is missing from an input NetCDF file but exists in the schema definition, an empty variable should be created. This functionality is implemented in the Zarr handler through the
preprocessfunction. The intended usage is forxarray.open_mfdatasetto utilise thispreprocessfunction. However, due to issues encountered when running on a remote cluster, this function is currently not in use.After extensive testing of the
preprocessfunction, which is challenging to serialise, it was found necessary to move it outside the handler class. But even with an emptypreprocessfunction simply returning the input dataframe, and defined outside the class, once called withxarray.open_mfdataset, data being sent back to the machine creating the Coiled cluster for some processing.This behavior has been managed by simplifying the preprocess function and calling it post
mfdatasetcall.However, we’re currently dropping ALL variables which aren’t present across all the NetCDF files. This is done by adding
"drop_var": truein the schema definition for any variable to drop. In the future, this should be fixed!
Creating the Schema
See Creating the Schema section above. As for Parquet…
Global Attributes to drop and set
Similar to Parquet. Add the following under the “schema_transformation” section
"global_attributes": {
"set": {
"title": ""
},
"delete": [
"Voyage_number",
"platform_code",
"geospatial_lat_max",
"geospatial_lat_min",
"geospatial_lon_max",
"geospatial_lon_min",
"date_created"
]
}
Global Attributes to variables
Similar to Parquet. Add the following to the “schema_transformation” section.
Note
Important Note
Due to some zarr/xarray bugs, only string variables are supported
"add_variables": {
"quality_control_version": {
"source": "@global_attribute:file_version",
"schema": {
"type": "<U49",
"units": "1",
"dimensions": "TIME"
}
},
"platform_code": {
"source": "@global_attribute:platform_code",
"schema": {
"type": "<U7",
"units": "1",
"dimensions": "TIME"
}
},
"voyage_number": {
"source": "@global_attribute:Voyage_number",
"schema": {
"type": "<U10",
"units": "1",
"dimensions": "TIME"
}
}
}
Run Settings Options
Example
"run_settings": {
"coiled_cluster_options": {
"n_workers": [
40,
50
],
"scheduler_vm_types": "m7i.large",
"worker_vm_types": "m7i.large",
"allow_ingress_from": "me",
"compute_purchase_option": "spot_with_fallback",
"worker_options": {
"nthreads": 8,
"memory_limit": "16GB"
}
},
"batch_size": 60,
"cluster": {
"mode": "coiled",
"restart_every_path": false
},
"paths": [
{
"s3_uri": "s3://imos-data/IMOS/AATAMS/satellite_tagging/MEOP_QC_CTD/",
"filter": [
".*\\.nc$"
],
"year_range": []
}
],
"clear_existing_data": true,
"raise_error": false,
"force_previous_parquet_deletion": true
}
Note
Important Note
If cluster.mode is set to “coiled”, the coiled_cluster_options need to be set.
If cluster.mode is set to “ec2”, the ec2_cluster_options and ec2_adapt_options need to be set.
cluster.mode can be also set to “local” or null
force_previous_parquet_deletion forces the search for existing parquet files to delete matching the new one to ingest. This can end up being really slow if there are a lot of objets (for example Argo)
In order to create the dataset on a remote cluster (ec2/coiled), the following configuration needs to be added within the run_settings:
Coiled Cluster configuration
For a coiled cluster, simply put this in the run_settings config
"coiled_cluster_options" : {
"n_workers": [2, 20],
"scheduler_vm_types": "m7i-flex.large",
"worker_vm_types": "m7i-flex.large",
"allow_ingress_from": "me",
"compute_purchase_option": "spot_with_fallback",
"worker_options": {
"nthreads": 8,
"memory_limit": "16GB" }
},
"cluster": {
"mode": "coiled",
"restart_every_path": false
},
Note
Important Note
Every dataset is different, and so will be the configuration above. The values of the
batch_size,number of n_workers,scheduler_vm_typesandworker_vm_typesare all intertwined.It is necessary to understand the dataset, how big are the input files.
It is advised run some tests on the coiled cluster and look at the graph outputs to find the best cluster configuration to process input files as quickly and cheaply as possible.
Too big of a
batch_sizewith a too small of aworker_vm_typeswill lead to out of memory issues, and higher Global Interpreter Lock (GIL)
EC2 Cluster configuration
As for above, in the EC2 cluster is to be chosen, simply put this in the run_settings config
"ec2_cluster_options": {
"n_workers": 1,
"scheduler_instance_type": "m7i-flex.xlarge",
"worker_instance_type": "m7i-flex.2xlarge",
"security": false,
"docker_image": "ghcr.io/aodn/aodn_cloud_optimised:latest"
},
"ec2_adapt_options": {
"minimum": 1,
"maximum": 120
},
"cluster": {
"mode": "ec2",
"restart_every_path": false
},
Local Development with MinIO (S3-Compatible Bucket)
If you want to develop locally without connecting to AWS S3, you can use MinIO, an S3-compatible object store that can run in Docker.
This allows you to test features such as bucket creation, file uploads, and s3fs integration without needing real S3 credentials.
Running MinIO with Docker Compose
Save the following as docker-compose.yml in your project root:
services:
minio:
image: quay.io/minio/minio:latest
container_name: minio
ports:
- "9000:9000" # S3 API
- "9001:9001" # Web console
environment:
MINIO_ROOT_USER: minioadmin
MINIO_ROOT_PASSWORD: minioadmin
command: server /data --console-address ":9001"
volumes:
- minio_data:/data
volumes:
minio_data:
Start MinIO with:
docker compose up -d
Access the MinIO web console at: http://localhost:9001
(Default login: minioadmin / minioadmin).
Creating a Bucket
Once MinIO is running, create a bucket (for example test-bucket) either via
the web console or with the mc (MinIO client) CLI:
docker run --rm -it \
--network host \
quay.io/minio/mc alias set local http://localhost:9000 minioadmin minioadmin
docker run --rm -it \
--network host \
quay.io/minio/mc mb local/test-bucket
S3FS bucket endpoint patching for MinIO
Two new optional settings allow to configure access to S3 (or S3-compatible) storage for input and output datasets.
These settings control authentication and client configuration used by s3fs / boto3.
"s3_fs_common_opts": {
"key": "minioadmin",
"secret": "minioadmin",
"client_kwargs": {
"endpoint_url": "http://localhost:9000"
}
},
"s3_bucket_opts": {
"input_data": {
"bucket": "imos-data",
"s3_fs_opts": {
"key": "minioadmin",
"secret": "minioadmin",
"client_kwargs": {
"endpoint_url": "http://localhost:9000"
}
}
},
"output_data": {
"bucket": "aodn-cloud-optimised",
"s3_fs_opts": {
"key": "minioadmin",
"secret": "minioadmin",
"client_kwargs": {
"endpoint_url": "http://localhost:9000"
}
}
}
}
Explanation
s3_fs_common_optsDefines the default connection options shared by both input and output S3 clients (e.g. access keys, endpoint). If set, these options are used unless explicitly overridden by per-bucket configuration.s3_bucket_optsAllows configuring per-bucket overrides for input and output datasets. Each section may define: -bucket→ the bucket name to use for reading/writing data -s3_fs_opts→ optional overrides to connection options defined ins3_fs_common_optsBoth
input_dataandoutput_dataare optional. If not specified, the system falls back to default bucket names (bucket_raw_default_nameandoptimised_bucket_name) or environment variables.
Precedence Rules
If
s3_bucket_opts.<input_data|output_data>.s3_fs_optsis defined → it takes priority.Otherwise,
s3_fs_common_optsis used.If neither is defined → the default global configuration is used.
Note
Important Note
s3_fs_common_optsands3_bucket_optsare optional — they are mainly useful when pointing to non-AWS endpoints (e.g. MinIO, localstack) or when input and output buckets require different credentials.If
s3_fs_common_optsis provided, both a valids3fs.S3FileSystemsession and correspondingboto3client will be created automatically.If you provide only one of (
s3_fs_common_optsor its corresponding boto client options), a validation error will be raised.
—
AWS OpenData registry
In order to publicise the dataset on the OpenData Registry, the following needs to be added to every dataset configuration file.
Once populated, the registry files needed by AWS can be created by the script below, and then added to the AWS OpenData Github repository: AWS Open Data Registry.
"aws_opendata_registry": {
"Name": "",
"Description": "",
"Documentation": "",
"Contact": "",
"ManagedBy": "",
"UpdateFrequency": "",
"Tags": [],
"License": "",
"Resources": [
{
"Description": "",
"ARN": "",
"Region": "",
"Type": "",
"Explore": []
},
{
"Description": "",
"ARN": "",
"Region": "",
"Type": ""
},
{
"Description": "",
"ARN": "",
"Region": "",
"Type": ""
},
{
"Description": "",
"ARN": "",
"Region": "",
"Type": ""
}
],
"DataAtWork": {
"Tutorials": [
{
"Title": "",
"URL": "",
"Services": "",
"AuthorName": "",
"AuthorURL": ""
},
{
"Title": "",
"URL": "",
"AuthorName": "",
"AuthorURL": ""
},
{
"Title": "",
"URL": "",
"AuthorName": "",
"AuthorURL": ""
}
],
"Tools & Applications": [
{
"Title": "",
"URL": "",
"AuthorName": "",
"AuthorURL": ""
},
{
"Title": "",
"URL": "",
"AuthorName": "",
"AuthorURL": ""
}
],
"Publications": [
{
"Title": "",
"URL": "",
"AuthorName": ""
},
{
"Title": "",
"URL": "",
"AuthorName": ""
}
]
}
}
}
A script, automatically installed with the module, exists to facilitate the creation of all registry entries.
cloud_optimised_create_aws_registry_dataset -h
usage: cloud_optimised_create_aws_registry_dataset [-h] [-f FILE] [-d DIRECTORY] [-a]
Create AWS OpenData Registry YAML files from the dataset configuration, ready to be added to the OpenData Github
repository.
The script can be run in three ways:
1. Convert a specific JSON file to YAML using '-f' or '--file'.
2. Convert all JSON files in the directory using '-a' or '--all'.
3. Run interactively to list all available JSON files and prompt
the user to choose one to convert.
options:
-h, --help show this help message and exit
-f FILE, --file FILE Name of a specific JSON file to convert.
-d DIRECTORY, --directory DIRECTORY
Output directory to save converted YAML files.
-a, --all Convert all JSON files in the directory.
This script can be run in a few different ways:
cloud_optimised_create_aws_registry_dataset-> will trigger an interactive modecloud_optimised_create_aws_registry_dataset -a-> will output all dataset metadatacloud_optimised_create_aws_registry_dataset -f slocum_glider_delayed_qc.json-> for a specific dataset
Adding the dataset to pyproject.toml
TODO: - Explain pyproject.toml - individual scripts for full reprocessing
Configuration Validation
When you run generic_cloud_optimised_creation --config <config_file>, the configuration is automatically validated against the Pydantic data model.
Understanding Validation
All dataset configurations are validated using Python’s Pydantic library, which ensures:
Required fields are present
Field types are correct (string, number, boolean, etc.)
Nested structures follow the expected schema
Enum values match allowed options
If validation fails, you’ll see a detailed error message showing exactly which field is problematic.
Common Validation Errors
Error: Field required
ValidationError: 1 validation error for DatasetConfig
cloud_optimised_format: Field required
Fix: Ensure you’ve set the cloud_optimised_format field to either "parquet" or "zarr":
{
"dataset_name": "my_dataset",
"cloud_optimised_format": "parquet",
...
}
Error: Input should be a valid string
ValidationError: 1 validation error for DatasetConfig
metadata_uuid: Input should be a valid string
Fix: Ensure the field value is quoted as a string (not a number or boolean):
{
"metadata_uuid": "a681fdba-c6d9-44ab-90b9-113b0ed03536"
}
Error: Input should be one of
ValidationError: 1 validation error for DatasetConfig
cloud_optimised_format: Input should be 'zarr' or 'parquet'
Fix: Ensure the format is one of the allowed values:
{
"cloud_optimised_format": "parquet" // or "zarr"
}
Required vs Optional Fields
Always Required:
dataset_name— Identifier for the datasetcloud_optimised_format— Either"parquet"or"zarr"metadata_uuid— Geonetwork metadata record UUIDschema— Variable definitions (varies by format)
Often Required (depending on format):
For Parquet:
schema_transformation.partitions— How to partition the dataFor Zarr:
schema.chunks— Chunking strategy for performance
Optional:
run_settings— Cluster configuration, S3 buckets (uses defaults if omitted)schema_transformation— Data processing steps (uses defaults)
Debugging Validation
To debug validation issues before running the full processing:
1. Check syntax with Python
import json
with open('my_dataset.json') as f:
config = json.load(f) # Catches JSON syntax errors
2. Validate against the model
from aodn_cloud_optimised.bin.config.model import DatasetConfig
with open('my_dataset.json') as f:
config_dict = json.load(f)
try:
config = DatasetConfig.model_validate(config_dict)
print("✓ Config is valid!")
except ValidationError as e:
print(f"✗ Validation error:\n{e}")
3. Use the automatic config creation tool
See Create Dataset Configuration (semi-automatic) for a semi-automatic tool that generates a valid configuration template.
Note
Important Note
In order to test the new configuration, the newly created script needs to be installed in the environment.
Activate the virtual env
run
`poetry install --with dev`re-activate the virtual env
The new script will be available