Usage
The library provides multiple ways to create cloud-optimised datasets:
Recommended: Use ``generic_cloud_optimised_creation`` with dataset config
Alternative: Write Python code using the api-reference directly
For development: Use integration tests and custom handlers
Generic Cloud Optimised Creation Script
The primary way to process datasets is via the generic_cloud_optimised_creation command with a dataset configuration file.
Basic usage:
generic_cloud_optimised_creation --config <dataset_config.json>
Getting help:
generic_cloud_optimised_creation -h
Key arguments
-c, --configPath or name of the dataset configuration JSON file. This is the main argument. Examples:
mooring_hourly_timeseries_delayed_qc.jsonorsatellite_ghrsst_l3s_1day_daynighttime_single_sensor_australia.json-o, --json-overwriteOptional JSON string to override config fields at runtime. Useful for testing without modifying the config file.
Example:
'{"run_settings": {"cluster": {"mode": null}, "raise_error": true}}'-t, --testUse integration testing buckets instead of default buckets (for development and testing).
Advanced options for data retrieval
When processing input data, you can control what gets fetched:
--bucket-rawS3 bucket containing input files. Default:
imos-data--optimised-bucket-nameS3 bucket where cloud-optimised output will be written. Default:
imos-data-lab-optimised--root-prefix-cloud-optimised-pathS3 path prefix for output location. Example:
cloud_optimised/example_testing
Examples
Process a Zarr dataset (gridded data):
generic_cloud_optimised_creation --config satellite_ghrsst_l3s_1day_daynighttime_single_sensor_australia
Process a Parquet dataset with testing bucket:
generic_cloud_optimised_creation --config mooring_hourly_timeseries_delayed_qc --test
Override cluster configuration at runtime:
generic_cloud_optimised_creation --config my_dataset \
--json-overwrite '{"run_settings": {"cluster": {"mode": null}}}'
Note
Dataset-Specific Commands
Many pre-configured dataset scripts are available in the library. These call generic_cloud_optimised_creation with pre-set parameters. See aodn_cloud_optimised/bin/datasets/ in the GitHub repository for examples.
As a python module
Zarr example
from importlib.resources import files
from aodn_cloud_optimised.lib.CommonHandler import cloud_optimised_creation
from aodn_cloud_optimised.lib.config import (
load_variable_from_config,
load_dataset_config,
)
from aodn_cloud_optimised.lib.s3Tools import s3_ls
def main():
BUCKET_RAW_DEFAULT = load_variable_from_config("BUCKET_RAW_DEFAULT")
nc_obj_ls = s3_ls(BUCKET_RAW_DEFAULT, "IMOS/SRS/SST/ghrsst/L3S-1d/dn/2024")
dataset_config = load_dataset_config(
str(files("aodn_cloud_optimised").joinpath("config").joinpath("dataset").joinpath("satellite_ghrsst_l3s_1day_daynighttime_single_sensor_australia.json")
)
)
cloud_optimised_creation(
nc_obj_ls,
dataset_config=dataset_config,
clear_existing_data=True, # this will delete existing data, be cautious! If testing change the paths below
optimised_bucket_name="imos-data-lab-optimised", # optional, default value in config/common.json
root_prefix_cloud_optimised_path="cloud_optimised/example_testing", # optional, default value in config/common.json
cluster_mode='local'
)
if __name__ == "__main__":
main()
Parquet Example
from importlib.resources import files
from aodn_cloud_optimised.lib.CommonHandler import cloud_optimised_creation
from aodn_cloud_optimised.lib.config import (
load_variable_from_config,
load_dataset_config,
)
from aodn_cloud_optimised.lib.s3Tools import s3_ls
def main():
BUCKET_RAW_DEFAULT = load_variable_from_config("BUCKET_RAW_DEFAULT")
nc_obj_ls = s3_ls(BUCKET_RAW_DEFAULT, "IMOS/ANMN/NSW")
# Apply filters
filters = [ "_hourly-timeseries_", "FV02"]
for filter_str in filters:
nc_obj_ls = [s for s in nc_obj_ls if filter_str in s]
dataset_config = load_dataset_config(
str(files("aodn_cloud_optimised").joinpath("config").joinpath("dataset").joinpath("mooring_hourly_timeseries_delayed_qc.json")
)
)
cloud_optimised_creation(
nc_obj_ls,
dataset_config=dataset_config,
clear_existing_data=True, # this will delete existing data, be cautious! If testing change the paths below
optimised_bucket_name="imos-data-lab-optimised", # optional, default value in config/common.json
root_prefix_cloud_optimised_path="cloud_optimised/example_testing", # optional, default value in config/common.json
cluster_mode='local'
)
if __name__ == "__main__":
main()