Cluster Configuration & Distributed Processing

All large-scale AODN processing uses Coiled — a managed Dask cluster service that auto-scales AWS workers. This guide explains how to choose the right settings for batch_size, memory_limit, nthreads, n_workers, and worker_vm_types.

Getting these parameters wrong is the most common cause of slow or failing processing jobs.


The batch_size Paradox

Important

Bigger ``batch_size`` does NOT create more parallelism — it creates less.

Each batch is one Dask task. One task runs on one worker. Coiled auto-scales the number of active workers based on the number of tasks waiting in the scheduler queue. The relationship is:

n_tasks  = total_files / batch_size
n_active_workers ≈ min(n_tasks, n_workers_max)

If you have 1 000 files and batch_size=100, you get 10 tasks. With 20 workers, at most 10 are ever active at the same time — the other 10 stay idle for the whole job.

If you use batch_size=10, you get 100 tasks → all 20 workers stay busy for 5 rounds.

Rule of thumb:

batch_size ≈ total_files / (n_workers_max × 5)

Targeting 5 rounds of work per worker means all workers stay busy and the job finishes efficiently. Targeting more rounds (10–20) is better for jobs with uneven file sizes.

Real example (AC-S hourly, 42 000 files):

  • batch_size=50, n_workers_max=20 → 840 tasks → good utilisation ✓

  • batch_size=5000, n_workers_max=20 → 8 tasks → 12 workers idle at all times ✗

Note

Each batch runs sequentially inside a single worker: all files in the batch are opened, processed, and written to Zarr one after another. A larger batch does not speed up a single worker; it just makes the scheduler give that worker more work before it can share.


The memory_limit Pitfall

memory_limit in the worker_options block tells Dask how much RAM the worker has available. When usage approaches this value, Dask spills data to disk to avoid OOM crashes.

If ``memory_limit`` is set higher than the actual VM RAM, Dask never spills, because it thinks there is still headroom. This leads to:

  • Workers running out of real RAM silently

  • OS swap activity → major slowdowns (100× slower than RAM)

  • Unexpected OOM kills from the OS (no warning, no retry)

The correct value is roughly 87.5 % of actual VM RAM (leaving ~12.5 % for the OS, Python interpreter, and Dask overhead).

VM RAM reference table

VM type

Actual RAM

Recommended memory_limit

m7i-flex.large m7i.large

8 GB

7 GB

m7i-flex.xlarge m7i.xlarge

16 GB

14 GB

m7i-flex.2xlarge m7i.2xlarge

32 GB

28 GB

m7i-flex.4xlarge m7i.4xlarge

64 GB

56 GB

m7i-flex.8xlarge m7i.8xlarge

128 GB

112 GB

Parquet vs. Zarr memory behaviour:

  • Parquet (timeseries): all files in a batch are fully loaded into a Pandas DataFrame. Peak memory ≈ batch_size × avg_file_size × 3. Keep batch_size small enough that the whole batch fits in memory_limit.

  • Zarr / xarray (gridded, spectral): files are opened lazily with open_mfdataset. Only the chunks currently being written are in memory. Peak memory is typically 1–3 chunks, regardless of batch_size. batch_size affects parallelism, not peak memory for Zarr.


nthreads — Thread Count per Worker

Each Dask worker can run multiple Python threads. Threads share memory but are subject to Python’s Global Interpreter Lock (GIL) for CPU-bound code.

General guidance:

Workload type

Recommended nthreads

Reason

Parquet, small NetCDF files (mooring, vessel, Argo)

4–8

I/O-bound; threads overlap download + decode

Standard Zarr (gridded satellite, mooring timeseries)

4–8

Moderate I/O and compute

Spectral Zarr with regridding (ACS, HyperOCR, DALEC)

2

Heavy SciPy interpolation; more threads cause GIL contention

Ocean colour (large tiled arrays, satellite_ocean_colour_*)

16–32

Dask chunks processed independently; each thread works on one tile

Warning

Setting nthreads too high on a small VM multiplies memory pressure. A worker with nthreads=32 and memory_limit=14GB allocates only 437 MB per thread — easily exceeded by any scientific interpolation step.


n_workers — Worker Count

Set as a list [min, max]. Coiled auto-scales between these limits:

  • min controls the number of workers that start immediately (warm-up cost). For short jobs set min=1. For jobs with thousands of tasks, set min=5–20 so the cluster doesn’t start cold.

  • max is the ceiling. Coiled only runs workers when there are tasks to process. Setting a high max costs nothing if the task queue is small.

Tip

Do not set max below n_tasks / 5. If there are fewer workers than tasks ÷ 5, workers will be busy for too many sequential rounds and the job will run slowly.


worker_vm_types — VM Selection

Choose the instance type based on the data format and spectral complexity:

VM type

Best for

Notes

m7i-flex.large (8 GB RAM)

Scalar parquet timeseries (WQM, EcoTriplet, small vessel data)

8 GB RAM; light I/O-only workloads

m7i-flex.xlarge (16 GB RAM)

Standard Zarr, moderate NetCDF (mooring, glider, radar, GHRSST)

16 GB RAM; most GHRSST datasets, Argo, most mooring collections

m7i.2xlarge (32 GB RAM)

Spectral Zarr, large gridded NetCDF (LJCO instruments, satellite L4)

32 GB RAM; ACS, HyperOCR, DALEC, large GHRSST composites

m7i.4xlarge (64 GB RAM)

High-resolution ocean colour, multi-band satellite products

64 GB RAM; multi-variable tiled products with large arrays

Use the m7i-flex.* variants where possible — they are Flex instances that can burst above their baseline CPU for short periods, which helps with decompression and interpolation spikes.


Putting It All Together — Configuration Reference

The full Coiled configuration block in a dataset JSON looks like this:

{
  "run_settings": {
    "batch_size": 50,
    "coiled_cluster_options": {
      "n_workers": [5, 40],
      "scheduler_options": {"idle_timeout": "2 hours"},
      "worker_vm_types": "m7i-flex.xlarge",
      "allow_ingress_from": "everyone",
      "compute_purchase_option": "spot_with_fallback",
      "worker_options": {
        "nthreads": 4,
        "memory_limit": "14GB"
      }
    }
  }
}

Worked Examples

Argo float profiles (Parquet, ~500 000 files across 11 DACs)

Each profile NetCDF is ~1.2 MB. Loaded fully into Pandas (parquet format).

{
  "batch_size": 300,
  "coiled_cluster_options": {
    "n_workers": [2, 80],
    "worker_vm_types": "m7i-flex.xlarge",
    "worker_options": {"nthreads": 8, "memory_limit": "12GB"}
  }
}
  • batch_size=300: 300 × 1.2 MB × 3× expansion ≈ 1.1 GB per batch → well within 12 GB

  • n_workers_max=80: 500 000 / 300 ≈ 1 667 tasks → 80 workers → ~21 rounds

  • nthreads=8: parquet loading is I/O-bound; threads overlap S3 downloads

Note

Argo processing is inherently long due to the sheer file count. With 1 667 tasks and 80 workers at ~3 min/task, expect ~60 minutes. This is expected behaviour, not a bug.

AC-S hourly (Spectral Zarr, 42 000 files, 707-wavelength regridding)

Each hourly file is ~4 MB. Heavy SciPy interpolation during preprocessing.

{
  "batch_size": 5,
  "coiled_cluster_options": {
    "n_workers": [1, 40],
    "worker_vm_types": "m7i.2xlarge",
    "worker_options": {"nthreads": 2, "memory_limit": "28GB"}
  }
}
  • batch_size=5: 42 000 / 5 = 8 400 tasks → 40 workers → 210 rounds, each tiny

  • nthreads=2: SciPy interp is CPU-bound; more threads cause GIL contention

  • memory_limit=28GB: 87.5 % of actual 32 GB RAM → safe Dask spill threshold

Warning

A previous misconfiguration had target_wavelength_grids.step=0.1, creating 3 531 wavelength points instead of 707. This caused each batch to load ~65 GB of data and the job ran for 12+ hours without completing. Always verify spectral grid sizes match the dimensions.chunk and dimensions.size values in the config.

GHRSST L3S 1-day (Gridded Zarr, ~15 000 files, 139 MB each)

Large gridded files opened lazily with xarray — peak memory per worker ≈ 1–2 tiles.

{
  "batch_size": 60,
  "coiled_cluster_options": {
    "n_workers": [35, 120],
    "worker_vm_types": "m7i-flex.xlarge",
    "worker_options": {"nthreads": 2, "memory_limit": "14GB"}
  }
}
  • batch_size=60: 15 000 / 60 = 250 tasks → 120 workers → 2 rounds (fast!)

  • nthreads=2: xarray zarr writes benefit from 2 threads for I/O overlap

  • Lazy loading: only active chunks are in memory, so 60 × 139 MB is not loaded at once


Troubleshooting

Processing is very slow despite many workers

Check that n_tasks n_workers_max. If total_files / batch_size < n_workers_max, most workers are idle. Reduce batch_size.

Workers keep crashing with OOM

  1. Check memory_limit does not exceed actual VM RAM (see VM RAM reference table).

  2. For Parquet: reduce batch_size (each batch is fully loaded).

  3. For spectral Zarr with target_wavelength_grids: verify the step produces the expected number of points — a small step creates massive arrays.

Very high CPU but low throughput (GIL contention)

Reduce nthreads. For spectral interpolation workloads, nthreads=2 is usually optimal. Additional threads queue behind the GIL and add overhead without adding throughput.

Job completes quickly but output Zarr store is missing data

The n_workers_max might have been hit before all tasks were queued. Check Coiled dashboard for task failures. If workers were killed by OOM (no graceful error), reduce batch_size and/or upgrade worker_vm_types.

Additional Resources