Cluster Configuration & Distributed Processing

All large-scale AODN processing uses Coiled — a managed Dask cluster service that auto-scales AWS workers. This guide explains how to choose the right settings for batch_size, memory_limit, nthreads, n_workers, and worker_vm_types.

Getting these parameters wrong is the most common cause of slow or failing processing jobs.

The `batch_size` Paradox 

Important

Bigger ``batch_size`` does NOT create more parallelism — it creates less.

Each batch is one Dask task. One task runs on one worker. Coiled auto-scales the number of active workers based on the number of tasks waiting in the scheduler queue. The relationship is:

n_tasks  = total_files / batch_size
n_active_workers ≈ min(n_tasks, n_workers_max)

If you have 1 000 files and batch_size=100, you get 10 tasks. With 20 workers, at most 10 are ever active at the same time — the other 10 stay idle for the whole job.

If you use batch_size=10, you get 100 tasks → all 20 workers stay busy for 5 rounds.

Rule of thumb:

batch_size ≈ total_files / (n_workers_max × 5)

Targeting 5 rounds of work per worker means all workers stay busy and the job finishes efficiently. Targeting more rounds (10–20) is better for jobs with uneven file sizes.

Real example (AC-S hourly, 42 000 files):

batch_size=50, n_workers_max=20 → 840 tasks → good utilisation ✓
batch_size=5000, n_workers_max=20 → 8 tasks → 12 workers idle at all times ✗

Note

Each batch runs sequentially inside a single worker: all files in the batch are opened, processed, and written to Zarr one after another. A larger batch does not speed up a single worker; it just makes the scheduler give that worker more work before it can share.

The `memory_limit` Pitfall 

memory_limit in the worker_options block tells Dask how much RAM the worker has available. When usage approaches this value, Dask spills data to disk to avoid OOM crashes.

If ``memory_limit`` is set higher than the actual VM RAM, Dask never spills, because it thinks there is still headroom. This leads to:

Workers running out of real RAM silently
OS swap activity → major slowdowns (100× slower than RAM)
Unexpected OOM kills from the OS (no warning, no retry)

The correct value is roughly 87.5 % of actual VM RAM (leaving ~12.5 % for the OS, Python interpreter, and Dask overhead).

VM RAM reference table 

VM type	Actual RAM	Recommended `memory_limit`
`m7i-flex.large` `m7i.large`	8 GB	7 GB
`m7i-flex.xlarge` `m7i.xlarge`	16 GB	14 GB
`m7i-flex.2xlarge` `m7i.2xlarge`	32 GB	28 GB
`m7i-flex.4xlarge` `m7i.4xlarge`	64 GB	56 GB
`m7i-flex.8xlarge` `m7i.8xlarge`	128 GB	112 GB

Parquet vs. Zarr memory behaviour:

Parquet (timeseries): all files in a batch are fully loaded into a Pandas DataFrame. Peak memory ≈ batch_size × avg_file_size × 3. Keep batch_size small enough that the whole batch fits in memory_limit.
Zarr / xarray (gridded, spectral): files are opened lazily with open_mfdataset. Only the chunks currently being written are in memory. Peak memory is typically 1–3 chunks, regardless of batch_size. batch_size affects parallelism, not peak memory for Zarr.

`nthreads` — Thread Count per Worker 

Each Dask worker can run multiple Python threads. Threads share memory but are subject to Python’s Global Interpreter Lock (GIL) for CPU-bound code.

General guidance:

Workload type	Recommended `nthreads`	Reason
Parquet, small NetCDF files (mooring, vessel, Argo)	4–8	I/O-bound; threads overlap download + decode
Standard Zarr (gridded satellite, mooring timeseries)	4–8	Moderate I/O and compute
Spectral Zarr with regridding (ACS, HyperOCR, DALEC)	2	Heavy SciPy interpolation; more threads cause GIL contention
Ocean colour (large tiled arrays, `satellite_ocean_colour_*`)	16–32	Dask chunks processed independently; each thread works on one tile

Warning

Setting nthreads too high on a small VM multiplies memory pressure. A worker with nthreads=32 and memory_limit=14GB allocates only 437 MB per thread — easily exceeded by any scientific interpolation step.

`n_workers` — Worker Count 

Set as a list [min, max]. Coiled auto-scales between these limits:

min controls the number of workers that start immediately (warm-up cost). For short jobs set min=1. For jobs with thousands of tasks, set min=5–20 so the cluster doesn’t start cold.
max is the ceiling. Coiled only runs workers when there are tasks to process. Setting a high max costs nothing if the task queue is small.

Tip

Do not set max below n_tasks / 5. If there are fewer workers than tasks ÷ 5, workers will be busy for too many sequential rounds and the job will run slowly.

`worker_vm_types` — VM Selection 

Choose the instance type based on the data format and spectral complexity:

VM type	Best for	Notes
`m7i-flex.large` (8 GB RAM)	Scalar parquet timeseries (WQM, EcoTriplet, small vessel data)	8 GB RAM; light I/O-only workloads
`m7i-flex.xlarge` (16 GB RAM)	Standard Zarr, moderate NetCDF (mooring, glider, radar, GHRSST)	16 GB RAM; most GHRSST datasets, Argo, most mooring collections
`m7i.2xlarge` (32 GB RAM)	Spectral Zarr, large gridded NetCDF (LJCO instruments, satellite L4)	32 GB RAM; ACS, HyperOCR, DALEC, large GHRSST composites
`m7i.4xlarge` (64 GB RAM)	High-resolution ocean colour, multi-band satellite products	64 GB RAM; multi-variable tiled products with large arrays

Use the m7i-flex.* variants where possible — they are Flex instances that can burst above their baseline CPU for short periods, which helps with decompression and interpolation spikes.

Putting It All Together — Configuration Reference 

The full Coiled configuration block in a dataset JSON looks like this:

{
  "run_settings": {
    "batch_size": 50,
    "coiled_cluster_options": {
      "n_workers": [5, 40],
      "scheduler_options": {"idle_timeout": "2 hours"},
      "worker_vm_types": "m7i-flex.xlarge",
      "allow_ingress_from": "everyone",
      "compute_purchase_option": "spot_with_fallback",
      "worker_options": {
        "nthreads": 4,
        "memory_limit": "14GB"
      }
    }
  }
}

Worked Examples 

Argo float profiles (Parquet, ~500 000 files across 11 DACs)

Each profile NetCDF is ~1.2 MB. Loaded fully into Pandas (parquet format).

{
  "batch_size": 300,
  "coiled_cluster_options": {
    "n_workers": [2, 80],
    "worker_vm_types": "m7i-flex.xlarge",
    "worker_options": {"nthreads": 8, "memory_limit": "12GB"}
  }
}

batch_size=300: 300 × 1.2 MB × 3× expansion ≈ 1.1 GB per batch → well within 12 GB
n_workers_max=80: 500 000 / 300 ≈ 1 667 tasks → 80 workers → ~21 rounds
nthreads=8: parquet loading is I/O-bound; threads overlap S3 downloads

Note

Argo processing is inherently long due to the sheer file count. With 1 667 tasks and 80 workers at ~3 min/task, expect ~60 minutes. This is expected behaviour, not a bug.

AC-S hourly (Spectral Zarr, 42 000 files, 707-wavelength regridding)

Each hourly file is ~4 MB. Heavy SciPy interpolation during preprocessing.

{
  "batch_size": 5,
  "coiled_cluster_options": {
    "n_workers": [1, 40],
    "worker_vm_types": "m7i.2xlarge",
    "worker_options": {"nthreads": 2, "memory_limit": "28GB"}
  }
}

batch_size=5: 42 000 / 5 = 8 400 tasks → 40 workers → 210 rounds, each tiny
nthreads=2: SciPy interp is CPU-bound; more threads cause GIL contention
memory_limit=28GB: 87.5 % of actual 32 GB RAM → safe Dask spill threshold

Warning

A previous misconfiguration had target_wavelength_grids.step=0.1, creating 3 531 wavelength points instead of 707. This caused each batch to load ~65 GB of data and the job ran for 12+ hours without completing. Always verify spectral grid sizes match the dimensions.chunk and dimensions.size values in the config.

GHRSST L3S 1-day (Gridded Zarr, ~15 000 files, 139 MB each)

Large gridded files opened lazily with xarray — peak memory per worker ≈ 1–2 tiles.

{
  "batch_size": 60,
  "coiled_cluster_options": {
    "n_workers": [35, 120],
    "worker_vm_types": "m7i-flex.xlarge",
    "worker_options": {"nthreads": 2, "memory_limit": "14GB"}
  }
}

batch_size=60: 15 000 / 60 = 250 tasks → 120 workers → 2 rounds (fast!)
nthreads=2: xarray zarr writes benefit from 2 threads for I/O overlap
Lazy loading: only active chunks are in memory, so 60 × 139 MB is not loaded at once

Troubleshooting 

Processing is very slow despite many workers

Check that n_tasks ≫ n_workers_max. If total_files / batch_size < n_workers_max, most workers are idle. Reduce batch_size.

Workers keep crashing with OOM

Check memory_limit does not exceed actual VM RAM (see VM RAM reference table).
For Parquet: reduce batch_size (each batch is fully loaded).
For spectral Zarr with target_wavelength_grids: verify the step produces the expected number of points — a small step creates massive arrays.

Very high CPU but low throughput (GIL contention)

Reduce nthreads. For spectral interpolation workloads, nthreads=2 is usually optimal. Additional threads queue behind the GIL and add overhead without adding throughput.

Job completes quickly but output Zarr store is missing data

The n_workers_max might have been hit before all tasks were queued. Check Coiled dashboard for task failures. If workers were killed by OOM (no graceful error), reduce batch_size and/or upgrade worker_vm_types.

Additional Resources 

Dataset Configuration — Full dataset configuration reference including coiled_cluster_options schema
Coiled Documentation
Dask Distributed Documentation