Cluster Configuration & Distributed Processing
All large-scale AODN processing uses Coiled — a managed Dask cluster
service that auto-scales AWS workers. This guide explains how to choose the right settings for
batch_size, memory_limit, nthreads, n_workers, and worker_vm_types.
Getting these parameters wrong is the most common cause of slow or failing processing jobs.
The batch_size Paradox
Important
Bigger ``batch_size`` does NOT create more parallelism — it creates less.
Each batch is one Dask task. One task runs on one worker. Coiled auto-scales the number of active workers based on the number of tasks waiting in the scheduler queue. The relationship is:
n_tasks = total_files / batch_size
n_active_workers ≈ min(n_tasks, n_workers_max)
If you have 1 000 files and batch_size=100, you get 10 tasks. With 20 workers, at most 10 are
ever active at the same time — the other 10 stay idle for the whole job.
If you use batch_size=10, you get 100 tasks → all 20 workers stay busy for 5 rounds.
Rule of thumb:
batch_size ≈ total_files / (n_workers_max × 5)
Targeting 5 rounds of work per worker means all workers stay busy and the job finishes efficiently. Targeting more rounds (10–20) is better for jobs with uneven file sizes.
Real example (AC-S hourly, 42 000 files):
batch_size=50,n_workers_max=20→ 840 tasks → good utilisation ✓batch_size=5000,n_workers_max=20→ 8 tasks → 12 workers idle at all times ✗
Note
Each batch runs sequentially inside a single worker: all files in the batch are opened, processed, and written to Zarr one after another. A larger batch does not speed up a single worker; it just makes the scheduler give that worker more work before it can share.
The memory_limit Pitfall
memory_limit in the worker_options block tells Dask how much RAM the worker has
available. When usage approaches this value, Dask spills data to disk to avoid OOM crashes.
If ``memory_limit`` is set higher than the actual VM RAM, Dask never spills, because it thinks there is still headroom. This leads to:
Workers running out of real RAM silently
OS swap activity → major slowdowns (100× slower than RAM)
Unexpected OOM kills from the OS (no warning, no retry)
The correct value is roughly 87.5 % of actual VM RAM (leaving ~12.5 % for the OS, Python interpreter, and Dask overhead).
VM RAM reference table
VM type |
Actual RAM |
Recommended
|
|---|---|---|
|
8 GB |
7 GB |
|
16 GB |
14 GB |
|
32 GB |
28 GB |
|
64 GB |
56 GB |
|
128 GB |
112 GB |
Parquet vs. Zarr memory behaviour:
Parquet (timeseries): all files in a batch are fully loaded into a Pandas DataFrame. Peak memory ≈
batch_size × avg_file_size × 3. Keepbatch_sizesmall enough that the whole batch fits inmemory_limit.Zarr / xarray (gridded, spectral): files are opened lazily with
open_mfdataset. Only the chunks currently being written are in memory. Peak memory is typically 1–3 chunks, regardless ofbatch_size.batch_sizeaffects parallelism, not peak memory for Zarr.
nthreads — Thread Count per Worker
Each Dask worker can run multiple Python threads. Threads share memory but are subject to Python’s Global Interpreter Lock (GIL) for CPU-bound code.
General guidance:
Workload type |
Recommended
|
Reason |
|---|---|---|
Parquet, small NetCDF files (mooring, vessel, Argo) |
4–8 |
I/O-bound; threads overlap download + decode |
Standard Zarr (gridded satellite, mooring timeseries) |
4–8 |
Moderate I/O and compute |
Spectral Zarr with regridding (ACS, HyperOCR, DALEC) |
2 |
Heavy SciPy interpolation; more threads cause GIL contention |
Ocean colour (large tiled arrays,
|
16–32 |
Dask chunks processed independently; each thread works on one tile |
Warning
Setting nthreads too high on a small VM multiplies memory pressure. A worker with
nthreads=32 and memory_limit=14GB allocates only 437 MB per thread — easily
exceeded by any scientific interpolation step.
n_workers — Worker Count
Set as a list [min, max]. Coiled auto-scales between these limits:
mincontrols the number of workers that start immediately (warm-up cost). For short jobs setmin=1. For jobs with thousands of tasks, setmin=5–20so the cluster doesn’t start cold.maxis the ceiling. Coiled only runs workers when there are tasks to process. Setting a highmaxcosts nothing if the task queue is small.
Tip
Do not set max below n_tasks / 5. If there are fewer workers than tasks ÷ 5,
workers will be busy for too many sequential rounds and the job will run slowly.
worker_vm_types — VM Selection
Choose the instance type based on the data format and spectral complexity:
VM type |
Best for |
Notes |
|---|---|---|
|
Scalar parquet timeseries (WQM, EcoTriplet, small vessel data) |
8 GB RAM; light I/O-only workloads |
|
Standard Zarr, moderate NetCDF (mooring, glider, radar, GHRSST) |
16 GB RAM; most GHRSST datasets, Argo, most mooring collections |
|
Spectral Zarr, large gridded NetCDF (LJCO instruments, satellite L4) |
32 GB RAM; ACS, HyperOCR, DALEC, large GHRSST composites |
|
High-resolution ocean colour, multi-band satellite products |
64 GB RAM; multi-variable tiled products with large arrays |
Use the m7i-flex.* variants where possible — they are Flex instances that can burst
above their baseline CPU for short periods, which helps with decompression and interpolation spikes.
Putting It All Together — Configuration Reference
The full Coiled configuration block in a dataset JSON looks like this:
{
"run_settings": {
"batch_size": 50,
"coiled_cluster_options": {
"n_workers": [5, 40],
"scheduler_options": {"idle_timeout": "2 hours"},
"worker_vm_types": "m7i-flex.xlarge",
"allow_ingress_from": "everyone",
"compute_purchase_option": "spot_with_fallback",
"worker_options": {
"nthreads": 4,
"memory_limit": "14GB"
}
}
}
}
Worked Examples
Argo float profiles (Parquet, ~500 000 files across 11 DACs)
Each profile NetCDF is ~1.2 MB. Loaded fully into Pandas (parquet format).
{
"batch_size": 300,
"coiled_cluster_options": {
"n_workers": [2, 80],
"worker_vm_types": "m7i-flex.xlarge",
"worker_options": {"nthreads": 8, "memory_limit": "12GB"}
}
}
batch_size=300: 300 × 1.2 MB × 3× expansion ≈ 1.1 GB per batch → well within 12 GBn_workers_max=80: 500 000 / 300 ≈ 1 667 tasks → 80 workers → ~21 roundsnthreads=8: parquet loading is I/O-bound; threads overlap S3 downloads
Note
Argo processing is inherently long due to the sheer file count. With 1 667 tasks and 80 workers at ~3 min/task, expect ~60 minutes. This is expected behaviour, not a bug.
AC-S hourly (Spectral Zarr, 42 000 files, 707-wavelength regridding)
Each hourly file is ~4 MB. Heavy SciPy interpolation during preprocessing.
{
"batch_size": 5,
"coiled_cluster_options": {
"n_workers": [1, 40],
"worker_vm_types": "m7i.2xlarge",
"worker_options": {"nthreads": 2, "memory_limit": "28GB"}
}
}
batch_size=5: 42 000 / 5 = 8 400 tasks → 40 workers → 210 rounds, each tinynthreads=2: SciPyinterpis CPU-bound; more threads cause GIL contentionmemory_limit=28GB: 87.5 % of actual 32 GB RAM → safe Dask spill threshold
Warning
A previous misconfiguration had target_wavelength_grids.step=0.1, creating 3 531
wavelength points instead of 707. This caused each batch to load ~65 GB of data and the
job ran for 12+ hours without completing. Always verify spectral grid sizes match the
dimensions.chunk and dimensions.size values in the config.
GHRSST L3S 1-day (Gridded Zarr, ~15 000 files, 139 MB each)
Large gridded files opened lazily with xarray — peak memory per worker ≈ 1–2 tiles.
{
"batch_size": 60,
"coiled_cluster_options": {
"n_workers": [35, 120],
"worker_vm_types": "m7i-flex.xlarge",
"worker_options": {"nthreads": 2, "memory_limit": "14GB"}
}
}
batch_size=60: 15 000 / 60 = 250 tasks → 120 workers → 2 rounds (fast!)nthreads=2: xarray zarr writes benefit from 2 threads for I/O overlapLazy loading: only active chunks are in memory, so 60 × 139 MB is not loaded at once
Troubleshooting
Processing is very slow despite many workers
Check that n_tasks ≫ n_workers_max. If total_files / batch_size < n_workers_max,
most workers are idle. Reduce batch_size.
Workers keep crashing with OOM
Check
memory_limitdoes not exceed actual VM RAM (see VM RAM reference table).For Parquet: reduce
batch_size(each batch is fully loaded).For spectral Zarr with
target_wavelength_grids: verify thestepproduces the expected number of points — a small step creates massive arrays.
Very high CPU but low throughput (GIL contention)
Reduce nthreads. For spectral interpolation workloads, nthreads=2 is usually optimal.
Additional threads queue behind the GIL and add overhead without adding throughput.
Job completes quickly but output Zarr store is missing data
The n_workers_max might have been hit before all tasks were queued. Check Coiled
dashboard for task failures. If workers were killed by OOM (no graceful error), reduce
batch_size and/or upgrade worker_vm_types.
Additional Resources
Dataset Configuration — Full dataset configuration reference including
coiled_cluster_optionsschema