Reading OME-Zarr in a PyTorch DataLoader¶
iohub defaults to the zarrs-python implementation, which uses the
zarrs (Rust) codec pipeline. It is
fast, but its parallel codec runs on a process-global thread pool that is
not fork-safe. PyTorch's DataLoader forks worker processes on Linux when
num_workers > 0, and the first chunk decode inside a forked worker deadlocks.
A tiny dataset¶
import torch
from torch.utils.data import Dataset
from iohub import open_ome_zarr
class FOVDataset(Dataset):
def __init__(self, store_path: str):
self.store_path = store_path
def __len__(self) -> int:
return 8
def __getitem__(self, idx: int) -> torch.Tensor:
# Open lazily and read a single index from the "0" array.
with open_ome_zarr(self.store_path, layout="fov", mode="r") as fov:
return torch.from_numpy(fov["0"][idx])
What breaks¶
Deadlocks on Linux
from torch.utils.data import DataLoader
sample = FOVDataset(store_path)[0] # (1)!
loader = DataLoader(FOVDataset(store_path), num_workers=4) # (2)!
for batch in loader: # hangs forever on the first batch
...
- Any decode in the main process before the workers start (a sanity read like this, normalization stats, a previous epoch) initializes the zarrs codec's thread pool here. This is what arms the deadlock.
num_workers > 0forks worker processes. Each child inherits the pool's locks but none of its threads, so the first decode blocks on a latch no worker will ever release.
Why it's intermittent
The hang needs both a fork start method and a main-process decode
before iterating. A pipeline that reads only inside workers may never see it,
while the same code that first reads an array in the main process deadlocks
every time. That is exactly why this bug is easy to miss in testing.
How to do it correctly¶
The simplest fix is to read in the main process, with no workers and no fork():
Load in the main process
from torch.utils.data import DataLoader
loader = DataLoader(FOVDataset(store_path), num_workers=0) # (1)!
for batch in loader:
... # decodes on the calling process, safe
num_workers=0does all reading in the calling process. zarrs keeps its fast Rust codec and there is nofork()to corrupt the pool.
If you need parallel workers
Force spawned (not forked) workers, so each re-initializes zarrs in a fresh process:
loader = DataLoader(
FOVDataset(store_path),
num_workers=4,
multiprocessing_context="spawn",
)
Spawn re-imports your modules and pickles the Dataset for each worker,
so startup is slower and throughput drops by roughly 10-20%.
Or opt out of the Rust pipeline
Pass implementation="zarr-python" to open_ome_zarr to use the pure-Python
codec pipeline, which is fork-safe and works with forked DataLoader
workers at the cost of slower decoding:
Why this happens
This is not specific to OME-Zarr or iohub. Any zarr array read through the
zarrs-python (Rust) codec pipeline inside a torch DataLoader hits it: the
parallel codec uses a process-global thread pool that cannot survive fork().
annbatch documents the same
failure and recommends the same spawn workaround.