Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode
Back to top

Core

Wrapper

class ddpw.Wrapper(platform: Platform)

Bases: object

This class is the highest level of abstraction: it accepts the platform-related configurations and initialises the setup accordingly. When given a task, it then runs the task according to the specified configurations.

Example

from ddpw import Platform, Wrapper

wrapper = Wrapper(Platform(...))

wrapper.start(some_callable)
Parameters:

platform (Platform) – Platform-related configurations.

start(target: Callable[[Tuple, dict], Any], *args, **kwargs)

This method performs the necessary setup according to the specified configurations and then invokes the given task.

Parameters:
  • target (Callable[[Tuple, dict], Any]) – The task, a callable.

  • args (Optional[Tuple]) – Arguments to be passed to target. Default: None.

  • kwargs (Optional[Dict]) – Keyword arguments to be passed to target. Default: None.

ddpw.wrapper(platform: Platform)

A decorator that can be applied to callables.

Parameters:

platform (Platform) – Platform details.

Example

from ddpw import Platform, wrapper

platform = Platform(device='gpu', n_gpus_per_node=2, n_cpus_per_node=2)

@wrapper(platform)
def run(*args, **kwargs):
    # some task
    pass

Platform

final class ddpw.Platform(name: str = 'ddpw', device: Device | str = Device.GPU, partition: str = 'general', n_nodes: int = 1, n_gpus_per_node: int = 1, n_cpus_per_node: int = 1, mem_per_node: int = 32, spawn_method: str | None = 'spawn', ipc_protocol: str = 'tcp', master_addr: str = 'localhost', master_port: str | None = <factory>, ipc_groups: List[List[int]] | None = <factory>, backend: dist.Backend | None = None, seed: int = 1889, timeout_min: int = 2880, slurm_additional_parameters: dict | None = None, console_logs: str = './.output/logs', verbose: bool | None = True, upon_finish: Callable | None = None)

Bases: object

Platform-related configurations such as the device, environment, communication IP address and port, world size, etc.

Examples

from ddpw import Platform

# a setup with 4 GPUs
platform = Platform(device='gpu', n_gpus_per_node=4)

# a setup to request SLURM for 2 nodes, each with 3 GPUs in the "example" partition
platform = Platform(device='slurm', n_nodes=2, n_gpus_per_node=3, partition='example')
name: str = 'ddpw'

Name of the platform job.

Used by SLURM. Default: ddpw.

device: Device | str = 'gpu'

The type of device.

Default: Device.GPU.

partition: str = 'general'

Name of the SLURM partition (used only by SLURM).

Default: general.

n_nodes: int = 1

The total number of nodes (used only by SLURM).

Default: 1.

n_gpus_per_node: int = 1

The number of GPUs per node.

Default: 1.

n_cpus_per_node: int = 1

The number of CPUs per node (used only by SLURM). Must be divisible by n_gpus_per_node so that CPUs can be split evenly across tasks (one task per GPU).

Default: 1.

mem_per_node: int = 32

Memory per node in GB (used only by SLURM). Maps to SLURM’s --mem.

Default: 32.

spawn_method: str | None = 'spawn'

This string corresponds to that passed to mp.set_start_method().

Default: spawn.

ipc_protocol: str = 'tcp'

IPC protocol.

Accepted values: tcp and file. Default: tcp.

master_addr: str = 'localhost'

IPC address.

Default: localhost.

master_port: str | None

The port at which IPC happens.

Default: a random port between 1024 and 49151.

ipc_groups: List[List[int]] | None

Process-group layout across global ranks. Behaviour by value:

  • None — no process group is initialised; group handed to the task is None and collective ops cannot be called. Use this to opt out of distributed coordination entirely.

  • [] (default) — all ranks share a single process group (GroupMember.WORLD). Initialised even when world_size == 1 so task code can call collective ops uniformly.

  • A list of lists of non-overlapping global ranks — each inner list forms its own process group. Every rank must belong to exactly one group.

Examples

# no IPC; each device runs independently
platform = Platform(device='gpu', n_gpus_per_node=4, ipc_groups=None)

# all devices under one group: default behaviour
platform = Platform(device='gpu', n_gpus_per_node=4)
platform = Platform(device='gpu', n_gpus_per_node=4, ipc_groups=[])

# custom groups
platform = Platform(device='gpu', n_gpus_per_node=4, ipc_groups=[[0, 2], [1], [3]])
platform = Platform(device='gpu', n_gpus_per_node=4, ipc_groups=[[0, 2], [1, 3]])

Variable groups unstable

PyTorch behaviour seems to be inconsistent when using variable process groups. An open bug issue is on GitHub.

backend: dist.Backend | None = None

The PyTorch-supported backend to use for distributed data parallel.

None (default) resolves to torch.distributed.Backend.GLOO when the process group is initialised. Set explicitly to pick a different backend.

seed: int = 1889

Seed with which to initialise the various [pseudo]random number generators.

Default: 1889.

timeout_min: int = 2880

Minimum timeout (in minutes) for jobs (used only by SLURM).

Default: 2880 (two days).

slurm_additional_parameters: dict | None = None

Additional SLURM parameters; this dictionary corresponds to the one passed to submitit’s slurm_additional_parameters argument. Default: None.

console_logs: str = './.output/logs'

Location of console logs (used mainly by SLURM to log the errors and output to files).

Default: ./.output/logs

verbose: bool | None = True

Whether or not to print updates to the standard output during setup.

Default: True.

upon_finish: Callable | None = None

An optional callable to be invoked upon completion of the given task.

Default: None.

property world_size

Specified the world size.

This is the total number of GPUs across all nodes. Default: 1.

property requires_ipc

Specified whether the processes need inter-communication.

This property determines whether or not the setup requires IPC. IPC is not required for a single device.

print()

This method serialises this object in a human readable format and prints it.

Device

final class ddpw.Device(*values)

Bases: Enum

The device on which to run the task.

CPU = 'cpu'

The device to run on is a CPU.

GPU = 'gpu'

The device to run on is one or more GPUs.

SLURM = 'slurm'

The device to run on is a cluster of GPU nodes managed by SLURM.

MPS = 'mps'

The device to run on is Apple SoC.

static from_str(device: str) Device

This method returns a Device object given a valid device string.

Parameters:

device (str) – The type of the device. Supported values: cpu, gpu, slurm, and mps (case insensitive).

Returns Device:

Device corresponds to the device type string.

Raises:

ValueError – Raises an error if the string is invalid.