Core¶

Wrapper¶

class ddpw.Wrapper(platform: Platform)¶

Bases: object

This class is the highest level of abstraction: it accepts the platform-related configurations and initialises the setup accordingly. When given a task, it then runs the task according to the specified configurations.

Example

from ddpw import Platform, Wrapper

wrapper = Wrapper(Platform(...))

wrapper.start(some_callable)

Parameters:: platform (Platform) – Platform-related configurations.

start(target: Callable[[Tuple, dict], Any], *args, **kwargs)¶

This method performs the necessary setup according to the specified configurations and then invokes the given task.

Parameters:

target (Callable[[Tuple, dict], Any]) – The task, a callable.
args (Optional[Tuple]) – Arguments to be passed to target. Default: None.
kwargs (Optional[Dict]) – Keyword arguments to be passed to target. Default: None.

ddpw.wrapper(platform: Platform)¶

A decorator that can be applied to callables.

Parameters:: platform (Platform) – Platform details.

Example

from ddpw import Platform, wrapper

platform = Platform(device='gpu', n_gpus=2, n_cpus=2)

@wrapper(platform)
def run(*args, **kwargs):
    # some task
    pass

Platform¶

final class ddpw.Platform(name: str = 'ddpw', device: ~ddpw.platform.Device | str = Device.GPU, partition: str = 'general', n_nodes: int = 1, n_gpus: int = 1, n_cpus: int = 1, ram: int = 32, spawn_method: str | None = 'spawn', ipc_protocol: str = 'tcp', master_addr: str = 'localhost', master_port: str | None = '42298', ipc_groups: ~typing.List[~typing.List[int]] | None = <factory>, backend: ~torch.distributed.distributed_c10d.Backend | None = 'gloo', seed: int = 1889, timeout_min: int = 2880, slurm_additional_parameters: dict | None = None, console_logs: str = './logs', verbose: bool | None = True, upon_finish: ~typing.Callable | None = None)¶

Bases: object

Platform-related configurations such as the device, environment, communication IP address and port, world size, etc.

Examples

from ddpw import Platform

# a setup with 4 GPUs
platform = Platform(device='gpu', n_gpus=4)

# a setup to request SLURM for 2 nodes, each with 3 GPUs in the "example" partition
platform = Platform(device='slurm', n_nodes=2, n_gpus=3, partition='example')

name: str = 'ddpw'¶

Name of the platform job.

Used by SLURM. Default: ddpw.

device: Device | str = 'gpu'¶

The type of device.

Default: Device.GPU.

partition: str = 'general'¶

Name of the SLURM partition (used only by SLURM).

Default: general.

n_nodes: int = 1¶

The total number of nodes (used only by SLURM).

Default: 1.

n_gpus: int = 1¶

The number of GPUs (per node).

Default: 1.

n_cpus: int = 1¶

The total number of CPUs (used only by SLURM).

Default: 1.

ram: int = 32¶

Total RAM (in GB) (used only by SLURM).

Default: 32.

spawn_method: str | None = 'spawn'¶

This string corresponds to that passed to mp.set_start_method().

Default: spawn.

ipc_protocol: str = 'tcp'¶

IPC protocol.

Accepted values: tcp and file. Default: tcp.

master_addr: str = 'localhost'¶

IPC address.

Default: localhost.

master_port: str | None = '42298'¶

The port at which IPC happens.

Default: a random port between 1024 and 49151.

ipc_groups: List[List[int]] | None¶

A list of lists of non-overlapping global ranks of devices. If None, every device will be its own group, and no IPC will take place. If an empty list is passed, all devices are grouped into one process group. Default: [].

Examples

# no IPC between devices; each device is its own group
platform = Platform(device='gpu', n_gpus=4, ipc_groups=None)

# all devices under one group: default behaviour
platform = Platform(device='gpu', n_gpus=4)
platform = Platform(device='gpu', n_gpus=4, ipc_groups=[])

# custom groups
platform = Platform(device='gpu', n_gpus=4, ipc_groups=[[0, 2], [1], [3]])
platform = Platform(device='gpu', n_gpus=4, ipc_groups=[[0, 2], [1, 3]])

Variable groups unstable

PyTorch behaviour seems to be inconsistent when using variable process groups. An open bug issue is on GitHub.

backend: Backend | None = 'gloo'¶

The PyTorch-supported backend to use for distributed data parallel.

Default: torch.distributed.Backend.GLOO.

seed: int = 1889¶

Seed with which to initialise the various [pseudo]random number generators.

Default: 1889.

timeout_min: int = 2880¶

Minimum timeout (in minutes) for jobs (used only by SLURM).

Default: 2880 (two days).

slurm_additional_parameters: dict | None = None¶: Additional SLURM parameters; this dictionary corresponds to the one passed to submitit’s slurm_additional_parameters argument. Default: None.

console_logs: str = './logs'¶

Location of console logs (used mainly by SLURM to log the errors and output to files).

Default: ./logs

verbose: bool | None = True¶

Whether or not to print updates to the standard output during setup.

Default: True.

upon_finish: Callable | None = None¶

An optional callable to be invoked upon completion of the given task.

Default: None.

property world_size¶

Specified the world size.

This is the total number of GPUs across all nodes. Default: 1.

property requires_ipc¶

Specified whether the processes need inter-communication.

This property determines whether or not the setup requires IPC. IPC is not required for a single device.

print()¶: This method serialises this object in a human readable format and prints it.

Device¶

final class ddpw.Device(*values)¶

Bases: Enum

The device on which to run the task.

CPU = 'cpu'¶: The device to run on is a CPU.

GPU = 'gpu'¶: The device to run on is one or more GPUs.

SLURM = 'slurm'¶: The device to run on is a cluster of GPU nodes managed by SLURM.

MPS = 'mps'¶: The device to run on is Apple SoC.

static from_str(device: str) → Device¶

This method returns a Device object given a valid device string.

Parameters:: device (str) – The type of the device. Supported values: cpu, gpu, slurm, and mps (case insensitive).
Returns Device:: Device corresponds to the device type string.
Raises:: ValueError – Raises an error if the string is invalid.