Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode
Back to top

Core

Wrapper

class ddpw.Wrapper(platform: Platform)

Bases: object

This class is the highest level of abstraction: it accepts the platform-related configurations and initialises the setup accordingly. When given a task, it then runs the task according to the specified configurations.

Example

from ddpw import Platform, Wrapper

wrapper = Wrapper(Platform(...))

wrapper.start(some_callable)
Parameters:

platform (Platform) – Platform-related configurations.

start(target: Callable[[int, int, ProcessGroup, Tuple | None], Any], args: Tuple | None = None)

This method performs the necessary setup according to the specified configurations and then invokes the given task.

Parameters:
  • target (Callable[[int, int, dist.ProcessGroup, Optional[Tuple]], Any]) – The task. A callable which accepts two integers (the global and local ranks of the device), the process group, and an optional tuple which are the callable’s arguments.

  • args (Optional[Tuple]) – Arguments to be passed to target. Default: None.

ddpw.wrapper(platform: Platform)

A decorator that can be applied to callables.

Parameters:

platform (Platform) – Platform details.

Example

from ddpw import Platform, wrapper

@wrapper(Platform(device='gpu', n_gpus=2, n_cpus=2))
def run(a, b):
    # some task
    pass

Platform

final class ddpw.Platform(name: str = 'ddpw', device: ~ddpw.platform.Device | str = Device.GPU, partition: str = 'general', n_nodes: int = 1, n_gpus: int = 1, n_cpus: int = 1, ram: int = 32, spawn_method: str | None = 'fork', ipc_protocol: str = 'tcp', master_addr: str = 'localhost', master_port: str | None = '11195', ipc_groups: ~typing.List[~typing.List[int]] | None = <factory>, backend: ~torch.distributed.distributed_c10d.Backend | None = 'gloo', seed: int = 1889, timeout_min: int = 2880, slurm_additional_parameters: dict | None = None, console_logs: str = './logs', verbose: bool | None = True, upon_finish: ~typing.Callable | None = None)

Bases: object

Platform-related configurations such as the device, environment, communication IP address and port, world size, etc.

Examples

from ddpw import Platform

# a setup with 4 GPUs
platform = Platform(device='gpu', n_gpus=4)

# a setup to request SLURM for 2 nodes, each with 3 GPUs in the "example" partition
platform = Platform(device='slurm', n_nodes=2, n_gpus=3, partition='example')
name: str = 'ddpw'

Name of the platform job. Used by SLURM. Default: ddpw.

device: Device | str = 'gpu'

The type of device. Default: Device.GPU.

partition: str = 'general'

Name of the SLURM partition (used only by SLURM). Default: general.

n_nodes: int = 1

The total number of nodes (used only by SLURM). Default: 1.

n_gpus: int = 1

The number of GPUs (per node). Default: 1.

n_cpus: int = 1

The total number of CPUs (used only by SLURM). Default: 1.

ram: int = 32

Total RAM (in GB) (used only by SLURM). Default: 32.

spawn_method: str | None = 'fork'

This string corresponds to that passed to mp.set_start_method(). Default: fork.

ipc_protocol: str = 'tcp'

IPC protocol. Accepted values: tcp and file. Default: tcp.

master_addr: str = 'localhost'

IPC address. Default: localhost.

master_port: str | None = '11195'

The port at which IPC happens. Default: a random port between 1024 and 49151.

ipc_groups: List[List[int]] | None

A list of lists of non-overlapping global ranks of devices. If None, every device will be its own group, and no IPC will take place. If an empty list is passed, all devices are grouped into one process group. Default: [].

Examples

# no IPC between devices; each device is its own group
platform = Platform(device='gpu', n_gpus=4, ipc_groups=None)

# all devices under one group: default behaviour
platform = Platform(device='gpu', n_gpus=4)
platform = Platform(device='gpu', n_gpus=4, ipc_groups=[])

# custom groups
platform = Platform(device='gpu', n_gpus=4, ipc_groups=[[0, 2], [1], [3]])
platform = Platform(device='gpu', n_gpus=4, ipc_groups=[[0, 2], [1, 3]])

Variable groups unstable

PyTorch behaviour seems to be inconsistent when using variable process groups. An open bug issue is on GitHub.

backend: Backend | None = 'gloo'

The PyTorch-supported backend to use for distributed data parallel. Default: torch.distributed.Backend.GLOO.

seed: int = 1889

Seed with which to initialise the various [pseudo]random number generators. Default: 1889.

timeout_min: int = 2880

Minimum timeout (in minutes) for jobs (used only by SLURM). Default: 2880 (two days).

slurm_additional_parameters: dict | None = None

Additional SLURM parameters; this dictionary corresponds to the one passed to submitit’s slurm_additional_parameters argument. Default: None.

console_logs: str = './logs'

Location of console logs (used mainly by SLURM to log the errors and output to files). Default: ./logs

verbose: bool | None = True

Whether or not to print updates to the standard output during setup. Default: True.

upon_finish: Callable | None = None

An optional callable to be invoked upon completion of the given task. Default: None.

property world_size

Specified the world size. This is the total number of GPUs across all nodes. Default: 1.

property requires_ipc

Specified whether the processes need inter-communication. This property determines whether or not the setup requires IPC. IPC is not required for a single device.

print()

This method serialises this object in a human readable format and prints it.

Device

final class ddpw.Device(value, names=_not_given, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

The device on which to run the task.

CPU = 'cpu'

The device to run on is a CPU.

GPU = 'gpu'

The device to run on is one or more GPUs.

SLURM = 'slurm'

The device to run on is a cluster of GPU nodes managed by SLURM.

MPS = 'mps'

The device to run on is Apple SoC.

static from_str(device: str) Device

This method returns a Device object given a valid device string.

Parameters:

device (str) – The type of the device. Supported values: cpu, gpu, slurm, and mps (case insensitive).

Returns Device:

Device corresponds to the device type string.

Raises:

ValueError – Raises an error if the string is invalid.