Core¶
Wrapper¶
- class ddpw.Wrapper(platform: Platform)¶
Bases:
object
This class is the highest level of abstraction: it accepts the platform-related configurations and initialises the setup accordingly. When given a task, it then runs the task according to the specified configurations.
Example
from ddpw import Platform, Wrapper wrapper = Wrapper(Platform(...)) wrapper.start(some_callable)
- Parameters:
platform (Platform) – Platform-related configurations.
- start(target: Callable[[int, int, ProcessGroup, Tuple | None], Any], args: Tuple | None = None)¶
This method performs the necessary setup according to the specified configurations and then invokes the given task.
- Parameters:
target (Callable[[int, int, dist.ProcessGroup, Optional[Tuple]], Any]) – The task. A callable which accepts two integers (the global and local ranks of the device), the process group, and an optional tuple which are the callable’s arguments.
args (Optional[Tuple]) – Arguments to be passed to
target
. Default:None
.
Platform¶
- final class ddpw.Platform(name: str = 'ddpw', device: ~ddpw.platform.Device | str = Device.GPU, partition: str = 'general', n_nodes: int = 1, n_gpus: int = 1, n_cpus: int = 1, ram: int = 32, spawn_method: str | None = 'fork', ipc_protocol: str = 'tcp', master_addr: str = 'localhost', master_port: str | None = '11195', ipc_groups: ~typing.List[~typing.List[int]] | None = <factory>, backend: ~torch.distributed.distributed_c10d.Backend | None = 'gloo', seed: int = 1889, timeout_min: int = 2880, slurm_additional_parameters: dict | None = None, console_logs: str = './logs', verbose: bool | None = True, upon_finish: ~typing.Callable | None = None)¶
Bases:
object
Platform-related configurations such as the device, environment, communication IP address and port, world size, etc.
Examples
from ddpw import Platform # a setup with 4 GPUs platform = Platform(device='gpu', n_gpus=4) # a setup to request SLURM for 2 nodes, each with 3 GPUs in the "example" partition platform = Platform(device='slurm', n_nodes=2, n_gpus=3, partition='example')
- name: str = 'ddpw'¶
Name of the platform job. Used by SLURM. Default:
ddpw
.
- partition: str = 'general'¶
Name of the SLURM partition (used only by SLURM). Default:
general
.
- n_nodes: int = 1¶
The total number of nodes (used only by SLURM). Default:
1
.
- n_gpus: int = 1¶
The number of GPUs (per node). Default:
1
.
- n_cpus: int = 1¶
The total number of CPUs (used only by SLURM). Default:
1
.
- ram: int = 32¶
Total RAM (in GB) (used only by SLURM). Default:
32
.
- spawn_method: str | None = 'fork'¶
This string corresponds to that passed to
mp.set_start_method()
. Default:fork
.
- ipc_protocol: str = 'tcp'¶
IPC protocol. Accepted values:
tcp
andfile
. Default:tcp
.
- master_addr: str = 'localhost'¶
IPC address. Default:
localhost
.
- master_port: str | None = '11195'¶
The port at which IPC happens. Default: a random port between 1024 and 49151.
- ipc_groups: List[List[int]] | None¶
A list of lists of non-overlapping global ranks of devices. If
None
, every device will be its own group, and no IPC will take place. If an empty list is passed, all devices are grouped into one process group. Default:[]
.Examples
# no IPC between devices; each device is its own group platform = Platform(device='gpu', n_gpus=4, ipc_groups=None) # all devices under one group: default behaviour platform = Platform(device='gpu', n_gpus=4) platform = Platform(device='gpu', n_gpus=4, ipc_groups=[]) # custom groups platform = Platform(device='gpu', n_gpus=4, ipc_groups=[[0, 2], [1], [3]]) platform = Platform(device='gpu', n_gpus=4, ipc_groups=[[0, 2], [1, 3]])
Variable groups unstable
PyTorch behaviour seems to be inconsistent when using variable process groups. An open bug issue is on GitHub.
- backend: Backend | None = 'gloo'¶
The PyTorch-supported backend to use for distributed data parallel. Default:
torch.distributed.Backend.GLOO
.
- seed: int = 1889¶
Seed with which to initialise the various [pseudo]random number generators. Default:
1889
.
- timeout_min: int = 2880¶
Minimum timeout (in minutes) for jobs (used only by SLURM). Default:
2880
(two days).
- slurm_additional_parameters: dict | None = None¶
Additional SLURM parameters; this dictionary corresponds to the one passed to
submitit
’sslurm_additional_parameters
argument. Default:None
.
- console_logs: str = './logs'¶
Location of console logs (used mainly by SLURM to log the errors and output to files). Default:
./logs
- verbose: bool | None = True¶
Whether or not to print updates to the standard output during setup. Default:
True
.
- upon_finish: Callable | None = None¶
An optional callable to be invoked upon completion of the given task. Default:
None
.
- property world_size¶
Specified the world size. This is the total number of GPUs across all nodes. Default:
1
.
- property requires_ipc¶
Specified whether the processes need inter-communication. This property determines whether or not the setup requires IPC. IPC is not required for a single device.
- print()¶
This method serialises this object in a human readable format and prints it.
Device¶
- final class ddpw.Device(value, names=_not_given, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶
Bases:
Enum
The device on which to run the task.
- CPU = 'cpu'¶
The device to run on is a CPU.
- GPU = 'gpu'¶
The device to run on is one or more GPUs.
- SLURM = 'slurm'¶
The device to run on is a cluster of GPU nodes managed by SLURM.
- MPS = 'mps'¶
The device to run on is Apple SoC.
- static from_str(device: str) Device ¶
This method returns a
Device
object given a valid device string.- Parameters:
device (str) – The type of the device. Supported values:
cpu
,gpu
,slurm
, andmps
(case insensitive).- Returns Device:
Device
corresponds to the device type string.- Raises:
ValueError – Raises an error if the string is invalid.