Core¶
Wrapper¶
- class ddpw.Wrapper(platform: Platform)¶
Bases:
objectThis class is the highest level of abstraction: it accepts the platform-related configurations and initialises the setup accordingly. When given a task, it then runs the task according to the specified configurations.
Example
from ddpw import Platform, Wrapper wrapper = Wrapper(Platform(...)) wrapper.start(some_callable)
- Parameters:
platform (Platform) – Platform-related configurations.
- start(target: Callable[[Tuple, dict], Any], *args, **kwargs)¶
This method performs the necessary setup according to the specified configurations and then invokes the given task.
- Parameters:
target (Callable[[Tuple, dict], Any]) – The task, a callable.
args (Optional[Tuple]) – Arguments to be passed to
target. Default:None.kwargs (Optional[Dict]) – Keyword arguments to be passed to
target. Default:None.
- ddpw.wrapper(platform: Platform)¶
A decorator that can be applied to callables.
- Parameters:
platform (Platform) – Platform details.
Example
from ddpw import Platform, wrapper platform = Platform(device='gpu', n_gpus_per_node=2, n_cpus_per_node=2) @wrapper(platform) def run(*args, **kwargs): # some task pass
Platform¶
- final class ddpw.Platform(name: str = 'ddpw', device: Device | str = Device.GPU, partition: str = 'general', n_nodes: int = 1, n_gpus_per_node: int = 1, n_cpus_per_node: int = 1, mem_per_node: int = 32, spawn_method: str | None = 'spawn', ipc_protocol: str = 'tcp', master_addr: str = 'localhost', master_port: str | None = <factory>, ipc_groups: List[List[int]] | None = <factory>, backend: dist.Backend | None = None, seed: int = 1889, timeout_min: int = 2880, slurm_additional_parameters: dict | None = None, console_logs: str = './.output/logs', verbose: bool | None = True, upon_finish: Callable | None = None)¶
Bases:
objectPlatform-related configurations such as the device, environment, communication IP address and port, world size, etc.
Examples
from ddpw import Platform # a setup with 4 GPUs platform = Platform(device='gpu', n_gpus_per_node=4) # a setup to request SLURM for 2 nodes, each with 3 GPUs in the "example" partition platform = Platform(device='slurm', n_nodes=2, n_gpus_per_node=3, partition='example')
- name: str = 'ddpw'¶
Name of the platform job.
Used by SLURM. Default:
ddpw.
- partition: str = 'general'¶
Name of the SLURM partition (used only by SLURM).
Default:
general.
- n_nodes: int = 1¶
The total number of nodes (used only by SLURM).
Default:
1.
- n_gpus_per_node: int = 1¶
The number of GPUs per node.
Default:
1.
- n_cpus_per_node: int = 1¶
The number of CPUs per node (used only by SLURM). Must be divisible by
n_gpus_per_nodeso that CPUs can be split evenly across tasks (one task per GPU).Default:
1.
- mem_per_node: int = 32¶
Memory per node in GB (used only by SLURM). Maps to SLURM’s
--mem.Default:
32.
- spawn_method: str | None = 'spawn'¶
This string corresponds to that passed to
mp.set_start_method().Default:
spawn.
- ipc_protocol: str = 'tcp'¶
IPC protocol.
Accepted values:
tcpandfile. Default:tcp.
- master_addr: str = 'localhost'¶
IPC address.
Default:
localhost.
- master_port: str | None¶
The port at which IPC happens.
Default: a random port between 1024 and 49151.
- ipc_groups: List[List[int]] | None¶
Process-group layout across global ranks. Behaviour by value:
None— no process group is initialised;grouphanded to the task isNoneand collective ops cannot be called. Use this to opt out of distributed coordination entirely.[](default) — all ranks share a single process group (GroupMember.WORLD). Initialised even whenworld_size == 1so task code can call collective ops uniformly.A list of lists of non-overlapping global ranks — each inner list forms its own process group. Every rank must belong to exactly one group.
Examples
# no IPC; each device runs independently platform = Platform(device='gpu', n_gpus_per_node=4, ipc_groups=None) # all devices under one group: default behaviour platform = Platform(device='gpu', n_gpus_per_node=4) platform = Platform(device='gpu', n_gpus_per_node=4, ipc_groups=[]) # custom groups platform = Platform(device='gpu', n_gpus_per_node=4, ipc_groups=[[0, 2], [1], [3]]) platform = Platform(device='gpu', n_gpus_per_node=4, ipc_groups=[[0, 2], [1, 3]])
Variable groups unstable
PyTorch behaviour seems to be inconsistent when using variable process groups. An open bug issue is on GitHub.
- backend: dist.Backend | None = None¶
The PyTorch-supported backend to use for distributed data parallel.
None(default) resolves totorch.distributed.Backend.GLOOwhen the process group is initialised. Set explicitly to pick a different backend.
- seed: int = 1889¶
Seed with which to initialise the various [pseudo]random number generators.
Default:
1889.
- timeout_min: int = 2880¶
Minimum timeout (in minutes) for jobs (used only by SLURM).
Default:
2880(two days).
- slurm_additional_parameters: dict | None = None¶
Additional SLURM parameters; this dictionary corresponds to the one passed to
submitit’sslurm_additional_parametersargument. Default:None.
- console_logs: str = './.output/logs'¶
Location of console logs (used mainly by SLURM to log the errors and output to files).
Default:
./.output/logs
- verbose: bool | None = True¶
Whether or not to print updates to the standard output during setup.
Default:
True.
- upon_finish: Callable | None = None¶
An optional callable to be invoked upon completion of the given task.
Default:
None.
- property world_size¶
Specified the world size.
This is the total number of GPUs across all nodes. Default:
1.
- property requires_ipc¶
Specified whether the processes need inter-communication.
This property determines whether or not the setup requires IPC. IPC is not required for a single device.
- print()¶
This method serialises this object in a human readable format and prints it.
Device¶
- final class ddpw.Device(*values)¶
Bases:
EnumThe device on which to run the task.
- CPU = 'cpu'¶
The device to run on is a CPU.
- GPU = 'gpu'¶
The device to run on is one or more GPUs.
- SLURM = 'slurm'¶
The device to run on is a cluster of GPU nodes managed by SLURM.
- MPS = 'mps'¶
The device to run on is Apple SoC.
- static from_str(device: str) Device¶
This method returns a
Deviceobject given a valid device string.- Parameters:
device (str) – The type of the device. Supported values:
cpu,gpu,slurm, andmps(case insensitive).- Returns Device:
Devicecorresponds to the device type string.- Raises:
ValueError – Raises an error if the string is invalid.