Core¶
Wrapper¶
- class ddpw.Wrapper(platform: Platform)¶
Bases:
objectThis class is the highest level of abstraction: it accepts the platform-related configurations and initialises the setup accordingly. When given a task, it then runs the task according to the specified configurations.
Example
from ddpw import Platform, Wrapper wrapper = Wrapper(Platform(...)) wrapper.start(some_callable)
- Parameters:
platform (Platform) – Platform-related configurations.
- start(target: Callable[[Tuple, dict], Any], *args, **kwargs)¶
This method performs the necessary setup according to the specified configurations and then invokes the given task.
- Parameters:
target (Callable[[Tuple, dict], Any]) – The task, a callable.
args (Optional[Tuple]) – Arguments to be passed to
target. Default:None.kwargs (Optional[Dict]) – Keyword arguments to be passed to
target. Default:None.
Platform¶
- final class ddpw.Platform(name: str = 'ddpw', device: ~ddpw.platform.Device | str = Device.GPU, partition: str = 'general', n_nodes: int = 1, n_gpus: int = 1, n_cpus: int = 1, ram: int = 32, spawn_method: str | None = 'spawn', ipc_protocol: str = 'tcp', master_addr: str = 'localhost', master_port: str | None = '42298', ipc_groups: ~typing.List[~typing.List[int]] | None = <factory>, backend: ~torch.distributed.distributed_c10d.Backend | None = 'gloo', seed: int = 1889, timeout_min: int = 2880, slurm_additional_parameters: dict | None = None, console_logs: str = './logs', verbose: bool | None = True, upon_finish: ~typing.Callable | None = None)¶
Bases:
objectPlatform-related configurations such as the device, environment, communication IP address and port, world size, etc.
Examples
from ddpw import Platform # a setup with 4 GPUs platform = Platform(device='gpu', n_gpus=4) # a setup to request SLURM for 2 nodes, each with 3 GPUs in the "example" partition platform = Platform(device='slurm', n_nodes=2, n_gpus=3, partition='example')
- name: str = 'ddpw'¶
Name of the platform job.
Used by SLURM. Default:
ddpw.
- partition: str = 'general'¶
Name of the SLURM partition (used only by SLURM).
Default:
general.
- n_nodes: int = 1¶
The total number of nodes (used only by SLURM).
Default:
1.
- n_gpus: int = 1¶
The number of GPUs (per node).
Default:
1.
- n_cpus: int = 1¶
The total number of CPUs (used only by SLURM).
Default:
1.
- ram: int = 32¶
Total RAM (in GB) (used only by SLURM).
Default:
32.
- spawn_method: str | None = 'spawn'¶
This string corresponds to that passed to
mp.set_start_method().Default:
spawn.
- ipc_protocol: str = 'tcp'¶
IPC protocol.
Accepted values:
tcpandfile. Default:tcp.
- master_addr: str = 'localhost'¶
IPC address.
Default:
localhost.
- master_port: str | None = '42298'¶
The port at which IPC happens.
Default: a random port between 1024 and 49151.
- ipc_groups: List[List[int]] | None¶
A list of lists of non-overlapping global ranks of devices. If
None, every device will be its own group, and no IPC will take place. If an empty list is passed, all devices are grouped into one process group. Default:[].Examples
# no IPC between devices; each device is its own group platform = Platform(device='gpu', n_gpus=4, ipc_groups=None) # all devices under one group: default behaviour platform = Platform(device='gpu', n_gpus=4) platform = Platform(device='gpu', n_gpus=4, ipc_groups=[]) # custom groups platform = Platform(device='gpu', n_gpus=4, ipc_groups=[[0, 2], [1], [3]]) platform = Platform(device='gpu', n_gpus=4, ipc_groups=[[0, 2], [1, 3]])
Variable groups unstable
PyTorch behaviour seems to be inconsistent when using variable process groups. An open bug issue is on GitHub.
- backend: Backend | None = 'gloo'¶
The PyTorch-supported backend to use for distributed data parallel.
Default:
torch.distributed.Backend.GLOO.
- seed: int = 1889¶
Seed with which to initialise the various [pseudo]random number generators.
Default:
1889.
- timeout_min: int = 2880¶
Minimum timeout (in minutes) for jobs (used only by SLURM).
Default:
2880(two days).
- slurm_additional_parameters: dict | None = None¶
Additional SLURM parameters; this dictionary corresponds to the one passed to
submitit’sslurm_additional_parametersargument. Default:None.
- console_logs: str = './logs'¶
Location of console logs (used mainly by SLURM to log the errors and output to files).
Default:
./logs
- verbose: bool | None = True¶
Whether or not to print updates to the standard output during setup.
Default:
True.
- upon_finish: Callable | None = None¶
An optional callable to be invoked upon completion of the given task.
Default:
None.
- property world_size¶
Specified the world size.
This is the total number of GPUs across all nodes. Default:
1.
- property requires_ipc¶
Specified whether the processes need inter-communication.
This property determines whether or not the setup requires IPC. IPC is not required for a single device.
- print()¶
This method serialises this object in a human readable format and prints it.
Device¶
- final class ddpw.Device(*values)¶
Bases:
EnumThe device on which to run the task.
- CPU = 'cpu'¶
The device to run on is a CPU.
- GPU = 'gpu'¶
The device to run on is one or more GPUs.
- SLURM = 'slurm'¶
The device to run on is a cluster of GPU nodes managed by SLURM.
- MPS = 'mps'¶
The device to run on is Apple SoC.
- static from_str(device: str) Device¶
This method returns a
Deviceobject given a valid device string.- Parameters:
device (str) – The type of the device. Supported values:
cpu,gpu,slurm, andmps(case insensitive).- Returns Device:
Devicecorresponds to the device type string.- Raises:
ValueError – Raises an error if the string is invalid.