You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a SLURM cluster with 50 nodes, each node having 96 CPU cores. I want to execute a job on the cluster, and the job is divided into 192 subtasks. Theoretically, I should be able to lock resources on two nodes to run these 192 tasks simultaneously. However, when I specified node41 and node42 in the nodelist, I found that parallel execution across nodes could not be achieved. Each node executed tasks with the same rank. Below are the logs:
parser = argparse.ArgumentParser(description="Read and Write example")
parser.add_argument("--input_folder", default="**", help="Input folder path")
parser.add_argument("--base_output_folder", default="**", help="Base output folder path")
parser.add_argument('--tasks', default=192, type=int,
help='total number of tasks to run the pipeline on (default: 1)')
parser.add_argument('--workers', default=-1, type=int,
help='how many tasks to run simultaneously. (default is -1 for no limit aka tasks)')
parser.add_argument('--limit', default=-1, type=int,
help='Number of files to process')
parser.add_argument('--logging_dir', default="**", type=str,
help='Path to the logging directory')
# parser.add_argument('--local_tasks', default=-1, type=int,
# help='how many of the total tasks should be run on this node/machine. -1 for all')
# parser.add_argument('--local_rank_offset', default=0, type=int,
# help='the rank of the first task to run on this machine.')
parser.add_argument('--job_name', default='**', type=str,
help='Name of the job')
parser.add_argument('--condaenv', default='vldata', type=str,
help='Name of the conda environment')
parser.add_argument('--slurm_logs_folder', default='**', type=str,
help='Path to the slurm logs folder')
parser.add_argument(
'--nodelist',
type=str,
default='node41,node42',
help='Comma-separated list of nodes (default: node41 to node49)'
)
parser.add_argument('--nodes', default=2, type=str,
help='Number of nodes to use')
parser.add_argument('--time', default='01:00:00', type=str,
help='Time limit for the job')
parser.add_argument(
'--exclude',
type=str,
help='List of nodes to exclude'
)
if __name__ == '__main__':
args = parser.parse_args()
sbatch_args = {}
if args.nodelist:
sbatch_args["nodelist"] = args.nodelist
if args.exclude:
sbatch_args["exclude"] = args.exclude
if args.nodes:
sbatch_args["nodes"] = args.nodes
pipeline = [
MINTReader(data_folder=args.input_folder, glob_pattern="*.tar", limit=args.limit),
MINTWriter(output_folder=args.base_output_folder)
]
executor = SlurmPipelineExecutor(pipeline=pipeline,
tasks=args.tasks,
workers=args.workers,
logging_dir=args.logging_dir,
partition='cpu',
sbatch_args=sbatch_args,
condaenv=args.condaenv,
time=args.time,
job_name=args.job_name,
slurm_logs_folder=args.slurm_logs_folder,
)
print(executor.run())
I have a SLURM cluster with 50 nodes, each node having 96 CPU cores. I want to execute a job on the cluster, and the job is divided into 192 subtasks. Theoretically, I should be able to lock resources on two nodes to run these 192 tasks simultaneously. However, when I specified node41 and node42 in the nodelist, I found that parallel execution across nodes could not be achieved. Each node executed tasks with the same rank. Below are the logs:
Below are my code
Below are my sbatch script:
The text was updated successfully, but these errors were encountered: