| Category: Particle | |||
|---|---|---|---|
SubjectParticle Physics Linux Condor Batch Farm | |||
ContentFor support please email itsupport@physics.ox.ac.uk which will create a ticket and be seen by the PP Linux admin team. The BasicsThe batch farm is running on Enterprise Linux 9 (EL9) with approximately 1.8 Peta Byte network attached storage and is accessible through the "interactive" login nodes pplxint12 and pplxint13. Please follow log-in instructions as described here. The EL9 cluster consists of two interactive nodes and a number of EL9 worker nodes. HTCondor is the job scheduler for the batch farm. All work nodes are configured with 4GB of RAM per logical CPU core and approximately 50GB of local scratch disk per CPU core. Please note that interactive login to the worker nodes is disabled for all users. HTCondor Quick Start GuideTo submit jobs to the HTCondor batch system, login onto either Particle Physics “interactive” nodes pplxint12 or pplxint13; create a file containing commands that tell it how to run jobs. The batch system will locate a worker node that can run each job within the pool of worker nodes, and the output is returned to the interactive nodes. Submitting a JobA submit file is required that sets environment variables for the HTCondor batch queue and which calls an executable, for example a submit file myjob.submit below runs hello.py in the batch queue. hello.py example file: #!/usr/bin/python
import platform
host=platform.node()
print ("Hello World - ", host)
print ("finished")
Make the script executable first by running: chmod +x hello.py
Submit FileAn example myjob.submit file: executable = hello.py
getenv = true
output = output/results.output.$(ClusterId)
error = error/results.error.$(ClusterId)
log = log/results.log.$(ClusterId)
notification = never
queue 1
Where:
Submitting the jobA job is added to the HTCondor queue using condor_submit in order to be executed. On pplxint12, simply run the command: $ condor_submit myjob.submit
Submitting job(s). 1 job(s) submitted to cluster 70.
Memory and CPU estimatesThe condor batch system allocates one CPU core and 4GB of Memory per job. If your job requires multiple CPU core or more memory, then it can request in a Job submit file. For example, if your job needs 3 CPU Core and 8 GB memory then the requirement can be added to job submit form like this request_cpus = 3
request_memory = 8 GB
Monitoring the jobThe condor_q command prints a listing of all the jobs currently in the queue. For example, a short time after submitting “myjob.submit” job from pplxint12, output appears as $ condor_q ID
OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
70.0 davda 2/13 10:49 0+00:00:03 R 0 97.7 myjob.submit
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
The queue might contain many jobs. To see your own jobs, use an option “–submitter” with your unix login to the condor_q command. See example below, only show davda’s jobs $ condor_q -submitter davda
Rather than monitoring the job using repeated running of condor_q command, use condor_wait command: $ condor_wait -status log/results.70.log
70.0.0 submitted 70.0.0 executing on host <163.1.136.221:9618?addrs=163.1.136.221-9618+[--1]-9618&noUDP&sock=1232_a0c4_3> 70.0.0 completed All jobs done.
Removing a jobSuccessfully submitted jobs will occasionally need to be removed from the queue. Use the condor_rm command specifying the job identifier as a command line argument. For example, remove job number 70.0 from the queue with $ condor_rm 70.0
Data transfer jobsTo submit jobs to a normal queue, See an example below. executable = my_grid_transfer_job.sh
output = my_grid_transfer_job.output.$(ClusterId)
error = my_grid_transfer_job.error.$(ClusterId)
log = my_grid_transfer_job.log.$(ClusterId)
getenv = true
notification = never queue 1
The commands to transfer files should be in a script and must be executable. For example, this can be: For internal: cp <some directory path>/Jan-Dec_2020.dat /data/atlas/<login>/datasets;
For external: scp -i /home/<LOGIN>/data_transfer_key <CERN LOGIN>@lxplus.cern.ch:bigfile /data/myexperiment/bigfile
More often, it is used to transfer to/from the Grid. To submit a job, prepare a script which is capable of transferring your data correctly. Add the following line after the "#!/bin/bash" in your script: export X509_USER_PROXY=${HOME}/.gridProxy
The above command instructs the Grid tools to look for your Grid proxy credentials. The location of your Grid proxy credentials must be accessible to both the interactive machines and the worker nodes. Before you submit the job, you need to initialize your Grid proxy into the file indicated by the X509_USER_PROXY environment variable. The proxy initialization command varies from experiment to experiment. To submit the job script, you should therefore execute the following commands on pplxint12/13. $ export X509_USER_PROXY=~/.gridProxy
$ voms-proxy-init vo.southgrid.ac.uk (or lhcb-proxy-init, or atlas-proxy-init or otherwise)
$ condor_submit my_grid_transfer_job.sh
For very long jobs, you may need to refresh your Grid proxy periodically. The proxy normally lasts about 12 hours. $ export X509_USER_PROXY=~/.gridProxy
$ voms-proxy-init vo.southgrid.ac.uk (or lhcb-proxy-init, or atlas-proxy-init or otherwise)
Submitting GPU JobsThe HTCondor batch farm contains a small number of worker nodes equipped with GPUs. Currently, only NVIDIA based based GPUs are available. These are best accessed using the NVIDIA's CUDA toolkit. "Normal" jobs submitted to the batch queue will not land on these GPU worker nodes. To access them you must add configuration to your submit file. If you don't mind which GPU worker node your job runs on, you just need to add the following line to your submit file: request_gpus = 1 If you need to use a specific GPU resource, for example, a GPU with more memory, you can specifically request those GPUs. To find what GPU resources are available in the queue, use the following command: $ condor_status -gpus -compact If you want to target a GPU worker node with a specific resource , you could add something like this to your condor submit file (remember you also need the request_gpus line): require_gpus = (Capability >= 9.0) && (GlobalMemoryMb >= 90000) GPU Job EtiquetteThe GPU resources are both scarce and expensive. Please use with care. If your jobs can run on the cards with 40GB, please don't run them on the 96GB cards, just because your jobs might complete faster. Use the least resources that will get your job done to keep resources available for those that might need more. Useful links
| |||
Documents | |||
| File | Heading | Date | |
| Drupal page URL | 06-12-2024 10:33 | ||
| Writer: Michael Leech Created on 09-10-2017 02:10 Last update on 20-02-2025 14:52 | 533 views This item is part of the Physics IT knowledgebase | ||