Contents
Available Resources
The NCG cluster is a national infrastructure with heterogenous hardware and is shared between a wide range of internal and external communities. The cluster provides a few type of resources organized by queues, HPC, HTC, GPU, infiniband, environment or group ownership.
On most cases users do not need to apply for a particular resource or queue, the system automatically choose the best matching hardware. A user who need special hardware or environment should request resources as shown below. The queues, in general, are not homogenous in respect to resources contents, and the user authorization vary from queue to queue and some times from host to host within the same queue.
Users shouldn't apply for a specific queue because he may not be allowed to use it or the best resources combination may be on a different queue. On some cases mixing a resquest for a queue and resources may lead to a combination that couldn't be satisfay and the job will stay on waiting list for ever, as seen on the example below List host queues using mixed criteria. |
Queue Configuration
Queues represent an aggregation of hosts, each host contributes with a certain number of running jobs or slots. The queues available for LIP users are organized by group ownership listed on the table below, for that reason the resources on a queue may be heterogenous.
Queue name |
owner |
usage |
calolip |
CALO |
shared |
complip |
COMPASS |
shared |
cosmolip |
Auger |
private |
csyslip |
LIP |
public |
lipq |
LIP |
public |
solip |
LIP |
public |
Jobs submitted to the cluster are allocated automaticly to the proper host queue, users do not need to apply for a specific queue. Some of the group queues are shared with other LIP groups and so users will benefit if they let the queue attribute blank, there will be more available nodes to run his jobs.
Each node may run a maximum number of jobs, normally one job correspond to one slots, shared between one or more queues, so the sum of TOTAL jobs is on most cases greather than the actual maximum of allowed jobs.
Lets say that some worker node can serve 2 slots and we configure two queues, A and B, capable of using 2 slots each on the same node. We can have all combinations of jobs running on those queues as long the sum of jobs on the node is never greather than 2:
A |
B |
Running |
|
0 |
0 |
0 |
|
0 |
1 |
1 |
|
0 |
2 |
2 |
|
1 |
0 |
1 |
|
1 |
1 |
2 |
|
1 |
2 |
3 |
never happen |
2 |
0 |
2 |
|
2 |
1 |
3 |
never happen |
2 |
2 |
4 |
never happen |
Each queue enforce a limit of 2GB of residente memory and 4GB of virtual memory per job in order to protect other tasks running on a single node.
Complex/Resources Configuration
Complex attributes provides a way of defining cluster resources which are requested throught the option -l <resource> on submission command qsub or qhost and qselect commands below, check the manual page, man complex, for more details.
The pertinent complex attributes, per host, defined for LIP users are the following:
Complex (resource) |
type |
values |
Description |
slots |
integer |
integer |
number of jobs |
mem_total |
memory |
integer |
total RAM memory |
virtual_total |
memory |
integer |
total virtual memory |
proc |
string |
intel,amd |
CPU family |
gpu |
boolean |
0,1 |
GPU present |
Parallel Environments
Parallel environment support the execution of distributed shared memory applications. Examples of parallel environments are OpenMP on shared memory multiprocessor systems, or Message Passing Interface (MPI) on a distributed system cluster. These are available for LIP users but the MPI environment only work for a single node, this opens the possibility of allocating more than one slot and corresponding memory for a single job. The table below shows the presently available parallel environment for LIP users.
Parallel Environment Name |
Multi Node |
Max. Slots |
mcore |
NO |
Node dependent |
Users who need parallel applications or have wider memory requisits may request more than one slot. Each slot guarantee 2GB of RAM and 2GB of virtual memory, for example, if a user request 4 slots then the job will have a RAM limit of (2x4)GB=8GB and (4x4)GB=16GB of virtual memory allocated for usage.
It is advisable to check the availability of resources serving the requested slots value, use the qhost and qselect commands below to do so. We also recommend a conservative selection of slots values in order to save resources for other users as well.
Check https://wiki-lip.lip.pt/Computing/LIP_Lisboa_Farm/6.2_Job_Submissions_Advanced section to learn how to request a parallel environment.
Operating Systems
The LIP FARM supports the execution of multi operating systems on the same hardware throught the new Containers technology, we use an implementation called udocker which permit to virtualize complex environments on user space.
Presently the cluster supports the operating systems, or distributions:
- CentOS 6.9
- CentOS 7.4
- Ubuntu 16.04
and more could be added as needed.
For each type of operating system there is a dedicated login server, check Login Servers sections for more details. The login servers are configured as the worker nodes, when running the target operating system, to facilitate the user adapting the submission scripts, see Job Submission Basic section for more information and examples.
There is a local command qinfo made inhouse to list the login servers by operating system and available container names and corresponding operating systems, see QINFO: print available opetating systems section below.
Query Commands
QSTAT: show the status of Grid Engine jobs and queues
List user jobs
Running command qstat without options returns the user jobs, for example:
[diogo@fermi03 ~]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------- 5032642 1.10000 farm_conex diogo r 11/13/2017 03:42:23 lipq@wn054.ncg.ingrid.pt 1 5032892 1.10000 farm_conex diogo r 11/13/2017 04:17:35 lipq@wn073.ncg.ingrid.pt 1 5032893 1.10000 farm_conex diogo Eqw 11/13/2017 04:17:30 1 5032903 1.10000 farm_conex diogo qw 11/13/2017 03:18:30 1
on this example the first two job are running on nodes wn054 and wn073 and the last job is waiting for resources availability. Jobs with state Eqw means that the job run but there was a cluster error, users should contact the IT team on this cases.
List other users jobs
Use the option -u the query other users jobs, the option accept an user name as argument or an wildcard to query all users usage, for example:
[diogo@fermi03 ~]$ qstat -u fcruz job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------- 3852655 0.40279 Hexadecame fcruz r 11/13/2017 04:22:46 lipq@wn103.ncg.ingrid.pt 1 3852656 0.00000 Hexadecame fcruz hqw 10/03/2017 12:16:42
on this example the state hqw means that the job is on hold, this is an advance way of running jobs, see Job Submissions Advanced section.
Use the following command to list all users jobs:
[diogo@fermi03 ~]$ qstat -u '*' job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 5025385 0.18066 cream_8234 cmsplt013 r 11/12/2017 05:12:10 cmsgrid_mcore@wn022.ncg.ingrid 8 5025386 0.18066 cream_1267 cmsplt013 r 11/12/2017 05:12:33 cmsgrid_mcore@wn054.ncg.ingrid 8 5025851 0.33147 cream_3727 snop005 r 11/12/2017 05:50:14 gridq@wn017.ncg.ingrid.pt 1 5025852 0.33147 cream_9907 snop005 r 11/12/2017 05:50:16 gridq@wn110.ncg.ingrid.pt 1 ... endless list ....
Show job details
The option -j <job_id> of command qstat provides all job details, the list is huge:
[mpinto@fermi01 ~]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ------------------------------------------------------------------------------------------------------ 5048766 0.22507 lipfarm_DD mpinto r 11/15/2017 15:00:39 solip@wn202.ncg.ingrid.pt 1 [mpinto@fermi01 ~]$ qstat -j 5048766 | less ============================================================== job_number: 5048766 exec_file: job_scripts/5048766 submission_time: Wed Nov 15 15:00:39 2017 owner: mpinto uid: 5040009 group: cosmo gid: 5040000 sge_o_home: /home/cosmo/mpinto sge_o_log_name: mpinto ...
check man qstat for more details on command usage.
List all QUEUES
Use the command bellow to list all available queues on FARM, the queues of interest to LIP users terminate with lip plus the lipq queue and gridc for LIP Coimbra users.
[jpina@fermi03 ~]$ qstat -g c CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE -------------------------------------------------------------------------------- atlasgrid 0.70 32 0 281 375 0 62 atlasgrid_mcore 0.77 456 0 416 1016 0 144 calolip 0.00 0 0 24 24 0 0 cmsgrid 0.74 0 0 305 359 0 54 cmsgrid_mcore 0.76 312 0 564 1032 0 172 cmst3grid -NA- 0 0 0 1 0 1 complip 0.00 0 0 11 11 0 0 cosmolip 0.09 0 0 77 77 0 0 csyslip 0.00 20 0 28 48 0 0 dteamgrid 0.69 0 0 314 373 0 59 dteamgrid_mcore 0.73 0 0 704 832 0 128 fast_medusa 0.00 0 0 44 44 0 0 gridc 0.57 0 0 286 286 0 0 gridq 0.73 1 0 310 362 0 51 gridq_mcore 0.72 0 0 720 848 0 128 hpcgrid 0.63 216 0 48 280 0 16 hpcib 0.42 24 0 60 100 0 16 hpclong 0.62 12 0 102 352 0 240 incd 0.77 0 0 270 318 0 48 lipq 0.66 176 0 102 338 0 60 medusa 0.25 65 0 199 268 0 4 opsgrid 0.70 0 0 325 387 0 62 opsgrid_mcore 0.72 0 0 720 848 0 128 qao_2356 0.74 0 0 16 192 0 176 qao_2356_ib 0.39 0 0 16 112 0 96 qdav 0.00 0 0 8 8 0 0 qix_e5472_nv 0.00 0 0 24 32 0 8 qix_es2680 0.72 0 0 48 48 0 0 solip 0.00 0 0 66 74 0 8
The table column meaning is the following:
CQLOAD: total queue present load;
USED: total queue present used slots;
RES: total queue present reserved slots, normally none;
AVAIL: total queue present available, or free, slots;
TOTAL: total queue present configured slots;
- aoACDS:
- cdsuE:
Looking at the previous example we see that the maximum allowed slots running on queue lipq is 338, corresponding to the TOTAL column but we have to subtract 60 slots due to some problem. So the actual allowed slots will be 278; 176 + 102. Althought the 102 slots are tagged as AVAIL they may be not free because the nodes belonging to queue lipq are serving other queues, like the 20 slots used on csyslip queue.
List LIP QUEUES
Use the command bellow to list only LIP queues on FARM, the command also accept a list of queue names, e.g. qstat -g c -q lipq gridc cosmolip.
[jpina@fermi03 ~]$ qstat -g c -q '*lip*' CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE -------------------------------------------------------------------------------- calolip 0.00 0 0 24 24 0 0 complip 0.00 0 0 11 11 0 0 cosmolip 0.08 0 0 77 77 0 0 csyslip 0.00 20 0 28 48 0 0 lipq 0.68 176 0 102 338 0 60 solip 0.00 0 0 66 74 0 8
QHOST: show the status of Grid Engine hosts, queues and jobs
List all cluster nodes
The qhost command allows to query the scheduler and provide some usefull information about the system. For example, running the command without arguments returns the list of all nodes on the cluster including some details about the hardware: