Locked History Actions

Computing/LIP_Lisbon_Farm/5_SGE-deprecated/5.1_Cluster_Status

Available Resources

The NCG cluster is a national infrastructure with heterogenous hardware and is shared between a wide range of internal and external communities. The cluster provides a few type of resources organized by queues, HPC, HTC, GPU, infiniband, environment or group ownership.

On most cases users do not need to apply for a particular resource or queue, the system automatically choose the best matching hardware. A user who need special hardware or environment should request resources as shown below. The queues, in general, are not homogenous in respect to resources contents, and the user authorization vary from queue to queue and some times from host to host within the same queue.

Users shouldn't apply for a specific queue because he may not be allowed to use it or the best resources combination may be on a different queue. On some cases mixing a resquest for a queue and resources may lead to a combination that couldn't be satisfay and the job will stay on waiting list for ever, as seen on the example below List host queues using mixed criteria.

Queue Configuration

Queues represent an aggregation of hosts, each host contributes with a certain number of running jobs or slots. The queues available for LIP users are organized by group ownership listed on the table below, for that reason the resources on a queue may be heterogenous.

Queue name

owner

usage

calolip

CALO

shared

complip

COMPASS

shared

cosmolip

Auger

private

csyslip

LIP

public

lipq

LIP

public

solip

LIP

public

Jobs submitted to the cluster are allocated automaticly to the proper host queue, users do not need to apply for a specific queue. Some of the group queues are shared with other LIP groups and so users will benefit if they let the queue attribute blank, there will be more available nodes to run his jobs.

Each node may run a maximum number of jobs, normally one job correspond to one slots, shared between one or more queues, so the sum of TOTAL jobs is on most cases greather than the actual maximum of allowed jobs.

Lets say that some worker node can serve 2 slots and we configure two queues, A and B, capable of using 2 slots each on the same node. We can have all combinations of jobs running on those queues as long the sum of jobs on the node is never greather than 2:

A

B

Running

0

0

0

0

1

1

0

2

2

1

0

1

1

1

2

1

2

3

never happen

2

0

2

2

1

3

never happen

2

2

4

never happen

Each queue enforce a limit of 2GB of residente memory and 4GB of virtual memory per job in order to protect other tasks running on a single node.

Complex/Resources Configuration

Complex attributes provides a way of defining cluster resources which are requested throught the option -l <resource> on submission command qsub or qhost and qselect commands below, check the manual page, man complex, for more details.

The pertinent complex attributes, per host, defined for LIP users are the following:

Complex (resource)

type

values

Description

slots

integer

integer

number of jobs

mem_total

memory

integer

total RAM memory

virtual_total

memory

integer

total virtual memory

proc

string

intel,amd

CPU family

gpu

boolean

0,1

GPU present

Parallel Environments

Parallel environment support the execution of distributed shared memory applications. Examples of parallel environments are OpenMP on shared memory multiprocessor systems, or Message Passing Interface (MPI) on a distributed system cluster. These are available for LIP users but the MPI environment only work for a single node, this opens the possibility of allocating more than one slot and corresponding memory for a single job. The table below shows the presently available parallel environment for LIP users.

Parallel Environment Name

Multi Node

Max. Slots

mcore

NO

Node dependent

Users who need parallel applications or have wider memory requisits may request more than one slot. Each slot guarantee 2GB of RAM and 2GB of virtual memory, for example, if a user request 4 slots then the job will have a RAM limit of (2x4)GB=8GB and (4x4)GB=16GB of virtual memory allocated for usage.

It is advisable to check the availability of resources serving the requested slots value, use the qhost and qselect commands below to do so. We also recommend a conservative selection of slots values in order to save resources for other users as well.

Check https://wiki-lip.lip.pt/Computing/LIP_Lisboa_Farm/6.2_Job_Submissions_Advanced section to learn how to request a parallel environment.

Operating Systems

The LIP FARM supports the execution of multi operating systems on the same hardware throught the new Containers technology, we use an implementation called udocker which permit to virtualize complex environments on user space.

Presently the cluster supports the operating systems, or distributions:

  • CentOS 6.9
  • CentOS 7.4
  • Ubuntu 16.04

and more could be added as needed.

For each type of operating system there is a dedicated login server, check Login Servers sections for more details. The login servers are configured as the worker nodes, when running the target operating system, to facilitate the user adapting the submission scripts, see Job Submission Basic section for more information and examples.

There is a local command qinfo made inhouse to list the login servers by operating system and available container names and corresponding operating systems, see QINFO: print available opetating systems section below.

Query Commands

QSTAT: show the status of Grid Engine jobs and queues

List user jobs

Running command qstat without options returns the user jobs, for example:

[diogo@fermi03 ~]$ qstat
job-ID  prior   name       user   state submit/start at     queue                    slots ja-task-ID
-----------------------------------------------------------------------------------------------------
5032642 1.10000 farm_conex diogo  r     11/13/2017 03:42:23 lipq@wn054.ncg.ingrid.pt 1
5032892 1.10000 farm_conex diogo  r     11/13/2017 04:17:35 lipq@wn073.ncg.ingrid.pt 1
5032893 1.10000 farm_conex diogo  Eqw   11/13/2017 04:17:30                          1
5032903 1.10000 farm_conex diogo  qw    11/13/2017 03:18:30                          1

on this example the first two job are running on nodes wn054 and wn073 and the last job is waiting for resources availability. Jobs with state Eqw means that the job run but there was a cluster error, users should contact the IT team on this cases.

List other users jobs

Use the option -u the query other users jobs, the option accept an user name as argument or an wildcard to query all users usage, for example:

[diogo@fermi03 ~]$ qstat -u fcruz
job-ID  prior   name       user   state submit/start at     queue                    slots ja-task-ID 
-----------------------------------------------------------------------------------------------------
3852655 0.40279 Hexadecame fcruz  r     11/13/2017 04:22:46 lipq@wn103.ncg.ingrid.pt 1        
3852656 0.00000 Hexadecame fcruz  hqw   10/03/2017 12:16:42

on this example the state hqw means that the job is on hold, this is an advance way of running jobs, see Job Submissions Advanced section.

Use the following command to list all users jobs:

[diogo@fermi03 ~]$ qstat -u '*'
job-ID  prior   name       user   state submit/start at     queue                                slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
5025385 0.18066 cream_8234 cmsplt013    r     11/12/2017 05:12:10 cmsgrid_mcore@wn022.ncg.ingrid 8        
5025386 0.18066 cream_1267 cmsplt013    r     11/12/2017 05:12:33 cmsgrid_mcore@wn054.ncg.ingrid 8        
5025851 0.33147 cream_3727 snop005      r     11/12/2017 05:50:14 gridq@wn017.ncg.ingrid.pt      1        
5025852 0.33147 cream_9907 snop005      r     11/12/2017 05:50:16 gridq@wn110.ncg.ingrid.pt      1
... endless list ....

Show job details

The option -j <job_id> of command qstat provides all job details, the list is huge:

[mpinto@fermi01 ~]$ qstat
job-ID  prior   name       user   state submit/start at     queue                     slots ja-task-ID 
------------------------------------------------------------------------------------------------------
5048766 0.22507 lipfarm_DD mpinto r     11/15/2017 15:00:39 solip@wn202.ncg.ingrid.pt 1

[mpinto@fermi01 ~]$ qstat -j 5048766 | less
==============================================================
job_number:                 5048766
exec_file:                  job_scripts/5048766
submission_time:            Wed Nov 15 15:00:39 2017
owner:                      mpinto
uid:                        5040009
group:                      cosmo
gid:                        5040000
sge_o_home:                 /home/cosmo/mpinto
sge_o_log_name:             mpinto
...

check man qstat for more details on command usage.

List all QUEUES

Use the command bellow to list all available queues on FARM, the queues of interest to LIP users terminate with lip plus the lipq queue and gridc for LIP Coimbra users.

[jpina@fermi03 ~]$ qstat -g c
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
--------------------------------------------------------------------------------
atlasgrid                         0.70     32      0    281    375      0     62
atlasgrid_mcore                   0.77    456      0    416   1016      0    144
calolip                           0.00      0      0     24     24      0      0
cmsgrid                           0.74      0      0    305    359      0     54
cmsgrid_mcore                     0.76    312      0    564   1032      0    172
cmst3grid                         -NA-      0      0      0      1      0      1
complip                           0.00      0      0     11     11      0      0
cosmolip                          0.09      0      0     77     77      0      0
csyslip                           0.00     20      0     28     48      0      0
dteamgrid                         0.69      0      0    314    373      0     59
dteamgrid_mcore                   0.73      0      0    704    832      0    128
fast_medusa                       0.00      0      0     44     44      0      0
gridc                             0.57      0      0    286    286      0      0
gridq                             0.73      1      0    310    362      0     51
gridq_mcore                       0.72      0      0    720    848      0    128
hpcgrid                           0.63    216      0     48    280      0     16
hpcib                             0.42     24      0     60    100      0     16
hpclong                           0.62     12      0    102    352      0    240
incd                              0.77      0      0    270    318      0     48
lipq                              0.66    176      0    102    338      0     60
medusa                            0.25     65      0    199    268      0      4
opsgrid                           0.70      0      0    325    387      0     62
opsgrid_mcore                     0.72      0      0    720    848      0    128
qao_2356                          0.74      0      0     16    192      0    176
qao_2356_ib                       0.39      0      0     16    112      0     96
qdav                              0.00      0      0      8      8      0      0
qix_e5472_nv                      0.00      0      0     24     32      0      8
qix_es2680                        0.72      0      0     48     48      0      0
solip                             0.00      0      0     66     74      0      8

The table column meaning is the following:

  • CQLOAD: total queue present load;

  • USED: total queue present used slots;

  • RES: total queue present reserved slots, normally none;

  • AVAIL: total queue present available, or free, slots;

  • TOTAL: total queue present configured slots;

  • aoACDS:
  • cdsuE:

Looking at the previous example we see that the maximum allowed slots running on queue lipq is 338, corresponding to the TOTAL column but we have to subtract 60 slots due to some problem. So the actual allowed slots will be 278; 176 + 102. Althought the 102 slots are tagged as AVAIL they may be not free because the nodes belonging to queue lipq are serving other queues, like the 20 slots used on csyslip queue.

List LIP QUEUES

Use the command bellow to list only LIP queues on FARM, the command also accept a list of queue names, e.g. qstat -g c -q lipq gridc cosmolip.

[jpina@fermi03 ~]$ qstat -g c -q '*lip*'
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
--------------------------------------------------------------------------------
calolip                           0.00      0      0     24     24      0      0
complip                           0.00      0      0     11     11      0      0
cosmolip                          0.08      0      0     77     77      0      0
csyslip                           0.00     20      0     28     48      0      0
lipq                              0.68    176      0    102    338      0     60
solip                             0.00      0      0     66     74      0      8

QHOST: show the status of Grid Engine hosts, queues and jobs

List all cluster nodes

The qhost command allows to query the scheduler and provide some usefull information about the system. For example, running the command without arguments returns the list of all nodes on the cluster including some details about the hardware: