## page was renamed from Computing/LIP_Lisbon_Farm/5_SGE/5.1_Cluster_Status
## page was renamed from Computing/LIP_Coimbra_Farm2/5.1_Cluster_Status
<<TableOfContents()>>

= Available Resources =
The NCG cluster is a national infrastructure with heterogenous hardware and is shared between a wide range of internal and external communities. The cluster provides a few type of resources organized by '''queues''', HPC, HTC, GPU, infiniband, environment or group ownership.

On most cases users do not need to apply for a particular '''resource''' or '''queue''', the system automatically choose the best matching hardware. A user who need special hardware or environment should request '''resources''' as shown below. The queues, in general, are not homogenous in respect to resources contents, and the user authorization vary from queue to queue and some times from host to host within the same queue.
||<tablewidth="&quot" tableheight="&quot" tablestyle="&quot; &amp; quot;  &amp; amp;  quot;   &amp; amp;  amp;   quot;    61px&amp; amp;  amp;   amp;    quot;    ;    879px&amp; amp;  amp;   amp;    quot&amp; amp;  amp;   quot;    ;   &amp; amp;  amp;   amp;   quot&amp; amp;  amp;   quot;   ;   &amp; amp;  amp;   amp;   quot&amp; amp;  amp;   quot&amp; amp;  quot;   ;  &amp; amp;  amp;  quot&amp; amp;  quot;  ;  &amp; amp;  amp;  quot&amp; amp;  quot&amp; quot;  ; &amp; amp; quot&amp; quot; ; &amp; amp; quot&amp; quot&quot; ;&amp;quot&quot;;&amp;quot&quot"80% #cccccc style="&quot; &amp; quot;  &amp; amp;  quot;   &amp; amp;  amp;   quot;    &amp; amp;  amp;   amp;    quot;text-align:center&amp; amp;  amp;   amp;    quot;     ;text-align:center&amp; amp;  amp;   quot;    &amp; amp;  quot;   &amp; quot;  &quot; ">'''Users shouldn't apply for a specific ''queue'' because he may not be allowed to use it or the best resources combination may be on a different ''queue''. On some cases mixing a resquest for a ''queue'' and ''resources'' may lead to a combination that couldn't be satisfay and the job will stay on waiting list for ever, as seen on the example below [[https://wiki-lip.lip.pt/Computing/LIP_Coimbra_Farm2/5_Cluster_Status#List_host_queues_using_mixed_criteria|List host queues using mixed criteria]]. ''' ||

== Queue Configuration ==

'''Queues''' represent an aggregation of hosts, each host contributes with a certain number of running jobs or ''slots''. The '''queues''' available for LIP users are organized by group ownership listed on the table below, for that reason the resources on a '''queue''' may be heterogenous.

||<tablewidth="&quot" tablestyle="200px&quot"#66ffff>'''Queue name''' ||<#66ffff>'''owner''' ||<#66ffff>'''usage''' ||
||'''calolip''' ||CALO ||shared||
||'''complip''' ||COMPASS || shared||
||'''cosmolip''' ||Auger || private||
||'''csyslip''' ||LIP || public||
||'''lipq''' || LIP || public||
||'''solip''' || LIP || public||

Jobs submitted to the cluster are allocated automaticly to the proper host '''queue''', users do not need to apply for a specific '''queue'''. Some of the group '''queues''' are shared with other LIP groups and so users will benefit if they let the '''queue''' attribute blank, there will be more available nodes to run his jobs.

Each node may run a maximum number of jobs, normally one job correspond to one '''slots''', shared between one or more '''queues''', so the sum of TOTAL jobs is on most cases greather than the actual maximum of allowed jobs.

Lets say that some worker node can serve 2 slots and we configure two queues, '''A''' and '''B''', capable of using 2 slots each on the same node. We can have all combinations of jobs running on those queues as long the sum of jobs on the node is never greather than 2:

||<tablewidth="&quot" tablestyle="&quot; &amp; quot;  &amp; amp;  quot;   200px&amp; amp;  amp;   quot&amp; amp;  quot;   ;  &amp; amp;  amp;  quot&amp; amp;  quot&amp; quot;  ; &amp; amp; quot&amp; quot&quot; ;&amp;quot&quot"#999999 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">'''A ''' ||<#999999 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">'''B ''' ||<#999999 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">'''Running ''' ||<#999999 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; "> ||
||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">0 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">0 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">0 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; "> ||
||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">0 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">1 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">1 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; "> ||
||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">0 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">2 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">2 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; "> ||
||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">1 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">0 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">1 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; "> ||
||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">1 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">1 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">2 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; "> ||
||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">1 ||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">2 ||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">3 ||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">never happen ||
||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">2 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">0 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">2 ||<#00ff33 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; "> ||
||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">2 ||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">1 ||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">3 ||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">never happen ||
||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">2 ||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">2 ||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">4 ||<#ff3333 style="&quot; &amp; quot;text-align:center&amp; quot;  &quot; ">never happen ||

Each '''queue''' enforce a limit of '''2GB''' of residente memory and '''4GB''' of virtual memory per job in order to protect other tasks running on a single node.

== Complex/Resources Configuration ==
Complex attributes provides a way of defining cluster resources which are requested throught the option '''-l <resource>''' on submission command '''qsub''' or '''qhost''' and '''qselect''' commands below, check the manual page, '''man complex''', for more details.

The pertinent complex attributes, per host, defined for LIP users are the following:
||<tablewidth="&quot" tablestyle="200px&quot"#66ffff>'''Complex (resource)''' ||<#66ffff>'''type''' ||<#66ffff>'''values''' ||<#66ffff>'''Description''' ||
||'''slots''' ||integer ||integer||number of jobs ||
||'''mem_total''' ||memory ||integer||total RAM memory ||
||'''virtual_total''' ||memory ||integer||total virtual memory ||
||'''proc''' ||string ||intel,amd ||CPU family ||
||'''gpu''' ||boolean ||0,1 ||GPU present ||




== Parallel Environments ==

Parallel environment support the execution of distributed shared memory applications. Examples of parallel environments are OpenMP on shared memory multiprocessor systems, or Message Passing Interface (MPI) on a distributed system cluster. These are available for LIP users but the MPI environment only work for a single node, this opens the possibility of allocating more than one slot and corresponding memory for a single job. The table below shows the presently available parallel environment for LIP users.

||<tablewidth="&quot" tablestyle="200px&quot"#66ffff>'''Parallel Environment Name''' ||<#66ffff>'''Multi Node''' ||<#66ffff>'''Max. Slots''' ||
||'''mcore''' ||NO || Node dependent||

Users who need parallel applications or have wider memory requisits may request more than one '''slot'''. Each '''slot''' guarantee '''2GB''' of RAM  and '''2GB''' of virtual memory, for example, if a user request 4 '''slots''' then the job will have a RAM limit of '''(2x4)GB=8GB''' and '''(4x4)GB=16GB''' of virtual memory allocated for usage.

It is advisable to check the availability of resources serving the requested '''slots''' value, use the '''qhost''' and '''qselect''' commands below to do so. We also recommend a conservative selection of '''slots''' values in order to save resources for other users as well.

Check [[https://wiki-lip.lip.pt/Computing/LIP_Lisboa_Farm/6.2_Job_Submissions_Advanced||Job Submissions Advanced]] section to learn how to request a parallel environment.

== Operating Systems ==

The LIP FARM supports the execution of multi operating systems on the same hardware throught the new ''Containers'' technology, we use an implementation called '''udocker''' which permit to virtualize complex environments on user space.

Presently the cluster supports the operating systems, or distributions:
 * CentOS 6.9
 * CentOS 7.4
 * Ubuntu 16.04
and more could be added as needed.

For each type of operating system there is a dedicated login server, check [[https://wiki-lip.lip.pt/Computing/LIP_Lisboa_Farm/2_How_To_Access#Login_Servers|Login Servers]] sections for more details.
The login servers are configured as the worker nodes, when running the target operating system, to facilitate the user adapting the submission scripts, see [[https://wiki-lip.lip.pt/Computing/LIP_Lisboa_Farm/6.1_Job_Submission_basic|Job Submission Basic]] section for more information and examples.

There is a local command '''qinfo''' made inhouse to list the login servers by operating system and available container names and corresponding operating systems, see [[https://wiki-lip.lip.pt/Computing/LIP_Lisboa_Farm/5_Cluster_Status#QINFO:_print_available_operating_systems|QINFO: print available opetating systems]] section below.

= Query Commands =
== QSTAT: show the status of Grid Engine jobs and queues ==
=== List user jobs ===
Running command '''qstat''' without options returns the user jobs, for example:

{{{
[diogo@fermi03 ~]$ qstat
job-ID  prior   name       user   state submit/start at     queue                    slots ja-task-ID
-----------------------------------------------------------------------------------------------------
5032642 1.10000 farm_conex diogo  r     11/13/2017 03:42:23 lipq@wn054.ncg.ingrid.pt 1
5032892 1.10000 farm_conex diogo  r     11/13/2017 04:17:35 lipq@wn073.ncg.ingrid.pt 1
5032893 1.10000 farm_conex diogo  Eqw   11/13/2017 04:17:30                          1
5032903 1.10000 farm_conex diogo  qw    11/13/2017 03:18:30                          1
}}}
on this example the first two job are running on nodes ''wn054'' and ''wn073'' and the last job is waiting for resources availability. Jobs with state '''Eqw''' means that the job run but there was a cluster error, '''users should contact the IT team on this cases'''.

=== List other users jobs ===

Use the option '''-u''' the query other users jobs, the option accept an user name as argument or an wildcard to query all users usage, for example:

{{{
[diogo@fermi03 ~]$ qstat -u fcruz
job-ID  prior   name       user   state submit/start at     queue                    slots ja-task-ID 
-----------------------------------------------------------------------------------------------------
3852655 0.40279 Hexadecame fcruz  r     11/13/2017 04:22:46 lipq@wn103.ncg.ingrid.pt 1        
3852656 0.00000 Hexadecame fcruz  hqw   10/03/2017 12:16:42
}}}
on this example the state '''hqw''' means that the job is on hold, this is an advance way of running jobs, see [[https://wiki-lip.lip.pt/Computing/LIP_Coimbra_Farm/6.2_Job_Submissions_Advanced|Job Submissions Advanced]] section.

Use the following command to list all users jobs:

{{{
[diogo@fermi03 ~]$ qstat -u '*'
job-ID  prior   name       user   state submit/start at     queue                                slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
5025385 0.18066 cream_8234 cmsplt013    r     11/12/2017 05:12:10 cmsgrid_mcore@wn022.ncg.ingrid 8        
5025386 0.18066 cream_1267 cmsplt013    r     11/12/2017 05:12:33 cmsgrid_mcore@wn054.ncg.ingrid 8        
5025851 0.33147 cream_3727 snop005      r     11/12/2017 05:50:14 gridq@wn017.ncg.ingrid.pt      1        
5025852 0.33147 cream_9907 snop005      r     11/12/2017 05:50:16 gridq@wn110.ncg.ingrid.pt      1
... endless list ....
}}}

=== Show job details ===

The option '''-j <job_id>''' of command '''qstat''' provides all job details, the list is huge:
{{{
[mpinto@fermi01 ~]$ qstat
job-ID  prior   name       user   state submit/start at     queue                     slots ja-task-ID 
------------------------------------------------------------------------------------------------------
5048766 0.22507 lipfarm_DD mpinto r     11/15/2017 15:00:39 solip@wn202.ncg.ingrid.pt 1

[mpinto@fermi01 ~]$ qstat -j 5048766 | less
==============================================================
job_number:                 5048766
exec_file:                  job_scripts/5048766
submission_time:            Wed Nov 15 15:00:39 2017
owner:                      mpinto
uid:                        5040009
group:                      cosmo
gid:                        5040000
sge_o_home:                 /home/cosmo/mpinto
sge_o_log_name:             mpinto
...
}}}
check '''man qstat''' for more details on command usage.

=== List all QUEUES ===
Use the command bellow to list all available ''queues'' on FARM, the ''queues'' of interest to LIP users terminate with '''lip''' plus the '''lipq''' ''queue'' and '''gridc''' for LIP Coimbra users.

{{{
[jpina@fermi03 ~]$ qstat -g c
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
--------------------------------------------------------------------------------
atlasgrid                         0.70     32      0    281    375      0     62
atlasgrid_mcore                   0.77    456      0    416   1016      0    144
calolip                           0.00      0      0     24     24      0      0
cmsgrid                           0.74      0      0    305    359      0     54
cmsgrid_mcore                     0.76    312      0    564   1032      0    172
cmst3grid                         -NA-      0      0      0      1      0      1
complip                           0.00      0      0     11     11      0      0
cosmolip                          0.09      0      0     77     77      0      0
csyslip                           0.00     20      0     28     48      0      0
dteamgrid                         0.69      0      0    314    373      0     59
dteamgrid_mcore                   0.73      0      0    704    832      0    128
fast_medusa                       0.00      0      0     44     44      0      0
gridc                             0.57      0      0    286    286      0      0
gridq                             0.73      1      0    310    362      0     51
gridq_mcore                       0.72      0      0    720    848      0    128
hpcgrid                           0.63    216      0     48    280      0     16
hpcib                             0.42     24      0     60    100      0     16
hpclong                           0.62     12      0    102    352      0    240
incd                              0.77      0      0    270    318      0     48
lipq                              0.66    176      0    102    338      0     60
medusa                            0.25     65      0    199    268      0      4
opsgrid                           0.70      0      0    325    387      0     62
opsgrid_mcore                     0.72      0      0    720    848      0    128
qao_2356                          0.74      0      0     16    192      0    176
qao_2356_ib                       0.39      0      0     16    112      0     96
qdav                              0.00      0      0      8      8      0      0
qix_e5472_nv                      0.00      0      0     24     32      0      8
qix_es2680                        0.72      0      0     48     48      0      0
solip                             0.00      0      0     66     74      0      8
}}}




The table column meaning is the following:

 * CQLOAD: total '''queue''' present load;
 * USED:   total '''queue''' present used slots;
 * RES:    total '''queue''' present reserved slots, normally none;
 * AVAIL:  total '''queue''' present available, or free, slots;
 * TOTAL:  total '''queue''' present configured slots;
 * aoACDS:
 * cdsuE:

Looking at the previous example we see that the maximum allowed slots running on '''queue''' '''lipq''' is '''338''', corresponding to the TOTAL column but we have to subtract '''60''' slots due to some problem. So the actual allowed slots will be '''278'''; '''176 + 102'''. Althought the '''102''' slots are tagged as '''AVAIL''' they may be not free because the nodes belonging to '''queue''' '''lipq''' are serving other '''queues''', like the '''20''' slots used on '''csyslip''' '''queue'''.

=== List LIP QUEUES ===
Use the command bellow to list only LIP ''queues'' on FARM, the command also accept a list of ''queue'' names, e.g. '''qstat -g c -q lipq gridc cosmolip'''.

{{{
[jpina@fermi03 ~]$ qstat -g c -q '*lip*'
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE
--------------------------------------------------------------------------------
calolip                           0.00      0      0     24     24      0      0
complip                           0.00      0      0     11     11      0      0
cosmolip                          0.08      0      0     77     77      0      0
csyslip                           0.00     20      0     28     48      0      0
lipq                              0.68    176      0    102    338      0     60
solip                             0.00      0      0     66     74      0      8
}}}


== QHOST: show the status of Grid Engine hosts, queues and jobs ==



=== List all cluster nodes ===
The '''qhost''' command allows to query the scheduler and provide some usefull information about the system. For example, running the command without arguments returns the list of all nodes on the cluster including some details about the hardware:

{{{
[j