## page was renamed from Computing/LIP_Lisbon_Farm/5_SGE/5.1_Cluster_Status ## page was renamed from Computing/LIP_Coimbra_Farm2/5.1_Cluster_Status <> = Available Resources = The NCG cluster is a national infrastructure with heterogenous hardware and is shared between a wide range of internal and external communities. The cluster provides a few type of resources organized by '''queues''', HPC, HTC, GPU, infiniband, environment or group ownership. On most cases users do not need to apply for a particular '''resource''' or '''queue''', the system automatically choose the best matching hardware. A user who need special hardware or environment should request '''resources''' as shown below. The queues, in general, are not homogenous in respect to resources contents, and the user authorization vary from queue to queue and some times from host to host within the same queue. ||'''Users shouldn't apply for a specific ''queue'' because he may not be allowed to use it or the best resources combination may be on a different ''queue''. On some cases mixing a resquest for a ''queue'' and ''resources'' may lead to a combination that couldn't be satisfay and the job will stay on waiting list for ever, as seen on the example below [[https://wiki-lip.lip.pt/Computing/LIP_Coimbra_Farm2/5_Cluster_Status#List_host_queues_using_mixed_criteria|List host queues using mixed criteria]]. ''' || == Queue Configuration == '''Queues''' represent an aggregation of hosts, each host contributes with a certain number of running jobs or ''slots''. The '''queues''' available for LIP users are organized by group ownership listed on the table below, for that reason the resources on a '''queue''' may be heterogenous. ||'''Queue name''' ||<#66ffff>'''owner''' ||<#66ffff>'''usage''' || ||'''calolip''' ||CALO ||shared|| ||'''complip''' ||COMPASS || shared|| ||'''cosmolip''' ||Auger || private|| ||'''csyslip''' ||LIP || public|| ||'''lipq''' || LIP || public|| ||'''solip''' || LIP || public|| Jobs submitted to the cluster are allocated automaticly to the proper host '''queue''', users do not need to apply for a specific '''queue'''. Some of the group '''queues''' are shared with other LIP groups and so users will benefit if they let the '''queue''' attribute blank, there will be more available nodes to run his jobs. Each node may run a maximum number of jobs, normally one job correspond to one '''slots''', shared between one or more '''queues''', so the sum of TOTAL jobs is on most cases greather than the actual maximum of allowed jobs. Lets say that some worker node can serve 2 slots and we configure two queues, '''A''' and '''B''', capable of using 2 slots each on the same node. We can have all combinations of jobs running on those queues as long the sum of jobs on the node is never greather than 2: ||'''A ''' ||<#999999 style="" & quot;text-align:center& quot; " ">'''B ''' ||<#999999 style="" & quot;text-align:center& quot; " ">'''Running ''' ||<#999999 style="" & quot;text-align:center& quot; " "> || ||<#00ff33 style="" & quot;text-align:center& quot; " ">0 ||<#00ff33 style="" & quot;text-align:center& quot; " ">0 ||<#00ff33 style="" & quot;text-align:center& quot; " ">0 ||<#00ff33 style="" & quot;text-align:center& quot; " "> || ||<#00ff33 style="" & quot;text-align:center& quot; " ">0 ||<#00ff33 style="" & quot;text-align:center& quot; " ">1 ||<#00ff33 style="" & quot;text-align:center& quot; " ">1 ||<#00ff33 style="" & quot;text-align:center& quot; " "> || ||<#00ff33 style="" & quot;text-align:center& quot; " ">0 ||<#00ff33 style="" & quot;text-align:center& quot; " ">2 ||<#00ff33 style="" & quot;text-align:center& quot; " ">2 ||<#00ff33 style="" & quot;text-align:center& quot; " "> || ||<#00ff33 style="" & quot;text-align:center& quot; " ">1 ||<#00ff33 style="" & quot;text-align:center& quot; " ">0 ||<#00ff33 style="" & quot;text-align:center& quot; " ">1 ||<#00ff33 style="" & quot;text-align:center& quot; " "> || ||<#00ff33 style="" & quot;text-align:center& quot; " ">1 ||<#00ff33 style="" & quot;text-align:center& quot; " ">1 ||<#00ff33 style="" & quot;text-align:center& quot; " ">2 ||<#00ff33 style="" & quot;text-align:center& quot; " "> || ||<#ff3333 style="" & quot;text-align:center& quot; " ">1 ||<#ff3333 style="" & quot;text-align:center& quot; " ">2 ||<#ff3333 style="" & quot;text-align:center& quot; " ">3 ||<#ff3333 style="" & quot;text-align:center& quot; " ">never happen || ||<#00ff33 style="" & quot;text-align:center& quot; " ">2 ||<#00ff33 style="" & quot;text-align:center& quot; " ">0 ||<#00ff33 style="" & quot;text-align:center& quot; " ">2 ||<#00ff33 style="" & quot;text-align:center& quot; " "> || ||<#ff3333 style="" & quot;text-align:center& quot; " ">2 ||<#ff3333 style="" & quot;text-align:center& quot; " ">1 ||<#ff3333 style="" & quot;text-align:center& quot; " ">3 ||<#ff3333 style="" & quot;text-align:center& quot; " ">never happen || ||<#ff3333 style="" & quot;text-align:center& quot; " ">2 ||<#ff3333 style="" & quot;text-align:center& quot; " ">2 ||<#ff3333 style="" & quot;text-align:center& quot; " ">4 ||<#ff3333 style="" & quot;text-align:center& quot; " ">never happen || Each '''queue''' enforce a limit of '''2GB''' of residente memory and '''4GB''' of virtual memory per job in order to protect other tasks running on a single node. == Complex/Resources Configuration == Complex attributes provides a way of defining cluster resources which are requested throught the option '''-l ''' on submission command '''qsub''' or '''qhost''' and '''qselect''' commands below, check the manual page, '''man complex''', for more details. The pertinent complex attributes, per host, defined for LIP users are the following: ||'''Complex (resource)''' ||<#66ffff>'''type''' ||<#66ffff>'''values''' ||<#66ffff>'''Description''' || ||'''slots''' ||integer ||integer||number of jobs || ||'''mem_total''' ||memory ||integer||total RAM memory || ||'''virtual_total''' ||memory ||integer||total virtual memory || ||'''proc''' ||string ||intel,amd ||CPU family || ||'''gpu''' ||boolean ||0,1 ||GPU present || == Parallel Environments == Parallel environment support the execution of distributed shared memory applications. Examples of parallel environments are OpenMP on shared memory multiprocessor systems, or Message Passing Interface (MPI) on a distributed system cluster. These are available for LIP users but the MPI environment only work for a single node, this opens the possibility of allocating more than one slot and corresponding memory for a single job. The table below shows the presently available parallel environment for LIP users. ||'''Parallel Environment Name''' ||<#66ffff>'''Multi Node''' ||<#66ffff>'''Max. Slots''' || ||'''mcore''' ||NO || Node dependent|| Users who need parallel applications or have wider memory requisits may request more than one '''slot'''. Each '''slot''' guarantee '''2GB''' of RAM and '''2GB''' of virtual memory, for example, if a user request 4 '''slots''' then the job will have a RAM limit of '''(2x4)GB=8GB''' and '''(4x4)GB=16GB''' of virtual memory allocated for usage. It is advisable to check the availability of resources serving the requested '''slots''' value, use the '''qhost''' and '''qselect''' commands below to do so. We also recommend a conservative selection of '''slots''' values in order to save resources for other users as well. Check [[https://wiki-lip.lip.pt/Computing/LIP_Lisboa_Farm/6.2_Job_Submissions_Advanced||Job Submissions Advanced]] section to learn how to request a parallel environment. == Operating Systems == The LIP FARM supports the execution of multi operating systems on the same hardware throught the new ''Containers'' technology, we use an implementation called '''udocker''' which permit to virtualize complex environments on user space. Presently the cluster supports the operating systems, or distributions: * CentOS 6.9 * CentOS 7.4 * Ubuntu 16.04 and more could be added as needed. For each type of operating system there is a dedicated login server, check [[https://wiki-lip.lip.pt/Computing/LIP_Lisboa_Farm/2_How_To_Access#Login_Servers|Login Servers]] sections for more details. The login servers are configured as the worker nodes, when running the target operating system, to facilitate the user adapting the submission scripts, see [[https://wiki-lip.lip.pt/Computing/LIP_Lisboa_Farm/6.1_Job_Submission_basic|Job Submission Basic]] section for more information and examples. There is a local command '''qinfo''' made inhouse to list the login servers by operating system and available container names and corresponding operating systems, see [[https://wiki-lip.lip.pt/Computing/LIP_Lisboa_Farm/5_Cluster_Status#QINFO:_print_available_operating_systems|QINFO: print available opetating systems]] section below. = Query Commands = == QSTAT: show the status of Grid Engine jobs and queues == === List user jobs === Running command '''qstat''' without options returns the user jobs, for example: {{{ [diogo@fermi03 ~]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------- 5032642 1.10000 farm_conex diogo r 11/13/2017 03:42:23 lipq@wn054.ncg.ingrid.pt 1 5032892 1.10000 farm_conex diogo r 11/13/2017 04:17:35 lipq@wn073.ncg.ingrid.pt 1 5032893 1.10000 farm_conex diogo Eqw 11/13/2017 04:17:30 1 5032903 1.10000 farm_conex diogo qw 11/13/2017 03:18:30 1 }}} on this example the first two job are running on nodes ''wn054'' and ''wn073'' and the last job is waiting for resources availability. Jobs with state '''Eqw''' means that the job run but there was a cluster error, '''users should contact the IT team on this cases'''. === List other users jobs === Use the option '''-u''' the query other users jobs, the option accept an user name as argument or an wildcard to query all users usage, for example: {{{ [diogo@fermi03 ~]$ qstat -u fcruz job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------- 3852655 0.40279 Hexadecame fcruz r 11/13/2017 04:22:46 lipq@wn103.ncg.ingrid.pt 1 3852656 0.00000 Hexadecame fcruz hqw 10/03/2017 12:16:42 }}} on this example the state '''hqw''' means that the job is on hold, this is an advance way of running jobs, see [[https://wiki-lip.lip.pt/Computing/LIP_Coimbra_Farm/6.2_Job_Submissions_Advanced|Job Submissions Advanced]] section. Use the following command to list all users jobs: {{{ [diogo@fermi03 ~]$ qstat -u '*' job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 5025385 0.18066 cream_8234 cmsplt013 r 11/12/2017 05:12:10 cmsgrid_mcore@wn022.ncg.ingrid 8 5025386 0.18066 cream_1267 cmsplt013 r 11/12/2017 05:12:33 cmsgrid_mcore@wn054.ncg.ingrid 8 5025851 0.33147 cream_3727 snop005 r 11/12/2017 05:50:14 gridq@wn017.ncg.ingrid.pt 1 5025852 0.33147 cream_9907 snop005 r 11/12/2017 05:50:16 gridq@wn110.ncg.ingrid.pt 1 ... endless list .... }}} === Show job details === The option '''-j ''' of command '''qstat''' provides all job details, the list is huge: {{{ [mpinto@fermi01 ~]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ------------------------------------------------------------------------------------------------------ 5048766 0.22507 lipfarm_DD mpinto r 11/15/2017 15:00:39 solip@wn202.ncg.ingrid.pt 1 [mpinto@fermi01 ~]$ qstat -j 5048766 | less ============================================================== job_number: 5048766 exec_file: job_scripts/5048766 submission_time: Wed Nov 15 15:00:39 2017 owner: mpinto uid: 5040009 group: cosmo gid: 5040000 sge_o_home: /home/cosmo/mpinto sge_o_log_name: mpinto ... }}} check '''man qstat''' for more details on command usage. === List all QUEUES === Use the command bellow to list all available ''queues'' on FARM, the ''queues'' of interest to LIP users terminate with '''lip''' plus the '''lipq''' ''queue'' and '''gridc''' for LIP Coimbra users. {{{ [jpina@fermi03 ~]$ qstat -g c CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE -------------------------------------------------------------------------------- atlasgrid 0.70 32 0 281 375 0 62 atlasgrid_mcore 0.77 456 0 416 1016 0 144 calolip 0.00 0 0 24 24 0 0 cmsgrid 0.74 0 0 305 359 0 54 cmsgrid_mcore 0.76 312 0 564 1032 0 172 cmst3grid -NA- 0 0 0 1 0 1 complip 0.00 0 0 11 11 0 0 cosmolip 0.09 0 0 77 77 0 0 csyslip 0.00 20 0 28 48 0 0 dteamgrid 0.69 0 0 314 373 0 59 dteamgrid_mcore 0.73 0 0 704 832 0 128 fast_medusa 0.00 0 0 44 44 0 0 gridc 0.57 0 0 286 286 0 0 gridq 0.73 1 0 310 362 0 51 gridq_mcore 0.72 0 0 720 848 0 128 hpcgrid 0.63 216 0 48 280 0 16 hpcib 0.42 24 0 60 100 0 16 hpclong 0.62 12 0 102 352 0 240 incd 0.77 0 0 270 318 0 48 lipq 0.66 176 0 102 338 0 60 medusa 0.25 65 0 199 268 0 4 opsgrid 0.70 0 0 325 387 0 62 opsgrid_mcore 0.72 0 0 720 848 0 128 qao_2356 0.74 0 0 16 192 0 176 qao_2356_ib 0.39 0 0 16 112 0 96 qdav 0.00 0 0 8 8 0 0 qix_e5472_nv 0.00 0 0 24 32 0 8 qix_es2680 0.72 0 0 48 48 0 0 solip 0.00 0 0 66 74 0 8 }}} The table column meaning is the following: * CQLOAD: total '''queue''' present load; * USED: total '''queue''' present used slots; * RES: total '''queue''' present reserved slots, normally none; * AVAIL: total '''queue''' present available, or free, slots; * TOTAL: total '''queue''' present configured slots; * aoACDS: * cdsuE: Looking at the previous example we see that the maximum allowed slots running on '''queue''' '''lipq''' is '''338''', corresponding to the TOTAL column but we have to subtract '''60''' slots due to some problem. So the actual allowed slots will be '''278'''; '''176 + 102'''. Althought the '''102''' slots are tagged as '''AVAIL''' they may be not free because the nodes belonging to '''queue''' '''lipq''' are serving other '''queues''', like the '''20''' slots used on '''csyslip''' '''queue'''. === List LIP QUEUES === Use the command bellow to list only LIP ''queues'' on FARM, the command also accept a list of ''queue'' names, e.g. '''qstat -g c -q lipq gridc cosmolip'''. {{{ [jpina@fermi03 ~]$ qstat -g c -q '*lip*' CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE -------------------------------------------------------------------------------- calolip 0.00 0 0 24 24 0 0 complip 0.00 0 0 11 11 0 0 cosmolip 0.08 0 0 77 77 0 0 csyslip 0.00 20 0 28 48 0 0 lipq 0.68 176 0 102 338 0 60 solip 0.00 0 0 66 74 0 8 }}} == QHOST: show the status of Grid Engine hosts, queues and jobs == === List all cluster nodes === The '''qhost''' command allows to query the scheduler and provide some usefull information about the system. For example, running the command without arguments returns the list of all nodes on the cluster including some details about the hardware: {{{ [j