Install TORQUE Resource Manager on CentOS 7

TORQUE is an open source resource manager providing control over batch jobs and distributed computer nodes. It is a community effort based on the original *PBS project and, has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC , the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, and many other leading edge HPC organizations.[1]

TORQUE 是分布式计算机或并行计算的开源资源管理软件。它是基于 *PBS 项目的一个社区软件实现,已经集成了扩展性、故障容错和特性扩展等,贡献代码的机构包括:NCSA, OSC, USC , the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, 以及其他领先的 HPC 提供商。

1. Install TORQUE packages 安装 TORQUE 软件包

$sudo yum install torque torque-server torque-mom torque-libs torque-client

2. Configure TORQUE on the server 在服务器端配置 TORQUE,通常小型工作站的服务端和工作端在同一机器上

$cd /usr/share/doc/torque-4.2.10/
$sudo vi torque.setup

change the lines

qmgr -c 'create queue batch'
qmgr -c 'set queue batch queue_type = execution'
qmgr -c 'set queue batch started = true'
qmgr -c 'set queue batch enabled = true'
qmgr -c 'set queue batch resources_default.walltime = 1:00:00'
qmgr -c 'set queue batch resources_default.nodes = 1'

qmgr -c 'set server default_queue = batch'

as

qmgr -c 'create queue batch'
qmgr -c 'set queue batch queue_type = execution'
qmgr -c 'set queue batch started = true'
qmgr -c 'set queue batch enabled = true'

qmgr -c 'set queue batch resources_default.walltime = 720:00:00'
# walltime = 720:00:00 means that every job has 720 hours to execute as default
# walltime = 720:00:00 表示每一个任务缺省的运行时间是 720 小时


qmgr -c 'set queue batch resources_default.nodes = 1' 
qmgr -c 'set queue batch max_running = 5' 
# max_running = 5 means there are two jobs running at any time 
# max_running = 5 表示同时只能有 5 个任务运行,后续提交的任务就排队

qmgr -c 'set queue batch max_user_run = 5' 
# max_user_run meas there are five users in the queue 
# max_user_run 表示只能有 5 个用户排队

qmgr -c 'set server default_queue = batch'

then execute it as

$sudo ./torque.setup root

for root as the administrator.

Here, you may bump into this problem:

munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
......

This is caused by the new key management. Now let’s try to fix it.

[~]$ su # change user to root 改变用户为root超级用户
[~]# cd /etc/munge
[munge]# dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key # create a key for munge
[munge]# chown munge munge.key
[munge]# chmod 400 munge.key

check munge.service run correctly

[~]$ sudo systemctl enable munge.service
ln -s '/usr/lib/systemd/system/munge.service' '/etc/systemd/system/multi-user.target.wants/munge.service'
[~]$ sudo systemctl start munge.service
[~]$ sudo systemctl status munge.service
munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled)
   Active: active (running) since Tue 2016-02-16 17:01:56 CST; 16h ago
     Docs: man:munged(8)
 Main PID: 29402 (munged)
   CGroup: /system.slice/munge.service
           └─29402 /usr/sbin/munged

Feb 16 17:01:56  systemd[1]: Started MUNGE authentication service.

3. setting the server nodes 设定服务器端的节点
the default TORQUE configuration folder on CentOS 7  is /var/lib/torque
缺省的TORQUE 配置文件夹在 /var/lib/torque 
make a file server_priv/nodes like this
新建 server_priv/nodes 文件

node01 np=2
# node01 is your hostname, np=2 means 2 processors on the node
# node01 是主机名, np=2 表示本主机有两个处理器(线程)

4. Initialize/Configure TORQUE on Each Compute Node
make a file mom_priv/torque.cfg like this

$pbsserver localhost # note: hostname running pbs_server
$logevent 255 # bitmap of which events to log

5. Start the daemon service

$sudo chkconfig pbs_mom on
$sudo chkconfig pbs_sched on
$sudo chkconfig pbs_server on

6. Test service configuration
verify all nodes are correctly reporting

[~]$ pbsnodes -a
node01-el7
     state = free
     np = 24
     ntype = cluster
     status = ......
     mom_service_port = 15002
     mom_manager_port = 15003

view additional service configuration

[~]$ qmgr -c 'p s'

Finally, you finish the settings so that you want to work on it. Submitting a job in the queue is to use command qsub

$qsub batchjob

the batchjob is a file containing some settings and command lines.

However, this is a simple configuration to use TORQUE on Fedora 12. A detailed configuration is on the site clusterresources.com

References
[1] ClusterResources. TORQUE Administrator’s Guide. v2.3

[2] MUNGE Installation Guide. http://mcs.une.edu.au/doc/munge/QUICKSTART, retrieved on 2016/02/16

[3] Installing TORQUE. http://docs.adaptivecomputing.com/torque/5-1-1/Content/topics/hpcSuiteInstall/manual/1-installing/installingTorque.htm, retrieved on 2016/02/16

PDF24    Send article as PDF   

3 Responses

Page 1 of 1
  1. RUN DU
    RUN DU Monday, September 12, 2016 at 10:39 |

    server_priv/nodes文件内容为:

    node01 np=2 num_node_boards=1

  2. RUN DU
    RUN DU Monday, September 12, 2016 at 10:48 |

    The pbs_server requires awareness of how the MOM is reporting nodes since there is only one MOM daemon and multiple MOM nodes. So, configure the server_priv/nodes file with the num_node_boards and numa_board_str attributes. The attribute num_node_boards tells pbs_server how many numa nodes are reported by the MOM. Following is an example of how to configure the nodes file with num_node_boards:

    numa-10 np=72 num_node_boards=12

    This line in the nodes file tells pbs_server there is a host named numa-10 and that it has 72 processors and 12 nodes. The pbs_server divides the value of np (72) by the value for num_node_boards (12) and determines there are 6 CPUs per NUMA node.

    [1] http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/1-installConfig/buildingWithNUMA.htm

  3. RUN DU
    RUN DU Wednesday, October 19, 2016 at 17:15 |

    Sometimes, if qsub cannot submit batches, and “pbsnodes -a” prompts:
    pbsnodes: End of file

    One possible solution is to delete the configuration folder and then to reinstall torque tools.

Please comment with your real name using good manners.

Leave a Reply