TORQUE is an open source resource manager providing control over batch jobs and distributed computer nodes. It is a community effort based on the original *PBS project and, has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC , the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, and many other leading edge HPC organizations.[1]
TORQUE 是分布式计算机或并行计算的开源资源管理软件。它是基于 *PBS 项目的一个社区软件实现,已经集成了扩展性、故障容错和特性扩展等,贡献代码的机构包括:NCSA, OSC, USC , the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, 以及其他领先的 HPC 提供商。
1. Install TORQUE packages 安装 TORQUE 软件包
$sudo yum install torque torque-server torque-mom torque-libs torque-client
2. Configure TORQUE on the server 在服务器端配置 TORQUE,通常小型工作站的服务端和工作端在同一机器上
$cd /usr/share/doc/torque-4.2.10/ $sudo vi torque.setup
change the lines
qmgr -c 'create queue batch' qmgr -c 'set queue batch queue_type = execution' qmgr -c 'set queue batch started = true' qmgr -c 'set queue batch enabled = true' qmgr -c 'set queue batch resources_default.walltime = 1:00:00' qmgr -c 'set queue batch resources_default.nodes = 1' qmgr -c 'set server default_queue = batch'
as
qmgr -c 'create queue batch' qmgr -c 'set queue batch queue_type = execution' qmgr -c 'set queue batch started = true' qmgr -c 'set queue batch enabled = true' qmgr -c 'set queue batch resources_default.walltime = 720:00:00' # walltime = 720:00:00 means that every job has 720 hours to execute as default # walltime = 720:00:00 表示每一个任务缺省的运行时间是 720 小时 qmgr -c 'set queue batch resources_default.nodes = 1' qmgr -c 'set queue batch max_running = 5' # max_running = 5 means there are two jobs running at any time # max_running = 5 表示同时只能有 5 个任务运行,后续提交的任务就排队 qmgr -c 'set queue batch max_user_run = 5' # max_user_run meas there are five users in the queue # max_user_run 表示只能有 5 个用户排队 qmgr -c 'set server default_queue = batch'
then execute it as
$sudo ./torque.setup root
for root as the administrator.
Here, you may bump into this problem:
munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such file or directory ......
This is caused by the new key management. Now let's try to fix it.
[~]$ su # change user to root 改变用户为root超级用户 [~]# cd /etc/munge [munge]# dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key # create a key for munge [munge]# chown munge munge.key [munge]# chmod 400 munge.key
check munge.service run correctly
[~]$ sudo systemctl enable munge.service ln -s '/usr/lib/systemd/system/munge.service' '/etc/systemd/system/multi-user.target.wants/munge.service' [~]$ sudo systemctl start munge.service [~]$ sudo systemctl status munge.service munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled) Active: active (running) since Tue 2016-02-16 17:01:56 CST; 16h ago Docs: man:munged(8) Main PID: 29402 (munged) CGroup: /system.slice/munge.service └─29402 /usr/sbin/munged Feb 16 17:01:56 systemd[1]: Started MUNGE authentication service.
3. setting the server nodes 设定服务器端的节点
the default TORQUE configuration folder on CentOS 7 is /var/lib/torque
缺省的TORQUE 配置文件夹在 /var/lib/torque
make a file server_priv/nodes
like this
新建 server_priv/nodes
文件
node01 np=20 num_node_boards=1 #numa_board_str=8 # update on 2019-06-09 # node01 is your hostname, np=2 means 2 processors on the node # node01 是主机名, np=2 表示本主机有两个处理器(线程)
4. Initialize/Configure TORQUE on Each Compute Node
make a file mom_priv/torque.cfg
like this
$pbsserver localhost # note: hostname running pbs_server $logevent 255 # bitmap of which events to log
5. Start the daemon service
$sudo chkconfig pbs_mom on $sudo chkconfig pbs_sched on $sudo chkconfig pbs_server on
6. Test service configuration
verify all nodes are correctly reporting
[~]$ pbsnodes -a node01-el7 state = free np = 24 ntype = cluster status = ...... mom_service_port = 15002 mom_manager_port = 15003
view additional service configuration
[~]$ qmgr -c 'p s'
Finally, you finish the settings so that you want to work on it. Submitting a job in the queue is to use command qsub
$qsub batchjob
the batchjob
is a file containing some settings and command lines.
However, this is a simple configuration to use TORQUE on Fedora 12. A detailed configuration is on the site clusterresources.com
References
[1] ClusterResources. TORQUE Administrator's Guide. v2.3
[2] MUNGE Installation Guide. http://mcs.une.edu.au/doc/munge/QUICKSTART, retrieved on 2016/02/16
[3] Installing TORQUE. http://docs.adaptivecomputing.com/torque/5-1-1/Content/topics/hpcSuiteInstall/manual/1-installing/installingTorque.htm, retrieved on 2016/02/16
server_priv/nodes
文件内容为:node01 np=2 num_node_boards=1
The pbs_server requires awareness of how the MOM is reporting nodes since there is only one MOM daemon and multiple MOM nodes. So, configure the server_priv/nodes file with the num_node_boards and numa_board_str attributes. The attribute num_node_boards tells pbs_server how many numa nodes are reported by the MOM. Following is an example of how to configure the nodes file with num_node_boards:
numa-10 np=72 num_node_boards=12
This line in the nodes file tells pbs_server there is a host named numa-10 and that it has 72 processors and 12 nodes. The pbs_server divides the value of np (72) by the value for num_node_boards (12) and determines there are 6 CPUs per NUMA node.
[1] http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/1-installConfig/buildingWithNUMA.htm
Sometimes, if qsub cannot submit batches, and “pbsnodes -a” prompts:
pbsnodes: End of file
One possible solution is to delete the configuration folder and then to reinstall torque tools.
on compute nodes , install mom
and
Edit the /etc/services file and set pbs port_num/tcp.