LSA Mac OS X Cluster 2. Information technology resources and support for LSA faculty, staff, and department system administrators.



accessible view | jump to content | search | jump to site-wide navigation

LSA Mac OS X Cluster 2

About the Cluster

lsa-cluster2 is a Mac OS X cluster running 17 Apple Xserves. This cluster may be used for running parallel MPI jobs that are already known to work in an MPI environment. The hardware consists of the following:
ProcessorsMemory (ECC)Local Disk (SATA)Shared DiskFront-side Bus
1 Head Nodedual 2.3 GHz G58 GB80 GB1.4 TB (RAID5; NFS shared to all nodes)1.15 GHz
16 Compute Nodesdual 2.3 GHz G54 GB80 GB1.15 GHz

Performance

This cluster has been benchmarked using the High Performance Linpack (HPL) benchmark, the same benchmark used to rank the Top 500 supercomputers in the world. This cluster achieved 135.6 Gflops.

NetBoot

This cluster utilizes a very convenient feature of Mac OS X known as NetBoot. Each cluster node's operating system is loaded over the network when it starts up. This allows us to reconfigure the entire cluster to function completely differently if desired, simply by telling each node which configuration to use rebooting it. In about a minute, each cluster node will be ready to use its new configuration.

Environmentals

The entire cluster is housed in an APC Netshelter VX 42U cabinet. The back door was replaced with an APC air removal door, keeping each computer in the cluster from overheating which otherwise could lead to heat related errors and crashes. The head node and switch are protected with an APC SmartUPS-3000 UPS to protect your data in the event of power problems, although any currently running jobs will need to be restarted after power returns. Everything in the cluster is plugged into metered power distribution units, which allows us to monitor the power consumption of each circuit, which will prevent any current overdraws that might otherwise trip a circuit breaker. Additionally, each Xserve in the cluster contains over 44 sensors (24 power, 11 temperature, 8 fans, and more) which are checked every five minutes by multiple automated processes.

Setup

For those interested in how this cluster was setup, we are making our setup document available in PDF format.


Communications

Each node is connected by gigabit ethernet on a dedicated network using an Asante IntraCore 36480 switch.

The preferred communications protocol utilizes LAM/MPI. PVM is also available, but should only be used if MPI is not available. Xgrid is now available for submitting serial jobs from your desktop Mac.

The head node has connections to both the public internet (with a firewall in place) and to the private cluster network. If your application requires it, a VPN is available that will place your workstation on the cluster's private network, but this generally isn't necessary.



Getting Started

Currently, this cluster does not utilize a scheduling system. Instead, the cluster is just used by one person at a time. In the mean time, if you'd like to use this cluster, send email to lsa-cluster-requests@umich.edu with the following information: Once we create your account, you can connect to the cluster by using any SSH client with your uniqname and UMICH.EDU Kerberos password.

After you connect, we recommend that you utilize the screen utility by simply typing screen (or screen -x if you're returning from a disconnected session). This provides numerous benefits:

Storage
Everyone has access to an NFS share that is mounted on every node in the cluster. You should keep all of your files in a subdirectory named by your uniqname. This UFS formatted directory is located at:
/NFSshare/uniqname/

on every node. Note that the compute jobs themselves will not be running as yourself. They will either be running as the "cluster" user (see below) or as "nobody" (if using Xgrid). In either case, you'll need to make sure that the permissions are appropriate on /NFSshare/uniqname/.The simplest approach is to just make your directory writable by all users:
chmod -R g+w,o+w /NFSshare/uniqname

You may, or may not have had a home directory created at /Users/uniqname/. If you require a home directory and don't have one, send us an email. Otherwise, you can just use scp to copy directly into your NFSshare location. AFS is not available on the cluster at this time.
NOTE: Your data on the cluster is not backed up. Please be sure this is not the only place you've saved your data.
Communication
All of the inter-node communication on the cluster requires the use of passwordless-SSH. You can use this by switching to a special user account named cluster by typing su cluster. This account already has SSH key pairs distributed to every node on the cluster, which permits you to SSH anywhere on the cluster without a password, and more importantly allows you to initialize the LAM/MPI environment mentioned next.

MPI / Parallel Jobs
If the LAM/MPI communications environment is not yet running, you can start it now by typing:
lamboot -v /NFSshare/lamhosts if you'd like to use every processor in the cluster. If you will only be using a subset of the nodes, make a copy of that file and modify it as necessary.

Your job is now ready to run. You can start your job by simply typing:
mpirun C /NFSshare/uniqname/YourApp [arguments] This will start a job running on the maximum number of CPUs available to you. It is assumed that you're familiar with the mpirun command. Please see the manpage for mpirun if you'd like to utilize any of its other features.

Serial Jobs
There are two ways to submit serial jobs to the cluster, by using SSH or by using Xgrid.
SSH: Simply copy all of your files and all of your data into your /NFSshare/uniqname/ directory. Be sure the output directories in that same location are writable by the username "cluster". Switch over to the cluster user by typing su cluster if you haven't already done so. You can then SSH to each of the compute nodes to start your job. The compute nodes are named "nodeN.lsa-cluster2.lsa.umich.edu" where N is an integer between 1 and 16 (inclusive). You may also wish to run your job on the head-node itself.
Xgrid: A far simpler way to submit serial jobs to the cluster is by using Xgrid. Xgrid has many advantages, with the primary advantage being that you need not SSH to the cluster at all. All of your job submission and results can be done from your desktop Mac. See the Xgrid page for more information and suggestions.

Once your jobs are complete, be sure to copy any data or results you'd like to save to a location off the cluster, and send an email to lsa-cluster-requests@umich.edu notifying them that you are finished using the cluster.


Software

At this time the following additional software is available on this cluster: Other software may be on the cluster and just missing from this list. If there is software that isn't present that you'd like, send us a request via email.

Note on binary data

Please be aware that the Endiannessor byte order on PowerPC (PPC) Macintosh systems, such as this cluster, is different from that of standard PCs (Mac G5 is big endian, PCs are little endian). This has no effect on text files or ASCII data, but can cause problems when transfering binary data to or from the cluster.
Some programs will allow you to set the endianness of binary data when opening a file (e.g. the machineformat switch in Matlab); alternatively you can use bitshifting to correct the byte order (an example in C is given here).


Resources

Apple Xserve Page
LAM/MPI (includes Tutorials)
HPC on Mac OS X
Dauger Research Services - Lots of good parallelization information and tutorials
U of Michigan Center For Advanced Computing (CAC)

Getting Help

To request assistance in using this cluster, please send an email to lsa-cluster-requests@umich.edu . We are currently not able to help with coding issues. Please see the resources above.

back to top


back to top