accessible view | jump to content | search | jump to site-wide navigation
LSA Mac OS X Cluster 2
About the Cluster
lsa-cluster2 is a Mac OS X cluster running 17 Apple Xserves. This cluster may be used for running parallel MPI jobs that are already known to work in an MPI environment. The hardware consists of the following:| Processors | Memory (ECC) | Local Disk (SATA) | Shared Disk | Front-side Bus | |
|---|---|---|---|---|---|
| 1 Head Node | dual 2.3 GHz G5 | 8 GB | 80 GB | 1.4 TB (RAID5; NFS shared to all nodes) | 1.15 GHz |
| 16 Compute Nodes | dual 2.3 GHz G5 | 4 GB | 80 GB | 1.15 GHz |
Performance
This cluster has been benchmarked using the High Performance Linpack (HPL) benchmark, the same benchmark used to rank the Top 500 supercomputers in the world. This cluster achieved 135.6 Gflops.NetBoot
This cluster utilizes a very convenient feature of Mac OS X known as NetBoot. Each cluster node's operating system is loaded over the network when it starts up. This allows us to reconfigure the entire cluster to function completely differently if desired, simply by telling each node which configuration to use rebooting it. In about a minute, each cluster node will be ready to use its new configuration.Environmentals
The entire cluster is housed in an APC Netshelter VX 42U cabinet. The back door was replaced with an APC air removal door, keeping each computer in the cluster from overheating which otherwise could lead to heat related errors and crashes. The head node and switch are protected with an APC SmartUPS-3000 UPS to protect your data in the event of power problems, although any currently running jobs will need to be restarted after power returns. Everything in the cluster is plugged into metered power distribution units, which allows us to monitor the power consumption of each circuit, which will prevent any current overdraws that might otherwise trip a circuit breaker. Additionally, each Xserve in the cluster contains over 44 sensors (24 power, 11 temperature, 8 fans, and more) which are checked every five minutes by multiple automated processes.Setup
For those interested in how this cluster was setup, we are making our setup document available in PDF format.
Communications
Each node is connected by gigabit ethernet on a dedicated network using an Asante IntraCore 36480 switch.The preferred communications protocol utilizes LAM/MPI. PVM is also available, but should only be used if MPI is not available. Xgrid is now available for submitting serial jobs from your desktop Mac.
The head node has connections to both the public internet (with a firewall in
place) and to the private cluster network. If your application requires it,
a VPN is available that will place your workstation on the cluster's
private network, but this generally isn't necessary.
Getting Started
Currently, this cluster does not utilize a scheduling system. Instead, the cluster is just used by one person at a time. In the mean time, if you'd like to use this cluster, send email to lsa-cluster-requests@umich.edu with the following information:- Which cluster (this page describes "lsa-cluster2")
- Your department
- A Brief description of your project
- An estimate of how long you'll need to run your job
- How many nodes you'd like to use
After you connect, we recommend that you utilize the screen utility by simply typing screen (or screen -x if you're returning from a disconnected session). This provides numerous benefits:
- It will preserve your session if you should get disconnected, so you can just reconnect and resume where you left off
- It will create a convenience log of your session for your later review if desired
- It will permit screen sharing so other members of their team can simultaneously monitor the progress of your job or type commands as if everyone was sitting in front of the same computer.
Storage
Everyone has access to an NFS share that is mounted on every node in the cluster. You should keep all of your files in a subdirectory named by your uniqname. This UFS formatted directory is located at:/NFSshare/uniqname/on every node. Note that the compute jobs themselves will not be running as yourself. They will either be running as the "cluster" user (see below) or as "nobody" (if using Xgrid). In either case, you'll need to make sure that the permissions are appropriate on
/NFSshare/uniqname/.The simplest approach is to just make your directory writable by all users:
chmod -R g+w,o+w /NFSshare/uniqnameYou may, or may not have had a home directory created at /Users/uniqname/. If you require a home directory and don't have one, send us an email. Otherwise, you can just use scp to copy directly into your NFSshare location. AFS is not available on the cluster at this time.
NOTE: Your data on the cluster is not backed up. Please be sure this is not the only place you've saved your data.
Communication
All of the inter-node communication on the cluster requires the use of passwordless-SSH. You can use this by switching to a special user account namedcluster by typing
su cluster. This account already has SSH key pairs distributed
to every node on the cluster, which permits you to SSH anywhere on the
cluster without a password, and more importantly allows you to initialize
the LAM/MPI environment mentioned next.
MPI / Parallel Jobs
If the LAM/MPI communications environment is not yet running, you can start it now by typing:lamboot -v /NFSshare/lamhosts if you'd like to use
every processor in the cluster. If you will only be using a subset of the
nodes, make a copy of that file and modify it as necessary.
Your job is now ready to run. You can start your job by simply typing:
mpirun C /NFSshare/uniqname/YourApp [arguments]
This will start a job running on the maximum number of CPUs available
to you. It is assumed that you're familiar with the mpirun command. Please see
the manpage for mpirun if you'd like to utilize any of its other features.
Serial Jobs
There are two ways to submit serial jobs to the cluster, by using SSH or by using Xgrid.SSH: Simply copy all of your files and all of your data into your /NFSshare/uniqname/ directory. Be sure the output directories in that same location are writable by the username "cluster". Switch over to the cluster user by typing
su cluster if you haven't already done so.
You can then SSH to each of the compute nodes to start your job. The
compute nodes are named "nodeN.lsa-cluster2.lsa.umich.edu" where
N is an integer between 1 and 16 (inclusive). You may also wish
to run your job on the head-node itself.Xgrid: A far simpler way to submit serial jobs to the cluster is by using Xgrid. Xgrid has many advantages, with the primary advantage being that you need not SSH to the cluster at all. All of your job submission and results can be done from your desktop Mac. See the Xgrid page for more information and suggestions.
Once your jobs are complete, be sure to copy any data or results you'd like to save to a location off the cluster, and send an email to lsa-cluster-requests@umich.edu notifying them that you are finished using the cluster.
Software
At this time the following additional software is available on this cluster:- g77
- gfortran
- Xcode, including gcc 4.0.1 and its related files.
- Matlab
- Mathematica
- Maple
Note on binary data
Please be aware that theSome programs will allow you to set the endianness of binary data when opening a file (e.g. the machineformat switch in Matlab); alternatively you can use bitshifting to correct the byte order (an example in C is given here).
Resources
Apple Xserve PageLAM/MPI (includes Tutorials)
HPC on Mac OS X
Dauger Research Services - Lots of good parallelization information and tutorials
U of Michigan Center For Advanced Computing (CAC)