CalHPC: A dedicated cluster support service for campus researchers

Publication Date: 
September 30, 2008
Expiration Date: 
September 30, 2011
Lucia Tsai, IST-Technical Account Management
Weight: 
0
Body Text: 

Information Services and Technology (IST) is collaborating with the Lawrence Berkeley National Laboratory (LBNL) to offer campus researchers a comprehensive service for supporting Linux-based computing clusters hosted in the campus data center.

Within the last decade, the developments in clustering technology, the advances in high-speed networking, and the growing acceptance of open-source software have enabled scientists to deploy high-performance computing clusters at a fraction of the cost of traditional high-end supercomputers. Clusters built using inexpensive, commodity hardware and open-source Linux software have proven to be affordable, reliable, and scalable.

Using cluster-based systems, researchers are able to take advantage of parallel computing applications to perform time- or data-intensive simulations not previously possible. High-performance computing (HPC) is enabling researchers to address complex problems in nearly all disciplines — from science and engineering to humanities and education — that have a direct bearing on society and the quality of life.

For many researchers, the practicalities of obtaining and managing a cluster system can be challenging. It requires expertise in computer architecture, operating systems, security, cluster technologies, and integration techniques. With myriad technologies and peripheral components available, determining a cost effective and well-performing configuration can be difficult. Ongoing maintenance can prove to be time-consuming. Researchers may find themselves (or their graduate students) spending a disproportionate amount of time on system administration and user support. Having the appropriate site for hosting the cluster is also a consideration. Large, multinode clusters need sufficient network, bandwidth, and security, along with significant amounts of cooling and electrical infrastructure in order to operate. 

Recognizing the need for an affordable cluster support service, LBNL developed a Scientific Cluster Support (SCS) program that has been running successfully since 2003. The service has proven to be very effective and popular with researchers; the SCS group has implemented, and currently manages, 34 production clusters comprising more than 5300 processors.

Through the CalHPC service, Berkeley faculty and researchers can now utilize the campus data center to host their clusters and draw upon the professional expertise of the LBNL SCS technical team for their cluster management needs. CalHPC provides best practice services throughout the cluster life cycle to ensure that application needs are met and valuable research time is not squandered on operational issues.

Service features

The standard implementation model requires a minimum of 10 compute nodes, utilizing either Intel or AMD x86-64 type processors, configured in a standard Beowulf architecture, and running on the Red Hat Linux or CentOS operating systems (OS). Supported clusters are hosted in the campus data center.

Researchers using the CalHPC service are expected to purchase the hardware for their clusters. To facilitate this, IST and SCS will work with researchers to ascertain their application and processing requirements in order to determine the hardware architecture and configuration best suited to their applications. IST and SCS will also work with researchers to adapt their existing clusters to meet our implementation model.

Listed below are some key features of the CalHPC service.

  • Pre-purchase consulting — identifying the hardware architecture, interconnects, and software components.
  • Procurement assistance — developing a budget and creating the equipment RFP.
  • Cluster installation and integration — installing and configuring cluster hardware; setting up the network; installing the cluster software, scheduler, and applications software; assisting with running, debugging, and testing user application code on the cluster.
  • Ongoing systems administration and cyber security — including OS and cluster software maintenance and upgrades, security updates, resource management and job scheduler support, monitoring, account setup, two-factor authentication, compliance with campus Minimum Security Standards, physical replacement of faulty hardware, and crash recovery.
  • Data center colocation and network infrastructure — hosting in the campus data center, a secure, state-of-the-art facility equipped with the infrastructure necessary to support the networking, power, and cooling requirements of HPC clusters.

Optional IST Storage and Backup services are also available.

Eligibility

The service is open to all UC Berkeley researchers planning to purchase new high-performance computing clusters, as well as those with existing clusters. Clusters must be used for campus-administered research.

Cost

Cost to campus researchers will vary depending on the size and complexity of the cluster. Below are the rates for standard components.

Feature Component Cost
Data center colocation Full rack (40 rack units) $3,648 per year (per rack)
Network connection (1 GB) $301 one-time fee; $160 per year (per connection)
Cluster management Master node $5,583 per year
Compute node $186 per year (per node)

Additional ongoing costs apply for managing high-speed interconnects (e.g., Infiniband, Myrinet), RAID storage arrays, and other peripheral equipment. One-time setup costs apply for requirements definition, installation, and configuration.

In an effort to keep costs commensurate with pricing for LBNL researchers, campus customers may benefit from a subsidy to offset some portion of the cluster management charges. The subsidy is made possible by funds administered by the Office of the CIO and are subject to availability.

Who to contact for more information

If you are interested in learning more about the CalHPC service, or would like to set up a consultation, contact the Technical Account Management (TAM) group,