v What Is a Cluster?

Before we talk about cluster computing, we need to define our terms. A cluster is a parallel computer that is constructed of commodity components and runs (as its system software) commodity software. A cluster is made up of nodes, each containing one or more processors, memory that is shared by all of the processors in (and only in) the node, and additional peripheral devices (such as disks), connected by a network that allows data to move between the nodes.

Nodes come in many flavors but are usually built from processors designed for the PC or desktop market. If a node contains more than one processor, it is called an SMP (symmetric multiprocessor) node.

Networks also come in many flavors. These range from very simple (and relatively low-performance) networks based on Ethernet to high-performance networks designed for clusters.

Clusters can also be divided into two types: do-it-yourself and prepackaged. A do-it-yourself cluster is assembled by the user out of commodity parts that are purchased separately. A prepackaged cluster (sometimes called a turnkey system) is assembled by a cluster vendor, either before or after shipping it to the customer's location. Which you choose depends on your budget, need for outside help, and facility with computer hardware. 

v Why Use a Cluster?

Why use a cluster instead of a single computer? There are really two reasons: performance and fault tolerance. The original reason for the development of Beowulf clusters was to provide cost-effective computing power for scientific applications, that is, to address the needs of applications that required greater performance than was available from single (commodity) processors or affordable multiprocessors. An application may desire more computational power for many reasons, but the following three are the most common:

·         Real-time constraints, that is, a requirement that the computation finish within a certain period of time. Weather forecasting is an example. Another is processing data produced by an experiment; the data must be processed (or stored) at least as fast as it is produced.

·         Throughput. A scientific or engineering simulation may require many computations. A cluster can provide the resources to process many related simulations. On the other hand, some single simulations require so much computing power that a single processor would require days or even years to complete the calculation. An example of using a Linux Beowulf cluster for throughput is Google , which uses over 15,000 commodity PCs with fault-tolerant software to provide a high-performance Web search service.

·         Memory. Some of the most challenging applications require huge amounts of data as part of the simulation. A cluster provides an effective way to provide even terabytes (1012 bytes) of program memory for an application.

Clusters provide the computational power through the use of parallel programming, a technique for coordinating the use of many processors for a single problem. What clusters are not good for is accelerating calculations that are neither memory intensive nor processing-power intensive or (in a way that will be made precise below) that require frequent communication between the processors in the cluster.

Another reason for using clusters is to provide fault tolerance, that is, to ensure that computational power is always available. Because clusters are assembled from many copies of the same or similar components, the failure of a single part only reduces the cluster's power. Thus, clusters are particularly good choices for environments that require guarantees of available processing power, such as Web servers and systems used for data collection.

We note that fault tolerance can be interpreted in several ways. For a Web server or data handling, the cluster can be considered up as long as enough processors and network capacity are available to meet the demand. A well-designed cluster can provide a virtual guarantee of availabilty, short of a disaster such as a fire that strikes the whole cluster. Such a cluster will have virtually 100% uptime. For scientific applications, the interpretation of uptime is often different. For clusters used for scientific applications, however, particularly ones used to provide adequate memory, uptime is measured relative to the minimum size of cluster (e.g., number of nodes) that allows the applications to run. In many cases, all or nearly all of the nodes in the cluster must be available to run these applications.

v Understanding Application Requirements

In order to know what applications are suitable for cluster computing and what tradeoffs are involved in designing a cluster, one needs to understand the requirements of applications.

·        Computational Requirements

The most obvious requirement (at least in scientific and technical applications) is the number of floating-point operations needed to perform the calculation. For simple calculations, estimating this number is relatively easy; even in more complex cases, a rough estimate is usually possible. Most communities have a large body of literature on the floating-point requirements of applications, and these results should be consulted first. Most textbooks on numerical analysis will give formulas for the number of floating-point operations required for many common operations. For example, the solution of a system of n linear equations; solved with the most common algorithms, takes 2n3/3 floating-point operations. Similar formulas hold for many common problems.

You might expect that by comparing the number of floating-point operations with the performance of the processor (in terms of peak operations per second), you can make a good estimate of the time to perform a computation. For example, on a 2 GHz processor, capable of 2 × 109 floating-point operations per second (2 GFLOPS), a computation that required 1 billion floating-point operations would take only half a second. However, this estimate ignores the large role that the performance of the memory system plays in the performance of the overall system. In many cases, the rate at which data can be delivered to the processor is a better measure of the achievable performance of application .

Thus, when considering the computational requirements, it is imperative to know what the expected achievable performance will be. In some cases this may be estimated by using standard benchmarks such as LINPACK and STREAM , but it is often best to run a representative sample of the application (or application mix) on a candidate processor. After all, one of the advantages of cluster computing is that the individual components, such as the processor nodes, are relatively inexpensive.

·        Memory

The memory needs of an application strongly affect both the performance of the application and the cost of the cluster. the memory on a compute node is divided into several major types. Main memory holds the entire problem and should be chosen to be large enough to contain all of the data needed by an application (distributed, of course, across all the nodes in the cluster). Cache memory is smaller but faster memory that is used to improve the performance of applications. Some applications will benefit more from cache memory than others; in some cases, application performance can be very sensitive to the size of cache memory. Virtual memory is memory that appears to be available to the application but is actually mapped so that some of it can be stored on disk; this greatly enlarges the available memory for an application for low monetary cost (disk space is cheap). Because disks are electromechanical devices, access to memory that is stored on disk is very slow. Hence, some high-performance clusters do not use virtual memory.

·        I/O

Results of computations must be placed into nonvolatile storage, such as a disk file. Parallel computing makes it possible to perform computations very quickly, leading to commensurate demands on the I/O system. Other applications, such as Web servers or data analysis clusters, need to serve up data previously stored on a file system.

The use of the network file system (NFS) to allow any node in a cluster to access any file. However, NFS provides neither high performance nor correct semantics for concurrent access to the same file . Fortunately, a number of high-performance parallel file systems exist for Linux.

v Other Requirements

A cluster may need other resources. For example, a cluster used as a highly-available and scalable Web server requires good external networking. A cluster used for visualization on a tiled display requires graphics cards and connections to the projectors. A cluster that is used as the primary computing resource requires access to an archival storage system to support backups and user-directed data archiving.

·  Parallelism

Parallel applications can be categorized in two major classes. One class is called embarassingly (or sometimes pleasingly) parallel. These applications are easily divided into smaller tasks that can be executed independently. One common example of this kind of parallel application is a parameter study, where a single program is presented with different initial inputs. Another example is a Web server, where each request is an independent request for information stored on the web server. These applications are easily ported to a cluster; a cluster provides an easily administered and fault-tolerant platform for executing such codes.The other major class of parallel applications comprise those that cannot be broken down into independent subtasks. Such applications must usually be written with explicit (programmer-specified) parallelism; in addition, their performance depends both on the performance of the individual compute nodes and on the network that allows those nodes to communicate. To understand whether an application can be run effectively on a cluster (or on any parallel machine), we must first quantify the node and communication performance of typical cluster components. The key terms are as follows:

·         latency: The minimum time to send a message from one process to another.

·         overhead: The time that the CPU must spend to perform the communication. (Often included as part of the latency.)

·         bandwidth: The rate at which data can be moved between processes

·         contention: The performance consequence of communication between different processes sharing some resource, such as network wires.

One way to think about the time to communicate data is to make the latency and bandwidth terms nondimensional in terms of the floating-point rate. For example, if we take a 2 GHz processor and typical choices of network for a Beowulf cluster, the ratio of latency to floating-point rate ranges from 10,000 to 200,000! What this tells us is that parallel programs for clusters must involve a significant amount of work between communication operatoins. Fortunately, many applications have this property.

The simple model is adequate for many uses. A slightly more sophisticated model, called logP , separates the overhead from the latency.Additional examples appear throughout this book. For example, the performance of a master/worker example that uses the Message-Passing Interface (MPI) as the programming model.

v Building and Using a Cluster

In this section we review the issues in configuring, building, and using a cluster and provide references to the corresponding organized around particular tasks, such as programming or managing a cluster.

·        Choosing a Cluster

When choosing the components for a cluster, or selecting from a prebuilt cluster, you must focus on the applications that will be run on the cluster. The following list covers some of the issues to consider.

1.     Understanding the needs of your application. Some of this has been covered above; you will find more on understanding the performance of applications.

2.     Decide the number and type of nodes. Based on the application needs, select a node type (e.g., uni-processor or SMP), processor type, and memory system. As described above, raw CPU clock rate is not always a good guide to performance, so make sure that you have a good understanding of the applications. Other issues to consider when choosing the processor type include whether you will run prebuilt applications that require a particular type of processor, whether you need 64-bit or 32-bit addressing, or whether your codes are integer or floating-point intensive.

3.     Decide on the network. Determine whether your applications require low latency and/or high bandwidth in the network. If not, for example, running a throughput cluster with embarassingly parallel applications, then simple fast Ethernet with low-cost switches may be adequate. Otherwise, you may need to invest in a high-performance cluster network. Note that the cost of a fast Ethernet network is very low while a high performance network can double the cost of a cluster.

4.     Determine the physical infrastructure needs. How much floor space, power, and cooling will you need. Is noise a factor?

5.     Determine the operating system (OS) that you will use. Since you bought this book, you have probably selected Linux The choice of cluster setup software may also influence which distribution of Linux you can use; In choosing the operating system, consider the following:

o    Do your applications run on the chosen system? Many applications and programming models run under many operating systems, including Windows, Linux, and other forms of Unix.

o    Do you have expertise with a particular operating system?

o    Are there license issues (cost of acquiring or leasing) software, including the operating system and compilers?

6.     Cost tradeoffs. The cost of a node is not linearly related to the performance of that node. The fastest nodes are more expensive per flop (and usually per MByte/sec of memory bandwidth) than are lower-cost nodes. The question is then: Should a cluster use the fastest available nodes regardless of cost, or should it use mid-range or even low-range nodes? The answer depends, as always, on your needs:

o    If price is no object, go with the fastest nodes. This approach will reduce the number of nodes needed for any given amount of computing power, and thus the amount of parallel overhead.

o    If total computing power over time is the goal, then go with mid- or low-end nodes, but replace them frequently (say, every 18 months to two years) with newer nodes. This strategy exploits the rapid advances in node performance; buying two low-end nodes every two years will often give you a greater amount of computing power (integrated over time) than spending the same amount every four years on a high-end node.

o    If a targeted amount of computing power (e.g., for a specific application) is the goal, then analyze the tradeoffs between a greater number of slower (but possibly much cheaper) nodes and a smaller number of faster but individually less cost-efficient nodes.

·        Setting Up a Cluster

Once you have specified your cluster, you need to assemble the components and set up the initial software. In the past few years, great strides have been made in simplifying the process of initializing the software environment on a cluster. At this point, you may wish to benchmark your cluster. Since such benchmarking will require running a parallel program, information on this topic is provided .Alternatively, you may prefer to run a prepackaged performance suite, such as the Beowulf Performance Suite (BPS), available at BPS contains both single node and parallel performance tests, including the following:

·         bonnie++: I/O (disk) performance;

·         Stream: Memory system performance;

·         netperf: General network performance;

·         netpipe: A more detailed network performance benchmark;

·         unixbench: General Unix benchmarks;

·         LMbench: Low-level benchmarks; NAS parallel benchmarks: A suite of parallel benchmarks derived from some important application.

v What Is Cluster Computing?

Defination :-In computers, clustering is the use of multiple computers, typically PCs or UNIX workstations, multiple storage devices, and redundant interconnections, to form what appears to users as a single highly available system. Cluster computing can be used for load balancing as well as for high availability. Advocates of clustering suggest that the approach can help an enterprise achieve 99.999 availability in some cases. One of the main ideas of cluster computing is that, to the outside world, the cluster appears to be a single system.

A common use of cluster computing is to load balance traffic on high-traffic Web sites. A Web page request is sent to a "manager" server, which then determines which of several identical or very similar Web servers to forward the request to for handling. Having a Web farm (as such a configuration is sometimes called) allows traffic to be handled more quickly.

Clustering has been available since the 1980s when it was used in DEC's VMS systems. IBM's sysplex is a cluster approach for a mainframe system. Microsoft, Sun Microsystems, and other leading hardware and software companies offer clustering packages that are said to offer scalability as well as availability. As traffic or availability assurance increases, all or some parts of the cluster can be increased in size or number.

Cluster computing can also be used as a relatively low-cost form of parallel processing for scientific and other applications that lend themselves to parallel operations. An early and well-known example was the Beowulf project in which a number of off-the-shelf PCs were used to form a cluster for scientific applications.

v What is a Beowulf Cluster?


Beowulf is a multi-computer architecture which can be used for parallel computations, server consolidation or computer room management. A Beowulf cluster is a computer system conforming to the Beowulf architecture, which consists of one master node and multiple compute nodes. The nodes,typically commodity off-the-shelf servers or server blades, are connected together via a switched Ethernet network or a dedicated high-speed interconnect network such as Infiniband. The nodes usually do not contain any custom hardware components and are trivially reproducible. The master node controls the entire cluster and serves parallel jobs and their required files to the compute nodes. The master node is the cluster’s administration console and its gateway to the outside world. It assumes full control of the compute nodes from software provisioning to monitoring to job execution. Compute nodes do not need keyboards, monitors or even a hard disk. Simply put, Beowulf is a technology of clustering Linux computers together to form a parallel, virtua supercomputer,

a beowulf cluster. While Linux-based beowulf clusters provide a cost-effective hardware alternative to the traditional supercomputers for high performance computing applications, the original software implementations for Linux beowulfs were not without their problems.


v How does Beowulf work?

At the simplest level, Beowulf is a collection of extensions to the standard Linux kernel. Thus, at its heart a Beowulf system is only slightly different from a network of UNIX machines that you may already have. A Network Of Workstations (NOW) using system wide logins, such as NIS, with users able to execute remote commands on different machines could in some sense be called a Beowulf system. Beowulf uses several distributed application programming environments to pass messages between the different computation nodes. The most commonly used are PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) which take the form of library function calls that must be added into the code. These environments, especially PVM, are designed to be used on a heterogeneous system of workstations with apparently little in common and could be used in NOWs. There are however several subtle, yet significant differences between these systems and Beowulf systems. In the usual Beowulf scheme, as is common in NOWs, every node is responsible for

running its own copy of the operating system kernel, with nodes generally sovereign and autonomous at this level. However in order to present a more uniform image to both the applications and the users, the Beowulf extensions allow a loose ensemble of nodes to participate in a more coordinated way through a number of global namespaces. For example, normal UNIX processes ‘belong’ to the kernel running them and have a unique identifier within that context. In the Beowulf system it is possible to have a process ID which is unique across an entire cluster. A Beowulf cluster will typically have only one connection with the outside world, and will communicate via its own private intranet. This means that there is only a single point for users to log into. In the initial Beowulf machine, 100 Mbit Fast Ethernet was considered too expensive for the system intranet, so a technique was devised, named channel bonding, which joined multiple low cost networks into a single logical network with a higher bandwidth. Thus in the original Beowulf machine each node had two network cards. The current low cost of switched 100Mbit fast Ethernet and 1000Mbit Gigabit Ethernet has reduced the need for this channel bonding solution, but it can still be used for higher network traffic applications. NOW systems have historically only been successfully used in this manner at night when there is little other demand on the individual machines and the local network. A Beowulf machine is perceived to be a single user machine, running only one user application at a time. This gives uninterrupted processing time, spread evenly across all the nodes, to a computationally intensive job.


1 comment:

aswathi said...

I want the full details about the seminar topic beowulf clusters