Cluster Server FAQ: Overview of Microsoft Cluster Server |
Last
updated on May 4, 1999
|
What is a server "cluster"?
A server cluster is a group of independent servers managed as a single system for higher availability, easier manageability, and greater scalability.
What does it take to create a server cluster?
The minimum requirements for a server cluster are (a) two servers connected by a network, (b) a method for each server to access the other's disk data, and (c) special cluster software like Microsoft® Cluster Server (MSCS). The special software provides services such as failure detection, recovery, and the ability to manage the servers as a single system.
What are the benefits of server clustering?
There are three primary benefits to server clustering: improved availability, easier manageability, and more cost-effective scalability. Using Microsoft Cluster Server as an example:
What are clusters used for?
Customer surveys indicate that MSCS clusters will be used as highly available multipurpose platforms, mirroring the current uses of the Microsoft Windows NT® Server operating system. Surveyed customers suggested that the most common uses of MSCS clusters will be mission-critical database management, file/intranet data sharing, messaging, and general business applications.
When a cluster is recovering from a server failure, how does the surviving server get access to the failed server's disk data?
There are basically three techniques that clusters use to make disk data available to more than one server:
What is "Wolfpack"?
"Wolfpack" was the code name for Microsoft Cluster Server.
What is Microsoft Cluster Server (MSCS)?
MSCS is a built-in feature of Windows NT Server, Enterprise Edition. It is software that supports the connection of two servers into a "cluster" for higher availability and easier manageability of data and applications. MSCS can automatically detect and recover from server or application failures. It can be used to move server workload to balance utilization and to provide for planned maintenance without downtime. And, over time, MSCS will also become a platform for highly scalable, cluster-aware applications.
How many servers can be in an MSCS cluster?
The initial release of MSCS supports clusters with two servers. A future
version referred to as MSCS "Phase 2" will support larger clusters, and will
include enhanced services to simplify the creation of highly scalable,
cluster-aware applications.
When will MSCS be available?
MSCS is shipping as part of windows NT Server 4.0 enterprise Edition.
Significant enhancements to Windows NT Server, Enterprise Edition are planned for Windows 2000 Server Enterprise Edition including the following key improvements:
What other companies were involved in the development of MSCS?
Microsoft worked closely with leading hardware vendors, software vendors, and customers in the specification and development of MSCS and its API. These other companies participated through five different programs:
In what languages will MSCS be available?
Microsoft Windows NT Server, Enterprise Edition 4.0, which included MSCS
1.0, is available in English, French, German, Japanese, and Spanish.
Through what channels is Windows NT Server, Enterprise Edition be available?
Microsoft Windows NT Server, Enterprise Edition is available to
customers through all standard channels: reseller, retail, OEM, and the
Microsoft Select licensing program.
What versions of Windows NT Server does MSCS support?
MSCS software is only available as a built-in feature of Windows NT
Server 4.0, Enterprise Edition.
Will MSCS be extended beyond Windows NT Server to Windows NT Workstation?
There is currently no plan to extend cluster support to Windows NT
Workstation. MSCS software has been designed and written to closely
integrate with the architecture and features of Windows NT Server, including
its server-oriented networking and directory services capabilities.
What clients can connect to an MSCS cluster?
Any client that can connect to Windows NT Server through TCP/IP will work
with MSCS. This includes Microsoft MS-DOS®, Microsoft Windows® 3.x, Windows
95, Windows NT, Apple Macintosh, and UNIX. MSCS does not require any special
software on the client for transparent recovery of services that connect to
clients through standard IP protocols.
How does MSCS provide high availability?
MSCS uses software "heartbeats" to detect failed applications or servers. In the event of a server failure, it employs a "shared nothing" clustering architecture that automatically transfers ownership of resources (such as disk drives and IP addresses) from a failed server to a surviving server. It then restarts the failed server's workload on the surviving server. All of thisfrom detection to restarttypically takes under a minute. If an individual application fails (but the server does not), MSCS will typically try to restart the application on the same server; if that fails, it moves the application's resources and restarts it on the other server. The cluster administrator can use a graphical console to set various recovery policies, such as dependencies between applications, whether or not to restart an application on the same server, and whether or not to automatically "failback" (rebalance) workloads when a failed server comes back online.
Can MSCS provide "zero downtime"?
No. MSCS can dramatically reduce planned and unplanned downtime. However, even with MSCS, a server could still experience downtime from the following events:
Microsoft recommends that clusters be used as one element in customers' overall programs to provide high integrity and high availability for their mission-critical server-based data and applications.
Is MSCS failover transparent to users?
MSCS does not require any special software on client computers, so the user experience during failover depends on the nature of the client side of their client-server application. Client reconnection is often transparent, because MSCS has restarted the applications, file shares, and so on, at exactly the same IP address.
If a client is using "state-less" connections such as a standard browser connection, then it would be unaware of a failover if it occurred between server requests. If a failure occurs while a client is connected to the failed resource, then the client will receive whatever standard notification is provided by the client side of the application in use when the server side becomes unavailable. This might be, for example, the standard "Abort, Retry, or Cancel?" prompt you get when using Windows Explorer to download a file at the time a server or network goes down. In this case, client reconnection is not automatic (the user must choose "Retry"), but the user is fully informed of what's happening and has a simple, well-understood method of reestablishing contact with the server. Of course, in the meantime, MSCS is busily restarting the service or application so that, when the user chooses "Retry," it reappears as if it never went away.
For client-side applications that have "state-full" connections to the server, a new logon is typically required following a server failure. In many cases, this approach is required for security purposes. For example, this is how SAP R/3 worksif the server connection is lost, the user is prompted to log on again to make sure it's the same user accessing the application.
Even with state-full connections, it's possible for an application to automatically reconnect following a failover. For example, when Microsoft demonstrated SAP R/3 failover at Microsoft Scalability Day in New York City on May 20, it was accessed through an Active browser application that had automatically (and securely) cached the user's ID and password from the initial logon. Thus, when the server connection was momentarily lost during the failover demo, the client application automatically logged on again using the cached ID and password. This was done using standard IP connections, running a simple Microsoft Visual Basic® development system program within an HTML document through the Microsoft ActiveX® technology.
When a server comes back online following a failure, is there any human intervention required to get it back "up and running," or is the heartbeat enough for the other server to include it once again?
No manual intervention is required. When a server running Microsoft Cluster Server, say "Server A," boots, it starts the MSCS service automatically. MSCS in turn checks the interconnect (and network if necessary) to find the other server in its cluster, say "Server B." If Server A finds Server B, then Server A rejoins the cluster and Server B updates it with current cluster status info. Server A then initiates "failback," moving back failed-over workload from Server B to Server A at an appropriate time.
What is "failback," and how does it work in MSCS?
"Failback" is the ability to automatically rebalance the workload in a cluster when a failed server comes back online. This is a standard feature of MSCS. For example, say "Server A" has crashed and its workload failed-over to "Server B." When Server A reboots, it automatically finds Server B and rejoins the cluster. It then checks to see if any of the cluster groups running on Server B would "prefer" to be running on Server A. If so, it automatically moves those groups from Server B to Server A as soon as the time is right. Failback propertiesthat is, which groups can failback, which is their preferred server, and during what hours the time is "right" for failbackare all set from the cluster administration console.
Can the servers in an MSCS cluster be located at separate locations for recovery from site disasters?
Not at this time. All of the cluster configurations currently being considered for validation use SCSI connections to storage resources, which limits the distance between clustered servers to the distance supported by standard SCSI. This is typically no more than 25 meters, though there are SCSI extender technologies that can potentially stretch the connection up to 1,000 meters.
Note that Windows NT Server customers already have several choices for software that can mirror data to remote disaster recovery sites, including solutions from N.S.I., Octopus, Veritas, and Vinca. Most of these vendors have already announced that their disaster site mirroring solutions will also work with MSCS clusters.
Can MSCS restore registry keys for an application from one server to the other when doing failover?
Yes. Recovery of an application's registry information is a configurable feature that is available to the Generic Application and Generic Service resource types. Basically, you tell it what registry keys to log and recover, and that's all there is to it. This capability should be used if the application or service stores volatile information in specific registry keys. If this is done, when the resource comes online on another node, it will have the same registry information as the previously online resource.
When an application restarts on another server following a failure, does it re-start from a copy of the application?
No. The new server (say, "Server 2") would start the application from the same physical disks as Server 1, since ownership of the application's disks on the shared SCSI bus had been moved from Server 1 to Server 2 as one of the first steps in the failover process. This approach assures that the application always restarts from its last known state, as recorded on its disk drives (and, if you use the available option, as recorded in its registry keys.)
Can MSCS restore an application's "state" at the time of its failure rather than requiring a complete restart?
MSCS can restore the state of an application's registry keys, but any other state information must be managed and restored by the application. Applications need to provide some model for persistence to insure that state can be recaptured. For example, Microsoft SQL Server uses transaction logs to provide this assurance. If a server running Microsoft SQL Server crashes, upon restart the application uses its transaction logs to bring the database back to a known state. With a cluster, just as with a single server, good application design and the use of ACID (Atomic, Consistent, Isolated, and Durable) transaction properties are important.
What is the granularity of resource failover?
MSCS supports failover of "virtual servers," which usually correspond to applications, Web sites, print queues, or file shares (including their disk spindles, files, IP addresses, and so on). MSCS also provides cluster-wide services that are simultaneously available on all servers in the cluster, including cluster administration, performance monitoring, event viewing, a cluster name, and cluster time synchronization.
What is a "quorum disk" and how does it help MSCS provide high availability?
It's a disk spindle that MSCS uses to determine whether or not another server is up or down. Technically, it's a resource that can only be owned by one server at a time, and for which servers can negotiate for ownership. Negotiating for the quorum drive allows MSCS to avoid "split brain" situations where both servers are active and think the other server is down. (This can happen when, for example, the cluster interconnect is lost and network response time is problematic.) The use of a quorum resource is one of the sophisticated algorithms that Microsoft got by working with pioneers in clustering such as Digital and Tandem.
How does MSCS improve the manageability of servers?
MSCS gives administrators a graphical console from which they can monitor and manage all of the resources in a cluster as if it was a single system. Using the familiar standards of a Microsoft Windows graphical user interface, an administrator can use the cluster console to:
The ability to graphically move workload from one server to another with only a momentary pause in service (typically less than a minute) means administrators can easily unload servers for planned maintenance without taking important data and applications offline for long periods of time.
Does MSCS provide administrators with a "single system image"?
Yes. MSCS provides administrators a single graphical console to manage all of the applications and resources in a cluster. The MSCS console presents cluster resources by physical server, and by "virtual server" (or "cluster group"). This allows administrators to centrally manage the cluster as a collection of virtual application-oriented servers, or as a collection of physical resources when appropriate.
Can MSCS be remotely managed?
Yes. An authorized user can run the MSCS administration console from any Windows NT Workstation or Windows NT Server on the network. In the version of MSCS accompanying Windows 2000 Server, Enterprise Edition, the cluster administration console will be a "snap-in" to the Microsoft Management Console, providing scriptable, remoteable access, including access through Internet protocols from a browser.
How does MSCS help administrators do "rolling upgrades" of their servers?
With MSCS, server administrators no longer have to do all their maintenance within those rare windows of opportunity when no users are online. Instead, they can simply wait until a convenient off-peak time when one of the servers in the cluster has enough horsepower for all of the cluster workload. They then point-and-click to move all the workload onto one server, and they're ready to perform maintenance on the unloaded server. Once the maintenance is complete and tested, they bring that server back online and it automatically rejoins the cluster, ready for work. When convenient, the administrator repeats the process to perform maintenance on the other server in the cluster. This ability to keep applications and data online while performing server maintenance is often referred to as doing "rolling upgrades" to your servers.
Will Microsoft support "rolling upgrades" of future server products using MSCS clusters?
It is Microsoft's goal to support "rolling upgrades" between releases of Microsoft server software using MSCS clusters. However, we cannot commit to this for all releases of all products. Persistent storage formats must occasionally change to accommodate new capabilities, and changes in persistent storage occasionally require applications to be taken offline while storage or indices are restructured. Microsoft will commit to always providing smooth upgrades between releases of all our products, and we'll use MSCS to provide seamless rolling upgrades whenever possible.
How will MSCS enhance server scalability?
The manageability benefits of Windows NT Server Enterprise Edition 4.0 simplify many of the processes currently used to improve scalability, such as upgrading server hardware and installing new versions of applications. A post Windows 2000 Server Enterprise Edition version of MSCS will support clusters containing larger numbers of servers, and will provide enhanced abilities that simplify the creation of highly scalable, cluster-aware applications.
Today, however, there are significant scalability advantages to clustering. For example:
The Microsoft cluster strategy White Paper said MSCS is already architected for multiple nodes. Has MSCS been tested on multinode clusters? If so, why is Microsoft waiting to deliver multinode support?
Yes, Microsoft and other vendors have tested MSCS clusters with more than two servers. These clusters "work" in that they are stable and the administrator's console provides basic management for the multiserver environment. However, the algorithms and features in the current software must be extended and thoroughly tested on larger clusters before customers can reliably use a multinode MSCS cluster for production work, or gain enhanced cluster benefits. In addition, Microsoft will have to extend the cluster hardware validation procedures to accommodate the additional requirements of multinode clusters.
Microsoft has architected MSCS for multinode support in preparation for the coming "Phase 2" version. Today's multinode tests have proven the architecture is correct. However, there are two key reasons Microsoft is limiting the initial release to two-server clusters:
Is it possible to add hard drives to an MSCS cluster without rebooting?
It depends on whether the drive cabinet supports this, since Windows NT will not do so until the Windows 2000 release. There are examples of RAID cabinets validated for Windows NT that support changing volumes on the fly (with RAID parity).
How will MSCS help do load balancing?
"Load balancing" is the ability to move work from a very busy server to a less-busy server. MSCS will support load balancing in four ways over time:
Should cluster-aware applications developed for MSCS use a shared-disk or shared-nothing architecture for greatest scalability?
Microsoft recommends a shared-nothing architecture for cluster-aware applications because of its greater scalability potential. With shared-disk applications, copies of the application running on two or more servers in the cluster share concurrent read/write access to a single set of disk files, mediating ownership of the files using a "distributed lock manager" (DLM). A shared-nothing application, on the other hand, avoids the potential bottleneck of shared resources and a DLM by partitioning or replicating the data so that each server in the cluster works primarily with its own data and disk resources. In theory, MSCS can support either type of application. However, Microsoft has no plans at this time to include a DLM in the MSCS cluster services, so vendors would have to develop or license a DLM to implement a shared-disk application on MSCS. Microsoft has chosen to use the shared-nothing architecture for future versions of Microsoft BackOffice® family applications because of that architecture's greater potential for cluster-enabled scalability.
Will MSCS ever have a Distributed Lock Manager (DLM)?
Microsoft will not include a distributed lock manager in the first release of MSCS. Enhancements in future releases will be determined based on customer requirements.
When will Microsoft offer a parallel version of Microsoft SQL Server that runs on multiple servers at the same time for automatic load balancing and scalability?
The next major release after Microsoft SQL Server 7.0 is planned to offer cluster-enabled scalability on MSCS clusters. It will use a scalable "shared nothing" architecture to spread a single database across multiple servers. A White Paper on the strategy for
Microsoft SQL Server on clusters can be downloaded from http://www.microsoft.com/sql.
Although this is an important direction for Microsoft SQL Server, it must be kept in perspective: It will only be needed by a small percent of customers. Cluster-enabled scalability will only be needed by extremely large enterprise applications which are (a) too large to run on a single high-end SMP server (for example, eight-processor SMP with 4 GB of RAM), and (b) cannot be partitioned to run on a distributed network using MTS.
What are Microsoft's plans for supporting Distributed Message Passing (DMP)?
Distributed Message Passing is one of the intracluster communications techniques that are planned for Phase 2 of MSCS. (Another is I/O shipping.) Applications will be able to access MSCS DMP services through extensions to the Cluster API. MSCS in turn will host the DMP services over a variety of interconnect technologies including new low-latency drivers based on the Virtual Interface (VI) architecture. The result will be a standard infrastructure for supporting a new generation of scalable, cluster-aware applications.
What types of applications and services will
benefit from MSCS clustering?
There are three types of server applications that will benefit from MSCS clusters:
What software vendors will offer cluster-aware applications for MSCS?
Software vendors that have already announced plans to offer products for MSCS clusters include Baan, Cheyenne, Computer Associates (CA/Unicenter TNG), HP (ClusterView), IBM (DB2), NetIQ, Octopus, Oracle (Oracle 7 Failsafe), SAP, Vinca, and, of course, Microsoft (Microsoft SQL Server, Enterprise Edition, and Exchange Server, Enterprise Edition.) For an up-to-date list of announced products that support MSCS, refer to the Microsoft Windows NT Server, Enterprise Edition Solutions Directory look here.
Will Microsoft validate or logo software products that work with MSCS?
Microsoft will not have a validation program for MSCS-based software products at first. It is expected that once MSCS clusters are deployed in volume and there are sufficient examples of cluster-aware application products to evaluate, Microsoft will extend its Microsoft BackOffice logo program to include, at a minimum, validation of support for basic failover operation on an MSCS cluster.
What are Microsoft's plans for supporting Microsoft SQL Server on MSCS clusters?
Microsoft SQL Server, Enterprise Edition version 6.5 is available now and
provides "active/active" cluster support (for example, both servers can be
running SQL Server, with each server supporting its own databases).
Microsoft SQL Server 7.0, currently in beta test, will include additional
cluster-aware enhancements that provide for faster recovery in the event of
a server or application failure. The version of Microsoft SQL Server that
follows release 7.0 will include new features for shared-nothing scalability
on MSCS clusters (for example, a single database will be able to span
multiple servers).
What are Microsoft's plans for supporting Microsoft Exchange Server on MSCS clusters?
Microsoft Exchange Server Enterprise Edition 5.5 supports cluster failover and is shipping today.
Can the standard versions of Microsoft SQL Server 6.5 or Exchange Server 5.0 be set up for failover on a cluster using the "generic application" capability of MSCS?
Technically proficient customers who want to test Microsoft SQL Server 6.5 or Exchange Server 5.5 on a cluster may do so using the generic application capability of MSCS. However, the setup can be complex, and will not be supported by Microsoft support services. Therefore, customers should only do so for testing purposes, not for production deployments. Microsoft SQL Server, Enterprise Edition version 6.5, and Exchange Server, Enterprise Edition 5.5 feature a simplified cluster setup procedure, and are fully supported for failover on MSCS clusters.
Will Microsoft SNA Server benefit from MSCS?
No, because Microsoft SNA Server already provides a hot failover capability independent of MSCS.
Will Microsoft Proxy Server benefit from MSCS?
No, because the current version of Microsoft Proxy Server has its own capability for chaining together multiple servers for high availability and scalability.
Will Microsoft Systems Management Server benefit from MSCS?
No, MSCS will not provide high availability for the current release of Microsoft Systems Management Server. Microsoft intends to provide cluster-enabled high availability for Systems Management Server in a future release.
Can MSCS failover a Windows NT Server Directory (Domain) Controller?
No, because it is already possible to have backup directory service controllers for high availability. Servers in an MSCS cluster may be either primary or backup directory controllers for Windows NT Directory Services.
Can MSCS failover a WINS (Windows Internet Name Service) server?
No, because it is already possible to have backup WINS servers for high availability.
Can MSCS failover Remote Access Services (RAS)?
Remote Access Services cannot benefit from MSCS at this time since there is no standard method for doing software failover of modem connections. For higher reliability of dial-up connections, you can use the RAS Multi-Link capability first introduced in Windows NT Server 4.0.
Can MSCS failover Microsoft Distributed File System (Dfs) directories?
Not in Windows NT Server, Enterprise Edition 4.0. The version of Dfs in Windows 2000 Server will provide directory replication for fault tolerance. When used on the Enterprise Edition of Windows 2000 Server, Dfs will also work with MSCS failover for fast recovery from server crashes.
What versions of Oracle will benefit from MSCS clusters?
Oracle has announced that Oracle Failsafe 2.0 is available for Oracle7 customers at no extra cost. It provides "active/active" database failover on MSCS clusters (for example, can run on both servers at the same time, and either can failover to the other server in the event of an application or server failure). For more information, refer to http://ntsolutions.oracle.com/index.htm.
Does Tandem NonStop SQL/MX use MSCS?
Tandem NonStop SQL/MX uses MSCS clustering services when running on a two-server cluster. NonStop SQL/MX uses its own single-application clustering services when running on a cluster with more than two servers. Customers who want high availability plus database scalability up to the performance provided by two high-end SMP servers, will benefit by running NonStop SQL/MX on MSCS to gain the additional benefits of high availability for other services and applications on the cluster. Customers who require additional scalability would use the built-in single-application cluster services of NonStop SQL/MX, trading off general availability services for the ability to scale on more than two servers.
How does Microsoft Cluster Server work with Windows NT Load Balancing
Service?
Windows NT load balancing service is fully complementary to Microsoft
Cluster Server. Microsoft Clustering Service provides a non-stop reliable
platform for data base, messaging and related application services through
fail-over clustering for two nodes. Windows NT Load Balancing Service
balances and distributes client connections (TCP/IP connections) over
multiple servers. In a three tier model, MSCS handles the application layer
and the data layer, while the Convoy or Windows NT Load Balancing Service is
focused on handling the front end connections. When used together, Microsoft
Cluster Server and Windows NT Load Balancing Service provide customers with
a highly scalable, reliable and available system. This is an industry
leading way to combine transactional systems with a web-based front end, and
to deliver the scale, availability and robustness demanded by enterprise
class customers.