|Oracle® Database High Availability Architecture and Best Practices
10g Release 1 (10.1)
Part Number B10726-02
This chapter provides recommendations for configuring the subcomponents that make up the database server tier and the network. It includes the following sections:
The goal of configuring a highly available environment is to create a redundant, reliable system and database without sacrificing simplicity and performance. This chapter includes recommendations for configuring the subcomponents of the database server tier.
Redundancy: Choose components and capabilities that are redundant to reduce the risk from failures
Consolidation: Consolidate components where appropriate to reduce the number of components that can fail
Expansion: Plan for future growth
Manageability: Choose components that are reliable and can be serviced online
See Also:Your vendors' HA architecture and best practices documentation and recommendations for hardware, operating system, and cluster configurations
Electronic data is one of the most important assets of any business. Storage arrays that house this data must protect it and keep it accessible to ensure the success of the company. This section describes characteristics of a fault-tolerant storage subsystem that protects data while providing manageability and performance. The following storage recommendations for all architectures are discussed in this section:
The following section pertains specifically to RAC environments:
The storage array should contain one or more spare disks (often called hot spares). When a physical disk starts to report errors to the monitoring infrastructure, or fails suddenly, the firmware should immediately restore fault tolerance by mirroring the contents of the failed disk onto a spare disk.
Connectivity to the storage array must be fully redundant (referred to as multipathing) so that the failure of any single component in the data path from any node to the shared disk array (such as controllers, interface cards, cables, and switches) is transparent and keeps the array fully accessible. This achieves addressing the same logical device through multiple physical paths. A host-based device driver reissues the I/O to one of the surviving physical interfaces.
If the storage array includes a write cache, then it must be protected to guard against memory board failures. The write cache should be protected by multiple battery backups to guard against a failure of all external power supplies. The cache must be able to survive either by using the battery backup long enough for the external power to return or until all the dirty blocks in cache are guaranteed to be fully flushed to a physical disk.
If a physical component fails, then the array must allow the failed device to be repaired or replaced without requiring the array to be shut down or taken offline. Also, the storage array must allow the firmware to be upgraded and patched without shutting down the storage array.
Data should be mirrored to protect against disk and other component failures and should be striped over a large number of disks to achieve optimal performance. This method of storage configuration is known as Stripe and Mirror Everything (SAME). The SAME methodology provides a simple, efficient, and highly available storage configuration.
Oracle's automatic storage management (ASM) feature always evenly stripes data across all drives within a disk troup with the added benefit of automatically rebalancing files across new disks when disks are added, or across existing disks if disks are removed. In addition, ASM can provide redundancy protection to protect against component failures or enable mirroring to be provided by the underlying storage array.
Load balancing of I/O operations across physical interfaces is usually provided by a software package that is installed on the host. The purpose of this load balancing software is to redirect I/O operations to a less busy physical interface if a single host bus adapter (HBA) is overloaded by the current workload.
Create separate, independent storage areas for software, active database files, and recovery-related files.
The following storage areas are needed:
Flash recovery area: A location where recovery-related files are created, such as multiplexed copies of the current control file and online redo logs, archived redo logs, backup sets, and flashback log files
The storage containing the database area should be as physically distinct as possible from the flash recovery area. At a minimum, the database area and flash recovery area should not share the same physical disk drives or controllers. This practice ensures that the failure of a component that provides access to a datafile in the database area does not also cause the loss of the backups or redo logs in the flash recovery area needed to recover that datafile.
Storage options are defined as follows:
Regular file system: A local file system that is not shared by any other system
The rest of this section includes these topics:
Software area should be installed on a regular file system local to each node in the RAC cluster. This configuration permits Oracle patch upgrades and eliminates the software as a single point of failure.
Both the database area and the flash recovery area must be accessible to all nodes in the RAC cluster.
If a cluster file system (CFS) is available on the target platform, then both the database area and flash recovery area can be created on either CFS or ASM.
If a cluster file system (CFS) is unavailable on the target platform, then the database area can be created either on ASM or on raw devices (with the required volume manager), and the flash recovery area must be created on ASM.
Table 6-1 summarizes the independent storage recommendations by architecture.
Table 6-1 Independent Storage Recommendations for HA Architectures
|Storage Area||Non-RAC Database||RAC Database (CFS Available)||RAC Database (CFS Not Available)|
|Software area||Regular file system||Regular file system||Regular file system|
|Database area||Regular file system or ASM||CFS or ASM||Raw or ASM|
|Flash recovery area||Regular file system or ASM||CFS or ASM||ASM|
ASM uses a concept called failure groups to protect data against disk or component failures. When using failure groups, ASM optimizes file layout to reduce the unavailability of data due to the failure of a shared resource. Failure groups define disks that share components, so that if one disk fails, then other disks sharing the component might also fail. An example of what might be defined as a failure group is a string of SCSI disks on the same SCSI controller. Failure groups are used to determine which ASM disks to use for storing redundant data. For example, if 2-way mirroring is specified for a file, then redundant copies of file extents will be stored in separate failure groups.
Failure groups are used with storage that does not provide its own redundancy capability, such as disks that have not been configured according to RAID. The manner in which failure groups are defined for an ASM disk group is site-specific because the definition depends on the configuration of the storage and how it is connected to the database systems. ASM attempts to maintain three copies of its metadata, requiring a minimum of three failure groups for proper protection.
When using ASM with intelligent storage arrays, the storage array's protection features (such as hardware RAID-1 mirroring) are typically used instead of ASM's redundancy capabilities, which are implemented by using the
EXTERNAL REDUNDANCY clause of the
CREATE DISKGROUP statement. When using external redundancy, ASM failure groups are not used, because disks in an external redundancy disk group are presumed to be highly available.
Disks, as presented by the storage array to the operating system, should not be aggregated or subdivided by the array because it can hide the physical disk boundaries needed for proper data placement and protection. If, however, physical disk boundaries are always hidden by the array, and if each logical device, as presented to the operating system, has the same size and performance characteristics, then simply place each logical device in the same ASM disk group (with redundancy defined as
EXTERNAL REDUNDANCY),and place the database and flash recovery areas in that one ASM disk group. An alternative approach is to create two disk groups, each concsisting of a single logical device, and place the database area in one disk group and the flash recovery in the other disk group. This method provides additional protection against disk group metadata failure and corruption.
For example, suppose a storage array takes eight physical disks of size 36GB and configures them in a RAID 0+1 manner for performance and protection, giving a total of 144GB of mirrored storage. Futhermore, this 144GB of storage is presented to the operating system as two 72GB logical devices with each logical device being striped and mirrored across all eight drives. When configuring ASM, place each 72GB logical device in the same ASM disk group, and place the database area and flash recovery areas on that disk group.
If the two logical devices have different performance characteristics,( for example, one corresponds to the inner half and the other to the outer half of the underlying physical drives) then the logical devices should be placed in separate disk groups. Because the outer portion of a disk has a higher transfer rate, the outer half disk group should be used for the database area; the inner half disk group should be used for the flash recovery area.
If multiple, distinct storage arrays are used with a database under ASM control, then multiple options are available. One option is to create multiple ASM disk groups that do not share storage across multiple arrays and place the database and flash recovery areas in separate disk groups, thus physically separating the database area from the flash recovery area. Another option is to create a single disk group across arrays that consists of failure groups, where each failure group contains disks from just one array.These options provide protection against the failure of an entire storage array.
See Also:Oracle Database Administrator's Guide
The Hardware Assisted Resilient Data (HARD) initiative is a program designed to prevent data corruptions before they happen. Data corruptions are very rare, but when they do occur, they can have a catastrophic effect on business. Under the HARD initiative, Oracle's storage partners implement Oracle's data validation algorithms inside storage devices. This makes it possible to prevent corrupted data from being written to permanent storage. The goal of HARD is to eliminate a class of failures that the computer industry has so far been powerless to prevent. RAID has gained a wide following in the storage industry by ensuring the physical protection of data; HARD takes data protection to the next level by going beyond protecting physical data to protecting business data.
In order to prevent corruptions before they happen, Oracle tightly integrates with advanced storage devices to create a system that detects and eliminates corruptions before they happen. Oracle has worked with leading storage vendors to implement Oracle's data validation and checking algorithms in the storage devices themselves. The classes of data corruptions that Oracle addresses with HARD include:
Writes that physically and logically corrupt data file, control file and log file blocks
Writes of Oracle blocks to incorrect locations
Erroneous writes to Oracle data by programs other than Oracle
Writes of partial or incomplete blocks
End-to-end block validation is the technology employed by the operating system or storage subsystem to validate the Oracle data block contents. By validating Oracle data in the storage devices, corruptions will be detected and eliminated before they can be written to permanent storage. This goes beyond the current Oracle block validation features that do not detect a stray, lost, or corrupted write until the next physical read.
Oracle vendors are given the opportunity to implement validation checks based on a specification. A vendors' implementation may offer features specific to their storage technology. Oracle maintains a Web site that will show a comparison of each vendor's solution by product and Oracle version. For the most recent information, see
The shared volumes created for the OCR and the voting disk should be configured using RAID to protect against media failure. This requires the use of an external cluster volume manager, cluster file system, or storage hardware that provides RAID protection.
The main server hardware components are the nodes for the database and application server farm and the components within each node such as CPU, memory, interface boards (such as I/O and network), storage, and the cluster interconnect in a RAC environment.
This section includes the following topics:
Use fewer, faster CPUs instead of more, slower CPUs. Use fewer higher-density memory modules instead of more lower-density memory modules. This reduces the number of components that can fail while providing the same service level. This needs to be balanced with cost and redundancy.
Using redundant hardware components enables the system to fail over to the working component while taking the failed component offline. Choose components (such as power supplies, cooling fans, and interface boards) that can be repaired or replaced while the system is running to prevent unscheduled outages caused by the repair of hardware failures.
Use systems that can automatically detect failures and provide alternate paths around subsystems that have failed or isolate the subsystem. Choose a system that continues to run despite a component failure and automatically works around the failed component without incurring a full outage. For example, find a system that can use an alternate path for I/O requests if an adapter fails or can avoid a bad physical memory location if a memory board fails.
Because mirroring does not protect against accidental removal of a file or most corruptions of the boot disk, an online copy of the boot disk should be maintained so that the system can be quickly rebooted using the same operating system image if a critical file is removed or corruption occurs on the primary boot image. Operating system vendors often provide a mechanism to easily maintain multiple boot images.
RAC provides fast, automatic recovery from node and instance failures. Using a properly supported configuration is necessary for a successful RAC implementation. A supported hardware configuration may encompass server hardware, storage arrays, interconnect technology, and other hardware components.
Select an interconnect that has redundancy, high speed, low latency, low host resource consumption, and the ability to balance loads across multiple available paths. In two-node configurations, it may be possible to use a direct-connect interconnect between the two nodes, but a switch should be used instead to provide a degree of isolation between the network interface cards of the two nodes in the cluster. If you plan to have more than two nodes in the future, then choose a switch-based interconnect solution instead of direct-connect to reduce the complexity of adding additional nodes in the future. If you have more than two nodes, then a switch-based interconnect is highly recommended and, in many cases, a requirement of the cluster solution being used.
Using identical hardware for machines at both sites provides a symmetric environment that is easier to administer and maintain. Such a symmetric configuration mitigates failures or performance inconsistencies following a switchover or failover because of dissimilar hardware.
The recommendations for serversoftware apply to all nodes in a RAC environment and to both the primary and secondary sites in a Data Guard and MAA environment because they contain the identical configuration.
This section includes the following topics:
Use the same operating system version, patch level, single patches, and driver versions on all machines. Consistency with operating system versions and patch levels reduces the likelihood of encountering incompatibilities or small inconsistencies between the software on all machines. In an open standards environment, it is impractical for any vendor to test each and every piece of new software with every combination of software and hardware that has been previously released. Temporary differences can be tolerated when using RAC or Data Guard as individual systems or groups of systems are upgraded or patched one at a time to minimize scheduled outages, if the goal is that all machines will be upgraded to the same versions and patch levels.
Use an operating system that, when coupled with the proper hardware, supports the ability to automatically detect failures and provide alternate paths around subsystems that have failed or isolate subsystems. Choose a system that can continue running if a component, such as a CPU or memory board, fails and automatically provides paths around failed components, such as an alternate path for I/O requests in the event of an adapter failure.
Use disk-based swap instead of RAM-based swap. It is always good practice to make all system memory available for database and application use unless the amount available is sufficiently oversized to accommodate swap.
Do not use TMPFS (a Solaris implementation that stores files in virtual memory rather than on disk) or other file systems that exclusively use virtual memory instead of disks for storage for the
/tmp file system or other scratch file systems. This prevents runaway programs that write to
/tmp from having the potential to hang the system by exhausting all virtual memory.
An example on UNIX is setting shared memory and semaphore kernel parameters high enough to enable future growth, thus preventing an outage for reconfiguring the kernel and rebooting the system. Verify that using settings higher than required either presents no overhead or incurs such small overhead on large systems that the effect is negligible.
As with server hardware and software, the recommendations for Oracle software apply to both the primary and secondary sites because they contain identical configurations. Mirror disks containing Oracle and application software to prevent a disk failure from causing an unscheduled outage.
The following recommendations apply to a "RAC only" environment:
RAC provides fast, automatic recovery from node and instance failures. Using a properly supported configuration is a key component of success. A supported software configuration encompasses operating system versions and patch levels, clustering software versions, which possibly includes Oracle-supplied cluster software. On some platforms (such as Solaris, Linux, and Windows), Oracle supplies cluster software required for use with RAC.
This section includes the following topics:
The following recommendations apply to all architectures:
Table 6-2 Redundant Network Components
|Exterior connectivity (including remote IT staff)||Multiple internet service providers (ISP) ensure that an ISP failure does not make your site unavailable.
Note: The cables should not be housed within the same trunk into your facility. Otherwise, a saw or shovel can cut connections to multiple ISPs simultaneously.
|WAN traffic manager||Use a primary and a backup WAN traffic manager. (Not used with "Database only" or "RAC only" environments) Environments using Data Guard contain a primary and a backup WAN traffic manager at each site.|
|Application load balancer||Implement redundant application load balancers for directing client requests to the application server tier. The secondary load balancer should be identically configured with a heartbeat to the primary load balancer. If the primary load balancer fails, then the secondary load balancer initiates a takeover function.|
|Firewall||Implement redundant firewalls and load-balance across the pair. Because the connections through a firewall tend to be longer (socket connections versus HTTP), the load-balanceing decision needs to be made when the session is initiated and maintained during the session. Check with your firewall vendor for available functionality.|
|Middle-tier application server||Implement application server farms. This provides fault tolerance and scalability.|
Figure 6-1 depicts a single site in an MAA environment, emphasizing the redundant network components.
Figure 6-1 Network Configuration
Application layer load balancers sit logically in front of the application server farm and publish to the outside world a single IP address for the service running on a group of application servers. All requests are initially handled by the load balancer, which then distributes them to a server within the application server farm. End users only need to address their requests to the published IP address; the load balancer determines which server should handle the request.
If one of the middle-tier application servers cannot process a request, the load balancers route all subsequent requests to functioning servers. The load balancers should be redundant to avoid being a single point of failure because all client requests pass through a single hardware-based load balancer. Because failure of that piece of hardware is detrimental to the entire site, a backup load balancer is configured that has a heartbeat with the primary load balancer. With two load balancers, one is configured as standby and becomes active only if the primary load balancer becomes unavailable.
Use the Oracle Interface Configuration (OIFCFG) tool to classify network interfaces as public, cluster interconnect, or storage so that RAC properly selects a network interface for internode network traffic.
The following recommendations apply to Data Guard:
Configure system TCP parameters that control the sending and receiving buffer sizes so that the bandwidth between sites can be fully utilized for log transport services. The proper buffer size is often governed by the bandwidth delay product (BDP) formula, particularly when using a high-speed, high-latency network.
WAN traffic managers provide the initial access to the services located at the primary site. These managers are implemented at the primary and secondary sites to provide site failover capabilities when the primary site becomes completely unavailable. Geographically separating the WAN traffic managers on separate networks reduces the impact of network failures or a natural disaster that causes the primary site to become unavailable.
See Also:"Complete or Partial Site Failover"