|Oracle® Database High Availability Architecture and Best Practices
10g Release 1 (10.1)
Part Number B10726-02
This chapter includes the following topics:
The interconnected nature of today's global businesses demands continuous availability for more of the business components. However, a business that is designing and implementing an HA strategy must perform a thorough analysis and have a complete understanding of the business drivers that require high availability, because implementing high availability is costly. It may involve critical tasks such as:
Retiring legacy systems
Investment in more sophisticated and robust systems and facilities
Redesign of the overall IT architecture to adapt to this high availability model
Redesign of business processes
Hiring and training of personnel
Higher degrees of availability reduce downtime significantly, as shown in the following table:
|Availability Percentage||Approximate Downtime Per Year|
Businesses with higher availability requirements must deploy more fault-tolerant, redundant systems for their business components and have a larger investment in IT staff, processes, and services to ensure that the risk of business downtime is minimized.
An analysis of the business requirements for high availability and an understanding of the accompanying costs enables an optimal solution that meets the needs of the business managers to be available as much as possible within financial and resource limitations of the business. This chapter provides a simple framework that can be used effectively to evaluate the high availability requirements of a business.
The elements of this analysis framework are:
A rigorous business impact analysis identifies the critical business processes within an organization, calculates the quantifiable loss risk for unplanned and planned IT outages affecting each of these business processes, and outlines the less tangible impacts of these outages. It takes into consideration essential business functions, people and system resources, government regulations, and internal and external business dependencies. This analysis is done using objective and subjective data gathered from interviews with knowledgeable and experienced personnel, reviewing business practice histories, financial reports, IT systems logs, and so on.
The business impact analysis categorizes the business processes based on the severity of the impact of IT-related outages. For example, consider a semiconductor manufacturer, with chip design centers located worldwide. An internal corporate system providing access to human resources, business expenses and internal procurement is not likely to be considered as mission-critical as the internal e-mail system. Any downtime of the e-mail system is likely to severely affect the collaboration and communication capabilities among the global R&D centers, causing unexpected delays in chip manufacturing, which in turn will have a material financial impact on the company.
In a similar fashion, an internal knowledge management system is likely to be considered mission-critical for a management consulting firm because the business of a client-focused company is based on internal research accessibility for its consultants and knowledge workers. The cost of downtime of such a system is extremely high for this business. This leads us to the next element in the high availability requirements framework: cost of downtime.
A well-implemented business impact analysis provides insights into the costs that result from unplanned and planned downtimes of the IT systems supporting the various business processes. Understanding this cost is essential because this has a direct influence on the high availability technology chosen to minimize the downtime risk.
Various reports have been published, documenting the costs of downtime across industry verticals. These costs range from millions of dollars per hour for brokerage operations and credit card sales, to tens of thousands of dollars per hour for package shipping services.
While these numbers are staggering, the reasons are quite obvious. The Internet has brought millions of customers directly to the businesses' electronic storefronts. Critical and interdependent business issues such as customer relationships, competitive advantages, legal obligations, industry reputation, and shareholder confidence are even more critical now because of their increased vulnerability to business disruptions.
A business impact analysis, as well as the calculated cost of downtime, provides insights into the recovery time objective (RTO), an important statistic in business continuity planning. It is defined as the maximum amount of time that an IT-based business process can be down before the organization starts suffering significant material losses. RTO indicates the downtime tolerance of a business process or an organization in general.
The RTO requirements are proportional to the mission-critical nature of the business. Thus, for a system running a stock exchange, the RTO is zero or very near to zero.
An organization is likely to have varying RTO requirements across its various business processes. Thus, for a high volume e-commerce Web site, for which there is an expectation of rapid response times and for which customer switching costs are very low, the web-based customer interaction system that drives e-commerce sales is likely to have an RTO close to zero. However, the RTO of the systems that support the backend operations such as shipping and billing can be higher. If these backend systems are down, then the business may resort to manual operations temporarily without a significantly visible impact.
A systems statistic related to RTO is the network recovery objective (NRO), which indicates the maximum time that network operations can be down for a business. Components of network operations include communication links, routers, name servers, load balancers, and traffic managers. NRO impacts the RTO of the whole organization because individual servers are useless if they cannot be accessed when the network is down.
Recovery point objective (RPO) is another important statistic for business continuity planning and is calculated through an effective business impact analysis. It is defined as the maximum amount of data an IT-based business process may lose before causing detrimental harm to the organization. RPO indicates the data-loss tolerance of a business process or an organization in general. This data loss is often measured in terms of time, for example, 5 hours or 2 days worth of data loss.
A stock exchange where millions of dollars worth of transactions occur every minute cannot afford to lose any data. Thus, its RPO must be zero. Referring to the e-commerce example, the web-based sales system does not strictly require an RPO of zero, although a low RPO is essential for customer satisfaction. However, its backend merchandising and inventory update system may have a higher RPO; lost data in this case can be re-entered.
Using the high availability analysis framework, a business can complete a business impact analysis, identify and categorize the critical business processes that have the high availability requirements, formulate the cost of downtime, and establish RTO and RPO goals for these various business processes.
This enables the business to define service level agreements (SLAs) in terms of high availability for critical aspects of its business. For example, it can categorize its businesses into several HA tiers:
Tier 1 business processes have maximum business impact. They have the most stringent HA requirements, with RTO and RPO close to zero, and the systems supporting it need to be available on a continuous basis. For a business with a high-volume e-commerce presence, this may be the web-based customer interaction system.
Tier 2 business processes can have slightly relaxed HA and RTO/RPO requirements. The second tier of an e-commerce business may be their supply chain / merchandising systems. For example, these systems do not need to maintain 99.999% availability and may have nonzero RTO/RPO values. Thus, the HA systems and technologies chosen to support these two tiers of businesses are likely to be different.
Tier 3 business processes may be related to internal development and quality assurance processes. Systems support these processes need not have the rigorous HA requirements of the other tiers.
The next step for the business is to evaluate the capabilities of the various HA systems and technologies and choose the ones that meet its SLA requirements, within the guidelines as dictated by business performance issues, budgetary constraints, and anticipated business growth.
Figure 2-1 illustrates this process.
Figure 2-1 Planning and Implementing a Highly Available Enterprise
The following sections provide further details about this methodology:
A broad range of high availability and business continuity solutions exists today. As the sophistication and scope of these systems increase, they make more of the IT infrastructure, such as the data storage, server, network, applications, and facilities, highly available. They also reduce RTO and RPO from days to hours, or even to minutes. But increased availability comes with an increased cost, and on some occasions, with an increased impact on systems performance.
Organizations need to carefully analyze the capabilities of these HA systems and map their capabilities to the business requirements to make sure they have an optimal combination of HA solutions to keep their business running. Consider the business with a significant e-commerce presence as an example.
For this business, the IT infrastructure that supports the system that customers encounter, the core e-commerce engine, needs to be highly available and disaster-proof. The business may consider clustering for the web servers, application servers and the database servers serving this e-commerce engine. With built-in redundancy, clustered solutions eliminate single points of failure. Also, modern clustering solutions are application-transparent, provide scalability to accommodate future business growth, and provide load-balancing to handle heavy traffic. Thus, such clustering solutions are ideally suited for mission-critical high-transaction applications.
The data that supports the high volume e-commerce transactions must be protected adequately and be available with minimal downtime if unplanned and planned outages occur. This data should not only be backed up at regular intervals at the local data centers, but should also be replicated to databases at a remote data center connected over a high-speed, redundant network. This remote data center should be equipped with secondary servers and databases readily available, and be synchronized with the primary servers and databases. This gives the business the capability to switch to these servers at a moment's notice with minimal downtime if there is an outage, instead of waiting for hours and days to rebuild servers and recover data from backed-up tapes.
Maintaining synchronized remote data centers is an example where redundancy is built along the entire system's infrastructure. This may be expensive. However, the mission-critical nature of the systems and the data it protects may warrant this expense. Considering another aspect of the business: for example, the high availability requirements are less stringent for systems that gather clickstream data and perform data mining. The cost of downtime is low, and the RTO and RPO requirements for this system could be a few days, because even if this system is down and some data is lost, that will not have a detrimental effect on the business. While the business may need powerful machines to perform data mining, it does not need to mirror this data on a real-time basis. Data protection may be obtained by simply performing regularly scheduled backups, and archiving the tapes for offsite storage.
For this e-commerce business, the back-end merchandising and inventory systems are expected to have higher HA requirements than the data mining systems, and thus they may employ technologies such as local mirroring or local snapshots, in addition to scheduled backups and offsite archiving.
The business should employ a management infrastructure that performs overall systems management, administration and monitoring, and provides an executive dashboard. This management infrastructure should be highly available and fault-tolerant.
Finally, the overall IT infrastructure for this e-commerce business should be extremely secure, to protect against malicious external and internal electronic attacks.
High availability solutions must also be chosen keeping in mind business performance issues. For example, a business may use a zero-data-loss solution that synchronously mirrors every transaction on the primary database to a remote database. However, considering the speed-of-light limitations and the physical limitations associated with a network, there will be round-trip-delays in the network transmission. This delay increases with distance, and varies based on network bandwidth, traffic congestion, router latencies, and so on. Thus, this synchronous mirroring, if performed over large WAN distances, may impact the primary site performance. Online buyers may notice these system latencies and be frustrated with long system response times; they may go somewhere else for their purchases. This is an example where the business must make a trade-off between having a zero data loss solution and maximizing system performance.
High availability solutions must also be chosen keeping in mind financial considerations and future growth estimates. It is tempting to build redundancies throughout the IT infrastructure and claim that the infrastructure is completely failure-proof. However, going overboard with such solutions may not only lead to budget overruns, it may lead to an unmanageable and unscalable combination of solutions that are extremely complex and expensive to integrate and maintain.
An HA solution that has very impressive performance benchmark results may look good on paper. However, if an investment is made in such a solution without a careful analysis of how the technology capabilities match the business drivers, then a business may end up with a solution that does not integrate well with the rest of the system infrastructure, has annual integration and maintenance costs that easily exceed the upfront license costs, and forces a vendor lock-in. Cost-conscious and business-savvy CIOs must invest only in solutions that are well-integrated, standards-based, easy to implement, maintain and manage, and have a scalable architecture for accommodating future business growth.
Choosing and implementing the architecture that best fits the availability requirements of a business can be a daunting task. This architecture must encompass appropriate redundancy, provide adequate protection from all types of outages, ensure consistent high performance and robust security, while being easy to deploy, manage, and scale. Needless to mention, this architecture should be driven by well-understood business requirements.
To build, implement and maintain such an architecture, a business needs high availability best practices that involve both technical and operational aspects of its IT systems and business processes. Such a set of best practices removes the complexity of designing an HA architecture, maximizes availability while using minimum system resources, reduces the implementation and maintenance costs of the HA systems in place, and makes it easy to duplicate the high availability architecture in other areas of the business.
An enterprise with a well-articulated set of high availability best practices that encompass HA analysis frameworks, business drivers and system capabilities, will enjoy an improved operational resilience and enhanced business agility. The remaining chapters in this book will provide technical details on the various high availability technologies offered by Oracle, along with best practice recommendations on configuring and using such technologies.