Backups Strategic Vision
Written by Aamir Chaudry and Scotty Logan
Contents:
This document attempts to outline a strategic vision for the backups in general and backup service(s) provided by IT Services at Stanford University in particular.
Principles
The goal of a backup service is to provide a consistent and reliable copy of required data (all data, including work-related data, administrative data, personal data and application and system configuration information.). This copy can be used as a means of recovery should the data become lost, corrupted, or compromised. The service should make recovery possible by providing fast restores, which require local backups, with disaster protection, which requires off-site backups.
Although the definition of a backup may seem to be straightforward, the actual implementation of a storage backup strategy for successful data protection is not a minuscule task. It goes hand in hand with the deployed storage technologies therefore it is critical to understand the underlying storage strategy vision to fully exploit the power of a backup strategy. See the Storage Vision at: http://www.stanford.edu/dept/itss/organization/planning/vision/storage.html for storage details.
ITS’ backup services should provide a centrally-managed backup environment for use by both IT Services and client systems, applications and services.
Some guiding principles that apply to all backup services are:
- Simplicity and Transparency: Backup service should be simple, well defined and provide a transparent interface to the end user.
- Definition: Desired data sets to be backed up should be clearly defined and agreed upon by data owners and service provider.
- Scalability: A central backup service must handle hundreds of simultaneous client connections and reliably manage backups for thousands of clients.
- Validation: Formally tested and documented policies should exist to validate integrity, consistency and usability of data that has been backed up.
- Authentication and Access: Authenticated access to backup resources must be provided. A backup service must have flexible authentication mechanisms to meet the needs of private data and shared data confidential to a particular group.
- Cross-platform support: The backup service should have the ability to backup all IT Services supported platforms including but not limited to Windows, Mac OS X, and multiple versions of UNIX and Linux.
- Retention Policy: The service should allow multiple versions of the data to be retained as required by the data-owner. In addition to this, it should allow retaining a particular version for a desired interval of time which may be dictated by law.
- Recoverability: A well documented disaster recovery plan should exist for rebuilding the backup service in case of a catastrophic disaster.
Technologies
Stable core technologies:
- EMC CLARiiON, with ATA disks, for on-site backup.
- IBM's Tivoli Storage Manager for server backup and archiving.
- LTO Generation 1 and 3590 tape technology for backups.
- StorageTek LTO L700 Tape Libraries.
- IBM 3494 Tape Libraries.
- IBM AIX 5.x.
- IBM Series.
- Solaris 9.x.
- RHEL 3.0.
- Qlogic SANBox2-16 and SANBox2-8 SAN switches.
- Business Continuance Volumes (BCVs) on EMC Symmetrix/DMX arrays for creating database clones for backups and reporting.
Technologies new to IT Services:
- IBM Tivoli Storage Manager 5.3.x for backups and archives.
- NDMP backups for NAS devices: Currently IT Services does not have a NAS offering that will require a central backup. Backups for NAS devices are often best done using NDMP.
- Snapshots and cloning: As data continues to grow and our backup windows continue to shrink, new technologies are needed to augment the current backup methods. As the name implies, a snapshot is a frozen image or picture of data at a given instant of time. Snapshots allow a consistent copy of an application and its dta to be created with minimum impact to the application. A clone is an exact replica or a mirror copy of the existing data; it can be disassociated from its source and made available to another server for backups or other uses like reporting. Although both technologies have been available for a while, they have not been widely deployed by IT Services on the Clariion platform and are still nascent to this organization. We MUST look into creating point in time copies of data quickly and reliably with minimum impact to production for backup.
- Desktop backup. While some users and departments still use the centrally managed TSM server (aka BaRS), the service was sunsetted in fall 2004; a new project was recently started to investigate desktop backup options.
Emerging technologies: (see Projects)
- LAN Free backups.
- Server Free backups.
Deprecated technologies: (see Projects and Research)
- All servers which have reached their end of life cycle are deprecated at this point and need to be retired.
- IBM Tivoli Storage Manager v4 for archives and backups.
- IBM 3590 J tape media. IT Services currently has a significant number of such media and should plan migration of all data from this media type.
- IBM SAN Data Gateways 2108-G07, is a fiber channel bridge used to connect SCSI 3590 tape drives to the SAN.
Other technologies in use: (These technologies are currently deployed and useful in specific circumstances, but either are not attractive for broader use or have a limited scope of applicability -- we have no recommendations on their use at this time.)
- Dantz Retrospect for Windows and Mac backup. Retrospect is used by many departments who either have their own IT staff, or who contract with IT Services for support.
- Connected.com for Windows desktop and laptop backup. IT Services has an existing contract with Connected, but so far less than 150 systems are being backed up.
Projects
First:
- Upgrade TSM servers to version 5.3.x. Several of the TSM servers are running older, unsupported versions of TSM; upgrading them will require help from the DBAs, who will need to upgrade BMC SQL-Backtrack (DataTools).
- Restructure AFS backups to optimize current backup methods. This will involve upgrading the AFS backup servers to TSM 5.3.x and deploying new scripts to backup AFS.
- Deploy a TSM server for SUL/AIR. Most likely, this will be on a Solaris system for use as part of the Stanford Digital Repository.
- Detailed documentation of the current backup environment is either non-existent, scattered among individual groups or in certain cases is non coherent. Gather, link and make available all documentation currently maintained by different groups, centralizing the documentation and making it generally available.
- Storage currently used by TSM servers should be immediately migrated from RAID-5 protected ATA disks to RAID 1+0 protected ATA disks. RAID 1+0 provides a higher degree of resilience against double disk failures as compared to RAID-5. Currently IT Services uses only RAID-5 on its disk arrays to provide data protection, however RAID-5 has proven to be more susceptible to failures especially on ATA disks.
- Deploy latest TSM clients for all supported platforms.
- Desktop backup service. The production selection phase of the project is about to finish in early August. The duration of the implementation will depend on the selected product.
Next:
- Deploy IBM Tivoli Storage Manager Enterprise features for centralized management and policy based administration of all TSM servers.
- Upgrade core network being used by the backup service to gigabit. This will require placing all TSM servers behind firewalls and using the 171.67.3.x address space for backups.
- Upgrade the firewall and DNS infrastructure used to support backups.
- Explore methods for backing up CIFS. Document, test and deploy in production.
- Determine requirements for data encryption. Formulate policies to enforce required encryption. Test and deploy the formulated policies in production.
- Acquire or develop capacity planning tools to make accurate projections of backups needs.
- Test, document and publish data recovery / system recovery procedures.
- Replacement of all AIX TSM servers with either Linux TSM servers or Windows TSM servers, after publishing a detailed feasibility study of performing such migration from a known and stable AIX environment to a comparatively new and untested Linux/Windows platform. Solaris TSM servers should also be migrated to Linux/Windows if there are no large Solaris database servers which require backups direct from cloned SAN volumes to tape. All the AIX and Solaris TSM servers will need to be upgraded to a recent version of TSM before migrating to Linux/Windows.
Later:
- Completely isolate the backup infrastructure from the central Forsythe SAN and network.
- Subfile backups: The need for faster backups for mobile users over slower links is becoming increasingly important for any organization. We need to research areas that can exploit our currently deployed technology without re-inventing the wheel. IBM Tivoli Storage Manager provides adaptive subfile differencing which can make regular backups a faster by transferring only the individual bytes of data files that have changed since their last backup. Although this functionality has been available since 2004, IT Services has not done any work in production with subfile backups, and has therefore not realized the cost benefits of having to backup less and in turn charge less to end-users.
- Deploy a central management tool (e.g. Servergraph) that can be used for monitoring and administration of our current and future TSM environments.
- Integrate backup infrastructure monitoring into SMARTS to enable Production Control Group (PCG) to monitor some of the lower end functions of TSM (where possible).
Research
The following areas should be explored with an eye to their long-term inclusion in our backup strategy. Without more information, it is premature to specifically identify any of these areas for projects, but if the research pans out, they may move from this section into the project section for a full production implementation.
- LAN Free Backups: LAN-free backups allow a server to backup over a Fiber Channel interface. The data moves from the SAN attached disk, through the application server and directly to a SAN attached storage device. LAN-free backups provide substantial performance gains for large volume backup requirements. Meta-data packets containing information about the backed up data travel across the LAN in this configuration, with minimal impact to LAN bandwidth.
- Server Free Backups: Server-free backup lower the involvement of the application server and reduce the amount of CPU, memory, and I/O consumption on the application server during the backup process. Conceptually, the data moves directly from the server's disk to a 'data mover' router, through the SAN to a SAN attached storage device. The key advantage to server-free backup is the reduction of workload on the application server. For mission critical (24x7 availability) application servers, this technology will provide a great advancement for data management methods.
- Disk-only based backups: Due to the high recurring cost of a tape backup solution, there is a strong industry shift and effort to move towards disk-only based backups. We have already deployed a disk-only based solution (limited to onsite storage only at this point) for desktop backups under the old BaRS service. We need to explore this avenue more for server class systems by considering moving the on-site copies of their data to cheap disk technology (although not at the cost of jeopardizing data availability and performance)
- Disaster Recovery and Replication: A separate project is underway to create a disaster recovery plan; the central backup service will be a major component of that plan.
- Data classification: Currently all data is treated equally which is wasteful in terms of cost and resources. We need to identify a) What data needs to be backed up; b) How important is the data; c) How long do we need to keep it; d) What are the availability requirements; e) Are there any other requirements that we need to address? Data should then be categorized and bound to specific retention and versioning policies thereby potentially reducing the total cost.
- Advancements in tape technology that can provide us a comparable stable environment as the current IBM 3494 tape libraries at a fraction of the cost.




