This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at firstname.lastname@example.org and let’s get in touch!
This article is based upon our upcoming book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data.
In the context of database management, business continuity can be defined as an organization’s ability to guarantee its uninterrupted functioning, despite possible planned or unplanned downtime of the hard- and software supporting its database functionality. Planned downtime can be due to backups, maintenance, upgrades etc. Unplanned downtime can be due to malfunctioning of the server hardware, the storage devices, the operating system, the database software or business applications. A very specific, and extreme, aspect of business continuity is an organization’s endurance against human or nature induced disasters. In that context, we speak of disaster tolerance.
Contingency Planning, Recovery Point and Recovery Time
An organization’s measures with respect to business continuity and recovery from any calamities are formalized in a contingency plan. Without going into too much detail, a primary element of such plan is the quantification of recovery objectives, considering an organization’s strategic priorities:
- The Recovery Time Objective (RTO) specifies the amount of downtime that is acceptable, after a calamity occurred. The estimated cost of this downtime provides guidance as to the investments an organization is prepared to make to keep this downtime as minimal as possible. The closer the RTO is to the calamity, the less downtime there will be, but also the higher the required investments in measures to restore database systems to a functioning state after planned or unplanned downtime. The RTO is different for any organization. For example, a worldwide online shop will push for zero downtime, as even the slightest downtime costs vast amounts in lost sales. On the other hand, a secondary school may be able to cope with a few hours of downtime, so there is no need in making investments to recover from a calamity in a matter of minutes.
- The Recovery Point Objective (RPO) specifies the degree to which data loss is acceptable after a calamity. Or to put it differently, it specifies which point in time the system should be restored to, once the system is up and running again. The closer the RPO is to the time of the calamity, the less data will be lost, but also the higher the required investments in state of the art backup facilities, data redundancy etc. Also, the RPO differs from organization to organization. For example, although a higher RTO is acceptable for a secondary school, its RPO is probably closer to zero, as loss of data with respect to e.g. the pupils’ exam results is quite unacceptable. On the other hand, a weather observatory is better off with a low RTO than with a low RPO, as it is probably important to be able to resume observations as soon as possible, but a certain loss of past data from just before the calamity may be less dramatic.
The aim of a contingency plan is to minimize the RPO and/or RTO, or to at least guarantee a level that is appropriate to the organization, department or process at hand. A crucial aspect in this context is to avoid single points of failure, as these represent the Achilles heels of the organization’s information systems. With respect to database management, the following ‘points of failure’ can be identified: availability and accessibility of storage devices, availability of database functionality and availability of the data itself. In each of these domains, some form of redundancy is called for to mitigate single points of failure. We discuss the respective domains in the next sections.
Availability and Accessibility of Storage Devices
The availability and accessibility of storage devices is typically arranged by means of RAID configurations and enterprise storage subsystems. For example, networked storage, in addition to other considerations, avoids single points of failure with respect to the connectivity between servers and storage devices. In addition, different RAID levels not only impact the RPO by avoiding data loss through redundancy, but also the RTO. For example, the mirror set-up in RAID 1 allows for uninterrupted storage device access, as all processes can be instantaneously redirected to the mirror drive if the primary drive fails. In contrast, the redundancy in the format of parity bits in other RAID levels requires some time to reconstruct the data, if the content of one drive in the RAID configuration is damaged.
Availability of Database Functionality
Safeguarding access to storage devices is useless if the organization cannot guarantee a DBMS that is permanently up and running as well. A first, simple, approach here is to provide for manual failover of DBMS functionality. This means that a spare server with DBMS software is standby, possibly with shared access to the same storage devices as the primary server. However, in case of a calamity, manual intervention is needed, initiating startup scripts etc., to transfer the workload from the primary database server over to the backup server. This inevitably takes some time, hence pushing back the RTO.
A more complex and expensive solution, but with a much better impact on the RTO, is the use of clustering. In general, clustering refers to multiple interconnected computer systems working together to be perceived, in certain aspects, as a unity. The individual computer systems are denoted as the nodes in the cluster. The purpose of cluster computing is to improve performance by means of parallelism and/or availability through redundancy in hardware, software and data.. Availability is guaranteed by automated failover, in that other nodes in the cluster are configured to take over the workload of a failing node without halting the system. In the same way, planned downtime can be avoided. A typical example here are rolling upgrades, where software upgrades are applied one node at the time, with the other nodes temporarily taking over the workload. The coordination of DBMS nodes in a cluster can be organized at different levels. It can be the responsibility of the operating system, which then provides specific facilities for exploiting a cluster environment. Several DBMS vendors also offer tailored DBMS implementations, with the DBMS software itself taking on the responsibility of coordinating and synchronizing different DBMS instances in a distributed setting.
A last concern in the context of business continuity is the availability of the data itself. Many techniques exist to safeguard data by means of backup and/or replication. These techniques all have a different impact on the RPO and RTO, resulting in a different answer to the respective questions ‘how many data will be lost since the last backup, in case of a calamity?’ and ‘how long does it take to restore the backup copy?’. Of course, also here, a tighter RPO and/or RTO often comes at a higher cost. Some typical approaches include:
- Tape backup
- Hard disk backup
- Electronic vaulting
- Replication and mirroring
- Disaster tolerance
- Transaction recovery
In this article, we provided a database perspective on business continuity. Starting from a contingency plan, we elaborated on various single points of failure: availability and accessibility of storage devices, availability of database functionality and availability of the data itself.
For more information, we are happy to refer to our upcoming book, Principles of Database Management, see www.pdbmbook.com.