Disaster Recovery - Strategies for IP CommunicationsNovember 18, 2009 —
Avoiding or ignoring expenses related to backup and disaster recovery planning is one of the invisible ways carriers and service providers cut costs. The downward pressure on profitability, proliferation of smaller-revenue service operators and the inherent uncertainty in the global telecommunications industry all perpetuate this emerging trend of a less protected service offering. Subsequently, these factors contribute to lower customer expectations on quality of service, increased consumer skepticism about using ‘off-brand’ service providers and a cumulative lag in the overall adoption rate of next-generation communications technology. While voice (VoIP) and video applications that run over the Internet are often referred to as the ‘wild west’ of communications for their unregulated and untapped potential, enterprises and consumers demand a Wyatt Earp-style rule of law in the form of Service Level Agreements (SLAs) and 99.999% (“5-9s”) uptime guarantees. This fact has led to success for IP telephony service providers that invest in disaster recovery planning and, in turn, experience less down-time and outages. Once an operator has confidence in the stability of IP telephony – actually using VoIP to mitigate disasters (find me services and portable SIP IDs) becomes possible as a value-added application of the technology. The crux of this problem is that between disaster recovery planning and the investment in the required redundancies the price tag is inevitably too steep for small operators. The more than 130 carriers and service providers that have contracted with VoIP Logic for VoIP infrastructure-as-a-Service, and engineering support for their respective services have colored perspective on this issue. Among this group there is a small, yet growing appetite for more expensive forms of backup and disaster recovery that translate into an ability to effectively serve customers with higher expectations. The general feeling is that investment in quality of service and up-time are directly related to lower customer churn. This article will detail factors to consider when aligning backup and disaster recovery requirements with budget. My goal is to provide a working knowledge on these topics so that carriers and service providers have a transparent view on the risks and rewards of committing varying quantities of time and capital into preparedness. Here are some of the critical questions to answer in assessing investment in backup and disaster recovery: 1) What happens under various scenarios of network or device outage? There are a myriad of excellent books and treatises written on the topic of high availability, backup and disaster recovery such as High Availability and Disaster Recovery: Concepts, Design, Implementation by Klaus Schmidt in 2006 and, IT Disaster Recovery Planning For Dummies by Peter Gregory is informative and well reviewed. Consider this brief article to be a short primer on the fairly scientific process of determining the value of an investment in protection. Backup and disaster recovery is contingent on different types of simple and geographic duplication in each of the production hardware/software platforms as well as the network and resources that protect those platforms. There is an ongoing series of operational milestones that are required to confirm that a disaster recovery plan works as advertised. Three layers of protection comprise the options from which a service provider can choose to use all providing forms of insurance against outages: Internal redundancy (often called “Backup”), LAN duplication (“Redundancy”) and WAN redundancy (“Disaster Recovery”). I like to think of these layers as they also represent the geographic footprint of protection – from hard-disk failure – a problem protected within the machine – to full power-grid outages in a metro-area – a problem protected via geographic duplication. Because managed services operators (like VoIP Logic) have geographic footprints in so many hub cities, this type of infrastructure redundancy is a service we provide to the utmost limitations of the underlying technology. Generally speaking – carriers will opt for more backup and redundancy capabilities in a managed service offering where it is packaged more economically and to offset some of their outsourcing concerns. Hardware backup comes in many forms. Several industrial grade servers and systems come equipped with varying types of internal redundancies such as dual power supply, multiple hard-drives, RAID protection of hard drive data, abundant RAM and quad or eight CPUs. Often, even on the slimmest budgets, extra expense on some type of data protection (RAID or other) and multiple power feeds can be a wise investment. (Usually for a few hundred dollars per server you can upgrade with additional drives, memory, RAM, CPUs or power supplies.) When providing hardware as part of a solution, we have found success and reliability standardizing on a robust platform with which we are very familiar; it requires less system administration time and we have a built in network of same server deployments for disaster recovery. I recommend an investment in a powerful SNMP monitoring and alarming tool to track all processes running on all servers. This is an inexpensive way to keep track of how many problems you are having before deciding to spend further. For the most part, LAN redundancies include a second hardware component, a second network feed or second power feed in the same geographic location providing hot, warm, or cold ‘standby’. The additional server, or custom hardware, is set up with full software licensing and active scripts comparable to the primary server. Sometimes a load balancer is used to provide a shared primary server. Bandwidth is fed from a BGP interconnection or from two distinct paths to the Internet managed by a flexible dual-WAN router. Power is redundantly fed from at least a different UPS source and at most from a different power source. The most important part about LAN redundancy is confirming it works. Testing and outage failover planning on a schedule basis (at least annually) is a viable way to confirm the redundancy is there when it’s needed. It is also important to understand and test the logistics of data loss and restore process due to system switches – are live calls lost, are recent DB updates lost, etc. LAN redundancy is the mid-tier of investment in backup and disaster recovery and is where high-uptime guarantors must spend more effort and treasure and where consumer-focused, low-cost leaders spend less. In my experience these investments are a great resource if ensuring uptime is worth it. The single most valuable investment in redundancy – in my experience at VoIP Logic – is in two or more power feeds from different sources. In the past decade or so one recurring cause of outage has been electricity often due to increasingly high demands in a small footprint. Surprisingly, because of the demand on power, frequent system testing and upgrades, and access to redundant power feeds, has risen to the top on our list as we scout out new collocation facilities. When it comes to spending on WAN redundancies, service providers are usually protecting a large subscriber base, serving a disaster prone area, or are providing an enforced SLA that requires true 99.999% uptime. With WAN redundancies, disaster recovery is provided in the event of a large power grid outage, Internet outage, full bandwidth loss or other catastrophic failure. This approach involves using two or more geographic locations with mirror systems and well defined primary-secondary and/or load-balancing procedures. More and more vendors support software that allows WAN duplication – particularly if their service provider customer’s install managed bandwidth between the remote locations (for easier DB duplication). One of the promises of cloud computing is virtualization – which means resources can be dispersed around geographic locations. VoIP Logic uses WAN redundancy precautions only on its most uptime-sensitive managed services – most notably, enterprise Class 5 – and to meet the requirements of individual customers. Many smaller operators choose to forgo the additional capital expense and operational expense of multiple locations. A more common type of WAN redundancy that can be a useful investment is offsite data protection. This has been one of the few cloud computing businesses that has an easy value proposition. Even if one cannot afford to protect their operations with full redundancy, at minimum, they can protect their data so outages and mishaps are easier to resolve. There are many places where one can spend money to shore up IT infrastructure. It always comes down to budget. So while I would say that using a managed services provider for infrastructure is a smart way to get more value for cost/allocation, there are also smart decisions on specific redundancies that can save money and deliver better uptime. Often a detailed analysis from your head of Operations can uncover weaknesses. I recommend an annual audit of your VoIP and other IT infrastructure for weaknesses and/or single points of failure and steps to remedy these flaws. And, if nothing else, you will know where you stand on these issues and you might just find you can afford some intermediary steps that can go a long way towards increasing uptime. ***** http://blog.telephonyonline.com/briefingroom/2009/11/18/disaster-recovery-strategies-for-ip-communications/ |