Failure is not an option in business and redundancy is the solution. You can offer customers systems that keep running when the hardware fails.
Server failure used to mean widespread panic and a scramble to find backup tapes and a spare server to restore onto. Inevitably this meant business interruption, user frustration and even data loss. Recovering from a backup can take hours; even planned maintenance can take the server down during business hours leaving users unable to work. Today there is no excuse for this situation.
See key tools for backup and redundancy in action:
Click Here to see "Replicating Active Directory"
Redundancy used to mean using expensive clustering or other technologies and a plethora of additional hardware; now there are solutions available for even the smallest of companies. You can scale them to the size of the company, choose the appropriate price points and manage them remotely as part of a service that you offer to the business.
Disaster Recovery or Business Continuity
Let’s get this one out of the way right from the start. Disaster recovery and business continuity are not the same thing. No matter how many times you hear people use them interchangeably, they address different situations and require different processes.
Disaster recovery plans assume there is a complete failure of all equipment and even loss of the site. The only focus of disaster recovery is to recreate the entire environment to the same state as it was before everything broke. Business continuity is different. Rather than trying to recreate everything, it looks at a more limited remedy; what is required to get key systems and key personnel working. Instead of supporting email for 500 people this might mean providing access for just 20. Rather than having a response time of under 2 seconds, you might have to live with 20. In short, business continuity is a stage on the way to disaster recovery, where you get essentials working immediately with full restoration following later.
When a small business customer relies on their IT so much that they can’t do business without it, backing up data isn’t enough any more. They need a service that covers applications as well, with seamless recovery to get them back up to speed before business is disrupted.
The traditional way of providing failover is either with a fault-tolerant machine with duplicate hardware – which puts up the price of the server by adding a second power supply, network card, cpu or even motherboard – or with High Availability (HA). HA works by linking several machines together so that the failure of one computer brings another online, so that the users see no difference. You can link multiple computers together in this way. The applications run on the servers but the data is stored on the network.
To make high availability work, the servers use a heartbeat to see each other. The biggest risk with this is that the servers could all lose network connectivity at the same time, causing all the servers to take control with a risk of data corruption on the shared storage. The servers can be on different sites or at the same location, providing support for a business-critical application.
Load balancing is often used in conjunction with high availability. Rather than having machines that do nothing but wait for failure, you use them to provide services to users. This is ideal for several offices spread over a wide area and gives users the response they need from a local machine. Each server is working in read-only mode but they are all pulling the data from a shared network, ensuring that only one copy of the data is stored.
When any single server fails, users are redirected to other servers while you repair or replace the failed server and bring it back on line. As with high availability, this can be used to support multiple branch offices or the servers can be part of a pool supporting a business-critical application at a single site. However, for every server that you bring into a high availability or load balancing group, you need to license the operating system and any applications on the server. This can be an extremely expensive option.
Local or remote?
This is a big question and one that goes to the heart of the final solution – and the price. Is this about providing a solution on the business premises that copes with failure of a single piece of hardware or is this about using external resources in case of a major failure?
The advantage of a local solution is the ease of set up, management and lower costs. The downside is that a single power failure will take out all the hardware leaving all plans pretty much useless. Even a failure of an inexpensive component such as an ADSL router will leave remote and home workers unable to connect and work. An off-site solution is still relatively easy to set up but is more expensive. No matter what happens to the primary site you should be able to provide a service to users. The downside is that you have to size bandwidth, purchase rack space and install multi-site management tools.
The ideal solution is to have a multi-site policy where you use spare capacity locally to protect against a single piece of hardware failing with another layer of redundancy off site, in case of catastrophic failure.
Virtual or physical?
However existing servers are deployed, it’s possible for failover servers to be physical or virtual. Virtual servers are the cheapest option and they give you the option of getting the customer running on any hardware. They can also be moved back from virtual to physical machines at any point in time to allow you to reinstate a server.
This is where you will need to consider the disaster recovery and business continuity plans. Unless there is a significant reason, business continuity would use virtual servers temporarily to provide core services, while disaster recovery plans will cover replacing hardware and then reinstating a full service level.
It’s not a rigid choice between physical and virtual. Over the last decade the hardware appliance market has grown significantly and there’s now a range of server appliances designed to act as a consolidation point for failover services. Deployed as a single physical device, appliances like the PlateSpin Forge use Physical 2 Virtual (P2V) technologies to replicate the data from existing servers onto a device that can run the servers in an emergency and restore them to their original state when new hardware is obtained. Another alternative is using remote services such as Neverfail (www.neverfail.co.uk) or SOS Standby Server (www.sosstandbyserver.com/replication.htm). They replicate the customer data to their site; when something happens the company apps run on their servers for a while and then restore back to their own server. Some of them work much like an insurance policy as the customer pays a fixed fee whether they need to use the service or not.
As well as doing replication for the server environment, Neverfail has a series of Continuous Availability plug-ins for mission-critical applications including Exchange Server, Blackberry, SQL Server and SharePoint as well as anti-virus and anti-spam add-ons.
While all of these have their own solution for protection, none of them are real-time solutions. Instead they use a combination of a master copy and log shipping which is then applied out-of-band to the copy. This means that there is always the risk of log files being lost. However, as all of these are database controlled, they have recovery mechanisms that allow them to recreate or request the data again. Servers are replicated to machines hosted by Neverfail at their data centre and when there is a failure, the replicas are brought online while the original machines are replaced. Once the servers have been replaced on the client site, the Neverfail copies are rebuilt onto those machines.
All of this takes time and depending on the size of the client, it may be better to suggest that critical services such as email should be hosted from the start. This would mean that they are accessible from anywhere, are constantly protected against any disaster or problem and can be quickly and easily replaced or restored.
Most managed service providers have reseller programmes that allow you to sell their services to your clients and retain control of the accounts. In effect, you become the named support person for your customer.
-- click image to enlarge --
In this scenario you use spare capacity on existing servers to keep virtual copies of applications installed on other servers. Here we have four applications installed onto three servers. To provide redundancy they are using virtual machines on other servers to keep a copy of the server and application.
-- click image to enlarge --
In this example a number of servers are protected by the PlateSpin Forge appliance. These could be physical servers or virtual servers – each is replicated as its own VM inside the Forge appliance. If any of the servers fail, the copy on the Forge appliance takes over and keeps the enterprise running.
-- click image to enlarge --
In this example servers 1, 2 and 3 are all clustered. Servers 1 and 2 are on the same site and should there be a failure then server 2 would take over running the application. Server 2 and server 3 will continue the cluster such that if server 2 is lost, then server 3 will continue to run the application. As the other servers come back online they will resync and re-establish the cluster.
-- click image to enlarge --
In this example office 1 has multiple servers and the PlateSpin Forge appliance. Office 2 and office 3 have one server each and these are replicated over the WAN to the Forge appliance at office 1. If any server goes down the application can be run from the Forge appliance while the server is replaced and then resynchronised.
Choosing the architecture
There are many ways to do server replication but not all of them give you perfect redundancy. Log shipping keeps the servers in sync but with a slight delay. Instead of data being written away to all servers at the same time, the synchronisation is done through the use of activity logs which are moved on a regular basis to the secondary server and then applied. This is how Microsoft Exchange Server 2007 Business Continuity works. Log shipping is also known as transactional replication.
Merge replication is similar to log shipping and is generally used with database applications. A copy of the database is taken but instead of applying the changes via the log files, triggers are used to copy changes. A snapshot is a point-in-time copy. The snapshot stores a complete image of the server and/or data and each copy overwrites the previous copy. It takes time to create and copy the snapshot. This means that it is useful as a regular copy technique for timed backups but not ideal for server failover.
Real-time replication copies any changes simultaneously to a secondary location. Depending on the type of files, this is generally done in two ways – file or block. File replication copies the entire file as soon as it is written. If you’re working with unstructured data where file sizes are small – like standard office documents – this works well. Block replication looks at the blocks on the device and whenever there is a change it replicates that block to the remote device. This is ideal for large files with frequent small changes.
If you are designing a multi-site solution be aware of the needs of the replication service. The first time you create any replication you will need to copy the primary to the secondary. If possible, this should be done by bringing the secondary to the same site as the primary server. After that, you can take it to the remote site and re-establish the links. The first sync will take a reasonable amount of bandwidth and is an ideal overnight or weekend job. After that, the updates should be small but regular.
One way to improve the bandwidth is to use a WAN accelerator such as the Bluecoat PacketShaper (www.bluecoat.com). These devices improve WAN connectivity by prioritising and compressing the traffic. Bluecoat is also adding security appliance features to the appliance, offering end-point security which may make it an easier sell to the customer.
If the client has multiple sites then you may want to consider building them an architecture that replicates back to their headquarters and then out to a data centre. Alternatively, go direct to the data centre and reduce the amount of traffic. There are pros and cons with both. In the first case, using their head office means they will have a single consolidated copy of what is happening at the branches that can be then backed up with the rest of their systems. With an off-site data centre, you can offer backup services to ensure that everything is backed up regularly. If you don’t want to build the solution either at your site or on machines that you are hosting, again you can use a managed provider such as Neverfail.
Your solution needs to be a combination of multiple technologies for maximum protection. It must work with all the operating systems the clients have. For example, SOS only supports Windows servers at the moment so Linux servers would need an alternative solution. You need the right kind of replication for the data. And the system must match the business continuity and disaster recovery needs of the business.
The business case
When customers hear the phrases high availability, clustering or redundancy, they often get hung up on how much it’s going to cost. The problem is that the cost to their business of not doing this can be far higher. In addition, if the business is subject to any compliance regulations, providing these features is no longer optional, it’s a legal necessity.
The cost doesn’t have to be high. Start by using the features built into their operating system. Moving the customer to a virtual environment with existing servers and on-site virtual machines is a big step forward, especially as hardware costs can be minimal – hard disks and memory are relatively cheap and the initial investment in hardware is the main cost. Extending this by providing offsite replication means they can recover from any failure – fire, flood, natural disaster, power or hardware failure and even theft. It also provides you with the opportunity to upsell additional hardware capacity and services.
With hardware such as blade systems, you can build a dense computing solution for clients at a reasonable cost which is easily recouped. When coupled with services such as backup and continuous data protection for the SAN/NAS, it is possible to build a replicated solution for clients that will provide them with the best possible outcome in the case of a disaster, without putting so high a price on the system that they decide to rely on luck instead. Or you can steer the customer towards a managed service which saves the hardware costs of on-site servers and the power costs for running them. And the level of protection it offers is likely to be well received by their insurance company, potentially lowering their premiums.