Failure at Every Connection

I’ve been working on Direct Response Marketing systems for 9 years. Starting mainly with on premise media tracking software and call center applications with a small staff of 25 sales agents, within a year we had complex commission generation engines, continuity plan templates, and dynamic product catalogs across multiple servers in several locations for 2000 at home agents. In a school of hard knocks I learned that a failover plan needs to be in place for every piece of your operation. I’ve taken that experience and embedded it into all levels of Blue Vase Marketing, LLC. As the CIO, I am responsible for not just the software applications, but the hardware and network as well.

It is imperative whenever implementing a component into your architecture to step back and say, “what happens when it breaks?” Any good software engineer puts exception handling into their classes and methods to log errors, send alerts, or reattempt actions. Agent consoles and web sites should not output “MDB2 Error: constraint violation Query: _doQuery: [Error message” on the screen of the user, it should politely say, “I’m sorry. An error has occurred, please retry again later. Cick here to return.” In the direct response industry every show counts, every show needs to be accounted for and hit the ratios required to make that airing profitable. If your phone system goes down for 10 minutes your company could lose tens of thousands of dollars. If your agents can’t input an order when the call comes in and an agent writes down the information outside your secure system, you lose not only the source of the order, but you put your PCI compliance at risk.

At Blue Vase Marketing, LLC we attack every scenario with the expectation that something will fail. We have overflow call centers, we can redirect telephone calls at the carrier level, the ACD platform level, and the physical phone location level. We have implemented two separate LANs, completely separate out to the cloud, and put half the company on each network. One telco’s PoP(point-of-presence) is copper based and located on the north side of the building and the other PoP is fiber and is located on the south side of the building. In the instance of one LAN or the other collapsing, we can quickly bring up the entire operation on the one that does work, within minutes. This additionally makes it easy to identify if the connection problem is on premise or at a third party.

Ask your third-party companies about their disaster recovery plan. Ask yourself about your own disaster recovery plan. Three questions that should fall out of your mouth when implementing or shopping for a system are: What happens if you lose power at your facility? How often do you back up your database and where do you store the records? Who do I call when it breaks and how many rings before the phone is answered? Most companies put into their SLA 99.9% uptime guaranteed, this still means that your system will be down 15 minutes a day or depending how the contract is worded, 4 hours a month, to a 24/7 facility this could cause some serious bleeding.

All systems break. Computers, routers, switches and ports have very small parts that require specific voltage parameters and temperature conditions, even with climate control and surge protection, switches reset and servers fry. The code in a single software system module may have millions of lines, thousands of variables, hundreds of database connections, and dozens of programmers. Even with source control and versioning, accidents happen, code is overwritten and database files are corrupt. The internet itself is currently in need of an infrastructure revival as more and more people are streaming music, uploading pictures to facebook, and playing with their smartphone apps. The important thing for you and your business’s systems and interconnectivity is to be prepared for something to break, because it will. Then… simply failover.