02.28.07
When things go wrong…
Last week we had a processor failure on a multi-processor box, and found ourselves in a downward spiral, one that lasted about 5 hours.
Before I get into the details, I want to offer an apology to those customers who were impacted (due to the distributed nature of our service, no single fault affects everyone). We take great pride in our operation, and we do know that our customers – and their partners, who may or may not be our customers – depend on our service.
This interruption, honestly, was not supposed to happen. One of the virtues of a multi-processor box is that – in theory – you can lose one without losing the server (not blaming the hardware, just didn’t want you to think we don’t have redundancy). But of course, that is why we have the ability to rollover to additional servers – which we did.
This is when the “spiral” became very visible… Read the rest of this entry »
