Fault tolerance is the ability of a system to continue operating properly even when one or more of its components fail. It's a critical property for any system that needs to be reliable, because let's face it, shit happens (especially in software).
I told the PM that we couldn't just deploy the app on a single EC2 instance and call it a day - we needed to build in some fault tolerance so it wouldn't fall over like a drunken frat boy every time there was a hiccup in the cloud.
Management keeps harping on about five nines of uptime, but without investing in some serious fault tolerance measures, they'll be lucky to get five minutes of uptime before the whole thing goes tits up.
Catastrophic Failover: A cautionary tale about how fault tolerance measures like failover can sometimes bite you in the ass if not implemented carefully.
Eradicating Non-Determinism in Tests: Tips for eliminating sources of non-determinism (like relying on the system clock) that can cause intermittent test failures and make fault tolerance harder to achieve.
Developing Patterns in Enterprise Software: An overview of various pattern catalogs for enterprise software development, which often touch on fault tolerance concepts.
Note: the Developer Dictionary is in Beta. Please direct feedback to skye@statsig.com.