Airlines Are Fail Operational–NASA Is Only Fail Safe

As I pulled into Titusville last week to the news that the launch had been scrubbed due to a sensor failure, I had similar thoughts to the following from George William Herbert, posted at sci.space.policy today, but he wrote them down, and I didn’t:

“Something has been nagging me since the current round of hydrogen depletion sensor problems started on Discovery’s launch attempt, and I haven’t seen any good comments come up on the newsgroups or other commentary, so I’m going to launch it out there.

The Shuttle design was intended to be highly reliable and to have multiple redundant sensors and systems in most key areas. By and large, other than structural items where it’s hard to have another whole heatshield under the first one, they have had good success with redundancy covering flight faults and avoiding nasty aborts and the like.

There is a key difference to be seen between the behaviour last week trying to launch Discovery, though, and what typically happens with say a large 747 jetliner and its typical operational cycle.

Airliners have what’s called a Minimum Equipment List. This covers a set of systems that have to be operational in order for the vehicle to safely depart on a flight. The MEL is usually designed so that a number of minor faults are tolerated, and in areas where a fault would cause the aircraft to have to stay and be repaired, where possible an extra set of redundancy is applied so that if four units are needed for safe suitably redundant flight operation, five are installed, and the MEL is four. One sensor or navigation system or whatever can be completely broken, and the required flight safety level is still met with the remaining units.

Airliners are designed that way because it costs serious money when they can’t depart on time… either they have to be repaired in a hurry, which means lots of technicians at each airport and lots of expensive spare parts stocked everywhere (plus, a long enough operating cycle to accomplish the repairs in), or you have to scramble to find another plane to shift to the flight whose aircraft is down with a gripe, and then shift another plane to cover for the one you grabbed, and so on.

Shuttle was designed with an adequate level of systems redundancy for safety considerations, in most systems. It was not designed with an adequate level of systems redundancy for operational considerations. The cost per day of a Shuttle sitting on the pad, the ops crews and the control room crews and the costs of a rollback and destacking are all very significant. The opportunity cost of not being able to fly on time is also not at all a minor issue, with Shuttle’s life span limited by a currently hard deadline and too many ISS flights remaining to get done between now and then.

Redundancy is often described in “N+1” or “N+2” or “2N” terms; shorthand for one or two more units than are required for safe operation, or twice as many as are required. MEL logic really goes to a different level. We should really be looking to “(N+1)+1″, or both safety redundancy and an operational redundancy margin. Defining the safety redunancy factor as the N plus or multiplied by whatever, we can then define an operational redundancy factor, consisting of some margin on top of the minimum safety requirements. In shorthand, let’s say O for Operational Factor = (required safety factor including margins), or for example O = N+1 . The operability factor would then be, for example, O+1 or 0+2, with the additional operability margin depending on the maintainability of the parts.

Future reusable spacecraft and their operators generally already have a clue about these issues, but it bears repeating in public to make the point. The capsules I am working on should not have to be destacked and dissassembled if one out of a set of four units fails while we’re on the pad; either there should be a fifth, or three should be adequate for safe flight including safety margins, and listed in the MEL. The same should go for any other manned orbital project.

Not every system can be made this redundant, but as Discovery is showing, there are many systems for which safety dictated enough redundancy that adding an operability margin on top of that would have not been that difficult. Two wires in the shuttle/tank interface, one more sensor unit, a few pounds of payload capacity lost… and how many millions of dollars lost destacking Discovery the first time, and in this launch delay now?

Thin margins kill costs.”

[Copyright 2005, by George William Herbert]

[Update a few minute later]

Via Clark Lindsey, here’s a good description of the sensor that failed from Bill Harwood.

I should also mention that there’s a good discussion of the problems associated with troubleshooting this problem over at sci.space.policy. Some of the posters there are theorizing that it’s a separation of an electrical conductor that only occurs at cryo temperatures (if so, it would likely be due to differential thermal expansion). They also point out the high costs of figuring out just where it’s happening to the degree necessary to have confidence in flying again. And as always, it points out the fragility of the system, and the danger of relying on a single hardware concept for all of NASA’s human exploration goals. Because this is an element of the external tank, which would be common to all Shuttle-derived heavy lifters, our ability to get to the Moon would be shut down until this issue was resolved.