Single-Event Upsets

Marcia Smith has a report on the anomaly analysis for the SpaceX station resupply mission:

Several other problems also arose during the mission. While berthed to the ISS, one of the three computers on the Dragon spacecraft failed. Dragon can operate with only two computers, and SpaceX chose to proceed with the two functioning units rather than trying to fix the faulty unit while on orbit. According to Suffredini’s charts, Flight Computer-B “de-synched” from the other two “due to a suspected radiation hit” and although it was rebooted successfully, it was “not resynched.” Dragon experienced other anomalies because of radiation as well. One of three GPS units, the Propulsion and Trunk computers and Ethernet switch all experienced “suspected radiation hits,” but all were recovered after a power cycle. Suffredini said that SpaceX is considering whether it needs to use radiation-hardened parts instead, but noted that “rad-hardened” computers, for example, not only are more expensive, but slower. He speculated that the company would ultimately decide to use rad-hardened components in the future unless it is cost-prohibitive.

I had heard that there were also SEUs on the first ISS flight. It’s a young system, with very few actual flights, which is how you learn about things like this. But clearly it has enough redundancy for mission success (including in its ascent propulsion system). There’s a trade between using rad-hard components and utilizing more shielding. I assume that SpaceX is doing that trade right now (and perhaps has been doing so for months).

24 thoughts on “Single-Event Upsets”

  1. Couldn’t a software fix be possible for this? If it could filter for the type of error that radiation causes, it could recognize and route around the errors or even chips on the board.

    1. No. Read the basics http://en.wikipedia.org/wiki/Radiation_hardening

      And i dont think SpaceX has an easy route to just “switch to rad-hardened components” if they didnt plan for this as a backup from the outset. Its mostly not possible to just substitute for equivalent rad-hardened components in any random circuit, you need to design for this from the get go.

    2. The problem with single event upsets (SEUs) is that any bit in the computer can randomly be flipped. It could be in data or it could be in executable code. That makes it difficult to impossible to filter out the error because it could happen anywhere. The best you can do (besided shielding and radiation hardened components) is to have multiple redundant systems and majority polling logic. That sounds like what SpaceX did with their design.

      1. Worse than that – I don’t know if they are using them in the onboard computers, but SpaceX uses SRAM-based FPGAs in a lot of the avionics boxes they designed and built for themselves. SEU effects in SRAM FPGA devices not only flip register data or executable code bits, they can also flip configuration bits – in a reprogrammable FPGA device, that means the SEU induced charge in the silicon substrate suddenly changes the actual electrical circuitry the device is configured with into a random new configuration – connections change, or an and-gate changes into an or-gate, etc. Really bad juju.

        It can be less of an issue when an avionics box is on a booster and as such only needs to work for the duration of a stage burn, where you can use redundant circuit design to make sure a neutron or charged particle hit doesn’t impact the mission, but when it’s on an vehicle that you need to stay operational while parked on orbit for a couple weeks the probabilities can catch up with you.

        There are mitigation techniques for SRAM FPGA devices (which I’m sure they used), or they could choose another underlying technology (flash-based FPGAs for example, hold up really really well under SEU).

        It sounds like on this flight they ended up with a hard failure that was not fixed with a power off reset, so they probably took a hit in somelthing like an i/o and it got nuked for good.

  2. I have no idea how the DragonRider’s fuel tanks are configured, but it’s obviously internal to the capsule and fuel makes excellent radiation shielding if they can take advantage of it.

  3. Is this really something you have to learn after a few flights, these days? I would think this trade was done ages ago. Though maybe they have to re-think.

    What do/did other LEO spacecraft use with regard to on board computers, Ethernet switches, and GPS units? Rad-hard?

    My only direct experience is with a high apogee satellites and all our stuff is rad hard cuz we get nailed all the time.

    1. Gregg,

      AIUI, the traditional way to solve the problem kind of sucks. You use very expensive, hard to get rad-hard components, live with computers that are state of 1990s, and don’t try to do anything very complex. SpaceX (and several other groups) have been pushing to see what they can do with more commercial available computer hardware. I’m no expert in the area, but it seems like some groups like Surrey Satellite have had pretty good results (they’ve had a bird flying in MEO using mostly COTS electronics for something like 7 or 8 years), as have some of the microsat folks. Not sure though if they’re just designing systems that can accept faults and just recover quickly, or if they’re actually trying to make systems that can flat out not go into a fault when hit by radiation. It’s definitely an area I’m reading-up on though.

      ~Jon

      1. SpaceX (and several other groups) have been pushing to see what they can do with more commercial available computer hardware.

        You can also up-screen mil and commercial material, even if it wasn’t designed for space; it is one of the least-expensive routes to go. I’m positive Space X is doing a lot of this.

  4. The astronauts on shuttle used standard off the shelf laptops and they always knew when they passed over the south Atlantic anomaly because their laptops would crash and reboot.

    An experiment was conducted on shuttle with a highly redundant computer system, meaning something like 8 processors and it was found to be a feasible solution for some things. Just playing the odds that they wouldn’t all get hit at once. I’m certainly not a computer guy but I suppose that it’s only a useful solution for sufficiently slow processes due to the switching time. Seems like they ought to consider something like that.

    1. Ah, so that explains it! My Windows 7 laptop crashes whenever the ISS passes over the south Atlantic anomaly. It makes sense now. 🙂

  5. Shielding doesn’t really help against cosmic radiation, the best you can hope for is not to make it worse, but shielding can help against solar flares and trapped particles.

  6. If they have problems with Ethernet couldn’t they just use one of the fiber optic versions of it? It is more expensive, but still off the shelf, and the cables are immune to EM interference. I assume they are using ECC memory and the ilk. A lot of the time the “rad hard” chips are simply chips manufactured at older manufacturing nodes with larger cell sizes and less susceptibility to the stray high velocity particle and the like.
    If they are using FPGAs like someone here mentioned that could be a problem… However given Musk’s background they probably use software solutions whenever possible.

  7. The rad-hard trade is a nuanced one, but it basically comes down to a hardware/software spectrum.

    At the hardware end you can use SOI, core memory, BJT-based circuits and other forms that are less susceptible to ionizing radiation in the first place. At the software end you have programs to prevent effects from propagating (memory scrubbers), and programs to recover (lockstep processors, redundant cores). In general the hardware side will reduce the frequency, but you’ll always need some kind of software solution because eventually you’ll get hit.

    I think most of the mammals in the industry go heavy on the software side, and the math is fairly compelling. With the pace of memory and processor development and mass production, a rad-hard architecture might be 100+ times slower for a given cost than a COTS part. Even if you devote 90% of your processor load to memory scrubbing, running checksums, watchdogging your co-processors, and coordinating between, say, 3 redundant cores in order to get comparable reliability to a rad-hard architecture, you come out ahead. The tradeoff is software complexity, but in return you’re getting a non-bottlenecked supply chain with lots of well-understood parts.

    If you make a few fairly painless choices (careful about the type of RAM and ROM you use, upscreening from a handful of candidates) you can get your mean time between failures to a level that software can manage, certainly for LEO and probably for GEO/MEO/deep space missions that don’t go to Jupiter or suffer direct hits from solar flares.

    It’s also instructive to see how nature does it. Deinococcus radiodurans has some hardware protection (manganese antioxidants) but mainly relies on redundancy and software (multiple copies of genome and advanced repair mechanisms).

  8. Some googling reveals that rad hard processors sell for ~$40k, and half that if you buy them in quantity. That’s a rounding error compared to the cost of a Dragon, even if you need a bunch of them for raw computing power.

    1. If a capsule’s cost of goods is $50M and you need, say, 40 processors, that’s 1.6M or over 3% of the total cost. That is not a rounding error. If you consider say a 30% profit, you just ate over 10% of your margin. 40 is probably very low. Your rad-harness is only as good as the weakest link, so you also need rad hard RAM, non-volatile storage, buffers, power supplies, etc. Realistically adding a rad hard computing architecture to a low-cost space capsule could rival the cost of the space capsule itself.

      However, the true cost of rad hard components is not the sticker price. It is the single-source, long-lead supply chain; the specialized software and firmware skills which may be 10 years obsolete; and the relatively low computing power. Where they start making sense is when you have a several-hundred-million $$ GEO or deep space mission; but I suspect with computing power so ubiquitous and the easy reusability of modular software the software-heavy approach will start to make more sense there as well. Keep in mind that permanent radiation damage is relatively rare; the much more common scenario is temporary latchup which is cleared by a power cycle or reboot of the affected chip. As long as your long tail time between failures is slower than your software’s ability to fix those failures and re-establish system state and maintain normal operations, you are okay. As long as you have enough computing overhead devoted to healing thyself, and are somewhat careful about choosing relatively rad-resistant technologies and screening your designs, software should be able to keep up.

      Things get a little tougher for long-duration missions and missions to Jupiter or the inner solar system where you can expect higher and more energetic total exposures. At some point the likelihood of permanent damage to memory locations or processor switches gets large and then software is not nearly as well-developed to deal with it. At that point you need dynamic memory allocation which is usually considered a no-no on critical applications, and you may need on-the-fly re-architecting of FPGA firmware, which to my knowledge does not exist outside maybe some graduate theses. I’m not sure how you could even handle it in a microprocessor or a PROM aside from massive redundancy. So it probably makes sense for some components to remain rad-hard for those mission types, at least until flexible architectures and reactive software is sophisticated to compensate for “must not fail” failure points.

  9. I wonder if it’s possible to create a laminate that reflects radiation? Or perhaps a magnetic bubble just large enough to deflect radiation from your electronics?

    1. Sheilding works for alpha and beta radiation, but it is pretty much useless for high energy cosmic rays or neutrons. Using rad-tolerant devices and the design and programming strategies described here are basically the state of the art.

Comments are closed.