14 thoughts on “Computer Problems”

  1. Have you run memtest86? It’s rare that a memory/CPU/bus error gets past that.

    I’ve seen bad filesystem corruption issues from a screwy drive connection (combined with a lousy hard drive); does smartctl report any logged screams from the hardware?

    1. I’m going to run memtest86 overnight tonight to see. If there’s a problem, I’ll be unhappy, because the memory is only a few months old. 16G of 2133 DDR3. The “hard drive” that root is on is an Intel SSD, also a few months old. No obvious issues from smartctl on sda2.

    2. OK, I didn’t run it all night, but I ran memtest this morning for a couple hours, with no errors. I’ll try it again tonight, but that doesn’t seem to be the problem at this point.

  2. S.M.A.R.T. doesn’t necessarily translate well to SSDs and their failure syndromes since it was primarily developed to spot issues with rotational storage. However you can try commanding it to run the firmware-based ‘short’ self test to see if it reports any issues. If you have several hours to burn you can also run the ‘extended’ self test. Be sure when you’ve commanded the self-test you wait the prerequisite amount of time to allow the self test to complete before you query the disk’s health status or error log. It helps to keep the disk idle during this time as much as possible.

    I’m hoping you are using a modern file system for your root disk such as ext3 or ext4. A journaling file system is more robust to disk issues than the older and I consider now obsolete ext2.

    So I would recommend the following, although you may not like what you hear. I would recommend a backup of your /root file system to another SSD device and see if the problem persists. I’m almost willing to wager it won’t. THEN I would recommend a complete reformat of your failing SSD. There is probably a bad block that is slipping by that another mke2fs will pick up. I know having a 2nd SSD on hand isn’t necessarily a cheap or viable option, but the least disruptive IMO.

    Another option, but risky if you are relying on this SSD as your only root disk, is to check with Intel using their SSD Toolbox to check to see if there is a firmware upgrade available for your disk. The SSD Toolbox can be down loaded from Intel but runs only on a Windows PC with an Internet connection. So you will have to either boot up Windows on your PC or move the drive over to a Win PC. You will also need to install a msdos partition on the drive in order to allow Windows to “see” it and assign a letter drive number to it and thus make it available to be seen by the SSD Toolbox. There doesn’t actually have to be a file system present in that partition. If your Linux file system partition isn’t occupying the whole disk you can slip in a dos partition beyond it safely. Most disks including SSD’s typically come with a dos partition already installed. When I re-partition for Linux I try to resize the dos partition to some small footprint and then partition the rest for Linux for exactly this reason.

    There *is* a way to upgrade disk firmware from within Linux using “hdparm –fwdownload” but it requires you have the proper “.bin” file handy and doesn’t provide any of the manufacturer’s safeguards. It also requires you to use some very *funky* switches to actually get it to work. 🙂

    In theory it is possible to upgrade a disk firmware w/o damaging content and I’ve done it, but I’ve always had a 2nd disk on hand with a backup copy just in case.

    A disk firmware upgrade runs a small risk of “bricking” the drive. Which means you’d need to have a JTAG setup to bring it back to life for which few manufacturer’s provide field support.

  3. One comment on running memtest86. I agree a great tool but it’s most effective when directly booted. That allows it to operate across your entire memory, not just sub-blocks that your kernel allows you to have access to. So use the version that’s standalone bootable via your bios or grub or lilo, for your overnight run. A bootable USB stick comes in handy for this.

    1. I did boot it from a stick. I ran it again last night, for eight hours or so. It once again showed no errors, but it seemed to hang on the hammer test, and never got further than it did the night before, which is making me go hmmmmm…

  4. There appear to be known issues with Test 13: Hammer Test esp. around DDR3 RAM. Which version of memtest86 are your using?

    You can also use the older & forked GPL’d memtest86+ at: http://www.memtest.org/

    Last piece of advice: if you have the time and gumption, test your DIMM sticks individually. This will not only dramatically reduce your test times but give you isolation to a single stick in one go, *IF* your problem is confined to a single DIMM! You will need to use DIMM slot 0 for each. Hopefully you’re at some small number of sticks like 4…

    Also I trust you are not over-clocking?

    memtest86 issues with the Hammer Test are discussed at various sites on the Internet. A quick Google search will yield lots of links. Also mentioned in the user forum of http://www.passmark.com. I tried to post the links here but got called out for trying to spam your blog.
    Here’s *one* link hopefully will pass the filter and shed some light:

    http://www.overclock.net/t/1558462/memtest86-hammer-test

    1. Which version of memtest86 are your using?

      Whatever was on their official site to download yesterday. I’m at two sticks, but unfortunately, I can’t easily play with them, because I broke one of the levers on the last installation. And no, I’m not overclocking.

  5. Oh yeah one more piece of advice, try to group sticks according to manufacturer such that you are not attempting to interleave sticks from different makers. I know, I know, the SPD data is supposed to provide your controller with enough information to deal with it, but mfgs. are also known sometimes to not live up to all that’s promised in the SPD. It just makes life easier to not interleave between mfg’s. and if possible lots as well…

  6. I’m at two sticks, but unfortunately, I can’t easily play with them, because I broke one of the levers on the last installation.

    Erm. Thought has occurred to me, if you have a 4 DIMM slot motherboard, try moving your sticks to the other pair of slots that have intact levers, assuming your BIOS is ok with having controller slot 0 unoccupied, which may not be the case. Also you can try re-seating the DIMM w/o levers just to make sure it’s in properly.

    More details concerning memtest86’s Hammer Test can be found here:

    http://www.memtest86.com/troubleshooting.htm#hammer

    Also there are some remarks below about the Hammer Test and the specific brand of memory you are using. There’s a suggestion here that Kingston HyperX Savage yield better results on the Hammer test, but as always YMMV.

    http://www.passmark.com/forum/showthread.php?5077-How-to-relate-to-errors-in-Hammer-Test-13

    If your CPU supports ECC (and it may not) you can pony up some more $$$ for ECC DIMM and go that route if this is too highly annoying.

  7. Another suggestion that has been floated and could be a cheap fix; see if your BIOS allows you to step up the voltage and/or cut tRef
    by 1/4 to 1/2 its default value, esp. if the Hammer Test is logging bit failures. The latter will lower performance a bit but might cure your Hammer test & system crash issues AND not cost you any $$$.

Comments are closed.