Strange Computer Problem

When I started using the machine this morning, it seemed to be running like molasses in January. I tried rebooting, and it took forever to boot, and then wouldn’t let me log in. I fired up a clean Fedora from a stick, and fscked my drives. The /home hard drive had a lot of errors on it, that got fixed, but there was no problem with the SSD where my OS resides. Then I rebooted. It took a long time, but finally came up. Everything continues to load and run slow. Nothing seems to be bogging down the CPU, and there is plenty of free memory. Any ideas what the problem could be?

[Update a while later]

OK, I rebooted without mounting the hard drive. It seems to be running fine now. So I guess I need a new drive.

[Update early afternoon]

Well, this is fun. It won’t boot with the drive mounted, so I’m back to the Fedora on a stick, but I can’t find the logical volume where my fstab is to tell it not to mount the drive.

[Evening update]

I got the new drive, and started to dd the data from the old drive to it. The process died after about 2.7G, with an “i/o error.” How screwed am I?

[Update a couple minutes later]

I’m trying again, with a conv=noerror flag. I may not get everything, but hopefully most of it.

[Friday-morning update]

Well, it’s copying at 11MB/s. At that rate, it’s about a third of the way through, and won’t be done until tomorrow. I’m glad it wasn’t bigger…

[Friday-afternoon update]

So, after moving about 833GB, the process ground to a slow crawl, so I gave up on it, and am going to try ddrescue. I bought another drive to write the image to, but when I try to partition it, I get this message:

(parted) mkpart
Partition name? []? gpt
File system type? [ext2]? ext4
Start? 1.048
End? 1800000
Warning: You requested a partition from 1048kB to 1800GB (sectors
2046..3515625000).
The closest location we can manage is 1048kB to 1048kB (sectors 2047..2047).
Is this still acceptable to you?

It’s a 2-terabyte drive. What’s going on? (And yes, I do have the correct drive selected, /dev/sde).

[Update a few minutes later]

Never mind, I found the problem.

[Update]

OK, WTF now?

[root@localhost-live /]

# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 1.8G 1 loop /run/media/liveuser/disk
loop1 7:1 0 7.5G 1 loop
├─live-rw 253:0 0 7.5G 0 dm /
└─live-base 253:1 0 7.5G 1 dm
loop2 7:2 0 32G 0 loop
└─live-rw 253:0 0 7.5G 0 dm /
sda 8:0 0 232.9G 0 disk
├─sda1 8:1 0 600M 0 part
├─sda2 8:2 0 1G 0 part
└─sda3 8:3 0 230G 0 part
├─fedora_localhost–live-home00
│ 253:2 0 10G 0 lvm
└─fedora_localhost–live-root00
253:3 0 220G 0 lvm
sdb 8:16 0 1.8T 0 disk
└─sdb1 8:17 0 1.8T 0 part
sdc 8:32 0 55.9G 0 disk
sdd 8:48 0 1.8T 0 disk
└─sdd1 8:49 0 1.8T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.6T 0 part /mnt
sdf 8:80 1 14.9G 0 disk
├─sdf1 8:81 1 1.9G 0 part /run/initramfs/live
├─sdf2 8:82 1 10.9M 0 part
└─sdf3 8:83 1 22.9M 0 part
zram0 252:0 0 4G 0 disk [SWAP]

[root@localhost-live /]

# umount /mnt

[root@localhost-live /]

# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 1.8G 1 loop /run/media/liveuser/disk
loop1 7:1 0 7.5G 1 loop
├─live-rw 253:0 0 7.5G 0 dm /
└─live-base 253:1 0 7.5G 1 dm
loop2 7:2 0 32G 0 loop
└─live-rw 253:0 0 7.5G 0 dm /
sda 8:0 0 232.9G 0 disk
├─sda1 8:1 0 600M 0 part
├─sda2 8:2 0 1G 0 part
└─sda3 8:3 0 230G 0 part
├─fedora_localhost–live-home00
│ 253:2 0 10G 0 lvm
└─fedora_localhost–live-root00
253:3 0 220G 0 lvm
sdb 8:16 0 1.8T 0 disk
└─sdb1 8:17 0 1.8T 0 part
sdc 8:32 0 55.9G 0 disk
sdd 8:48 0 1.8T 0 disk
└─sdd1 8:49 0 1.8T 0 part
sde 8:64 0 1.8T 0 disk
└─sde1 8:65 0 1.6T 0 part
sdf 8:80 1 14.9G 0 disk
├─sdf1 8:81 1 1.9G 0 part /run/initramfs/live
├─sdf2 8:82 1 10.9M 0 part
└─sdf3 8:83 1 22.9M 0 part
zram0 252:0 0 4G 0 disk [SWAP]

[root@localhost-live /]

# mount /dev/sde1 /mnt

[root@localhost-live /]

# ddrescue -d /dev/sdb1 /mnt/test.img /mnt/test.logfile
GNU ddrescue 1.25
Press Ctrl-C to interrupt
Initial status (read from mapfile)
rescued: 0 B, tried: 0 B, bad-sector: 0 B, bad areas: 0

Current status
ipos: 0 B, non-trimmed: 0 B, current rate: 0 B/s
opos: 0 B, non-scraped: 0 B, average rate: 0 B/s
non-tried: 2000 GB, bad-sector: 0 B, error rate: 0 B/s
rescued: 0 B, bad areas: 0, run time: 0s
pct rescued: 0.00%, read errors: 0, remaining time: n/a
time since last successful read: n/a
Copying non-tried blocks… Pass 1 (forwards)

ddrescue: Error writing mapfile ‘/mnt/test.logfile’: No space left on device
Fix the problem and press ENTER to retry,
or E+ENTER for an emergency save and exit,
or Q+ENTER to abort.

************************************************

lsblk says it’s got 1.6 Terabytes. I just partitioned it. How can there be no space left on the device?

[Update a while later]

Yes, I forgot to format after petitioning…

[Saturday-morning update]

OK, so what does this mean?

[root@localhost-live /]

# ddrescue -d /dev/sdb1 /mnt/test.img /mnt/test.logfile
GNU ddrescue 1.25
Press Ctrl-C to interrupt
ipos: 1784 GB, non-trimmed: 43778 kB, current rate: 14680 kB/s
opos: 1784 GB, non-scraped: 0 B, average rate: 46776 kB/s
non-tried: 229778 MB, bad-sector: 0 B, error rate: 0 B/s
rescued: 1770 GB, bad areas: 0, run time: 10h 30m 51s
pct rescued: 88.51%, read errors: 668, remaining time: 1h 10m
time since last successful read: n/a
Copying non-tried blocks… Pass 1 (forwards)
ddrescue: Write error: No space left on device

**************************************************

So it rescued 88.51%. What does that mean, in terms of actual data recovery? It says it rescued 1770 GB, but I’m sure I didn’t actually have that much data (it was probably less than a terabyte). And why is there “no space left on device”?

[Monday-morning update]

Here is the final result of copying it to one of the new hard drives:

ipos: 1772 GB, non-trimmed: 57184 kB, current rate: 180 kB/s
ipos: 1986 GB, non-trimmed: 0 B, current rate: 40448 B/s
opos: 1986 GB, non-scraped: 16249 kB, average rate: 18850 kB/s
non-tried: 0 B, bad-sector: 1029 kB, error rate: 0 B/s
rescued: 2000 GB, bad areas: 2010, run time: 1d 5h 28m
pct rescued: 99.99%, read errors: 3885, remaining time: 35m

Not sure what that means in terms of data integrity, but I’m now backing up the drive to the other new drive, after which I’ll e2fsck it, then try mounting it. It’s moving the data pretty briskly, and says it will be done in about three hours.

57 thoughts on “Strange Computer Problem”

      1. You might also have a marginal hard drive. Recommend immediate back up just in case. That could take forever unless you have it shadowed somewhere. If the power cycle doesn’t help or makes it worse time to consider a swap out of /dev/sdX that hosts /home….

        The reason I focus on /home is that if you had done a complete reboot the modern ext3 & ext4 file systems do journaling. You should not be experiencing fsck issues across a reboot under those file systems unless there is some underlying hardware issue. Assuming a clean reboot that you didn’t interrupt…

        1. In Linux lingo a shutdown merely means the kernel has entered a halt loop. On older timesharing systems a shutdown implied dropping into single user mode as root with the kernel still running. A shutdown with restart means some piece of software in the kernel causes an unmaskable CPU interrupt to trigger a reboot. None of these scenarios cycle power to either the CPU or motherboard leaving asynchronous state machines and / or bus driven microcontrollers to their own mystical ways and affectations.

      1. smartctl is far from perfect it’s biggest usefulness is to watch the reported errors to see if that is incrementing frequently. Sometimes you can get a result by engaging the long test and then see if the error log shows bigger numbers than it had before. Usually the self test won’t fail you have to watch the behavior of the error log. It’s far from foolproof but is one way of conducting a non-destructive test. Regrettably with Linux you have to look for symptoms. It won’t tell you outright.

        1. Well, now the disk management tool in Gnome is telling me that the disk is self reporting as failing. I have another drive I want to back it up to, but it has two partitions on it: a boot and the OS. I deleted the boot partition, but the rest is still calling itself /dev/sd2, when I want it to be /dev/sd1. Any way to fix this, short of blowing away the other partition with all its existing data (it’s an old backup)? If I can’t renumber it, the boot will fail when it tries to mount it.

          1. See:
            https://superuser.com/questions/393613/how-to-renumber-a-partition#915412

            You can renumber the partition table by entering the ‘x’ command to enter ‘expert’ mode.

            Before you do the ‘w’ make sure you’ve written down or copied all the info from the first -l so you can put the partition table back to something useful if you mess it up.

            I haven’t done this in a few years so I’m reticent to give you too much advice here. Lest I get it wrong.

        2. The smartctl self test is driven by the disk vendor’s firmware. And frankly it appears most vendors can’t be bothered with a decent self test if they bother at all. Something even guides like Tom’s Hardware ought to make an issue of but hey $/bit is Uber alles!

          1. I found a vintage IBM PC 70 MB full-height drive on Ebay for $8.50. Proven, reliable, and offering plenty of storage space for documents, recipes, and games. But they want $14+ for shipping, and I’m not sure your PC can mount a full-height drive. So you’re better off with the 2 Terabyte one.

  1. <I.I got the new drive, and started to dd the data from the old drive to it. The process died after about 2.7G, with an “i/o error.” How screwed am I?

    You’re “this is why we make backups” screwed, at least potentially.

    But it’s a useful lesson – if it ain’t backed up, it’s already gone is the one true way to approach data.

      1. Or invest in another WD disk and set the two up in a RAID array with one being the shadow disk. Either way.

  2. Well, it’s copying at 11MB/s. At that rate, it’s about a third of the way through, and won’t be done until tomorrow. I’m glad it wasn’t bigger…

    Assuming you are copying from rotational media this sustained slow rate is also telling. Rather than having a bad block or sector sounds like a sense amplifier is having great difficulty. Which makes multiple accesses necessary to pull the data off the recalcitrant disk which means you can only pull data off at a fractional rate of your I/O bus whether SATA or something else. Good luck but most of all be patient!

  3. After a painful experience with a failed drive, I set up an old refurbished computer to continually back up important parts of my home directory. I set up network drives and use fwbackup (a front for rsync). Works great.

    I also like to always have a relatively up to date PartedMagic drive (with Clonezilla and Ghost 4 Linux) for cloning drives. Clonzilla makes more compact images, but it’s more fussy. Given your disk errors, the more brute force G4L would likely be more appropriate. It’s like a front end for dd, but it’s easy to use.

    Finally, it makes sense to have a copy of Boot Repair on hand, but you know that.

  4. Try the ddrescue tool if the dd copy doesn’t work out. It has a more aggressive retry policy and also logs the status of each copied region.

  5. Well, generally a drive is partitioned and then formatted before it is useable, the old “FORMAT C:” solution to many a software issue. ^_^

    1. It’s events like these that make me happy I’m on the East Coast and fast asleep…. *_*

  6. “Yes, I forgot to format after petitioning…”

    You have to “petition” your computer?

    Um, I think you would be much better off with Windows 10, or even with some incarnation of an Apple OS – Jaguar. Leopard, or even Ferrari, and, especially, Blue Steel.

    1. When you can pry my Linux from my cold, dead, hands…. 🙂

      FYI Apple OS *is* essentially Linux, under the hood….
      Drop into the command line interpreter and poke around if you need proof… like % cat /proc/cpuinfo for example…

      1. Been that way since System X (System 10) I believe…
        I once had the privilege of programming directly on System 6. It was fun and educational to see the lineage of Andy Hertzfeld’s work. Nicely done Andy…

  7. Um save the drive that had the 833GB transfer just in case. Your source disk sounds like it is really really on its way out. Did you get enough via the first transfer that all key files were recovered? Just sayin’ this might be the best you can do. Highly recommend a backup strategy. And of course my Space Cadet training says that I have to ask the following question: If you have one, why not just restore from it? Not trying for the salt->wound, just trying to save you time…. There’s a little bit of the Doc McCoy in me that says a little pain is good for the soul….

    1. cp -r to selectively restore what is most important might be your friend here, just sayin….

  8. And why is there “no space left on device”?
    Did you mount the output device? The path name you gave for the output file is suspicious. Usually when you mount a drive you have to give it a mount point in /mnt. Like /mnt/newdrive/mydiectory/test.img

    But why back up to an image file? Don’t you just want to go disk to disk? If you are doing direct can’t you just give it /dev/sda1 (or wherever you new drive configured) for the output file instead?

      1. That’s right, I forgot, you flipped on the benefit of coffee again.
        Well maybe you are caffeine immune? In that case, you live in CA, go outside (everyday is warm and sunny in CA right?) run around the house three times (in both directions for a total of six) THEN proceed with this…. ^_^

        I was following directions here.

        Jump ahead in those instructions to the section called “Cloning directly to a new disk”. This is what you need to do. Creating a dot img file is fine for when you are just doing backups, but you need a backup AND restore. To paraphrase DEVO: “backup is what you’ll get but cloning is what you need. The disk you want”. In fact playing DEVO in the background while doing this seems quite appropriate. (I know, I know you despise DEVO, ’cause of its Kent State genesis and all that…).

  9. ddrescue copies the entire device, so the destination device must be equal to or larger than the source. You mention you’re using less than 1 TB, but that’s a filesystem thing. ddrescue wants to copy the entire 1.8 TB verbatim. Like dd, it is copying sectors, not files.

  10. A suggestion: In future posts about computer problems (and there WILL be such posts) it would make your blog easier to follow if you put the output of troublesome console sessions down here in the comments and just stick a one paragraph summary in the OP with a reference to comments for details.

    1. I meant to add that bumped short one paragraph summaries in the OP are fine until they get to multiple pages when you can then more tag them. It’s nice to be able to follow your progress. Good Luck!

    1. It’s still working on it. About 77% recovered, with estimates to completion ranging from a couple hours to a month, depending on current speed… Once I get it off the bad drive, I’ll probably back it up, and then do a file-system repair.

      1. I have a pair of 4 or 5 TB external USB drives which I should really make more periodic use of for full system backups. They’re cheap and if I traveled much they’d come in really handy for keeping all my data at hand.

        1. Since he now has two drives of the same size he can shadow one in an internal RAID array to leverage the speed of his I/O bus. Switchover ought to be painless and no restore needed although both drives age at the same rate. Nice to occasionally swap them each out with a newer drive to stagger their age. Like drives like tires. heh. Or go all SSD. The aging characteristics / data look good so far. If you can afford them. Age staging still a good idea even with SSDs that have no moving parts.

Comments are closed.