Computer Disaster

Well, maybe that’s too strong a word, but my main work station (running Fedora Core 10) froze up today. I could move the cursor around on the screen, but nothing else. I ended up doing a hard reboot, after which it refused to boot, hanging after dragging the bar across the bottom of the screen. I’ve booted into rescue mode, but I have no idea how to fix it, because I don’t know what the problem is. I may trying to do a reinstall with the net install disk that I used to rescue, but I’m wondering if anyone has any ideas how to diagnose?

For those wondering, I’m posting from my laptop (which has its own problems, because the wireless isn’t working, and I have to find a place for it in my office where I can plug it in to ethernet).

[Update early afternoon]

I’m downloading Knoppix, because I’m having trouble with fsck from the rescue disk. It boots into level 3, though, so it may be an XWindows problem.

[Update a few minutes later]

No, it doesn’t go into level 3, it only boots single user.

[Early evening update]

Well, I backed up my data, and upgraded from Core 10 to Core 11. Whatever the problem was, the upgrade didn’t fix it. It acts just the same, except it says “Fedora 11” in the lower-right corner of the screen, instead of 10. It doesn’t complete the boot.

I guess I have to just do a clean installation, and recover from backup.

[Update a few minutes later]

D’oh!!!!!!!!!

I noticed when it shut down that it reported a bad line in /etc/fstab. I rescued, vi’d in, and noticed that I’d added a line to attempt to mount a remote drive, but had apparently screwed it up. This is the first time that I’d attempted a boot since I did that, weeks ago. So the screen freeze had nothing to do with the booting problem, other than it caused me to reboot. I deleted the line, and it looks like its coming up now. So I wasted a day on this. OTOH, I did do the upgrade, which I’d been meaning to anyway.

[Update a few minutes later]

Spoke too soon. It acted different, but it still won’t boot. At the end of the day, I’m still stuck with a white bar across the bottom of the screen. I’ll make one more attempt to look at the logs, and then just do a clean install of 11.

[Mid-evening update]

OK, I am officially infuriated.

I did a clean install of Core 11 over my previous Core 10. Everything went fine, except the friggin’ thing will not let me log in. I assigned a password for me (it didn’t ask for one for root), and when I use it, it just sits there. Forever. When I pull up a text window (ctrl-alt-F1) it asks for a password, then delays for a long time, then returns to a login prompt.

There are no words for how angry and frustrated I am…

27 thoughts on “Computer Disaster”

  1. Download and run a memory diagnostic. A real memory test, not the cheap assed one. Since you live in florida there is a good possibility that your memory timing has gone to hell due to corrosion on the pins.

    If that pinpoints the problem, then go buy good new memory, not cheap crap that barely meets spec.

  2. Sounds like an X problem, all right. Have a look in /var/log; there should be all manner of log files there for various subsystems that may tell you if X hung, etc.

    The other thing you can try is switching back to a text console in runlevel 5. Linux has multiple “virtual consoles”, and you can usually switch to VC[n] with they key combo Alt-F[n]. Fedora used to have X on VC7 and consoles on 1-6, but that may have changed lately to X on VC1 and consoles on 2-7. If you get a working console that way in rl5 it’s almost certainly an X issue.

  3. Not to be trite, but this is why I buy a cheap Dell and pay for the support contract. I simply never have these problems – they run their diagnostics, or send someone to my house, and parts are either fixed or replaced the once or twice a year something goes wrong.

    Mac OS and Windows 7 are great OSes. I really don’t know why anyone who isn’t a professional in need of Linux for some reason, or a masochist, fools around with Linux in any of its varieties.

  4. I can’t boot into run level 5. It will only boot level 1. Also, I pulled and replaced the memory, and am testing it, with no errors after ten minutes or so.

  5. Well, that’s odd. The difference between 1 and 3 is mounting some filesystems, turning on networking, starting some server processes… have a look in /var/log/messages from the attempt to boot to runlevel 5 and see how far it got and whether there were any exciting errors, I guess. You probably should be able to do that from singleuser… although if you got a cursor in 5 it was at least trying, and partially succeeding, in getting X up. Maybe a corrupted filesystem?

  6. I never had a cursor in 5, except when it initially crashed, and that’s all I had — no response from the mouse buttons or keyboard. I had to do a hard reboot, and when I did, it refused to boot at any level above 1. I’m running a memory test on it right now, so I can’t look at the logs, but I’ll do so in a minute, because I’m about to burn a Knoppix disk.

  7. Odds are, it’s a corrupted file system that’s keeping you from getting past runlevel 1. fsck from the rescue mode, or from Knoppix.

  8. Same problem in Knoppix as from the rescue disk. When I try to fsck sda2, it says I have a bad magic number in super-block while trying to open /dev/sda2. Now what?

  9. Oh, it’s using the logical volume manger, so your raw partitions aren’t actually the filesystem partitions. Gah… I’m not familiar enough with LVM2 to comment. You might have a look at:

    http://forums.fedoraforum.org/showthread.php?t=203741

    I’m included to say that LVM(2) is overkill for a workstation; you’re probably better off using a second disk for snapshot backups rather than mirroring or the like…

  10. Could be a video card problem. The kernel is working fine if you can boot to single user mode. You could file corruption of one sort or another, or your video hardware could be gronked sufficient to prevent X from starting. I second the perusal of X logs in /var/log to see what happened when X tried to start. You can do that in single-user mode. You can probably also modify the arguments to X when it tries to start next time so you get even more copious debugging output.

    Dunno what happened with fsck. Keep in mind you can’t fsck a mounted partition, so for example you can’t fsck the partition you have mounted at / if you have booted from it. I believe you can force a fsck at boot-time before the root partition is mounted, possibly by shutting down the system with “shutdown -F” or by creating a file called “forcefsck” in the root directory, and if fsck fails it will drop you into a special shell that lets you run fsck by hand and correct errors by answering Y to the appropriate questions.

    Personally, I doubt memory problems if the ugliness is consistent. Memory problems usually (although not always of course) turn up as weird flaky segfaults, particularly in reliable programs like gcc. A consistent bugger-up like this says corrupt files or bad hardware (in this case video hardware) to me, FWIW.

  11. “Cursor and nothing from mouse/KB” sounds like “X with no window manager” (especially if you didn’t try to switch to a console).

    If you can boot single-user but not multi, check the logs? Sounds to me like something’s just freezing in the multi-user startup.

  12. I’m looking through the logs, but not seeing anything obvious. I’m backing up my data to another drive in case I have to do an upgrade or clean install.

  13. Wait, how are you backing up the data if you can’t mount the filesystems…? Is the rescue disk mounting them, and if so, what does “df” show as the filesystem devices?

  14. I’d be surprised if an upgrade did any good. Most upgrades will carefully preserve any customization you’ve done to configuration files, to your regret in this case. Note also that you can go into /etc/rc.d/rc3.d or whatever runlevel you want and change the links so that fewer things attempt to start, in an effort to isolate the problem.

  15. You have something corrupted somewhere. Maybe an application is trying to do a DNS lookup, and is hanging, or something like that. Start a boot process…. go to bed… see if it comes up.

    Else:

    Boot into single user mode.

    enable the ethernet interface
    ifup eth0 #will usually do the trick, if not
    ifconfig eth0 some.good.ip.address # will work, certainly
    /etc/init.d/ssh start # enable ssh

    ssh into the box from your laptop
    become root, then tail -f /var/log/messages /var/log/syslog &
    telinit 2 # you can actually do this from the main prompt

    That should kick you into multi-user mode, without graphics enabled, and you can see where or if it hangs. telinit 5 kicks off graphics.

    Same procedure would also let you get help from an expert via a ssh tunnel. (I’d volunteer but I can’t help til tuesday)

  16. I re-read all the comments and thought about your problem some more, as it shifted after you installed a new OS

    If you have a separate partition for /home, and it is out of space, you will see problems like you first described (inability to login, except as root)

    runlevel 2 is single user, no networking or filesharing
    runlevel 3 includes filesharing – if that hung, you had some sort of networking problem….

    I am not familiar with recent versions of fedora, but it sounds like they are going the ubuntu route of disabling root by default. That’s annoying

    Your current problem could well be as simple as being out of disk, try two things:

    One – boot single user, give root a password, boot into graphics mode, login as root.

    Or get to runlevel 2 and see how much diskspace you have. I note that having 99% full does not mean you can still create a file. Still, root running graphics is an independent test.

    And

    I have had corrupt lockfiles in my home dir that would result in behavior as you describe, as well.

    Two – login as root (or in single user) create an entirely new user, with password (adduser is the command), and see if that user works.

    Three – try to login using a different desktop (this is essentially the same test as two, two is more robust)

  17. @Brock:

    “Not to be trite, but this is why I buy a cheap Dell and pay for the support contract. I simply never have these problems – they run their diagnostics, or send someone to my house, and parts are either fixed or replaced the once or twice a year something goes wrong.”

    I just can’t help but notice that once or twice a year thing that you consider acceptable…

    “Mac OS and Windows 7 are great OSes. I really don’t know why anyone who isn’t a professional in need of Linux for some reason, or a masochist, fools around with Linux in any of its varieties.”

    I have a mac with broken DNS services sitting on my desk right now that has thus far proved unfixable.

    This is definitely the wrong forum for this little rant, but have you seen any Apple Genius’s floating around in orbit, lately? Or service calls from Dell on the moon, for Windows 7?

    Linux has a set of well defined failure and recovery modes that can be managed remotely that make it well suited for use in complex space applications, and by extension, using it on your desktop extends your personal knowledge base to better handle problems “up there”, as well as “down here”.

    Secondly, excellent support is available for linux, both from official channels and informal ones. Notably irc, on the irc network irc.freenode.org has multiple realtime channels full of experts that can help on nearly any level of problem, and would be a far better forum than a blog to pursue a solution to Rand’s current problem.

    It’s not masochism to try to use as a many common tools as are available. It may well be masochism to run the absolute latest versions of a “test” version of the OS, such as fedora.

  18. Couple quick comments (not 100% familiar with Fedora; I use Debian and Solaris more):

    Probably Fedora 11 is set up to have no root password; you’re expected to log in as a normal user and use “sudo” to run commands requiring root privledges. This is simpler in some respects and arguably more secure for various tedious reasons.

    It sounds like you were getting logged in and then right back out again. Do the rescue disk (or single-user mode?) and have a look at the entry for your ID in /etc/passwd. Most likely there’s a problem with your shell, or there’s a problem with your home directory. I suspect it’s not mounting your home directory properly…

Comments are closed.