2003-02-06 20:11:21

by Thomas Molina

[permalink] [raw]
Subject: possible partition corruption

I have run into an apparent anomaly while compiling/testing 2.5.59-bk. My
normal mode of operation is to do a daily bk pull to get the latest csets
and do a compile/boot run. After yesterday's I started seeing problems on
reboot. During the reboot I would get the OK booting the kernel followed
by a system freeze. After a forced reboot into a stock RedHat 8.0 2.4
kernel I would see the system misidentify my boot partiton as an ext2
partition and the following messages would appear:

EXT2-fs: ide0(3,8): couldn't mount because of unsupported optional feature
(4).
Kernel panic: VFS: Unable to mount root fs on 08:08

Following this I rebooted using my rescue CD and did an fsck on the
partition. It reported recovering the journal, but found no other
apparent corruption. Upon rebooting, the fsck noted an unclean shutdown,
but found the partition clean after checking it. No other anomalies were
noted.

However, the same thing happened this morning, so I did some investigating
and found that now, whenever I compile and boot from any 2.5.59 kernel I
get the above-described behaviour. I've even tried downloading a stock
tarball and using a default configuration. I've made the configuration
almost as simple as I could and still get the above behaviour.

At this point, I'm looking at my description and seeing there is very
little data I can provide that might be useful. My stuff is backed up,
and I can reformat and try again if necessary. I'm including some data
which seems relevant. The configuration file I used is attached. Nothing
interesting gets logged to /var/log/messages.

If someone wants to look into this, I'll keep things as they are.
Otherwise, I'll wipe and restart.

[root@dad scripts]# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 1343.085
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov
pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 2680.42


[root@dad scripts]# /sbin/lspci
00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133] (rev
03)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
00:04.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South]
(rev 40)
00:04.1 IDE interface: VIA Technologies, Inc. VT82C586B PIPC Bus Master
IDE (rev 06)
00:04.2 USB Controller: VIA Technologies, Inc. USB (rev 16)
00:04.3 USB Controller: VIA Technologies, Inc. USB (rev 16)
00:04.4 Bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev
40)
00:0a.0 Multimedia audio controller: Yamaha Corporation YMF-754 [DS-1E
Audio Controller]
00:0d.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev
08)
00:11.0 Unknown mass storage controller: Promise Technology, Inc. 20265
(rev 02)
01:00.0 VGA compatible controller: ATI Technologies Inc 3D Rage Pro AGP
1X/2X (rev 5c)

hda: WDC WD300BB-00AUA1, ATA DISK drive
hdb: MDT MD100EB-00BHF0, ATA DISK drive
hdc: LG CD-RW CED-8080B, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide1 at 0x170-0x177,0x376 on irq 15
hda: 58633344 sectors (30020 MB) w/2048KiB Cache, CHS=3649/255/63,
UDMA(100)
hdb: 19541088 sectors (10005 MB) w/2048KiB Cache, CHS=1216/255/63,
UDMA(100)
ide-floppy driver 0.99.newide
Partition check:
hda: hda1 hda2 < hda5 hda6 hda7 hda8 hda9 >
hdb: hdb1 hdb2


Attachments:
.config (11.24 kB)

2003-02-06 20:27:11

by Andrew Morton

[permalink] [raw]
Subject: Re: possible partition corruption

Thomas Molina <[email protected]> wrote:
>
> I have run into an apparent anomaly while compiling/testing 2.5.59-bk. My
> normal mode of operation is to do a daily bk pull to get the latest csets
> and do a compile/boot run. After yesterday's I started seeing problems on
> reboot. During the reboot I would get the OK booting the kernel followed
> by a system freeze. After a forced reboot into a stock RedHat 8.0 2.4
> kernel I would see the system misidentify my boot partiton as an ext2
> partition and the following messages would appear:
>

Everything you describe is consistent with a kernel which does not have ext3
compiled into it.

> EXT2-fs: ide0(3,8): couldn't mount because of unsupported optional feature
> (4).
> Kernel panic: VFS: Unable to mount root fs on 08:08
>

That is an ext3 filesystem in the "needs journal recovery" state. ext2
cannot mount that until either fsck or the ext3 kernel driver has run
recovery.

grep EXT3 .config ??

2003-02-06 20:55:29

by Thomas Molina

[permalink] [raw]
Subject: Re: possible partition corruption

On Thu, 6 Feb 2003, Andrew Morton wrote:

> Thomas Molina <[email protected]> wrote:
> >
> > I have run into an apparent anomaly while compiling/testing 2.5.59-bk. My
> > normal mode of operation is to do a daily bk pull to get the latest csets
> > and do a compile/boot run. After yesterday's I started seeing problems on
> > reboot. During the reboot I would get the OK booting the kernel followed
> > by a system freeze. After a forced reboot into a stock RedHat 8.0 2.4
> > kernel I would see the system misidentify my boot partiton as an ext2
> > partition and the following messages would appear:
> >
>
> Everything you describe is consistent with a kernel which does not have ext3
> compiled into it.
>
> > EXT2-fs: ide0(3,8): couldn't mount because of unsupported optional feature
> > (4).
> > Kernel panic: VFS: Unable to mount root fs on 08:08
> >
>
> That is an ext3 filesystem in the "needs journal recovery" state. ext2
> cannot mount that until either fsck or the ext3 kernel driver has run
> recovery.
>
> grep EXT3 .config ??
>

I'm aware of that. I attached the config file showing ext3 was compiled
in. I went through several iterations to ensure that having the proper
filesystem compiled in was done.

2003-02-06 21:14:29

by Andrew Morton

[permalink] [raw]
Subject: Re: possible partition corruption

Thomas Molina <[email protected]> wrote:
>
> > Everything you describe is consistent with a kernel which does not have ext3
> > compiled into it.
> ...
> I'm aware of that.

In that case you may be experiencing the mysterious vanishing
ext3_read_super-doesn't-work bug. Usually a recompile/relink makes it go
away. I haven't seen it in months.

Could you please drop this additional debugging in there and see
what happens?


fs/ext3/super.c | 34 ++++++++++++++++++++++------------
1 files changed, 22 insertions(+), 12 deletions(-)

diff -puN fs/ext3/super.c~efs fs/ext3/super.c
--- 25/fs/ext3/super.c~efs Thu Feb 6 13:17:55 2003
+++ 25-akpm/fs/ext3/super.c Thu Feb 6 13:21:16 2003
@@ -1017,12 +1017,16 @@ static int ext3_fill_super (struct super
int i;
int needs_recovery;

+ printk("%s: enter\n", __FUNCTION__);
+
#ifdef CONFIG_JBD_DEBUG
ext3_ro_after = 0;
#endif
sbi = kmalloc(sizeof(*sbi), GFP_KERNEL);
- if (!sbi)
+ if (!sbi) {
+ printk("no sbi\n");
return -ENOMEM;
+ }
sb->s_fs_info = sbi;
memset(sbi, 0, sizeof(*sbi));
sbi->s_mount_opt = 0;
@@ -1057,10 +1061,10 @@ static int ext3_fill_super (struct super
sbi->s_es = es;
sb->s_magic = le16_to_cpu(es->s_magic);
if (sb->s_magic != EXT3_SUPER_MAGIC) {
- if (!silent)
- printk(KERN_ERR
- "VFS: Can't find ext3 filesystem on dev %s.\n",
- sb->s_id);
+ printk(KERN_ERR
+ "VFS: Can't find ext3 filesystem on dev %s.\n",
+ sb->s_id);
+ printk("magic=0x%x\n", sb->s_magic);
goto failed_mount;
}

@@ -1091,8 +1095,10 @@ static int ext3_fill_super (struct super
sbi->s_resuid = le16_to_cpu(es->s_def_resuid);
sbi->s_resgid = le16_to_cpu(es->s_def_resgid);

- if (!parse_options ((char *) data, sbi, &journal_inum, 0))
+ if (!parse_options ((char *) data, sbi, &journal_inum, 0)) {
+ printk("option parsing failed\n");
goto failed_mount;
+ }

sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
((sbi->s_mount_opt & EXT3_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
@@ -1276,16 +1282,19 @@ static int ext3_fill_super (struct super
*/
if (!test_opt(sb, NOLOAD) &&
EXT3_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_HAS_JOURNAL)) {
- if (ext3_load_journal(sb, es))
+ if (ext3_load_journal(sb, es)) {
+ printk("journal loading failed\n");
goto failed_mount2;
+ }
} else if (journal_inum) {
- if (ext3_create_journal(sb, es, journal_inum))
+ if (ext3_create_journal(sb, es, journal_inum)) {
+ printk("journal creation failed\n");
goto failed_mount2;
+ }
} else {
- if (!silent)
- printk (KERN_ERR
- "ext3: No journal on filesystem on %s\n",
- sb->s_id);
+ printk (KERN_ERR
+ "ext3: No journal on filesystem on %s\n",
+ sb->s_id);
goto failed_mount2;
}

@@ -1371,6 +1380,7 @@ failed_mount:
out_fail:
sb->s_fs_info = NULL;
kfree(sbi);
+ printk("%s: failing\n", __FUNCTION__);
return -EINVAL;
}


_

2003-02-06 21:23:37

by Thomas Molina

[permalink] [raw]
Subject: Re: possible partition corruption

On Thu, 6 Feb 2003, Andrew Morton wrote:

> Thomas Molina <[email protected]> wrote:
> >
> > > Everything you describe is consistent with a kernel which does not have ext3
> > > compiled into it.
> > ...
> > I'm aware of that.
>
> In that case you may be experiencing the mysterious vanishing
> ext3_read_super-doesn't-work bug. Usually a recompile/relink makes it go
> away. I haven't seen it in months.
>
> Could you please drop this additional debugging in there and see
> what happens?

I'll try it, but a question did occur to me. I got the hang while booting
a freshly-compiled 2.5.59, but the error message was received after
supposedly cleaning and recovering the journal. That was using the stock
RedHat 8.0 kernel, 2.4.18-24.8.0, which most certainly does have ext3
support. Would the bug you described affect a following boot into a
totally different kernel?

2003-02-06 21:31:06

by Andrew Morton

[permalink] [raw]
Subject: Re: possible partition corruption

Thomas Molina <[email protected]> wrote:
>
> On Thu, 6 Feb 2003, Andrew Morton wrote:
>
> > Thomas Molina <[email protected]> wrote:
> > >
> > > > Everything you describe is consistent with a kernel which does not have ext3
> > > > compiled into it.
> > > ...
> > > I'm aware of that.
> >
> > In that case you may be experiencing the mysterious vanishing
> > ext3_read_super-doesn't-work bug. Usually a recompile/relink makes it go
> > away. I haven't seen it in months.
> >
> > Could you please drop this additional debugging in there and see
> > what happens?
>
> I'll try it, but a question did occur to me. I got the hang while booting
> a freshly-compiled 2.5.59, but the error message was received after
> supposedly cleaning and recovering the journal. That was using the stock
> RedHat 8.0 kernel, 2.4.18-24.8.0, which most certainly does have ext3
> support. Would the bug you described affect a following boot into a
> totally different kernel?

The error message you saw was ext2 saying "I don't know how to deal with
uncleanly shut-down ext3 filesystems".

The kernel will try to mount the fs as ext3 first, and then as ext2.

For some reason the ext3 attempt is not successful, so it falls through to
ext2 and that is the first message we see.

2003-02-06 21:49:16

by Andreas Dilger

[permalink] [raw]
Subject: Re: possible partition corruption

On Feb 06, 2003 15:05 -0600, Thomas Molina wrote:
> On Thu, 6 Feb 2003, Andrew Morton wrote:
> > Everything you describe is consistent with a kernel which does not have ext3
> > compiled into it.
> >
> > That is an ext3 filesystem in the "needs journal recovery" state. ext2
> > cannot mount that until either fsck or the ext3 kernel driver has run
> > recovery.
>
> I'm aware of that. I attached the config file showing ext3 was compiled
> in. I went through several iterations to ensure that having the proper
> filesystem compiled in was done.

Maybe some config/linking breakage puts ext2 in front of ext3 in the probe
order? Try compiling with ext2 as a module.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2003-02-06 22:26:01

by Thomas Molina

[permalink] [raw]
Subject: Re: possible partition corruption

On Thu, 6 Feb 2003, Andreas Dilger wrote:

> On Feb 06, 2003 15:05 -0600, Thomas Molina wrote:
> > On Thu, 6 Feb 2003, Andrew Morton wrote:
> > > Everything you describe is consistent with a kernel which does not have ext3
> > > compiled into it.
> > >
> > > That is an ext3 filesystem in the "needs journal recovery" state. ext2
> > > cannot mount that until either fsck or the ext3 kernel driver has run
> > > recovery.
> >
> > I'm aware of that. I attached the config file showing ext3 was compiled
> > in. I went through several iterations to ensure that having the proper
> > filesystem compiled in was done.
>
> Maybe some config/linking breakage puts ext2 in front of ext3 in the probe
> order? Try compiling with ext2 as a module.

Nope. I still got the same symptoms.

2003-02-07 19:50:29

by Thomas Molina

[permalink] [raw]
Subject: Re: possible partition corruption

> On Thu, 6 Feb 2003, Andrew Morton wrote:
>
> > > I'm still not seeing your messages.
> > >
> > > Part of the problem is I'm only seeing part of the boot sequence. The
> > > last thing I see is the infamous:
> > >
> > > Uncompressing Linux... OK booting the kernel
> > >
> > > Something is going on behind the scenes, though. The drive light seems to
> > > be doing its usual thing. The system responds to alt-sysrq keys, but not
> > > to ctrl-alt-del.
> >
> > eh? But that sounds like something totally different from
> > your original report?
> >
> > You're not getting any console output at all?
> >
> > Make sure that you're not booting with the `quiet' option.
> >
> > If you're using any fancy video options, framebuffers, etc then disabled them.

Further on this problem. I did a system restore to a disk on /dev/hdb,
fixed up fstab and other files so I could boot from /dev/hdb1. I got
results similar to the original. However, this time I did get log
messages.

I'm also attaching a copy of the configuration used to build this kernel.

What might be a good next step in debugging this?

When I booted up I got the following on the console:

Booting 'tom's test'
root (hd1,0)
Filesystem type is ext2fs, partition type 0x83
kernel /boot/vmlinuz-2.5.59-0207 ro root=/dev/hdb1
[Linux-bzImage, setup=0x1400, size= 0xec89c]

Uncompressing Linux... OK booting the kernel


at which time I ceased seeing messages on the console. However, I
continued to see drive light activity. When that ceased, I booted back
into a stock RedHat 8.0 kernel and did an fsck on the affected drive. I
got the following:

[root@dad root]# e2fsck -d -f -v /dev/hdb1
e2fsck 1.27 (8-Mar-2002)
/dev/hdb1: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

106211 inodes used (9%)
83 non-contiguous inodes (0.1%)
# of inodes with ind/dind/tind blocks: 4994/29/0
777861 blocks used (33%)
0 bad blocks
0 large files

80812 regular files
4661 directories
2523 character device files
15884 block device files
1 fifo
2265 links
2318 symbolic links (2315 fast symbolic links)
3 sockets
--------
108467 files

In /var/log/messages I got the following interesting stuff:

Feb 7 13:20:36 dad kernel: Journalled Block Device driver loaded

(how can that be, considering the initial message on the console?)

then I started seeing some of these:

Feb 7 13:20:36 dad keytable: /etc/rc3.d/S17keytable: line 26: /dev/tty0:
No such device

Feb 7 13:20:36 dad keytable: Couldnt get a file descriptor referring to
the console

Feb 7 13:20:36 dad kernel: Warning: unable to open an initial console.

Feb 7 13:20:43 dad gpm[724]: oops() invoked from gpn.c(125)
Feb 7 13:20:43 dad gpm[724]: /dev/console: No such device
Feb 7 13:20:43 dad gpm: gpm: oops() invoked from gpn.c(125)
Feb 7 13:20:43 dad gpm: /dev/console: No such device
Feb 7 13:20:43 dad gpm: gpm startup failed
Feb 7 13:20:43 dad crond: crond startup succeeded
Feb 7 13:20:44 dad xfs: listening on port 7100
Feb 7 13:20:44 dad xfs: xfs startup succeeded
Feb 7 13:20:44 dad anacron: anacron startup succeeded
Feb 7 13:20:44 dad atd: atd startup succeeded
Feb 7 13:20:44 dad xfs: ignoring font path element
/usr/X11R6/lib/X11/fonts/cyrillic (unreadable)
Feb 7 13:20:44 dad kernel: warning: process `update' used the obsolete
bdflush system call
Feb 7 13:20:44 dad kernel: Fix your initscripts?
Feb 7 13:20:44 dad /sbin/mingetty[799]: /dev/tty5: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[795]: /dev/tty1: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[796]: /dev/tty2: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[797]: /dev/tty3: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[800]: /dev/tty6: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[798]: /dev/tty4: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[809]: /dev/tty1: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[810]: /dev/tty2: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[811]: /dev/tty3: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[812]: /dev/tty4: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[813]: /dev/tty5: cannot open tty: No
such device
Feb 7 13:20:44 dad /sbin/mingetty[814]: /dev/tty6: cannot open tty: No
such device


etc., etc., etc.

followed finally by:

Feb 7 13:20:45 dad init: Id "1" respawning too fast: disabled for 5
minutes
Feb 7 13:20:45 dad /sbin/mingetty[907]: /dev/tty6: cannot open tty: No
such device
Feb 7 13:20:45 dad /sbin/mingetty[906]: /dev/tty4: cannot open tty: No
such device
Feb 7 13:20:45 dad init: Id "5" respawning too fast: disabled for 5
minutes
Feb 7 13:20:45 dad init: Id "4" respawning too fast: disabled for 5
minutes
Feb 7 13:20:45 dad /sbin/mingetty[911]: /dev/tty3: cannot open tty: No
such device
Feb 7 13:20:45 dad /sbin/mingetty[913]: /dev/tty6: cannot open tty: No
such device
Feb 7 13:20:45 dad /sbin/mingetty[905]: /dev/tty2: cannot open tty: No
such device
Feb 7 13:20:45 dad init: Id "6" respawning too fast: disabled for 5
minutes
Feb 7 13:20:45 dad init: Id "2" respawning too fast: disabled for 5
minutes
Feb 7 13:20:45 dad /sbin/mingetty[915]: /dev/tty3: cannot open tty: No
such device
Feb 7 13:20:45 dad init: Id "3" respawning too fast: disabled for 5
minutes
Feb 7 13:20:45 dad init: no more processes left in this runlevel

You get the idea.



Attachments:
.config (21.29 kB)

2003-02-07 22:06:40

by Andrew Morton

[permalink] [raw]
Subject: Re: possible partition corruption


gack, your whole tty layer didn't come up, or the device nodes aren't there.
Are you using devfs?

2003-02-07 23:31:36

by Andrew Morton

[permalink] [raw]
Subject: Re: possible partition corruption

Thomas Molina <[email protected]> wrote:
>
> Further on this problem. I did a system restore to a disk on /dev/hdb,
> fixed up fstab and other files so I could boot from /dev/hdb1. I got
> results similar to the original. However, this time I did get log
> messages.

OK, I tried your .config and the same happened here - no console output and
huge amounts of disk I/O as the system was booting. No filesystem problems
on reboot, however.

I couldn't immediately see the reason for this. You have your whole input
layer configured as a module, perhaps that has upset things.

I suggest that you work on the config settings and find out what it is that
is causing the tty layer to not come up.

2003-02-07 23:44:52

by Thomas Molina

[permalink] [raw]
Subject: Re: possible partition corruption

On Fri, 7 Feb 2003, Andrew Morton wrote:

>
> gack, your whole tty layer didn't come up, or the device nodes aren't there.
> Are you using devfs?
>

No

2003-02-08 01:38:47

by Thomas Molina

[permalink] [raw]
Subject: Re: possible partition corruption

On Fri, 7 Feb 2003, Andrew Morton wrote:

> I couldn't immediately see the reason for this. You have your whole input
> layer configured as a module, perhaps that has upset things.
>
> I suggest that you work on the config settings and find out what it is that
> is causing the tty layer to not come up.

OK, Hold the phone Susan, and stamp IDIOT on my forehead. The erason for
not getting any output on the console was that I had configured the kernel
without support for virtual terminals or console on virtual terminals.
Once I configured that correctly, things worked. Duh!

The thing I don't understand is why would not having that configured in
give me the lost journal and an inability to boot and mount the root
partition when I booted back into a "normal" kernel.

2003-02-08 01:57:04

by Andrew Morton

[permalink] [raw]
Subject: Re: possible partition corruption

Thomas Molina <[email protected]> wrote:
>
> On Fri, 7 Feb 2003, Andrew Morton wrote:
>
> > I couldn't immediately see the reason for this. You have your whole input
> > layer configured as a module, perhaps that has upset things.
> >
> > I suggest that you work on the config settings and find out what it is that
> > is causing the tty layer to not come up.
>
> OK, Hold the phone Susan, and stamp IDIOT on my forehead. The erason for
> not getting any output on the console was that I had configured the kernel
> without support for virtual terminals or console on virtual terminals.
> Once I configured that correctly, things worked. Duh!

Yes, but you had CONFIG_VGA_CONSOLE=y.

> The thing I don't understand is why would not having that configured in
> give me the lost journal and an inability to boot and mount the root
> partition when I booted back into a "normal" kernel.

Don't know. It didn't do that when I tested it.