2003-07-11 04:28:30

by Ivan Gyurdiev

[permalink] [raw]
Subject: 2.5.75 does not boot - TCQ oops

See,

http://www.ussg.iu.edu/hypermail/linux/kernel/0307.0/0515.html

where the bug is described for 2.5.74.
I got no replies, and the bug persists in 2.5.75 (+bk patches).

Note:
The machine boots with TASKFILE on, TCQ is causing the problem.




2003-07-11 07:48:56

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5.75 does not boot - TCQ oops

On Thu, Jul 10 2003, Ivan Gyurdiev wrote:
> See,
>
> http://www.ussg.iu.edu/hypermail/linux/kernel/0307.0/0515.html
>
> where the bug is described for 2.5.74.
> I got no replies, and the bug persists in 2.5.75 (+bk patches).
>
> Note:
> The machine boots with TASKFILE on, TCQ is causing the problem.

Looks like IDE using the queue before it has been setup, probably Bart
broke it when he moved the TCQ init around. I'll take a look.

--
Jens Axboe

2003-07-11 08:13:56

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5.75 does not boot - TCQ oops

On Fri, Jul 11 2003, Jens Axboe wrote:
> On Thu, Jul 10 2003, Ivan Gyurdiev wrote:
> > See,
> >
> > http://www.ussg.iu.edu/hypermail/linux/kernel/0307.0/0515.html
> >
> > where the bug is described for 2.5.74.
> > I got no replies, and the bug persists in 2.5.75 (+bk patches).
> >
> > Note:
> > The machine boots with TASKFILE on, TCQ is causing the problem.
>
> Looks like IDE using the queue before it has been setup, probably Bart
> broke it when he moved the TCQ init around. I'll take a look.

Here's the fix. Bart, you moved the tcq init to a much earlier fase
(saying ide_init_drive() was too early, well ide_dma_on is much earlier
in the init fase). ide_init_drive() _is_ the correct location in my
oppinion, drive and queue has been set up at this point.

--- drivers/ide/ide-dma.c~ 2003-07-11 10:21:04.492561920 +0200
+++ drivers/ide/ide-dma.c 2003-07-11 10:25:28.183474808 +0200
@@ -572,10 +572,6 @@
if (HWIF(drive)->ide_dma_host_on(drive))
return 1;

-#ifdef CONFIG_BLK_DEV_IDE_TCQ_DEFAULT
- HWIF(drive)->ide_dma_queued_on(drive);
-#endif
-
return 0;
}

--- drivers/ide/ide-probe.c~ 2003-07-11 10:06:54.036850840 +0200
+++ drivers/ide/ide-probe.c 2003-07-11 10:25:30.884064256 +0200
@@ -945,6 +945,10 @@
static void ide_init_drive(ide_drive_t *drive)
{
ide_toggle_bounce(drive, 1);
+
+#ifdef CONFIG_BLK_DEV_IDE_TCQ_DEFAULT
+ HWIF(drive)->ide_dma_queued_on(drive);
+#endif
}

/*

--
Jens Axboe

2003-07-11 08:23:12

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5.75 does not boot - TCQ oops

On Fri, Jul 11 2003, Jens Axboe wrote:
> On Fri, Jul 11 2003, Jens Axboe wrote:
> > On Thu, Jul 10 2003, Ivan Gyurdiev wrote:
> > > See,
> > >
> > > http://www.ussg.iu.edu/hypermail/linux/kernel/0307.0/0515.html
> > >
> > > where the bug is described for 2.5.74.
> > > I got no replies, and the bug persists in 2.5.75 (+bk patches).
> > >
> > > Note:
> > > The machine boots with TASKFILE on, TCQ is causing the problem.
> >
> > Looks like IDE using the queue before it has been setup, probably Bart
> > broke it when he moved the TCQ init around. I'll take a look.
>
> Here's the fix. Bart, you moved the tcq init to a much earlier fase
> (saying ide_init_drive() was too early, well ide_dma_on is much earlier
> in the init fase). ide_init_drive() _is_ the correct location in my
> oppinion, drive and queue has been set up at this point.

Better still (and later in the probe) is this version. This is in my
oppinion the correct place to init tcq, and also contains it to ide-disk
where it should be.

--- drivers/ide/ide-dma.c~ 2003-07-11 10:21:04.492561920 +0200
+++ drivers/ide/ide-dma.c 2003-07-11 10:25:28.183474808 +0200
@@ -572,10 +572,6 @@
if (HWIF(drive)->ide_dma_host_on(drive))
return 1;

-#ifdef CONFIG_BLK_DEV_IDE_TCQ_DEFAULT
- HWIF(drive)->ide_dma_queued_on(drive);
-#endif
-
return 0;
}

--- drivers/ide/ide-disk.c~ 2003-07-11 10:30:51.783280160 +0200
+++ drivers/ide/ide-disk.c 2003-07-11 10:31:09.873530024 +0200
@@ -1665,6 +1665,10 @@
drive->no_io_32bit = id->dword_io ? 1 : 0;
if (drive->id->cfs_enable_2 & 0x3000)
write_cache(drive, (id->cfs_enable_2 & 0x3000));
+
+#ifdef CONFIG_BLK_DEV_IDE_TCQ_DEFAULT
+ HWIF(drive)->ide_dma_queued_on(drive);
+#endif
}

static int idedisk_cleanup (ide_drive_t *drive)

--
Jens Axboe

2003-07-11 08:20:15

by Voluspa

[permalink] [raw]
Subject: Re: 2.5.75 does not boot - TCQ oops


On 2003-07-11 2:51:42 Ivan Gyurdiev wrote:

> http://www.ussg.iu.edu/hypermail/linux/kernel/0307.0/0515.html

Reading the handcrafted log, yes, that's 'exactly' what I get ;-)
If prodded, I can do a transcription as well.

> where the bug is described for 2.5.74.
> I got no replies, and the bug persists in 2.5.75 (+bk patches).

Haven't tried the 2.5.74, but plain 2.5.75 is where I crash.

> Note:
> The machine boots with TASKFILE on, TCQ is causing the problem.

Yes, writing this on a machine with CONFIG_IDE_TASK_IOCTL is not set,
CONFIG_IDE_TASKFILE_IO=y and CONFIG_BLK_DEV_IDE_TCQ is not set.

Speaking of TASKFILE... I had some hope that it would fix at least a bit
of the regression in disk speed since 2.4.19-ac1+preempt (my yardstick,
excellent kernel). Doing a hdparm -tT /dev/hda on that kernel I get ca
123 MB/sec and 27 MB/sec. On this 2.5.75 I see 119 MB/sec and 22 MB/sec.

Here's hoping it can be cranked up before 2.6!

Mvh
Mats Johannesson

Subject: Re: 2.5.75 does not boot - TCQ oops


On Fri, 11 Jul 2003, Jens Axboe wrote:

> On Fri, Jul 11 2003, Jens Axboe wrote:
> > On Fri, Jul 11 2003, Jens Axboe wrote:
> > > On Thu, Jul 10 2003, Ivan Gyurdiev wrote:
> > > > See,
> > > >
> > > > http://www.ussg.iu.edu/hypermail/linux/kernel/0307.0/0515.html
> > > >
> > > > where the bug is described for 2.5.74.
> > > > I got no replies, and the bug persists in 2.5.75 (+bk patches).
> > > >
> > > > Note:
> > > > The machine boots with TASKFILE on, TCQ is causing the problem.
> > >
> > > Looks like IDE using the queue before it has been setup, probably Bart
> > > broke it when he moved the TCQ init around. I'll take a look.
> >
> > Here's the fix. Bart, you moved the tcq init to a much earlier fase
> > (saying ide_init_drive() was too early, well ide_dma_on is much earlier
> > in the init fase). ide_init_drive() _is_ the correct location in my
> > oppinion, drive and queue has been set up at this point.

Yes, but hey you've acked this change ;-).

> Better still (and later in the probe) is this version. This is in my
> oppinion the correct place to init tcq, and also contains it to ide-disk
> where it should be.

Looks okay, thanks.
--
Bartlomiej

> --- drivers/ide/ide-dma.c~ 2003-07-11 10:21:04.492561920 +0200
> +++ drivers/ide/ide-dma.c 2003-07-11 10:25:28.183474808 +0200
> @@ -572,10 +572,6 @@
> if (HWIF(drive)->ide_dma_host_on(drive))
> return 1;
>
> -#ifdef CONFIG_BLK_DEV_IDE_TCQ_DEFAULT
> - HWIF(drive)->ide_dma_queued_on(drive);
> -#endif
> -
> return 0;
> }
>
> --- drivers/ide/ide-disk.c~ 2003-07-11 10:30:51.783280160 +0200
> +++ drivers/ide/ide-disk.c 2003-07-11 10:31:09.873530024 +0200
> @@ -1665,6 +1665,10 @@
> drive->no_io_32bit = id->dword_io ? 1 : 0;
> if (drive->id->cfs_enable_2 & 0x3000)
> write_cache(drive, (id->cfs_enable_2 & 0x3000));
> +
> +#ifdef CONFIG_BLK_DEV_IDE_TCQ_DEFAULT
> + HWIF(drive)->ide_dma_queued_on(drive);
> +#endif
> }
>
> static int idedisk_cleanup (ide_drive_t *drive)
>
> --
> Jens Axboe

2003-07-11 10:41:10

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5.75 does not boot - TCQ oops

On Fri, Jul 11 2003, Bartlomiej Zolnierkiewicz wrote:
>
> On Fri, 11 Jul 2003, Jens Axboe wrote:
>
> > On Fri, Jul 11 2003, Jens Axboe wrote:
> > > On Fri, Jul 11 2003, Jens Axboe wrote:
> > > > On Thu, Jul 10 2003, Ivan Gyurdiev wrote:
> > > > > See,
> > > > >
> > > > > http://www.ussg.iu.edu/hypermail/linux/kernel/0307.0/0515.html
> > > > >
> > > > > where the bug is described for 2.5.74.
> > > > > I got no replies, and the bug persists in 2.5.75 (+bk patches).
> > > > >
> > > > > Note:
> > > > > The machine boots with TASKFILE on, TCQ is causing the problem.
> > > >
> > > > Looks like IDE using the queue before it has been setup, probably Bart
> > > > broke it when he moved the TCQ init around. I'll take a look.
> > >
> > > Here's the fix. Bart, you moved the tcq init to a much earlier fase
> > > (saying ide_init_drive() was too early, well ide_dma_on is much earlier
> > > in the init fase). ide_init_drive() _is_ the correct location in my
> > > oppinion, drive and queue has been set up at this point.
>
> Yes, but hey you've acked this change ;-).

I know, I'll carry (let me see) 30% of the blame :)

> > Better still (and later in the probe) is this version. This is in my
> > oppinion the correct place to init tcq, and also contains it to ide-disk
> > where it should be.
>
> Looks okay, thanks.

Thanks, I'll forward it then.

--
Jens Axboe

2003-07-11 22:34:57

by Ivan Gyurdiev

[permalink] [raw]
Subject: Re: 2.5.75 does not boot - TCQ oops


Patch confirmed to work - the machine boots.
Unfortunately the long term result looked something like:

Objects Without Names: 2580
Empty Lost Dirs Removed: 12294
Dirs Linked to lost+found: 192
Dirs Without Stats Data: 141
....

Most massive fs corruption I've ever had.
Twice. (I tried it again to get a log, but I don't know if it will be
helpful).

I blamed the reiserfs bk work at first (which I applied along with your tcq
patch), but I noted that the fs only gets corrupted with a tcq-enabled kernel
(every time it seems...).

The machine boots and the filesystem (reiserfs) check says:
Zero bit found in on-disk bitmap after the last valid bit. Switching to
fix-fixable pass. After which it detects various types of corruption.
(eventually it breaks with fork: Cannot allocate memory - is this a
fsck.reiserfs problem?)

Here's parts of a log with fastboot on (skip the fs check).
The reiser filesystem had just been repaired, and here it breaks again.
I don't know if I should be sending this to you, but I think tcq causes the
corruption, since non-tcq 2.5.75 and 2.5.74 appear to run fine.

Kernel command line: auto BOOT_IMAGE=2.5.75-2 ro root=306 single fastboot
...

Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: IDE controller at PCI slot 0000:00:11.1
ACPI: No IRQ known for interrupt pin A of device 0000:00:11.1 - using IRQ 255
VP_IDE: chipset revision 6
VP_IDE: not 100% native mode: will probe irqs later
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
VP_IDE: VIA vt8235 (rev 00) IDE UDMA133 controller on pci0000:00:11.1
ide0: BM-DMA at 0xe000-0xe007, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xe008-0xe00f, BIOS settings: hdc:DMA, hdd:pio
hda: IC35L080AVVA07-0, ATA DISK drive
anticipatory scheduling elevator
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hdc: TOSHIBA DVD-ROM SD-R1102, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
hda: max request size: 128KiB
hda: host protected area => 1
hda: 160836480 sectors (82348 MB) w/1863KiB Cache, CHS=159560/16/63, UDMA(100)
hda: tagged command queueing enabled, command queue depth 8
/dev/ide/host0/bus0/target0/lun0: p1 p2 < p5 p6 p7 >
.....
BIOS EDD facility v0.09 2003-Jan-22, 1 devices found
....

Reiserfs journal params: device hda6, size 8192, journal first block 18, max
trans len 1024, max batch 900, max commit age 30, max trans age 30
reiserfs: checking transaction log (hda6) for (hda6)
Using r5 hash to sort names
VFS: Mounted root (reiserfs filesystem) readonly.
Mounted devfs on /dev
Freeing unused kernel memory: 152k freed
Real Time Clock Driver v1.11
Adding 512024k swap on /dev/hda5. Priority:-1 extents:1
vs-4080: reiserfs_free_block: free_block (hda6:895404)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895260)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895259)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895258)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895257)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895256)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895255)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895254)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895253)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895249)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895248)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895247)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895246)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895245)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895244)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895243)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895242)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895241)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895240)[dev:blocknr]: bit
already cleared
vs-4080: reiserfs_free_block: free_block (hda6:895187)[dev:blocknr]: bit
already cleared

> --- drivers/ide/ide-dma.c~ 2003-07-11 10:21:04.492561920 +0200
> +++ drivers/ide/ide-dma.c 2003-07-11 10:25:28.183474808 +0200
> @@ -572,10 +572,6 @@
> if (HWIF(drive)->ide_dma_host_on(drive))
> return 1;
>
> -#ifdef CONFIG_BLK_DEV_IDE_TCQ_DEFAULT
> - HWIF(drive)->ide_dma_queued_on(drive);
> -#endif
> -
> return 0;
> }
>
> --- drivers/ide/ide-disk.c~ 2003-07-11 10:30:51.783280160 +0200
> +++ drivers/ide/ide-disk.c 2003-07-11 10:31:09.873530024 +0200
> @@ -1665,6 +1665,10 @@
> drive->no_io_32bit = id->dword_io ? 1 : 0;
> if (drive->id->cfs_enable_2 & 0x3000)
> write_cache(drive, (id->cfs_enable_2 & 0x3000));
> +
> +#ifdef CONFIG_BLK_DEV_IDE_TCQ_DEFAULT
> + HWIF(drive)->ide_dma_queued_on(drive);
> +#endif
> }
>
> static int idedisk_cleanup (ide_drive_t *drive)

2003-07-12 03:49:42

by Voluspa

[permalink] [raw]
Subject: Re: 2.5.75 does not boot - TCQ oops


On 2003-07-11 20:58:09 Ivan Gyurdiev wrote:

> Patch confirmed to work - the machine boots.
[...]
> Most massive fs corruption I've ever had.
[...]
> I blamed the reiserfs bk work at first (which I applied along with
> [Axboe's] tcq patch), but I noted that the fs only gets corrupted
> with a tcq-enabled kernel

I took home 2.5.75-bk1, applied the tcq patch and then used the computer
for five hours in the TCQ+TASKFILE environment. Filesystem is ext2.

Untarred a kernel. Copied it to a couple of destinations. Compiled.
Listened to music. Watched part of a movie. Did a nfs move of a file
(which by the way was a pure horror... 600k in ca 3 minutes) from a
machine with a 2.2.16 kernel. Then read about your woes.

Checked the md5sum of a large file that I keep for... corruption checks.
Was ok. Did a read massage by "cd /usr ; find . -type f -exec md5sum {}
\;". No hickups. Except...

Found 1 error in /var/log/kernel that I _never_ get with the 2.4.19:
Jul 12 02:03:39 loke kernel: hda: status error: status=0x48 { DriveReady
DataRequest }
Jul 12 02:03:39 loke kernel:
Jul 12 02:03:39 loke kernel: hda: drive not ready for command

So I shut down X in preparation for a reboot and full fs check, waiting
for the distributed project foldingathome to checkpoint its work, and
there was another never experienced error (time is UTC):

[01:10:00] [SPG] 100.0 %
[01:10:00] [SPG] Writing current.xyz
[01:10:01] [SPG] Sequence 15 completed:
[01:10:01] SNEYSGTFSFKTKQSKDEMLDALQIKNSYISQMRQITPKMAIEYPKGTPT . . .
[01:10:01] - Error: Checksums don't match (work/wudata_06.arc)
[01:10:01] [SPG] Error: checksum error
[01:10:02] CoreStatus = 0 (0)
[01:10:02] Client-core communications error: ERROR 0x0
[01:10:02] Deleting current work unit & continuing...

The reboot didn't reveal any fs corruption. Still, I've returned to a
safe kernel :-) Disk where TCQ was enabled (using depth 8)
is a IBM-DTLA-307015. Unfortunately, or luckily, my
IC35L080AVVA07-0 shares its life with a CD, so no TCQ there.

Mvh
Mats Johannesson

2003-07-12 09:27:24

by Ivan Gyurdiev

[permalink] [raw]
Subject: Re: 2.5.75 does not boot - TCQ oops

> [ Voluspa wrote: ]
> I took home 2.5.75-bk1, applied the tcq patch and then used the computer
> for five hours in the TCQ+TASKFILE environment. Filesystem is ext2.
....
>
> No hickups. Except...
>======================================================

What would be the best approach to track down the problem?
I've recovered 99% of my system.
(rpm -V is a wonderful thing together with wget).

I intend to convert my root filesystem from reiserfs to xfs tomorrow
for the purposes of testing (and because I've had too many problems with
reiser over time). Should I do this, and re-test TCQ to avoid any reiser
problems, or should I stick with the current setup and do more testing as
needed. I'll have to figure out some precautions if so :)
How can I help?

On the good side, fs corruption dug out 10 gigs of free space
I never would have thought were available....cleanup for free.




2003-07-12 12:22:08

by Ivan Gyurdiev

[permalink] [raw]
Subject: Re: 2.5.75 does not boot - TCQ oops

Okay, I figured out some more things.

I split my root fs in half and made two identical copies - one on reiser and
one on xfs. I compiled a bunch of kernels, and tested some more.

A TCQ enabled kernel with AS, queue depth of 32 works fine on both reiser and
xfs. A TCQ enabled kernel with deadline, queue depth of 32 works fine on both
reiser and xfs. Note the depth of 32. I had TCQ enabled on previous <74
kernels and it worked fine, everywhere with depth 32.

However, the kernel that crashed was using the default
(The default is 8, even though the comment says 32).
I tested that kernel again with reiserfs, TCQ, the two elevators, and queue
depth 8 - on-boot fsck detects corruption every time, marks the system
unclean, and requires --rebuild-tree or --fix-fixable on any further mounts.
I now get "Wrong amount of used blocks." message.

Have not tested depth 8 TCQ kernel with xfs, since that's my surviving root
fs, and I'd like to avoid corruption there.

Have not tested queue depths other than 8 and 32.
I could test some more on reiser now that I have a backup root fs.




2003-07-12 12:52:16

by Voluspa

[permalink] [raw]
Subject: Re: 2.5.75 does not boot - TCQ oops


On Sat, 12 Jul 2003 06:46:01 -0400
Ivan Gyurdiev <[email protected]> wrote:

[...]
> However, the kernel that crashed was using the default
> (The default is 8, even though the comment says 32).

Some nice testing you've done there. Yeah, the 'map vs reality' is
confusing when configuring. I previously changed the 8 to 32, following
the help, but for 2.5.75 changed it to 8 based on an offhand comment on
lkml saying: '8 has been tested as optimal'.

Hope LinuxTag isn't drawing all the developers attention away from your
problem!

Mvh
Mats Johannesson