2002-02-26 18:41:51

by Florian Lohoff

[permalink] [raw]
Subject: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI Watchdog detected LOCKUP


Hi,
i have been looking for deadlocks we are experiencing on a couple of
SMP machines (Dual Celeron and Dual PIII). After a night stressing a
spare machine with dbench/bonnie++/tcpspray the machine locked up 30
minutes after i killed the test. The last messages on the console were:

__block_prepare_write: zeroing uptodate buffer!

15 times - Machine was answering ping first but stopped after a couple
of minutes. Another couple of minutes later the nmi_watchdog stepped in
and produced an oops:


ksymoops 2.3.4 on i686 2.4.18-pre4. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.18-pre4/ (default)
-m /boot/System.map-2.4.18-pre4 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

NMI Watchdog detected LOCKUP on CPU0, registers:
CPU: 0
EIP: 0010:[<c01cc34d>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000086
eax: c0094000 ebx: c0096d2c ecx: 00000002 edx: 00000000
esi: f880a000 edi: c0094084 ebp: c0094084 esp: f7ddbe80
ds: 0018 es: 0018 ss: 0018
Process scsi_eh_2 (pid: 8, stackpage=f7ddb000)
Stack: 0000001e f880a000 c0094084 00000000 00000001 00000009 c0096c16 00000086
f880a000 c0094084 00000000 c01cb2b2 00000000 000186a0 c0094084 00000000
00000002 00000000 00000018 c01c8e74 00000012 c0094084 00000000 c0094084
Call Trace: [<c01cb2b2>] [<c01c8e74>] [<c01c9036>] [<c01cbde4>] [<c01b7e25>]
[<c01b85a7>] [<c01b8b1b>] [<c01056f4>]
Code: 7e f9 e9 4f d0 ff ff 80 3f 00 f3 90 7e f9 e9 53 da ff ff 80

>>EIP; c01cc34d <_text_lock_gdth+e1/154> <=====
Trace; c01cb2b2 <gdth_interrupt+6ce/6d8>
Trace; c01c8e74 <gdth_wait+5c/ac>
Trace; c01c9036 <gdth_internal_cmd+172/1b8>
Trace; c01cbde4 <gdth_eh_bus_reset+1b4/36c>
Trace; c01b7e25 <scsi_try_bus_reset+89/118>
Trace; c01b85a7 <scsi_unjam_host+357/72c>
Trace; c01b8b1b <scsi_error_handler+19f/1f9>
Trace; c01056f4 <kernel_thread+28/38>
Code; c01cc34d <_text_lock_gdth+e1/154>
00000000 <_EIP>:
Code; c01cc34d <_text_lock_gdth+e1/154> <=====
0: 7e f9 jle fffffffb <_EIP+0xfffffffb> c01cc348 <_text_lock_gdth+dc/154> <=====
Code; c01cc34f <_text_lock_gdth+e3/154>
2: e9 4f d0 ff ff jmp ffffd056 <_EIP+0xffffd056> c01c93a3 <gdth_next+53/b5c>
Code; c01cc354 <_text_lock_gdth+e8/154>
7: 80 3f 00 cmpb $0x0,(%edi)
Code; c01cc357 <_text_lock_gdth+eb/154>
a: f3 90 repz nop
Code; c01cc359 <_text_lock_gdth+ed/154>
c: 7e f9 jle 7 <_EIP+0x7> c01cc354 <_text_lock_gdth+e8/154>
Code; c01cc35b <_text_lock_gdth+ef/154>
e: e9 53 da ff ff jmp ffffda66 <_EIP+0xffffda66> c01c9db3 <gdth_next+a63/b5c>
Code; c01cc360 <_text_lock_gdth+f4/154>
13: 80 00 00 addb $0x0,(%eax)


1 warning issued. Results may not be reliable.


Software: 2.4.18-pre4, ext2 filesystems, no highmem
Hardware: Dual PIII 1GHz, 1GB Ram,
ICP Vortex GDT6523RS ( 6 Disks with hardware raid 1 )
GDTH Firmware is latest 2.28.06-R049

I am now booting with 2.4.19-pre1 and will try to crash it again
tonight.

Flo
--
Florian Lohoff [email protected] +49-5201-669912
Nine nineth on september the 9th Welcome to the new billenium


Attachments:
(No filename) (3.59 kB)
(No filename) (232.00 B)
Download all attachments

2002-02-26 19:06:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI Watchdog detected LOCKUP

Florian Lohoff wrote:
>
> Hi,
> i have been looking for deadlocks we are experiencing on a couple of
> SMP machines (Dual Celeron and Dual PIII). After a night stressing a
> spare machine with dbench/bonnie++/tcpspray the machine locked up 30
> minutes after i killed the test. The last messages on the console were:
>
> __block_prepare_write: zeroing uptodate buffer!

Yup. This happens when the disk fills up. Andrea and I were
discussing it over the weekend. There's a new patch in the -aa
kernels which doesn't quite fix it :(

We'll fix it in 2.4.19-pre somehow. It's possible that this problem
causes a chnuk of zeroes to be written into the file when you hit
ENOSPC, which is rather rude. But your file was truncated anyway.

> 15 times - Machine was answering ping first but stopped after a couple
> of minutes. Another couple of minutes later the nmi_watchdog stepped in
> and produced an oops:

SCSI error recovery deadlocked.

Now it's *just* conceivable that the __block_prepare_write() problem
caused a junk request to be sent down to the driver, which caused
the driver to enter recovery, which it then screwed up. But I
doubt it.

It's also conceivable that the NMI watchdog code itself caused
problems also. Back in the days when it was permanently enabled,
some machines kept going silly until nmi watchdog was enabled.

Bottom line: I don't know why you got a SCSI error, and the lockup
is possibly a bug in the scsi layer.

-

2002-02-26 19:17:11

by Florian Lohoff

[permalink] [raw]
Subject: Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI Watchdog detected LOCKUP

On Tue, Feb 26, 2002 at 11:04:55AM -0800, Andrew Morton wrote:
> > __block_prepare_write: zeroing uptodate buffer!
>
> Yup. This happens when the disk fills up. Andrea and I were
> discussing it over the weekend. There's a new patch in the -aa
> kernels which doesn't quite fix it :(
>
> We'll fix it in 2.4.19-pre somehow. It's possible that this problem
> causes a chnuk of zeroes to be written into the file when you hit
> ENOSPC, which is rather rude. But your file was truncated anyway.

I dont think the machine had full filesystems at all.

> SCSI error recovery deadlocked.
>
> Now it's *just* conceivable that the __block_prepare_write() problem
> caused a junk request to be sent down to the driver, which caused
> the driver to enter recovery, which it then screwed up. But I
> doubt it.
>
> It's also conceivable that the NMI watchdog code itself caused
> problems also. Back in the days when it was permanently enabled,
> some machines kept going silly until nmi watchdog was enabled.

I have seen a lot of deadlocks on these machines for the last months - I
never got ANY output - Nothing in the syslog, no oops. I am running
now with the nmi_watchdog for 4 weeks and have seen 10-15 crashes on 2
machines with no output. This was the first deadlock with some output
at all which is very frustrating.

Flo
--
Florian Lohoff [email protected] +49-5201-669912
Nine nineth on september the 9th Welcome to the new billenium


Attachments:
(No filename) (1.46 kB)
(No filename) (232.00 B)
Download all attachments

2002-02-26 19:22:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI Watchdog detected LOCKUP

Florian Lohoff wrote:
>
> On Tue, Feb 26, 2002 at 11:04:55AM -0800, Andrew Morton wrote:
> > > __block_prepare_write: zeroing uptodate buffer!
> >
> > Yup. This happens when the disk fills up. Andrea and I were
> > discussing it over the weekend. There's a new patch in the -aa
> > kernels which doesn't quite fix it :(
> >
> > We'll fix it in 2.4.19-pre somehow. It's possible that this problem
> > causes a chnuk of zeroes to be written into the file when you hit
> > ENOSPC, which is rather rude. But your file was truncated anyway.
>
> I dont think the machine had full filesystems at all.

I/O error when reading filesystem metadata would also cause it.

2002-02-26 19:29:11

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI Watchdog detected LOCKUP

On Tue, Feb 26, 2002 at 07:40:43PM +0100, Florian Lohoff wrote:
>
> Hi,
> i have been looking for deadlocks we are experiencing on a couple of
> SMP machines (Dual Celeron and Dual PIII). After a night stressing a
> spare machine with dbench/bonnie++/tcpspray the machine locked up 30
> minutes after i killed the test. The last messages on the console were:
>
> __block_prepare_write: zeroing uptodate buffer!
>
> 15 times - Machine was answering ping first but stopped after a couple
> of minutes. Another couple of minutes later the nmi_watchdog stepped in
> and produced an oops:


the lockup should not be related to the above messages (looks more like
a driver bug).

But for the message thanks to Andrew's great testing effort, I found
some fs data (not metadata, so not that severe) corruption bug that
could generate the above message, here it is:

--- 2.4.18rc4aa1/fs/buffer.c.~1~ Sat Feb 23 08:21:00 2002
+++ 2.4.18rc4aa1/fs/buffer.c Sun Feb 24 04:48:03 2002
@@ -1721,6 +1721,13 @@
if (buffer_new(bh)) {
unmap_underlying_metadata(bh);
if (Page_Uptodate(page)) {
+ /*
+ * Avoid new and uptodate to be set at the same time
+ * so we can retain the buffer_uptodate() bugcheck in
+ * the undo/fixup pass below (goto out path).
+ */
+ clear_bit(BH_New, &bh->b_state);
+
set_bit(BH_Uptodate, &bh->b_state);
continue;
}
@@ -1757,8 +1764,16 @@
* Zero out any newly allocated blocks to avoid exposing stale
* data. If BH_New is set, we know that the block was newly
* allocated in the above loop.
+ *
+ * Details:
+ * 1) hole in uptodate page, get_block(create) allocate the block, so the buffer is
+ * new and additionally we also mark it uptodate (so remeber to clear BH_New
+ * above to avoid triggering the "zeroing" printk below).
+ * 2) we must stop the "undo/clear" fixup pass not at the caller "to" but at the last
+ * block that we successfully arrived in the main loop.
*/
bh = head;
+ to = block_start; /* stop at the last successfully handled block */
block_start = 0;
do {
block_end = block_start+blocksize;



Andrew said it doesn't make the message go away, and infact I now
noticed that after the above fix the buffer_uptodate will generate false
positives, think an uptodate but unmapped buffer, uptodate because it
was an hole.

So in short I did now this new untested patch that should remove the
false positive too. The bugcheck was helpful to find the other bugs, but
now that the other bugs are fixed, I think we cannot keep it because
we must skip over those valid uptodate and newly mapped buffers and I
documented why in the patch.

diff -urN 2.4.18/fs/buffer.c 2.4.18aa1/fs/buffer.c
--- 2.4.18/fs/buffer.c Tue Feb 26 17:51:28 2002
+++ 2.4.18aa1/fs/buffer.c Tue Feb 26 20:23:33 2002
@@ -1680,8 +1680,21 @@
* Zero out any newly allocated blocks to avoid exposing stale
* data. If BH_New is set, we know that the block was newly
* allocated in the above loop.
+ *
+ * Details the buffer can be new and uptodate because:
+ * 1) hole in uptodate page, get_block(create) allocate the block, so the buffer is
+ * new and additionally we also mark it uptodate
+ * 2) The buffer is not mapped and uptodate due a previous partial read.
+ *
+ * We can always ignore uptodate buffers here, if you mark a buffer uptodate
+ * you must make sure it contains the right data first.
*/
bh = head;
+ /*
+ * We must stop the "undo/clear" fixup pass not at the caller "to" but at the last
+ * block that we successfully arrived in the main loop.
+ */
+ to = block_start; /* stop at the last successfully handled block */
block_start = 0;
do {
block_end = block_start+blocksize;
@@ -1689,9 +1702,7 @@
goto next_bh;
if (block_start >= to)
break;
- if (buffer_new(bh)) {
- if (buffer_uptodate(bh))
- printk(KERN_ERR "%s: zeroing uptodate buffer!\n", __FUNCTION__);
+ if (buffer_new(bh) && !buffer_uptodate(bh)) {
memset(kaddr+block_start, 0, bh->b_size);
set_bit(BH_Uptodate, &bh->b_state);
mark_buffer_dirty(bh);


comments Andrew?

Andrea

2002-02-26 19:33:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI Watchdog detected LOCKUP

On Tue, Feb 26, 2002 at 11:04:55AM -0800, Andrew Morton wrote:
> Florian Lohoff wrote:
> >
> > Hi,
> > i have been looking for deadlocks we are experiencing on a couple of
> > SMP machines (Dual Celeron and Dual PIII). After a night stressing a
> > spare machine with dbench/bonnie++/tcpspray the machine locked up 30
> > minutes after i killed the test. The last messages on the console were:
> >
> > __block_prepare_write: zeroing uptodate buffer!
>
> Yup. This happens when the disk fills up. Andrea and I were
> discussing it over the weekend. There's a new patch in the -aa
> kernels which doesn't quite fix it :(

I think it basically does, see the other email for why I think so.

> Bottom line: I don't know why you got a SCSI error, and the lockup
> is possibly a bug in the scsi layer.

agreed (just like said in the other email :).

Andrea

2002-02-26 19:34:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI Watchdog detected LOCKUP

On Tue, Feb 26, 2002 at 11:20:06AM -0800, Andrew Morton wrote:
> Florian Lohoff wrote:
> >
> > On Tue, Feb 26, 2002 at 11:04:55AM -0800, Andrew Morton wrote:
> > > > __block_prepare_write: zeroing uptodate buffer!
> > >
> > > Yup. This happens when the disk fills up. Andrea and I were
> > > discussing it over the weekend. There's a new patch in the -aa
> > > kernels which doesn't quite fix it :(
> > >
> > > We'll fix it in 2.4.19-pre somehow. It's possible that this problem
> > > causes a chnuk of zeroes to be written into the file when you hit
> > > ENOSPC, which is rather rude. But your file was truncated anyway.
> >
> > I dont think the machine had full filesystems at all.
>
> I/O error when reading filesystem metadata would also cause it.

Indeed. Looks like gdth gone wild and that triggered the cleanup slowpath.

Andrea

2002-02-27 03:27:58

by T. A.

[permalink] [raw]
Subject: Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI Watchdog detected LOCKUP

What motherboard are you using? I recently installed a ICP RAID card
into a VP6 with dual processors and had similar problems. I've also had
problems with the gdth driver under linux in that drives were disappearing
now and then destroying the integrity of the RAID drive, though in a
different setup.


----- Original Message -----
From: "Florian Lohoff" <[email protected]>
To: "Andrew Morton" <[email protected]>
Cc: <[email protected]>
Sent: Tuesday, February 26, 2002 2:16 PM
Subject: Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer!
/ NMI Watchdog detected LOCKUP


2002-02-27 13:02:02

by Florian Lohoff

[permalink] [raw]
Subject: Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI Watchdog detected LOCKUP

On Tue, Feb 26, 2002 at 10:27:50PM -0500, T. A. wrote:
> What motherboard are you using? I recently installed a ICP RAID card
> into a VP6 with dual processors and had similar problems. I've also had
> problems with the gdth driver under linux in that drives were disappearing
> now and then destroying the integrity of the RAID drive, though in a
> different setup.

Serverworks Chipsset - We dont have any other problems beside the
real deadlocks.

Flo
--
Florian Lohoff [email protected] +49-5201-669912
Nine nineth on september the 9th Welcome to the new billenium


Attachments:
(No filename) (616.00 B)
(No filename) (232.00 B)
Download all attachments