2000-12-07 23:13:24

by Joseph Cheek

[permalink] [raw]
Subject: kernel BUG at buffer.c:827 in test12-pre6 and 7

copying files off a loopback-mounted vfat filesystem exposes this bug.
test11 worked fine.

loop.o built as module. this hard crashes the machine, every time
[PIII-450]. i don't know how to debug this, is there a FAQ?

[transcribed by hand]:

# mount -o loop /tmp/cdboot.288 /mnt/cd
# cd /mnt/cd
# cp menu.lst /tmp
kernel BUG at buffer.c:827!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c013660c>]
EFLAGS: 00010082
eax: 0000001c ebx: c1d8fc60 ecx: 00000000 edx: 00000001
esi: c10658e4 edi: 00000002 ebp: c1d8fca8 esp: c1793dc0
ds: 0018 es: 0018 ss: 0018
Process cp (pid 762, stackpage=c1793000)
Stack: c01fe484 c01fe95a 0000033b c1d8fc60 c1cef420 00000001 00000001
c01610e1
c1d8fc60 00000001 c1cef420 00000000 c1cef420 c02c8ed8 c88df91c
c1cef420
00000001 c88e0986 00000007 00000000 00000001 c02c8ed8 c02c8ee8
c4f18800
Call Trace: [<c01fe484>] [<c01fe95a>] [<c0130703>] [<c8895de3>]
[<c88df91c>] [<c8894494>] [<c0160d2f>] [<c0160ead>]
[<c0161011>] [<c0137a49>] [<c0130703>] [<c8895de3>] [<c8894494>]
[<c01284d3>] [<c012887b>] [c0128720>]
[<c889448d>] [<c01349a7>] [c010b56b>]
Code: 0f 0b 83 c4 0c 8d 5e 28 8d 46 2c 39 46 2c 74 24 b9 01 00 00

as soon as i reboot i will look what's at buffer.c:827



--
thanks!

joe

--
Joseph Cheek, Sr Linux Consultant, Linuxcare | http://www.linuxcare.com/
Linuxcare. Support for the Revolution. | [email protected]
CTO / Acting PM, Redmond Linux Project | [email protected]
425 990-1072 vox [1074 fax] 206 679-6838 pcs | [email protected]




2000-12-07 23:45:15

by Keith Owens

[permalink] [raw]
Subject: Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

On Thu, 07 Dec 2000 14:42:38 -0800,
Joseph Cheek <[email protected]> wrote:
>loop.o built as module. this hard crashes the machine, every time
>[PIII-450]. i don't know how to debug this, is there a FAQ?

linux/Documentation/oops-tracing.txt

2000-12-08 00:53:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

In article <[email protected]>,
Joseph Cheek <[email protected]> wrote:
>copying files off a loopback-mounted vfat filesystem exposes this bug.
>test11 worked fine.

It's not a new bug - it's an old bug that apparently is uncovered by a
new stricter test.

Apparently loopback unlocks an already unlocked page - which has always
been a serious offense, but has never been detected before.

test12-pre6+ detects it, and thus the BUG().

Your stack trace isn't symbolic (see Documentation/oops-tracing.txt), so
it's impossible to debug, but it's already interesting information to
see that it seems to be either loopback of vfat.

Can you test some more? In particular, I'd love to hear if this happens
with vfat even without loopback, or with loopback even without vfat
(make an ext2 filesystem or similar instead). That woul dnarrow down the
bug further.

Thanks,
Linus

2000-12-08 01:55:15

by Joseph Cheek

[permalink] [raw]
Subject: Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

i'll check it out. i'm compiling ksymoops now, is there a way to get it to
work without a static libbfd? all i've got is a libbfd.so, and i'm going to
need to recompile binutils if i must have a libbfd.a.

Linus Torvalds wrote:

> Your stack trace isn't symbolic (see Documentation/oops-tracing.txt), so
> it's impossible to debug, but it's already interesting information to
> see that it seems to be either loopback of vfat.
>
> Can you test some more? In particular, I'd love to hear if this happens
> with vfat even without loopback, or with loopback even without vfat
> (make an ext2 filesystem or similar instead). That woul dnarrow down the
> bug further.

thanks!

joe

--
Joseph Cheek, Sr Linux Consultant, Linuxcare | http://www.linuxcare.com/
Linuxcare. Support for the Revolution. | [email protected]
CTO / Acting PM, Redmond Linux Project | [email protected]
425 990-1072 vox [1074 fax] 206 679-6838 pcs | [email protected]



2000-12-08 02:47:25

by Joseph Cheek

[permalink] [raw]
Subject: Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

hi,

comments below.

Linus Torvalds wrote:

> In article <[email protected]>,
> Joseph Cheek <[email protected]> wrote:
> >copying files off a loopback-mounted vfat filesystem exposes this bug.
> >test11 worked fine.
>
> It's not a new bug - it's an old bug that apparently is uncovered by a
> new stricter test.
>
> Apparently loopback unlocks an already unlocked page - which has always
> been a serious offense, but has never been detected before.
>
> test12-pre6+ detects it, and thus the BUG().
>
> Your stack trace isn't symbolic (see Documentation/oops-tracing.txt), so
> it's impossible to debug, but it's already interesting information to
> see that it seems to be either loopback of vfat.
>
> Can you test some more? In particular, I'd love to hear if this happens
> with vfat even without loopback, or with loopback even without vfat
> (make an ext2 filesystem or similar instead). That woul dnarrow down the
> bug further.

this happens on loopbacked ext2, and not on regular vfat. so it appears that
the culprit is loopback. i got ksymoops working, here are the traces from
the vfat-over-loopback [first] and the ext2-over-loopback [second].

again, these are copied by hand, so i give it a 1% chance of
mistranscription.

ksymoops 2.3.5 on i686 2.4.0. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.0/ (default)
-m /usr/src/linux/System.map (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

kernel BUG at buffer.c:827!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c013660c>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010082
eax: 0000001c ebx: c1d8fc60 ecx: 00000000 edx: 00000001
esi: c10658e4 edi: 00000002 ebp: c1d8fca8 esp: c1793dc0
ds: 0018 es: 0018 ss: 0018
Process cp (pid 762, stackpage=c1793000)
Stack: c01fe484 c01fe95a 0000033b c1d8fc60 c1cef420 00000001 00000001
c01610e1
c1d8fc60 00000001 c1cef420 00000000 c1cef420 c02c8ed8 c88df91c
c1cef420
00000001 c88e0986 00000007 00000000 00000001 c02c8ed8 c02c8ee8
c4f18800
Call Trace: [<c01fe484>] [<c01fe95a>] [<c0130703>] [<c8895de3>]
[<c88df91c>] [<c8894494>] [<c0160d2f>] [<c0160ead>]
[<c0161011>] [<c0137a49>] [<c0130703>] [<c8895de3>] [<c8894494>]
[<c01284d3>] [<c012887b>] [c0128720>]
[<c889448d>] [<c01349a7>] [c010b56b>]
Code: 0f 0b 83 c4 0c 8d 5e 28 8d 46 2c 39 46 2c 74 24 b9 01 00 00

>>EIP; c013660c <end_buffer_io_async+e0/11c> <=====
Trace; c01fe484 <tvecs+4fc0/13968>
Trace; c01fe95a <tvecs+5496/13968>
Trace; c0130703 <__alloc_pages+df/2d4>
Trace; c8895de3 <[fat]fat_readpage+f/14>
Trace; c88df91c <[cdrom]cdrom_read_mech_status+c/4c>
Trace; c8894494 <[fat]fat_get_block+0/e4>
Trace; c0160d2f <__make_request+5cb/648>
Trace; c0160ead <generic_make_request+101/110>
Trace; c0161011 <ll_rw_block+155/1c4>
Trace; c0137a49 <block_read_full_page+1fd/2a8>
Trace; c0130703 <__alloc_pages+df/2d4>
Trace; c8895de3 <[fat]fat_readpage+f/14>
Trace; c8894494 <[fat]fat_get_block+0/e4>
Trace; c01284d3 <do_generic_file_read+337/584>
Trace; c012887b <generic_file_read+5b/74>
Trace; c889448d <[fat]fat_file_read+2d/34>
Trace; c01349a7 <sys_read+8f/c4>
Code; c013660c <end_buffer_io_async+e0/11c>
0000000000000000 <_EIP>:
Code; c013660c <end_buffer_io_async+e0/11c> <=====
0: 0f 0b ud2a <=====
Code; c013660e <end_buffer_io_async+e2/11c>
2: 83 c4 0c add $0xc,%esp
Code; c0136611 <end_buffer_io_async+e5/11c>
5: 8d 5e 28 lea 0x28(%esi),%ebx
Code; c0136614 <end_buffer_io_async+e8/11c>
8: 8d 46 2c lea 0x2c(%esi),%eax
Code; c0136617 <end_buffer_io_async+eb/11c>
b: 39 46 2c cmp %eax,0x2c(%esi)
Code; c013661a <end_buffer_io_async+ee/11c>
e: 74 24 je 34 <_EIP+0x34> c0136640
<end_buffer_io_async+114/11c>
Code; c013661c <end_buffer_io_async+f0/11c>
10: b9 01 00 00 00 mov $0x1,%ecx


1 warning issued. Results may not be reliable.

ksymoops 2.3.5 on i686 2.4.0. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.0/ (default)
-m /usr/src/linux/System.map (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

kernel BUG at buffer.c:827!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c013660c>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010086
eax: 0000001c ebx: c1d212a0 ecx: 00000000 edx: 00000001
esi: c11274bc edi: 00000002 ebp: c1d212e8 esp: c4f1bddc
ds: 0018 es: 0018 ss: 0018
Process cp (pid 772, stackpage=c4f1b000)
Stack: c01fe484 c01fe95a 0000033b c1d21290 c1983420 00000002 00000001
c01610e1
c1d212a0 00000001 c1983420 00000000 c1983420 c02c8ed8 c88df91c
c1983420
00000001 c88e0986 00000007 00000000 00000002 c02c8ed8 c02c8ee8
c4cdf998
Call Trace: [<c01fe484>] [<c01fe95a>] [<c01610e1>] [<c88df91c>] [<c88e0986>]
[<c0160d2f>] [<c0160ead>]
[<c0161011>] [<c0137a49>] [<c0130703>] [<c0156223>] [<c0155b38>]
[<c012887b>] [c0128720>]
[<c01349a7>] [c010b56b>]
Code: 0f 0b 83 c4 0c 8d 5e 28 8d 46 2c 39 46 2c 74 24 b9 01 00 00

>>EIP; c013660c <end_buffer_io_async+e0/11c> <=====
Trace; c01fe484 <tvecs+4fc0/13968>
Trace; c01fe95a <tvecs+5496/13968>
Trace; c01610e1 <end_that_request_first+61/b8>
Trace; c88df91c <[cdrom]cdrom_read_mech_status+c/4c>
Trace; c88e0986 <[cdrom]cdrom_read_block+3a/f8>
Trace; c0160d2f <__make_request+5cb/648>
Trace; c0160ead <generic_make_request+101/110>
Trace; c0161011 <ll_rw_block+155/1c4>
Trace; c0137a49 <block_read_full_page+1fd/2a8>
Trace; c0130703 <__alloc_pages+df/2d4>
Trace; c0156223 <ext2_readpage+f/14>
Trace; c0155b38 <ext2_get_block+0/4e0>
Trace; c012887b <generic_file_read+5b/74>
Trace; c01349a7 <sys_read+8f/c4>
Code; c013660c <end_buffer_io_async+e0/11c>
0000000000000000 <_EIP>:
Code; c013660c <end_buffer_io_async+e0/11c> <=====
0: 0f 0b ud2a <=====
Code; c013660e <end_buffer_io_async+e2/11c>
2: 83 c4 0c add $0xc,%esp
Code; c0136611 <end_buffer_io_async+e5/11c>
5: 8d 5e 28 lea 0x28(%esi),%ebx
Code; c0136614 <end_buffer_io_async+e8/11c>
8: 8d 46 2c lea 0x2c(%esi),%eax
Code; c0136617 <end_buffer_io_async+eb/11c>
b: 39 46 2c cmp %eax,0x2c(%esi)
Code; c013661a <end_buffer_io_async+ee/11c>
e: 74 24 je 34 <_EIP+0x34> c0136640
<end_buffer_io_async+114/11c>
Code; c013661c <end_buffer_io_async+f0/11c>
10: b9 01 00 00 00 mov $0x1,%ecx


1 warning issued. Results may not be reliable.

--
thanks!

joe

--
Joseph Cheek, Sr Linux Consultant, Linuxcare | http://www.linuxcare.com/
Linuxcare. Support for the Revolution. | [email protected]
CTO / Acting PM, Redmond Linux Project | [email protected]
425 990-1072 vox [1074 fax] 206 679-6838 pcs | [email protected]



2000-12-08 03:34:21

by Keith Owens

[permalink] [raw]
Subject: Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

On Thu, 07 Dec 2000 17:23:51 -0800,
Joseph Cheek <[email protected]> wrote:
>i'll check it out. i'm compiling ksymoops now, is there a way to get it to
>work without a static libbfd? all i've got is a libbfd.so, and i'm going to
>need to recompile binutils if i must have a libbfd.a.

ksymoops/Makefile, change
$(CC) $(OBJECTS) $(CFLAGS) -Wl,-Bstatic -lbfd -liberty -Wl,-Bdynamic -o $@
to
$(CC) $(OBJECTS) $(CFLAGS) -lbfd -liberty -o $@

But you should have libbfd.a. Debian splits binutils into binutils and
binutils-dev, libbfd.a might be in the latter.

2000-12-08 07:58:30

by Tom Leete

[permalink] [raw]
Subject: Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

Linus Torvalds wrote:
>
> It's not a new bug - it's an old bug that apparently is uncovered by a
> new stricter test.
>
> Apparently loopback unlocks an already unlocked page - which has always
> been a serious offense, but has never been detected before.
>
> test12-pre6+ detects it, and thus the BUG().
>
> Your stack trace isn't symbolic (see Documentation/oops-tracing.txt), so
> it's impossible to debug, but it's already interesting information to
> see that it seems to be either loopback of vfat.
>
> Can you test some more? In particular, I'd love to hear if this happens
> with vfat even without loopback, or with loopback even without vfat
> (make an ext2 filesystem or similar instead). That woul dnarrow down the
> bug further.
>
> Thanks,
> Linus

Hi,

Here is a rather different datapoint. I hope it's different enough to help
nail this.

test12-pre5 + kdb + Serial Console. Sorry, I didn't get the contents of
pointer args.

It's probably worth saying that this kernel was compiled with gcc 2.95.2. I
have the blessed egcs also, will compile pre7 with that and see what
happens.

Cheers,
Tom

from nfs mounted ext2 (2.4.0-test12-pre5):
kernel BUG at buffer.c:827!

Entering kdb (current=0xc2014000, pid 466) Panic: invalid operand
due to panic @ 0xc0130c73
eax = 0x0000001c ebx = 0xc109ebf8 ecx = 0x00000000 edx = 0x00000006
esi = 0xc2739af0 edi = 0xc2739b38 esp = 0xc2015dd4 eip = 0xc0130c73
ebp = 0xc2015df4 xss = 0x00000018 xcs = 0xc2010010 eflags = 0x00010046
xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff &regs = 0xc2015da0
kdb> bt
EBP EIP Function(args)
0xc2015df4 0xc0130c73 end_buffer_io_async+0xd3 (0xc2739af0, 0x1)
kernel .text 0xc0100000 0xc0130ba0 0xc0130cb0
0xc2015e10 0xc0164756 end_that_request_first+0x66 (0xc11c2c20, 0x1,
0xc031cf04)
kernel .text 0xc0100000 0xc01646f0 0xc01647b0
0xc2015e30 0xc018a128 ide_end_request+0x28 (0x1, 0xc11c5060)
kernel .text 0xc0100000 0xc018a100 0xc018a180
0xc2015e64 0xc018e614 read_intr+0x104 (0xc031ce20)
kernel .text 0xc0100000 0xc018e510 0xc018e650
0xc2015e88 0xc018bad6 ide_intr+0x106 (0xe, 0xc11c5060, 0xc2015ed4, 0x1c0)
kernel .text 0xc0100000 0xc018b9d0 0xc018bb30
0xc2015ea8 0xc010ab50 handle_IRQ_event+0x30 (0xe, 0xc2015ed4, 0xc11de2e0)
kernel .text 0xc0100000 0xc010ab20 0xc010ab80
0xc2015ecc 0xc010ace2 do_IRQ+0x72 (0xc02f4520, 0xc02a92ac, 0xc201c000,
0xc201c000, 0xfffffc18)
kernel .text 0xc0100000 0xc010ac70 0xc010ad30
0xc01093f0 ret_from_intr
kernel .text 0xc0100000 0xc01093f0 0xc0109410
Interrupt registers:
eax = 0x00000019 ebx = 0xc02f4520 ecx = 0xc02a92ac edx = 0xc201c000
esi = 0xc201c000 edi = 0xfffffc18 esp = 0xc2015f08 eip = 0xc0115816
ebp = 0xc2015f4c xss = 0x00000018 xcs = 0xc0000010 eflags = 0x00000287
xds = 0xc2070018 xes = 0xc2070018 origeax = 0xffffff0e &regs = 0xc2015ed4
0xc0115816 schedule+0x1b6
kernel .text 0xc0100000 0xc0115660 0xc0115b00
0xc2015f70 0xc01155c7 schedule_timeout+0x17 (0xc2014000, 0x1785222)
kernel .text 0xc0100000 0xc01155b0 0xc0115650
0xc2015fac 0xc4055753 [sunrpc]svc_recv+0x1a3 (0xc24b2470, 0xc207be00,
0x7fffffff)
sunrpc .text 0xc404e060 0xc40555b0 0xc4055940
0xc2015fec 0xc40713f3 [nfsd]nfsd+0x253
nfsd .text 0xc4071060 0xc40711a0 0xc40714a0
0xc0107843 kernel_thread+0x23
kernel .text 0xc0100000 0xc0107820 0xc0107860

2000-12-08 09:38:21

by David Woodhouse

[permalink] [raw]
Subject: Re: kernel BUG at buffer.c:827 in test12-pre6 and 7


[email protected] said:
> Can you test some more? In particular, I'd love to hear if this
> happens with vfat even without loopback, or with loopback even without
> vfat (make an ext2 filesystem or similar instead). That woul dnarrow
> down the bug further.

Loopback-mounted iso9660 does it too. This was #3 on my list of test12-pre5
oospen to investigate this weekend :)

--
dwmw2


2000-12-08 10:30:28

by Alexander Viro

[permalink] [raw]
Subject: [found?] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

I'm doing some massive grepping (basically, audit of page locking),
but nothing relevant so far. There was some catch (aside of documenting
the thing and finding several completely unrelated buglets):
* ramfs_writepage() doesn't UnlockPage(). Deadlock.
* udf_adinicb_writepage() does extra UnlockPage().
I don't see the fsckup in loop.c, though. On the read path it uses
do_generic_read_file() and on the write it's essentially the simplified
variant of generic_file_write(). Hell knows... It looks like we are
getting dirty buffer inheriting end_buffer_io_async from the previous
life.

Oh, damn it. Indeed. Look:

generic_file_write() or lo_send():
lock_page()
->prepare_write() sets sync ->b_end_io
->commit_write() puts them on the dirty list
UnlockPage() releases the page lock
... requests are sent to driver

page_launder():
TryLockPage() succeeds
block_write_full_page()
... goes through the bh'es and sets ->b_end_io to end_buffer_io_async()
at that point the last remaining request completes. It calls
->b_end_io(), aka. end_buffer_io_async(). And does UnlockPage().

In the meanwhile, we call ll_rw_block(). Requests are sent again.
When _they_ complete we get the second UnlockPage()

Now, I might miss some obvious reason why it could never happen. Moreover,
the real problem may be completely different - the race above is not too wide.

However, I'ld really like to hear why the scenario above is impossible. BTW,
the race isn't even that narrow - if ->prepare_write() didn't cover the
whole page we've got a get_block() to call and there's a plenty of time when
shit can hit the fan - it's a blocking operation, after all.

Comments?
Cheers,
Al

2000-12-08 18:42:43

by Alexander Viro

[permalink] [raw]
Subject: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

Folks, see if the following patch helps. AFAICS it closes a pretty real
race - we could call block_write_full_page() for a page that has sync
IO in progress and blindly change ->b_end_io callbacks on the bh with
pending requests. With a little bit of bad luck they would complete before
we got to ll_rw_block(), thus leading to extra UnlockPage(). All it takes
is a pageout on a page that got partial write recently - if some fragments
were still unmapped we get to call get_block() on them and it can easily
block, providing a decent window for that race.

Fix: postpone changing ->b_end_io until the call of ll_rw_block(); if by
the time of ll_rw_block() some fragments will still have IO in progress -
wait on them.

Comments?
Cheers,
Al

--- buffer.c Fri Dec 8 16:19:53 2000
+++ buffer.c.new Fri Dec 8 16:26:44 2000
@@ -1577,6 +1577,26 @@
* "Dirty" is valid only with the last case (mapped+uptodate).
*/

+static void write_array_async(struct page *page, struct buffer_head **p, int n)
+{
+ int i;
+ if (!n) {
+ UnlockPage(page);
+ return;
+ }
+ /*
+ * If there are pending requests on these guys - wait before changing
+ * ->b_end_io.
+ */
+ for (i=0; i<n; i++) {
+ wait_on_buffer(p[i]);
+ set_bit(BH_Uptodate, &p[i]->b_state);
+ set_bit(BH_Dirty, &p[i]->b_state);
+ p[i]->b_end_io = end_buffer_io_async;
+ }
+ ll_rw_block(WRITE, n, p);
+}
+
/*
* block_write_full_page() is SMP-safe - currently it's still
* being called with the kernel lock held, but the code is ready.
@@ -1616,28 +1636,17 @@
if (buffer_new(bh))
unmap_underlying_metadata(bh);
}
- set_bit(BH_Uptodate, &bh->b_state);
- set_bit(BH_Dirty, &bh->b_state);
- bh->b_end_io = end_buffer_io_async;
atomic_inc(&bh->b_count);
arr[nr++] = bh;
bh = bh->b_this_page;
block++;
} while (bh != head);

- if (nr) {
- ll_rw_block(WRITE, nr, arr);
- } else {
- UnlockPage(page);
- }
+ write_array_async(page, arr, nr);
SetPageUptodate(page);
return 0;
out:
- if (nr) {
- ll_rw_block(WRITE, nr, arr);
- } else {
- UnlockPage(page);
- }
+ write_array_async(page, arr, nr);
ClearPageUptodate(page);
return err;
}

2000-12-08 19:19:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Fri, 8 Dec 2000, Alexander Viro wrote:
>
> Fix: postpone changing ->b_end_io until the call of ll_rw_block(); if by
> the time of ll_rw_block() some fragments will still have IO in progress -
> wait on them.
>
> Comments?

Yes.

On the other hand, I have this suspicion that there is an even simpler
solution: stop using the end_buffer_io_sync version for writes
altogether.

It's really only __block_prepare_write() that can mark the buffers for
sync writes, and even that case is fairly bogus: it really only wants to
mark the non-upto-date buffers that it wants to _read_ for sync IO, it
just "knows" that it is ok to change every buffer it sees. It isn't.

Moving the

bh->b_end_io = end_buffer_io_sync;

into the read-path should be sufficient (hell, we should move the
"ll_rw_block(READ, 1, &bh)" away, and make it one single read with

ll_rw_block(READ, wait_bh-wait, wait);

at the end of the loop.

If we do it that way, there is no way the two can clash, because a
non-up-to-date buffer head won't ever be touched by the write path, so we
can't get this kind of confusion.

Al? I'd really prefer this alternative instead. Do you see any problems
with it?

(New rule that makes 100% sense: NOBODY sets "bh->b_end_io" on any buffer
that he isn't actually going to do IO on. And if there is pending IO on
the buffer, it had better only be of the same type as the one you're going
to do).

Linus

2000-12-08 19:43:57

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Fri, 8 Dec 2000, Linus Torvalds wrote:

> It's really only __block_prepare_write() that can mark the buffers for
> sync writes, and even that case is fairly bogus: it really only wants to
> mark the non-upto-date buffers that it wants to _read_ for sync IO, it
> just "knows" that it is ok to change every buffer it sees. It isn't.
>
> Moving the
>
> bh->b_end_io = end_buffer_io_sync;
>
> into the read-path should be sufficient (hell, we should move the
> "ll_rw_block(READ, 1, &bh)" away, and make it one single read with
>
> ll_rw_block(READ, wait_bh-wait, wait);
>
> at the end of the loop.

Erm... So you want to make ->commit_write() page-unlocking? Fine with me,
but that will make for somewhat bigger patch. Hey, _you_ are in position
to change the locking rules, freeze or not, so if it's OK with you...

> If we do it that way, there is no way the two can clash, because a
> non-up-to-date buffer head won't ever be touched by the write path, so we
> can't get this kind of confusion.
>
> Al? I'd really prefer this alternative instead. Do you see any problems
> with it?

If I understood you right - no problems. I can do such patch, just give
me a couple of hours to recheck the locking rules. BTW, what do you
think about the following:
* add_to_page_cache() is not exported and never used. Kill?
* __lock_page() is called only in lock_page(). Make the latter
inlined wrapper or kill the former?
* rw_swap_page_base() would be simpler if it didn't leave unlocking the
page to caller - right now all callers do
if (!rw_swap_page_base(...))
UnlockPage(page);
* ramfs_writepage() needs UnlockPage()
* udf_adinicb_writepage() doesn't
* brw_page() needs liposuction. _Badly_. Just read it.
* Documentations/filesystems/Locking needs an update.
I can toss any subset of the above into patch - just tell...

> (New rule that makes 100% sense: NOBODY sets "bh->b_end_io" on any buffer
> that he isn't actually going to do IO on. And if there is pending IO on
> the buffer, it had better only be of the same type as the one you're going
> to do).

Hrm. Why not move setting ->b_end_io to ll_rw_block() while we are at it?
Or into ll_rw_block() wrapper...
Cheers,
Al

2000-12-08 20:11:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Fri, 8 Dec 2000, Alexander Viro wrote:
>
> Erm... So you want to make ->commit_write() page-unlocking? Fine with me,
> but that will make for somewhat bigger patch. Hey, _you_ are in position
> to change the locking rules, freeze or not, so if it's OK with you...

No.

Read the code a bit more.

commit_write() doesn't start any IO at all. It only marks the buffer
dirty.

The dirty flush-out works the way it has always worked.

(You're right in that the dirty flusher should make sure to change
b_end_io when they do their write-outs - that code used to just depend on
the buffer always having b_end_io magically set)

> Hrm. Why not move setting ->b_end_io to ll_rw_block() while we are at it?
> Or into ll_rw_block() wrapper...

That's too big a change right now, but yes, in theory. That would make
sure that nobody could ever forget to set the "completion" handler.

In the meantime, let's just enforce it for ll_rw_block: make sure that
there is a 1:1 mapping between setting b_end_io and doing a ll_rw_block
(and let's make sure that there are no more of these non-local rules any
more).

Btw, I also think that the dirty buffer flushing should get the page lock.
Right now it touches the buffer list without holding the lock on the page
that the buffer is on, which means that there is really nothign that
prevents it from racing with the page-based writeout. I don't like that:
it does hold the LRU list lock, so the list state itself is ok, but it
does actually touch part of the buffer state that is not really protected
by the lock.

I think it ends up being ok because ll_rw_block will ignore buffers that
get output twice, and the rest of the state is handled with atomic
operations (b_count and b_flags), but it's not really proper. If
flush_dirty_buffers() got the page lock, everything would be protected
properly. Thoughts?

Linus

2000-12-08 21:31:04

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

On Fri, 08 Dec 2000, Linus Torvalds wrote:
> Btw, I also think that the dirty buffer flushing should get the page lock.
> Right now it touches the buffer list without holding the lock on the page
> that the buffer is on, which means that there is really nothign that
> prevents it from racing with the page-based writeout. I don't like that:
> it does hold the LRU list lock, so the list state itself is ok, but it
> does actually touch part of the buffer state that is not really protected
> by the lock.
>
> I think it ends up being ok because ll_rw_block will ignore buffers that
> get output twice, and the rest of the state is handled with atomic
> operations (b_count and b_flags), but it's not really proper. If
> flush_dirty_buffers() got the page lock, everything would be protected
> properly. Thoughts?

This is great when you have buffersize==pagesize. When there are
multiple buffers per page it means that some of the buffers might have
to wait for flushing just because bdflush started IO on some other
buffer on the same page. Oh well. The common case improves in terms
being proveably correct and the uncommon case gets worse a tiny bit.
It sounds like a win.

We are moving steadily away from buffer oriented I/O anyway, and I
think we can optimize the case of buffersize!=pagesize in 2.5 by being a
little more careful about how getblk hands out buffers. Getblk
could force all buffers on the same page to be on the same
major/minor, or even better, to be physically close together. Or
more simply, flush_dirty_buffers could check for other dirty buffers on
the same page and initiate I/O at the same time. Or both strategies.

This is by way of buttressing an argument in favor of simplicity
and reliabilty, i.e., being religious about taking the page lock, even
when there's a slight cost in the short term.

--
Daniel

2000-12-08 21:49:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Fri, 8 Dec 2000, Daniel Phillips wrote:
>
> [ flush-buffers taking the page lock ]
>
> This is great when you have buffersize==pagesize. When there are
> multiple buffers per page it means that some of the buffers might have
> to wait for flushing just because bdflush started IO on some other
> buffer on the same page. Oh well. The common case improves in terms
> being proveably correct and the uncommon case gets worse a tiny bit.
> It sounds like a win.

Also, I think that we should strive for a setup where most of the dirty
buffer flushing is done through "page_launder()" instead of using
sync_buffers all that much at all.

I'm convinced that the page LRU list is as least as good as, if not better
than, the dirty buffer timestamp stuff. And as we need to have the page
LRU for other reasons anyway, I'd like the long-range plan to be to get
rid of the buffer LRU completely. It wastes memory and increases
complexity for very little gain, I think.

Linus

Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

Linus Torvalds wrote:
>
> On Fri, 8 Dec 2000, Daniel Phillips wrote:
> >
> > [ flush-buffers taking the page lock ]
> >
> > This is great when you have buffersize==pagesize. When there are
> > multiple buffers per page it means that some of the buffers might have
> > to wait for flushing just because bdflush started IO on some other
> > buffer on the same page. Oh well. The common case improves in terms
> > being proveably correct and the uncommon case gets worse a tiny bit.
> > It sounds like a win.
>
> Also, I think that we should strive for a setup where most of the dirty
> buffer flushing is done through "page_launder()" instead of using
> sync_buffers all that much at all.
>
> I'm convinced that the page LRU list is as least as good as, if not better
> than, the dirty buffer timestamp stuff. And as we need to have the page
> LRU for other reasons anyway, I'd like the long-range plan to be to get
> rid of the buffer LRU completely. It wastes memory and increases
> complexity for very little gain, I think.
>

I think flushing pages instead of buffers is a good direction to take.
Two things:

1. currently bdflush is setup to use page_launder only
under memory pressure (if (free_shortage() ... )
Do you think that it should call page_launder regardless?

2. There are two operations here:
a. starting a write-back, periodically.
b. freeing a page, which may involve taking the page
out of a inode mapping, etc. IOW, what page_launder does.
bdflush primarily does (a). If we want to move to page-oriented
flushing, we atleast need extra information in the _page_ structure
as to whether it is time to flush the page back.


--------------------------------------------------------------------------
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--------------------------------------------------------------------------

2000-12-08 23:01:10

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Fri, 8 Dec 2000, Linus Torvalds wrote:

>
>
> On Fri, 8 Dec 2000, Alexander Viro wrote:
> >
> > Erm... So you want to make ->commit_write() page-unlocking? Fine with me,
> > but that will make for somewhat bigger patch. Hey, _you_ are in position
> > to change the locking rules, freeze or not, so if it's OK with you...
>
> No.
>
> Read the code a bit more.
>
> commit_write() doesn't start any IO at all. It only marks the buffer
> dirty.

I'm quite aware of that fact ;-) However, you said

On the other hand, I have this suspicion that there is an even simpler
solution: stop using the end_buffer_io_sync version for writes
altogether.

If that happens (i.e. if write requests resulting from prepare_write()/
commit_write()/bdflush sequence become async) we must stop unlocking pages
after commit_write(). Essentially it would become unlocker of the same
kind as readpage() and writepage() - callers must assume that page submitted
to commit_write() will eventually be unlocked.

> The dirty flush-out works the way it has always worked.
>
> (You're right in that the dirty flusher should make sure to change
> b_end_io when they do their write-outs - that code used to just depend on
> the buffer always having b_end_io magically set)

Hmm... IDGI. Do you want the request resulting from commit_write() ("resulting"
!= "issued") to stay sync (in that case we still need to change
block_write_full_page() since the analysis stands - it dirties blocks
unconditionally) or you want them to go async (in that case we need to change
commit_write() callers)? Could you clarify?

> Btw, I also think that the dirty buffer flushing should get the page lock.
> Right now it touches the buffer list without holding the lock on the page
> that the buffer is on, which means that there is really nothign that
> prevents it from racing with the page-based writeout. I don't like that:
> it does hold the LRU list lock, so the list state itself is ok, but it
> does actually touch part of the buffer state that is not really protected
> by the lock.
>
> I think it ends up being ok because ll_rw_block will ignore buffers that
> get output twice, and the rest of the state is handled with atomic
> operations (b_count and b_flags), but it's not really proper. If
> flush_dirty_buffers() got the page lock, everything would be protected
> properly. Thoughts?

Umm... I don't think that we need to do it on per-page basis. We need some
exclusion, all right, but I'ld rather provide it from block_write_full_page()
and its ilk. Hell knows...
Cheers,
Al

2000-12-08 23:14:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Fri, 8 Dec 2000, Alexander Viro wrote:
>
> I'm quite aware of that fact ;-) However, you said
>
> On the other hand, I have this suspicion that there is an even simpler
> solution: stop using the end_buffer_io_sync version for writes
> altogether.
>
> If that happens (i.e. if write requests resulting from prepare_write()/
> commit_write()/bdflush sequence become async) we must stop unlocking pages
> after commit_write(). Essentially it would become unlocker of the same
> kind as readpage() and writepage() - callers must assume that page submitted
> to commit_write() will eventually be unlocked.

You're right, we can't do that for anonymous buffers right now. Mea culpa.

Looking more at this issue, I suspect that the easiest pretty solution
that everybody can probably agree is reasonable is to either pass down the
end-of-io callback to ll_rw_block as you suggested, or, preferably by just
forcing the _caller_ to do the buffer locking, and just do the b_end_io
stuff inside the buffer lock and get rid of all the races that way
instead (and make ll_rw_block() verify that the buffers it is passed are
always locked).

Linus

2000-12-09 05:29:54

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Fri, 8 Dec 2000, Linus Torvalds wrote:

> Looking more at this issue, I suspect that the easiest pretty solution
> that everybody can probably agree is reasonable is to either pass down the
> end-of-io callback to ll_rw_block as you suggested, or, preferably by just
> forcing the _caller_ to do the buffer locking, and just do the b_end_io
> stuff inside the buffer lock and get rid of all the races that way
> instead (and make ll_rw_block() verify that the buffers it is passed are
> always locked).

Hmm... I've looked through the ll_rw_block() callers and yes, it seems
to be doable. BTW, we probably want a helper function that would do
lock+ll_rw_block+wait - it will simplify the life in filesystems.
I'll do a patch tonight.
Cheers,
Al
PS: alpha-testers needed for sysvfs patches - right now the thing is racey
as hell and unlike minixfs I can't test it myself. It's more or less
straightforward port of minixfs patch that went into the tree sometime
ago, so it shouldn't be too bad, but... IOW, if somebody has boxen to
test the thing on - yell.

2000-12-09 09:17:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7


On Fri, 8 Dec 2000, Alexander Viro wrote:
> On Fri, 8 Dec 2000, Linus Torvalds wrote:
>
> > Looking more at this issue, I suspect that the easiest pretty solution
> > that everybody can probably agree is reasonable is to either pass down the
> > end-of-io callback to ll_rw_block as you suggested, or, preferably by just
> > forcing the _caller_ to do the buffer locking, and just do the b_end_io
> > stuff inside the buffer lock and get rid of all the races that way
> > instead (and make ll_rw_block() verify that the buffers it is passed are
> > always locked).
>
> Hmm... I've looked through the ll_rw_block() callers and yes, it seems
> to be doable.

Looking at this, there's a _lot_ of them.

I've taken a test-approach that is fairly simple:

- get rid of the old "ll_rw_block()", because it was inherently racey wrt
bh->b_end_io, I think we all agree that changing bh->b_end_io without
holding any locks at all is fairly dangerous.

- instead, a simple "submit_bh()" thing, that takes only one locked
buffer head at a time, and queues it for IO. The usage would basically
be

lock_buffer(bh);
bh->b_end_io = completion_callback;
submit_bh(READ|WRITE, bh);

which is a pretty clean and simple interface that has no obvious
races - submit_bh() will set bh->b_end_io to the completion callback.

- BUT BUT BUT: Because of tons of old users of ll_rw_block(), we
introduce a totally new ll_rw_block() that has the same old interface,
but basically does

void ll_rw_block(int op, int nr, struct buffer_head **array)
{
int i;

for (i = 0; i < nr; i++) {
struct buffer_head * bh = array[i];
lock_buffer(bh);
bh->b_end_io = end_buffer_io_sync;
submit_bh(op, bh);
}
}

Again, the above avoids the race (we never touch b_end_io except with
the buffer lock held), and allows all old uses of "ll_rw_block()"
_except_ the ones that wanted to do the fancy async callbacks.

The advantage? All the regular old code that isn't fancy (ie the low-level
filesystems, bread(), breada() etc) will get a working ll_rw_block() with
the semantics they want, and we can pretty much prove that they can never
get an async handler even by mistake.

And the (few) clever routines in fs/buffer.c that currently use
ll_rw_block() with async handlers can just be converted to use the
submit_bh() interface.

This is a preliminary patch that I have not compiled and probably breaks,
but you get the idea. I'm going to sleep, to survive another night with
three small kids.

If somebody wants to run with this, please try it out, but realize that
it's a work-in-progress. And realize that bugs in this area tend to
corrupt filesystems very quickly indeed. I would strongly advice against
trying this patch out without really grokking what it does and feeling
confident that it actually works.

NOTE NOTE NOTE! I've tried to keep the patch small and simple. I've tried
to make all the changes truly straightforward and obvious. I want this bug
squashed, and I don't want to see it again. But I _still_ think this is a
dangerous patch until somebody like Al has given it a green light. Caveat
Emptor.

Linus

-----
diff -u --recursive t12p7/linux/drivers/block/ll_rw_blk.c linux/drivers/block/ll_rw_blk.c
--- t12p7/linux/drivers/block/ll_rw_blk.c Thu Dec 7 15:56:25 2000
+++ linux/drivers/block/ll_rw_blk.c Sat Dec 9 00:40:35 2000
@@ -885,6 +885,36 @@
while (q->make_request_fn(q, rw, bh));
}

+
+/*
+ * Submit a buffer head for IO.
+ */
+void submit_bh(int rw, struct buffer_head * bh)
+{
+ if (!test_bit(BH_Lock, &bh->b_state))
+ BUG();
+
+ set_bit(BH_Req, &bh->b_state);
+
+ /*
+ * First step, 'identity mapping' - RAID or LVM might
+ * further remap this.
+ */
+ bh->b_rdev = bh->b_dev;
+ bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
+
+ generic_make_request(rw, bh);
+}
+
+/*
+ * Default IO end handler, used by "ll_rw_block()".
+ */
+static void end_buffer_io_sync(struct buffer_head *bh, int uptodate)
+{
+ mark_buffer_uptodate(bh, uptodate);
+ unlock_buffer(bh);
+}
+
/* This function can be used to request a number of buffers from a block
device. Currently the only restriction is that all buffers must belong to
the same device */
@@ -931,7 +961,8 @@
if (test_and_set_bit(BH_Lock, &bh->b_state))
continue;

- set_bit(BH_Req, &bh->b_state);
+ /* We have the buffer lock */
+ bh->b_end_io = end_buffer_io_sync;

switch(rw) {
case WRITE:
@@ -954,17 +985,9 @@
end_io:
bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state));
continue;
-
}

- /*
- * First step, 'identity mapping' - RAID or LVM might
- * further remap this.
- */
- bh->b_rdev = bh->b_dev;
- bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
-
- generic_make_request(rw, bh);
+ submit_bh(rw, bh);
}
return;

@@ -972,7 +995,6 @@
for (i = 0; i < nr; i++)
buffer_IO_error(bhs[i]);
}
-

#ifdef CONFIG_STRAM_SWAP
extern int stram_device_init (void);
diff -u --recursive t12p7/linux/fs/buffer.c linux/fs/buffer.c
--- t12p7/linux/fs/buffer.c Thu Dec 7 15:56:26 2000
+++ linux/fs/buffer.c Sat Dec 9 00:38:20 2000
@@ -758,12 +758,6 @@
bh->b_private = private;
}

-static void end_buffer_io_sync(struct buffer_head *bh, int uptodate)
-{
- mark_buffer_uptodate(bh, uptodate);
- unlock_buffer(bh);
-}
-
static void end_buffer_io_bad(struct buffer_head *bh, int uptodate)
{
mark_buffer_uptodate(bh, uptodate);
@@ -1001,7 +995,7 @@
* and it is clean.
*/
if (bh) {
- init_buffer(bh, end_buffer_io_sync, NULL);
+ init_buffer(bh, end_buffer_io_bad, NULL);
bh->b_dev = dev;
bh->b_blocknr = block;
bh->b_state = 1 << BH_Mapped;
@@ -1210,7 +1204,6 @@

if (buffer_uptodate(bh))
return(bh);
- else ll_rw_block(READ, 1, &bh);

blocks = (filesize - pos) >> (9+index);

@@ -1225,12 +1218,11 @@
brelse(bh);
break;
}
- else bhlist[j++] = bh;
+ bhlist[j++] = bh;
}

/* Request the read for these buffers, and then release them. */
- if (j>1)
- ll_rw_block(READA, (j-1), bhlist+1);
+ ll_rw_block(READ, j, bhlist);
for(i=1; i<j; i++)
brelse(bhlist[i]);

@@ -1439,7 +1431,7 @@
block = *(b++);

tail = bh;
- init_buffer(bh, end_buffer_io_async, NULL);
+ init_buffer(bh, end_buffer_io_bad, NULL);
bh->b_dev = dev;
bh->b_blocknr = block;

@@ -1586,8 +1578,6 @@
int err, i;
unsigned long block;
struct buffer_head *bh, *head;
- struct buffer_head *arr[MAX_BUF_PER_PAGE];
- int nr = 0;

if (!PageLocked(page))
BUG();
@@ -1600,6 +1590,8 @@

bh = head;
i = 0;
+
+ /* Stage 1: make sure we have all the buffers mapped! */
do {
/*
* If the buffer isn't up-to-date, we can't be sure
@@ -1616,28 +1608,32 @@
if (buffer_new(bh))
unmap_underlying_metadata(bh);
}
- set_bit(BH_Uptodate, &bh->b_state);
- set_bit(BH_Dirty, &bh->b_state);
+ bh = bh->b_this_page;
+ block++;
+ } while (bh != head);
+
+ /* Stage 2: lock the buffers, mark them dirty */
+ do {
+ lock_buffer(bh);
bh->b_end_io = end_buffer_io_async;
atomic_inc(&bh->b_count);
- arr[nr++] = bh;
+ set_bit(BH_Uptodate, &bh->b_state);
+ set_bit(BH_Dirty, &bh->b_state);
bh = bh->b_this_page;
- block++;
} while (bh != head);

- if (nr) {
- ll_rw_block(WRITE, nr, arr);
- } else {
- UnlockPage(page);
- }
+ /* Stage 3: submit the IO */
+ do {
+ submit_bh(WRITE, bh);
+ bh = bh->b_this_page;
+ } while (bh != head);
+
+ /* Done - end_buffer_io_async will unlock */
SetPageUptodate(page);
return 0;
+
out:
- if (nr) {
- ll_rw_block(WRITE, nr, arr);
- } else {
- UnlockPage(page);
- }
+ UnlockPage(page);
ClearPageUptodate(page);
return err;
}
@@ -1669,7 +1665,6 @@
continue;
if (block_start >= to)
break;
- bh->b_end_io = end_buffer_io_sync;
if (!buffer_mapped(bh)) {
err = get_block(inode, block, bh, 1);
if (err)
@@ -1766,7 +1761,6 @@
unsigned long iblock, lblock;
struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
unsigned int blocksize, blocks;
- char *kaddr = NULL;
int nr, i;

if (!PageLocked(page))
@@ -1793,35 +1787,40 @@
continue;
}
if (!buffer_mapped(bh)) {
- if (!kaddr)
- kaddr = kmap(page);
- memset(kaddr + i*blocksize, 0, blocksize);
+ memset(kmap(page) + i*blocksize, 0, blocksize);
flush_dcache_page(page);
+ kunmap(page);
set_bit(BH_Uptodate, &bh->b_state);
continue;
}
}

- init_buffer(bh, end_buffer_io_async, NULL);
- atomic_inc(&bh->b_count);
arr[nr] = bh;
nr++;
} while (i++, iblock++, (bh = bh->b_this_page) != head);

- if (nr) {
- if (Page_Uptodate(page))
- BUG();
- ll_rw_block(READ, nr, arr);
- } else {
+ if (!nr) {
/*
* all buffers are uptodate - we can set the page
* uptodate as well.
*/
SetPageUptodate(page);
UnlockPage(page);
+ return 0;
}
- if (kaddr)
- kunmap(page);
+
+ /* Stage two: lock the buffers */
+ for (i = 0; i < nr; i++) {
+ struct buffer_head * bh = arr[i];
+ lock_buffer(bh);
+ bh->b_end_io = end_buffer_io_async;
+ atomic_inc(&bh->b_count);
+ }
+
+ /* Stage 3: start the IO */
+ for (i = 0; i < nr; i++)
+ submit_bh(READ, arr[i]);
+
return 0;
}

@@ -1989,7 +1988,6 @@
if (Page_Uptodate(page))
set_bit(BH_Uptodate, &bh->b_state);

- bh->b_end_io = end_buffer_io_sync;
if (!buffer_uptodate(bh)) {
err = -EIO;
ll_rw_block(READ, 1, &bh);
@@ -2263,67 +2261,31 @@
*/
int brw_page(int rw, struct page *page, kdev_t dev, int b[], int size)
{
- struct buffer_head *head, *bh, *arr[MAX_BUF_PER_PAGE];
- int nr, fresh /* temporary debugging flag */, block;
+ struct buffer_head *head, *bh;

if (!PageLocked(page))
panic("brw_page: page not locked for I/O");
-// ClearPageError(page);
- /*
- * We pretty much rely on the page lock for this, because
- * create_page_buffers() might sleep.
- */
- fresh = 0;
- if (!page->buffers) {
- create_page_buffers(rw, page, dev, b, size);
- fresh = 1;
- }
+
if (!page->buffers)
+ create_page_buffers(rw, page, dev, b, size);
+
+ head = bh = page->buffers;
+ if (!head)
BUG();

- head = page->buffers;
- bh = head;
- nr = 0;
+ /* Stage 1: lock all the buffers */
do {
- block = *(b++);
+ lock_buffer(bh);
+ bh->b_end_io = end_buffer_io_async;
+ atomic_inc(&bh->b_count);
+ bh = bh->b_this_page;
+ } while (bh != head);

- if (fresh && (atomic_read(&bh->b_count) != 0))
- BUG();
- if (rw == READ) {
- if (!fresh)
- BUG();
- if (!buffer_uptodate(bh)) {
- arr[nr++] = bh;
- atomic_inc(&bh->b_count);
- }
- } else { /* WRITE */
- if (!bh->b_blocknr) {
- if (!block)
- BUG();
- bh->b_blocknr = block;
- } else {
- if (!block)
- BUG();
- }
- set_bit(BH_Uptodate, &bh->b_state);
- set_bit(BH_Dirty, &bh->b_state);
- arr[nr++] = bh;
- atomic_inc(&bh->b_count);
- }
+ /* Stage 2: start the IO */
+ do {
+ submit_bh(rw, bh);
bh = bh->b_this_page;
} while (bh != head);
- if ((rw == READ) && nr) {
- if (Page_Uptodate(page))
- BUG();
- ll_rw_block(rw, nr, arr);
- } else {
- if (!nr && rw == READ) {
- SetPageUptodate(page);
- UnlockPage(page);
- }
- if (nr && (rw == WRITE))
- ll_rw_block(rw, nr, arr);
- }
return 0;
}

diff -u --recursive t12p7/linux/include/linux/fs.h linux/include/linux/fs.h
--- t12p7/linux/include/linux/fs.h Thu Dec 7 15:56:26 2000
+++ linux/include/linux/fs.h Sat Dec 9 00:27:41 2000
@@ -1193,6 +1193,7 @@
extern struct buffer_head * get_hash_table(kdev_t, int, int);
extern struct buffer_head * getblk(kdev_t, int, int);
extern void ll_rw_block(int, int, struct buffer_head * bh[]);
+extern void submit_bh(int, struct buffer_head *);
extern int is_read_only(kdev_t);
extern void __brelse(struct buffer_head *);
static inline void brelse(struct buffer_head *buf)

2000-12-09 09:28:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Sat, 9 Dec 2000, Linus Torvalds wrote:
>
> This is a preliminary patch that I have not compiled and probably breaks,
> but you get the idea. I'm going to sleep, to survive another night with
> three small kids.

Call me stupid [ Chorus: "You're stupid, Linus" ], but I actually compiled
and booted this remotely. And it came up. Which is a pretty good sign that
it doesn't have anything really broken in it.

Still, it would be good to have a few other sharp eyes looking it over.
Look sclean and obviously correct to me, but then _everything_ I write
always looks obviously correct yo me.

Linus

2000-12-09 11:13:01

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Sat, 9 Dec 2000, Linus Torvalds wrote:

> This is a preliminary patch that I have not compiled and probably breaks,
> but you get the idea. I'm going to sleep, to survive another night with
> three small kids.
>
> If somebody wants to run with this, please try it out, but realize that
> it's a work-in-progress. And realize that bugs in this area tend to
> corrupt filesystems very quickly indeed. I would strongly advice against
> trying this patch out without really grokking what it does and feeling
> confident that it actually works.
>
> NOTE NOTE NOTE! I've tried to keep the patch small and simple. I've tried
> to make all the changes truly straightforward and obvious. I want this bug
> squashed, and I don't want to see it again. But I _still_ think this is a
> dangerous patch until somebody like Al has given it a green light. Caveat
> Emptor.

> @@ -1001,7 +995,7 @@
> * and it is clean.
> */
> if (bh) {
> - init_buffer(bh, end_buffer_io_sync, NULL);
> + init_buffer(bh, end_buffer_io_bad, NULL);

How about NULL?

> @@ -1210,7 +1204,6 @@
[breada()]
Umm... why do we keep it, in the first place? AFAICS the only
in-tree user is hpfs_map_sector() and it doesn't look like we really
need it there. OTOH, trimming the buffer.c down is definitely nice.
Mikulas?

> @@ -1439,7 +1431,7 @@
[create_page_buffers()]
Linus, compare the result with create_empty_buffers(). Then look
at the only caller and notice that it will merrily loop over these
buffer_heads. IOW, I propose to mark them mapped and set the ->b_blocknr
in brw_page(), replace create_page_buffers() call with create_empty_buffers()
and let create_page_buffers() go to hell. Oh, and let create_empty_buffers()
take device as an argument, not inode as it does now.

> @@ -1600,6 +1590,8 @@
[__block_write_full_page()]
>
> bh = head;
> i = 0;
> +
> + /* Stage 1: make sure we have all the buffers mapped! */
> do {
> /*
> * If the buffer isn't up-to-date, we can't be sure

Ahem. Think what will happen if we are sitting over the hole in file,
map some buffers and fail in the middle of the fun. Notice that we
do not submit any IO requests in that case, i.e. we just had created
a random crap in file.

More recovery needed here. I would just break out of the mapping loop,
then proceed as above, but limited the activity to mapped blocks only.
And did if (err) ClearPageUptodate(page) in the end.

> @@ -1669,7 +1665,6 @@
[__block_prepare_write()]

Same problem with recovery - see the patch I've sent recently (handling
get_block() failures).

> @@ -1793,35 +1787,40 @@
[block_read_full_page()]
> - init_buffer(bh, end_buffer_io_async, NULL);
Fine
> - atomic_inc(&bh->b_count);

Why? It's cleaner the old way - why bother postponing that until we
lock the thing?

> + /* Stage two: lock the buffers */
> + for (i = 0; i < nr; i++) {
> + struct buffer_head * bh = arr[i];
> + lock_buffer(bh);
> + bh->b_end_io = end_buffer_io_async;
> + atomic_inc(&bh->b_count);

See above.

> @@ -2263,67 +2261,31 @@
> */
> int brw_page(int rw, struct page *page, kdev_t dev, int b[], int size)

> if (!page->buffers)
> + create_page_buffers(rw, page, dev, b, size);

create_empty_buffers(page, dev, size);

> +
> + head = bh = page->buffers;
> + if (!head)
> BUG();
>
> - head = page->buffers;
> - bh = head;
> - nr = 0;
> + /* Stage 1: lock all the buffers */
> do {
> - block = *(b++);
> + lock_buffer(bh);
> + bh->b_end_io = end_buffer_io_async;

bh->b_blocknr = *(b++);
set_bit(BH_Mapped, &bh->b_state);

Modulo the comments above - fine with me. However, there is stuff in
drivers/md that I don't understand. Ingo, could you comment on the use of
->b_end_io there?

Another bad thing is in mm/filemap.c::writeout_one_page() - it doesn't
even check for buffers being mapped, let alone attempt to map them.
Fortunately, ext2 doesn't use it these days, but the rest of block
filesystems... <doubletake> WTF? fsync() merrily ignores mmap()'ed stuff?
I mean, we can mmap() an area, dirty the bloody thing, call fsync() and
get zero traffic. Hmm... OTOH, it might be correct behaviour...
Anyway, fsync() on block filesystems other than ext2 needs fixing.
Badly. I'll port Stephen's patch to them.

IMO patch is mostly safe (the worst part is error recovery on
block_write_full_page()), except the writeout_one_page() part and possibly
the RAID stuff.
Linus, Ingo, Mikulas - comments?
Cheers,
Al

2000-12-09 13:27:52

by Andries Brouwer

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

On Sat, Dec 09, 2000 at 05:40:47AM -0500, Alexander Viro wrote:

> > @@ -1210,7 +1204,6 @@
> [breada()]
> Umm... why do we keep it, in the first place? AFAICS the only
> in-tree user is hpfs_map_sector() and it doesn't look like we really
> need it there. OTOH, trimming the buffer.c down is definitely nice.
> Mikulas?

Throw it out. The number of users has diminished over time.
Recently isofs stopped using breada.
The hpfs use was broken, I fixed it a bit some time ago, but
there is nothing against throwing it out altogether, I think.

Andries

2000-12-09 13:42:07

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Sat, 9 Dec 2000, Andries Brouwer wrote:

> On Sat, Dec 09, 2000 at 05:40:47AM -0500, Alexander Viro wrote:
>
> > > @@ -1210,7 +1204,6 @@
> > [breada()]
> > Umm... why do we keep it, in the first place? AFAICS the only
> > in-tree user is hpfs_map_sector() and it doesn't look like we really
> > need it there. OTOH, trimming the buffer.c down is definitely nice.
> > Mikulas?
>
> Throw it out. The number of users has diminished over time.
> Recently isofs stopped using breada.
> The hpfs use was broken, I fixed it a bit some time ago, but
> there is nothing against throwing it out altogether, I think.

I've looked at the use of hpfs_map_sector() (and hpfs_map_4sectors() - sorry)
and it looks like we would be better off doing getblk() on affected sectors
and ll_rw_block() on the whole bunch - we end up calling breada() for
increasing block numbers with decreasing readahead window anyway.

So it probably should go - it gives no real win. Mikulas has the final
word here - he is the HPFS maintainer, so...

2000-12-09 14:47:55

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

Date: Sat, 9 Dec 2000 00:45:51 -0800 (PST)
From: Linus Torvalds <[email protected]>

out:
- if (nr) {
- ll_rw_block(WRITE, nr, arr);
- } else {
- UnlockPage(page);
- }
+ UnlockPage(page);
ClearPageUptodate(page);
return err;
}
@@ -1669,7 +1665,6 @@

It would seem that you would want to unlock the page _after_ clearing
the uptodate bit to make sure people sleeping on the page do not see
it set by accident.

Later,
David S. Miller
[email protected]

2000-12-09 16:08:36

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

On Fri, 8 Dec 2000, Alexander Viro wrote:

> BTW, what do you think about the following:
> * add_to_page_cache() is not exported and never used. Kill?

I have my eye on that for execute-in-place of romfs from real ROM chips -
making up struct pages for parts of the ROM chips and inserting them into
the page cache. I'd rather you didn't remove it :)

--
dwmw2


2000-12-09 17:59:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Sat, 9 Dec 2000, Alexander Viro wrote:
> Fine
> > - atomic_inc(&bh->b_count);
>
> Why? It's cleaner the old way - why bother postponing that until we
> lock the thing?

I had a rule: we always do the "lock_buffer()" and "b_count increment"
together with setting "b_end_io = end_buffer_io_async". Why? Because that
way we pair up the actions, and I could easily verify that every single
user of "end_buffer_io_async" will increment the count (that is
decremented in "end_buffer_io_async").

We never used to have any rules in this area, and it was sometimes hard to
match up the actions with each other.

> > int brw_page(int rw, struct page *page, kdev_t dev, int b[], int size)
>
> > if (!page->buffers)
> > + create_page_buffers(rw, page, dev, b, size);
>
> create_empty_buffers(page, dev, size);

Agreed.

> Modulo the comments above - fine with me. However, there is stuff in
> drivers/md that I don't understand. Ingo, could you comment on the use of
> ->b_end_io there?

I already sent him mail about it for the same reason.

> Another bad thing is in mm/filemap.c::writeout_one_page() - it doesn't
> even check for buffers being mapped, let alone attempt to map them.
> Fortunately, ext2 doesn't use it these days, but the rest of block
> filesystems... <doubletake> WTF? fsync() merrily ignores mmap()'ed stuff?

fsync() has _always_ ignored mmap'ed stuff.

If you want to get your mappings synchronized, you absolutely positively
have to call "msync()".

Linus

2000-12-09 19:14:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

Linus Torvalds <[email protected]> writes:


> > Modulo the comments above - fine with me. However, there is stuff in
> > drivers/md that I don't understand. Ingo, could you comment on the use of
> > ->b_end_io there?
>
> I already sent him mail about it for the same reason.

How about sending mail to all the journaling FS groups too? -- the change is surely
to break all the JFS which use usually b_end_io to submit the data after the journal
has been flushed.


-Andi

2000-12-09 21:57:02

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7

On Sat, 9 Dec 2000, Alexander Viro wrote:

> On Sat, 9 Dec 2000, Andries Brouwer wrote:
>
> > On Sat, Dec 09, 2000 at 05:40:47AM -0500, Alexander Viro wrote:
> >
> > > > @@ -1210,7 +1204,6 @@
> > > [breada()]
> > > Umm... why do we keep it, in the first place? AFAICS the only
> > > in-tree user is hpfs_map_sector() and it doesn't look like we really
> > > need it there. OTOH, trimming the buffer.c down is definitely nice.
> > > Mikulas?
> >
> > Throw it out. The number of users has diminished over time.
> > Recently isofs stopped using breada.
> > The hpfs use was broken, I fixed it a bit some time ago, but
> > there is nothing against throwing it out altogether, I think.
>
> I've looked at the use of hpfs_map_sector() (and hpfs_map_4sectors() - sorry)
> and it looks like we would be better off doing getblk() on affected sectors
> and ll_rw_block() on the whole bunch - we end up calling breada() for
> increasing block numbers with decreasing readahead window anyway.
>
> So it probably should go - it gives no real win. Mikulas has the final
> word here - he is the HPFS maintainer, so...

I did a test. I disabled readahead except for reading all 4 buffers in
map_4sectors.

I observed 14% slowdown on walking directories with find and 4% slowdown
on grepping one my working directory (10M, 281 files).

If you can't make it otherwise you can rip readahead out. If there is a
possibility to keep it, it would be better.

Mikulas


2000-12-10 01:43:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: kernel BUG at buffer.c:827 in test12-pre6 and 7



On Sat, 9 Dec 2000, Mikulas Patocka wrote:
>
> I did a test. I disabled readahead except for reading all 4 buffers in
> map_4sectors.
>
> I observed 14% slowdown on walking directories with find and 4% slowdown
> on grepping one my working directory (10M, 281 files).
>
> If you can't make it otherwise you can rip readahead out. If there is a
> possibility to keep it, it would be better.

The absolutely best thing would be to keep the directories in the page
cache. At which point the whole issue becomes pretty moot and we could use
the page-cache readahead code. Which gets the end right, and can handle
stuff that isn't physically contiguous on disk.

Linus

2000-12-12 01:25:38

by Joseph Cheek

[permalink] [raw]
Subject: Re: [FIXED!] kernel BUG at buffer.c:827 in test12-pre6 and 7

this works fine in pre8. thanks all!

Joseph Cheek wrote:

> copying files off a loopback-mounted vfat filesystem exposes this bug.
> test11 worked fine.
>
> loop.o built as module. this hard crashes the machine, every time
> [PIII-450]. i don't know how to debug this, is there a FAQ?
>
> [transcribed by hand]:
>
> # mount -o loop /tmp/cdboot.288 /mnt/cd
> # cd /mnt/cd
> # cp menu.lst /tmp
> kernel BUG at buffer.c:827!
> invalid operand: 0000
> CPU: 0
> EIP: 0010:[<c013660c>]
> EFLAGS: 00010082
> eax: 0000001c ebx: c1d8fc60 ecx: 00000000 edx: 00000001
> esi: c10658e4 edi: 00000002 ebp: c1d8fca8 esp: c1793dc0
> ds: 0018 es: 0018 ss: 0018
> Process cp (pid 762, stackpage=c1793000)
> Stack: c01fe484 c01fe95a 0000033b c1d8fc60 c1cef420 00000001 00000001
> c01610e1
> c1d8fc60 00000001 c1cef420 00000000 c1cef420 c02c8ed8 c88df91c
> c1cef420
> 00000001 c88e0986 00000007 00000000 00000001 c02c8ed8 c02c8ee8
> c4f18800
> Call Trace: [<c01fe484>] [<c01fe95a>] [<c0130703>] [<c8895de3>]
> [<c88df91c>] [<c8894494>] [<c0160d2f>] [<c0160ead>]
> [<c0161011>] [<c0137a49>] [<c0130703>] [<c8895de3>] [<c8894494>]
> [<c01284d3>] [<c012887b>] [c0128720>]
> [<c889448d>] [<c01349a7>] [c010b56b>]
> Code: 0f 0b 83 c4 0c 8d 5e 28 8d 46 2c 39 46 2c 74 24 b9 01 00 00
>
> as soon as i reboot i will look what's at buffer.c:827

thanks!

joe

--
Joseph Cheek, Sr Linux Consultant, Linuxcare | http://www.linuxcare.com/
Linuxcare. Support for the Revolution. | [email protected]
CTO / Acting PM, Redmond Linux Project | [email protected]
425 990-1072 vox [1074 fax] 206 679-6838 pcs | [email protected]