2003-08-04 14:22:51

by Daniel Jacobowitz

[permalink] [raw]
Subject: ext3 badness in 2.6.0-test2

I came back this morning and found:
EXT3-fs error (device md0) in start_transaction: Journal has aborted
EXT3-fs error (device md0) in start_transaction: Journal has aborted
EXT3-fs error (device md0) in start_transaction: Journal has aborted

Unfortunately, from the very first one, all writes failed; including all
writes to syslog. So I don't know what happened at the beginning. Is this
more likely to be something internal to ext3, or a problem with the RAID
layer?

The RAID was able to shut down cleanly and came back up with no errors, and
the ext3 filesystem was tagged as having (just a few) errors on next boot,
so I'm guessing an ext3 problem.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer


2003-08-04 20:22:18

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3 badness in 2.6.0-test2

Daniel Jacobowitz <[email protected]> wrote:
>
> I came back this morning and found:
> EXT3-fs error (device md0) in start_transaction: Journal has aborted
> EXT3-fs error (device md0) in start_transaction: Journal has aborted
> EXT3-fs error (device md0) in start_transaction: Journal has aborted
>
> Unfortunately, from the very first one, all writes failed; including all
> writes to syslog. So I don't know what happened at the beginning. Is this
> more likely to be something internal to ext3, or a problem with the RAID
> layer?

Could have been an IO error, or the block/MD/device layer returned
incorrect data. ext3 used to go BUG a lot in the latter case, but nowadays
we try to abort the journal and go read-only.

Without the initial message we do not know.

2003-08-06 06:37:57

by Randy Hron

[permalink] [raw]
Subject: Re: ext3 badness in 2.6.0-test2

>> EXT3-fs error (device md0) in start_transaction: Journal has aborted

> Without the initial message we do not know.

During a dbench 64 run with 2.6.0-test2-mm4 on ext3 /var/log/messages said:

kernel: attempt to access beyond end of device
kernel: hdc1: rw=0, want=1212696656, limit=4096449

fdisk /dev/hdc using sectors for units shows:
Disk /dev/hdc: 16 heads, 63 sectors, 39703 cylinders
Units = sectors of 1 * 512 bytes

Device Boot Start End Blocks Id System
/dev/hdc1 63 4096511 2048224+ 83 Linux
/dev/hdc2 4096512 23634575 9769032 83 Linux
/dev/hdc3 * 23634576 40020623 8193024 83 Linux


The console displayed:

Buffer I/O error on device hdc1, logical block 298266
lost page write due to I/O error on hdc1
Buffer I/O error on device hdc1, logical block 298112
lost page write due to I/O error on hdc1
Buffer I/O error on device hdc1, logical block 296626
lost page write due to I/O error on hdc1
Buffer I/O error on device hdc1, logical block 294743
lost page write due to I/O error on hdc1
EXT3-fs error (device hdc1): ext3_free_blocks: Freeing blocks not in datazone - block = 151587081, count = 1
Aborting journal on device hdc1.
ext3_abort called.
EXT3-fs abort (device hdc1): ext3_journal_start: Detected aborted journal
Remounting filesystem read-only
ext3_reserve_inode_write: aborting transaction: Journal has aborted in __ext3_journal_get_write_access<2>EXT3-fs error (device hdc1) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device hdc1) in ext3_truncate: Journal has aborted
ext3_reserve_inode_write: aborting transaction: Journal has aborted in __ext3_journal_get_write_access<2>EXT3-fs error (device hdc1) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device hdc1) in ext3_orphan_del: Journal has aborted
ext3_reserve_inode_write: aborting transaction: Journal has aborted in __ext3_journal_get_write_access<2>EXT3-fs error (device hdc1) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device hdc1) in ext3_delete_inode: Journal has aborted

The console did not respond to <Enter>.
The machine was pingable, but would not give an ssh prompt.

Additional /var/log/messages:

Aug 5 20:29:24 mountain kernel: Buffer I/O error on device hdc1, logical block 298266
Aug 5 20:29:24 mountain kernel: lost page write due to I/O error on hdc1
Aug 5 20:29:24 mountain kernel: Buffer I/O error on device hdc1, logical block 298112
Aug 5 20:29:24 mountain kernel: lost page write due to I/O error on hdc1
Aug 5 20:29:24 mountain kernel: Buffer I/O error on device hdc1, logical block 296626
Aug 5 20:29:24 mountain kernel: lost page write due to I/O error on hdc1
Aug 5 20:29:24 mountain kernel: Buffer I/O error on device hdc1, logical block 294743
Aug 5 20:29:24 mountain kernel: lost page write due to I/O error on hdc1
Aug 5 20:29:24 mountain kernel: attempt to access beyond end of device
Aug 5 20:29:24 mountain kernel: hdc1: rw=0, want=1212696656, limit=4096449
Aug 5 20:29:24 mountain kernel: attempt to access beyond end of device
Aug 5 20:29:24 mountain kernel: hdc1: rw=0, want=1212696656, limit=4096449
Aug 5 20:29:24 mountain kernel: attempt to access beyond end of device
Aug 5 20:29:36 mountain kernel: hdc1: rw=0, want=1212696656, limit=4096449
..

Uniprocessor K6/2 with IDE disks.
It did not have a problem with dbench 32 on ext3.
dbench on ext2 ran fine too.
e2fsprogs-1.33. After e2fsck, filesystem seems okay.


--
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html

2003-08-06 06:36:28

by NeilBrown

[permalink] [raw]
Subject: Re: ext3 badness in 2.6.0-test2

On Monday August 4, [email protected] wrote:
> Daniel Jacobowitz <[email protected]> wrote:
> >
> > I came back this morning and found:
> > EXT3-fs error (device md0) in start_transaction: Journal has aborted
> > EXT3-fs error (device md0) in start_transaction: Journal has aborted
> > EXT3-fs error (device md0) in start_transaction: Journal has aborted
> >
> > Unfortunately, from the very first one, all writes failed; including all
> > writes to syslog. So I don't know what happened at the beginning. Is this
> > more likely to be something internal to ext3, or a problem with the RAID
> > layer?
>
> Could have been an IO error, or the block/MD/device layer returned
> incorrect data. ext3 used to go BUG a lot in the latter case, but nowadays
> we try to abort the journal and go read-only.
>
> Without the initial message we do not know.

Can I add a "me too".....

First, I'm using data=journal - is that supposed to work in 2.6 yet?


I have a raid5 array across a bunch of SCSI drives and a separate scsi
drive with boot, swap, and a journal partition.
I have an ext3 filesystem on the raid5 array with an external journal
on the journal partition.

The raid5 was rebuilding a spare and I was pounding the filesystem
over NFS using the SPEC SFS benchmark program (ofcourse the raid5
rebuild killed the performance reported by SFS, but I expected that.

Shortly after the rebuild finished, I got an ext3 error (see log
below) and the journal aborted, and then nfsd Oopsed inside ext3.

I rebooted and fscked the filesystem and it found nothing interesting
- see output below.

So I suspect ext3 has a problem somewhere.
I'll see if I can break it again :-)

NeilBrown



Aug 6 15:22:05 adams kernel: EXT3-fs error (device md1): ext3_add_entry: bad entry in directory #41
009295: rec_len is smaller than minimal - offset=0, inode=3265411686, rec_len=0, name_len=0
Aug 6 15:22:05 adams kernel: Aborting journal on device sda4.
Aug 6 15:22:05 adams kernel: ext3_abort called.
Aug 6 15:22:05 adams kernel: EXT3-fs abort (device md1): ext3_journal_start: Detected aborted journ
al
Aug 6 15:22:05 adams kernel: Remounting filesystem read-only
Aug 6 15:22:05 adams kernel: Unable to handle kernel NULL pointer dereference at virtual address 00
000000
Aug 6 15:22:05 adams kernel: printing eip:
Aug 6 15:22:05 adams kernel: c01b1e61
Aug 6 15:22:05 adams kernel: *pde = 00000000
Aug 6 15:22:05 adams kernel: Oops: 0000 [#1]
Aug 6 15:22:05 adams kernel: CPU: 1
Aug 6 15:22:05 adams kernel: EIP: 0060:[<c01b1e61>] Not tainted
Aug 6 15:22:05 adams kernel: EFLAGS: 00010286
Aug 6 15:22:05 adams kernel: do_journal_get_write_access: aborting transaction: Journal has aborted
in __ext3_journal_get_write_access<2>EXT3-fs error (device md1) in ext3_prepare_write: Journal has
aborted
Aug 6 15:22:05 adams kernel: EXT3-fs error (device md1) in start_transaction: Journal has aborted
Aug 6 15:22:05 adams kernel: EIP is at do_get_write_access+0x11/0x770
Aug 6 15:22:05 adams kernel: eax: e8888a64 ebx: f066f8a4 ecx: 00000004 edx: 00000dab
Aug 6 15:22:05 adams kernel: esi: f2eae000 edi: 00000000 ebp: c46208a4 esp: f19378d4
Aug 6 15:22:05 adams kernel: ds: 007b es: 007b ss: 0068
Aug 6 15:22:05 adams kernel: Process nfsd (pid: 732, threadinfo=f1936000 task=f1969000)
Aug 6 15:22:05 adams kernel: Stack: e1f155e4 e1f15d64 e1f15d24 e1f15624 e1f159a4 c95d71a4 e171a364
f066f8a4
Aug 6 15:22:05 adams kernel: 00000008 c371d780 f066f8a4 c01634e3 f066f8a4 0000001b 00000000
00001000
Aug 6 15:22:05 adams kernel: 00000000 0000001b 00000000 0000001b 00000000 f066f8a4 f2eae000
f066f8a4
Aug 6 15:22:05 adams kernel: Call Trace:
Aug 6 15:22:05 adams kernel: [<c01634e3>] __find_get_block+0x73/0x100
Aug 6 15:22:05 adams kernel: [<c01b290d>] journal_get_undo_access+0x3d/0x170
Aug 6 15:22:05 adams kernel: [<c01a2a34>] ext3_try_to_allocate+0xc4/0x240
Aug 6 15:22:05 adams kernel: [<c01a2db4>] ext3_new_block+0x204/0x740
Aug 6 15:22:05 adams kernel: [<c0163439>] bh_lru_install+0xb9/0xf0
Aug 6 15:22:05 adams kernel: [<c01a5a57>] ext3_alloc_block+0x37/0x40
Aug 6 15:22:05 adams kernel: [<c01a5dfa>] ext3_alloc_branch+0x4a/0x2c0
Aug 6 15:22:05 adams kernel: [<c0119eb5>] __change_page_attr+0x25/0x1e0
Aug 6 15:22:05 adams kernel: [<c01a63fc>] ext3_get_block_handle+0x18c/0x340
Aug 6 15:22:05 adams kernel: [<c0165d3c>] alloc_buffer_head+0x1c/0x50
Aug 6 15:22:05 adams kernel: [<c0165d61>] alloc_buffer_head+0x41/0x50
Aug 6 15:22:05 adams kernel: [<c0162e0a>] create_buffers+0x6a/0xc0
Aug 6 15:22:05 adams kernel: [<c01a6614>] ext3_get_block+0x64/0xb0
Aug 6 15:22:05 adams kernel: [<c016404b>] __block_prepare_write+0x20b/0x490
Aug 6 15:22:05 adams kernel: [<c011bc70>] default_wake_function+0x0/0x30
Aug 6 15:22:05 adams kernel: [<c0164bb4>] block_prepare_write+0x34/0x50
Aug 6 15:22:05 adams kernel: [<c01a65b0>] ext3_get_block+0x0/0xb0
Aug 6 15:22:05 adams kernel: [<c01a6bcf>] ext3_prepare_write+0x5f/0x110
Aug 6 15:22:05 adams kernel: [<c01a65b0>] ext3_get_block+0x0/0xb0
Aug 6 15:22:05 adams kernel: [<c013f0e2>] generic_file_aio_write_nolock+0x412/0xbd0
Aug 6 15:22:05 adams kernel: [<c017c0ed>] d_alloc_anon+0x2d/0x240
Aug 6 15:22:05 adams kernel: [<c034bf0f>] sock_alloc_send_skb+0x2f/0x40
Aug 6 15:22:05 adams kernel: [<c0366ceb>] ip_append_data+0x6db/0x780
Aug 6 15:22:05 adams kernel: [<c013f91e>] generic_file_write_nolock+0x7e/0xa0
Aug 6 15:22:05 adams kernel: [<c038867a>] udp_sendmsg+0x41a/0xb40
Aug 6 15:22:05 adams kernel: [<c02a4136>] e1000_xmit_frame+0x516/0x680
Aug 6 15:22:05 adams kernel: [<c01df6b0>] exp_find_key+0x60/0x70
Aug 6 15:22:05 adams kernel: [<c013fb7c>] generic_file_writev+0x5c/0x80
Aug 6 15:22:05 adams kernel: [<c016068f>] do_readv_writev+0x23f/0x2d0
Aug 6 15:22:05 adams kernel: [<c0160040>] do_sync_write+0x0/0xc0
Aug 6 15:22:05 adams kernel: [<c016108d>] open_private_file+0x9d/0xa0
Aug 6 15:22:05 adams kernel: [<c01607e8>] vfs_writev+0x58/0x70
Aug 6 15:22:05 adams kernel: [<c01dbdcf>] nfsd_write+0x11f/0x380
Aug 6 15:22:05 adams kernel: [<c011bc9a>] default_wake_function+0x2a/0x30
Aug 6 15:22:05 adams kernel: [<c011bcda>] __wake_up_common+0x3a/0x70
Aug 6 15:22:05 adams kernel: [<c01d8808>] nfsd_proc_write+0xa8/0x130
Aug 6 15:22:05 adams kernel: [<c01d7818>] nfsd_dispatch+0xe8/0x1f5
Aug 6 15:22:05 adams kernel: [<c01d7730>] nfsd_dispatch+0x0/0x1f5
Aug 6 15:22:05 adams kernel: [<c03b8120>] svc_process+0x480/0x64c
Aug 6 15:22:05 adams kernel: [<c01d747b>] nfsd+0x26b/0x520
Aug 6 15:22:05 adams kernel: [<c010b356>] work_resched+0x5/0x16
Aug 6 15:22:05 adams kernel: [<c01d7210>] nfsd+0x0/0x520
Aug 6 15:22:05 adams kernel: [<c01d7210>] nfsd+0x0/0x520
Aug 6 15:22:05 adams kernel: [<c0108e35>] kernel_thread_helper+0x5/0x10
Aug 6 15:22:05 adams kernel:
Aug 6 15:22:05 adams kernel: Code: 8b 37 c7 44 24 20 00 00 00 00 c7 44 24 1c 00 00 00 00 8d 96
Aug 6 15:22:05 adams kernel: <1>Unable to handle kernel NULL pointer dereference at virtual addres
s 00000000
Aug 6 15:22:05 adams kernel: printing eip:
Aug 6 15:22:05 adams kernel: journal commit I/O error
Aug 6 15:22:05 adams kernel: c01b1e61
Aug 6 15:22:05 adams kernel: journal commit I/O error


-----------------------------------------------------
adams # fsck -n /dev/md1
fsck 1.34-WIP (21-May-2003)
e2fsck 1.34-WIP (21-May-2003)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/md1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached zero-length inode 47710329. Clear? no

Unattached inode 47710329
Connect to /lost+found? no

Pass 5: Checking group summary information

/dev/md1: ********** WARNING: Filesystem still has errors **********

/dev/md1: 235617/53362688 files (11.7% non-contiguous), 3139703/106699200 blocks
adams # fsck /dev/md1
fsck 1.34-WIP (21-May-2003)
e2fsck 1.34-WIP (21-May-2003)
/dev/md1: recovering journal
/dev/md1: clean, 235617/53362688 files, 3139703/106699200 blocks
adams # fsck /dev/md1
fsck 1.34-WIP (21-May-2003)
e2fsck 1.34-WIP (21-May-2003)
/dev/md1: clean, 235617/53362688 files, 3139703/106699200 blocks
adams # fsck -f /dev/md1
fsck 1.34-WIP (21-May-2003)
e2fsck 1.34-WIP (21-May-2003)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #2532 (8073, counted=8075).
Fix<y>? yes

Free blocks count wrong (103559497, counted=103559499).
Fix<y>? yes

Free inodes count wrong for group #2912 (16263, counted=16264).
Fix<y>? yes

Free inodes count wrong (53127071, counted=53127072).
Fix<y>? yes


/dev/md1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md1: 235616/53362688 files (11.7% non-contiguous), 3139701/106699200 blocks

2003-08-06 06:55:59

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3 badness in 2.6.0-test2

Neil Brown <[email protected]> wrote:
>
> > Could have been an IO error, or the block/MD/device layer returned
> > incorrect data. ext3 used to go BUG a lot in the latter case, but nowadays
> > we try to abort the journal and go read-only.
> >
> > Without the initial message we do not know.
>
> Can I add a "me too".....

No. Go away.

> First, I'm using data=journal - is that supposed to work in 2.6 yet?
>

I think so. It's much less tested than ordered mode, but some people have
beat upon it.

> I have a raid5 array across a bunch of SCSI drives and a separate scsi
> drive with boot, swap, and a journal partition.
> I have an ext3 filesystem on the raid5 array with an external journal
> on the journal partition.

oh. Good to hear that external journals still work.

> The raid5 was rebuilding a spare and I was pounding the filesystem
> over NFS using the SPEC SFS benchmark program (ofcourse the raid5
> rebuild killed the performance reported by SFS, but I expected that.
>
> Shortly after the rebuild finished, I got an ext3 error (see log
> below) and the journal aborted, and then nfsd Oopsed inside ext3.

> ...
> Aug 6 15:22:05 adams kernel: EXT3-fs error (device md1): ext3_add_entry: bad entry in directory #41
> 009295: rec_len is smaller than minimal - offset=0, inode=3265411686, rec_len=0, name_len=0

It looks like we had a block full of zeroes come back from the device
driver. I find it distinctly fishy how this happens so much with
ext3-on-md, and so little with ext3-on-just-a-disk.


> Aug 6 15:22:05 adams kernel: Remounting filesystem read-only
> Aug 6 15:22:05 adams kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000

Now that's an ext3 bug. Something like this...

fs/jbd/transaction.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)

diff -puN fs/jbd/transaction.c~ext3-aborted-journal-fix fs/jbd/transaction.c
--- 25/fs/jbd/transaction.c~ext3-aborted-journal-fix 2003-08-05 23:53:16.000000000 -0700
+++ 25-akpm/fs/jbd/transaction.c 2003-08-05 23:56:47.000000000 -0700
@@ -525,12 +525,18 @@ do_get_write_access(handle_t *handle, st
int force_copy, int *credits)
{
struct buffer_head *bh;
- transaction_t *transaction = handle->h_transaction;
- journal_t *journal = transaction->t_journal;
+ transaction_t *transaction;
+ journal_t *journal;
int error;
char *frozen_buffer = NULL;
int need_copy = 0;

+ if (is_handle_aborted(handle))
+ return -EROFS;
+
+ transaction = handle->h_transaction;
+ journal = transaction->t_journal;
+
jbd_debug(5, "buffer_head %p, force_copy %d\n", jh, force_copy);

JBUFFER_TRACE(jh, "entry");

_

2003-08-06 20:18:15

by Frank van de Pol

[permalink] [raw]
Subject: Re: md+ext3 badness in 2.6.0-test2


On Tue, Aug 05, 2003 at 11:57:35PM -0700, Andrew Morton wrote:
>
> > ...
> > Aug 6 15:22:05 adams kernel: EXT3-fs error (device md1): ext3_add_entry: bad entry in directory #41
> > 009295: rec_len is smaller than minimal - offset=0, inode=3265411686, rec_len=0, name_len=0
>
> It looks like we had a block full of zeroes come back from the device
> driver. I find it distinctly fishy how this happens so much with
> ext3-on-md, and so little with ext3-on-just-a-disk.
>

I'm seeing these kind of errors also on my box when running late 2.5.x /
2.6.0-preX kernels. 2.4 is stable on this box. The affected filesystems are
also ext3 on md (using raid5 volume).

I was suspecting that it had something to with memory corruption (memtest
does not find any problems) triggered by the hashes introduced during 2.5 or
some locking issue since my box is a dual P-II.

Interesting to see that others are experience these kind of problems as
well, and even more interesting is the relation to md.

Frank.

--
+---- --- -- - - - -
| Frank van de Pol -o) A-L-S-A
| [email protected] /\\ Sounds good!
| http://www.alsa-project.org _\_v
| Linux - Why use Windows if we have doors available?

2003-08-08 01:01:15

by NeilBrown

[permalink] [raw]
Subject: Re: ext3 badness in 2.6.0-test2

On Tuesday August 5, [email protected] wrote:
> Neil Brown <[email protected]> wrote:
> > ...
> > Aug 6 15:22:05 adams kernel: EXT3-fs error (device md1): ext3_add_entry: bad entry in directory #41
> > 009295: rec_len is smaller than minimal - offset=0, inode=3265411686, rec_len=0, name_len=0
>
> It looks like we had a block full of zeroes come back from the device
> driver. I find it distinctly fishy how this happens so much with
> ext3-on-md, and so little with ext3-on-just-a-disk.

Well, they're not *all* zero.....

I can reproduce this easily with various configurations of ext3 over
raid5, and get a similar problem with ext2 over raid5 (corrupt inodes
rather than directory entries) but ext3 over raid0 is rock-solid.

So I guess the finger points generally in the direction of raid5.
Now I've just got to figure if it is a bug in r5, or some assumption
that it makes that is no longer valid (I was briefly suspicious of
PF_READAHEAD which could have made a real mess of raid5, but that
wouldn't have this symptom)

NeilBrown

2003-08-08 01:14:36

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3 badness in 2.6.0-test2

Neil Brown <[email protected]> wrote:
>
> On Tuesday August 5, [email protected] wrote:
> > Neil Brown <[email protected]> wrote:
> > > ...
> > > Aug 6 15:22:05 adams kernel: EXT3-fs error (device md1): ext3_add_entry: bad entry in directory #41
> > > 009295: rec_len is smaller than minimal - offset=0, inode=3265411686, rec_len=0, name_len=0
> >
> > It looks like we had a block full of zeroes come back from the device
> > driver. I find it distinctly fishy how this happens so much with
> > ext3-on-md, and so little with ext3-on-just-a-disk.
>
> Well, they're not *all* zero.....
>
> I can reproduce this easily with various configurations of ext3 over
> raid5, and get a similar problem with ext2 over raid5 (corrupt inodes
> rather than directory entries) but ext3 over raid0 is rock-solid.

Good news that it is reproducible.

Have you tried running fsx-linux? It is good at picking up data loss.

> So I guess the finger points generally in the direction of raid5.
> Now I've just got to figure if it is a bug in r5, or some assumption
> that it makes that is no longer valid (I was briefly suspicious of
> PF_READAHEAD which could have made a real mess of raid5, but that
> wouldn't have this symptom)

The PF_READAHEAD things was a huge bug. Make sure that it is fixed before
proceeding. Linus's tree has the fix. This is the relevant patch:

drivers/block/ll_rw_blk.c | 2 +-
fs/buffer.c | 3 +--
include/linux/sched.h | 1 -
mm/readahead.c | 11 +++--------
4 files changed, 5 insertions(+), 12 deletions(-)

diff -puN mm/readahead.c~remove-PF_READAHEAD mm/readahead.c
--- 25/mm/readahead.c~remove-PF_READAHEAD 2003-08-06 19:53:14.000000000 -0700
+++ 25-akpm/mm/readahead.c 2003-08-06 19:53:14.000000000 -0700
@@ -298,15 +298,10 @@ int force_page_cache_readahead(struct ad
int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
unsigned long offset, unsigned long nr_to_read)
{
- int ret = 0;
-
- if (!bdi_read_congested(mapping->backing_dev_info)) {
- current->flags |= PF_READAHEAD;
- ret = __do_page_cache_readahead(mapping, filp,
+ if (!bdi_read_congested(mapping->backing_dev_info))
+ return __do_page_cache_readahead(mapping, filp,
offset, nr_to_read);
- current->flags &= ~PF_READAHEAD;
- }
- return ret;
+ return 0;
}

/*
diff -puN fs/buffer.c~remove-PF_READAHEAD fs/buffer.c
--- 25/fs/buffer.c~remove-PF_READAHEAD 2003-08-06 19:53:14.000000000 -0700
+++ 25-akpm/fs/buffer.c 2003-08-06 19:53:14.000000000 -0700
@@ -506,8 +506,7 @@ static void end_buffer_async_read(struct
set_buffer_uptodate(bh);
} else {
clear_buffer_uptodate(bh);
- if (!(current->flags & PF_READAHEAD))
- buffer_io_error(bh);
+ buffer_io_error(bh);
SetPageError(page);
}

diff -puN drivers/block/ll_rw_blk.c~remove-PF_READAHEAD drivers/block/ll_rw_blk.c
--- 25/drivers/block/ll_rw_blk.c~remove-PF_READAHEAD 2003-08-06 19:53:14.000000000 -0700
+++ 25-akpm/drivers/block/ll_rw_blk.c 2003-08-06 19:53:14.000000000 -0700
@@ -1833,7 +1833,7 @@ static int __make_request(request_queue_

barrier = test_bit(BIO_RW_BARRIER, &bio->bi_rw);

- ra = bio_flagged(bio, BIO_RW_AHEAD) || current->flags & PF_READAHEAD;
+ ra = bio_flagged(bio, BIO_RW_AHEAD);

again:
insert_here = NULL;
diff -puN include/linux/sched.h~remove-PF_READAHEAD include/linux/sched.h
--- 25/include/linux/sched.h~remove-PF_READAHEAD 2003-08-06 19:53:14.000000000 -0700
+++ 25-akpm/include/linux/sched.h 2003-08-06 19:53:14.000000000 -0700
@@ -487,7 +487,6 @@ do { if (atomic_dec_and_test(&(tsk)->usa
#define PF_SWAPOFF 0x00080000 /* I am in swapoff */
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
-#define PF_READAHEAD 0x00400000 /* I am doing read-ahead */

#ifdef CONFIG_SMP
extern int set_cpus_allowed(task_t *p, unsigned long new_mask);

_

2003-08-09 00:43:11

by NeilBrown

[permalink] [raw]
Subject: Re: ext3 badness in 2.6.0-test2

On Thursday August 7, [email protected] wrote:
> Neil Brown <[email protected]> wrote:
>
> > So I guess the finger points generally in the direction of raid5.
> > Now I've just got to figure if it is a bug in r5, or some assumption
> > that it makes that is no longer valid (I was briefly suspicious of
> > PF_READAHEAD which could have made a real mess of raid5, but that
> > wouldn't have this symptom)
>
> The PF_READAHEAD things was a huge bug. Make sure that it is fixed before
> proceeding. Linus's tree has the fix.

I found it. It was read-ahead related, but nothing to do with
PF_READAHEAD.

With this patch, my test ran to completion instead of dying at about
th 20% mark.

NeilBrown

=================================================================
Disable raid5 handling of read-ahead

raid5 trys to honour RWA_MASK, but messes it up and can return bad data.
Just ignore RWA_MASK for now.


----------- Diffstat output ------------
./drivers/md/raid5.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~ 2003-08-08 14:37:00.000000000 +1000
+++ ./drivers/md/raid5.c 2003-08-08 14:37:19.000000000 +1000
@@ -1326,7 +1326,7 @@ static int make_request (request_queue_t
(unsigned long long)new_sector,
(unsigned long long)logical_sector);

- sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
+ sh = get_active_stripe(conf, new_sector, pd_idx, 0/*(bi->bi_rw&RWA_MASK)*/);
if (sh) {

add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK));


2003-08-09 01:09:23

by Mike Fedyk

[permalink] [raw]
Subject: Re: ext3 badness in 2.6.0-test2

On Sat, Aug 09, 2003 at 10:39:43AM +1000, Neil Brown wrote:
> - sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
> + sh = get_active_stripe(conf, new_sector, pd_idx, 0/*(bi->bi_rw&RWA_MASK)*/);

Wouldn't it be better to remove instead of just commenting out that part?

At first glance it looked like a device by zero error... :-/

2003-08-10 23:44:39

by NeilBrown

[permalink] [raw]
Subject: Re: ext3 badness in 2.6.0-test2

On Friday August 8, [email protected] wrote:
> On Sat, Aug 09, 2003 at 10:39:43AM +1000, Neil Brown wrote:
> > - sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
> > + sh = get_active_stripe(conf, new_sector, pd_idx, 0/*(bi->bi_rw&RWA_MASK)*/);
>
> Wouldn't it be better to remove instead of just commenting out that
> part?

Thew ugliness (hopefuly) reminds me to fix it properly.
I think I can come up with a sensible use for the read-ahead flag, but
I would want to think carefully about it first, and test it somewhat.

NeilBrown