2000-12-06 02:38:18

by Jan Niehusmann

[permalink] [raw]
Subject: fs corruption with invalidate_buffers()

Some days ago I saw filesystem corruptions while testing the ext2fs online
resize patches by Andreas Dilger. First I thought that the online resizing
caused the problems, but further investigations didn't support this.

The latest observation shows that the problem is probably neither ext2 nor
lvm related:

While resizing the filesystem, invalidate_buffers() is called from the
lvm code. (lvm.c, line 2251, in lvm_do_lv_extend_reduce())
If I remove this call, the corruption goes away. But this is probably not
the correct fix, as it can cause problems when reducing the lv size.


For reference, some details of the corruption:
- I reproduced it with kernels between 2.4.0-test9
and 2.4.0-test12-pre5
- It is easily reproducible immediately after rebooting, but goes
away after some uptime (perhaps simply related to the amount of
unused memory)
- example script follows (attention: absolute device names
like /dev/vg1/test3 hardcoded!)

---------------------------------------------------
#!/bin/bash

umount /dev/vg1/test3
lvremove -f /dev/vg1/test3
lvcreate -n test3 -L 100M vg1
mke2fs -b 1024 /dev/vg1/test3
ext2prepare -v /dev/vg1/test3 50G
mount /dev/vg1/test3 /mnt/test3

( sleep 20; echo resize1; e2fsadm -L+90M /dev/vg1/test3; echo resize1 done ;
sleep 10; echo resize2; e2fsadm -L+90M /dev/vg1/test3; echo resize2 done ) &
echo copy1
cp -a /mnt/test/linux /mnt/test3/linux
echo copy1 done
echo copy2
cp -a /mnt/test3/linux /mnt/test3/linux2
echo copy2 done
---------------------------------------------------

/mnt/test/linux contains (surprise) a linux source, but I don't think
the contents are too important :-). The sleep values are tuned in a way that
leads to the following sequence:

copy1, resize1, resize1 done, copy1 done,
copy2, resize2, resize2 done, copy2 done

After that, the first copy is corrupted in memory only (and is ok after
rebooting), and the second copy is corrupted in memory and on disk. The
corrupted files contain parts of other files or binary stuff that may come
from directory entries.

I guess that invalidate_buffers somehow marks the buffers that contain
the first copy as free, but the second cp still uses them to copy the
files again. I don't understand the source well enough to find out
how it happens.

Jan


2000-12-07 19:37:02

by Jan Niehusmann

[permalink] [raw]
Subject: Re: fs corruption with invalidate_buffers()

On Wed, Dec 06, 2000 at 03:07:23AM +0100, Jan Niehusmann wrote:
> While resizing the filesystem, invalidate_buffers() is called from the
> lvm code. (lvm.c, line 2251, in lvm_do_lv_extend_reduce())
> If I remove this call, the corruption goes away. But this is probably not
> the correct fix, as it can cause problems when reducing the lv size.

Some more details:

I added the following code to put_last_free(bh) in buffer.c:

--- buffer.c.orig Wed Dec 6 17:19:57 2000
+++ buffer.c Thu Dec 7 19:55:39 2000
@@ -500,6 +500,11 @@
struct bh_free_head *head = &free_list[BUFSIZE_INDEX(bh->b_size)];
struct buffer_head **bhp = &head->list;

+ if(bh->b_page && Page_Uptodate(bh->b_page)
+ && bh->b_page->mapping) { // XXX ???
+ BUG();
+ }
+
bh->b_state = 0;

spin_lock(&head->lock);


That is, if I want to put a buffer to the free list, I check if it is
mapped and uptodate. If I understand the memory management correctly, this
is a Bad Thing and should not happen. But guess what? It does, in
invalidate_buffers.

I think invalidate_buffers should check if the buffer belongs to a
mapped page, and if it does, invalidate this mapping.

Jan


2000-12-07 22:02:37

by Jan Niehusmann

[permalink] [raw]
Subject: [PATCH] Re: fs corruption with invalidate_buffers()

The following patch actually prevents the corruption I described.

I'd like to hear from the people having problems with hdparm, if it helps
them, too.

Please note that the patch circumvents the problem more than it fixes it.
The true fix would invalidate the mappings, but I don't know how to do it.

Jan

--- linux-2.4.0-test11/fs/buffer.c Mon Nov 20 08:55:05 2000
+++ test/fs/buffer.c Thu Dec 7 22:28:24 2000
@@ -589,7 +589,7 @@
then an invalidate_buffers call that doesn't trash dirty buffers. */
void __invalidate_buffers(kdev_t dev, int destroy_dirty_buffers)
{
- int i, nlist, slept;
+ int i, nlist, slept, db_message=0;
struct buffer_head * bh, * bh_next;

retry:
@@ -615,8 +615,13 @@
write_lock(&hash_table_lock);
if (!atomic_read(&bh->b_count) &&
(destroy_dirty_buffers || !buffer_dirty(bh))) {
- __remove_from_queues(bh);
- put_last_free(bh);
+ if(bh->b_page
+ && bh->b_page->mapping) {
+ db_message=1;
+ } else {
+ __remove_from_queues(bh);
+ put_last_free(bh);
+ }
}
/* else complain loudly? */

@@ -629,6 +634,8 @@
spin_unlock(&lru_list_lock);
if (slept)
goto retry;
+ if(db_message)
+ printk("invalidate_buffer with mapped page\n");
}

void set_blocksize(kdev_t dev, int size)

2000-12-07 22:34:24

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [PATCH] Re: fs corruption with invalidate_buffers()

Jan Niehusmann wrote:
>
> The following patch actually prevents the corruption I described.
>
> I'd like to hear from the people having problems with hdparm, if it helps
> them, too.

Yes, it prevents the issue.

> Please note that the patch circumvents the problem more than it fixes it.
> The true fix would invalidate the mappings, but I don't know how to do it.

I don't know either. What does Alexander Viro say to all of this?

-Udo.



Same debug patch adapted to test12-pre7 follows:

--- linux/fs/buffer.c Thu Dec 7 22:55:54 2000
+++ /usr/src/linux/fs/buffer.c Thu Dec 7 22:49:02 2000
@@ -627,7 +627,7 @@
then an invalidate_buffers call that doesn't trash dirty buffers. */
void __invalidate_buffers(kdev_t dev, int destroy_dirty_buffers)
{
- int i, nlist, slept;
+ int i, nlist, slept, db_message = 0;
struct buffer_head * bh, * bh_next;

retry:
@@ -653,9 +653,13 @@
write_lock(&hash_table_lock);
if (!atomic_read(&bh->b_count) &&
(destroy_dirty_buffers || !buffer_dirty(bh))) {
- remove_inode_queue(bh);
- __remove_from_queues(bh);
- put_last_free(bh);
+ if (bh->b_page && bh->b_page->mapping)
+ db_message = 1;
+ else {
+ remove_inode_queue(bh);
+ __remove_from_queues(bh);
+ put_last_free(bh);
+ }
}
/* else complain loudly? */

@@ -668,6 +672,8 @@
spin_unlock(&lru_list_lock);
if (slept)
goto retry;
+ if (db_message)
+ printk("invalidate_buffers with mapped page!\n");
}

void set_blocksize(kdev_t dev, int size)

2000-12-07 22:57:48

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] Re: fs corruption with invalidate_buffers()



On Thu, 7 Dec 2000, Udo A. Steinberg wrote:

> Jan Niehusmann wrote:
> >
> > The following patch actually prevents the corruption I described.
> >
> > I'd like to hear from the people having problems with hdparm, if it helps
> > them, too.
>
> Yes, it prevents the issue.
>
> > Please note that the patch circumvents the problem more than it fixes it.
> > The true fix would invalidate the mappings, but I don't know how to do it.
>
> I don't know either. What does Alexander Viro say to all of this?

That invalidate_buffers() should leave the unhashed ones alone. If it can't
be found via getblk() - just leave it as is.

IOW, let it skip bh if (bh->b_next == NULL && !destroy_dirty_buffers).
No warnings needed - it's a normal situation.

2000-12-08 00:08:41

by Jan Niehusmann

[permalink] [raw]
Subject: Re: [PATCH] Re: fs corruption with invalidate_buffers()

On Thu, Dec 07, 2000 at 05:26:46PM -0500, Alexander Viro wrote:
> That invalidate_buffers() should leave the unhashed ones alone. If it can't
> be found via getblk() - just leave it as is.
>
> IOW, let it skip bh if (bh->b_next == NULL && !destroy_dirty_buffers).
> No warnings needed - it's a normal situation.

Yes, if(bh->b_next == NULL && !destroy_dirty_buffers) seems to work, too.
I wonder why, because, if I analysed the problem correctly, it was caused
by the page mapping. Or is it a general rule that bh->b_next==NULL if
bh->b_page->mapping != NULL, ie. a buffer is never directly hased and belongs
to a mapped page?

Is there a place one can look for documentation about these things?

Another question is what should happen with the mapped pages? Whoever calls
invalidate_buffers(), probably does it because the underlying device disappered
or changed, so the page mappings should be invalidated, too.
OTOH, pages are (normally) mapped through inodes, and if the inode stays valid
after the invalidate_buffers() (ie. if it's called by an lvm resize event),
the page mapping stays valid, too.

Jan

2000-12-22 00:36:08

by Jan Niehusmann

[permalink] [raw]
Subject: [PATCH] Re: fs corruption with invalidate_buffers()

The file corruption I reported on Dec 6 is still there in test13-pre3.
(I can only reproduce it easily with the ext2 online resizing patches,
but I really don't think it is caused by them)

The corruption happens if invalidate_buffers calls put_last_free() on
buffers that belong to mapped pages. These pages remain valid and may
get used later, while the buffers are marked free and may be reused
by something completely different, immediately causing corruption.

I changed my patch for the problem according to the following advice by
Al Viro:

On Thu, Dec 07, 2000 at 05:26:46PM -0500, Alexander Viro wrote:
> That invalidate_buffers() should leave the unhashed ones alone. If it can't
> be found via getblk() - just leave it as is.
>
> IOW, let it skip bh if (bh->b_next == NULL && !destroy_dirty_buffers).
> No warnings needed - it's a normal situation.

This is the result - against test12-pre7, but works well with
test13-pre3:


--- linux-2.4.0-test12-pre7-jn/fs/buffer.c.orig Fri Dec 8 14:59:28 2000
+++ linux-2.4.0-test12-pre7-jn/fs/buffer.c Fri Dec 8 15:05:11 2000
@@ -502,6 +502,10 @@
struct bh_free_head *head = &free_list[BUFSIZE_INDEX(bh->b_size)];
struct buffer_head **bhp = &head->list;

+ if(bh->b_page && bh->b_page->mapping) {
+ panic("put_last_free() on mapped buffer!");
+ }
+
bh->b_state = 0;

spin_lock(&head->lock);
@@ -652,7 +656,8 @@

write_lock(&hash_table_lock);
if (!atomic_read(&bh->b_count) &&
- (destroy_dirty_buffers || !buffer_dirty(bh))) {
+ (destroy_dirty_buffers ||
+ (!buffer_dirty(bh) && bh->b_next!=0) )) {
remove_inode_queue(bh);
__remove_from_queues(bh);
put_last_free(bh);

2000-12-22 01:09:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: fs corruption with invalidate_buffers()



On Fri, 22 Dec 2000, Jan Niehusmann wrote:
>
> This is the result - against test12-pre7, but works well with
> test13-pre3:

This looks bogus.

You can't test "bh->b_next!=0", because that is entirely meaningless.

b_next can be NULL either because the buffer isn't hashed, or because the
buffer _is_ hashed, but just happens to be last on the hash chain.

So testing "bh->b_next" doesn't actually tell you anything at all.

Linus

2000-12-22 01:18:54

by Jan Niehusmann

[permalink] [raw]
Subject: Re: [PATCH] Re: fs corruption with invalidate_buffers()

On Thu, Dec 21, 2000 at 04:37:30PM -0800, Linus Torvalds wrote:
> This looks bogus.

It may be - I just did what Al told me without really understanding it ;-)

The test I did initially was the following:

if(!atomic_read(&bh->b_count) &&
(destroy_dirty_buffers || !buffer_dirty(bh))
&& ! (bh->b_page && bh->b_page->mapping)
)

That is, I was explicitely checking for a mapped page. It worked well, too.
Is this more reasonable?

Jan

2000-12-22 01:32:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Re: fs corruption with invalidate_buffers()



On Fri, 22 Dec 2000, Jan Niehusmann wrote:
>
> The test I did initially was the following:
>
> if(!atomic_read(&bh->b_count) &&
> (destroy_dirty_buffers || !buffer_dirty(bh))
> && ! (bh->b_page && bh->b_page->mapping)
> )
>
> That is, I was explicitely checking for a mapped page. It worked well, too.
> Is this more reasonable?

I'd suggest just doing this instead (warning: cut-and-paste in xterm, so
white-space damage):

--- linux/fs/buffer.c.old Wed Dec 20 17:50:56 2000
+++ linux/fs/buffer.c Thu Dec 21 16:42:11 2000
@@ -639,8 +639,13 @@
continue;
for (i = nr_buffers_type[nlist]; i > 0 ; bh = bh_next, i--) {
bh_next = bh->b_next_free;
+
+ /* Another device? */
if (bh->b_dev != dev)
continue;
+ /* Part of a mapping? */
+ if (bh->b_page->mapping)
+ continue;
if (buffer_locked(bh)) {
atomic_inc(&bh->b_count);
spin_unlock(&lru_list_lock);

which just ignores mapped buffers entirely (and doesn't test for
bh->b_page being non-NULL, because that shouldn't be allowed anyway).

Linus

2000-12-22 02:25:40

by Jan Niehusmann

[permalink] [raw]
Subject: Re: [PATCH] Re: fs corruption with invalidate_buffers()

On Thu, Dec 21, 2000 at 05:01:00PM -0800, Linus Torvalds wrote:
>
>
> On Fri, 22 Dec 2000, Jan Niehusmann wrote:
> >
> > The test I did initially was the following:
> >
> > if(!atomic_read(&bh->b_count) &&
> > (destroy_dirty_buffers || !buffer_dirty(bh))
> > && ! (bh->b_page && bh->b_page->mapping)
> > )
> >
> > That is, I was explicitely checking for a mapped page. It worked well, too.
> > Is this more reasonable?
>
> I'd suggest just doing this instead (warning: cut-and-paste in xterm, so
> white-space damage):

> which just ignores mapped buffers entirely (and doesn't test for
> bh->b_page being non-NULL, because that shouldn't be allowed anyway).

Yes, looks good to me, and passes some tests. Here is a patch that has not
been cut and pasted:

--- linux/fs/buffer.c.orig Thu Dec 21 20:30:03 2000
+++ linux/fs/buffer.c Fri Dec 22 02:11:29 2000
@@ -643,7 +643,12 @@
continue;
for (i = nr_buffers_type[nlist]; i > 0 ; bh = bh_next, i--) {
bh_next = bh->b_next_free;
+
+ /* Another device? */
if (bh->b_dev != dev)
+ continue;
+ /* Part of a mapping? */
+ if (bh->b_page->mapping)
continue;
if (buffer_locked(bh)) {
atomic_inc(&bh->b_count);


I have one additional question: invalidate_buffers normally gets called if
someone wants to make sure that, after the call, read accesses to a device
really go to the device and don't get served by a cache. Is there some
mechanismn that does the same thing to mapped pages?

Jan


2000-12-22 02:27:20

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] Re: fs corruption with invalidate_buffers()



On Thu, 21 Dec 2000, Linus Torvalds wrote:

>
>
> On Fri, 22 Dec 2000, Jan Niehusmann wrote:
> >
> > This is the result - against test12-pre7, but works well with
> > test13-pre3:
>
> This looks bogus.

It is bogus. My apologies.