2004-03-04 15:13:18

by Dave Kleikamp

[permalink] [raw]
Subject: Race in nobh_prepare_write

Andrew,
I discovered a race betwen nobh_prepare_write and end_buffer_read_sync.
end_buffer_read_sync calls unlock_buffer, waking the nobh_prepare_write
thread, which immediately frees the buffer_head. end_buffer_read_sync
then calls put_bh which decrements b_count for the already freed
structure. The SLAB_DEBUG code detects the slab corruption.

I was able to fix it with the following patch. I reversed the order of
unlock_buffer and put_bh in end_buffer_read_sync. I also set b_count to
1 and later called brelse in nobh_prepare_write, since unlock_buffer may
expect b_count to be non-zero. I didn't change the other end_io
functions, as I'm not sure what effect this may have on other callers.

Is my fix correct? Is it complete?

Here's the patch:

--- linux-2.6.4-rc1+/fs/buffer.c.orig 2004-03-03 13:50:10.000000000 -0600
+++ linux-2.6.4-rc1+/fs/buffer.c 2004-03-04 08:30:03.000000000 -0600
@@ -178,8 +178,8 @@ void end_buffer_read_sync(struct buffer_
/* This happens, due to failed READA attempts. */
clear_buffer_uptodate(bh);
}
- unlock_buffer(bh);
put_bh(bh);
+ unlock_buffer(bh);
}

void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
@@ -2395,7 +2395,7 @@ int nobh_prepare_write(struct page *page
goto failed;
}
bh->b_state = map_bh.b_state;
- atomic_set(&bh->b_count, 0);
+ atomic_set(&bh->b_count, 1);
bh->b_this_page = 0;
bh->b_page = page;
bh->b_blocknr = map_bh.b_blocknr;
@@ -2413,6 +2413,7 @@ int nobh_prepare_write(struct page *page
wait_on_buffer(read_bh[i]);
if (!buffer_uptodate(read_bh[i]))
ret = -EIO;
+ brelse(read_bh[i]);
free_buffer_head(read_bh[i]);
read_bh[i] = NULL;
}
@@ -2438,8 +2439,10 @@ int nobh_prepare_write(struct page *page

failed:
for (i = 0; i < nr_reads; i++) {
- if (read_bh[i])
+ if (read_bh[i]) {
+ brelse(read_bh[i]);
free_buffer_head(read_bh[i]);
+ }
}

/*


Thanks,
Shaggy
--
David Kleikamp
IBM Linux Technology Center


2004-03-04 17:32:04

by Andrew Morton

[permalink] [raw]
Subject: Re: Race in nobh_prepare_write

Dave Kleikamp <[email protected]> wrote:
>
> Andrew,
> I discovered a race betwen nobh_prepare_write and end_buffer_read_sync.
> end_buffer_read_sync calls unlock_buffer, waking the nobh_prepare_write
> thread, which immediately frees the buffer_head. end_buffer_read_sync
> then calls put_bh which decrements b_count for the already freed
> structure. The SLAB_DEBUG code detects the slab corruption.

Indeed.

> I was able to fix it with the following patch. I reversed the order of
> unlock_buffer and put_bh in end_buffer_read_sync. I also set b_count to
> 1 and later called brelse in nobh_prepare_write, since unlock_buffer may
> expect b_count to be non-zero. I didn't change the other end_io
> functions, as I'm not sure what effect this may have on other callers.
>
> Is my fix correct? Is it complete?

There's still a race, but it is benign.

See unlock_buffer():

clear_buffer_locked(bh);
smp_mb__after_clear_bit();
wake_up_buffer(bh);

versus:

> @@ -2413,6 +2413,7 @@ int nobh_prepare_write(struct page *page
> wait_on_buffer(read_bh[i]);
> if (!buffer_uptodate(read_bh[i]))
> ret = -EIO;
> + brelse(read_bh[i]);
> free_buffer_head(read_bh[i]);
> read_bh[i] = NULL;

That free_buffer_head() could get in there before unlock_buffer() runs
wake_up_buffer().

But wake_up_buffer() purely uses the bh address as a hash key, so nothing
bad happens.

It's a bit grubby, but this would be hard to fix otherwise. I'll slap a big
comment on it.

Thanks.

2004-03-05 08:17:31

by Andrew Morton

[permalink] [raw]
Subject: Re: Race in nobh_prepare_write

Dave Kleikamp <[email protected]> wrote:
>
> I discovered a race betwen nobh_prepare_write and end_buffer_read_sync.

Having looked at this a little more, I don't really like the dropping of
the bh refcount in end_buffer_read_sync() prior to unlocking the buffer.

I don't see any problem with it per-se because of the behaviour of
try_to_free_buffers(), and the fact that the page lock totally protects the
buffer ring. But it is, I think, cleaner and more future-safe to give
nobh_prepare_write() a custom end_io handler.

This passes decent testing on 1k blocksize ext2.


fs/buffer.c | 43 +++++++++++++++++++++++++++++++++++++++----
1 files changed, 39 insertions(+), 4 deletions(-)

diff -puN fs/buffer.c~nobh_prepare_write-race-fix-2 fs/buffer.c
--- 25/fs/buffer.c~nobh_prepare_write-race-fix-2 2004-03-04 23:11:33.000000000 -0800
+++ 25-akpm/fs/buffer.c 2004-03-04 23:12:14.000000000 -0800
@@ -2321,6 +2321,28 @@ int generic_commit_write(struct file *fi
return 0;
}

+
+/*
+ * nobh_prepare_write()'s prereads are special: the buffer_heads are freed
+ * immediately, while under the page lock. So it needs a special end_io
+ * handler which does not touch the bh after unlocking it.
+ *
+ * Note: unlock_buffer() sort-of does touch the bh after unlocking it, but
+ * a race there is benign: unlock_buffer() only use the bh's address for
+ * hashing after unlocking the buffer, so it doesn't actually touch the bh
+ * itself.
+ */
+static void end_buffer_read_nobh(struct buffer_head *bh, int uptodate)
+{
+ if (uptodate) {
+ set_buffer_uptodate(bh);
+ } else {
+ /* This happens, due to failed READA attempts. */
+ clear_buffer_uptodate(bh);
+ }
+ unlock_buffer(bh);
+}
+
/*
* On entry, the page is fully not uptodate.
* On exit the page is fully uptodate in the areas outside (from,to)
@@ -2412,12 +2434,25 @@ int nobh_prepare_write(struct page *page
}

if (nr_reads) {
- ll_rw_block(READ, nr_reads, read_bh);
+ struct buffer_head *bh;
+
+ /*
+ * The page is locked, so these buffers are protected from
+ * any VM or truncate activity. Hence we don't need to care
+ * for the buffer_head refcounts.
+ */
for (i = 0; i < nr_reads; i++) {
- wait_on_buffer(read_bh[i]);
- if (!buffer_uptodate(read_bh[i]))
+ bh = read_bh[i];
+ lock_buffer(bh);
+ bh->b_end_io = end_buffer_read_nobh;
+ submit_bh(READ, bh);
+ }
+ for (i = 0; i < nr_reads; i++) {
+ bh = read_bh[i];
+ wait_on_buffer(bh);
+ if (!buffer_uptodate(bh))
ret = -EIO;
- free_buffer_head(read_bh[i]);
+ free_buffer_head(bh);
read_bh[i] = NULL;
}
if (ret)

_

2004-03-05 15:14:00

by Dave Kleikamp

[permalink] [raw]
Subject: Re: Race in nobh_prepare_write

On Fri, 2004-03-05 at 02:17, Andrew Morton wrote:
>
> This passes decent testing on 1k blocksize ext2.

So far, I'm seeing no problems with jfs either.
--
David Kleikamp
IBM Linux Technology Center