From: Badari Pulavarty Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers Date: Tue, 05 Sep 2006 09:11:42 -0700 Message-ID: <1157472702.23501.12.camel@dyn9047017100.beaverton.ibm.com> References: <1157125829.30578.6.camel@dyn9047017100.beaverton.ibm.com> <1157128342.30578.14.camel@dyn9047017100.beaverton.ibm.com> <20060901101801.7845bca2.akpm@osdl.org> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Cc: Anton Altaparmakov , sct@redhat.com, linux-fsdevel , lkml , ext4 , jack@suse.cz Return-path: Received: from e33.co.us.ibm.com ([32.97.110.151]:15309 "EHLO e33.co.us.ibm.com") by vger.kernel.org with ESMTP id S965018AbWIEQIY (ORCPT ); Tue, 5 Sep 2006 12:08:24 -0400 To: Andrew Morton In-Reply-To: <20060901101801.7845bca2.akpm@osdl.org> Sender: linux-ext4-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote: > On Fri, 01 Sep 2006 09:32:22 -0700 > Badari Pulavarty wrote: > > > > > Kernel BUG at fs/buffer.c:2791 > > > > invalid opcode: 0000 [1] SMP > > > > > > > > Its complaining about BUG_ON(!buffer_mapped(bh)). > > I need to have a little think about this, remember what _should_ be > happening in this situation. > > We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize > filesystems. We've done something to make this start happening. Part of > resolving this bug will be working out what that was. > It took a while to track this down. 2.6.13 is the last kernel which runs my fsx tests fine (72+ hours). Here is the change that seems to cause the problem. Jana Kara introduced a new mode "SWRITE" for ll_rw_block() - where it waits for buffer to be unlocked (WRITE will skip locked buffers) + jbd journal_commit_transaction() has been changed to make use of SWRITE. http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112109788702895&w=2 Theoritically same race (between journal_commit_transaction() and journal_unmap_buffer()+set_page_dirty()) could exist before his changes - which should trigger bug in submit_bh(). But I can't seem to hit it without his changes. My guess is ll_rw_block() is always skipping those cleaned up buffers, before page gets dirtied again .. Andrew, what should we do ? Do you suggest handling this in jbd itself (like this patch) ? Thanks, Badari Patch to fix: Kernel BUG at fs/buffer.c:2791 on 1k (2k) filesystems while running fsx. journal_commit_transaction collects lots of dirty buffer from and does a single ll_rw_block() to write them out. ll_rw_block() locks the buffer and checks to see if they are dirty and submits them for IO. In the mean while, journal_unmap_buffers() as part of truncate can unmap the buffer and throw it away. Since its a 1k (2k) filesystem - each page (4k) will have more than one buffer_head attached to the page and and we can't free up buffer_heads attached to the page (if we are not invalidating the whole page). Now, any call to set_page_dirty() (like msync_interval) could end up setting all the buffer heads attached to this page again dirty, including the ones those got cleaned up :( So, -not-so-elegant fix would be to have make jbd skip all the buffers that are not mapped. Signed-off-by: Badari Pulavarty --- fs/jbd/commit.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) Index: linux-2.6.18-rc5/fs/jbd/commit.c =================================================================== --- linux-2.6.18-rc5.orig/fs/jbd/commit.c 2006-08-27 20:41:48.000000000 -0700 +++ linux-2.6.18-rc5/fs/jbd/commit.c 2006-09-01 10:43:54.000000000 -0700 @@ -160,6 +160,34 @@ static int journal_write_commit_record(j return (ret == -EIO); } +static void jbd_write_buffers(int nr, struct buffer_head *bhs[]) +{ + int i; + + for (i = 0; i < nr; i++) { + struct buffer_head *bh = bhs[i]; + + lock_buffer(bh); + + /* + * case 1: Buffer could have been flushed by now, + * if so nothing to do for us. + * case 2: Buffer could have been unmapped up by + * journal_unmap_buffer - but still attached to the + * page. Any calls to set_page_dirty() would dirty + * the buffer even though its not mapped. If so, + * we need to skip them. + */ + if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) { + bh->b_end_io = end_buffer_write_sync; + get_bh(bh); + submit_bh(WRITE, bh); + continue; + } + unlock_buffer(bh); + } +} + /* * journal_commit_transaction * @@ -356,7 +384,7 @@ write_out_data: jbd_debug(2, "submit %d writes\n", bufs); spin_unlock(&journal->j_list_lock); - ll_rw_block(SWRITE, bufs, wbuf); + jbd_write_buffers(bufs, wbuf); journal_brelse_array(wbuf, bufs); bufs = 0; goto write_out_data; @@ -379,7 +407,7 @@ write_out_data: if (bufs) { spin_unlock(&journal->j_list_lock); - ll_rw_block(SWRITE, bufs, wbuf); + jbd_write_buffers(bufs, wbuf); journal_brelse_array(wbuf, bufs); spin_lock(&journal->j_list_lock); }