2005-01-19 15:33:54

by Alex Tomas

[permalink] [raw]
Subject: [PATCH] JBD: fix against journal overflow


Good day,

under some quite high load, jbd can hit J_ASSERT(journal->j_free > 1)
in journal_next_log_block(). The cause is the following:

journal_commit_transaction()
{
struct buffer_head *wbuf[64];

....

/* If there's no more to do, or if the descriptor is full,
let the IO rip! */

if (bufs == ARRAY_SIZE(wbuf) ||
commit_transaction->t_buffers == NULL ||
space_left < sizeof(journal_block_tag_t) + 16) {

so, the real limit isn't size of journal descriptor, but wbuf.

__log_space_left()
{
/*
* Be pessimistic here about the number of those free blocks which
* might be required for log descriptor control blocks.
*/

#define MIN_LOG_RESERVED_BLOCKS 32 /* Allow for rounding errors */

left -= MIN_LOG_RESERVED_BLOCKS;


so, jbd expects upto 32 blocks to be used for internal purpose. but
with 64 blocks a descriptor limit we easily can break the limit.

The fix allocates wbuf in journal_init_dev(). That's enough because
we have only commit thread per filesystem.

thank, Alex


The journal_commit_transaction() generates too many descriptor blocks
because static array wbuf can hold 64 blocks only. The fix is to have
persistent array big enough to hold max. possible blocks.

Signed-off-by: Alex Tomas <[email protected]>

Index: linux-2.6.7/include/linux/jbd.h
===================================================================
--- linux-2.6.7.orig/include/linux/jbd.h 2004-08-26 17:12:16.000000000 +0400
+++ linux-2.6.7/include/linux/jbd.h 2005-01-19 17:08:33.144512008 +0300
@@ -826,6 +826,12 @@
struct jbd_revoke_table_s *j_revoke_table[2];

/*
+ * array of bhs for journal_commit_transaction
+ */
+ struct buffer_head **j_wbuf;
+ int j_wbufsize;
+
+ /*
* An opaque pointer to fs-private information. ext3 puts its
* superblock pointer here
*/
Index: linux-2.6.7/fs/jbd/commit.c
===================================================================
--- linux-2.6.7.orig/fs/jbd/commit.c 2004-08-26 17:12:40.000000000 +0400
+++ linux-2.6.7/fs/jbd/commit.c 2005-01-19 17:28:32.965111552 +0300
@@ -103,7 +103,7 @@
{
transaction_t *commit_transaction;
struct journal_head *jh, *new_jh, *descriptor;
- struct buffer_head *wbuf[64];
+ struct buffer_head **wbuf = journal->j_wbuf;
int bufs;
int flags;
int err;
@@ -271,7 +283,7 @@
BUFFER_TRACE(bh, "start journal writeout");
get_bh(bh);
wbuf[bufs++] = bh;
- if (bufs == ARRAY_SIZE(wbuf)) {
+ if (bufs == journal->j_wbufsize) {
jbd_debug(2, "submit %d writes\n",
bufs);
spin_unlock(&journal->j_list_lock);
@@ -488,7 +500,7 @@
/* If there's no more to do, or if the descriptor is full,
let the IO rip! */

- if (bufs == ARRAY_SIZE(wbuf) ||
+ if (bufs == journal->j_wbufsize ||
commit_transaction->t_buffers == NULL ||
space_left < sizeof(journal_block_tag_t) + 16) {

Index: linux-2.6.7/fs/jbd/journal.c
===================================================================
--- linux-2.6.7.orig/fs/jbd/journal.c 2005-01-19 12:07:59.000000000 +0300
+++ linux-2.6.7/fs/jbd/journal.c 2005-01-19 17:11:08.589880720 +0300
@@ -687,6 +687,7 @@
{
journal_t *journal = journal_init_common();
struct buffer_head *bh;
+ int n;

if (!journal)
return NULL;
@@ -702,6 +703,17 @@
journal->j_sb_buffer = bh;
journal->j_superblock = (journal_superblock_t *)bh->b_data;

+ /* journal descriptor can store upto n blocks -bzzz */
+ n = journal->j_blocksize / sizeof(journal_block_tag_t);
+ journal->j_wbufsize = n;
+ journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ if (!journal->j_wbuf) {
+ printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
+ __FUNCTION__);
+ kfree(journal);
+ journal = NULL;
+ }
+
return journal;
}

@@ -717,7 +729,7 @@
{
struct buffer_head *bh;
journal_t *journal = journal_init_common();
- int err;
+ int err, n;
unsigned long blocknr;

if (!journal)
@@ -734,6 +746,17 @@
journal->j_maxlen = inode->i_size >> inode->i_sb->s_blocksize_bits;
journal->j_blocksize = inode->i_sb->s_blocksize;

+ /* journal descriptor can store upto n blocks -bzzz */
+ n = journal->j_blocksize / sizeof(journal_block_tag_t);
+ journal->j_wbufsize = n;
+ journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ if (!journal->j_wbuf) {
+ printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
+ __FUNCTION__);
+ kfree(journal);
+ return NULL;
+ }
+
err = journal_bmap(journal, 0, &blocknr);
/* If that failed, give up */
if (err) {
@@ -1107,6 +1130,10 @@
iput(journal->j_inode);
if (journal->j_revoke)
journal_destroy_revoke(journal);
+ if (journal->j_wbuf) {
+ kfree(journal->j_wbuf);
+ journal->j_wbuf = NULL;
+ }
kfree(journal);
}



2005-01-24 17:43:37

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

Hi,

On Wed, 2005-01-19 at 15:32, Alex Tomas wrote:

> under some quite high load, jbd can hit J_ASSERT(journal->j_free > 1)
> in journal_next_log_block(). The cause is the following:
>
> journal_commit_transaction()
> {
> struct buffer_head *wbuf[64];
> /* If there's no more to do, or if the descriptor is full,
> let the IO rip! */
> if (bufs == ARRAY_SIZE(wbuf) ||
> commit_transaction->t_buffers == NULL ||
> space_left < sizeof(journal_block_tag_t) + 16) {
>
> so, the real limit isn't size of journal descriptor, but wbuf.

I don't see how that "limit" is relevant here. wbuf is nothing but the
size of the IO batches we pass to ll_rw_block() during that commit
phase. j_free affects the total size of space the *entire* commit has
to run into, and (as akpm has commented with a big marker beside it)
start_this_handle() reserves a *lot* of headroom for the extra space
that may be needed for transaction metadata.

(The comment there about journal_extend() needing to match it looks
correct, though --- that will need fixing.)

The only case I've ever seen the j_free > 1 assert fail on was the xattr
test that tridge was triggering with AG's first-generation xattr sharing
fix last December, and that was a journal_release_buffer() credits
accounting problem.

So NAK --- the wbuf batch size just doesn't seem to be relevant to the
problem being claimed.

Have you really seen this patch make a difference in testing?

Cheers,
Stephen


2005-01-24 17:49:15

by Alex Tomas

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

>>>>> Stephen C Tweedie (SCT) writes:
SCT> I don't see how that "limit" is relevant here. wbuf is nothing but the
SCT> size of the IO batches we pass to ll_rw_block() during that commit
SCT> phase. j_free affects the total size of space the *entire* commit has
SCT> to run into, and (as akpm has commented with a big marker beside it)
SCT> start_this_handle() reserves a *lot* of headroom for the extra space
SCT> that may be needed for transaction metadata.



/* If there's no more to do, or if the descriptor is full,
let the IO rip! */

if (bufs == ARRAY_SIZE(wbuf) ||
commit_transaction->t_buffers == NULL ||
space_left < sizeof(journal_block_tag_t) + 16) {

....

/* Force a new descriptor to be generated next
time round the loop. */
descriptor = NULL;
bufs = 0;

------------------------^^^^^^^^^^^^^^^^^^^


SCT> Have you really seen this patch make a difference in testing?

of course


2005-01-24 18:18:54

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

Hi,

On Mon, 2005-01-24 at 17:47, Alex Tomas wrote:

> SCT> I don't see how that "limit" is relevant here. ...

> if (bufs == ARRAY_SIZE(wbuf) ||
> commit_transaction->t_buffers == NULL ||
> space_left < sizeof(journal_block_tag_t) + 16) {
/* Force a new descriptor to be generated next
> time round the loop. */

Sure. So we know that for every 63 metadata blocks in the journal, we
can be writing one journal descriptor. That grows the size of the
transaction slightly, by a factor of 1/63.

But __log_space_left() is supposed to account for that:

/*
* Be pessimistic here about the number of those free blocks which
* might be required for log descriptor control blocks.
*/
...
left -= (left >> 3);

which reduces the estimate of "usable" space left by a full 1/8th, which
is way more than the upper bounds of the effect that the limited wbuf
size ought to have.

journal_extend() uses __log_space_left() too, so ought to have the same
protection.

Actually, I think I'll withdraw the NAK on this patch, because the
change has other benefits ---- it's going to be both a stack-usage and a
performance improvement to use a generous kmalloc()ed wbuf[] here. But
I still can't see why it would fix the problem you're claiming it
solves.

--Stephen

2005-01-24 19:35:58

by Alex Tomas

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

>>>>> Stephen C Tweedie (SCT) writes:

SCT> /*
SCT> * Be pessimistic here about the number of those free blocks which
SCT> * might be required for log descriptor control blocks.
SCT> */
SCT> ...
SCT> left -= (left >> 3);

oops. i overlooked this line. so, the fix becomes minor improvement patch ;)
thanks!

do you think the following can be improved?

/*
* @@@ AKPM: This seems rather over-defensive. We're giving commit
* a _lot_ of headroom: 1/4 of the journal plus the size of
* the committing transaction. Really, we only need to give it
* committing_transaction->t_outstanding_credits plus "enough" for
* the log control blocks.
* Also, this test is inconsitent with the matching one in
* journal_extend().
*/
if (__log_space_left(journal) < jbd_space_needed(journal)) {


2005-01-24 20:45:06

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

Hi,

On Mon, 2005-01-24 at 19:27, Alex Tomas wrote:

> oops. i overlooked this line. so, the fix becomes minor improvement patch ;)

Agreed, but a worthwhile one anyway. I'm still worried if you've seen
tests where this patch definitely cured a journal overflow, though ---
if so, it may be masking some other bug somewhere else.

> do you think the following can be improved?
>
> /*
> * @@@ AKPM: This seems rather over-defensive. We're giving commit
> * a _lot_ of headroom: 1/4 of the journal plus the size of
> * the committing transaction.
...

Possibly. But what this bit is doing is effectively chunking the
checkpoint operations --- if we have run out of journal space, we wait
until we've got enough room left for a maximally-sized transaction
before we let the new transaction start. And by doing that, we make
sure that the new transaction can grow to its full size before a commit
is forced.

The trouble is, it is not possible to wait for more log space during a
transaction's lifetime. You can't clear old log entries without making
sure that the buffers in them have been written back elsewhere, either
to more recent transactions in the log, or to the final writeback
store. And if you've got a buffer touched both in the oldest
transaction and the current one, then it's being journaled so you can't
do writeback; so you can't flush the old transaction from the log until
you've finished running and committing the current one.

So if we want to allow a transaction to grow to its full size, we *must*
wait for the log to have enough space for a maximal transaction before
we let the transaction start in the first place. And obviously, at that
point we don't know how large the transaction is going to get, so we
can't tell in advance whether we would be able to get away with a
smaller amount of space. :)

We could in theory monitor the usage, and if all the transactions being
committed are much smaller than maximum, we could shrink the space
requirement here. But I'm not convinced it's really worth it. (It
might be a small improvement for things like mail servers, though.)

Cheers,
Stephen