2006-05-18 08:26:11

by Menyhart Zoltan

[permalink] [raw]
Subject: re: [PATCH] Change ll_rw_block() calls in JBD

> We must be sure that the current data in buffer are sent to disk.
> Hence we have to call ll_rw_block() with SWRITE.

Let's consider the following case:

while (commit_transaction->t_sync_datalist) {
...

// Assume a "bh" got locked before starting this loop

if (buffer_locked(bh)) {
...
__journal_temp_unlink_buffer(jh);
__journal_file_buffer(jh, commit_transaction, BJ_Locked);
} else ...
}
...
while (commit_transaction->t_locked_list) {
...

// Assume our "bh" is not locked any more
// Nothing has happened to this "bh", someone just wanted
// to look at it in a safe way

if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
__journal_unfile_buffer(jh);
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
put_bh(bh);
} else ...
}

I.e. having an already locked "bh", it is missed out from the log.

Regards,

Zoltan Menyhart


2006-05-18 13:45:20

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

> >We must be sure that the current data in buffer are sent to disk.
> >Hence we have to call ll_rw_block() with SWRITE.
>
> Let's consider the following case:
>
> while (commit_transaction->t_sync_datalist) {
> ...
>
> // Assume a "bh" got locked before starting this loop
>
> if (buffer_locked(bh)) {
> ...
> __journal_temp_unlink_buffer(jh);
> __journal_file_buffer(jh, commit_transaction,
> BJ_Locked);
> } else ...
> }
> ...
> while (commit_transaction->t_locked_list) {
> ...
>
> // Assume our "bh" is not locked any more
> // Nothing has happened to this "bh", someone just wanted
> // to look at it in a safe way
>
> if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
> __journal_unfile_buffer(jh);
> jbd_unlock_bh_state(bh);
> journal_remove_journal_head(bh);
> put_bh(bh);
> } else ...
> }
>
> I.e. having an already locked "bh", it is missed out from the log.
Yes, I'm aware of this problem. Actually I wrote a patch (attached) for it
some time ago but as I'm checking current kernels it got lost somewhere on
the way. I'll rediff it and submit again. Thanks for spotting the
problem.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs


Attachments:
(No filename) (1.18 kB)
jbd-2.6.14-rc5-1-orderedwrite.diff (4.37 kB)
Download all attachments

2006-05-18 15:11:59

by Menyhart Zoltan

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Getting better :-)

> + was_dirty = buffer_dirty(bh);

Why do not we use "buffer_jbddirty()"?

> + if (was_dirty && test_set_buffer_locked(bh)) {
> + BUFFER_TRACE(bh, "needs blocking lock");
> + get_bh(bh);

Why do you need a "get_bh(bh)"?
"bh" is attached to "jh".
Can it go away while waiting for the buffer lock?
("jh" in on "t_sync_datalist", it cannot go away.)

> + spin_unlock(&journal->j_list_lock);
> + lock_buffer(bh);
> + spin_lock(&journal->j_list_lock);
> + /* Someone already cleaned up the buffer? Restart. */
> + if (!buffer_jbd(bh) || jh->b_jlist != BJ_SyncData) {

Who (else) can take away the journal head, remove our "jh" from the
synch. data list?

> + else {
> + BUFFER_TRACE(bh, "needs writeout, submitting");
> __journal_temp_unlink_buffer(jh);
> __journal_file_buffer(jh, commit_transaction,
> BJ_Locked);

A simple "__journal_file_buffer(...)" could be enough as it includes:

if (jh->b_transaction)
__journal_temp_unlink_buffer(jh);


Would not it be more easy to read like this?

if ((!was_dirty && buffer_locked(bh))
|| (was_dirty && test_clear_buffer_dirty(bh))) {
BUFFER_TRACE(bh, "needs writeout, submitting");
__journal_temp_unlink_buffer(jh);
__journal_file_buffer(jh, commit_transaction,
BJ_Locked);
jbd_unlock_bh_state(bh);
if (was_dirty) {
get_bh(bh);
bh->b_end_io = end_buffer_write_sync;
submit_bh(WRITE, bh);
}
}
else {

BUFFER_TRACE(bh, "writeout complete: unfile");
__journal_unfile_buffer(jh);
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
if (was_dirty)
unlock_buffer(bh);
put_bh(bh);
}


As synch. data handling is a compact stuff, cannot it be moved out from
"journal_commit_transaction()" as e.g. "journal_write_revoke_records()"?
(Just for a better readability...)

Thanks,

Zoltan







2006-05-18 22:25:32

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Hi,

On Thu, 2006-05-18 at 17:11 +0200, Zoltan Menyhart wrote:

> > + was_dirty = buffer_dirty(bh);
>
> Why do not we use "buffer_jbddirty()"?

jbddirty is what we use for metadata that we don't want to get written
to disk yet.

For journaling to work, we must ensure that the transaction's metadata
gets written to and committed in the journal before it is allowed to be
written back to main backing store. If we don't, and writeback occurs
early, then on a crash we can't rollback to the previous transaction.

So when we mark metadata as dirty we set the jbddirty flag but do *not*
set the normal dirty flag; so until the buffer is journaled, it is not
eligible for normal writeback.

This whole mechanism is only useful for metadata (plus data if we're in
data=journal mode.) For normal ordered data writes, we never use
jbddirty because there's just no problem if the data is written before
the transaction completes.

--Stephen


2006-05-19 01:30:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

> Getting better :-)
>
> >+ was_dirty = buffer_dirty(bh);
>
> Why do not we use "buffer_jbddirty()"?
I think Stephen has already explained it... We have a data buffer and
hence we use buffer_dirty()

>
> >+ if (was_dirty && test_set_buffer_locked(bh)) {
> >+ BUFFER_TRACE(bh, "needs blocking lock");
> >+ get_bh(bh);
>
> Why do you need a "get_bh(bh)"?
> "bh" is attached to "jh".
> Can it go away while waiting for the buffer lock?
> ("jh" in on "t_sync_datalist", it cannot go away.)
>
> >+ spin_unlock(&journal->j_list_lock);
> >+ lock_buffer(bh);
> >+ spin_lock(&journal->j_list_lock);
> >+ /* Someone already cleaned up the buffer? Restart. */
> >+ if (!buffer_jbd(bh) || jh->b_jlist != BJ_SyncData) {
>
> Who (else) can take away the journal head, remove our "jh" from the
> synch. data list?
For two of the above comments: Under memory pressure data buffers can
be written out earlier and then released by __journal_try_to_free_buffer()
as they are not dirty any more. The above checks protect us against this.

> >+ else {
> >+ BUFFER_TRACE(bh, "needs writeout, submitting");
> > __journal_temp_unlink_buffer(jh);
> > __journal_file_buffer(jh, commit_transaction,
> > BJ_Locked);
>
> A simple "__journal_file_buffer(...)" could be enough as it includes:
>
> if (jh->b_transaction)
> __journal_temp_unlink_buffer(jh);
Yes, you are right here.

> Would not it be more easy to read like this?
>
> if ((!was_dirty && buffer_locked(bh))
> || (was_dirty && test_clear_buffer_dirty(bh))) {
> BUFFER_TRACE(bh, "needs writeout, submitting");
> __journal_temp_unlink_buffer(jh);
> __journal_file_buffer(jh, commit_transaction,
> BJ_Locked);
> jbd_unlock_bh_state(bh);
> if (was_dirty) {
> get_bh(bh);
> bh->b_end_io = end_buffer_write_sync;
> submit_bh(WRITE, bh);
> }
> }
> else {
>
> BUFFER_TRACE(bh, "writeout complete: unfile");
> __journal_unfile_buffer(jh);
> jbd_unlock_bh_state(bh);
> journal_remove_journal_head(bh);
> if (was_dirty)
> unlock_buffer(bh);
> put_bh(bh);
> }
So you basically mean switching those two branches of if.. OK, maybe.

> As synch. data handling is a compact stuff, cannot it be moved out from
> "journal_commit_transaction()" as e.g. "journal_write_revoke_records()"?
> (Just for a better readability...)
Yes, probably moving it to a new function may improve the readability.
Thanks for suggestions.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-05-19 10:01:47

by Menyhart Zoltan

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

I'd like to take this opportunity to ask some questions
I always wanted to ask about committing a transaction(*)
but were afraid to ask :-)

We wait for several types of I/O-s:
- "Wait for all previously submitted IO to complete" (t_sync_datalist)
- wait_for_iobuf: (t_buffers)
- wait_for_ctlbuf: (t_log_list)
before "journal_write_commit_record()" gets invoked.

Why do not we wait for them at the very last moment in batch, is
it important that e.g. all the buffers from "t_buffers" be
completed before we start with the ones on "t_log_list"?

If "JFS_BARRIER" is set, why do we wait for these I/O-s at all?
(The "ordered" attribute is set => all the previous I/O-s must have hit
the permanent storage before the commit record can do.)

Why do we let the EXT3 layer to decide, why do not we ask the "bio"
if the "ordered" attribute is supported and use it systematically?

There is a comment in "journal_write_commit_record()":

/* is it possible for another commit to fail at roughly
* the same time as this one?...

Another commit for the same journal?
(If not the same journal, why is it a problem?)

If a barrier-based sync has failed on a device, does the actual
I/O start by itself (not caring for the ordering issue)?

Thanks,

Zoltan

2006-05-19 12:27:03

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Hi,

On Fri, 2006-05-19 at 12:01 +0200, Zoltan Menyhart wrote:
> I'd like to take this opportunity to ask some questions
> I always wanted to ask about committing a transaction(*)
> but were afraid to ask :-)
>
> We wait for several types of I/O-s:
> - "Wait for all previously submitted IO to complete" (t_sync_datalist)
> - wait_for_iobuf: (t_buffers)
> - wait_for_ctlbuf: (t_log_list)
> before "journal_write_commit_record()" gets invoked.
>
> Why do not we wait for them at the very last moment in batch, is
> it important that e.g. all the buffers from "t_buffers" be
> completed before we start with the ones on "t_log_list"?

We don't. We write the control and journaled metadata buffers first,
recording them on the t_buffers and t_log_list lists. We then wait on
both of those lists, but there's no IO being submitted at that stage and
the order in which we wait on them has no effect at all on the progress
of the IO.

The sync_datalist writes could be waited for separately at the end if we
wanted too. That code was written a _long_ time ago but I vaguely
recall some reasoning that we may have a lot more data than metadata, so
it would be reasonable to cleanup the data writes as quickly as possible
and to prevent data and metadata writes from getting submitted in
parallel and potentially causing seeks between the two. The elevator
should prevent the latter effect, though, so it would be entirely
reasonable to move that wait to later on in commit if we wanted to.

> If "JFS_BARRIER" is set, why do we wait for these I/O-s at all?
> (The "ordered" attribute is set => all the previous I/O-s must have hit
> the permanent storage before the commit record can do.)

> Why do we let the EXT3 layer to decide, why do not we ask the "bio"
> if the "ordered" attribute is supported and use it systematically?

It would be entirely reasonable to do that, sure. It just hasn't been
done yet.

> There is a comment in "journal_write_commit_record()":
>
> /* is it possible for another commit to fail at roughly
> * the same time as this one?...
>
> Another commit for the same journal?

Not likely. :-)

> If a barrier-based sync has failed on a device, does the actual
> I/O start by itself (not caring for the ordering issue)?

It fails with EOPNOTSUPP and must be retried, iirc.

--Stephen


2006-05-19 12:35:51

by Menyhart Zoltan

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Jan Kara wrote:

>>>+ if (!buffer_jbd(bh) || jh->b_jlist != BJ_SyncData) {
>>
>>Who (else) can take away the journal head, remove our "jh" from the
>>synch. data list?
>
> For two of the above comments: Under memory pressure data buffers can
> be written out earlier and then released by __journal_try_to_free_buffer()
> as they are not dirty any more. The above checks protect us against this.

Assume "bh" has been set free in the mean time.
Assume it is now used for another transaction (maybe for another file system).

The first part of the test should verify not only if "bh" is used for _any_
journal head but if it is exactly for our current one:

if (buffer_jbd(bh) != jh || ...

Thanks,

Zoltan


2006-05-19 15:05:49

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Hi,

On Fri, 2006-05-19 at 14:33 +0200, Zoltan Menyhart wrote:

> > For two of the above comments: Under memory pressure data buffers can
> > be written out earlier and then released by __journal_try_to_free_buffer()
> > as they are not dirty any more. The above checks protect us against this.
>
> Assume "bh" has been set free in the mean time.
> Assume it is now used for another transaction (maybe for another file system).

I don't follow --- we already have a refcount to the bh, how can it be
reused? We took the j_list_lock first, then looked up the jh on this
transaction's sync data list, then did the get_bh() without dropping
j_list_lock; so by the time we take the refcount, the j_list_lock
ensures it cannot have come off this transaction. And subsequently, the
refcount prevents reuse.

--Stephen


2006-05-19 15:07:04

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Hi,

On Thu, 2006-05-18 at 15:45 +0200, Jan Kara wrote:

> Yes, I'm aware of this problem. Actually I wrote a patch (attached) for it
> some time ago but as I'm checking current kernels it got lost somewhere on
> the way. I'll rediff it and submit again. Thanks for spotting the
> problem.

...
>
> + was_dirty = buffer_dirty(bh);
> + if (was_dirty && test_set_buffer_locked(bh)) {

The way was_dirty is used here seems confusing and hard to read; there
are completely separate code paths for dirty and non-dirty, lockable and
already-locked buffers, with all the paths interleaved to share a few
common bits of locking. It would be far more readable with any sharable
locking code simply removed to a separate function (such as we already
have for inverted_lock(), for example), and the different dirty/clean
logic laid out separately. Otherwise the code is littered with

> + if (was_dirty)
> + unlock_buffer(bh);

and it's not obvious at any point just what locks are held.

Having said that, it looks like it should work --- it just took more
effort than it should have to check!

--Stephen


2006-05-23 16:01:33

by Menyhart Zoltan

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Please have a look at my version of the "write_out_data:" loop.

Thanks,

Zoltan

--- linux-2.6.16.16-save/fs/jbd/commit.c 2006-05-19 15:00:50.000000000 +0200
+++ linux-2.6.16.16/fs/jbd/commit.c 2006-05-23 17:44:46.000000000 +0200
@@ -161,6 +161,104 @@
}

/*
+ * Flush data from the "->t_sync_datalist" of the committing transaction.
+ */
+static void write_out_sync_data(journal_t *journal,
+ transaction_t *commit_transaction)
+{
+ struct journal_head *jh;
+ struct buffer_head *bh;
+ int err = 0;
+
+ /*
+ * Whenever we unlock the journal and sleep, things can get removed
+ * from "->t_sync_datalist" by "journal_dirty_data()", so we have
+ * to keep looping back to write_out_data until we *know* that the
+ * list is empty.
+ *
+ * Cleanup any flushed data buffers from the data list. Even in
+ * abort mode, we want to flush this out as soon as possible.
+ */
+write_out_data:
+ cond_resched();
+ spin_lock(&journal->j_list_lock);
+ while ((jh = commit_transaction->t_sync_datalist) != NULL){
+ bh = jh2bh(jh);
+ if (buffer_locked(bh)){ /* Unsafe */
+ BUFFER_TRACE(bh, "locked");
+ if (!inverted_lock(journal, bh)) /* jbd_lock_bh_state */
+ goto write_out_data;
+ /* "bh" may have been unlocked in the mean time */
+ __journal_file_buffer(jh, commit_transaction, BJ_Locked);
+ jbd_unlock_bh_state(bh);
+ } else {
+ /* "bh" may have become locked in the mean time */
+ if (buffer_dirty(bh)){ /* Unsafe */
+ if (test_set_buffer_locked(bh)){
+ BUFFER_TRACE(bh, "locked");
+ continue; /* Put on the BJ_Locked list */
+ }
+ if (test_clear_buffer_dirty(bh)){
+ BUFFER_TRACE(bh, "start journal wr");
+ get_bh(bh);
+ bh->b_end_io = end_buffer_write_sync;
+ submit_bh(WRITE, bh);
+ continue; /* Put on the BJ_Locked list */
+ }
+ unlock_buffer(bh);
+ } else {
+ /* "bh" may have become dirty in the mean time */
+ /* Just do nothing for this transaction */
+ }
+ BUFFER_TRACE(bh, "writeout complete: unfile");
+ if (!inverted_lock(journal, bh)) /* jbd_lock_bh_state */
+ goto write_out_data;
+ __journal_unfile_buffer(jh);
+ jbd_unlock_bh_state(bh);
+ journal_remove_journal_head(bh);
+ put_bh(bh);
+ }
+ if (lock_need_resched(&journal->j_list_lock)){
+ spin_unlock(&journal->j_list_lock);
+ goto write_out_data;
+ }
+ }
+ /*
+ * Wait for all previously submitted IO to complete.
+ */
+ while (commit_transaction->t_locked_list) {
+ jh = commit_transaction->t_locked_list->b_tprev;
+ bh = jh2bh(jh);
+ get_bh(bh);
+ if (buffer_locked(bh)) {
+ spin_unlock(&journal->j_list_lock);
+ wait_on_buffer(bh);
+ if (unlikely(!buffer_uptodate(bh)))
+ err = -EIO;
+ spin_lock(&journal->j_list_lock);
+ }
+ if (!inverted_lock(journal, bh)) {
+ put_bh(bh);
+ spin_lock(&journal->j_list_lock);
+ continue;
+ }
+ if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
+ __journal_unfile_buffer(jh);
+ jbd_unlock_bh_state(bh);
+ journal_remove_journal_head(bh);
+ put_bh(bh);
+ } else {
+ jbd_unlock_bh_state(bh);
+ }
+ put_bh(bh);
+ cond_resched_lock(&journal->j_list_lock);
+ }
+ spin_unlock(&journal->j_list_lock);
+ if (err)
+ __journal_abort_hard(journal);
+}
+
+/*
* journal_commit_transaction
*
* The primary function for committing a transaction to the log. This
@@ -173,7 +271,7 @@
struct buffer_head **wbuf = journal->j_wbuf;
int bufs;
int flags;
- int err;
+ int err = 0;
unsigned long blocknr;
char *tagp = NULL;
journal_header_t *header;
@@ -313,113 +411,7 @@
* Now start flushing things to disk, in the order they appear
* on the transaction lists. Data blocks go first.
*/
-
- err = 0;
- /*
- * Whenever we unlock the journal and sleep, things can get added
- * onto ->t_sync_datalist, so we have to keep looping back to
- * write_out_data until we *know* that the list is empty.
- */
- bufs = 0;
- /*
- * Cleanup any flushed data buffers from the data list. Even in
- * abort mode, we want to flush this out as soon as possible.
- */
-write_out_data:
- cond_resched();
- spin_lock(&journal->j_list_lock);
-
- while (commit_transaction->t_sync_datalist) {
- struct buffer_head *bh;
-
- jh = commit_transaction->t_sync_datalist;
- commit_transaction->t_sync_datalist = jh->b_tnext;
- bh = jh2bh(jh);
- if (buffer_locked(bh)) {
- BUFFER_TRACE(bh, "locked");
- if (!inverted_lock(journal, bh))
- goto write_out_data;
- __journal_temp_unlink_buffer(jh);
- __journal_file_buffer(jh, commit_transaction,
- BJ_Locked);
- jbd_unlock_bh_state(bh);
- if (lock_need_resched(&journal->j_list_lock)) {
- spin_unlock(&journal->j_list_lock);
- goto write_out_data;
- }
- } else {
- if (buffer_dirty(bh)) {
- BUFFER_TRACE(bh, "start journal writeout");
- get_bh(bh);
- wbuf[bufs++] = bh;
- if (bufs == journal->j_wbufsize) {
- jbd_debug(2, "submit %d writes\n",
- bufs);
- spin_unlock(&journal->j_list_lock);
- ll_rw_block(SWRITE, bufs, wbuf);
- journal_brelse_array(wbuf, bufs);
- bufs = 0;
- goto write_out_data;
- }
- } else {
- BUFFER_TRACE(bh, "writeout complete: unfile");
- if (!inverted_lock(journal, bh))
- goto write_out_data;
- __journal_unfile_buffer(jh);
- jbd_unlock_bh_state(bh);
- journal_remove_journal_head(bh);
- put_bh(bh);
- if (lock_need_resched(&journal->j_list_lock)) {
- spin_unlock(&journal->j_list_lock);
- goto write_out_data;
- }
- }
- }
- }
-
- if (bufs) {
- spin_unlock(&journal->j_list_lock);
- ll_rw_block(SWRITE, bufs, wbuf);
- journal_brelse_array(wbuf, bufs);
- spin_lock(&journal->j_list_lock);
- }
-
- /*
- * Wait for all previously submitted IO to complete.
- */
- while (commit_transaction->t_locked_list) {
- struct buffer_head *bh;
-
- jh = commit_transaction->t_locked_list->b_tprev;
- bh = jh2bh(jh);
- get_bh(bh);
- if (buffer_locked(bh)) {
- spin_unlock(&journal->j_list_lock);
- wait_on_buffer(bh);
- if (unlikely(!buffer_uptodate(bh)))
- err = -EIO;
- spin_lock(&journal->j_list_lock);
- }
- if (!inverted_lock(journal, bh)) {
- put_bh(bh);
- spin_lock(&journal->j_list_lock);
- continue;
- }
- if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
- __journal_unfile_buffer(jh);
- jbd_unlock_bh_state(bh);
- journal_remove_journal_head(bh);
- put_bh(bh);
- } else {
- jbd_unlock_bh_state(bh);
- }
- put_bh(bh);
- cond_resched_lock(&journal->j_list_lock);
- }
- spin_unlock(&journal->j_list_lock);
-
- if (err)
- __journal_abort_hard(journal);
+ write_out_sync_data(journal, commit_transaction);

journal_write_revoke_records(journal, commit_transaction);

2006-05-24 09:14:45

by Menyhart Zoltan

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Yesterday I forgot a small "spin_unlock(&journal->j_list_lock);".

Thanks,

Zoltan

--- linux-2.6.16.16-save/fs/jbd/commit.c 2006-05-19 15:00:50.000000000 +0200
+++ linux-2.6.16.16/fs/jbd/commit.c 2006-05-24 10:43:32.000000000 +0200
@@ -161,6 +161,107 @@
}

/*
+ * Flush data from the "->t_sync_datalist" of the committing transaction.
+ */
+static void write_out_sync_data(journal_t *journal,
+ transaction_t *commit_transaction)
+{
+ struct journal_head *jh;
+ struct buffer_head *bh;
+ int err = 0;
+
+ /*
+ * Whenever we unlock the journal and sleep, things can get removed
+ * from "->t_sync_datalist" by "journal_dirty_data()", so we have
+ * to keep looping back to write_out_data until we *know* that the
+ * list is empty.
+ *
+ * Cleanup any flushed data buffers from the data list. Even in
+ * abort mode, we want to flush this out as soon as possible.
+ */
+write_out_data:
+ cond_resched();
+ spin_lock(&journal->j_list_lock);
+ while ((jh = commit_transaction->t_sync_datalist) != NULL){
+ bh = jh2bh(jh);
+ if (buffer_locked(bh)){ /* Unsafe */
+ BUFFER_TRACE(bh, "locked");
+ if (!inverted_lock(journal, bh)) /* jbd_lock_bh_state */
+ goto write_out_data;
+ /* "bh" may have been unlocked in the mean time */
+ __journal_file_buffer(jh, commit_transaction, BJ_Locked);
+ jbd_unlock_bh_state(bh);
+ } else {
+ /* "bh" may have become locked in the mean time */
+ if (buffer_dirty(bh)){ /* Unsafe */
+ if (test_set_buffer_locked(bh)){
+ BUFFER_TRACE(bh, "locked");
+ /* Put it on the BJ_Locked list */
+ continue;
+ }
+ if (test_clear_buffer_dirty(bh)){
+ BUFFER_TRACE(bh, "start journal wr");
+ spin_unlock(&journal->j_list_lock);
+ get_bh(bh);
+ bh->b_end_io = end_buffer_write_sync;
+ submit_bh(WRITE, bh);
+ /* Put it on the BJ_Locked list */
+ goto write_out_data;
+ }
+ unlock_buffer(bh);
+ } else {
+ /* "bh" may have become dirty in the mean time */
+ /* Just do nothing for this transaction */
+ }
+ BUFFER_TRACE(bh, "writeout complete: unfile");
+ if (!inverted_lock(journal, bh)) /* jbd_lock_bh_state */
+ goto write_out_data;
+ __journal_unfile_buffer(jh);
+ jbd_unlock_bh_state(bh);
+ journal_remove_journal_head(bh);
+ put_bh(bh);
+ }
+ if (lock_need_resched(&journal->j_list_lock)){
+ spin_unlock(&journal->j_list_lock);
+ goto write_out_data;
+ }
+ }
+ /*
+ * Wait for all previously submitted IO to complete.
+ */
+ while (commit_transaction->t_locked_list) {
+ jh = commit_transaction->t_locked_list->b_tprev;
+ bh = jh2bh(jh);
+ get_bh(bh);
+ if (buffer_locked(bh)) {
+ spin_unlock(&journal->j_list_lock);
+ wait_on_buffer(bh);
+ if (unlikely(!buffer_uptodate(bh)))
+ err = -EIO;
+ spin_lock(&journal->j_list_lock);
+ }
+ if (!inverted_lock(journal, bh)) {
+ put_bh(bh);
+ spin_lock(&journal->j_list_lock);
+ continue;
+ }
+ if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
+ __journal_unfile_buffer(jh);
+ jbd_unlock_bh_state(bh);
+ journal_remove_journal_head(bh);
+ put_bh(bh);
+ } else {
+ jbd_unlock_bh_state(bh);
+ }
+ put_bh(bh);
+ cond_resched_lock(&journal->j_list_lock);
+ }
+ spin_unlock(&journal->j_list_lock);
+ if (err)
+ __journal_abort_hard(journal);
+}
+
+/*
* journal_commit_transaction
*
* The primary function for committing a transaction to the log. This
@@ -173,7 +274,7 @@
struct buffer_head **wbuf = journal->j_wbuf;
int bufs;
int flags;
- int err;
+ int err = 0;
unsigned long blocknr;
char *tagp = NULL;
journal_header_t *header;
@@ -313,113 +414,7 @@
* Now start flushing things to disk, in the order they appear
* on the transaction lists. Data blocks go first.
*/
-
- err = 0;
- /*
- * Whenever we unlock the journal and sleep, things can get added
- * onto ->t_sync_datalist, so we have to keep looping back to
- * write_out_data until we *know* that the list is empty.
- */
- bufs = 0;
- /*
- * Cleanup any flushed data buffers from the data list. Even in
- * abort mode, we want to flush this out as soon as possible.
- */
-write_out_data:
- cond_resched();
- spin_lock(&journal->j_list_lock);
-
- while (commit_transaction->t_sync_datalist) {
- struct buffer_head *bh;
-
- jh = commit_transaction->t_sync_datalist;
- commit_transaction->t_sync_datalist = jh->b_tnext;
- bh = jh2bh(jh);
- if (buffer_locked(bh)) {
- BUFFER_TRACE(bh, "locked");
- if (!inverted_lock(journal, bh))
- goto write_out_data;
- __journal_temp_unlink_buffer(jh);
- __journal_file_buffer(jh, commit_transaction,
- BJ_Locked);
- jbd_unlock_bh_state(bh);
- if (lock_need_resched(&journal->j_list_lock)) {
- spin_unlock(&journal->j_list_lock);
- goto write_out_data;
- }
- } else {
- if (buffer_dirty(bh)) {
- BUFFER_TRACE(bh, "start journal writeout");
- get_bh(bh);
- wbuf[bufs++] = bh;
- if (bufs == journal->j_wbufsize) {
- jbd_debug(2, "submit %d writes\n",
- bufs);
- spin_unlock(&journal->j_list_lock);
- ll_rw_block(SWRITE, bufs, wbuf);
- journal_brelse_array(wbuf, bufs);
- bufs = 0;
- goto write_out_data;
- }
- } else {
- BUFFER_TRACE(bh, "writeout complete: unfile");
- if (!inverted_lock(journal, bh))
- goto write_out_data;
- __journal_unfile_buffer(jh);
- jbd_unlock_bh_state(bh);
- journal_remove_journal_head(bh);
- put_bh(bh);
- if (lock_need_resched(&journal->j_list_lock)) {
- spin_unlock(&journal->j_list_lock);
- goto write_out_data;
- }
- }
- }
- }
-
- if (bufs) {
- spin_unlock(&journal->j_list_lock);
- ll_rw_block(SWRITE, bufs, wbuf);
- journal_brelse_array(wbuf, bufs);
- spin_lock(&journal->j_list_lock);
- }
-
- /*
- * Wait for all previously submitted IO to complete.
- */
- while (commit_transaction->t_locked_list) {
- struct buffer_head *bh;
-
- jh = commit_transaction->t_locked_list->b_tprev;
- bh = jh2bh(jh);
- get_bh(bh);
- if (buffer_locked(bh)) {
- spin_unlock(&journal->j_list_lock);
- wait_on_buffer(bh);
- if (unlikely(!buffer_uptodate(bh)))
- err = -EIO;
- spin_lock(&journal->j_list_lock);
- }
- if (!inverted_lock(journal, bh)) {
- put_bh(bh);
- spin_lock(&journal->j_list_lock);
- continue;
- }
- if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
- __journal_unfile_buffer(jh);
- jbd_unlock_bh_state(bh);
- journal_remove_journal_head(bh);
- put_bh(bh);
- } else {
- jbd_unlock_bh_state(bh);
- }
- put_bh(bh);
- cond_resched_lock(&journal->j_list_lock);
- }
- spin_unlock(&journal->j_list_lock);
-
- if (err)
- __journal_abort_hard(journal);
+ write_out_sync_data(journal, commit_transaction);

journal_write_revoke_records(journal, commit_transaction);


2006-05-24 17:18:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Hello,

I have tested my version of the patch yesterday so I'll post it in
other mail... Anyway I have two objections to your version:
1) It does not solve the original problem with the fact that 'locked'
buffer does not mean 'buffer being written'. You cannot just refile
such buffers. You really have to check the dirty bit...
2) You drop j_list_lock after you write a single buffer. That may
impact the performance significantly (originally we dropped it only
for writing a bunch of buffers).

Honza

> --- linux-2.6.16.16-save/fs/jbd/commit.c 2006-05-19
> 15:00:50.000000000 +0200
> +++ linux-2.6.16.16/fs/jbd/commit.c 2006-05-24 10:43:32.000000000 +0200
> @@ -161,6 +161,107 @@
> }
>
> /*
> + * Flush data from the "->t_sync_datalist" of the committing transaction.
> + */
> +static void write_out_sync_data(journal_t *journal,
> + transaction_t *commit_transaction)
> +{
> + struct journal_head *jh;
> + struct buffer_head *bh;
> + int err = 0;
> +
> + /*
> + * Whenever we unlock the journal and sleep, things can get removed
> + * from "->t_sync_datalist" by "journal_dirty_data()", so we have
> + * to keep looping back to write_out_data until we *know* that the
> + * list is empty.
> + *
> + * Cleanup any flushed data buffers from the data list. Even in
> + * abort mode, we want to flush this out as soon as possible.
> + */
> +write_out_data:
> + cond_resched();
> + spin_lock(&journal->j_list_lock);
> + while ((jh = commit_transaction->t_sync_datalist) != NULL){
> + bh = jh2bh(jh);
> + if (buffer_locked(bh)){ /* Unsafe */
> + BUFFER_TRACE(bh, "locked");
> + if (!inverted_lock(journal, bh)) /*
> jbd_lock_bh_state */
> + goto write_out_data;
> + /* "bh" may have been unlocked in the mean time */
> + __journal_file_buffer(jh, commit_transaction,
> BJ_Locked);
> + jbd_unlock_bh_state(bh);
> + } else {
> + /* "bh" may have become locked in the mean time */
> + if (buffer_dirty(bh)){ /* Unsafe */
> + if (test_set_buffer_locked(bh)){
> + BUFFER_TRACE(bh, "locked");
> + /* Put it on the BJ_Locked list */
> + continue;
> + }
> + if (test_clear_buffer_dirty(bh)){
> + BUFFER_TRACE(bh, "start journal wr");
> + spin_unlock(&journal->j_list_lock);
> + get_bh(bh);
> + bh->b_end_io = end_buffer_write_sync;
> + submit_bh(WRITE, bh);
> + /* Put it on the BJ_Locked list */
> + goto write_out_data;
> + }
> + unlock_buffer(bh);
> + } else {
> + /* "bh" may have become dirty in the mean
> time */
> + /* Just do nothing for this transaction */
> + }
> + BUFFER_TRACE(bh, "writeout complete: unfile");
> + if (!inverted_lock(journal, bh)) /*
> jbd_lock_bh_state */
> + goto write_out_data;
> + __journal_unfile_buffer(jh);
> + jbd_unlock_bh_state(bh);
> + journal_remove_journal_head(bh);
> + put_bh(bh);
> + }
> + if (lock_need_resched(&journal->j_list_lock)){
> + spin_unlock(&journal->j_list_lock);
> + goto write_out_data;
> + }
> + }
> + /*
> + * Wait for all previously submitted IO to complete.
> + */
> + while (commit_transaction->t_locked_list) {
> + jh = commit_transaction->t_locked_list->b_tprev;
> + bh = jh2bh(jh);
> + get_bh(bh);
> + if (buffer_locked(bh)) {
> + spin_unlock(&journal->j_list_lock);
> + wait_on_buffer(bh);
> + if (unlikely(!buffer_uptodate(bh)))
> + err = -EIO;
> + spin_lock(&journal->j_list_lock);
> + }
> + if (!inverted_lock(journal, bh)) {
> + put_bh(bh);
> + spin_lock(&journal->j_list_lock);
> + continue;
> + }
> + if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
> + __journal_unfile_buffer(jh);
> + jbd_unlock_bh_state(bh);
> + journal_remove_journal_head(bh);
> + put_bh(bh);
> + } else {
> + jbd_unlock_bh_state(bh);
> + }
> + put_bh(bh);
> + cond_resched_lock(&journal->j_list_lock);
> + }
> + spin_unlock(&journal->j_list_lock);
> + if (err)
> + __journal_abort_hard(journal);
> +}
> +
> +/*
> * journal_commit_transaction
> *
> * The primary function for committing a transaction to the log. This
> @@ -173,7 +274,7 @@
> struct buffer_head **wbuf = journal->j_wbuf;
> int bufs;
> int flags;
> - int err;
> + int err = 0;
> unsigned long blocknr;
> char *tagp = NULL;
> journal_header_t *header;
> @@ -313,113 +414,7 @@
> * Now start flushing things to disk, in the order they appear
> * on the transaction lists. Data blocks go first.
> */
> -
> - err = 0;
> - /*
> - * Whenever we unlock the journal and sleep, things can get added
> - * onto ->t_sync_datalist, so we have to keep looping back to
> - * write_out_data until we *know* that the list is empty.
> - */
> - bufs = 0;
> - /*
> - * Cleanup any flushed data buffers from the data list. Even in
> - * abort mode, we want to flush this out as soon as possible.
> - */
> -write_out_data:
> - cond_resched();
> - spin_lock(&journal->j_list_lock);
> -
> - while (commit_transaction->t_sync_datalist) {
> - struct buffer_head *bh;
> -
> - jh = commit_transaction->t_sync_datalist;
> - commit_transaction->t_sync_datalist = jh->b_tnext;
> - bh = jh2bh(jh);
> - if (buffer_locked(bh)) {
> - BUFFER_TRACE(bh, "locked");
> - if (!inverted_lock(journal, bh))
> - goto write_out_data;
> - __journal_temp_unlink_buffer(jh);
> - __journal_file_buffer(jh, commit_transaction,
> - BJ_Locked);
> - jbd_unlock_bh_state(bh);
> - if (lock_need_resched(&journal->j_list_lock)) {
> - spin_unlock(&journal->j_list_lock);
> - goto write_out_data;
> - }
> - } else {
> - if (buffer_dirty(bh)) {
> - BUFFER_TRACE(bh, "start journal writeout");
> - get_bh(bh);
> - wbuf[bufs++] = bh;
> - if (bufs == journal->j_wbufsize) {
> - jbd_debug(2, "submit %d writes\n",
> - bufs);
> - spin_unlock(&journal->j_list_lock);
> - ll_rw_block(SWRITE, bufs, wbuf);
> - journal_brelse_array(wbuf, bufs);
> - bufs = 0;
> - goto write_out_data;
> - }
> - } else {
> - BUFFER_TRACE(bh, "writeout complete:
> unfile");
> - if (!inverted_lock(journal, bh))
> - goto write_out_data;
> - __journal_unfile_buffer(jh);
> - jbd_unlock_bh_state(bh);
> - journal_remove_journal_head(bh);
> - put_bh(bh);
> - if
> (lock_need_resched(&journal->j_list_lock)) {
> - spin_unlock(&journal->j_list_lock);
> - goto write_out_data;
> - }
> - }
> - }
> - }
> -
> - if (bufs) {
> - spin_unlock(&journal->j_list_lock);
> - ll_rw_block(SWRITE, bufs, wbuf);
> - journal_brelse_array(wbuf, bufs);
> - spin_lock(&journal->j_list_lock);
> - }
> -
> - /*
> - * Wait for all previously submitted IO to complete.
> - */
> - while (commit_transaction->t_locked_list) {
> - struct buffer_head *bh;
> -
> - jh = commit_transaction->t_locked_list->b_tprev;
> - bh = jh2bh(jh);
> - get_bh(bh);
> - if (buffer_locked(bh)) {
> - spin_unlock(&journal->j_list_lock);
> - wait_on_buffer(bh);
> - if (unlikely(!buffer_uptodate(bh)))
> - err = -EIO;
> - spin_lock(&journal->j_list_lock);
> - }
> - if (!inverted_lock(journal, bh)) {
> - put_bh(bh);
> - spin_lock(&journal->j_list_lock);
> - continue;
> - }
> - if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
> - __journal_unfile_buffer(jh);
> - jbd_unlock_bh_state(bh);
> - journal_remove_journal_head(bh);
> - put_bh(bh);
> - } else {
> - jbd_unlock_bh_state(bh);
> - }
> - put_bh(bh);
> - cond_resched_lock(&journal->j_list_lock);
> - }
> - spin_unlock(&journal->j_list_lock);
> -
> - if (err)
> - __journal_abort_hard(journal);
> + write_out_sync_data(journal, commit_transaction);
>
> journal_write_revoke_records(journal, commit_transaction);
>
>
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-05-24 17:32:29

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Hello,

> On Thu, 2006-05-18 at 15:45 +0200, Jan Kara wrote:
<snip>

> The way was_dirty is used here seems confusing and hard to read; there
> are completely separate code paths for dirty and non-dirty, lockable and
> already-locked buffers, with all the paths interleaved to share a few
> common bits of locking. It would be far more readable with any sharable
> locking code simply removed to a separate function (such as we already
> have for inverted_lock(), for example), and the different dirty/clean
> logic laid out separately. Otherwise the code is littered with
>
> > + if (was_dirty)
> > + unlock_buffer(bh);
>
> and it's not obvious at any point just what locks are held.
>
> Having said that, it looks like it should work --- it just took more
> effort than it should have to check!
Thanks for comments. I have written a new patch that tries to address
both your and Zsoltan's comments. I've also fixed a bug caused by calling
submit_bh() under j_list_lock (submit_bh() can sleep on allocating the
bio). That actually required some more changes because I don't want to
drop the j_list_lock for writing each buffer but I also want to keep the
buffer locked so that I can immediately refile it to BJ_Locked list.
If I just released buffer_lock() I would have to keep the buffer in
BJ_SyncData list and I'm afraid that could cause live-locks when
somebody is redirtying the buffer all the time.
So I'm not totally convinced this is the right way of doing things.
Anyway please have a look at the code and tell me what you think about
it. Thanks.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs


Attachments:
(No filename) (1.64 kB)
jbd-2.6.17-rc3-1-orderedwrite.diff (6.48 kB)
Download all attachments

2006-05-30 15:36:23

by Menyhart Zoltan

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

I have got a concern about this code fragment:

> + if (buffer_dirty(bh)) {
> + if (test_set_buffer_locked(bh)) {
> + ...
> + get_bh(bh);
> + spin_unlock(&journal->j_list_lock);
> + /* Submit all accumulated buffers to prevent
> + * possible deadlocks */
> + submit_buffers(bufs, wbuf);
> + bufs = 0;
> + lock_buffer(bh);
> + spin_lock(&journal->j_list_lock);
> + /* Someone already cleaned up the buffer? */
> + if (!buffer_jbd(bh)
> + || jh->b_jlist != BJ_SyncData) {
> + unlock_buffer(bh);
> + BUFFER_TRACE(bh, "already cleaned up");
> + put_bh(bh);
> + continue;
> + }
> + put_bh(bh);
> + }
> + if (test_clear_buffer_dirty(bh)) {
> + BUFFER_TRACE(bh, "needs writeout, submitting");
> + get_bh(bh);
> + wbuf[bufs++] = bh;
> + if (bufs == journal->j_wbufsize) {
> + spin_unlock(&journal->j_list_lock);
> + /* Writeout will be done at the
> + * beginning of the loop */
> + goto write_out_data;
> + }
> + }

We lock up to "->j_wbufsize" buffers, one after the others.

Originally, we toke a buffer, we did something to it, and we released it.
When we wanted two of them and the second one was not available, we
released the first one, too, in order to avoid dead-locks.

Keeping a couple of buffer locked for an unpredictably long time...
(Here you keep N-1 buffer locked while you are waiting for the Nth one.)
And not imposing / respecting any locking order...

The original code did not lock the buffers while it was composing the
list of buffers. "ll_rw_block()" locked one by one the buffers
and submitted them to the BIO. These buffers got eventually unlocked,
possibly before the some last buffers got locked by "ll_rw_block()".

This change implies an important difference in locking behavior.

Regards,

Zoltan

2006-05-30 16:39:34

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Change ll_rw_block() calls in JBD

Hello,

> I have got a concern about this code fragment:
>
> >+ if (buffer_dirty(bh)) {
> >+ if (test_set_buffer_locked(bh)) {
> >+ ...
> >+ get_bh(bh);
> >+ spin_unlock(&journal->j_list_lock);
> >+ /* Submit all accumulated buffers to prevent
> >+ * possible deadlocks */
> >+ submit_buffers(bufs, wbuf);
> >+ bufs = 0;
> >+ lock_buffer(bh);
> >+ spin_lock(&journal->j_list_lock);
> >+ /* Someone already cleaned up the buffer? */
> >+ if (!buffer_jbd(bh)
> >+ || jh->b_jlist != BJ_SyncData) {
> >+ unlock_buffer(bh);
> >+ BUFFER_TRACE(bh, "already cleaned
> >up");
> >+ put_bh(bh);
> >+ continue;
> >+ }
> >+ put_bh(bh);
> >+ }
> >+ if (test_clear_buffer_dirty(bh)) {
> >+ BUFFER_TRACE(bh, "needs writeout,
> >submitting");
> >+ get_bh(bh);
> >+ wbuf[bufs++] = bh;
> >+ if (bufs == journal->j_wbufsize) {
> >+ spin_unlock(&journal->j_list_lock);
> >+ /* Writeout will be done at the
> >+ * beginning of the loop */
> >+ goto write_out_data;
> >+ }
> >+ }
>
> We lock up to "->j_wbufsize" buffers, one after the others.
>
> Originally, we toke a buffer, we did something to it, and we released it.
> When we wanted two of them and the second one was not available, we
> released the first one, too, in order to avoid dead-locks.
>
> Keeping a couple of buffer locked for an unpredictably long time...
> (Here you keep N-1 buffer locked while you are waiting for the Nth one.)
> And not imposing / respecting any locking order...
You are not right here. This is exactly why we do submit_buffers()
before we do lock_buffer. So either we get the lock without waiting via
test_set_buffer_locked() or we send all buffers for IO (and hence they
will get unlocked eventually).

> The original code did not lock the buffers while it was composing the
> list of buffers. "ll_rw_block()" locked one by one the buffers
> and submitted them to the BIO. These buffers got eventually unlocked,
> possibly before the some last buffers got locked by "ll_rw_block()".
>
> This change implies an important difference in locking behavior.
I agree the locking behaviour has changed a bit. But not as much as you
thought.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs