Hello,
attached patch should fix handling of ordered data buffers in
journal_commit_write(). Currently the code simply moves the buffer
from t_sync_data list to t_locked list when the buffer is locked.
But in theory (I'm not sure this currently really happens for ext3
ordered data buffer) the buffer can be locked just for "examination"
purposes and not being sent to disk. Hence it could happen we don't
write some ordered buffer when we should.
What the patch does is that it writes the buffer when it is dirty
(even if it is locked - hence it can happen we call ll_rw_block()
on the buffer currently under write-out but in this case ll_rw_block()
just waits until IO is complete and returns (as the buffer won't be
dirty anymore) and this race with pdflush should be rare enough not
to affect the performance). The patch also moves buffer to t_locked
list immediately after queing the buffer for write-out (as otherwise
we would have to detect which buffers are already queued and that's not
nice). This changes a bit a logic of t_locked list - unlocked buffer
for the users different from journal_commit_transaction() does not mean
the buffer is already written. But only journal_unmap_buffer() cares
about this changes and I changed that one to check if the buffer is not
dirty.
Andrew, if you think the change is fine, please put it into -mm so
that it gets some testing.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
Jan Kara <[email protected]> wrote:
>
> When a buffer is locked it does not mean that write-out is in progress. We
> have to check if the buffer is dirty and if it is we have to submit it
> for write-out. We unconditionally move the buffer to t_locked_list so
> that we don't mistake unprocessed buffer and buffer not yet given to
> ll_rw_block(). This subtly changes the meaning of buffer states in
> t_locked_list - unlock buffer (for list users different from
> journal_commit_transaction()) does not mean it has been written. But
> only journal_unmap_buffer() cares and it now checks if the buffer
> is not dirty.
Seems complex. It means that t_locked_list takes on an additional (and
undocumented!) meaning.
Also, I don't think it works. See ll_rw_block()'s handling of
already-locked buffers..
An alternative is to just lock the buffer in journal_commit_transaction(),
if it was locked-and-dirty. And remove the call to ll_rw_block() and
submit the locked buffers by hand.
That would mean that if someone had redirtied a buffer which was on
t_sync_datalist *while* it was under writeout, we'd end up waiting on that
writeout to complete before submitting more I/O. But I suspect that's
pretty rare.
One thing which concerns me with your approach is livelocks: if some process
sits in a tight loop writing to the same part of the same file, will it
cause kjournald to get stuck?
The problem we have here is "was the buffer dirtied before this commit
started, or after?". In the former case we are obliged to write it. In
the later case we are not, and in trying to do this we risk livelocking.
> Jan Kara <[email protected]> wrote:
> >
> > When a buffer is locked it does not mean that write-out is in progress. We
> > have to check if the buffer is dirty and if it is we have to submit it
> > for write-out. We unconditionally move the buffer to t_locked_list so
> > that we don't mistake unprocessed buffer and buffer not yet given to
> > ll_rw_block(). This subtly changes the meaning of buffer states in
> > t_locked_list - unlock buffer (for list users different from
> > journal_commit_transaction()) does not mean it has been written. But
> > only journal_unmap_buffer() cares and it now checks if the buffer
> > is not dirty.
>
> Seems complex. It means that t_locked_list takes on an additional (and
> undocumented!) meaning.
Sorry, if we agree on some final form I'll add the appropriate
comment to the header file.
> Also, I don't think it works. See ll_rw_block()'s handling of
> already-locked buffers..
We send it to disk with SWRITE - hence ll_rw_block() wait for the buffer
lock for us. Or do you have something else in mind?
> An alternative is to just lock the buffer in journal_commit_transaction(),
> if it was locked-and-dirty. And remove the call to ll_rw_block() and
> submit the locked buffers by hand.
Yes, this has the advantage that we can move the buffer to t_locked_list
in the right time and so we don't change the semantics of t_locked_list.
OTOH the locking will be a bit more complicated (we'd need to acquire and
drop j_list_lock almost for every bh while currently we do it only once
per batch) - the code would have to be like:
spin_lock(&journal->j_list_lock);
while (commit_transaction->t_sync_datalist) {
jh = commit_transaction->t_sync_datalist;
bh = jh2bh(jh);
journal_grab_journal_head(bh);
if (buffer_dirty(bh)) {
get_bh(bh);
spin_unlock(&journal->j_list_lock);
lock_buffer(bh);
if (buffer_dirty(bh))
/* submit the buffer */
jbd_lock_bh_state(bh);
spin_lock(&journal->j_list_lock);
/* Check that somebody did not move the jh elsewhere */
}
else {
if (!inverted_lock(journal, bh))
goto write_out_data;
}
__journal_temp_unlink_buffer(jh);
__journal_file_buffer(jh, commit_transaction, BJ_Locked);
journal_put_journal_head(bh);
jbd_unlock_bh_state(bh);
}
If you prefer something like this I can code it up...
> That would mean that if someone had redirtied a buffer which was on
> t_sync_datalist *while* it was under writeout, we'd end up waiting on that
> writeout to complete before submitting more I/O. But I suspect that's
> pretty rare.
>
> One thing which concerns me with your approach is livelocks: if some process
> sits in a tight loop writing to the same part of the same file, will it
> cause kjournald to get stuck?
No, because as soon as we find the buffer in t_sync_datalist we move
it to t_locked_list and submit it for IO - this case is one reason why I
introduced that new meaning to t_locked_list.
> The problem we have here is "was the buffer dirtied before this commit
> started, or after?". In the former case we are obliged to write it. In
> the later case we are not, and in trying to do this we risk livelocking.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
Hi,
On Wed, 2005-09-14 at 02:43, Andrew Morton wrote:
> An alternative is to just lock the buffer in journal_commit_transaction(),
> if it was locked-and-dirty.
That was my first reaction too. We're using SWRITE so we'll end up
locking them anyway; and if synchronising against other lockers is an
issue, then locking them ourselves early is the obvious protection.
--Stephen
Jan Kara <[email protected]> wrote:
>
> > Also, I don't think it works. See ll_rw_block()'s handling of
> > already-locked buffers..
>
> We send it to disk with SWRITE - hence ll_rw_block() wait for the buffer
> lock for us. Or do you have something else in mind?
>
OK.
> > An alternative is to just lock the buffer in journal_commit_transaction(),
> > if it was locked-and-dirty. And remove the call to ll_rw_block() and
> > submit the locked buffers by hand.
>
> Yes, this has the advantage that we can move the buffer to t_locked_list
> in the right time and so we don't change the semantics of t_locked_list.
> OTOH the locking will be a bit more complicated (we'd need to acquire and
> drop j_list_lock almost for every bh while currently we do it only once
> per batch)
Only need to drop the spinlock if test_set_buffer_locked() fails.
>
> spin_lock(&journal->j_list_lock);
> while (commit_transaction->t_sync_datalist) {
> jh = commit_transaction->t_sync_datalist;
> bh = jh2bh(jh);
> journal_grab_journal_head(bh);
> if (buffer_dirty(bh)) {
> get_bh(bh);
> spin_unlock(&journal->j_list_lock);
> lock_buffer(bh);
> if (buffer_dirty(bh))
> /* submit the buffer */
> jbd_lock_bh_state(bh);
> spin_lock(&journal->j_list_lock);
> /* Check that somebody did not move the jh elsewhere */
> }
> else {
> if (!inverted_lock(journal, bh))
> goto write_out_data;
> }
> __journal_temp_unlink_buffer(jh);
> __journal_file_buffer(jh, commit_transaction, BJ_Locked);
> journal_put_journal_head(bh);
> jbd_unlock_bh_state(bh);
> }
>
> If you prefer something like this I can code it up...
If the code is conceptually simpler then I think it's worth doing, even if
the actual implementation is similarly or even more complex.
So yes please, let's see how it looks.
> > That would mean that if someone had redirtied a buffer which was on
> > t_sync_datalist *while* it was under writeout, we'd end up waiting on that
> > writeout to complete before submitting more I/O. But I suspect that's
> > pretty rare.
> >
> > One thing which concerns me with your approach is livelocks: if some process
> > sits in a tight loop writing to the same part of the same file, will it
> > cause kjournald to get stuck?
>
> No, because as soon as we find the buffer in t_sync_datalist we move
> it to t_locked_list and submit it for IO - this case is one reason why I
> introduced that new meaning to t_locked_list.
Right. But the buffer can be redirtied while it's on t_locked_list, even
while the I/O is in flight. What happens then? Will kjournald try to
rewrite it?
> > > An alternative is to just lock the buffer in journal_commit_transaction(),
> > > if it was locked-and-dirty. And remove the call to ll_rw_block() and
> > > submit the locked buffers by hand.
> >
> > Yes, this has the advantage that we can move the buffer to t_locked_list
> > in the right time and so we don't change the semantics of t_locked_list.
> > OTOH the locking will be a bit more complicated (we'd need to acquire and
> > drop j_list_lock almost for every bh while currently we do it only once
> > per batch)
>
> Only need to drop the spinlock if test_set_buffer_locked() fails.
Ahh, good point.
> > spin_lock(&journal->j_list_lock);
> > while (commit_transaction->t_sync_datalist) {
> > jh = commit_transaction->t_sync_datalist;
> > bh = jh2bh(jh);
> > journal_grab_journal_head(bh);
> > if (buffer_dirty(bh)) {
> > get_bh(bh);
> > spin_unlock(&journal->j_list_lock);
> > lock_buffer(bh);
> > if (buffer_dirty(bh))
> > /* submit the buffer */
> > jbd_lock_bh_state(bh);
> > spin_lock(&journal->j_list_lock);
> > /* Check that somebody did not move the jh elsewhere */
> > }
> > else {
> > if (!inverted_lock(journal, bh))
> > goto write_out_data;
> > }
> > __journal_temp_unlink_buffer(jh);
> > __journal_file_buffer(jh, commit_transaction, BJ_Locked);
> > journal_put_journal_head(bh);
> > jbd_unlock_bh_state(bh);
> > }
> >
> > If you prefer something like this I can code it up...
>
> If the code is conceptually simpler then I think it's worth doing, even if
> the actual implementation is similarly or even more complex.
>
> So yes please, let's see how it looks.
OK, will do.
> > > That would mean that if someone had redirtied a buffer which was on
> > > t_sync_datalist *while* it was under writeout, we'd end up waiting on that
> > > writeout to complete before submitting more I/O. But I suspect that's
> > > pretty rare.
> > >
> > > One thing which concerns me with your approach is livelocks: if some process
> > > sits in a tight loop writing to the same part of the same file, will it
> > > cause kjournald to get stuck?
> >
> > No, because as soon as we find the buffer in t_sync_datalist we move
> > it to t_locked_list and submit it for IO - this case is one reason why I
> > introduced that new meaning to t_locked_list.
>
> Right. But the buffer can be redirtied while it's on t_locked_list, even
> while the I/O is in flight. What happens then? Will kjournald try to
> rewrite it?
No. With my patch journaling code writes only data buffers in
t_sync_data_list and moves them to t_locked_list even before the actual
submit so we really write each buffer exactly once in the
journal_commit_transaction().
Originally it worked as follows: buffer has been first submitted
for IO, then if we eventually came over it for the second time and found
it has been locked, we moved it to t_locked_list. If we found a clean
buffer in t_sync_data_list we just removed it from the transaction.
Now the livelock you were talking about was prevented by the clever
code in journal_dirty_data() that has been (and still is) checking
whether the buffer is a part of committing transaction and if so, it
sends it to disk and refiles it to the new transaction.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs