Hi
Dio write returns EIO when try_to_release_page fails because bh is
still referenced.
The patch
"commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
Author: Mingming Cao <[email protected]>
Date: Fri Jul 25 01:46:22 2008 -0700
jbd: fix race between free buffer and commit transaction
"
was merged into 2.6.27-rc1, but I noticed that this patch is not enough
to fix the race.
I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
sometimes got EIO through this test.
The patch above fixed race between freeing buffer(dio) and committing
transaction(jbd) but I discovered that there is another race,
freeing buffer(dio) and ext3/4_ordered_writepage.
: background_writeout()
->write_cache_pages()
->ext3_ordered_writepage()
walk_page_buffers() <- take a bh ref
block_write_full_page() <- unlock_page
: <- end_page_writeback
: <- race! (dio write->try_to_release_page fails)
walk_page_buffers() <-release a bh ref
ext3_ordered_writepage holds bh ref and does unlock_page remaining
taking a bh ref, so this causes the race and failure of
try_to_release_page.
Following patch fixes this race.
Thanks.
Signed-off-by :Hisashi Hifumi <[email protected]>
diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c linux-2.6.27-rc1/fs/jbd/transaction.c
--- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29 19:28:47.000000000 +0900
+++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
@@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
*/
if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
journal_wait_for_transaction_sync_data(journal);
+
+ bh = head;
+ do {
+ while (atomic_read(&bh->b_count))
+ schedule();
+ } while ((bh = bh->b_this_page) != head);
ret = try_to_free_buffers(page);
}
diff -Nrup linux-2.6.27-rc1.org/fs/jbd2/transaction.c linux-2.6.27-rc1/fs/jbd2/transaction.c
--- linux-2.6.27-rc1.org/fs/jbd2/transaction.c 2008-07-29 19:28:47.000000000 +0900
+++ linux-2.6.27-rc1/fs/jbd2/transaction.c 2008-07-29 20:56:42.000000000 +0900
@@ -1583,6 +1583,12 @@ int jbd2_journal_try_to_free_buffers(jou
*/
if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
jbd2_journal_wait_for_transaction_sync_data(journal);
+
+ bh = head;
+ do {
+ while (atomic_read(&bh->b_count))
+ schedule();
+ } while ((bh = bh->b_this_page) != head);
ret = try_to_free_buffers(page);
}
On Mon, 04 Aug 2008 20:10:33 +0900
Hisashi Hifumi <[email protected]> wrote:
> Hi
>
> Dio write returns EIO when try_to_release_page fails because bh is
> still referenced.
> The patch
> "commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
> Author: Mingming Cao <[email protected]>
> Date: Fri Jul 25 01:46:22 2008 -0700
>
> jbd: fix race between free buffer and commit transaction
> "
> was merged into 2.6.27-rc1, but I noticed that this patch is not enough
> to fix the race.
> I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
> sometimes got EIO through this test.
> The patch above fixed race between freeing buffer(dio) and committing
> transaction(jbd) but I discovered that there is another race,
> freeing buffer(dio) and ext3/4_ordered_writepage.
> : background_writeout()
> ->write_cache_pages()
> ->ext3_ordered_writepage()
> walk_page_buffers() <- take a bh ref
> block_write_full_page() <- unlock_page
> : <- end_page_writeback
> : <- race! (dio write->try_to_release_page fails)
> walk_page_buffers() <-release a bh ref
>
> ext3_ordered_writepage holds bh ref and does unlock_page remaining
> taking a bh ref, so this causes the race and failure of
> try_to_release_page.
>
> Following patch fixes this race.
Please don't patch both filesystems in a single patch - they go into
the tree via different routes.
>
> Signed-off-by :Hisashi Hifumi <[email protected]>
"Signed-off-by: ", please.
>
> diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c linux-2.6.27-rc1/fs/jbd/transaction.c
> --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29 19:28:47.000000000 +0900
> +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
> @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
> */
> if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> journal_wait_for_transaction_sync_data(journal);
> +
> + bh = head;
> + do {
> + while (atomic_read(&bh->b_count))
> + schedule();
> + } while ((bh = bh->b_this_page) != head);
> ret = try_to_free_buffers(page);
> }
The loop is problematic. If the scheduler decides to keep running this
task then we have a busy loop. If this task has realtime policy then
it might even lock up the kernel.
Perhaps we can use wait_on_page_writeback()?
>>
>> diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
>linux-2.6.27-rc1/fs/jbd/transaction.c
>> --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
>19:28:47.000000000 +0900
>> +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
>> @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
>> */
>> if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
>> journal_wait_for_transaction_sync_data(journal);
>> +
>> + bh = head;
>> + do {
>> + while (atomic_read(&bh->b_count))
>> + schedule();
>> + } while ((bh = bh->b_this_page) != head);
>> ret = try_to_free_buffers(page);
>> }
>
>The loop is problematic. If the scheduler decides to keep running this
>task then we have a busy loop. If this task has realtime policy then
>it might even lock up the kernel.
>
>Perhaps we can use wait_on_page_writeback()?
>
We cannot use wait_on_page_writeback() to wait for releasing bh ref because
in ext3_ordered_writepage() bh ref is grabbed and released through walk_page_buffers
so between both walk_page_buffers, it remains taking a bh ref even if end_page_writeback
is performed.
->ext3_ordered_writepage()
walk_page_buffers() <- take a bh ref
block_write_full_page() <- unlock_page
: <- end_page_writeback
: <- race! (dio write->try_to_release_page fails): ---> remains taking a bh ref
walk_page_buffers() <-release a bh ref
On Mon, 2008-08-04 at 14:50 -0700, Andrew Morton wrote:
> On Mon, 04 Aug 2008 20:10:33 +0900
> Hisashi Hifumi <[email protected]> wrote:
>
> > Hi
> >
> > Dio write returns EIO when try_to_release_page fails because bh is
> > still referenced.
> >
> > diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c linux-2.6.27-rc1/fs/jbd/transaction.c
> > --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29 19:28:47.000000000 +0900
> > +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
> > @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
> > */
> > if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> > journal_wait_for_transaction_sync_data(journal);
> > +
> > + bh = head;
> > + do {
> > + while (atomic_read(&bh->b_count))
> > + schedule();
> > + } while ((bh = bh->b_this_page) != head);
> > ret = try_to_free_buffers(page);
> > }
>
> The loop is problematic. If the scheduler decides to keep running this
> task then we have a busy loop. If this task has realtime policy then
> it might even lock up the kernel.
>
ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
not be the best idea there either.
This code gets called from releasepage, which is used other places than
the O_DIRECT invalidation paths, I'd be worried about performance
problems here.
-chris
>> >
>> > diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
>linux-2.6.27-rc1/fs/jbd/transaction.c
>> > --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
>19:28:47.000000000 +0900
>> > +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
>> > @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
>> > */
>> > if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
>> > journal_wait_for_transaction_sync_data(journal);
>> > +
>> > + bh = head;
>> > + do {
>> > + while (atomic_read(&bh->b_count))
>> > + schedule();
>> > + } while ((bh = bh->b_this_page) != head);
>> > ret = try_to_free_buffers(page);
>> > }
>>
>> The loop is problematic. If the scheduler decides to keep running this
>> task then we have a busy loop. If this task has realtime policy then
>> it might even lock up the kernel.
>>
>
>ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
>not be the best idea there either.
>
>This code gets called from releasepage, which is used other places than
>the O_DIRECT invalidation paths, I'd be worried about performance
>problems here.
>
try_to_release_page has gfp_mask parameter. So when try_to_releasepage
is called from performance sensitive part, gfp_mask should not be set.
b_count check loop is inside of (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS) check.
On Tue, 2008-08-05 at 13:51 +0900, Hisashi Hifumi wrote:
> >> >
> >> > diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
> >linux-2.6.27-rc1/fs/jbd/transaction.c
> >> > --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
> >19:28:47.000000000 +0900
> >> > +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
> >> > @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
> >> > */
> >> > if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> >> > journal_wait_for_transaction_sync_data(journal);
> >> > +
> >> > + bh = head;
> >> > + do {
> >> > + while (atomic_read(&bh->b_count))
> >> > + schedule();
> >> > + } while ((bh = bh->b_this_page) != head);
> >> > ret = try_to_free_buffers(page);
> >> > }
> >>
> >> The loop is problematic. If the scheduler decides to keep running this
> >> task then we have a busy loop. If this task has realtime policy then
> >> it might even lock up the kernel.
> >>
> >
> >ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
> >not be the best idea there either.
> >
> >This code gets called from releasepage, which is used other places than
> >the O_DIRECT invalidation paths, I'd be worried about performance
> >problems here.
> >
>
> try_to_release_page has gfp_mask parameter. So when try_to_releasepage
> is called from performance sensitive part, gfp_mask should not be set.
> b_count check loop is inside of (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS) check.
Looks like try_to_free_pages will go into releasepage with wait & fs
both set. This kind of change would make me very nervous.
-chris
在 2008-08-04一的 20:10 +0900,Hisashi Hifumi写道:
> Hi
>
> Dio write returns EIO when try_to_release_page fails because bh is
> still referenced.
> The patch
> "commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
> Author: Mingming Cao <[email protected]>
> Date: Fri Jul 25 01:46:22 2008 -0700
>
> jbd: fix race between free buffer and commit transaction
> "
> was merged into 2.6.27-rc1, but I noticed that this patch is not enough
> to fix the race.
> I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
> sometimes got EIO through this test.
:( thought we beat that race pretty hard already.T
Could you send me the fsstree command to reproduce the race?
> The patch above fixed race between freeing buffer(dio) and committing
> transaction(jbd) but I discovered that there is another race,
> freeing buffer(dio) and ext3/4_ordered_writepage.
> : background_writeout()
> ->write_cache_pages()
> ->ext3_ordered_writepage()
> walk_page_buffers() <- take a bh ref
> block_write_full_page() <- unlock_page
> : <- end_page_writeback
> : <- race! (dio write->try_to_release_page fails)
> walk_page_buffers() <-release a bh ref
>
> ext3_ordered_writepage holds bh ref and does unlock_page remaining
> taking a bh ref, so this causes the race and failure of
> try_to_release_page.
>
I thought about this before, the race seems unlikely to me. Perhaps I
missed something, but DIO code already waiting for all the pending IO to
finish before calling try_to_release_page which eventually called
journal_try_to_free_buffers(). During this call, the inode mutx is hold
to prevent the new writer (buffered/DIO) to re-dirty the pages. If there
is background writeout happens when DIO is kicked in, DIO will wait for
all the pages writeback bit clear first. here is the stack
generic_file_aio_write()
-> mutex_lock(&inode->i_mutex);
-> __generic_file_aio_write_nolock()
-> generic_file_direct_IO()
->filemap_write_and_wait()
-> filemap_fdatawait()
-> wait_on_page_writeback_range()
(==== waiting for
pending IO to finish ====)
->invalidate_inode_pages2_range()
->invalidate_inode_pages2()
->try_to_releasepage()
->ext3_releasepage()
->journal_try_to_free_buffers()
> Following patch fixes this race.
> Thanks.
>
> Signed-off-by :Hisashi Hifumi <[email protected]>
>
> diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c linux-2.6.27-rc1/fs/jbd/transaction.c
> --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29 19:28:47.000000000 +0900
> +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
> @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
> */
> if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> journal_wait_for_transaction_sync_data(journal);
> +
> + bh = head;
> + do {
> + while (atomic_read(&bh->b_count))
> + schedule();
> + } while ((bh = bh->b_this_page) != head);
> ret = try_to_free_buffers(page);
> }
>
> diff -Nrup linux-2.6.27-rc1.org/fs/jbd2/transaction.c linux-2.6.27-rc1/fs/jbd2/transaction.c
> --- linux-2.6.27-rc1.org/fs/jbd2/transaction.c 2008-07-29 19:28:47.000000000 +0900
> +++ linux-2.6.27-rc1/fs/jbd2/transaction.c 2008-07-29 20:56:42.000000000 +0900
> @@ -1583,6 +1583,12 @@ int jbd2_journal_try_to_free_buffers(jou
> */
> if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> jbd2_journal_wait_for_transaction_sync_data(journal);
> +
> + bh = head;
> + do {
> + while (atomic_read(&bh->b_count))
> + schedule();
> + } while ((bh = bh->b_this_page) != head);
> ret = try_to_free_buffers(page);
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
在 2008-08-05二的 12:17 -0400,Chris Mason写道:
> On Tue, 2008-08-05 at 13:51 +0900, Hisashi Hifumi wrote:
> > >> >
> > >> > diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
> > >linux-2.6.27-rc1/fs/jbd/transaction.c
> > >> > --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
> > >19:28:47.000000000 +0900
> > >> > +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
> > >> > @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
> > >> > */
> > >> > if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> > >> > journal_wait_for_transaction_sync_data(journal);
> > >> > +
> > >> > + bh = head;
> > >> > + do {
> > >> > + while (atomic_read(&bh->b_count))
> > >> > + schedule();
> > >> > + } while ((bh = bh->b_this_page) != head);
> > >> > ret = try_to_free_buffers(page);
> > >> > }
> > >>
> > >> The loop is problematic. If the scheduler decides to keep running this
> > >> task then we have a busy loop. If this task has realtime policy then
> > >> it might even lock up the kernel.
> > >>
> > >
> > >ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
> > >not be the best idea there either.
> > >
> > >This code gets called from releasepage, which is used other places than
> > >the O_DIRECT invalidation paths, I'd be worried about performance
> > >problems here.
> > >
> >
> > try_to_release_page has gfp_mask parameter. So when try_to_releasepage
> > is called from performance sensitive part, gfp_mask should not be set.
> > b_count check loop is inside of (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS) check.
>
> Looks like try_to_free_pages will go into releasepage with wait & fs
> both set. This kind of change would make me very nervous.
>
Hi Chris,
The gfp_mask try_to_free_pages() takes from it's caller will past it
down to try_to_release_page(). Based on the meaning of __GFP_WAIT and
GFP_FS, if the upper level caller set these two flags, I assume the
upper level caller expect delay and wait for fs to finish?
But I agree that using a loop in journal_try_to_free_buffers() to wait
for the busy bh release the counter is expensive...
Mingming
> -chris
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
在 2008-08-05二的 11:36 +0900,Hisashi Hifumi写道:
> >>
> >> diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
> >linux-2.6.27-rc1/fs/jbd/transaction.c
> >> --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
> >19:28:47.000000000 +0900
> >> +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
> >> @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
> >> */
> >> if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> >> journal_wait_for_transaction_sync_data(journal);
> >> +
> >> + bh = head;
> >> + do {
> >> + while (atomic_read(&bh->b_count))
> >> + schedule();
> >> + } while ((bh = bh->b_this_page) != head);
> >> ret = try_to_free_buffers(page);
> >> }
> >
> >The loop is problematic. If the scheduler decides to keep running this
> >task then we have a busy loop. If this task has realtime policy then
> >it might even lock up the kernel.
> >
> >Perhaps we can use wait_on_page_writeback()?
> >
>
> We cannot use wait_on_page_writeback() to wait for releasing bh ref because
> in ext3_ordered_writepage() bh ref is grabbed and released through walk_page_buffers
> so between both walk_page_buffers, it remains taking a bh ref even if end_page_writeback
> is performed.
> ->ext3_ordered_writepage()
> walk_page_buffers() <- take a bh ref
> block_write_full_page() <- unlock_page
> : <- end_page_writeback
> : <- race! (dio write->try_to_release_page fails): ---> remains taking a bh ref
> walk_page_buffers() <-release a bh ref
>
Okay, I see the race window, DIO could come in before
walk_page_buffers() release the bh reference. So far I don't see a nicer
way to sync between background writeout with DIO path yet...
Mingming
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
At 06:35 08/08/06, Mingming Cao wrote:
>
>$Bi|%#(B 2008-08-05$Bh<8iSd(B 11:36 +0900$B~>7)(Bisashi Hifumi$BifRk!s~>S(B
>> >>
>> >> diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
>> >linux-2.6.27-rc1/fs/jbd/transaction.c
>> >> --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
>> >19:28:47.000000000 +0900
>> >> +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29
>20:40:12.000000000 +0900
>> >> @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
>> >> */
>> >> if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
>> >> journal_wait_for_transaction_sync_data(journal);
>> >> +
>> >> + bh = head;
>> >> + do {
>> >> + while (atomic_read(&bh->b_count))
>> >> + schedule();
>> >> + } while ((bh = bh->b_this_page) != head);
>> >> ret = try_to_free_buffers(page);
>> >> }
>> >
>> >The loop is problematic. If the scheduler decides to keep running this
>> >task then we have a busy loop. If this task has realtime policy then
>> >it might even lock up the kernel.
>> >
>> >Perhaps we can use wait_on_page_writeback()?
>> >
>>
>> We cannot use wait_on_page_writeback() to wait for releasing bh ref because
>> in ext3_ordered_writepage() bh ref is grabbed and released through
>walk_page_buffers
>> so between both walk_page_buffers, it remains taking a bh ref even if
>end_page_writeback
>> is performed.
>> ->ext3_ordered_writepage()
>> walk_page_buffers() <- take a bh ref
>> block_write_full_page() <- unlock_page
>> : <- end_page_writeback
>> : <- race! (dio write->try_to_release_page fails): --->
>remains taking a bh ref
>> walk_page_buffers() <-release a bh ref
>>
>
>Okay, I see the race window, DIO could come in before
>walk_page_buffers() release the bh reference. So far I don't see a nicer
>way to sync between background writeout with DIO path yet...
>
I know that b_count check on loop is not good, but I do not have better
idea to fix this yet too.
The race window is very short and rare, so I think the impact of introducing
the loop is small even if this loop can be busy loop due to scheduler circumstances.
>> > >> > diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
>> > >linux-2.6.27-rc1/fs/jbd/transaction.c
>> > >> > --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
>> > >19:28:47.000000000 +0900
>> > >> > +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29
>20:40:12.000000000 +0900
>> > >> > @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
>> > >> > */
>> > >> > if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
>> > >> > journal_wait_for_transaction_sync_data(journal);
>> > >> > +
>> > >> > + bh = head;
>> > >> > + do {
>> > >> > + while (atomic_read(&bh->b_count))
>> > >> > + schedule();
>> > >> > + } while ((bh = bh->b_this_page) != head);
>> > >> > ret = try_to_free_buffers(page);
>> > >> > }
>> > >>
>> > >> The loop is problematic. If the scheduler decides to keep running this
>> > >> task then we have a busy loop. If this task has realtime policy then
>> > >> it might even lock up the kernel.
>> > >>
>> > >
>> > >ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
>> > >not be the best idea there either.
>> > >
>> > >This code gets called from releasepage, which is used other places than
>> > >the O_DIRECT invalidation paths, I'd be worried about performance
>> > >problems here.
>> > >
>> >
>> > try_to_release_page has gfp_mask parameter. So when try_to_releasepage
>> > is called from performance sensitive part, gfp_mask should not be set.
>> > b_count check loop is inside of (gfp_mask & __GFP_WAIT) && (gfp_mask &
>__GFP_FS) check.
>>
>> Looks like try_to_free_pages will go into releasepage with wait & fs
>> both set. This kind of change would make me very nervous.
>>
>
>Hi Chris,
>
>The gfp_mask try_to_free_pages() takes from it's caller will past it
>down to try_to_release_page(). Based on the meaning of __GFP_WAIT and
>GFP_FS, if the upper level caller set these two flags, I assume the
>upper level caller expect delay and wait for fs to finish?
>
>
>But I agree that using a loop in journal_try_to_free_buffers() to wait
>for the busy bh release the counter is expensive...
I modified my patch.
I do not change Checking b_count in a loop, but introduce
set_current_state(TASK_UNINTERRUPTIBLE) to mitigate the loop. I think this can
lead to avoid busy loop.
I used the same approach of do_sync_read()->wait_on_retry_sync_kiocb or some drivers(qla2xxx).
Signed-off-by: Hisashi Hifumi <[email protected]>
diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c linux-2.6.27-rc1.jbdfix/fs/jbd/transaction.c
--- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29 19:28:47.000000000 +0900
+++ linux-2.6.27-rc1.jbdfix/fs/jbd/transaction.c 2008-08-06 13:35:37.000000000 +0900
@@ -1764,6 +1764,15 @@ int journal_try_to_free_buffers(journal_
*/
if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
journal_wait_for_transaction_sync_data(journal);
+
+ bh = head;
+ do {
+ while (atomic_read(&bh->b_count)) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule();
+ __set_current_state(TASK_RUNNING);
+ }
+ } while ((bh = bh->b_this_page) != head);
ret = try_to_free_buffers(page);
}
>I modified my patch.
>I do not change Checking b_count in a loop, but introduce
>set_current_state(TASK_UNINTERRUPTIBLE) to mitigate the loop. I think this can
>lead to avoid busy loop.
>I used the same approach of do_sync_read()->wait_on_retry_sync_kiocb or
>some drivers(qla2xxx).
>
>Signed-off-by: Hisashi Hifumi <[email protected]>
>
>diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
>linux-2.6.27-rc1.jbdfix/fs/jbd/transaction.c
>--- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29 19:28:47.000000000 +0900
>+++ linux-2.6.27-rc1.jbdfix/fs/jbd/transaction.c 2008-08-06
>13:35:37.000000000 +0900
>@@ -1764,6 +1764,15 @@ int journal_try_to_free_buffers(journal_
> */
> if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> journal_wait_for_transaction_sync_data(journal);
>+
>+ bh = head;
>+ do {
>+ while (atomic_read(&bh->b_count)) {
>+ set_current_state(TASK_UNINTERRUPTIBLE);
>+ schedule();
>+ __set_current_state(TASK_RUNNING);
>+ }
>+ } while ((bh = bh->b_this_page) != head);
> ret = try_to_free_buffers(page);
> }
Sorry, the thread entering this loop sleeps infinitely.
We should add some wait queue and implement wait/wakeup code.
Hi,
On Tue 05-08-08 14:03:14, Mingming Cao wrote:
> 在 2008-08-04一的 20:10 +0900,Hisashi Hifumi写道:
> > Hi
> >
> > Dio write returns EIO when try_to_release_page fails because bh is
> > still referenced.
> > The patch
> > "commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
> > Author: Mingming Cao <[email protected]>
> > Date: Fri Jul 25 01:46:22 2008 -0700
> >
> > jbd: fix race between free buffer and commit transaction
> > "
> > was merged into 2.6.27-rc1, but I noticed that this patch is not enough
> > to fix the race.
> > I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
> > sometimes got EIO through this test.
>
> :( thought we beat that race pretty hard already.T
>
> Could you send me the fsstree command to reproduce the race?
It is a part of ext3-tools package Andrew has somewhere and also LTP has
it I think.
> > The patch above fixed race between freeing buffer(dio) and committing
> > transaction(jbd) but I discovered that there is another race,
> > freeing buffer(dio) and ext3/4_ordered_writepage.
> > : background_writeout()
> > ->write_cache_pages()
> > ->ext3_ordered_writepage()
> > walk_page_buffers() <- take a bh ref
> > block_write_full_page() <- unlock_page
> > : <- end_page_writeback
> > : <- race! (dio write->try_to_release_page fails)
> > walk_page_buffers() <-release a bh ref
> >
> > ext3_ordered_writepage holds bh ref and does unlock_page remaining
> > taking a bh ref, so this causes the race and failure of
> > try_to_release_page.
> >
>
> I thought about this before, the race seems unlikely to me. Perhaps I
> missed something, but DIO code already waiting for all the pending IO to
> finish before calling try_to_release_page which eventually called
> journal_try_to_free_buffers(). During this call, the inode mutx is hold
> to prevent the new writer (buffered/DIO) to re-dirty the pages. If there
> is background writeout happens when DIO is kicked in, DIO will wait for
> all the pages writeback bit clear first. here is the stack
Yes, but in principle, nothing assures that writeback of buffers does not
finish even before block_write_full_page() returns. So there is possibly a
window after PageWriteback is cleared (and thus filemap_fdatawait()
finishes) and before buffer references are dropped.
Now what is more likely in practice is that all buffers of the page are
written during transaction commit. So they are clean but the page remains
dirty. Now background writeback happens, sees dirty page, calls:
ext3_ordered_writepage()
block_write_full_page()
- finds all buffers are clean -> end_page_writeback()
- and at this point direct IO happens which happily proceeds upto a point
where try_to_release_page() fails because ext3_ordered_writepage() has
not yet dropped its references to buffers. Nasty.
So we really need some nice and effective way in which ->writepage()
calls (and possibly others) could synchronize with try_to_release_page()
(which has __GFP_WAIT and __GFP_FS and is thus willing to wait a bit). But
I don't have a good candidate...
Honza
> generic_file_aio_write()
> -> mutex_lock(&inode->i_mutex);
> -> __generic_file_aio_write_nolock()
> -> generic_file_direct_IO()
> ->filemap_write_and_wait()
> -> filemap_fdatawait()
> -> wait_on_page_writeback_range()
> (==== waiting for
> pending IO to finish ====)
> ->invalidate_inode_pages2_range()
> ->invalidate_inode_pages2()
> ->try_to_releasepage()
> ->ext3_releasepage()
> ->journal_try_to_free_buffers()
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, 2008-08-05 at 14:17 -0700, Mingming Cao wrote:
> 在 2008-08-05二的 12:17 -0400,Chris Mason写道:
> > On Tue, 2008-08-05 at 13:51 +0900, Hisashi Hifumi wrote:
> > > >> >
> > > >> > diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
> > > >linux-2.6.27-rc1/fs/jbd/transaction.c
> > > >> > --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
> > > >19:28:47.000000000 +0900
> > > >> > +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
> > > >> > @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
> > > >> > */
> > > >> > if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> > > >> > journal_wait_for_transaction_sync_data(journal);
> > > >> > +
> > > >> > + bh = head;
> > > >> > + do {
> > > >> > + while (atomic_read(&bh->b_count))
> > > >> > + schedule();
> > > >> > + } while ((bh = bh->b_this_page) != head);
> > > >> > ret = try_to_free_buffers(page);
> > > >> > }
> > > >>
> > > >> The loop is problematic. If the scheduler decides to keep running this
> > > >> task then we have a busy loop. If this task has realtime policy then
> > > >> it might even lock up the kernel.
> > > >>
> > > >
> > > >ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
> > > >not be the best idea there either.
> > > >
> > > >This code gets called from releasepage, which is used other places than
> > > >the O_DIRECT invalidation paths, I'd be worried about performance
> > > >problems here.
> > > >
> > >
> > > try_to_release_page has gfp_mask parameter. So when try_to_releasepage
> > > is called from performance sensitive part, gfp_mask should not be set.
> > > b_count check loop is inside of (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS) check.
> >
> > Looks like try_to_free_pages will go into releasepage with wait & fs
> > both set. This kind of change would make me very nervous.
> >
>
> Hi Chris,
>
> The gfp_mask try_to_free_pages() takes from it's caller will past it
> down to try_to_release_page(). Based on the meaning of __GFP_WAIT and
> GFP_FS, if the upper level caller set these two flags, I assume the
> upper level caller expect delay and wait for fs to finish?
>
>
> But I agree that using a loop in journal_try_to_free_buffers() to wait
> for the busy bh release the counter is expensive...
I rediscovered your old thread about trying to do this in a launder_page
call ;)
Does it make more sense to fix do_launder_page to call into the FS on
every page, and let the FS check for PageDirty on its own? That way
invalidate_inode_pages2_range basically gets its own private call into
the FS that says wait around until this page is really free.
-chris
On Wed 06-08-08 09:25:13, Chris Mason wrote:
> On Tue, 2008-08-05 at 14:17 -0700, Mingming Cao wrote:
> > 在 2008-08-05二的 12:17 -0400,Chris Mason写道:
> > > On Tue, 2008-08-05 at 13:51 +0900, Hisashi Hifumi wrote:
> > > > >> >
> > > > >> > diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
> > > > >linux-2.6.27-rc1/fs/jbd/transaction.c
> > > > >> > --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
> > > > >19:28:47.000000000 +0900
> > > > >> > +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
> > > > >> > @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
> > > > >> > */
> > > > >> > if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> > > > >> > journal_wait_for_transaction_sync_data(journal);
> > > > >> > +
> > > > >> > + bh = head;
> > > > >> > + do {
> > > > >> > + while (atomic_read(&bh->b_count))
> > > > >> > + schedule();
> > > > >> > + } while ((bh = bh->b_this_page) != head);
> > > > >> > ret = try_to_free_buffers(page);
> > > > >> > }
> > > > >>
> > > > >> The loop is problematic. If the scheduler decides to keep running this
> > > > >> task then we have a busy loop. If this task has realtime policy then
> > > > >> it might even lock up the kernel.
> > > > >>
> > > > >
> > > > >ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
> > > > >not be the best idea there either.
> > > > >
> > > > >This code gets called from releasepage, which is used other places than
> > > > >the O_DIRECT invalidation paths, I'd be worried about performance
> > > > >problems here.
> > > > >
> > > >
> > > > try_to_release_page has gfp_mask parameter. So when try_to_releasepage
> > > > is called from performance sensitive part, gfp_mask should not be set.
> > > > b_count check loop is inside of (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS) check.
> > >
> > > Looks like try_to_free_pages will go into releasepage with wait & fs
> > > both set. This kind of change would make me very nervous.
> > >
> >
> > Hi Chris,
> >
> > The gfp_mask try_to_free_pages() takes from it's caller will past it
> > down to try_to_release_page(). Based on the meaning of __GFP_WAIT and
> > GFP_FS, if the upper level caller set these two flags, I assume the
> > upper level caller expect delay and wait for fs to finish?
> >
> >
> > But I agree that using a loop in journal_try_to_free_buffers() to wait
> > for the busy bh release the counter is expensive...
>
> I rediscovered your old thread about trying to do this in a launder_page
> call ;)
Yes, we thought about using launder_page() before :).
> Does it make more sense to fix do_launder_page to call into the FS on
> every page, and let the FS check for PageDirty on its own? That way
> invalidate_inode_pages2_range basically gets its own private call into
> the FS that says wait around until this page is really free.
That would certainly work as well. But IMHO waiting for ->writepage()
call to finish isn't really a big deal even in try_to_release_page() if
__GFP_FS (and __GFP_WAIT) is set. The only problem is that there is no
effective way to do so and so Hisashi used that "wait for b_count to drop"
which looks really scary and I don't like it as well.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
在 2008-08-06三的 15:53 +0200,Jan Kara写道:
> On Wed 06-08-08 09:25:13, Chris Mason wrote:
> > On Tue, 2008-08-05 at 14:17 -0700, Mingming Cao wrote:
> > > 在 2008-08-05二的 12:17 -0400,Chris Mason写道:
> > > > On Tue, 2008-08-05 at 13:51 +0900, Hisashi Hifumi wrote:
> > > > > >> >
> > > > > >> > diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
> > > > > >linux-2.6.27-rc1/fs/jbd/transaction.c
> > > > > >> > --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
> > > > > >19:28:47.000000000 +0900
> > > > > >> > +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900
> > > > > >> > @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
> > > > > >> > */
> > > > > >> > if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) {
> > > > > >> > journal_wait_for_transaction_sync_data(journal);
> > > > > >> > +
> > > > > >> > + bh = head;
> > > > > >> > + do {
> > > > > >> > + while (atomic_read(&bh->b_count))
> > > > > >> > + schedule();
> > > > > >> > + } while ((bh = bh->b_this_page) != head);
> > > > > >> > ret = try_to_free_buffers(page);
> > > > > >> > }
> > > > > >>
> > > > > >> The loop is problematic. If the scheduler decides to keep running this
> > > > > >> task then we have a busy loop. If this task has realtime policy then
> > > > > >> it might even lock up the kernel.
> > > > > >>
> > > > > >
> > > > > >ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
> > > > > >not be the best idea there either.
> > > > > >
> > > > > >This code gets called from releasepage, which is used other places than
> > > > > >the O_DIRECT invalidation paths, I'd be worried about performance
> > > > > >problems here.
> > > > > >
> > > > >
> > > > > try_to_release_page has gfp_mask parameter. So when try_to_releasepage
> > > > > is called from performance sensitive part, gfp_mask should not be set.
> > > > > b_count check loop is inside of (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS) check.
> > > >
> > > > Looks like try_to_free_pages will go into releasepage with wait & fs
> > > > both set. This kind of change would make me very nervous.
> > > >
> > >
> > > Hi Chris,
> > >
> > > The gfp_mask try_to_free_pages() takes from it's caller will past it
> > > down to try_to_release_page(). Based on the meaning of __GFP_WAIT and
> > > GFP_FS, if the upper level caller set these two flags, I assume the
> > > upper level caller expect delay and wait for fs to finish?
> > >
> > >
> > > But I agree that using a loop in journal_try_to_free_buffers() to wait
> > > for the busy bh release the counter is expensive...
> >
> > I rediscovered your old thread about trying to do this in a launder_page
> > call ;)
> Yes, we thought about using launder_page() before :).
>
> > Does it make more sense to fix do_launder_page to call into the FS on
> > every page, and let the FS check for PageDirty on its own? That way
> > invalidate_inode_pages2_range basically gets its own private call into
> > the FS that says wait around until this page is really free.
> That would certainly work as well. But IMHO waiting for ->writepage()
> call to finish isn't really a big deal even in try_to_release_page() if
> __GFP_FS (and __GFP_WAIT) is set. The only problem is that there is no
> effective way to do so and so Hisashi used that "wait for b_count to drop"
> which looks really scary and I don't like it as well.
>
I was looking at the comment in invalidate_complete_page2(), which is
now only called from DIO path, it saids
/*
* This is like invalidate_complete_page(), except it ignores the page's
* refcount. We do this because invalidate_inode_pages2() needs
stronger
* invalidation guarantees, and cannot afford to leave pages behind
because
* shrink_page_list() has a temp ref on them, or because they're
transiently
* sitting in the lru_cache_add() pagevecs.
*/
I am wondering why we need stronger invalidate hurantees for DIO->
invalidate_inode_pages_range(),which force the page being removed from
page cache? In case of bh is busy due to ext3 writeout,
journal_try_to_free_buffers() could return different error number(EBUSY)
to try_to_releasepage() (instead of EIO). In that case, could we just
leave the page in the cache, clean pageuptodate() (to force later buffer
read to read from disk) and then invalidate_complete_page2() return
successfully? Any issue with this way?
Mingming
> Honza
On Wed, 2008-08-06 at 15:57 -0700, Mingming Cao wrote:
> >
> > > Does it make more sense to fix do_launder_page to call into the FS on
> > > every page, and let the FS check for PageDirty on its own? That way
> > > invalidate_inode_pages2_range basically gets its own private call into
> > > the FS that says wait around until this page is really free.
> > That would certainly work as well. But IMHO waiting for ->writepage()
> > call to finish isn't really a big deal even in try_to_release_page() if
> > __GFP_FS (and __GFP_WAIT) is set. The only problem is that there is no
> > effective way to do so and so Hisashi used that "wait for b_count to drop"
> > which looks really scary and I don't like it as well.
> >
>
> I was looking at the comment in invalidate_complete_page2(), which is
> now only called from DIO path, it saids
>
> /*
> * This is like invalidate_complete_page(), except it ignores the page's
> * refcount. We do this because invalidate_inode_pages2() needs
> stronger
> * invalidation guarantees, and cannot afford to leave pages behind
> because
> * shrink_page_list() has a temp ref on them, or because they're
> transiently
> * sitting in the lru_cache_add() pagevecs.
> */
>
>
> I am wondering why we need stronger invalidate hurantees for DIO->
> invalidate_inode_pages_range(),which force the page being removed from
> page cache? In case of bh is busy due to ext3 writeout,
> journal_try_to_free_buffers() could return different error number(EBUSY)
> to try_to_releasepage() (instead of EIO). In that case, could we just
> leave the page in the cache, clean pageuptodate() (to force later buffer
> read to read from disk) and then invalidate_complete_page2() return
> successfully? Any issue with this way?
Isn't this similar to the recent splice thread? It seems like there are
places in the kernel that expect an up to date page they have a
reference against to stay up to date.
-chris
At 07:57 08/08/07, Mingming Cao wrote:
>
>$Bi|%#(B 2008-08-06$Bh:2iSd(B 15:53 +0200$B~>7+(Ban Kara$BifRk!s~>S(B
>> On Wed 06-08-08
>09:25:13, Chris Mason wrote:
>> > On Tue, 2008-08-05 at 14:17 -0700, Mingming Cao wrote:
>> > > $Bi|%#(B 2008-08-05$Bh<8iSd(B 12:17 -0400$B~>7$(Bhris Mason$BifRk!s~>S(B
>> > > > On
>Tue, 2008-08-05 at 13:51 +0900, Hisashi Hifumi wrote:
>> > > > > >> >
>> > > > > >> > diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c
>> > > > > >linux-2.6.27-rc1/fs/jbd/transaction.c
>> > > > > >> > --- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-29
>> > > > > >19:28:47.000000000 +0900
>> > > > > >> > +++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29
>20:40:12.000000000 +0900
>> > > > > >> > @@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
>> > > > > >> > */
>> > > > > >> > if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask &
>__GFP_FS)) {
>> > > > > >> > journal_wait_for_transaction_sync_data(journal);
>> > > > > >> > +
>> > > > > >> > + bh = head;
>> > > > > >> > + do {
>> > > > > >> > + while (atomic_read(&bh->b_count))
>> > > > > >> > + schedule();
>> > > > > >> > + } while ((bh = bh->b_this_page) != head);
>> > > > > >> > ret = try_to_free_buffers(page);
>> > > > > >> > }
>> > > > > >>
>> > > > > >> The loop is problematic. If the scheduler decides to keep
>running this
>> > > > > >> task then we have a busy loop. If this task has realtime
>policy then
>> > > > > >> it might even lock up the kernel.
>> > > > > >>
>> > > > > >
>> > > > > >ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
>> > > > > >not be the best idea there either.
>> > > > > >
>> > > > > >This code gets called from releasepage, which is used other
>places than
>> > > > > >the O_DIRECT invalidation paths, I'd be worried about performance
>> > > > > >problems here.
>> > > > > >
>> > > > >
>> > > > > try_to_release_page has gfp_mask parameter. So when try_to_releasepage
>> > > > > is called from performance sensitive part, gfp_mask should not be set.
>> > > > > b_count check loop is inside of (gfp_mask & __GFP_WAIT) &&
>(gfp_mask & __GFP_FS) check.
>> > > >
>> > > > Looks like try_to_free_pages will go into releasepage with wait & fs
>> > > > both set. This kind of change would make me very nervous.
>> > > >
>> > >
>> > > Hi Chris,
>> > >
>> > > The gfp_mask try_to_free_pages() takes from it's caller will past it
>> > > down to try_to_release_page(). Based on the meaning of __GFP_WAIT and
>> > > GFP_FS, if the upper level caller set these two flags, I assume the
>> > > upper level caller expect delay and wait for fs to finish?
>> > >
>> > >
>> > > But I agree that using a loop in journal_try_to_free_buffers() to wait
>> > > for the busy bh release the counter is expensive...
>> >
>> > I rediscovered your old thread about trying to do this in a launder_page
>> > call ;)
>> Yes, we thought about using launder_page() before :).
>>
>> > Does it make more sense to fix do_launder_page to call into the FS on
>> > every page, and let the FS check for PageDirty on its own? That way
>> > invalidate_inode_pages2_range basically gets its own private call into
>> > the FS that says wait around until this page is really free.
>> That would certainly work as well. But IMHO waiting for ->writepage()
>> call to finish isn't really a big deal even in try_to_release_page() if
>> __GFP_FS (and __GFP_WAIT) is set. The only problem is that there is no
>> effective way to do so and so Hisashi used that "wait for b_count to drop"
>> which looks really scary and I don't like it as well.
>>
>
>I was looking at the comment in invalidate_complete_page2(), which is
>now only called from DIO path, it saids
>
>/*
> * This is like invalidate_complete_page(), except it ignores the page's
> * refcount. We do this because invalidate_inode_pages2() needs
>stronger
> * invalidation guarantees, and cannot afford to leave pages behind
>because
> * shrink_page_list() has a temp ref on them, or because they're
>transiently
> * sitting in the lru_cache_add() pagevecs.
> */
>
>
>I am wondering why we need stronger invalidate hurantees for DIO->
>invalidate_inode_pages_range(),which force the page being removed from
>page cache? In case of bh is busy due to ext3 writeout,
>journal_try_to_free_buffers() could return different error number(EBUSY)
>to try_to_releasepage() (instead of EIO). In that case, could we just
>leave the page in the cache, clean pageuptodate() (to force later buffer
>read to read from disk) and then invalidate_complete_page2() return
>successfully? Any issue with this way?
My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
bh busy, and dio write falls back to buffered write. This is easy to fix.
On Thu, 2008-08-07 at 12:15 +0900, Hisashi Hifumi wrote:
> >/*
> > * This is like invalidate_complete_page(), except it ignores the page's
> > * refcount. We do this because invalidate_inode_pages2() needs
> >stronger
> > * invalidation guarantees, and cannot afford to leave pages behind
> >because
> > * shrink_page_list() has a temp ref on them, or because they're
> >transiently
> > * sitting in the lru_cache_add() pagevecs.
> > */
> >
> >
> >I am wondering why we need stronger invalidate hurantees for DIO->
> >invalidate_inode_pages_range(),which force the page being removed from
> >page cache? In case of bh is busy due to ext3 writeout,
> >journal_try_to_free_buffers() could return different error number(EBUSY)
> >to try_to_releasepage() (instead of EIO). In that case, could we just
> >leave the page in the cache, clean pageuptodate() (to force later buffer
> >read to read from disk) and then invalidate_complete_page2() return
> >successfully? Any issue with this way?
>
> My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
> bh busy, and dio write falls back to buffered write. This is easy to fix.
>
>
What about the invalidates done after the DIO has already run
non-buffered?
-chris
At 19:21 08/08/07, Chris Mason wrote:
>On Thu, 2008-08-07 at 12:15 +0900, Hisashi Hifumi wrote:
>> >/*
>> > * This is like invalidate_complete_page(), except it ignores the page's
>> > * refcount. We do this because invalidate_inode_pages2() needs
>> >stronger
>> > * invalidation guarantees, and cannot afford to leave pages behind
>> >because
>> > * shrink_page_list() has a temp ref on them, or because they're
>> >transiently
>> > * sitting in the lru_cache_add() pagevecs.
>> > */
>> >
>> >
>> >I am wondering why we need stronger invalidate hurantees for DIO->
>> >invalidate_inode_pages_range(),which force the page being removed from
>> >page cache? In case of bh is busy due to ext3 writeout,
>> >journal_try_to_free_buffers() could return different error number(EBUSY)
>> >to try_to_releasepage() (instead of EIO). In that case, could we just
>> >leave the page in the cache, clean pageuptodate() (to force later buffer
>> >read to read from disk) and then invalidate_complete_page2() return
>> >successfully? Any issue with this way?
>>
>> My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
>> bh busy, and dio write falls back to buffered write. This is easy to fix.
>>
>>
>
>What about the invalidates done after the DIO has already run
>non-buffered?
Dio write falls back to buffered IO when writing to a hole on ext3, I think. I want to
apply this mechanism to fix this issue. When try_to_release_page fails on a page
due to bh busy, dio write does buffered write, sync_page_range, and
wait_on_page_writeback, imvalidates page cache to preserve dio semantics.
Even if page invalidation that is carried out after wait_on_page_writeback fails,
there is no inconsistency between HDD and page cache.
On Fri, 2008-08-08 at 12:28 +0900, Hisashi Hifumi wrote:
> At 19:21 08/08/07, Chris Mason wrote:
> >On Thu, 2008-08-07 at 12:15 +0900, Hisashi Hifumi wrote:
> >> >/*
> >> > * This is like invalidate_complete_page(), except it ignores the page's
> >> > * refcount. We do this because invalidate_inode_pages2() needs
> >> >stronger
> >> > * invalidation guarantees, and cannot afford to leave pages behind
> >> >because
> >> > * shrink_page_list() has a temp ref on them, or because they're
> >> >transiently
> >> > * sitting in the lru_cache_add() pagevecs.
> >> > */
> >> >
> >> >
> >> >I am wondering why we need stronger invalidate hurantees for DIO->
> >> >invalidate_inode_pages_range(),which force the page being removed from
> >> >page cache? In case of bh is busy due to ext3 writeout,
> >> >journal_try_to_free_buffers() could return different error number(EBUSY)
> >> >to try_to_releasepage() (instead of EIO). In that case, could we just
> >> >leave the page in the cache, clean pageuptodate() (to force later buffer
> >> >read to read from disk) and then invalidate_complete_page2() return
> >> >successfully? Any issue with this way?
> >>
> >> My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
> >> bh busy, and dio write falls back to buffered write. This is easy to fix.
> >>
> >>
> >
> >What about the invalidates done after the DIO has already run
> >non-buffered?
>
> Dio write falls back to buffered IO when writing to a hole on ext3, I think. I want to
> apply this mechanism to fix this issue. When try_to_release_page fails on a page
> due to bh busy, dio write does buffered write, sync_page_range, and
> wait_on_page_writeback, imvalidates page cache to preserve dio semantics.
> Even if page invalidation that is carried out after wait_on_page_writeback fails,
> there is no inconsistency between HDD and page cache.
>
Sorry, I'm sure I wasn't very clear, I was referencing this code from
mm/filemap.c:
written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
/*
* Finally, try again to invalidate clean pages which might have been
* cached by non-direct readahead, or faulted in by get_user_pages()
* if the source of the write was an mmap'ed region of the file
* we're writing. Either one is a pretty crazy thing to do,
* so we don't support it 100%. If this invalidation
* fails, tough, the write still worked...
*/
if (mapping->nrpages) {
invalidate_inode_pages2_range(mapping,
pos >> PAGE_CACHE_SHIFT, end);
}
If this second invalidate fails during a DIO write, we'll have up to
date pages in cache that don't match the data on disk. It is unlikely
to fail because the conditions that make jbd unable to free a buffer are
rare, but it can still happen with the write combination of mmap usage.
The good news is the second invalidate doesn't make O_DIRECT return
-EIO. But, it sounds like fixing do_launder_page to always call into
the FS can fix all of these problems. Am I missing something?
-chris
>> >> >I am wondering why we need stronger invalidate hurantees for DIO->
>> >> >invalidate_inode_pages_range(),which force the page being removed from
>> >> >page cache? In case of bh is busy due to ext3 writeout,
>> >> >journal_try_to_free_buffers() could return different error number(EBUSY)
>> >> >to try_to_releasepage() (instead of EIO). In that case, could we just
>> >> >leave the page in the cache, clean pageuptodate() (to force later buffer
>> >> >read to read from disk) and then invalidate_complete_page2() return
>> >> >successfully? Any issue with this way?
>> >>
>> >> My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
>> >> bh busy, and dio write falls back to buffered write. This is easy to fix.
>> >>
>> >>
>> >
>> >What about the invalidates done after the DIO has already run
>> >non-buffered?
>>
>> Dio write falls back to buffered IO when writing to a hole on ext3, I
>think. I want to
>> apply this mechanism to fix this issue. When try_to_release_page fails on
>a page
>> due to bh busy, dio write does buffered write, sync_page_range, and
>> wait_on_page_writeback, imvalidates page cache to preserve dio semantics.
>> Even if page invalidation that is carried out after
>wait_on_page_writeback fails,
>> there is no inconsistency between HDD and page cache.
>>
>
>Sorry, I'm sure I wasn't very clear, I was referencing this code from
>mm/filemap.c:
>
> written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
>
> /*
> * Finally, try again to invalidate clean pages which might have been
> * cached by non-direct readahead, or faulted in by get_user_pages()
> * if the source of the write was an mmap'ed region of the file
> * we're writing. Either one is a pretty crazy thing to do,
> * so we don't support it 100%. If this invalidation
> * fails, tough, the write still worked...
> */
> if (mapping->nrpages) {
> invalidate_inode_pages2_range(mapping,
> pos >> PAGE_CACHE_SHIFT, end);
> }
>
>If this second invalidate fails during a DIO write, we'll have up to
>date pages in cache that don't match the data on disk. It is unlikely
>to fail because the conditions that make jbd unable to free a buffer are
>rare, but it can still happen with the write combination of mmap usage.
>
>The good news is the second invalidate doesn't make O_DIRECT return
>-EIO. But, it sounds like fixing do_launder_page to always call into
>the FS can fix all of these problems. Am I missing something?
>
My approach is not implementing do_launder_page for ext3.
It is needed to modify VFS.
My patch is as follows:
Comments?
diff -Nrup linux-2.6.27-rc2.org/mm/filemap.c linux-2.6.27-rc2/mm/filemap.c
--- linux-2.6.27-rc2.org/mm/filemap.c 2008-08-11 14:33:23.000000000 +0900
+++ linux-2.6.27-rc2/mm/filemap.c 2008-08-11 14:57:29.000000000 +0900
@@ -2129,13 +2129,16 @@ generic_file_direct_write(struct kiocb *
* After a write we want buffered reads to be sure to go to disk to get
* the new data. We invalidate clean cached page from the region we're
* about to write. We do this *before* the write so that we can return
- * -EIO without clobbering -EIOCBQUEUED from ->direct_IO().
+ * -EBUSY without clobbering -EIOCBQUEUED from ->direct_IO().
*/
if (mapping->nrpages) {
written = invalidate_inode_pages2_range(mapping,
pos >> PAGE_CACHE_SHIFT, end);
- if (written)
+ if (written) {
+ if (written == -EBUSY)
+ written = 0;
goto out;
+ }
}
written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
diff -Nrup linux-2.6.27-rc2.org/mm/truncate.c linux-2.6.27-rc2/mm/truncate.c
--- linux-2.6.27-rc2.org/mm/truncate.c 2008-08-11 14:33:24.000000000 +0900
+++ linux-2.6.27-rc2/mm/truncate.c 2008-08-11 14:52:03.000000000 +0900
@@ -380,7 +380,7 @@ static int do_launder_page(struct addres
* Any pages which are found to be mapped into pagetables are unmapped prior to
* invalidation.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EBUSY if any pages could not be invalidated.
*/
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
@@ -440,7 +440,7 @@ int invalidate_inode_pages2_range(struct
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))
- ret2 = -EIO;
+ ret2 = -EBUSY;
}
if (ret2 < 0)
ret = ret2;
On Mon, 2008-08-11 at 15:25 +0900, Hisashi Hifumi wrote:
> >> >> >I am wondering why we need stronger invalidate hurantees for DIO->
> >> >> >invalidate_inode_pages_range(),which force the page being removed from
> >> >> >page cache? In case of bh is busy due to ext3 writeout,
> >> >> >journal_try_to_free_buffers() could return different error number(EBUSY)
> >> >> >to try_to_releasepage() (instead of EIO). In that case, could we just
> >> >> >leave the page in the cache, clean pageuptodate() (to force later buffer
> >> >> >read to read from disk) and then invalidate_complete_page2() return
> >> >> >successfully? Any issue with this way?
> >> >>
> >> >> My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
> >> >> bh busy, and dio write falls back to buffered write. This is easy to fix.
> >> >>
> >> >>
> >> >
> >> >What about the invalidates done after the DIO has already run
> >> >non-buffered?
> >>
> >> Dio write falls back to buffered IO when writing to a hole on ext3, I
> >think. I want to
> >> apply this mechanism to fix this issue. When try_to_release_page fails on
> >a page
> >> due to bh busy, dio write does buffered write, sync_page_range, and
> >> wait_on_page_writeback, imvalidates page cache to preserve dio semantics.
> >> Even if page invalidation that is carried out after
> >wait_on_page_writeback fails,
> >> there is no inconsistency between HDD and page cache.
> >>
> >
> >Sorry, I'm sure I wasn't very clear, I was referencing this code from
> >mm/filemap.c:
> >
> > written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
> >
> > /*
> > * Finally, try again to invalidate clean pages which might have been
> > * cached by non-direct readahead, or faulted in by get_user_pages()
> > * if the source of the write was an mmap'ed region of the file
> > * we're writing. Either one is a pretty crazy thing to do,
> > * so we don't support it 100%. If this invalidation
> > * fails, tough, the write still worked...
> > */
> > if (mapping->nrpages) {
> > invalidate_inode_pages2_range(mapping,
> > pos >> PAGE_CACHE_SHIFT, end);
> > }
> >
> >If this second invalidate fails during a DIO write, we'll have up to
> >date pages in cache that don't match the data on disk. It is unlikely
> >to fail because the conditions that make jbd unable to free a buffer are
> >rare, but it can still happen with the write combination of mmap usage.
> >
> >The good news is the second invalidate doesn't make O_DIRECT return
> >-EIO. But, it sounds like fixing do_launder_page to always call into
> >the FS can fix all of these problems. Am I missing something?
> >
>
> My approach is not implementing do_launder_page for ext3.
> It is needed to modify VFS.
>
> My patch is as follows:
Sorry, I'm still not sure why the do_launder_page implementation is a
bad idea. Clearly Mingming spent quite some time on it in the past, but
given that it could provide a hook for the FS to do expensive operations
to make the page really go away, why not do it?
As far as I can tell, the only current users afs, nfs and fuse. Pushing
down the PageDirty check to those filesystems should be trivial.
With that said, I don't have strong feelings against falling back to
buffered IO when the invalidate fails. Maybe Zach remembers something I
don't?
-chris
>
>
> diff -Nrup linux-2.6.27-rc2.org/mm/filemap.c linux-2.6.27-rc2/mm/filemap.c
> --- linux-2.6.27-rc2.org/mm/filemap.c 2008-08-11 14:33:23.000000000 +0900
> +++ linux-2.6.27-rc2/mm/filemap.c 2008-08-11 14:57:29.000000000 +0900
> @@ -2129,13 +2129,16 @@ generic_file_direct_write(struct kiocb *
> * After a write we want buffered reads to be sure to go to disk to get
> * the new data. We invalidate clean cached page from the region we're
> * about to write. We do this *before* the write so that we can return
> - * -EIO without clobbering -EIOCBQUEUED from ->direct_IO().
> + * -EBUSY without clobbering -EIOCBQUEUED from ->direct_IO().
> */
> if (mapping->nrpages) {
> written = invalidate_inode_pages2_range(mapping,
> pos >> PAGE_CACHE_SHIFT, end);
> - if (written)
> + if (written) {
> + if (written == -EBUSY)
> + written = 0;
> goto out;
> + }
> }
>
> written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
> diff -Nrup linux-2.6.27-rc2.org/mm/truncate.c linux-2.6.27-rc2/mm/truncate.c
> --- linux-2.6.27-rc2.org/mm/truncate.c 2008-08-11 14:33:24.000000000 +0900
> +++ linux-2.6.27-rc2/mm/truncate.c 2008-08-11 14:52:03.000000000 +0900
> @@ -380,7 +380,7 @@ static int do_launder_page(struct addres
> * Any pages which are found to be mapped into pagetables are unmapped prior to
> * invalidation.
> *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EBUSY if any pages could not be invalidated.
> */
> int invalidate_inode_pages2_range(struct address_space *mapping,
> pgoff_t start, pgoff_t end)
> @@ -440,7 +440,7 @@ int invalidate_inode_pages2_range(struct
> ret2 = do_launder_page(mapping, page);
> if (ret2 == 0) {
> if (!invalidate_complete_page2(mapping, page))
> - ret2 = -EIO;
> + ret2 = -EBUSY;
> }
> if (ret2 < 0)
> ret = ret2;
>
>
>
>
> With that said, I don't have strong feelings against falling back to
> buffered IO when the invalidate fails. Maybe Zach remembers something I
> don't?
Nothing surprising is immediately coming to mind, no. I don't have
strong feelings against adding this case to the list of criteria that
causes it to fall back to buffered.
- z
在 2008-08-12二的 09:28 -0400,Chris Mason写道:
> On Mon, 2008-08-11 at 15:25 +0900, Hisashi Hifumi wrote:
> > >> >> >I am wondering why we need stronger invalidate hurantees for DIO->
> > >> >> >invalidate_inode_pages_range(),which force the page being removed from
> > >> >> >page cache? In case of bh is busy due to ext3 writeout,
> > >> >> >journal_try_to_free_buffers() could return different error number(EBUSY)
> > >> >> >to try_to_releasepage() (instead of EIO). In that case, could we just
> > >> >> >leave the page in the cache, clean pageuptodate() (to force later buffer
> > >> >> >read to read from disk) and then invalidate_complete_page2() return
> > >> >> >successfully? Any issue with this way?
> > >> >>
> > >> >> My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
> > >> >> bh busy, and dio write falls back to buffered write. This is easy to fix.
> > >> >>
> > >> >>
> > >> >
> > >> >What about the invalidates done after the DIO has already run
> > >> >non-buffered?
> > >>
> > >> Dio write falls back to buffered IO when writing to a hole on ext3, I
> > >think. I want to
> > >> apply this mechanism to fix this issue. When try_to_release_page fails on
> > >a page
> > >> due to bh busy, dio write does buffered write, sync_page_range, and
> > >> wait_on_page_writeback, imvalidates page cache to preserve dio semantics.
> > >> Even if page invalidation that is carried out after
> > >wait_on_page_writeback fails,
> > >> there is no inconsistency between HDD and page cache.
> > >>
> > >
> > >Sorry, I'm sure I wasn't very clear, I was referencing this code from
> > >mm/filemap.c:
> > >
> > > written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
> > >
> > > /*
> > > * Finally, try again to invalidate clean pages which might have been
> > > * cached by non-direct readahead, or faulted in by get_user_pages()
> > > * if the source of the write was an mmap'ed region of the file
> > > * we're writing. Either one is a pretty crazy thing to do,
> > > * so we don't support it 100%. If this invalidation
> > > * fails, tough, the write still worked...
> > > */
> > > if (mapping->nrpages) {
> > > invalidate_inode_pages2_range(mapping,
> > > pos >> PAGE_CACHE_SHIFT, end);
> > > }
> > >
> > >If this second invalidate fails during a DIO write, we'll have up to
> > >date pages in cache that don't match the data on disk. It is unlikely
> > >to fail because the conditions that make jbd unable to free a buffer are
> > >rare, but it can still happen with the write combination of mmap usage.
> > >
> > >The good news is the second invalidate doesn't make O_DIRECT return
> > >-EIO. But, it sounds like fixing do_launder_page to always call into
> > >the FS can fix all of these problems. Am I missing something?
> > >
> >
> > My approach is not implementing do_launder_page for ext3.
> > It is needed to modify VFS.
> >
> > My patch is as follows:
>
> Sorry, I'm still not sure why the do_launder_page implementation is a
> bad idea. Clearly Mingming spent quite some time on it in the past, but
> given that it could provide a hook for the FS to do expensive operations
> to make the page really go away, why not do it?
>
> As far as I can tell, the only current users afs, nfs and fuse. Pushing
> down the PageDirty check to those filesystems should be trivial.
>
>
I thought about your suggestion before, there should be no problem to
push down the pagedirty check to underlying fs.
My concern is even if we wait for page writeback cleared (from
ext3_ordered_writepage() )in the launder_page() , (which the wait
actually already done in previous DIO ->filemap_write_wait()),
ext3_ordered_writepage() still possibly hold the ref to the bh and
later journal_try_to_free_buffers() could still fail due to that.
> ->ext3_ordered_writepage()
> walk_page_buffers() <- take a bh ref
> block_write_full_page() <- unlock_page
> : <- end_page_writeback
> : <- race! (dio write->try_to_release_page fails)
here is the window.
> walk_page_buffers() <-release a bh ref
And we need someway to notify DIO code from ext3_ordered_writepage() to
indicating they are done with those buffers. That's the hard way, as Jan
mentioned.
> With that said, I don't have strong feelings against falling back to
> buffered IO when the invalidate fails.
>
>
It seems a little odd that we have to back to buffered IO in this case.
The pages are all flushed, DIO just want to make sure the
journaltransactions who still keep those buffers are removed from their
list. It did that, the only reason to keep DIO fail is someone else
hasn't release the bh.
Current code enforce all the buffers have to be freed and pages are
removed from page cache, in order to force later read are from disk. I
am not sure why can't we just leave the page in the cache, just clear it
uptodate flag, without reduce the page ref count? I think DIO should
proceed it's IO in this case...
At 05:06 08/08/13, Mingming Cao wrote:
>> With that said, I don't have strong feelings against falling back to
>> buffered IO when the invalidate fails.
>>
>>
>It seems a little odd that we have to back to buffered IO in this case.
>The pages are all flushed, DIO just want to make sure the
>journaltransactions who still keep those buffers are removed from their
>list. It did that, the only reason to keep DIO fail is someone else
>hasn't release the bh.
>
>Current code enforce all the buffers have to be freed and pages are
>removed from page cache, in order to force later read are from disk. I
>am not sure why can't we just leave the page in the cache, just clear it
>uptodate flag, without reduce the page ref count? I think DIO should
>proceed it's IO in this case...
Like this?
Following patch just clears PG_uptodate when try_to_release_page fails.
diff -Nrup linux-2.6.27-rc3.org/mm/truncate.c linux-2.6.27-rc3/mm/truncate.c
--- linux-2.6.27-rc3.org/mm/truncate.c 2008-08-13 13:48:48.000000000 +0900
+++ linux-2.6.27-rc3/mm/truncate.c 2008-08-13 14:24:47.000000000 +0900
@@ -345,8 +345,10 @@ invalidate_complete_page2(struct address
if (page->mapping != mapping)
return 0;
- if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
- return 0;
+ if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL)) {
+ ClearPageUptodate(page);
+ return 1;
+ }
spin_lock_irq(&mapping->tree_lock);
if (PageDirty(page))
But this patch may cause some issue.
See,
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=84209e02de48d72289650cc5a7ae8dd18223620f;hp=2b12a4c524812fb3f6ee590a02e65b95c8c32229
ClearPageUptodate in truncate.c was removed at 2.6.27-rc2 because this triggers some
problems around NFS,FUSE,madvice.
On Tue 12-08-08 09:28:26, Chris Mason wrote:
> On Mon, 2008-08-11 at 15:25 +0900, Hisashi Hifumi wrote:
> > >> >> >I am wondering why we need stronger invalidate hurantees for DIO->
> > >> >> >invalidate_inode_pages_range(),which force the page being removed from
> > >> >> >page cache? In case of bh is busy due to ext3 writeout,
> > >> >> >journal_try_to_free_buffers() could return different error number(EBUSY)
> > >> >> >to try_to_releasepage() (instead of EIO). In that case, could we just
> > >> >> >leave the page in the cache, clean pageuptodate() (to force later buffer
> > >> >> >read to read from disk) and then invalidate_complete_page2() return
> > >> >> >successfully? Any issue with this way?
> > >> >>
> > >> >> My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
> > >> >> bh busy, and dio write falls back to buffered write. This is easy to fix.
> > >> >>
> > >> >>
> > >> >
> > >> >What about the invalidates done after the DIO has already run
> > >> >non-buffered?
> > >>
> > >> Dio write falls back to buffered IO when writing to a hole on ext3, I
> > >think. I want to
> > >> apply this mechanism to fix this issue. When try_to_release_page fails on
> > >a page
> > >> due to bh busy, dio write does buffered write, sync_page_range, and
> > >> wait_on_page_writeback, imvalidates page cache to preserve dio semantics.
> > >> Even if page invalidation that is carried out after
> > >wait_on_page_writeback fails,
> > >> there is no inconsistency between HDD and page cache.
> > >>
> > >
> > >Sorry, I'm sure I wasn't very clear, I was referencing this code from
> > >mm/filemap.c:
> > >
> > > written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
> > >
> > > /*
> > > * Finally, try again to invalidate clean pages which might have been
> > > * cached by non-direct readahead, or faulted in by get_user_pages()
> > > * if the source of the write was an mmap'ed region of the file
> > > * we're writing. Either one is a pretty crazy thing to do,
> > > * so we don't support it 100%. If this invalidation
> > > * fails, tough, the write still worked...
> > > */
> > > if (mapping->nrpages) {
> > > invalidate_inode_pages2_range(mapping,
> > > pos >> PAGE_CACHE_SHIFT, end);
> > > }
> > >
> > >If this second invalidate fails during a DIO write, we'll have up to
> > >date pages in cache that don't match the data on disk. It is unlikely
> > >to fail because the conditions that make jbd unable to free a buffer are
> > >rare, but it can still happen with the write combination of mmap usage.
> > >
> > >The good news is the second invalidate doesn't make O_DIRECT return
> > >-EIO. But, it sounds like fixing do_launder_page to always call into
> > >the FS can fix all of these problems. Am I missing something?
> > >
> >
> > My approach is not implementing do_launder_page for ext3.
> > It is needed to modify VFS.
> >
> > My patch is as follows:
>
> Sorry, I'm still not sure why the do_launder_page implementation is a
> bad idea. Clearly Mingming spent quite some time on it in the past, but
> given that it could provide a hook for the FS to do expensive operations
> to make the page really go away, why not do it?
I don't see any harm in doing this either. Only that launder_page() name
stops being appropriate after such change (because laundering means making
a clean page from a dirty one) hence the function was originally intended
for a different purpose AFAICT. Another thing is that using launder_page()
does not solve the problem with second invalidate_inode_pages2() failing
(mmap can still instantiate the page again before DIO write starts, modify
the page and commit code / writepage makes it busy during the invalidate) -
but the comment above that call seems to suggest that this is a known
problem and I agree that if somebody mixes mmapped and DIO writes,
he deserves unexpected results.
> As far as I can tell, the only current users afs, nfs and fuse. Pushing
> down the PageDirty check to those filesystems should be trivial.
Yes, not a big deal.
> With that said, I don't have strong feelings against falling back to
> buffered IO when the invalidate fails. Maybe Zach remembers something I
> don't?
I don't have a strong opinion either. Falling back to buffered writes is
simpler at least for ext3/ext4 because properly synchronizing against
writepage() call does not seem to have a nice solution either in
do_launder_page() or in releasepage(). OTOH is hides the fact the invalidate
is failing and so if we screw up something in future and it fails often, it
might be hard to notice / track down the performance penalty.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 12-08-08 13:06:39, Mingming Cao wrote:
> 在 2008-08-12二的 09:28 -0400,Chris Mason写道:
> > On Mon, 2008-08-11 at 15:25 +0900, Hisashi Hifumi wrote:
> > > >> >> >I am wondering why we need stronger invalidate hurantees for DIO->
> > > >> >> >invalidate_inode_pages_range(),which force the page being removed from
> > > >> >> >page cache? In case of bh is busy due to ext3 writeout,
> > > >> >> >journal_try_to_free_buffers() could return different error number(EBUSY)
> > > >> >> >to try_to_releasepage() (instead of EIO). In that case, could we just
> > > >> >> >leave the page in the cache, clean pageuptodate() (to force later buffer
> > > >> >> >read to read from disk) and then invalidate_complete_page2() return
> > > >> >> >successfully? Any issue with this way?
> > > >> >>
> > > >> >> My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
> > > >> >> bh busy, and dio write falls back to buffered write. This is easy to fix.
> > > >> >>
> > > >> >>
> > > >> >
> > > >> >What about the invalidates done after the DIO has already run
> > > >> >non-buffered?
> > > >>
> > > >> Dio write falls back to buffered IO when writing to a hole on ext3, I
> > > >think. I want to
> > > >> apply this mechanism to fix this issue. When try_to_release_page fails on
> > > >a page
> > > >> due to bh busy, dio write does buffered write, sync_page_range, and
> > > >> wait_on_page_writeback, imvalidates page cache to preserve dio semantics.
> > > >> Even if page invalidation that is carried out after
> > > >wait_on_page_writeback fails,
> > > >> there is no inconsistency between HDD and page cache.
> > > >>
> > > >
> > > >Sorry, I'm sure I wasn't very clear, I was referencing this code from
> > > >mm/filemap.c:
> > > >
> > > > written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
> > > >
> > > > /*
> > > > * Finally, try again to invalidate clean pages which might have been
> > > > * cached by non-direct readahead, or faulted in by get_user_pages()
> > > > * if the source of the write was an mmap'ed region of the file
> > > > * we're writing. Either one is a pretty crazy thing to do,
> > > > * so we don't support it 100%. If this invalidation
> > > > * fails, tough, the write still worked...
> > > > */
> > > > if (mapping->nrpages) {
> > > > invalidate_inode_pages2_range(mapping,
> > > > pos >> PAGE_CACHE_SHIFT, end);
> > > > }
> > > >
> > > >If this second invalidate fails during a DIO write, we'll have up to
> > > >date pages in cache that don't match the data on disk. It is unlikely
> > > >to fail because the conditions that make jbd unable to free a buffer are
> > > >rare, but it can still happen with the write combination of mmap usage.
> > > >
> > > >The good news is the second invalidate doesn't make O_DIRECT return
> > > >-EIO. But, it sounds like fixing do_launder_page to always call into
> > > >the FS can fix all of these problems. Am I missing something?
> > > >
> > >
> > > My approach is not implementing do_launder_page for ext3.
> > > It is needed to modify VFS.
> > >
> > > My patch is as follows:
> >
> > Sorry, I'm still not sure why the do_launder_page implementation is a
> > bad idea. Clearly Mingming spent quite some time on it in the past, but
> > given that it could provide a hook for the FS to do expensive operations
> > to make the page really go away, why not do it?
> >
>
> > As far as I can tell, the only current users afs, nfs and fuse. Pushing
> > down the PageDirty check to those filesystems should be trivial.
> >
> >
>
> I thought about your suggestion before, there should be no problem to
> push down the pagedirty check to underlying fs.
>
> My concern is even if we wait for page writeback cleared (from
> ext3_ordered_writepage() ) in the launder_page(), (which the wait
> actually already done in previous DIO ->filemap_write_wait()),
> ext3_ordered_writepage() still possibly hold the ref to the bh and
> later journal_try_to_free_buffers() could still fail due to that.
Yes, how to properly wait for writepage() to finish is a different matter
and doing it launder_page() does not help. The only thing is that in
launder_page() we can do more expensive things because it is going to be
called only before DIO, not for ordinary page freeing on memory pressure.
> > ->ext3_ordered_writepage()
> > walk_page_buffers() <- take a bh ref
> > block_write_full_page() <- unlock_page
> > : <- end_page_writeback
> > : <- race! (dio write->try_to_release_page fails)
>
> here is the window.
> > walk_page_buffers() <-release a bh ref
>
> And we need someway to notify DIO code from ext3_ordered_writepage() to
> indicating they are done with those buffers. That's the hard way, as Jan
> mentioned.
Well, we can always introduce something like a per-sb waitqueue where
processes waiting for references to some buffer to be released would dwell.
We would wakeup processes in this queue after writepage drops all it's
references, we could even use the same mechanism for waiting till commit
code releases those references... But returning EBUSY and falling back to
buffered writes is definitely easier to do (modulo what I wrote to Chris
about hiding possible problems).
> > With that said, I don't have strong feelings against falling back to
> > buffered IO when the invalidate fails.
>
> It seems a little odd that we have to back to buffered IO in this case.
> The pages are all flushed, DIO just want to make sure the
> journaltransactions who still keep those buffers are removed from their
> list. It did that, the only reason to keep DIO fail is someone else
> hasn't release the bh.
>
> Current code enforce all the buffers have to be freed and pages are
> removed from page cache, in order to force later read are from disk. I
> am not sure why can't we just leave the page in the cache, just clear it
> uptodate flag, without reduce the page ref count? I think DIO should
> proceed it's IO in this case...
The problem with clearing page uptodate is described in commit
84209e02de48d72289650cc5a7ae8dd18223620f. The page may be currently in the
pipe and clearing the uptodate bit under it makes them unhappy (returning
errors or so). So either one has to change the pipe handling or we have to
cope without clearing page uptodate bit.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, 2008-08-13 at 12:16 +0200, Jan Kara wrote:
> > With that said, I don't have strong feelings against falling back to
> > buffered IO when the invalidate fails. Maybe Zach remembers something I
> > don't?
> I don't have a strong opinion either. Falling back to buffered writes is
> simpler at least for ext3/ext4 because properly synchronizing against
> writepage() call does not seem to have a nice solution either in
> do_launder_page() or in releasepage(). OTOH is hides the fact the invalidate
> is failing and so if we screw up something in future and it fails often, it
> might be hard to notice / track down the performance penalty.
In general, these races don't happen often, and when they do it is
because someone is mixing page cache and O_DIRECT io to the same file.
That is explicitly outside the main use case of O_DIRECT.
So, I'd rather see us slow down O_DIRECT in the mixed use case than have
big impacts in complexity or speed to other parts of the kernel. If
falling back avoids problems in some filesystems or avoids clearing the
uptodate bit unexpectedly, I'd much rather take the fallback patch.
-chris
At 21:59 08/08/13, Chris Mason wrote:
>On Wed, 2008-08-13 at 12:16 +0200, Jan Kara wrote:
>
>> > With that said, I don't have strong feelings against falling back to
>> > buffered IO when the invalidate fails. Maybe Zach remembers something I
>> > don't?
>> I don't have a strong opinion either. Falling back to buffered writes is
>> simpler at least for ext3/ext4 because properly synchronizing against
>> writepage() call does not seem to have a nice solution either in
>> do_launder_page() or in releasepage(). OTOH is hides the fact the invalidate
>> is failing and so if we screw up something in future and it fails often, it
>> might be hard to notice / track down the performance penalty.
>
>In general, these races don't happen often, and when they do it is
>because someone is mixing page cache and O_DIRECT io to the same file.
>That is explicitly outside the main use case of O_DIRECT.
>
>So, I'd rather see us slow down O_DIRECT in the mixed use case than have
>big impacts in complexity or speed to other parts of the kernel. If
>falling back avoids problems in some filesystems or avoids clearing the
>uptodate bit unexpectedly, I'd much rather take the fallback patch.
>
>-chris
Hi Andrew.
I think we don't have strong feelings against falling back to buffered writes to
fix the direct-io -EIO problem.
Please review my patch.
Signed-off-by: Hisashi Hifumi <[email protected]>
diff -Nrup linux-2.6.27-rc3.org/mm/filemap.c linux-2.6.27-rc3/mm/filemap.c
--- linux-2.6.27-rc3.org/mm/filemap.c 2008-08-13 13:48:47.000000000 +0900
+++ linux-2.6.27-rc3/mm/filemap.c 2008-08-19 15:45:31.000000000 +0900
@@ -2129,13 +2129,20 @@ generic_file_direct_write(struct kiocb *
* After a write we want buffered reads to be sure to go to disk to get
* the new data. We invalidate clean cached page from the region we're
* about to write. We do this *before* the write so that we can return
- * -EIO without clobbering -EIOCBQUEUED from ->direct_IO().
+ * without clobbering -EIOCBQUEUED from ->direct_IO().
*/
if (mapping->nrpages) {
written = invalidate_inode_pages2_range(mapping,
pos >> PAGE_CACHE_SHIFT, end);
- if (written)
+ /*
+ * If a page can not be invalidated, return 0 to fall back
+ * to buffered write.
+ */
+ if (written) {
+ if (written == -EBUSY)
+ return 0;
goto out;
+ }
}
written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
diff -Nrup linux-2.6.27-rc3.org/mm/truncate.c linux-2.6.27-rc3/mm/truncate.c
--- linux-2.6.27-rc3.org/mm/truncate.c 2008-08-13 13:48:48.000000000 +0900
+++ linux-2.6.27-rc3/mm/truncate.c 2008-08-19 12:10:46.000000000 +0900
@@ -380,7 +380,7 @@ static int do_launder_page(struct addres
* Any pages which are found to be mapped into pagetables are unmapped prior to
* invalidation.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EBUSY if any pages could not be invalidated.
*/
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
@@ -440,7 +440,7 @@ int invalidate_inode_pages2_range(struct
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))
- ret2 = -EIO;
+ ret2 = -EBUSY;
}
if (ret2 < 0)
ret = ret2;
On Tue, 19 Aug 2008 16:03:45 +0900 Hisashi Hifumi <[email protected]> wrote:
>
> At 21:59 08/08/13, Chris Mason wrote:
> >On Wed, 2008-08-13 at 12:16 +0200, Jan Kara wrote:
> >
> >> > With that said, I don't have strong feelings against falling back to
> >> > buffered IO when the invalidate fails. Maybe Zach remembers something I
> >> > don't?
> >> I don't have a strong opinion either. Falling back to buffered writes is
> >> simpler at least for ext3/ext4 because properly synchronizing against
> >> writepage() call does not seem to have a nice solution either in
> >> do_launder_page() or in releasepage(). OTOH is hides the fact the invalidate
> >> is failing and so if we screw up something in future and it fails often, it
> >> might be hard to notice / track down the performance penalty.
> >
> >In general, these races don't happen often, and when they do it is
> >because someone is mixing page cache and O_DIRECT io to the same file.
> >That is explicitly outside the main use case of O_DIRECT.
> >
> >So, I'd rather see us slow down O_DIRECT in the mixed use case than have
> >big impacts in complexity or speed to other parts of the kernel. If
> >falling back avoids problems in some filesystems or avoids clearing the
> >uptodate bit unexpectedly, I'd much rather take the fallback patch.
> >
> >-chris
>
> Hi Andrew.
> I think we don't have strong feelings against falling back to buffered writes to
> fix the direct-io -EIO problem.
>
> Please review my patch.
>
umm, what problem does it solve?
>
> diff -Nrup linux-2.6.27-rc3.org/mm/filemap.c linux-2.6.27-rc3/mm/filemap.c
> --- linux-2.6.27-rc3.org/mm/filemap.c 2008-08-13 13:48:47.000000000 +0900
> +++ linux-2.6.27-rc3/mm/filemap.c 2008-08-19 15:45:31.000000000 +0900
> @@ -2129,13 +2129,20 @@ generic_file_direct_write(struct kiocb *
> * After a write we want buffered reads to be sure to go to disk to get
> * the new data. We invalidate clean cached page from the region we're
> * about to write. We do this *before* the write so that we can return
> - * -EIO without clobbering -EIOCBQUEUED from ->direct_IO().
> + * without clobbering -EIOCBQUEUED from ->direct_IO().
> */
> if (mapping->nrpages) {
> written = invalidate_inode_pages2_range(mapping,
> pos >> PAGE_CACHE_SHIFT, end);
> - if (written)
> + /*
> + * If a page can not be invalidated, return 0 to fall back
> + * to buffered write.
> + */
> + if (written) {
> + if (written == -EBUSY)
> + return 0;
> goto out;
> + }
> }
>
> written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
> diff -Nrup linux-2.6.27-rc3.org/mm/truncate.c linux-2.6.27-rc3/mm/truncate.c
> --- linux-2.6.27-rc3.org/mm/truncate.c 2008-08-13 13:48:48.000000000 +0900
> +++ linux-2.6.27-rc3/mm/truncate.c 2008-08-19 12:10:46.000000000 +0900
> @@ -380,7 +380,7 @@ static int do_launder_page(struct addres
> * Any pages which are found to be mapped into pagetables are unmapped prior to
> * invalidation.
> *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EBUSY if any pages could not be invalidated.
> */
> int invalidate_inode_pages2_range(struct address_space *mapping,
> pgoff_t start, pgoff_t end)
> @@ -440,7 +440,7 @@ int invalidate_inode_pages2_range(struct
> ret2 = do_launder_page(mapping, page);
> if (ret2 == 0) {
> if (!invalidate_complete_page2(mapping, page))
> - ret2 = -EIO;
> + ret2 = -EBUSY;
> }
> if (ret2 < 0)
> ret = ret2;
If I recall correctly, we had a problem with pages which are pinned by
an ext3 transaction, and those pages weren't releaseable for direct-io,
and this caused some problem?
I think falling back to buffered writes is always a safe course, but
it'd be nice to have a full description of the change, please.
At 16:16 08/08/19, Andrew Morton wrote:
>On Tue, 19 Aug 2008 16:03:45 +0900 Hisashi Hifumi
><[email protected]> wrote:
>
>>
>> At 21:59 08/08/13, Chris Mason wrote:
>> >On Wed, 2008-08-13 at 12:16 +0200, Jan Kara wrote:
>> >
>> >> > With that said, I don't have strong feelings against falling back to
>> >> > buffered IO when the invalidate fails. Maybe Zach remembers something I
>> >> > don't?
>> >> I don't have a strong opinion either. Falling back to buffered writes is
>> >> simpler at least for ext3/ext4 because properly synchronizing against
>> >> writepage() call does not seem to have a nice solution either in
>> >> do_launder_page() or in releasepage(). OTOH is hides the fact the invalidate
>> >> is failing and so if we screw up something in future and it fails often, it
>> >> might be hard to notice / track down the performance penalty.
>> >
>> >In general, these races don't happen often, and when they do it is
>> >because someone is mixing page cache and O_DIRECT io to the same file.
>> >That is explicitly outside the main use case of O_DIRECT.
>> >
>> >So, I'd rather see us slow down O_DIRECT in the mixed use case than have
>> >big impacts in complexity or speed to other parts of the kernel. If
>> >falling back avoids problems in some filesystems or avoids clearing the
>> >uptodate bit unexpectedly, I'd much rather take the fallback patch.
>> >
>> >-chris
>>
>> Hi Andrew.
>> I think we don't have strong feelings against falling back to buffered
>writes to
>> fix the direct-io -EIO problem.
>>
>> Please review my patch.
>>
>
>umm, what problem does it solve?
>
>If I recall correctly, we had a problem with pages which are pinned by
>an ext3 transaction, and those pages weren't releaseable for direct-io,
>and this caused some problem?
Sorry, I should describe about this problem.
Yes, Dio write returns EIO when try_to_release_page fails because sometimes
bh is still referenced by jbd or other place.
The race between freeing buffer and committing transaction(jbd) was fixed
but I found another race. We have been discussing about this issue, and
I proposed that falling back to buffered writes to fix this issue.
I think we don't have strong feelings against falling back to buffered
writes to fix the direct-io -EIO problem.
>
>I think falling back to buffered writes is always a safe course, but
>it'd be nice to have a full description of the change, please.
[PATCH] VFS: fix dio write returning EIO when try_to_release_page fails
Dio write returns EIO when try_to_release_page fails because bh is
still referenced.
The patch
"commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
Author: Mingming Cao <[email protected]>
Date: Fri Jul 25 01:46:22 2008 -0700
jbd: fix race between free buffer and commit transaction
"
was merged into 2.6.27-rc1, but I noticed that this patch is not enough
to fix the race.
I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
sometimes got EIO through this test.
The patch above fixed race between freeing buffer(dio) and committing
transaction(jbd) but I discovered that there is another race,
freeing buffer(dio) and ext3/4_ordered_writepage.
: background_writeout()
->write_cache_pages()
->ext3_ordered_writepage()
walk_page_buffers() -> take a bh ref
block_write_full_page() -> unlock_page
: <- end_page_writeback
: <- race! (dio write->try_to_release_page fails)
walk_page_buffers() ->release a bh ref
ext3_ordered_writepage holds bh ref and does unlock_page remaining
taking a bh ref, so this causes the race and failure of
try_to_release_page.
To fix this race, I used the approach of falling back to buffered writes
if try_to_release_page fails on a page.
Signed-off-by: Hisashi Hifumi <[email protected]>
diff -Nrup linux-2.6.27-rc3.org/mm/filemap.c linux-2.6.27-rc3/mm/filemap.c
--- linux-2.6.27-rc3.org/mm/filemap.c 2008-08-13 13:48:47.000000000 +0900
+++ linux-2.6.27-rc3/mm/filemap.c 2008-08-19 15:45:31.000000000 +0900
@@ -2129,13 +2129,20 @@ generic_file_direct_write(struct kiocb *
* After a write we want buffered reads to be sure to go to disk to get
* the new data. We invalidate clean cached page from the region we're
* about to write. We do this *before* the write so that we can return
- * -EIO without clobbering -EIOCBQUEUED from ->direct_IO().
+ * without clobbering -EIOCBQUEUED from ->direct_IO().
*/
if (mapping->nrpages) {
written = invalidate_inode_pages2_range(mapping,
pos >> PAGE_CACHE_SHIFT, end);
- if (written)
+ /*
+ * If a page can not be invalidated, return 0 to fall back
+ * to buffered write.
+ */
+ if (written) {
+ if (written == -EBUSY)
+ return 0;
goto out;
+ }
}
written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
diff -Nrup linux-2.6.27-rc3.org/mm/truncate.c linux-2.6.27-rc3/mm/truncate.c
--- linux-2.6.27-rc3.org/mm/truncate.c 2008-08-13 13:48:48.000000000 +0900
+++ linux-2.6.27-rc3/mm/truncate.c 2008-08-19 12:10:46.000000000 +0900
@@ -380,7 +380,7 @@ static int do_launder_page(struct addres
* Any pages which are found to be mapped into pagetables are unmapped prior to
* invalidation.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EBUSY if any pages could not be invalidated.
*/
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
@@ -440,7 +440,7 @@ int invalidate_inode_pages2_range(struct
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))
- ret2 = -EIO;
+ ret2 = -EBUSY;
}
if (ret2 < 0)
ret = ret2;
At 16:16 08/08/19, Andrew Morton wrote:
>On Tue, 19 Aug 2008 16:03:45 +0900 Hisashi Hifumi
><[email protected]> wrote:
>
>>
>> At 21:59 08/08/13, Chris Mason wrote:
>> >On Wed, 2008-08-13 at 12:16 +0200, Jan Kara wrote:
>> >
>> >> > With that said, I don't have strong feelings against falling back to
>> >> > buffered IO when the invalidate fails. Maybe Zach remembers something I
>> >> > don't?
>> >> I don't have a strong opinion either. Falling back to buffered writes is
>> >> simpler at least for ext3/ext4 because properly synchronizing against
>> >> writepage() call does not seem to have a nice solution either in
>> >> do_launder_page() or in releasepage(). OTOH is hides the fact the invalidate
>> >> is failing and so if we screw up something in future and it fails often, it
>> >> might be hard to notice / track down the performance penalty.
>> >
>> >In general, these races don't happen often, and when they do it is
>> >because someone is mixing page cache and O_DIRECT io to the same file.
>> >That is explicitly outside the main use case of O_DIRECT.
>> >
>> >So, I'd rather see us slow down O_DIRECT in the mixed use case than have
>> >big impacts in complexity or speed to other parts of the kernel. If
>> >falling back avoids problems in some filesystems or avoids clearing the
>> >uptodate bit unexpectedly, I'd much rather take the fallback patch.
>> >
>> >-chris
>>
>> Hi Andrew.
>> I think we don't have strong feelings against falling back to buffered
>writes to
>> fix the direct-io -EIO problem.
>>
>> Please review my patch.
>>
>
>umm, what problem does it solve?
>
>If I recall correctly, we had a problem with pages which are pinned by
>an ext3 transaction, and those pages weren't releaseable for direct-io,
>and this caused some problem?
Hi Andrew.
Sorry, I should describe about this problem.
Yes, Dio write returns EIO when try_to_release_page fails because sometimes
bh is still referenced by jbd or other place.
The race between freeing buffer and committing transaction(jbd) was fixed
but I found another race. We have been discussing about this issue, and
I proposed that falling back to buffered writes to fix this issue.
I think we don't have strong feelings against falling back to buffered
writes to fix the direct-io -EIO problem.
>
>I think falling back to buffered writes is always a safe course, but
>it'd be nice to have a full description of the change, please.
[PATCH] VFS: fix dio write returning EIO when try_to_release_page fails
Dio write returns EIO when try_to_release_page fails because bh is
still referenced.
The patch
"commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
Author: Mingming Cao <[email protected]>
Date: Fri Jul 25 01:46:22 2008 -0700
jbd: fix race between free buffer and commit transaction
"
was merged into 2.6.27-rc1, but I noticed that this patch is not enough
to fix the race.
I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
sometimes got EIO through this test.
The patch above fixed race between freeing buffer(dio) and committing
transaction(jbd) but I discovered that there is another race,
freeing buffer(dio) and ext3/4_ordered_writepage.
: background_writeout()
->write_cache_pages()
->ext3_ordered_writepage()
walk_page_buffers() -> take a bh ref
block_write_full_page() -> unlock_page
: <- end_page_writeback
: <- race! (dio write->try_to_release_page fails)
walk_page_buffers() ->release a bh ref
ext3_ordered_writepage holds bh ref and does unlock_page remaining
taking a bh ref, so this causes the race and failure of
try_to_release_page.
To fix this race, I used the approach of falling back to buffered writes
if try_to_release_page fails on a page.
Signed-off-by: Hisashi Hifumi <[email protected]>
diff -Nrup linux-2.6.27-rc3.org/mm/filemap.c linux-2.6.27-rc3/mm/filemap.c
--- linux-2.6.27-rc3.org/mm/filemap.c 2008-08-13 13:48:47.000000000 +0900
+++ linux-2.6.27-rc3/mm/filemap.c 2008-08-19 15:45:31.000000000 +0900
@@ -2129,13 +2129,20 @@ generic_file_direct_write(struct kiocb *
* After a write we want buffered reads to be sure to go to disk to get
* the new data. We invalidate clean cached page from the region we're
* about to write. We do this *before* the write so that we can return
- * -EIO without clobbering -EIOCBQUEUED from ->direct_IO().
+ * without clobbering -EIOCBQUEUED from ->direct_IO().
*/
if (mapping->nrpages) {
written = invalidate_inode_pages2_range(mapping,
pos >> PAGE_CACHE_SHIFT, end);
- if (written)
+ /*
+ * If a page can not be invalidated, return 0 to fall back
+ * to buffered write.
+ */
+ if (written) {
+ if (written == -EBUSY)
+ return 0;
goto out;
+ }
}
written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
diff -Nrup linux-2.6.27-rc3.org/mm/truncate.c linux-2.6.27-rc3/mm/truncate.c
--- linux-2.6.27-rc3.org/mm/truncate.c 2008-08-13 13:48:48.000000000 +0900
+++ linux-2.6.27-rc3/mm/truncate.c 2008-08-19 12:10:46.000000000 +0900
@@ -380,7 +380,7 @@ static int do_launder_page(struct addres
* Any pages which are found to be mapped into pagetables are unmapped prior to
* invalidation.
*
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EBUSY if any pages could not be invalidated.
*/
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
@@ -440,7 +440,7 @@ int invalidate_inode_pages2_range(struct
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))
- ret2 = -EIO;
+ ret2 = -EBUSY;
}
if (ret2 < 0)
ret = ret2;