2010-12-30 23:30:51

by Joel Becker

[permalink] [raw]
Subject: Confused by commit 56a76f [fs: fix page_mkwrite error cases in core code and btrfs]

Nick,
While visiting some issues in ocfs2_page_mkwrite(), I realized
that we're returning 0 to mean "Please try again Mr VM", but the caller
of page_mkwrite() now expects VM_FAULT_NOPAGE. This is all due to
56a76f [fs: fix page_mkwrite error cases in core code and btrfs].
In the comment for 56a76f, you state that btrfs is the example
but all other filesystems need to be fixed...yet ocfs2, ext4, and gfs2
continue to return 0 or VM_FAULT_SIGBUS.
I guess this continues to work because we return the page
unlocked (thus triggering the "if (unlikely(!(tmp & VM_FAULT_LOCKED)))"
logic). Was this expected to continue to work, or do you consider these
filesystems broken until they are updated with the new return codes?
Back to the ocfs2 issue. Am I correctly reading the current
code that we can safely throw away the page passed in to page_mkwrite()
if a pagecache flush has dropped it? Currently, we presume that the
page passed in must be the one we make writable. We make a quick check
of page->mapping == inode->i_mapping, returning 0 (for "try again")
immediately if that's false. But once we get into the meat of our
locking and finally lock the page for good, we assume mapping==i_mapping
still holds. That obviously breaks when the pagecache gets truncated.
At this late stage, we -EINVAL (clearly wrong).
It looks hard to lock the page for good high up at our first
check of the mapping. Wengang has proposed to simply ignore the page
passed into page_mkwrite() and just find_or_create_page() the sucker at
the place we're ready to consider it stable. As I see it, the two
places that call page_mkwrite() are going to revalidate the pte anyway,
and they'll find the newly created page if find_or_create_page() gets a
different. Is that safe behavior?

Joel

--

Life's Little Instruction Book #3

"Watch a sunrise at least once a year."

Joel Becker
Senior Development Manager
Oracle
E-mail: [email protected]
Phone: (650) 506-8127


2010-12-31 09:31:43

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Confused by commit 56a76f [fs: fix page_mkwrite error cases in core code and btrfs]

Hi Joel,

On Fri, Dec 31, 2010 at 10:29 AM, Joel Becker <[email protected]> wrote:
> Nick,
> ? ? ? ?While visiting some issues in ocfs2_page_mkwrite(), I realized
> that we're returning 0 to mean "Please try again Mr VM", but the caller

0 has always meant minor fault, which means the filesystem satisfied
the request without doing IO. In the change to bit mask return values,
I kept it compatible by having major fault be presence of a bit, and minor
fault indicate absence.

> of page_mkwrite() now expects VM_FAULT_NOPAGE. ?This is all due to
> 56a76f [fs: fix page_mkwrite error cases in core code and btrfs].

NOPAGE basically means no further action on part of the VM is required.
It was used to consolidate pfn based fault handler with page based fault
handler. And then, later, it was further used in this case to allow the
filesystem to have the VM try again in rare / difficult to handle error cases
exactly like truncate races.

> ? ? ? ?In the comment for 56a76f, you state that btrfs is the example
> but all other filesystems need to be fixed...yet ocfs2, ext4, and gfs2
> continue to return 0 or VM_FAULT_SIGBUS.
> ? ? ? ?I guess this continues to work because we return the page
> unlocked (thus triggering the "if (unlikely(!(tmp & VM_FAULT_LOCKED)))"
> logic). ?Was this expected to continue to work, or do you consider these
> filesystems broken until they are updated with the new return codes?

No it was all back compatible. That is to say, the ones that return SIGBUS
in these cases were already broken, but the patch didn't break them further.
Actaully it closed up races a bit, if I recall. But yes they should have all
been converted to the new calling convention and returning with page
locked.

If the filesystem returns 0, it means minor fault, and the VM will actually
install the page (unless the hack to check page->mapping catches it).


> ? ? ? ?Back to the ocfs2 issue. ?Am I correctly reading the current
> code that we can safely throw away the page passed in to page_mkwrite()
> if a pagecache flush has dropped it?

Well you just return NOPAGE and the VM throws the page away.


> ?Currently, we presume that the
> page passed in must be the one we make writable. ?We make a quick check
> of page->mapping == inode->i_mapping, returning 0 (for "try again")
> immediately if that's false. ?But once we get into the meat of our
> locking and finally lock the page for good, we assume mapping==i_mapping
> still holds. ?That obviously breaks when the pagecache gets truncated.
> At this late stage, we -EINVAL (clearly wrong).

The better way to do this would be to just return VM_FAULT_NOPAGE
in any case you need the VM to retry the fault. When you reach the
business end of your handler, you want to hold the page locked, after
you verify it is correct, and return that to the fault handler.


> ? ? ? ?It looks hard to lock the page for good high up at our first
> check of the mapping. ?Wengang has proposed to simply ignore the page
> passed into page_mkwrite() and just find_or_create_page() the sucker at
> the place we're ready to consider it stable. ?As I see it, the two
> places that call page_mkwrite() are going to revalidate the pte anyway,
> and they'll find the newly created page if find_or_create_page() gets a
> different. ?Is that safe behavior?

I can't see the point. If you can find_or_create_page, then you can
lock_page, by definition. Then check the mapping and
VM_FAULT_SIGBUS if it is wrong.

Messing with the wrong page will only see the result ignored by the VM,
and push an incorrect page through parts of your fault handler, which
is potentially confusing at best, I would have thought.

Thanks,
Nick

2010-12-31 11:00:59

by Joel Becker

[permalink] [raw]
Subject: Re: Confused by commit 56a76f [fs: fix page_mkwrite error cases in core code and btrfs]

On Fri, Dec 31, 2010 at 08:31:41PM +1100, Nick Piggin wrote:
> On Fri, Dec 31, 2010 at 10:29 AM, Joel Becker <[email protected]> wrote:
> > Nick,
> > ? ? ? ?While visiting some issues in ocfs2_page_mkwrite(), I realized
> > that we're returning 0 to mean "Please try again Mr VM", but the caller
>
> 0 has always meant minor fault, which means the filesystem satisfied
> the request without doing IO. In the change to bit mask return values,
> I kept it compatible by having major fault be presence of a bit, and minor
> fault indicate absence.

Ok, good, the 0 is intentionally somewhat-working.

> NOPAGE basically means no further action on part of the VM is required.
> It was used to consolidate pfn based fault handler with page based fault
> handler. And then, later, it was further used in this case to allow the
> filesystem to have the VM try again in rare / difficult to handle error cases
> exactly like truncate races.

Ok, so it's not "no further action" anymore, it is "I don't have
a page, check again to see if I'm supposed to give you one."
That's how I read the code. Cool.

> No it was all back compatible. That is to say, the ones that return SIGBUS
> in these cases were already broken, but the patch didn't break them further.
> Actaully it closed up races a bit, if I recall. But yes they should have all
> been converted to the new calling convention and returning with page
> locked.
>
> If the filesystem returns 0, it means minor fault, and the VM will actually
> install the page (unless the hack to check page->mapping catches it).

Right, but does it install the page passed into page_mkwrite()?
The way I read the code, it actually rechecks the pte and installs the
page it now finds.

> > ? ? ? ?Back to the ocfs2 issue. ?Am I correctly reading the current
> > code that we can safely throw away the page passed in to page_mkwrite()
> > if a pagecache flush has dropped it?
>
> Well you just return NOPAGE and the VM throws the page away.

I mean, as we discuss below, that I ignore the page passed
to page_mkwrite() and rediscover the mapping ourselves.

> > ?Currently, we presume that the
> > page passed in must be the one we make writable. ?We make a quick check
> > of page->mapping == inode->i_mapping, returning 0 (for "try again")
> > immediately if that's false. ?But once we get into the meat of our
> > locking and finally lock the page for good, we assume mapping==i_mapping
> > still holds. ?That obviously breaks when the pagecache gets truncated.
> > At this late stage, we -EINVAL (clearly wrong).
>
> The better way to do this would be to just return VM_FAULT_NOPAGE
> in any case you need the VM to retry the fault. When you reach the
> business end of your handler, you want to hold the page locked, after
> you verify it is correct, and return that to the fault handler.

This is going to be hard. Our write_end() assumes it must
unlock the pages (which is normal behavior for write(2)), but in the
page_mkwrite() case we need to avoid the unlock to follow your
recommendation (we use our write_begin/write_end pair to trigger any
allocation or zeroing needed before the page is writable).

> > ? ? ? ?It looks hard to lock the page for good high up at our first
> > check of the mapping. ?Wengang has proposed to simply ignore the page
> > passed into page_mkwrite() and just find_or_create_page() the sucker at
> > the place we're ready to consider it stable. ?As I see it, the two
> > places that call page_mkwrite() are going to revalidate the pte anyway,
> > and they'll find the newly created page if find_or_create_page() gets a
> > different. ?Is that safe behavior?
>
> I can't see the point. If you can find_or_create_page, then you can
> lock_page, by definition. Then check the mapping and
> VM_FAULT_SIGBUS if it is wrong.

The find_or_create_page() is deep at the meat of the function,
not the cursory check at the top. The idea is that at this point,
find_or_create_page() will return a locked page that must, by
definition, be part of the correct mapping.

> Messing with the wrong page will only see the result ignored by the VM,
> and push an incorrect page through parts of your fault handler, which
> is potentially confusing at best, I would have thought.

If the VM is rechecking the pte after we return from
page_mkwrite(), won't it see any new page created?

Joel

--

"Egotist: a person more interested in himself than in me."
- Ambrose Bierce

http://www.jlbec.org/
[email protected]

2010-12-31 11:16:37

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Confused by commit 56a76f [fs: fix page_mkwrite error cases in core code and btrfs]

On Fri, Dec 31, 2010 at 10:00 PM, Joel Becker <[email protected]> wrote:
> On Fri, Dec 31, 2010 at 08:31:41PM +1100, Nick Piggin wrote:
>> On Fri, Dec 31, 2010 at 10:29 AM, Joel Becker <[email protected]> wrote:

>> If the filesystem returns 0, it means minor fault, and the VM will actually
>> install the page (unless the hack to check page->mapping catches it).
>
> ? ? ? ?Right, but does it install the page passed into page_mkwrite()?
> The way I read the code, it actually rechecks the pte and installs the
> page it now finds.

No, it doesn't install anything, it just bails out and the fault will be
restarted if required.

>> The better way to do this would be to just return VM_FAULT_NOPAGE
>> in any case you need the VM to retry the fault. When you reach the
>> business end of your handler, you want to hold the page locked, after
>> you verify it is correct, and return that to the fault handler.
>
> ? ? ? ?This is going to be hard. ?Our write_end() assumes it must
> unlock the pages (which is normal behavior for write(2)), but in the
> page_mkwrite() case we need to avoid the unlock to follow your
> recommendation (we use our write_begin/write_end pair to trigger any
> allocation or zeroing needed before the page is writable).

Yes that would be the best option. It is possible to use the unlocked
return, but that still gives possibility of races in some cases. Consider
if the filesystem has to allocate some persistent metadata while the
page is dirty, then it must do so in page_mkwrite (set_page_dirty may
not sleep), but if the page is returned with no locks, it might be possible
that metadata is freed before the page can be installed in the dirty pte.

Basically it would be nice for all filesystems to move to this convention
so we can remove the old cruft.


>> > ? ? ? ?It looks hard to lock the page for good high up at our first
>> > check of the mapping. ?Wengang has proposed to simply ignore the page
>> > passed into page_mkwrite() and just find_or_create_page() the sucker at
>> > the place we're ready to consider it stable. ?As I see it, the two
>> > places that call page_mkwrite() are going to revalidate the pte anyway,
>> > and they'll find the newly created page if find_or_create_page() gets a
>> > different. ?Is that safe behavior?
>>
>> I can't see the point. If you can find_or_create_page, then you can
>> lock_page, by definition. Then check the mapping and
>> VM_FAULT_SIGBUS if it is wrong.
>
> ? ? ? ?The find_or_create_page() is deep at the meat of the function,
> not the cursory check at the top. ?The idea is that at this point,
> find_or_create_page() will return a locked page that must, by
> definition, be part of the correct mapping.

But you must still handle failures there, because find_or_create_page
may return -ENOMEM. So just lock the page, recheck the mapping
there, and then do exactly the same error handling.

>> Messing with the wrong page will only see the result ignored by the VM,
>> and push an incorrect page through parts of your fault handler, which
>> is potentially confusing at best, I would have thought.
>
> ? ? ? ?If the VM is rechecking the pte after we return from
> page_mkwrite(), won't it see any new page created?

But the point of page_mkwrite is a dirty notifier for the fs. If this new
different page was installed due to say truncate then a read-only
fault, the filesystem would miss the notification. So it just does the
simple thing and retries the whole sequence (if needed).

2010-12-31 11:41:54

by Joel Becker

[permalink] [raw]
Subject: Re: Confused by commit 56a76f [fs: fix page_mkwrite error cases in core code and btrfs]

On Fri, Dec 31, 2010 at 10:16:35PM +1100, Nick Piggin wrote:
> On Fri, Dec 31, 2010 at 10:00 PM, Joel Becker <[email protected]> wrote:
> > On Fri, Dec 31, 2010 at 08:31:41PM +1100, Nick Piggin wrote:
> >> On Fri, Dec 31, 2010 at 10:29 AM, Joel Becker <[email protected]> wrote:
>
> >> The better way to do this would be to just return VM_FAULT_NOPAGE
> >> in any case you need the VM to retry the fault. When you reach the
> >> business end of your handler, you want to hold the page locked, after
> >> you verify it is correct, and return that to the fault handler.
> >
> > ? ? ? ?This is going to be hard. ?Our write_end() assumes it must
> > unlock the pages (which is normal behavior for write(2)), but in the
> > page_mkwrite() case we need to avoid the unlock to follow your
> > recommendation (we use our write_begin/write_end pair to trigger any
> > allocation or zeroing needed before the page is writable).
>
> Yes that would be the best option. It is possible to use the unlocked
> return, but that still gives possibility of races in some cases. Consider

Yes, I am aware of that.

> Basically it would be nice for all filesystems to move to this convention
> so we can remove the old cruft.

I can appreciate that. And you've just answered the "Do you
want us to get there, or are minor faults in the old-0-style OK?"
question.

> > ? ? ? ?The find_or_create_page() is deep at the meat of the function,
> > not the cursory check at the top. ?The idea is that at this point,
> > find_or_create_page() will return a locked page that must, by
> > definition, be part of the correct mapping.
>
> But you must still handle failures there, because find_or_create_page
> may return -ENOMEM. So just lock the page, recheck the mapping
> there, and then do exactly the same error handling.

Of course it can error, but error is different than clean
restart. Though cleanup should be the same; I'm looking at our code
trying to convince myself that this is so ;-) Btw, -ENOMEM is that OOM
fault error, right? ;-)

> > ? ? ? ?If the VM is rechecking the pte after we return from
> > page_mkwrite(), won't it see any new page created?
>
> But the point of page_mkwrite is a dirty notifier for the fs. If this new
> different page was installed due to say truncate then a read-only
> fault, the filesystem would miss the notification. So it just does the
> simple thing and retries the whole sequence (if needed).

Fair enough, even if we caused the new page and thus are already
notified ;-)
Essentially, you would really like us (and ext4, and gfs2, etc)
to start returning VM_FAULT_NOPAGE for any place we aren't sure what
happened to our page, VM_FAULT_OOM when we get an -ENOMEM, and
VM_FAULT_LOCKED for all successful operations. That the long and short
of it? Thank you for helping me understand your plan for this code.
Btw, what not use VM_FAULT_RETRY for "I'd like you to retry"?
Especially since the comment for NOPAGE doesn't read anything like its
actual behavior.

Joel

--

"Friends may come and go, but enemies accumulate."
- Thomas Jones

Joel Becker
Senior Development Manager
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2010-12-31 11:51:29

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Confused by commit 56a76f [fs: fix page_mkwrite error cases in core code and btrfs]

On Fri, Dec 31, 2010 at 10:39 PM, Joel Becker <[email protected]> wrote:
> On Fri, Dec 31, 2010 at 10:16:35PM +1100, Nick Piggin wrote:
>> On Fri, Dec 31, 2010 at 10:00 PM, Joel Becker <[email protected]> wrote:
>> > On Fri, Dec 31, 2010 at 08:31:41PM +1100, Nick Piggin wrote:
>> >> On Fri, Dec 31, 2010 at 10:29 AM, Joel Becker <[email protected]> wrote:
>>
>> >> The better way to do this would be to just return VM_FAULT_NOPAGE
>> >> in any case you need the VM to retry the fault. When you reach the
>> >> business end of your handler, you want to hold the page locked, after
>> >> you verify it is correct, and return that to the fault handler.
>> >
>> > ? ? ? ?This is going to be hard. ?Our write_end() assumes it must
>> > unlock the pages (which is normal behavior for write(2)), but in the
>> > page_mkwrite() case we need to avoid the unlock to follow your
>> > recommendation (we use our write_begin/write_end pair to trigger any
>> > allocation or zeroing needed before the page is writable).
>>
>> Yes that would be the best option. It is possible to use the unlocked
>> return, but that still gives possibility of races in some cases. Consider
>
> ? ? ? ?Yes, I am aware of that.
>
>> Basically it would be nice for all filesystems to move to this convention
>> so we can remove the old cruft.
>
> ? ? ? ?I can appreciate that. ?And you've just answered the "Do you
> want us to get there, or are minor faults in the old-0-style OK?"
> question.

It is OK in that it will work. It is not preferred, if you are reworking pleaase
use the new style.


>> > ? ? ? ?The find_or_create_page() is deep at the meat of the function,
>> > not the cursory check at the top. ?The idea is that at this point,
>> > find_or_create_page() will return a locked page that must, by
>> > definition, be part of the correct mapping.
>>
>> But you must still handle failures there, because find_or_create_page
>> may return -ENOMEM. So just lock the page, recheck the mapping
>> there, and then do exactly the same error handling.
>
> ? ? ? ?Of course it can error, but error is different than clean
> restart. ?Though cleanup should be the same; I'm looking at our code
> trying to convince myself that this is so ;-)

Cleanup is exactly the same, yes, because the only difference in
different error codes is what the VM does with them.


> ?Btw, -ENOMEM is that OOM
> fault error, right? ;-)

Right.


>> > ? ? ? ?If the VM is rechecking the pte after we return from
>> > page_mkwrite(), won't it see any new page created?
>>
>> But the point of page_mkwrite is a dirty notifier for the fs. If this new
>> different page was installed due to say truncate then a read-only
>> fault, the filesystem would miss the notification. So it just does the
>> simple thing and retries the whole sequence (if needed).
>
> ? ? ? ?Fair enough, even if we caused the new page and thus are already
> notified ;-)

Well, you would be notified via ->fault, but not not writable fault and no
page_mkwrite.


> ? ? ? ?Essentially, you would really like us (and ext4, and gfs2, etc)
> to start returning VM_FAULT_NOPAGE for any place we aren't sure what
> happened to our page, VM_FAULT_OOM when we get an -ENOMEM, and
> VM_FAULT_LOCKED for all successful operations. ?That the long and short
> of it? ?Thank you for helping me understand your plan for this code.

Exactly.


> ? ? ? ?Btw, what not use VM_FAULT_RETRY for "I'd like you to retry"?
> Especially since the comment for NOPAGE doesn't read anything like its
> actual behavior.

It was designed first to replace that old ->nopfn handler, so it does need
comments updating. RETRY doesn't cover the ->nopfn usage, but NOPAGE
does cover the retry usage, I think, so names are OK.