LinuxLists.cc - Re: Why doesn't zap_pte_range() call page

2009-04-25 05:10:34

Subject: Re: Why doesn't zap_pte_range() call page_mkwrite()

On Fri, Apr 24, 2009 at 01:00:48PM -0400, Trond Myklebust wrote:
> On Fri, 2009-04-24 at 16:52 +0200, Miklos Szeredi wrote:
> > On Fri, 24 Apr 2009, Robin Holt wrote:
> > > I am not sure how you came to this conclusion. The address_space has
> > > the vma's chained together and protected by the i_mmap_lock. That is
> > > acquired prior to the cleaning operation. Additionally, the cleaning
> > > operation walks the process's page tables and will remove/write-protect
> > > the page before releasing the i_mmap_lock.
> > >
> > > Maybe I misunderstand. I hope I have not added confusion.
> >
> > Looking more closely, I think you're right.
> >
> > I thought that detach_vmas_to_be_unmapped() also removed them from
> > mapping->i_mmap, but that is not the case, it only removes them from
> > the process's mm_struct. The vma is only removed from ->i_mmap in
> > unmap_region() _after_ zapping the pte's.
> >
> > This means that while the pte zapping is going on, any page faults
> > will fail but page_mkclean() (and all of rmap) will continue to work.
> >
> > But then I don't see how we get a dirty pte without also first getting
> > a page fault. Weird...
>
> You don't, but unless you unmap the page when you write it out, you will
> not get any further page faults. The VM will just redirty the page
> without calling page_mkwrite().

Why? It should call page_mkwrite...

> As I said, I think I can fix the NFS problem by simply unmapping the
> page inside ->writepage() whenever we know the write request was
> originally set up by a page fault.

The biggest outstanding problem we have remaining is get_user_pages.
Callers are only required to hold a ref on the page and then they
can call set_page_dirty at any point after that.

I have a half-done patch somewhere to add a put_user_pages, and then
we could probably go from there to pinning the fs metadata (whether
by using the page lock or something else, I don't quite know).

2009-09-08 15:30:07

by Chris Mason

[permalink] [raw]

Subject: Re: Why doesn't zap_pte_range() call page_mkwrite()

On Sat, Apr 25, 2009 at 07:10:28AM +0200, Nick Piggin wrote:
> On Fri, Apr 24, 2009 at 01:00:48PM -0400, Trond Myklebust wrote:
> > On Fri, 2009-04-24 at 16:52 +0200, Miklos Szeredi wrote:
> > > On Fri, 24 Apr 2009, Robin Holt wrote:
> > > > I am not sure how you came to this conclusion. The address_space has
> > > > the vma's chained together and protected by the i_mmap_lock. That is
> > > > acquired prior to the cleaning operation. Additionally, the cleaning
> > > > operation walks the process's page tables and will remove/write-protect
> > > > the page before releasing the i_mmap_lock.
> > > >
> > > > Maybe I misunderstand. I hope I have not added confusion.
> > >
> > > Looking more closely, I think you're right.
> > >
> > > I thought that detach_vmas_to_be_unmapped() also removed them from
> > > mapping->i_mmap, but that is not the case, it only removes them from
> > > the process's mm_struct. The vma is only removed from ->i_mmap in
> > > unmap_region() _after_ zapping the pte's.
> > >
> > > This means that while the pte zapping is going on, any page faults
> > > will fail but page_mkclean() (and all of rmap) will continue to work.
> > >
> > > But then I don't see how we get a dirty pte without also first getting
> > > a page fault. Weird...
> >
> > You don't, but unless you unmap the page when you write it out, you will
> > not get any further page faults. The VM will just redirty the page
> > without calling page_mkwrite().
>
> Why? It should call page_mkwrite...
>
>
> > As I said, I think I can fix the NFS problem by simply unmapping the
> > page inside ->writepage() whenever we know the write request was
> > originally set up by a page fault.
>
> The biggest outstanding problem we have remaining is get_user_pages.
> Callers are only required to hold a ref on the page and then they
> can call set_page_dirty at any point after that.
>
> I have a half-done patch somewhere to add a put_user_pages, and then
> we could probably go from there to pinning the fs metadata (whether
> by using the page lock or something else, I don't quite know).

Hi everyone,

Sorry for digging up an old thread, but is there any reason we can't
just use page_mkwrite here? I'd love to get rid of the btrfs code to
detect places that use set_page_dirty without a page_mkwrite.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-09-08 15:41:31

by Nick Piggin

[permalink] [raw]

Subject: Re: Why doesn't zap_pte_range() call page_mkwrite()

On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > > As I said, I think I can fix the NFS problem by simply unmapping the
> > > page inside ->writepage() whenever we know the write request was
> > > originally set up by a page fault.
> >
> > The biggest outstanding problem we have remaining is get_user_pages.
> > Callers are only required to hold a ref on the page and then they
> > can call set_page_dirty at any point after that.
> >
> > I have a half-done patch somewhere to add a put_user_pages, and then
> > we could probably go from there to pinning the fs metadata (whether
> > by using the page lock or something else, I don't quite know).
>
> Hi everyone,
>
> Sorry for digging up an old thread, but is there any reason we can't
> just use page_mkwrite here? I'd love to get rid of the btrfs code to
> detect places that use set_page_dirty without a page_mkwrite.

It is because page_mkwrite must be called before the page is dirtied
(it may fail, it theoretically may do something crazy with the previous
clean page data). And in several places I think it gets called from a
nasty context.

It hasn't fallen completely off my radar. fsblock has the same issue
(although I've just been ignoring gup writes into fsblock fs for the
time being).

I have a basic idea of what to do... It would be nice to change calling
convention of get_user_pages and take the page lock. Database people might
scream, in which case we could only take the page lock for filesystems that
define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
ordering might get a bit interesting, but if we can have callers ensure they
always submit and release partially fulfilled requirests, then we can always
trylock them.

2009-09-08 16:31:49

by Chris Mason

[permalink] [raw]

Subject: Re: Why doesn't zap_pte_range() call page_mkwrite()

On Tue, Sep 08, 2009 at 05:41:32PM +0200, Nick Piggin wrote:
> On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > > > As I said, I think I can fix the NFS problem by simply unmapping the
> > > > page inside ->writepage() whenever we know the write request was
> > > > originally set up by a page fault.
> > >
> > > The biggest outstanding problem we have remaining is get_user_pages.
> > > Callers are only required to hold a ref on the page and then they
> > > can call set_page_dirty at any point after that.
> > >
> > > I have a half-done patch somewhere to add a put_user_pages, and then
> > > we could probably go from there to pinning the fs metadata (whether
> > > by using the page lock or something else, I don't quite know).
> >
> > Hi everyone,
> >
> > Sorry for digging up an old thread, but is there any reason we can't
> > just use page_mkwrite here? I'd love to get rid of the btrfs code to
> > detect places that use set_page_dirty without a page_mkwrite.
>
> It is because page_mkwrite must be called before the page is dirtied
> (it may fail, it theoretically may do something crazy with the previous
> clean page data). And in several places I think it gets called from a
> nasty context.
>
> It hasn't fallen completely off my radar. fsblock has the same issue
> (although I've just been ignoring gup writes into fsblock fs for the
> time being).

Ok, I'll change my detection code a bit then.

>
> I have a basic idea of what to do... It would be nice to change calling
> convention of get_user_pages and take the page lock. Database people might
> scream, in which case we could only take the page lock for filesystems that
> define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
> ordering might get a bit interesting, but if we can have callers ensure they
> always submit and release partially fulfilled requirests, then we can always
> trylock them.

I think everyone will have page_mkwrite eventually, at least everyone
who the databases will care about ;)

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-09-08 17:00:00

by Nick Piggin

[permalink] [raw]

Subject: Re: Why doesn't zap_pte_range() call page_mkwrite()

On Tue, Sep 08, 2009 at 12:31:49PM -0400, Chris Mason wrote:
> On Tue, Sep 08, 2009 at 05:41:32PM +0200, Nick Piggin wrote:
> > It hasn't fallen completely off my radar. fsblock has the same issue
> > (although I've just been ignoring gup writes into fsblock fs for the
> > time being).
>
> Ok, I'll change my detection code a bit then.

OK.

> > I have a basic idea of what to do... It would be nice to change calling
> > convention of get_user_pages and take the page lock. Database people might
> > scream, in which case we could only take the page lock for filesystems that
> > define ->page_mkwrite (so shared mem segments avoid the overhead). Lock
> > ordering might get a bit interesting, but if we can have callers ensure they
> > always submit and release partially fulfilled requirests, then we can always
> > trylock them.
>
> I think everyone will have page_mkwrite eventually, at least everyone
> who the databases will care about ;)

Ah, the problem is not where the DIO write goes, it's where the read
goes :) (ie. the read writes into get_user_pages pages).

So for databases this should typically be shared memory segments I'd
say (tmpfs), or maybe anonymous memory.

2009-09-09 02:21:10

by Christoph Hellwig

[permalink] [raw]

Subject: Re: Why doesn't zap_pte_range() call page_mkwrite()

On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> Sorry for digging up an old thread, but is there any reason we can't
> just use page_mkwrite here? I'd love to get rid of the btrfs code to
> detect places that use set_page_dirty without a page_mkwrite.

It's not just btrfs, it's also a complete pain in the a** for XFS and
probably every filesystems using ->page_mkwrite for dirty page tracking.

2009-09-09 05:39:24

by Nick Piggin

[permalink] [raw]

Subject: Re: Why doesn't zap_pte_range() call page_mkwrite()

On Tue, Sep 08, 2009 at 10:21:02PM -0400, Christoph Hellwig wrote:
> On Tue, Sep 08, 2009 at 11:30:07AM -0400, Chris Mason wrote:
> > Sorry for digging up an old thread, but is there any reason we can't
> > just use page_mkwrite here? I'd love to get rid of the btrfs code to
> > detect places that use set_page_dirty without a page_mkwrite.
>
> It's not just btrfs, it's also a complete pain in the a** for XFS and
> probably every filesystems using ->page_mkwrite for dirty page tracking.

Well I guess I should really get out my put_user_pages patches and
propose doing page locking or something. One problem is just going
through and converting all callers... another problem is that
nobody seemed to care much last time but hopefully there is more
interest now.