2005-11-01 19:30:37

by Petr Vandrovec

[permalink] [raw]
Subject: Nick's core remove PageReserved broke vmware...

Hello Nick,
what's the reason behind disallowing get_user_pages() on VM_RESERVED regions?
vmmon uses VM_RESERVED on its 'vma' as otherwise some kernels used by SUSE
complained loudly about mismatch between PageReserved() and VM_RESERVED flags.

I'll remove it from vmmon for >= 2.6.14 kernels as that bogus test never made
to Linux kernel, but I cannot find any reason why get_user_pages() should not
work on VM_RESERVED (or VM_IO for that matter) user pages. Can you show me
reasoning behind that decision ?
Thanks,
Petr Vandrovec


b5810039a54e5babf428e9a1e89fc1940fabff11


tree 835836cb527ec9bd525f93eb7e016f3dfb8c8ae2


parent f9c98d0287de42221c624482fd4f8d485c98ab22


author Nick Piggin <[email protected]> Sat, 29 Oct 2005 18:16:12 -0700


committer Linus Torvalds <[email protected]> Sat, 29 Oct 2005 21:40:39 -0700





[PATCH] core remove PageReserved





Remove PageReserved() calls from core code by tightening VM_RESERVED


handling in mm/ to cover PageReserved functionality.







2005-11-02 00:36:03

by Nick Piggin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Petr Vandrovec wrote:
> Hello Nick,
> what's the reason behind disallowing get_user_pages() on VM_RESERVED
> regions? vmmon uses VM_RESERVED on its 'vma' as otherwise some kernels
> used by SUSE complained loudly about mismatch between PageReserved() and
> VM_RESERVED flags.
>

Hi Petr,

The reason is that VM_RESERVED indicates that the core vm is not allowed
to touch any 'struct page' through this mapping, which get_user_pages
would do.

> I'll remove it from vmmon for >= 2.6.14 kernels as that bogus test
> never made to Linux kernel, but I cannot find any reason why
> get_user_pages() should not work on VM_RESERVED (or VM_IO for that
> matter) user pages. Can you show me reasoning behind that decision ?

The reasoning behind the decision was so VM_RESERVED is usable for a
complete replacement to PageReserved. For example mappings through
/dev/mem should not touch the page count.

You may be able to go a step further and clear PageReserved from your
pages as well, and thus have a working driver without special casing
for both kernels.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 01:17:17

by Petr Vandrovec

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Nick Piggin wrote:
> Petr Vandrovec wrote:
>
>> Hello Nick,
>> what's the reason behind disallowing get_user_pages() on VM_RESERVED
>> regions? vmmon uses VM_RESERVED on its 'vma' as otherwise some
>> kernels used by SUSE complained loudly about mismatch between
>> PageReserved() and VM_RESERVED flags.
>>
>
> The reason is that VM_RESERVED indicates that the core vm is not allowed
> to touch any 'struct page' through this mapping, which get_user_pages
> would do.

But get_user_pages() was not invoked by 'the core vm'. I invoked it, from my
module... same one which populated this VMA before.

>> I'll remove it from vmmon for >= 2.6.14 kernels as that bogus test
>> never made to Linux kernel, but I cannot find any reason why
>> get_user_pages() should not work on VM_RESERVED (or VM_IO for that
>> matter) user pages. Can you show me reasoning behind that decision ?
>
> The reasoning behind the decision was so VM_RESERVED is usable for a
> complete replacement to PageReserved. For example mappings through
> /dev/mem should not touch the page count.
>
> You may be able to go a step further and clear PageReserved from your
> pages as well, and thus have a working driver without special casing
> for both kernels.

Nope. We are not having PageReserved() set on our pages since we want them
refcounted. But old SuSE kernels contained this code which was rather unhappy
if page did not have ->mapping set. So we just marked vma VM_RESERVED, as it
did not hurt, and all pages in this vma have refcount > 1 anyway so there is no
point in trying to cleanup these page tables. Now rmap catches this by
page_count() != page_mapcount(), so VM_RESERVED is not needed anymore, but there
did not seem to be any reason to remove it.

pageable = !PageReserved(new_page); << pageable = 1
as = !!new_page->mapping; << as = 0

BUG_ON(!pageable && as);

pageable &= as; << pageable = 0

...

/*
* This is the entry point for memory under VM_RESERVED vmas.
* That memory will not be tracked by the vm. These aren't
* real anonymous pages, they're "device" reserved pages instead.
*/
reserved = !!(vma->vm_flags & VM_RESERVED); << reserved = 0
if (unlikely(reserved == pageable)) << fires...
printk("Badness in %s at %s:%d\n",
__FUNCTION__, __FILE__, __LINE__);

So I've made this change... Test probably could be for 2.6.4 <= x <= 2.6.5 to
rule out all buggy kernels, but I'll probably leave it this way unless there is
some good reason to not set VM_RESERVED on these older kernels.

Thanks for explanation.
Petr

--- vmmon-only/linux/driver.c.orig 2005-11-02 02:00:46.000000000 +0100
+++ vmmon-only/linux/driver.c 2005-11-01 20:12:13.000000000 +0100
@@ -1283,9 +1283,13 @@
/*
* It seems that SuSE's 2.6.4-52 needs this. Hopefully
* it will not break anything else.
+ *
+ * It breaks on post 2.6.14 kernels, so get rid of it on them.
*/
#ifdef VM_RESERVED
+# if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 14)
vma->vm_flags |= VM_RESERVED;
+# endif
#endif
return 0;
}

2005-11-02 02:09:27

by Nick Piggin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Petr Vandrovec wrote:

>
> But get_user_pages() was not invoked by 'the core vm'. I invoked it,
> from my
> module... same one which populated this VMA before.
>

OK, sure in this case it should be fine because you know what you're
doing. In the case where someone submits a /dev/mem mapping for direct
IO, things would be different.


> So I've made this change... Test probably could be for 2.6.4 <= x <=
> 2.6.5 to rule out all buggy kernels, but I'll probably leave it this way
> unless there is some good reason to not set VM_RESERVED on these older
> kernels.
>

OK, if that is all that is required to fix it for you, then that's
a good result! Thanks for the mail and be sure to let us know if you
run into any other problems.

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 12:27:37

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Wed, 2 Nov 2005, Petr Vandrovec wrote:
>
> Nope. We are not having PageReserved() set on our pages since we want them
> refcounted. But old SuSE kernels contained this code which was rather unhappy
> if page did not have ->mapping set.

Yes, Andrea's do_no_page enforced some tests which I found confusing and
unhelpful, so I didn't propagate them to mainline when porting his rmap.

> So we just marked vma VM_RESERVED, as it
> did not hurt, and all pages in this vma have refcount > 1 anyway so there is
> no point in trying to cleanup these page tables. Now rmap catches this by
> page_count() != page_mapcount(), so VM_RESERVED is not needed anymore, but
> there did not seem to be any reason to remove it.

Well, beware. That "page_count() != page_mapcount() + 2" check in rmap.c
went away in 2.6.13: the problem it was there to solve being solved
instead by a can_share_swap_page based on page_mapcount instead of
page_count (partly to fix a page migration progress problem).

So, you may still be in trouble? But how do the pages you're concerned
with come to be on the LRU in the first place? If they're not on the
LRU, vmscanning will never try to take them away. Most drivers with
special pages, and ->mapping unset, don't put the pages on the LRU.

> --- vmmon-only/linux/driver.c.orig 2005-11-02 02:00:46.000000000 +0100
> +++ vmmon-only/linux/driver.c 2005-11-01 20:12:13.000000000 +0100
> @@ -1283,9 +1283,13 @@
> /*
> * It seems that SuSE's 2.6.4-52 needs this. Hopefully
> * it will not break anything else.
> + *
> + * It breaks on post 2.6.14 kernels, so get rid of it on them.
> */
> #ifdef VM_RESERVED
> +# if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 14)
> vma->vm_flags |= VM_RESERVED;
> +# endif
> #endif
> return 0;
> }

Nick's PageReserved/VM_RESERVED changes are not in 2.6.14 so I'd expect
2.6.15 there. Ah, you're trying to handle this awkward interval before
2.6.15-rc1 brings the numbering up to 2.6.15, okay.

Hugh

2005-11-02 18:06:58

by Petr Vandrovec

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Hugh Dickins wrote:
> On Wed, 2 Nov 2005, Petr Vandrovec wrote:
>
>>So we just marked vma VM_RESERVED, as it
>>did not hurt, and all pages in this vma have refcount > 1 anyway so there is
>>no point in trying to cleanup these page tables. Now rmap catches this by
>>page_count() != page_mapcount(), so VM_RESERVED is not needed anymore, but
>>there did not seem to be any reason to remove it.
>
>
> Well, beware. That "page_count() != page_mapcount() + 2" check in rmap.c
> went away in 2.6.13: the problem it was there to solve being solved
> instead by a can_share_swap_page based on page_mapcount instead of
> page_count (partly to fix a page migration progress problem).
>
> So, you may still be in trouble? But how do the pages you're concerned
> with come to be on the LRU in the first place? If they're not on the
> LRU, vmscanning will never try to take them away. Most drivers with
> special pages, and ->mapping unset, don't put the pages on the LRU.

No, we do not put pages on LRU. Older kernels were removing once
instantiated page entries from pagetables. It did not cause any
problems, but as VM cannot gain anything from this removal, we were
happy to find that VM did not clear page tables for VM_RESERVED vmas.
But primary motivation for VM_RESERVED was to get rid warning on SuSE
2.6.4. As one VMware instance uses 4 to 6 pages allocated this way,
it is really not big problem even if VM will try to remove them from
pagetables...

>>--- vmmon-only/linux/driver.c.orig 2005-11-02 02:00:46.000000000 +0100
>>+++ vmmon-only/linux/driver.c 2005-11-01 20:12:13.000000000 +0100
>>@@ -1283,9 +1283,13 @@
>>/*
>>* It seems that SuSE's 2.6.4-52 needs this. Hopefully
>>* it will not break anything else.
>>+ *
>>+ * It breaks on post 2.6.14 kernels, so get rid of it on them.
>> */
>> #ifdef VM_RESERVED
>>+# if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 14)
>> vma->vm_flags |= VM_RESERVED;
>>+# endif
>>#endif
>> return 0;
>>}
>
>
> Nick's PageReserved/VM_RESERVED changes are not in 2.6.14 so I'd expect
> 2.6.15 there. Ah, you're trying to handle this awkward interval before
> 2.6.15-rc1 brings the numbering up to 2.6.15, okay.

For our purpose anything between 2.6.6 and 2.6.14 is fine as for us.
Allocated pages are normal memory pages, they just were allocated to
meet specific criteria - being allocated either from low 4GB (accessible in
32bit mode without paging), or being physically contiguous, so they all
have 'struct page' and can be happily refcounted.

Only really obscure thing we do is that we allocate order 1 or 2 (8/16KB)
regions, bump refcount by 1 on all pages except first one (if CONFIG_MMU
is set), and then pass independent pages from this region through nopage
handler to the VM to map them to the user level. It means that region
is allocated by alloc_pages(GFP_USER, 2), while it is released by 4 calls
to put_page(), for page+0...page+3.

It works on all Linux kernels with MMU I've tested, and it would be nice
if this could continue to work. Current kernels seems to provide PageCompound,
which would fit our needs as well after some changes, but as long as
PageCompound is build time option it is not possible to rely on it.

Best regards,
Petr Vandrovec

2005-11-02 21:07:31

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Wed, 2005-11-02 at 12:26 +0000, Hugh Dickins wrote:

> Well, beware. That "page_count() != page_mapcount() + 2" check in rmap.c
> went away in 2.6.13: the problem it was there to solve being solved
> instead by a can_share_swap_page based on page_mapcount instead of
> page_count (partly to fix a page migration progress problem).

Not completely related to this thread, but ... I have been working on
the radeon DRI driver to add some non-AGP DMA functionality. That needs
to pin some userpages for DMA. It's currently doing get_user_pages(),
and later on, page_cache_release() when the DMA is done. In between
however, it returns to userland.

I haven't been able to find a firm grasp on what is needed to make sure
those pages won't be mucked with by rmap or others during that proc ess.
Should I set PG_locked ? if yes why and if not why not ? (you may figure
out at this point that I have a poor understanding of this part of the
VM subsystem). Will get_user_pages() increase page_mapcount or only
page_count (relative to your quote above) ?

Also, I'm not too sure how to handle dirtyness. It _seems_ to be that
for a DMA transfer from device to memory, I will have to call
get_user_pages() for write, thus setting dirty at that moment. However,
this is not very optimal. I want X to be able to "prepare" pixmaps for
DMA (keeping the user pages locked and the DMA sglists ready) (up to a
given threshold of memory of course) and later on, kick DMA operations.
In that context, X doesn't know in advance what direction the DMA will
take. Pixmaps can be migrated to/from vram at any time depending on the
type of rendering operation.

But I'm not sure I have a proper way to set those pages dirty after the
call to get_user_pages(), do I have a guarantee that they haven't been
unmapped from the user process for example ?

Ben.


2005-11-02 21:42:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, 3 Nov 2005, Benjamin Herrenschmidt wrote:
>
> Not completely related to this thread, but ... I have been working on
> the radeon DRI driver to add some non-AGP DMA functionality. That needs
> to pin some userpages for DMA. It's currently doing get_user_pages(),
> and later on, page_cache_release() when the DMA is done. In between
> however, it returns to userland.

That's the right way to do it.

> I haven't been able to find a firm grasp on what is needed to make sure
> those pages won't be mucked with by rmap or others during that proc ess.
> Should I set PG_locked ? if yes why and if not why not ? (you may figure
> out at this point that I have a poor understanding of this part of the
> VM subsystem). Will get_user_pages() increase page_mapcount or only
> page_count (relative to your quote above) ?

get_user_pages() raises page_count, not page_mapcount. You shouldn't set
PG_locked (and if you did, ought only to do so via lock_page): that was
done at one time in early 2.4, but it's irrelevant (and a problem when
the same page appears more than once in the list).

It remains unlikely but possible that rmap will come along and remove
the page from its place in the user address space before the DMAing
has finished; but that does not matter, so long as any user access
to that address faults the right pinned page back in.

The only extant problem here is if the pages are private, and you
fork while this is going on, and the parent user process writes to the
area before completion: then COW leaves the child with the page being
DMAed into, giving the parent a copied page which may be incomplete.

> Also, I'm not too sure how to handle dirtyness. It _seems_ to be that
> for a DMA transfer from device to memory, I will have to call
> get_user_pages() for write, thus setting dirty at that moment. However,
> this is not very optimal. I want X to be able to "prepare" pixmaps for
> DMA (keeping the user pages locked and the DMA sglists ready) (up to a
> given threshold of memory of course) and later on, kick DMA operations.
> In that context, X doesn't know in advance what direction the DMA will
> take. Pixmaps can be migrated to/from vram at any time depending on the
> type of rendering operation.

It's important that any necessary COW be done at get_user_pages time,
if there's any possibility that you'll be DMAing into them. So when
in doubt, call it for write access.

> But I'm not sure I have a proper way to set those pages dirty after the
> call to get_user_pages(), do I have a guarantee that they haven't been
> unmapped from the user process for example ?

You don't have that guarantee, but you shouldn't need it: when in doubt,
let it set them dirty beforehand. As to afterwards, if I remember
rightly, there's a race by which the pages might be written out and
dirty cleared before your DMA completed, so you actually do need to
mark them dirty after - searching fs/ for get_user_pages() use suggests
so. Take a look at Andrew's educational comment on set_page_dirty_lock
in mm/page-writeback.c. You do have the list of pages you need to
page_cache_release, don't you? So it should be easy to dirty them.

Hugh

2005-11-02 21:48:46

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Wed, 2005-11-02 at 21:41 +0000, Hugh Dickins wrote:

> The only extant problem here is if the pages are private, and you
> fork while this is going on, and the parent user process writes to the
> area before completion: then COW leaves the child with the page being
> DMAed into, giving the parent a copied page which may be incomplete.

Won't happen, and if it does, it's a user error to rely on that working,
so it doesn't matter.

> It's important that any necessary COW be done at get_user_pages time,
> if there's any possibility that you'll be DMAing into them. So when
> in doubt, call it for write access.

True, I forgot about COW... it still annoys me to mark dirty pages that
may not be in the end... but hell, the main usefulness of this DMA stuff
is DMA to memory anyway (AGP is good enough for the other direction most
of the time).

> Take a look at Andrew's educational comment on set_page_dirty_lock
> in mm/page-writeback.c. You do have the list of pages you need to
> page_cache_release, don't you? So it should be easy to dirty them.

Ok, so just passing 'write' to get_user_pages() is good enough; right ?

Thanks,
Ben.


2005-11-02 22:03:56

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, 3 Nov 2005, Benjamin Herrenschmidt wrote:
> On Wed, 2005-11-02 at 21:41 +0000, Hugh Dickins wrote:
>
> > The only extant problem here is if the pages are private, and you
> > fork while this is going on, and the parent user process writes to the
> > area before completion: then COW leaves the child with the page being
> > DMAed into, giving the parent a copied page which may be incomplete.
>
> Won't happen, and if it does, it's a user error to rely on that working,
> so it doesn't matter.

I wish everyone else would see it that way! (But some people do
have valid scenarios where it can't just be ruled out completely.)

> > Take a look at Andrew's educational comment on set_page_dirty_lock
> > in mm/page-writeback.c. You do have the list of pages you need to
> > page_cache_release, don't you? So it should be easy to dirty them.
>
> Ok, so just passing 'write' to get_user_pages() is good enough; right ?

Not quite, I think: you need to pass 'write' to get_user_pages()
initially; but at the end, if it was indeed writing into user space,
you need to do the set_page_dirty_lock thing on each of the pages
before page_cache_release, just in case a race cleaned them before
the DMA completed. I think (I've never used it myself).

Hugh

2005-11-02 22:25:21

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Wed, 2005-11-02 at 22:02 +0000, Hugh Dickins wrote:

> I wish everyone else would see it that way! (But some people do
> have valid scenarios where it can't just be ruled out completely.)

Hehe, well, in my case it's not one, at least not yet :)

> > > Take a look at Andrew's educational comment on set_page_dirty_lock
> > > in mm/page-writeback.c. You do have the list of pages you need to
> > > page_cache_release, don't you? So it should be easy to dirty them.
> >
> > Ok, so just passing 'write' to get_user_pages() is good enough; right ?
>
> Not quite, I think: you need to pass 'write' to get_user_pages()
> initially; but at the end, if it was indeed writing into user space,
> you need to do the set_page_dirty_lock thing on each of the pages
> before page_cache_release, just in case a race cleaned them before
> the DMA completed. I think (I've never used it myself).

Oh, I see... I can't prevent them from being cleaned during the DMA
then... Ok, will do that.

Also, what do you suggest as a good threshold to use on the max amount
of memory I can let the X server "pin" that way ? I was thinking it as
equivalent to mlock, thus I could maybe hijack mm->locked_vm & use
RLIMIT_MEMLOCK or is that too gross ?

Ben.



2005-11-02 22:39:31

by Petr Vandrovec

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Hugh Dickins wrote:
> On Thu, 3 Nov 2005, Benjamin Herrenschmidt wrote:
>
>>>Take a look at Andrew's educational comment on set_page_dirty_lock
>>>in mm/page-writeback.c. You do have the list of pages you need to
>>>page_cache_release, don't you? So it should be easy to dirty them.
>>
>>Ok, so just passing 'write' to get_user_pages() is good enough; right ?
>
>
> Not quite, I think: you need to pass 'write' to get_user_pages()
> initially; but at the end, if it was indeed writing into user space,
> you need to do the set_page_dirty_lock thing on each of the pages
> before page_cache_release, just in case a race cleaned them before
> the DMA completed. I think (I've never used it myself).

Unfortunately at least for our use set_page_dirty{_lock} has an
unfortunate feature that it schedules writeback immediately.

get_user_pages() through __follow_page() calls set_page_dirty() only
if pte_dirty() bit is not set (I have no idea why it just does not
set pte dirty bit instead of doing set_page_dirty(), but there must
be some reason, yes?), so under normal condition page is marked
dirty in the page's structure only if it was not marked dirty
on pte level before. This way page itself is marked dirty only after
somebody copies dirty bit from page tables to the page structure, which
can take a lot of time.

On other side when you do set_page_dirty(), page is in few seconds
written back to the disk - causing quite visible I/O load if you
perform get_user_pages()/set_page_dirty() when compared with
situation where you just mark PTE dirty after you are done with
page.

(for those interested, in the situation described above we are
doing get_user_pages() on the file mapped by MAP_SHARED to the
user's address space to get physical address of these pages,
then virtual machine monitor uses this physical address to fill
guest's pagetables, and later (once guest is done with page) we
mark page dirty and release page; performance difference between
set_page_dirty() and home grown ptep_set_dirty() is more than
visible... but AGP memory in question is probably not
backed up by some writeable file, so it does not make difference
here)

Petr

2005-11-03 08:04:24

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, Nov 03, 2005 at 09:22:10AM +1100, Benjamin Herrenschmidt wrote:
> Also, what do you suggest as a good threshold to use on the max amount
> of memory I can let the X server "pin" that way ? I was thinking it as
> equivalent to mlock, thus I could maybe hijack mm->locked_vm & use
> RLIMIT_MEMLOCK or is that too gross ?
>
This is what infiniband does, so it should be good for you too.

--
Gleb.

2005-11-03 08:12:56

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Wed, Nov 02, 2005 at 10:02:49PM +0000, Hugh Dickins wrote:
> On Thu, 3 Nov 2005, Benjamin Herrenschmidt wrote:
> > On Wed, 2005-11-02 at 21:41 +0000, Hugh Dickins wrote:
> >
> > > The only extant problem here is if the pages are private, and you
> > > fork while this is going on, and the parent user process writes to the
> > > area before completion: then COW leaves the child with the page being
> > > DMAed into, giving the parent a copied page which may be incomplete.
> >
> > Won't happen, and if it does, it's a user error to rely on that working,
> > so it doesn't matter.
>
> I wish everyone else would see it that way! (But some people do
> have valid scenarios where it can't just be ruled out completely.)
>
I am one of those people :)

Last discussion about this issue ended without resolution, but I remember
you mentioned the possibility to leave ptes writable in parent during fork
for private pages mapped for DMA. Is this approach acceptable?

--
Gleb.

2005-11-03 13:33:22

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, 3 Nov 2005, Gleb Natapov wrote:
> On Thu, Nov 03, 2005 at 09:22:10AM +1100, Benjamin Herrenschmidt wrote:
> > Also, what do you suggest as a good threshold to use on the max amount
> > of memory I can let the X server "pin" that way ? I was thinking it as
> > equivalent to mlock, thus I could maybe hijack mm->locked_vm & use
> > RLIMIT_MEMLOCK or is that too gross ?
> >
> This is what infiniband does, so it should be good for you too.

Yes, I noticed that code a couple of days ago, I think you're setting a
good example there, for long-term or large-area uses of get_user_pages():
Ben, take a look at drivers/infiniband/core/uverbs_mem.c

Hugh

2005-11-03 13:56:25

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, Nov 03, 2005 at 01:32:15PM +0000, Hugh Dickins wrote:
> On Thu, 3 Nov 2005, Gleb Natapov wrote:
> > On Thu, Nov 03, 2005 at 09:22:10AM +1100, Benjamin Herrenschmidt wrote:
> > > Also, what do you suggest as a good threshold to use on the max amount
> > > of memory I can let the X server "pin" that way ? I was thinking it as
> > > equivalent to mlock, thus I could maybe hijack mm->locked_vm & use
> > > RLIMIT_MEMLOCK or is that too gross ?
> > >
> > This is what infiniband does, so it should be good for you too.
>
> Yes, I noticed that code a couple of days ago, I think you're setting a
> good example there, for long-term or large-area uses of get_user_pages():
> Ben, take a look at drivers/infiniband/core/uverbs_mem.c
>
Credit goes to Roland Dreier :)

--
Gleb.

2005-11-03 14:12:57

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, 3 Nov 2005, Gleb Natapov wrote:
> On Wed, Nov 02, 2005 at 10:02:49PM +0000, Hugh Dickins wrote:
> > On Thu, 3 Nov 2005, Benjamin Herrenschmidt wrote:
> > > On Wed, 2005-11-02 at 21:41 +0000, Hugh Dickins wrote:
> > >
> > > > The only extant problem here is if the pages are private, and you
> > > > fork while this is going on, and the parent user process writes to the
> > > > area before completion: then COW leaves the child with the page being
> > > > DMAed into, giving the parent a copied page which may be incomplete.
> > >
> > > Won't happen, and if it does, it's a user error to rely on that working,
> > > so it doesn't matter.
> >
> > I wish everyone else would see it that way! (But some people do
> > have valid scenarios where it can't just be ruled out completely.)
> >
> I am one of those people :)
>
> Last discussion about this issue ended without resolution, but I remember
> you mentioned the possibility to leave ptes writable in parent during fork
> for private pages mapped for DMA. Is this approach acceptable?

I was toying with that idea back then, but it leaves the pages in a
peculiar limbo between being shared and private, such that it's hard
to think through the consequences. We do already have a case rather
like that (ptrace writing to a write-protected area), but some of us
are a bit worried by that one, so I'd be foolish now to recommend
another such subversion of the rules.

In the time since we discussed before, I've rather come full circle
round to my original position: abandoning such ideas of trying to
handle it from get_user_pages itself, appreciating the simplicity
of the original PROT_DONTCOPY idea from you guys; but sticking to my
initial reaction that this is better done by madvise(MADV_DONTCOPY),
not by the mmap/mprotect route in Michael's patch. (I never bought
the "racy" argument advanced in favour of the mmap flag.)

One of the factors which has swayed me to the DONTCOPY approach, is
Nick's 2.6.14 optimization in fork's copy_page_range, where areas
which can be safely faulted later are not copied pte by pte. But
that doesn't apply to all areas, and in particular cannot apply to
VM_NONLINEAR shared areas. It should be of benefit to apps which
use large such areas, and also do a lot of forking children who don't
need those areas, to be able to mark them VM_DONTCOPY. Or any other
vmas the children won't need. (But there's one big distinction between
the optimization and VM_DONTCOPY: the optimization copies vma but
doesn't fill in its ptes, VM_DONTCOPY doesn't even copy the vma.)

Two warnings if someone would like to post a MADV_DONTCOPY patch.
It should include a matching MADV_DOCOPY to clear the condition, but
that must not be allowed to clear VM_DONTCOPY set originally by driver:
perhaps you'll end up with a VM_UDONTCOPY or something like that.

And Badari has a MADV_REMOVE patch in the works, taking the next
slot (just after MADV_DONTNEED in most of the arches): probably
best for you to base yours on top of his (though yours is simpler
and might jump ahead).

Hugh

2005-11-03 14:23:33

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, Nov 03, 2005 at 02:11:46PM +0000, Hugh Dickins wrote:
> In the time since we discussed before, I've rather come full circle
> round to my original position: abandoning such ideas of trying to
> handle it from get_user_pages itself, appreciating the simplicity
> of the original PROT_DONTCOPY idea from you guys; but sticking to my
> initial reaction that this is better done by madvise(MADV_DONTCOPY),
> not by the mmap/mprotect route in Michael's patch. (I never bought
> the "racy" argument advanced in favour of the mmap flag.)
>
Excellent! Then DONTCOPY it will be.

--
Gleb.

2005-11-03 14:34:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Hugh Dickins <[email protected]>:
> Two warnings if someone would like to post a MADV_DONTCOPY patch.
> It should include a matching MADV_DOCOPY to clear the condition, but
> that must not be allowed to clear VM_DONTCOPY set originally by driver:
> perhaps you'll end up with a VM_UDONTCOPY or something like that.

Thanks!
All existing drivers that set VM_DONTCOPY also set VM_IO.
So lets just disable playing with these flags from madvise if VM_IO is set.
There's no reason I can see that the driver should have a say
on what the process does with its own (non-IO) memory.
Sounds good?

By the way, as a separate issue, we still have a problem with DMA to pages
which are *needed* by the child process. What do you think about VM_COPY
(to do the old unix thing of actually copying the page instead of
setting the COW flag) and a matching madvise call to set/clear it?

> And Badari has a MADV_REMOVE patch in the works, taking the next
> slot (just after MADV_DONTNEED in most of the arches): probably
> best for you to base yours on top of his (though yours is simpler
> and might jump ahead).
>
> Hugh

Yep, I saw that posted for inclusion in -mm.

--
MST

2005-11-03 15:00:52

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, 3 Nov 2005, Michael S. Tsirkin wrote:
> All existing drivers that set VM_DONTCOPY also set VM_IO.
> So lets just disable playing with these flags from madvise if VM_IO is set.
> There's no reason I can see that the driver should have a say
> on what the process does with its own (non-IO) memory.
> Sounds good?

You're then saying that a process cannot set VM_DONTCOPY on a VM_IO
area to prevent the first child getting the area, but clear it after
so the next child does get a copy of the area. I think it'd be wrong
(surprising) to limit the functionality in that way.

> By the way, as a separate issue, we still have a problem with DMA to pages
> which are *needed* by the child process. What do you think about VM_COPY
> (to do the old unix thing of actually copying the page instead of
> setting the COW flag) and a matching madvise call to set/clear it?

I don't much want to add another path into copy_pte_range, actually
copying pages. If the process really wants DMA into such areas,
then it should contain the code for the child to COW them itself?

Hugh

2005-11-03 15:10:23

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, Nov 03, 2005 at 02:59:44PM +0000, Hugh Dickins wrote:
> > By the way, as a separate issue, we still have a problem with DMA to pages
> > which are *needed* by the child process. What do you think about VM_COPY
> > (to do the old unix thing of actually copying the page instead of
> > setting the COW flag) and a matching madvise call to set/clear it?
>
> I don't much want to add another path into copy_pte_range, actually
> copying pages. If the process really wants DMA into such areas,
> then it should contain the code for the child to COW them itself?
>
I think MPI is the main consumer of Infiniband currently. Unfortunately
it's only a library and can't control what users do before or after
fork in their applications. It is good to have possibility to copy partially
registered pages, there shouldn't be many after all.

--
Gleb.

2005-11-03 15:11:24

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Hugh Dickins <[email protected]>:
> > All existing drivers that set VM_DONTCOPY also set VM_IO.
> > So lets just disable playing with these flags from madvise if VM_IO is set.
> > There's no reason I can see that the driver should have a say
> > on what the process does with its own (non-IO) memory.
> > Sounds good?
>
> You're then saying that a process cannot set VM_DONTCOPY on a VM_IO
> area to prevent the first child getting the area, but clear it after
> so the next child does get a copy of the area. I think it'd be wrong
> (surprising) to limit the functionality in that way.

Okay, I guess. I am just trying to avoid more VM_ flags.
Cant we get rid of the last requirement, then?
I dont see why the driver should have a say on what the process does with its
own memory.

> > By the way, as a separate issue, we still have a problem with DMA to pages
> > which are *needed* by the child process. What do you think about VM_COPY
> > (to do the old unix thing of actually copying the page instead of
> > setting the COW flag) and a matching madvise call to set/clear it?
>
> I don't much want to add another path into copy_pte_range, actually
> copying pages. If the process really wants DMA into such areas,
> then it should contain the code for the child to COW them itself?
>
> Hugh

How do you do that, say, for a stack page, or global data section?

--
MST

2005-11-03 15:38:52

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, 3 Nov 2005, Michael S. Tsirkin wrote:
> Quoting Hugh Dickins <[email protected]>:
> > > All existing drivers that set VM_DONTCOPY also set VM_IO.
> > > So lets just disable playing with these flags from madvise if VM_IO is set.
> > > There's no reason I can see that the driver should have a say
> > > on what the process does with its own (non-IO) memory.
> > > Sounds good?
> >
> > You're then saying that a process cannot set VM_DONTCOPY on a VM_IO
> > area to prevent the first child getting the area, but clear it after
> > so the next child does get a copy of the area. I think it'd be wrong
> > (surprising) to limit the functionality in that way.
>
> Okay, I guess. I am just trying to avoid more VM_ flags.
> Cant we get rid of the last requirement, then?

What last requirement?

> I dont see why the driver should have a say on what the process does with its
> own memory.

If a driver sets VM_DONTCOPY, it's likely to be because the driver knows
it'll cause some nastiness (memory corruption, memory leak, lockup...) if
it were copied. The memory belongs to the driver, it's letting the process
have a window on it. I don't think we should now let the process overrule it.

> > > By the way, as a separate issue, we still have a problem with DMA to pages
> > > which are *needed* by the child process. What do you think about VM_COPY
> > > (to do the old unix thing of actually copying the page instead of
> > > setting the COW flag) and a matching madvise call to set/clear it?
> >
> > I don't much want to add another path into copy_pte_range, actually
> > copying pages. If the process really wants DMA into such areas,
> > then it should contain the code for the child to COW them itself?
>
> How do you do that, say, for a stack page, or global data section?

And why do you need to?

You seem to be saying, actually DONTCOPY isn't enough of a solution,
we need something else instead.

Hugh

2005-11-03 15:53:13

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Hugh Dickins <[email protected]>:
> Subject: Re: Nick's core remove PageReserved broke vmware...
>
> On Thu, 3 Nov 2005, Michael S. Tsirkin wrote:
> > Quoting Hugh Dickins <[email protected]>:
> > > > All existing drivers that set VM_DONTCOPY also set VM_IO.
> > > > So lets just disable playing with these flags from madvise if
> VM_IO is set.
> > > > There's no reason I can see that the driver should have a say
> > > > on what the process does with its own (non-IO) memory.
> > > > Sounds good?
> > >
> > > You're then saying that a process cannot set VM_DONTCOPY on a VM_IO
> > > area to prevent the first child getting the area, but clear it after
> > > so the next child does get a copy of the area. I think it'd be wrong
> > > (surprising) to limit the functionality in that way.
> >
> > Okay, I guess. I am just trying to avoid more VM_ flags.
> > Cant we get rid of the last requirement, then?
>
> What last requirement?
>
> > I dont see why the driver should have a say on what the process does with its
> > own memory.
>
> If a driver sets VM_DONTCOPY, it's likely to be because the driver knows
> it'll cause some nastiness (memory corruption, memory leak, lockup...) if
> it were copied. The memory belongs to the driver, it's letting the process
> have a window on it. I don't think we should now let the process
> overrule it.

Okay, if you say so.

> > > > By the way, as a separate issue, we still have a problem with DMA to pages
> > > > which are *needed* by the child process. What do you think about VM_COPY
> > > > (to do the old unix thing of actually copying the page instead of
> > > > setting the COW flag) and a matching madvise call to set/clear it?
> > >
> > > I don't much want to add another path into copy_pte_range, actually
> > > copying pages. If the process really wants DMA into such areas,
> > > then it should contain the code for the child to COW them itself?
> >
> > How do you do that, say, for a stack page, or global data section?
>
> And why do you need to?
>
> You seem to be saying, actually DONTCOPY isn't enough of a solution,
> we need something else instead.
>
> Hugh

I am saying, ideally all pages marked for DMA would be copied e.g. on fork
rather than COW-ed since hardware may be accessing them.

We dont want the overhead, so instead we let the application
avoid COW in another way with DONTCOPY, and assume the child is careful
not to access any data in this VMA (even part of the page that is
not DMA'ed from/to).
Fine, but if the number of such pages is small, it seems we can
afford a full copy, so lets let the user decide?

But whether we want to, or not, is IMO a separate question from DONTCOPY,
which we seem to need anyway.

--
MST

2005-11-03 15:54:27

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, Nov 03, 2005 at 03:37:44PM +0000, Hugh Dickins wrote:
> > > I don't much want to add another path into copy_pte_range, actually
> > > copying pages. If the process really wants DMA into such areas,
> > > then it should contain the code for the child to COW them itself?
> >
> > How do you do that, say, for a stack page, or global data section?
>
> And why do you need to?
>
> You seem to be saying, actually DONTCOPY isn't enough of a solution,
> we need something else instead.
>
DONTCOPY is a good solution for the problem it solves, but there are
other problems :). One of them is what should we do if only part of the
page is used for DMA and other part contains information needed by the
forked child? We can copy such pages on fork or declare such state to be
user error. I can live with both.

--
Gleb.

2005-11-03 21:22:29

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, 2005-11-03 at 15:55 +0200, Gleb Natapov wrote:
> On Thu, Nov 03, 2005 at 01:32:15PM +0000, Hugh Dickins wrote:
> > On Thu, 3 Nov 2005, Gleb Natapov wrote:
> > > On Thu, Nov 03, 2005 at 09:22:10AM +1100, Benjamin Herrenschmidt wrote:
> > > > Also, what do you suggest as a good threshold to use on the max amount
> > > > of memory I can let the X server "pin" that way ? I was thinking it as
> > > > equivalent to mlock, thus I could maybe hijack mm->locked_vm & use
> > > > RLIMIT_MEMLOCK or is that too gross ?
> > > >
> > > This is what infiniband does, so it should be good for you too.
> >
> > Yes, I noticed that code a couple of days ago, I think you're setting a
> > good example there, for long-term or large-area uses of get_user_pages():
> > Ben, take a look at drivers/infiniband/core/uverbs_mem.c

Ok, thanks guys, will have a look.

Ben.

2005-11-08 21:30:59

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Hugh Dickins <[email protected]>:
> In the time since we discussed before, I've rather come full circle
> round to my original position: abandoning such ideas of trying to
> handle it from get_user_pages itself, appreciating the simplicity
> of the original PROT_DONTCOPY idea from you guys; but sticking to my
> initial reaction that this is better done by madvise(MADV_DONTCOPY),
> not by the mmap/mprotect route in Michael's patch. (I never bought
> the "racy" argument advanced in favour of the mmap flag.)
>
> One of the factors which has swayed me to the DONTCOPY approach, is
> Nick's 2.6.14 optimization in fork's copy_page_range, where areas
> which can be safely faulted later are not copied pte by pte. But
> that doesn't apply to all areas, and in particular cannot apply to
> VM_NONLINEAR shared areas. It should be of benefit to apps which
> use large such areas, and also do a lot of forking children who don't
> need those areas, to be able to mark them VM_DONTCOPY. Or any other
> vmas the children won't need. (But there's one big distinction between
> the optimization and VM_DONTCOPY: the optimization copies vma but
> doesn't fill in its ptes, VM_DONTCOPY doesn't even copy the vma.)
>
> Two warnings if someone would like to post a MADV_DONTCOPY patch.
> It should include a matching MADV_DOCOPY to clear the condition, but
> that must not be allowed to clear VM_DONTCOPY set originally by driver:
> perhaps you'll end up with a VM_UDONTCOPY or something like that.
>
> And Badari has a MADV_REMOVE patch in the works, taking the next
> slot (just after MADV_DONTNEED in most of the arches): probably
> best for you to base yours on top of his (though yours is simpler
> and might jump ahead).
>
> Hugh
>

Hugh, did you have something like the following in mind
(this is only boot-tested and only on x86-64)?
Hmm, maybe MADV_INHERIT and MADV_DONT_INHERIT would be better names,
since the copy is only dont one write ...

Comments?

----

Signed-off-by: Michael S. Tsirkin <[email protected]>

Index: linux-2.6.14-dontcopy/kernel/fork.c
===================================================================
--- linux-2.6.14-dontcopy.orig/kernel/fork.c 2005-11-08 23:41:30.000000000 +0200
+++ linux-2.6.14-dontcopy/kernel/fork.c 2005-11-08 23:41:08.000000000 +0200
@@ -209,7 +209,7 @@ static inline int dup_mmap(struct mm_str
for (mpnt = current->mm->mmap ; mpnt ; mpnt = mpnt->vm_next) {
struct file *file;

- if (mpnt->vm_flags & VM_DONTCOPY) {
+ if (mpnt->vm_flags & (VM_DONTCOPY | VM_UDONTCOPY)) {
long pages = vma_pages(mpnt);
mm->total_vm -= pages;
__vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
Index: linux-2.6.14-dontcopy/mm/mmap.c
===================================================================
--- linux-2.6.14-dontcopy.orig/mm/mmap.c 2005-11-08 23:42:01.000000000 +0200
+++ linux-2.6.14-dontcopy/mm/mmap.c 2005-11-08 23:41:48.000000000 +0200
@@ -840,7 +840,7 @@ void __vm_stat_account(struct mm_struct

#ifdef CONFIG_HUGETLB
if (flags & VM_HUGETLB) {
- if (!(flags & VM_DONTCOPY))
+ if (!(flags & (VM_DONTCOPY|VM_UDONTCOPY)))
mm->shared_vm += pages;
return;
}
Index: linux-2.6.14-dontcopy/mm/madvise.c
===================================================================
--- linux-2.6.14-dontcopy.orig/mm/madvise.c 2005-10-28 02:02:08.000000000 +0200
+++ linux-2.6.14-dontcopy/mm/madvise.c 2005-11-08 23:28:56.000000000 +0200
@@ -31,6 +31,12 @@ static long madvise_behavior(struct vm_a
case MADV_RANDOM:
new_flags |= VM_RAND_READ;
break;
+ case MADV_DONTCOPY:
+ new_flags |= VM_UDONTCOPY;
+ break;
+ case MADV_DOCOPY:
+ new_flags &= ~VM_UDONTCOPY;
+ break;
default:
break;
}
@@ -150,6 +156,8 @@ madvise_vma(struct vm_area_struct *vma,
case MADV_NORMAL:
case MADV_SEQUENTIAL:
case MADV_RANDOM:
+ case MADV_DONTCOPY:
+ case MADV_DOCOPY:
error = madvise_behavior(vma, prev, start, end, behavior);
break;

Index: linux-2.6.14-dontcopy/include/linux/mm.h
===================================================================
--- linux-2.6.14-dontcopy.orig/include/linux/mm.h 2005-11-08 23:24:58.000000000 +0200
+++ linux-2.6.14-dontcopy/include/linux/mm.h 2005-11-08 23:25:09.000000000 +0200
@@ -154,6 +154,7 @@ extern unsigned int kobjsize(const void
/* Used by sys_madvise() */
#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
#define VM_RAND_READ 0x00010000 /* App will not benefit from clustered reads */
+#define VM_UDONTCOPY 0x02000000 /* App wants to set VM_DONTCOPY */

#define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */
#define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
Index: linux-2.6.14-dontcopy/include/asm-x86_64/mman.h
===================================================================
--- linux-2.6.14-dontcopy.orig/include/asm-x86_64/mman.h 2005-11-08 23:19:35.000000000 +0200
+++ linux-2.6.14-dontcopy/include/asm-x86_64/mman.h 2005-11-08 23:19:46.000000000 +0200
@@ -36,6 +36,8 @@
#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */
#define MADV_WILLNEED 0x3 /* pre-fault pages */
#define MADV_DONTNEED 0x4 /* discard these pages */
+#define MADV_DONTCOPY 0x30 /* dont inherit across fork */
+#define MADV_DOCOPY 0x31 /* do inherit across fork */

/* compatibility flags */
#define MAP_ANON MAP_ANONYMOUS

--
MST

2005-11-10 12:36:24

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Tue, Nov 08, 2005 at 11:34:07PM +0200, Michael S. Tsirkin wrote:
> Index: linux-2.6.14-dontcopy/mm/madvise.c
> ===================================================================
> --- linux-2.6.14-dontcopy.orig/mm/madvise.c 2005-10-28 02:02:08.000000000 +0200
> +++ linux-2.6.14-dontcopy/mm/madvise.c 2005-11-08 23:28:56.000000000 +0200
> @@ -31,6 +31,12 @@ static long madvise_behavior(struct vm_a
> case MADV_RANDOM:
> new_flags |= VM_RAND_READ;
> break;
> + case MADV_DONTCOPY:
> + new_flags |= VM_UDONTCOPY;
> + break;
> + case MADV_DOCOPY:
> + new_flags &= ~VM_UDONTCOPY;
> + break;
> default:
> break;
> }
I think you are removing VM_RAND_READ/VM_SEQ_READ here.
Also perhapse we should skip VM_SHARED VMAs?

--
Gleb.

2005-11-10 12:45:42

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Gleb Natapov <[email protected]>:
> Subject: Re: Nick's core remove PageReserved broke vmware...
>
> On Tue, Nov 08, 2005 at 11:34:07PM +0200, Michael S. Tsirkin wrote:
> > Index: linux-2.6.14-dontcopy/mm/madvise.c
> > ===================================================================
> > --- linux-2.6.14-dontcopy.orig/mm/madvise.c 2005-10-28
> 02:02:08.000000000 +0200
> > +++ linux-2.6.14-dontcopy/mm/madvise.c 2005-11-08
> 23:28:56.000000000 +0200
> > @@ -31,6 +31,12 @@ static long madvise_behavior(struct vm_a
> > case MADV_RANDOM:
> > new_flags |= VM_RAND_READ;
> > break;
> > + case MADV_DONTCOPY:
> > + new_flags |= VM_UDONTCOPY;
> > + break;
> > + case MADV_DOCOPY:
> > + new_flags &= ~VM_UDONTCOPY;
> > + break;
> > default:
> > break;
> > }
> I think you are removing VM_RAND_READ/VM_SEQ_READ here.

True, we need something like

switch (behavior) {
case MADV_SEQUENTIAL:
new_flags = (vma->vm_flags & ~VM_READHINTMASK) | VM_SEQ_READ;
break;
case MADV_RANDOM:
new_flags = (vma->vm_flags & ~VM_READHINTMASK) | VM_RAND_READ;
break;
case MADV_DONTCOPY:
new_flags |= VM_UDONTCOPY;
break;
case MADV_DOCOPY:
new_flags &= ~VM_UDONTCOPY;
break;
default:
break;
}

Thanks!

> Also perhapse we should skip VM_SHARED VMAs?

Why?

> --
> Gleb.
>

--
MST

2005-11-10 12:50:34

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, Nov 10, 2005 at 02:48:53PM +0200, Michael S. Tsirkin wrote:
> > Also perhapse we should skip VM_SHARED VMAs?
>
> Why?
>
They will work correctly across fork().

--
Gleb.

2005-11-10 13:11:22

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Tue, 8 Nov 2005, Michael S. Tsirkin wrote:
>
> Hugh, did you have something like the following in mind
> (this is only boot-tested and only on x86-64)?

Yes, that looks pretty good to me, a few comments below.
Only another twenty or so architectures to go ;)

(I had been imagining VM_DONTCOPY plus another flag to say set by the
user: better your way, we would like to merge these vmas, and VM_DONTCOPY
is in that peculiar list of special flags that prevent merging.)

> Hmm, maybe MADV_INHERIT and MADV_DONT_INHERIT would be better names,

You're right, and it would be a good choice, except that MAP_INHERIT on
some OSes has a particular meaning (about inheriting across an exec),
so I think avoid confusion with that. MADV_DONTFORK and MADV_DOFORK?
Accompanied by VM_DONTFORK?

> since the copy is only dont one write ...

"only done on write"?

> Index: linux-2.6.14-dontcopy/include/linux/mm.h
> ===================================================================
> --- linux-2.6.14-dontcopy.orig/include/linux/mm.h 2005-11-08 23:24:58.000000000 +0200
> +++ linux-2.6.14-dontcopy/include/linux/mm.h 2005-11-08 23:25:09.000000000 +0200
> @@ -154,6 +154,7 @@ extern unsigned int kobjsize(const void
> /* Used by sys_madvise() */
> #define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
> #define VM_RAND_READ 0x00010000 /* App will not benefit from clustered reads */
> +#define VM_UDONTCOPY 0x02000000 /* App wants to set VM_DONTCOPY */

Please place it where you'd expect to find it by value.

> #define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */
> #define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
> Index: linux-2.6.14-dontcopy/include/asm-x86_64/mman.h
> ===================================================================
> --- linux-2.6.14-dontcopy.orig/include/asm-x86_64/mman.h 2005-11-08 23:19:35.000000000 +0200
> +++ linux-2.6.14-dontcopy/include/asm-x86_64/mman.h 2005-11-08 23:19:46.000000000 +0200
> @@ -36,6 +36,8 @@
> #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */
> #define MADV_WILLNEED 0x3 /* pre-fault pages */
> #define MADV_DONTNEED 0x4 /* discard these pages */
> +#define MADV_DONTCOPY 0x30 /* dont inherit across fork */
> +#define MADV_DOCOPY 0x31 /* do inherit across fork */
>
> /* compatibility flags */
> #define MAP_ANON MAP_ANONYMOUS

I think that's probably a good idea, to choose a range away from the rest.
But I'm not quite sure: anyone familiar with adding APIs listening?

Hugh

2005-11-10 13:13:15

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Gleb Natapov <[email protected]>:
> Subject: Re: Nick's core remove PageReserved broke vmware...
>
> On Thu, Nov 10, 2005 at 02:48:53PM +0200, Michael S. Tsirkin wrote:
> > > Also perhapse we should skip VM_SHARED VMAs?
> >
> > Why?
> >
> They will work correctly across fork().

So why would I call madvise on such a VMA?

--
MST

2005-11-10 13:17:17

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, Nov 10, 2005 at 03:16:30PM +0200, Michael S. Tsirkin wrote:
> Quoting Gleb Natapov <[email protected]>:
> > Subject: Re: Nick's core remove PageReserved broke vmware...
> >
> > On Thu, Nov 10, 2005 at 02:48:53PM +0200, Michael S. Tsirkin wrote:
> > > > Also perhapse we should skip VM_SHARED VMAs?
> > >
> > > Why?
> > >
> > They will work correctly across fork().
>
> So why would I call madvise on such a VMA?
>
Because libraries such as MPI can't figure out if memory is shared or not.
And madvise is just advise, so if kernel knows better it may ignore that
advise.

--
Gleb.

2005-11-10 13:16:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, 10 Nov 2005, Michael S. Tsirkin wrote:
> Quoting Gleb Natapov <[email protected]>:
> > I think you are removing VM_RAND_READ/VM_SEQ_READ here.

Good catch from Gleb, I certainly didn't notice that.
>
> True, we need something like
>
> switch (behavior) {
> case MADV_SEQUENTIAL:
> new_flags = (vma->vm_flags & ~VM_READHINTMASK) | VM_SEQ_READ;
> break;
> case MADV_RANDOM:
> new_flags = (vma->vm_flags & ~VM_READHINTMASK) | VM_RAND_READ;
> break;

Yes, that helps avoid such errors in future. Though, no strong feelings,
but I'd find it clearer to avoid the obscure VM_READHINTMASK completely,
and just say (vma->vm_flags & ~VM_RAND_READ) | VM_SEQ_READ etc.

Hugh

2005-11-10 13:22:40

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, 10 Nov 2005, Michael S. Tsirkin wrote:
> Quoting Gleb Natapov <[email protected]>:
> > On Thu, Nov 10, 2005 at 02:48:53PM +0200, Michael S. Tsirkin wrote:
> > > > Also perhapse we should skip VM_SHARED VMAs?
> > >
> > > Why?
> > >
> > They will work correctly across fork().
>
> So why would I call madvise on such a VMA?

To avoid the overhead of forking it e.g. if it's a large nonlinear vma,
a lot of time may be wasted on copying its ptes for fork. That's one
of the reasons I came to like your DONTCOPY.

So, it may not be useful for your particular RDMA issue, but I see no
reason to exclude VM_SHARED vmas from the madvise, and good reason to
include them.

Hugh

2005-11-10 13:27:43

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, Nov 10, 2005 at 01:21:25PM +0000, Hugh Dickins wrote:
> On Thu, 10 Nov 2005, Michael S. Tsirkin wrote:
> > Quoting Gleb Natapov <[email protected]>:
> > > On Thu, Nov 10, 2005 at 02:48:53PM +0200, Michael S. Tsirkin wrote:
> > > > > Also perhapse we should skip VM_SHARED VMAs?
> > > >
> > > > Why?
> > > >
> > > They will work correctly across fork().
> >
> > So why would I call madvise on such a VMA?
>
> To avoid the overhead of forking it e.g. if it's a large nonlinear vma,
> a lot of time may be wasted on copying its ptes for fork. That's one
> of the reasons I came to like your DONTCOPY.
>
> So, it may not be useful for your particular RDMA issue, but I see no
> reason to exclude VM_SHARED vmas from the madvise, and good reason to
> include them.
>
If the scope of DONTCOPY is more broad that just RDMA then I agree.

--
Gleb.

2005-11-10 13:33:51

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Hugh Dickins <[email protected]>:
> Subject: Re: Nick's core remove PageReserved broke vmware...
>
> On Tue, 8 Nov 2005, Michael S. Tsirkin wrote:
> >
> > Hugh, did you have something like the following in mind
> > (this is only boot-tested and only on x86-64)?
>
> Yes, that looks pretty good to me, a few comments below.
> Only another twenty or so architectures to go ;)

Yea, thats easy :).

> (I had been imagining VM_DONTCOPY plus another flag to say set by the
> user: better your way, we would like to merge these vmas, and
> VM_DONTCOPY is in that peculiar list of special flags that prevent merging.)
>
> > Hmm, maybe MADV_INHERIT and MADV_DONT_INHERIT would be better names,
>
> You're right, and it would be a good choice, except that MAP_INHERIT on
> some OSes has a particular meaning (about inheriting across an exec),
> so I think avoid confusion with that. MADV_DONTFORK and MADV_DOFORK?
> Accompanied by VM_DONTFORK?

Its actually similiar.
Maybe MADV_FORK_INHERIT/MADV_FORK_DONT_INHERIT then?
I find using "COPY" there confusing, since the copy is only done on write ...

> > Index: linux-2.6.14-dontcopy/include/asm-x86_64/mman.h
> > ===================================================================
> > --- linux-2.6.14-dontcopy.orig/include/asm-x86_64/mman.h
> 2005-11-08 23:19:35.000000000 +0200
> > +++ linux-2.6.14-dontcopy/include/asm-x86_64/mman.h 2005-11-08
> 23:19:46.000000000 +0200
> > @@ -36,6 +36,8 @@
> > #define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */
> > #define MADV_WILLNEED 0x3 /* pre-fault pages */
> > #define MADV_DONTNEED 0x4 /* discard these pages
> */
> > +#define MADV_DONTCOPY 0x30 /* dont inherit across fork */
> > +#define MADV_DOCOPY 0x31 /* do inherit across fork */
> >
> > /* compatibility flags */
> > #define MAP_ANON MAP_ANONYMOUS
>
> I think that's probably a good idea, to choose a range away from the rest.
> But I'm not quite sure: anyone familiar with adding APIs listening?
>
> Hugh
>

My reason was to make it possible to have identical values for all
architectures, making userspace portability easier.

--
MST

2005-11-10 13:56:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Thu, 10 Nov 2005, Michael S. Tsirkin wrote:
> Quoting Hugh Dickins <[email protected]>:
> >
> > You're right, and it would be a good choice, except that MAP_INHERIT on
> > some OSes has a particular meaning (about inheriting across an exec),
> > so I think avoid confusion with that. MADV_DONTFORK and MADV_DOFORK?
> > Accompanied by VM_DONTFORK?
>
> Its actually similiar.

Indeed, similar, but different.

> Maybe MADV_FORK_INHERIT/MADV_FORK_DONT_INHERIT then?

Those names are a lot longer than the others in that file.
I still prefer MADV_DONTFORK etc. myself. Or the original MADV_DONTCOPY,
though I think you were quite right to point out it's all about forking.

> I find using "COPY" there confusing, since the copy is only done on write ...

There are lots of levels on which there's copying or not. The name
VM_DONTCOPY means "don't copy this vma when forking", it's not thinking
of copying pages or even of copying ptes: just don't copy this vma into
the child.

Hugh

2005-11-10 14:09:43

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Hugh Dickins <[email protected]>:
> > Maybe MADV_FORK_INHERIT/MADV_FORK_DONT_INHERIT then?
>
> Those names are a lot longer than the others in that file.
> I still prefer MADV_DONTFORK etc. myself.

We'll do MADV_DONTFORK then.

--
MST

2005-11-14 12:23:53

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Hugh Dickins <[email protected]>:
> Subject: Re: Nick's core remove PageReserved broke vmware...
>
> On Tue, 8 Nov 2005, Michael S. Tsirkin wrote:
> >
> > Hugh, did you have something like the following in mind
> > (this is only boot-tested and only on x86-64)?
>
> Yes, that looks pretty good to me, a few comments below.
> Only another twenty or so architectures to go ;)

There's one thing that I have thought about: what happens
if I set DONTFORK on a page which already has COW set
(e.g. after fork)?

It seems that the right thing would be to force a page copy -
otherwise the page can get copied on write.

Makes sense?

--
MST

2005-11-14 12:28:44

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Mon, Nov 14, 2005 at 02:25:35PM +0200, Michael S. Tsirkin wrote:
> Quoting Hugh Dickins <[email protected]>:
> > Subject: Re: Nick's core remove PageReserved broke vmware...
> >
> > On Tue, 8 Nov 2005, Michael S. Tsirkin wrote:
> > >
> > > Hugh, did you have something like the following in mind
> > > (this is only boot-tested and only on x86-64)?
> >
> > Yes, that looks pretty good to me, a few comments below.
> > Only another twenty or so architectures to go ;)
>
> There's one thing that I have thought about: what happens
> if I set DONTFORK on a page which already has COW set
> (e.g. after fork)?
>
> It seems that the right thing would be to force a page copy -
> otherwise the page can get copied on write.
>
I thought about it. It should not happen for OpenIB since get_user_pages
will break COW for us and I don't think we should complicate DONTFORK
implementation by doing break during madvise().

--
Gleb.

2005-11-14 12:32:41

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Gleb Natapov <[email protected]>:
> Subject: Re: Nick's core remove PageReserved broke vmware...
>
> On Mon, Nov 14, 2005 at 02:25:35PM +0200, Michael S. Tsirkin wrote:
> > Quoting Hugh Dickins <[email protected]>:
> > > Subject: Re: Nick's core remove PageReserved broke vmware...
> > >
> > > On Tue, 8 Nov 2005, Michael S. Tsirkin wrote:
> > > >
> > > > Hugh, did you have something like the following in mind
> > > > (this is only boot-tested and only on x86-64)?
> > >
> > > Yes, that looks pretty good to me, a few comments below.
> > > Only another twenty or so architectures to go ;)
> >
> > There's one thing that I have thought about: what happens
> > if I set DONTFORK on a page which already has COW set
> > (e.g. after fork)?
> >
> > It seems that the right thing would be to force a page copy -
> > otherwise the page can get copied on write.
>
> I thought about it. It should not happen for OpenIB since get_user_pages
> will break COW for us and I don't think we should complicate DONTFORK
> implementation by doing break during madvise().

Hmm, I assumed we call madvise before driver does get_user_pages,
otherwise an application could fork in between.
Should we worry about this?

--
MST

2005-11-14 12:41:36

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Mon, 14 Nov 2005, Michael S. Tsirkin wrote:
> Quoting Gleb Natapov <[email protected]>:
> > On Mon, Nov 14, 2005 at 02:25:35PM +0200, Michael S. Tsirkin wrote:
> > >
> > > There's one thing that I have thought about: what happens
> > > if I set DONTFORK on a page which already has COW set
> > > (e.g. after fork)?
> > >
> > > It seems that the right thing would be to force a page copy -
> > > otherwise the page can get copied on write.

No, keep it simple, DONTFORK simply marks the area as not to be included
in a fork from that time onwards (until perhaps a DOFORK follows).

> > I thought about it. It should not happen for OpenIB since get_user_pages
> > will break COW for us and I don't think we should complicate DONTFORK
> > implementation by doing break during madvise().

Exactly.

> Hmm, I assumed we call madvise before driver does get_user_pages,
> otherwise an application could fork in between.

I think we're all of us assuming that.

> Should we worry about this?

About what?

Hugh

2005-11-14 12:41:57

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Mon, Nov 14, 2005 at 02:34:32PM +0200, Michael S. Tsirkin wrote:
> Quoting Gleb Natapov <[email protected]>:
> > Subject: Re: Nick's core remove PageReserved broke vmware...
> >
> > On Mon, Nov 14, 2005 at 02:25:35PM +0200, Michael S. Tsirkin wrote:
> > > Quoting Hugh Dickins <[email protected]>:
> > > > Subject: Re: Nick's core remove PageReserved broke vmware...
> > > >
> > > > On Tue, 8 Nov 2005, Michael S. Tsirkin wrote:
> > > > >
> > > > > Hugh, did you have something like the following in mind
> > > > > (this is only boot-tested and only on x86-64)?
> > > >
> > > > Yes, that looks pretty good to me, a few comments below.
> > > > Only another twenty or so architectures to go ;)
> > >
> > > There's one thing that I have thought about: what happens
> > > if I set DONTFORK on a page which already has COW set
> > > (e.g. after fork)?
> > >
> > > It seems that the right thing would be to force a page copy -
> > > otherwise the page can get copied on write.
> >
> > I thought about it. It should not happen for OpenIB since get_user_pages
> > will break COW for us and I don't think we should complicate DONTFORK
> > implementation by doing break during madvise().
>
> Hmm, I assumed we call madvise before driver does get_user_pages,
> otherwise an application could fork in between.
> Should we worry about this?
>
OK, we set VM_DONTFORK on vma and than driver calls get_user_pages()
that actually do page table walking and COW breaking. Or do I miss
something here?

--
Gleb.

2005-11-14 14:55:18

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting r. Hugh Dickins <[email protected]>:
> Subject: Re: Nick's core remove PageReserved broke vmware...
>
> On Mon, 14 Nov 2005, Michael S. Tsirkin wrote:
> > Quoting Gleb Natapov <[email protected]>:
> > > On Mon, Nov 14, 2005 at 02:25:35PM +0200, Michael S. Tsirkin wrote:
> > > >
> > > > There's one thing that I have thought about: what happens
> > > > if I set DONTFORK on a page which already has COW set
> > > > (e.g. after fork)?
> > > >
> > > > It seems that the right thing would be to force a page copy -
> > > > otherwise the page can get copied on write.
>
> > Should we worry about this?
>
> About what?

For pages which hardware will only read, not write,
hardware driver does get_user_pages with write cleared.

This means that COW may remain set.

--
MST

2005-11-14 14:51:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting r. Hugh Dickins <[email protected]>:
> Subject: Re: Nick's core remove PageReserved broke vmware...
>
> On Tue, 8 Nov 2005, Michael S. Tsirkin wrote:
> >
> > Hugh, did you have something like the following in mind
> > (this is only boot-tested and only on x86-64)?
>
> Yes, that looks pretty good to me, a few comments below.
> Only another twenty or so architectures to go ;)

Okay, here's an updated version.
More comments, before I go ahead and update the rest of the architectures :)?

---

Add madvise options to control whether memory range is inherited across fork.
Useful e.g. for when hardware is doing DMA from/into these pages.

Signed-off-by: Michael S. Tsirkin <[email protected]>

Index: linux-2.6.14-dontcopy/kernel/fork.c
===================================================================
--- linux-2.6.14-dontcopy.orig/kernel/fork.c 2005-11-08 23:41:30.000000000 +0200
+++ linux-2.6.14-dontcopy/kernel/fork.c 2005-11-14 16:59:05.000000000 +0200
@@ -209,7 +209,7 @@ static inline int dup_mmap(struct mm_str
for (mpnt = current->mm->mmap ; mpnt ; mpnt = mpnt->vm_next) {
struct file *file;

- if (mpnt->vm_flags & VM_DONTCOPY) {
+ if (mpnt->vm_flags & (VM_DONTCOPY | VM_DONTFORK)) {
long pages = vma_pages(mpnt);
mm->total_vm -= pages;
__vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
Index: linux-2.6.14-dontcopy/mm/mmap.c
===================================================================
--- linux-2.6.14-dontcopy.orig/mm/mmap.c 2005-11-08 23:42:01.000000000 +0200
+++ linux-2.6.14-dontcopy/mm/mmap.c 2005-11-14 17:02:37.000000000 +0200
@@ -840,7 +840,7 @@ void __vm_stat_account(struct mm_struct

#ifdef CONFIG_HUGETLB
if (flags & VM_HUGETLB) {
- if (!(flags & VM_DONTCOPY))
+ if (!(flags & (VM_DONTCOPY|VM_DONTFORK)))
mm->shared_vm += pages;
return;
}
Index: linux-2.6.14-dontcopy/mm/madvise.c
===================================================================
--- linux-2.6.14-dontcopy.orig/mm/madvise.c 2005-10-28 02:02:08.000000000 +0200
+++ linux-2.6.14-dontcopy/mm/madvise.c 2005-11-14 17:04:51.000000000 +0200
@@ -22,14 +22,20 @@ static long madvise_behavior(struct vm_a
struct mm_struct * mm = vma->vm_mm;
int error = 0;
pgoff_t pgoff;
- int new_flags = vma->vm_flags & ~VM_READHINTMASK;
+ int new_flags = vma->vm_flags;

switch (behavior) {
case MADV_SEQUENTIAL:
- new_flags |= VM_SEQ_READ;
+ new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
break;
case MADV_RANDOM:
- new_flags |= VM_RAND_READ;
+ new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
+ break;
+ case MADV_DONTFORK:
+ new_flags |= VM_DONTFORK;
+ break;
+ case MADV_DOFORK:
+ new_flags &= ~VM_DONTFORK;
break;
default:
break;
@@ -150,6 +156,8 @@ madvise_vma(struct vm_area_struct *vma,
case MADV_NORMAL:
case MADV_SEQUENTIAL:
case MADV_RANDOM:
+ case MADV_DONTFORK:
+ case MADV_DOFORK:
error = madvise_behavior(vma, prev, start, end, behavior);
break;

Index: linux-2.6.14-dontcopy/include/linux/mm.h
===================================================================
--- linux-2.6.14-dontcopy.orig/include/linux/mm.h 2005-11-08 23:24:58.000000000 +0200
+++ linux-2.6.14-dontcopy/include/linux/mm.h 2005-11-14 16:58:51.000000000 +0200
@@ -162,6 +162,7 @@ extern unsigned int kobjsize(const void
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
#define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */
+#define VM_DONTFORK 0x02000000 /* App wants to set VM_DONTCOPY */

#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: linux-2.6.14-dontcopy/include/asm-x86_64/mman.h
===================================================================
--- linux-2.6.14-dontcopy.orig/include/asm-x86_64/mman.h 2005-11-08 23:19:35.000000000 +0200
+++ linux-2.6.14-dontcopy/include/asm-x86_64/mman.h 2005-11-14 16:57:29.000000000 +0200
@@ -36,6 +36,8 @@
#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */
#define MADV_WILLNEED 0x3 /* pre-fault pages */
#define MADV_DONTNEED 0x4 /* discard these pages */
+#define MADV_DONTFORK 0x30 /* dont inherit across fork */
+#define MADV_DOFORK 0x31 /* do inherit across fork */

/* compatibility flags */
#define MAP_ANON MAP_ANONYMOUS

--
MST

2005-11-14 15:01:08

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Mon, Nov 14, 2005 at 04:52:52PM +0200, Michael S. Tsirkin wrote:
> Index: linux-2.6.14-dontcopy/mm/madvise.c
> ===================================================================
> --- linux-2.6.14-dontcopy.orig/mm/madvise.c 2005-10-28 02:02:08.000000000 +0200
> +++ linux-2.6.14-dontcopy/mm/madvise.c 2005-11-14 17:04:51.000000000 +0200
> @@ -22,14 +22,20 @@ static long madvise_behavior(struct vm_a
> struct mm_struct * mm = vma->vm_mm;
> int error = 0;
> pgoff_t pgoff;
> - int new_flags = vma->vm_flags & ~VM_READHINTMASK;
> + int new_flags = vma->vm_flags;
>
> switch (behavior) {
> case MADV_SEQUENTIAL:
> - new_flags |= VM_SEQ_READ;
> + new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
> break;
> case MADV_RANDOM:
> - new_flags |= VM_RAND_READ;
> + new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
> + break;
> + case MADV_DONTFORK:
> + new_flags |= VM_DONTFORK;
> + break;
> + case MADV_DOFORK:
> + new_flags &= ~VM_DONTFORK;
> break;
> default:
> break;
It seams now you broke MADV_NORMAL.

--
Gleb.

2005-11-14 15:07:53

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Mon, Nov 14, 2005 at 04:57:08PM +0200, Michael S. Tsirkin wrote:
> Quoting r. Hugh Dickins <[email protected]>:
> > Subject: Re: Nick's core remove PageReserved broke vmware...
> >
> > On Mon, 14 Nov 2005, Michael S. Tsirkin wrote:
> > > Quoting Gleb Natapov <[email protected]>:
> > > > On Mon, Nov 14, 2005 at 02:25:35PM +0200, Michael S. Tsirkin wrote:
> > > > >
> > > > > There's one thing that I have thought about: what happens
> > > > > if I set DONTFORK on a page which already has COW set
> > > > > (e.g. after fork)?
> > > > >
> > > > > It seems that the right thing would be to force a page copy -
> > > > > otherwise the page can get copied on write.
> >
> > > Should we worry about this?
> >
> > About what?
>
> For pages which hardware will only read, not write,
> hardware driver does get_user_pages with write cleared.
>
It looks like openib always pass 1 as write. I think this is
exactly for this reason.

> This means that COW may remain set.
>
> --
> MST

--
Gleb.

2005-11-14 20:22:16

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting Gleb Natapov <[email protected]>:
> It seams now you broke MADV_NORMAL.

Right. How's this?

---

Add madvise options to control whether memory range is inherited across fork.
Useful e.g. for when hardware is doing DMA from/into these pages.

Signed-off-by: Michael S. Tsirkin <[email protected]>

Index: linux-2.6.14-dontcopy/kernel/fork.c
===================================================================
--- linux-2.6.14-dontcopy.orig/kernel/fork.c 2005-11-08 23:41:30.000000000 +0200
+++ linux-2.6.14-dontcopy/kernel/fork.c 2005-11-14 16:59:05.000000000 +0200
@@ -209,7 +209,7 @@ static inline int dup_mmap(struct mm_str
for (mpnt = current->mm->mmap ; mpnt ; mpnt = mpnt->vm_next) {
struct file *file;

- if (mpnt->vm_flags & VM_DONTCOPY) {
+ if (mpnt->vm_flags & (VM_DONTCOPY | VM_DONTFORK)) {
long pages = vma_pages(mpnt);
mm->total_vm -= pages;
__vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
Index: linux-2.6.14-dontcopy/mm/mmap.c
===================================================================
--- linux-2.6.14-dontcopy.orig/mm/mmap.c 2005-11-08 23:42:01.000000000 +0200
+++ linux-2.6.14-dontcopy/mm/mmap.c 2005-11-14 17:02:37.000000000 +0200
@@ -840,7 +840,7 @@ void __vm_stat_account(struct mm_struct

#ifdef CONFIG_HUGETLB
if (flags & VM_HUGETLB) {
- if (!(flags & VM_DONTCOPY))
+ if (!(flags & (VM_DONTCOPY|VM_DONTFORK)))
mm->shared_vm += pages;
return;
}
Index: linux-2.6.14-dontcopy/mm/madvise.c
===================================================================
--- linux-2.6.14-dontcopy.orig/mm/madvise.c 2005-10-28 02:02:08.000000000 +0200
+++ linux-2.6.14-dontcopy/mm/madvise.c 2005-11-14 22:40:43.000000000 +0200
@@ -22,16 +22,23 @@ static long madvise_behavior(struct vm_a
struct mm_struct * mm = vma->vm_mm;
int error = 0;
pgoff_t pgoff;
- int new_flags = vma->vm_flags & ~VM_READHINTMASK;
+ int new_flags = vma->vm_flags;

switch (behavior) {
+ case MADV_NORMAL:
+ new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
+ break;
case MADV_SEQUENTIAL:
- new_flags |= VM_SEQ_READ;
+ new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
break;
case MADV_RANDOM:
- new_flags |= VM_RAND_READ;
+ new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
break;
- default:
+ case MADV_DONTFORK:
+ new_flags |= VM_DONTFORK;
+ break;
+ case MADV_DOFORK:
+ new_flags &= ~VM_DONTFORK;
break;
}

@@ -150,6 +157,8 @@ madvise_vma(struct vm_area_struct *vma,
case MADV_NORMAL:
case MADV_SEQUENTIAL:
case MADV_RANDOM:
+ case MADV_DONTFORK:
+ case MADV_DOFORK:
error = madvise_behavior(vma, prev, start, end, behavior);
break;

Index: linux-2.6.14-dontcopy/include/linux/mm.h
===================================================================
--- linux-2.6.14-dontcopy.orig/include/linux/mm.h 2005-11-08 23:24:58.000000000 +0200
+++ linux-2.6.14-dontcopy/include/linux/mm.h 2005-11-14 16:58:51.000000000 +0200
@@ -162,6 +162,7 @@ extern unsigned int kobjsize(const void
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
#define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */
+#define VM_DONTFORK 0x02000000 /* App wants to set VM_DONTCOPY */

#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: linux-2.6.14-dontcopy/include/asm-x86_64/mman.h
===================================================================
--- linux-2.6.14-dontcopy.orig/include/asm-x86_64/mman.h 2005-11-08 23:19:35.000000000 +0200
+++ linux-2.6.14-dontcopy/include/asm-x86_64/mman.h 2005-11-14 16:57:29.000000000 +0200
@@ -36,6 +36,8 @@
#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */
#define MADV_WILLNEED 0x3 /* pre-fault pages */
#define MADV_DONTNEED 0x4 /* discard these pages */
+#define MADV_DONTFORK 0x30 /* dont inherit across fork */
+#define MADV_DOFORK 0x31 /* do inherit across fork */

/* compatibility flags */
#define MAP_ANON MAP_ANONYMOUS
--
MST

2005-11-14 21:11:19

by Hugh Dickins

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Mon, 14 Nov 2005, Michael S. Tsirkin wrote:
>
> Okay, here's an updated version.

Looked good to me, but as usual Gleb noticed what I missed. And you
should be working against 2.6.15-rc1 or 2.6.15-rc1-git, not 2.6.14.

Hugh

2005-11-14 21:16:07

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

Quoting r. Hugh Dickins <[email protected]>:
> Subject: Re: Nick's core remove PageReserved broke vmware...
>
> On Mon, 14 Nov 2005, Michael S. Tsirkin wrote:
> >
> > Okay, here's an updated version.
>
> Looked good to me, but as usual Gleb noticed what I missed.

Yep, I fixed that in an updated patch. Thanks very much Gleb!

> And you
> should be working against 2.6.15-rc1 or 2.6.15-rc1-git, not 2.6.14.
>
> Hugh
>

So - its submission time?
I shall go over these 20 missing architectures then :)

--
MST

2005-11-15 09:26:47

by Gleb Natapov

[permalink] [raw]
Subject: Re: Nick's core remove PageReserved broke vmware...

On Mon, Nov 14, 2005 at 10:23:49PM +0200, Michael S. Tsirkin wrote:
> Right. How's this?
>
Looks good to me now.

> +#define VM_DONTFORK 0x02000000 /* App wants to set VM_DONTCOPY */
Minor nitpicking. I think comment should be changed. Something like:
/* App don't want its child inherit this VMA */ since we don't really
set VM_DONTCOPY.

--
Gleb.