2013-03-21 06:19:43

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH] rdma: don't make pages writeable if not requiested

core/umem.c seems to get the arguments to get_user_pages
in the reverse order: it sets writeable flag and
breaks COW for MAP_SHARED if and only if hardware needs to
write the page.

This breaks memory overcommit for users such as KVM:
each time we try to register a page to send it to remote, this
breaks COW. It seems that for applications that only have
REMOTE_READ permission, there is no reason to break COW at all.

If the page that is COW has lots of copies, this makes the user process
quickly exceed the cgroups memory limit. This makes RDMA mostly useless
for virtualization, thus the stable tag.

Reported-by: "Michael R. Hines" <[email protected]>
Cc: [email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
---

Note: compile-tested only, I don't have RDMA hardware at the moment.
Michael, could you please try this patch (also fixing your
usespace code not to request write access) and report?

Note2: grep for get_user_pages in infiniband drivers turns up
lots of users who set write to 1 unconditionally.
These might be bugs too, should be checked.

drivers/infiniband/core/umem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a841123..5929598 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
ret = get_user_pages(current, current->mm, cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof (struct page *)),
- 1, !umem->writable, page_list, vma_list);
+ !umem->writable, 1, page_list, vma_list);

if (ret < 0)
goto out;
--
MST


2013-03-21 06:56:20

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <[email protected]> wrote:
> core/umem.c seems to get the arguments to get_user_pages
> in the reverse order: it sets writeable flag and
> breaks COW for MAP_SHARED if and only if hardware needs to
> write the page.
>
> This breaks memory overcommit for users such as KVM:
> each time we try to register a page to send it to remote, this
> breaks COW. It seems that for applications that only have
> REMOTE_READ permission, there is no reason to break COW at all.

I proposed a similar (but not exactly the same, see below) patch a
while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
especially https://lkml.org/lkml/2012/2/6/265

I think this change will break the case where userspace tries to
register an MR with read-only permission, but intends locally through
the CPU to write to the memory. If the memory registration is done
while the memory is mapped read-only but has VM_MAYWRITE, then
userspace gets into trouble when COW happens. In the case you're
describing (although I'm not sure where in KVM we're talking about
using RDMA), what happens if you register memory with only REMOTE_READ
and then COW is triggered because of a local write? (I'm assuming you
don't want remote access to continue to get the old contents of the
page)

I have to confess that I still haven't had a chance to implement the
proposed FOLL_FOLLOW solution to all of this.

> If the page that is COW has lots of copies, this makes the user process
> quickly exceed the cgroups memory limit. This makes RDMA mostly useless
> for virtualization, thus the stable tag.

The actual problem description here is a bit too terse for me to
understand. How do we end up with lots of copies of a COW page? Why
is RDMA registering the memory any more special than having everyone
who maps this page actually writing to it and triggering COW?

> ret = get_user_pages(current, current->mm, cur_base,
> min_t(unsigned long, npages,
> PAGE_SIZE / sizeof (struct page *)),
> - 1, !umem->writable, page_list, vma_list);
> + !umem->writable, 1, page_list, vma_list);

The first two parameters in this line being changed are "write" and "force".

I think if we do change this, then we need to pass umem->writable (as
opposed to !umem->writable) for the "write" parameter. Not sure
whether "force" makes sense or not.

- R.

2013-03-21 07:03:30

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote:
> On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <[email protected]> wrote:
> > core/umem.c seems to get the arguments to get_user_pages
> > in the reverse order: it sets writeable flag and
> > breaks COW for MAP_SHARED if and only if hardware needs to
> > write the page.
> >
> > This breaks memory overcommit for users such as KVM:
> > each time we try to register a page to send it to remote, this
> > breaks COW. It seems that for applications that only have
> > REMOTE_READ permission, there is no reason to break COW at all.
>
> I proposed a similar (but not exactly the same, see below) patch a
> while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
> especially https://lkml.org/lkml/2012/2/6/265
>
> I think this change will break the case where userspace tries to
> register an MR with read-only permission, but intends locally through
> the CPU to write to the memory.

Shouldn't it set LOCAL_WRITE then?

2013-03-21 07:15:57

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

>> I think this change will break the case where userspace tries to
>> register an MR with read-only permission, but intends locally through
>> the CPU to write to the memory.

> Shouldn't it set LOCAL_WRITE then?

We're talking about the permissions for the register MR operation,
right? (That's what the kernel RDMA driver code that does
get_user_pages() sees)

In that case, no, I don't see any reason for LOCAL_WRITE, since the
only RDMA operations that will access this memory are remote reads.
The writing (that triggers COW) is coming from normal process access
triggering a page fault, etc. This is a pretty standard way of using
RDMA... For example, I allocate some memory and register it for RDMA
read (and pass the R_Key to the remote system) with only REMOTE_READ
permission. Then I fill in the memory with the results of some
computation and the remote system does an RDMA read to get those
results.

- R.

2013-03-21 08:50:50

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 12:15:33AM -0700, Roland Dreier wrote:
> >> I think this change will break the case where userspace tries to
> >> register an MR with read-only permission, but intends locally through
> >> the CPU to write to the memory.
>
> > Shouldn't it set LOCAL_WRITE then?
>
> We're talking about the permissions for the register MR operation,
> right? (That's what the kernel RDMA driver code that does
> get_user_pages() sees)
>
> In that case, no, I don't see any reason for LOCAL_WRITE, since the
> only RDMA operations that will access this memory are remote reads.

What is the meaning of LOCAL_WRITE then? There are no local
RDMA writes as far as I can see.

> The writing (that triggers COW) is coming from normal process access
> triggering a page fault, etc. This is a pretty standard way of using
> RDMA... For example, I allocate some memory and register it for RDMA
> read (and pass the R_Key to the remote system) with only REMOTE_READ
> permission. Then I fill in the memory with the results of some
> computation and the remote system does an RDMA read to get those
> results.
>
> - R.

OK then what we need is a new flag saying "I really do not
intend to write into this memory please do not break
COW or do anything else just in case I do".

--
MST

2013-03-21 09:14:04

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <[email protected]> wrote:
>> In that case, no, I don't see any reason for LOCAL_WRITE, since the
>> only RDMA operations that will access this memory are remote reads.
>
> What is the meaning of LOCAL_WRITE then? There are no local
> RDMA writes as far as I can see.

Umm, it means you're giving the local adapter permission to write to
that memory. So you can use it as a receive buffer or as the target
for remote data from an RDMA read operation.

> OK then what we need is a new flag saying "I really do not
> intend to write into this memory please do not break
> COW or do anything else just in case I do".

Isn't that a shared read-only mapping?

- R.

2013-03-21 09:33:43

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote:
> On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <[email protected]> wrote:
> > core/umem.c seems to get the arguments to get_user_pages
> > in the reverse order: it sets writeable flag and
> > breaks COW for MAP_SHARED if and only if hardware needs to
> > write the page.
> >
> > This breaks memory overcommit for users such as KVM:
> > each time we try to register a page to send it to remote, this
> > breaks COW. It seems that for applications that only have
> > REMOTE_READ permission, there is no reason to break COW at all.
>
> I proposed a similar (but not exactly the same, see below) patch a
> while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
> especially https://lkml.org/lkml/2012/2/6/265
>
> I think this change will break the case where userspace tries to
> register an MR with read-only permission, but intends locally through
> the CPU to write to the memory. If the memory registration is done
> while the memory is mapped read-only but has VM_MAYWRITE, then
> userspace gets into trouble when COW happens. In the case you're
> describing (although I'm not sure where in KVM we're talking about
> using RDMA), what happens if you register memory with only REMOTE_READ
> and then COW is triggered because of a local write? (I'm assuming you
> don't want remote access to continue to get the old contents of the
> page)

I read that, and the above. It looks like everyone was doing tricks
like "register page, then modify it, then let remote read it"
and for some reason assumed it's ok to write into page locally from CPU
even if LOCAL_WRITE is not set. I don't see why don't applications set
LOCAL_WRITE if they are going to write to memory locally, but assuming
they don't, we can't just break them.

So what we need is a new "no I really do not intend to write into this
memory" flag that avoids doing tricks in the kernel and treats the
page normally, just pins it so hardware can read it.


> I have to confess that I still haven't had a chance to implement the
> proposed FOLL_FOLLOW solution to all of this.

See a much easier to implement proposal at the bottom.

> > If the page that is COW has lots of copies, this makes the user process
> > quickly exceed the cgroups memory limit. This makes RDMA mostly useless
> > for virtualization, thus the stable tag.
>
> The actual problem description here is a bit too terse for me to
> understand. How do we end up with lots of copies of a COW page?

Reading the links above, rdma breaks COW intentionally.

Imagine a page with lots of instances in the process page map.
For example a zero page, but not only that: we rely on KSM heavily
to deduplicate pages for multiple VMs.
There are gigabytes of these in each of the multiple VMs
running on a host.

What we are using RDMA for is VM migration so we careful not to change
this memory: when we do allow memory to change we are careful
to track what was changed, reregister and resend the data.

But at the moment, each time we register a virtual address referencing
this page, infiniband assumes we might want to change the page so it
does get_user_pages with writeable flag, forcing a copy.
Amount of used RAM explodes.

> Why
> is RDMA registering the memory any more special than having everyone
> who maps this page actually writing to it and triggering COW?
>
> > ret = get_user_pages(current, current->mm, cur_base,
> > min_t(unsigned long, npages,
> > PAGE_SIZE / sizeof (struct page *)),
> > - 1, !umem->writable, page_list, vma_list);
> > + !umem->writable, 1, page_list, vma_list);
>
> The first two parameters in this line being changed are "write" and "force".
>
> I think if we do change this, then we need to pass umem->writable (as
> opposed to !umem->writable) for the "write" parameter.

Ugh. Sure enough. Let's agree on the direction before I respin the
patch though.

> Not sure
> whether "force" makes sense or not.
>
> - R.

If you don't force write on read-only mappings you don't, but
it seems harmless for read-only gup. Still, no need to change
what's not broken.

Please comment on the below (completely untested, and needs userspace
patch too, but just to give you the idea)

--->

rdma: add IB_ACCESS_APP_READONLY

At the moment any attempt to register memory for RDMA breaks
COW, which hurts hosts overcomitted for memory.
But if the application knows it won't write into the MR after
registration, we can save (sometimes a lot of) memory
by telling the kernel not to bother breaking COW for us.

If the application does change memory registered with this flag, it can
re-register afterwards, and resend the data on the wire.

Signed-off-by: Michael S. Tsirkin <[email protected]>

---

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 5929598..635b57a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -152,7 +152,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
ret = get_user_pages(current, current->mm, cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof (struct page *)),
- !umem->writable, 1, page_list, vma_list);
+ umem->writable ||
+ !(access & IB_ACCESS_APP_READONLY),
+ !umem->writable, page_list, vma_list);

if (ret < 0)
goto out;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 98cc4b2..3a3ba1b 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -871,7 +871,8 @@ enum ib_access_flags {
IB_ACCESS_REMOTE_READ = (1<<2),
IB_ACCESS_REMOTE_ATOMIC = (1<<3),
IB_ACCESS_MW_BIND = (1<<4),
- IB_ZERO_BASED = (1<<5)
+ IB_ZERO_BASED = (1<<5),
+ IB_ACCESS_APP_READONLY = (1<<6) /* User promises not to change the data */
};

struct ib_phys_buf {

--
MST

2013-03-21 09:39:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote:
> On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <[email protected]> wrote:
> >> In that case, no, I don't see any reason for LOCAL_WRITE, since the
> >> only RDMA operations that will access this memory are remote reads.
> >
> > What is the meaning of LOCAL_WRITE then? There are no local
> > RDMA writes as far as I can see.
>
> Umm, it means you're giving the local adapter permission to write to
> that memory. So you can use it as a receive buffer or as the target
> for remote data from an RDMA read operation.

Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ.
I don't see why do you need to give adapter permission
to use memory as a receive buffer given that
you must pass the address in the post receive verb,
but maybe that's what the IB spec says so that's what
the applications assumed?

> > OK then what we need is a new flag saying "I really do not
> > intend to write into this memory please do not break
> > COW or do anything else just in case I do".
>
> Isn't that a shared read-only mapping?
>
> - R.

Nothing to do with how the page is mapped. We can and do write the page
before registration. BTW umem.c passes in force so it breaks COW
even for read-only mappings, right?

--
MST

2013-03-21 11:30:41

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 11:32:30AM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote:
> > On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <[email protected]> wrote:
> > > core/umem.c seems to get the arguments to get_user_pages
> > > in the reverse order: it sets writeable flag and
> > > breaks COW for MAP_SHARED if and only if hardware needs to
> > > write the page.
> > >
> > > This breaks memory overcommit for users such as KVM:
> > > each time we try to register a page to send it to remote, this
> > > breaks COW. It seems that for applications that only have
> > > REMOTE_READ permission, there is no reason to break COW at all.
> >
> > I proposed a similar (but not exactly the same, see below) patch a
> > while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
> > especially https://lkml.org/lkml/2012/2/6/265
> >
> > I think this change will break the case where userspace tries to
> > register an MR with read-only permission, but intends locally through
> > the CPU to write to the memory. If the memory registration is done
> > while the memory is mapped read-only but has VM_MAYWRITE, then
> > userspace gets into trouble when COW happens. In the case you're
> > describing (although I'm not sure where in KVM we're talking about
> > using RDMA), what happens if you register memory with only REMOTE_READ
> > and then COW is triggered because of a local write? (I'm assuming you
> > don't want remote access to continue to get the old contents of the
> > page)
>
> I read that, and the above. It looks like everyone was doing tricks
> like "register page, then modify it, then let remote read it"
> and for some reason assumed it's ok to write into page locally from CPU
> even if LOCAL_WRITE is not set. I don't see why don't applications set
> LOCAL_WRITE if they are going to write to memory locally, but assuming
> they don't, we can't just break them.
>
> So what we need is a new "no I really do not intend to write into this
> memory" flag that avoids doing tricks in the kernel and treats the
> page normally, just pins it so hardware can read it.
>
>
> > I have to confess that I still haven't had a chance to implement the
> > proposed FOLL_FOLLOW solution to all of this.
>
> See a much easier to implement proposal at the bottom.
>
> > > If the page that is COW has lots of copies, this makes the user process
> > > quickly exceed the cgroups memory limit. This makes RDMA mostly useless
> > > for virtualization, thus the stable tag.
> >
> > The actual problem description here is a bit too terse for me to
> > understand. How do we end up with lots of copies of a COW page?
>
> Reading the links above, rdma breaks COW intentionally.
>
> Imagine a page with lots of instances in the process page map.
> For example a zero page, but not only that: we rely on KSM heavily
> to deduplicate pages for multiple VMs.
> There are gigabytes of these in each of the multiple VMs
> running on a host.
>
> What we are using RDMA for is VM migration so we careful not to change
> this memory: when we do allow memory to change we are careful
> to track what was changed, reregister and resend the data.
>
> But at the moment, each time we register a virtual address referencing
> this page, infiniband assumes we might want to change the page so it
> does get_user_pages with writeable flag, forcing a copy.
> Amount of used RAM explodes.
>
> > Why
> > is RDMA registering the memory any more special than having everyone
> > who maps this page actually writing to it and triggering COW?
> >
> > > ret = get_user_pages(current, current->mm, cur_base,
> > > min_t(unsigned long, npages,
> > > PAGE_SIZE / sizeof (struct page *)),
> > > - 1, !umem->writable, page_list, vma_list);
> > > + !umem->writable, 1, page_list, vma_list);
> >
> > The first two parameters in this line being changed are "write" and "force".
> >
> > I think if we do change this, then we need to pass umem->writable (as
> > opposed to !umem->writable) for the "write" parameter.
>
> Ugh. Sure enough. Let's agree on the direction before I respin the
> patch though.
>
> > Not sure
> > whether "force" makes sense or not.
> >
> > - R.
>
> If you don't force write on read-only mappings you don't, but
> it seems harmless for read-only gup. Still, no need to change
> what's not broken.
>
> Please comment on the below (completely untested, and needs userspace
> patch too, but just to give you the idea)
>
> --->
>
> rdma: add IB_ACCESS_APP_READONLY

Or we can call it IB_ACCESS_GIFT - this is a bit like SPLICE_F_GIFT
semantics.



> At the moment any attempt to register memory for RDMA breaks
> COW, which hurts hosts overcomitted for memory.
> But if the application knows it won't write into the MR after
> registration, we can save (sometimes a lot of) memory
> by telling the kernel not to bother breaking COW for us.
>
> If the application does change memory registered with this flag, it can
> re-register afterwards, and resend the data on the wire.
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
>
> ---
>
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 5929598..635b57a 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -152,7 +152,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> ret = get_user_pages(current, current->mm, cur_base,
> min_t(unsigned long, npages,
> PAGE_SIZE / sizeof (struct page *)),
> - !umem->writable, 1, page_list, vma_list);
> + umem->writable ||
> + !(access & IB_ACCESS_APP_READONLY),
> + !umem->writable, page_list, vma_list);
>
> if (ret < 0)
> goto out;
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 98cc4b2..3a3ba1b 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -871,7 +871,8 @@ enum ib_access_flags {
> IB_ACCESS_REMOTE_READ = (1<<2),
> IB_ACCESS_REMOTE_ATOMIC = (1<<3),
> IB_ACCESS_MW_BIND = (1<<4),
> - IB_ZERO_BASED = (1<<5)
> + IB_ZERO_BASED = (1<<5),
> + IB_ACCESS_APP_READONLY = (1<<6) /* User promises not to change the data */
> };
>
> struct ib_phys_buf {
>
> --
> MST

2013-03-21 12:23:59

by Michael R. Hines

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

Yes, I'd be happy to try the patch.

Got meetings all day...... but will dive in soon.

On 03/21/2013 02:18 AM, Michael S. Tsirkin wrote:
> core/umem.c seems to get the arguments to get_user_pages
> in the reverse order: it sets writeable flag and
> breaks COW for MAP_SHARED if and only if hardware needs to
> write the page.
>
> This breaks memory overcommit for users such as KVM:
> each time we try to register a page to send it to remote, this
> breaks COW. It seems that for applications that only have
> REMOTE_READ permission, there is no reason to break COW at all.
>
> If the page that is COW has lots of copies, this makes the user process
> quickly exceed the cgroups memory limit. This makes RDMA mostly useless
> for virtualization, thus the stable tag.
>
> Reported-by: "Michael R. Hines" <[email protected]>
> Cc: [email protected]
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> ---
>
> Note: compile-tested only, I don't have RDMA hardware at the moment.
> Michael, could you please try this patch (also fixing your
> usespace code not to request write access) and report?
>
> Note2: grep for get_user_pages in infiniband drivers turns up
> lots of users who set write to 1 unconditionally.
> These might be bugs too, should be checked.
>
> drivers/infiniband/core/umem.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index a841123..5929598 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> ret = get_user_pages(current, current->mm, cur_base,
> min_t(unsigned long, npages,
> PAGE_SIZE / sizeof (struct page *)),
> - 1, !umem->writable, page_list, vma_list);
> + !umem->writable, 1, page_list, vma_list);
>
> if (ret < 0)
> goto out;

2013-03-21 12:32:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 08:23:48AM -0400, Michael R. Hines wrote:
> Yes, I'd be happy to try the patch.
>
> Got meetings all day...... but will dive in soon.

The patch is unlikely to be the final version. In particular
you need to change !umem->writable to umem->writable.

> On 03/21/2013 02:18 AM, Michael S. Tsirkin wrote:
> >core/umem.c seems to get the arguments to get_user_pages
> >in the reverse order: it sets writeable flag and
> >breaks COW for MAP_SHARED if and only if hardware needs to
> >write the page.
> >
> >This breaks memory overcommit for users such as KVM:
> >each time we try to register a page to send it to remote, this
> >breaks COW. It seems that for applications that only have
> >REMOTE_READ permission, there is no reason to break COW at all.
> >
> >If the page that is COW has lots of copies, this makes the user process
> >quickly exceed the cgroups memory limit. This makes RDMA mostly useless
> >for virtualization, thus the stable tag.
> >
> >Reported-by: "Michael R. Hines" <[email protected]>
> >Cc: [email protected]
> >Signed-off-by: Michael S. Tsirkin <[email protected]>
> >---
> >
> >Note: compile-tested only, I don't have RDMA hardware at the moment.
> >Michael, could you please try this patch (also fixing your
> >usespace code not to request write access) and report?
> >
> >Note2: grep for get_user_pages in infiniband drivers turns up
> >lots of users who set write to 1 unconditionally.
> >These might be bugs too, should be checked.
> >
> > drivers/infiniband/core/umem.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> >index a841123..5929598 100644
> >--- a/drivers/infiniband/core/umem.c
> >+++ b/drivers/infiniband/core/umem.c
> >@@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> > ret = get_user_pages(current, current->mm, cur_base,
> > min_t(unsigned long, npages,
> > PAGE_SIZE / sizeof (struct page *)),
> >- 1, !umem->writable, page_list, vma_list);
> >+ !umem->writable, 1, page_list, vma_list);
> >
> > if (ret < 0)
> > goto out;

2013-03-21 17:11:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote:
> > On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <[email protected]> wrote:
> > >> In that case, no, I don't see any reason for LOCAL_WRITE, since the
> > >> only RDMA operations that will access this memory are remote reads.
> > >
> > > What is the meaning of LOCAL_WRITE then? There are no local
> > > RDMA writes as far as I can see.
> >
> > Umm, it means you're giving the local adapter permission to write to
> > that memory. So you can use it as a receive buffer or as the target
> > for remote data from an RDMA read operation.
>
> Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ.
> I don't see why do you need to give adapter permission

The access flags have to do with what independent access remote nodes
get. There are four major cases:

access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes
read the memory.

access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under
the direct control of the application, can read this memory. Remote
nodes are barred.

access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor
will let remote nodes write the memory

access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that
memory. Only the adaptor, under the direct control of the application,
can write the memory.

The fact LOCAL_READ/REMOTE_READ exists makes it possible to do what
you want - it guarentees the adaptor will never write to this memory
under any circumstances, so you can leave the page COW'd. If
LOCAL_WRITE was implied then you'd have to COW everything..

Would it be better to drive the COW break decision off the region's MM
flags? Ie if the memory is mapped read only into the process then you
can keep the COW at the RDMA layer, otherwise you should break
it. That seems more natural than a new flag?

Jason

2013-03-21 17:17:06

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 11:11:15AM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote:
> > > On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <[email protected]> wrote:
> > > >> In that case, no, I don't see any reason for LOCAL_WRITE, since the
> > > >> only RDMA operations that will access this memory are remote reads.
> > > >
> > > > What is the meaning of LOCAL_WRITE then? There are no local
> > > > RDMA writes as far as I can see.
> > >
> > > Umm, it means you're giving the local adapter permission to write to
> > > that memory. So you can use it as a receive buffer or as the target
> > > for remote data from an RDMA read operation.
> >
> > Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ.
> > I don't see why do you need to give adapter permission
>
> The access flags have to do with what independent access remote nodes
> get. There are four major cases:
>
> access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes
> read the memory.
>
> access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under
> the direct control of the application, can read this memory. Remote
> nodes are barred.
>
> access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor
> will let remote nodes write the memory
>
> access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that
> memory. Only the adaptor, under the direct control of the application,
> can write the memory.
>
> The fact LOCAL_READ/REMOTE_READ exists makes it possible to do what
> you want - it guarentees the adaptor will never write to this memory
> under any circumstances, so you can leave the page COW'd. If
> LOCAL_WRITE was implied then you'd have to COW everything..
>
> Would it be better to drive the COW break decision off the region's MM
> flags? Ie if the memory is mapped read only into the process then you
> can keep the COW at the RDMA layer, otherwise you should break
> it. That seems more natural than a new flag?
>
> Jason

No because application does this:
init page

...

after a lot of time

..

register
send
unregister

so it can not be read only.

--
MST

2013-03-21 17:22:07

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 07:15:25PM +0200, Michael S. Tsirkin wrote:

> No because application does this:
> init page
>
> ...
>
> after a lot of time
>
> ..
>
> register
> send
> unregister
>
> so it can not be read only.

mprotect(READONLY)
register
send
unregister
mprotect(WRITABLE)

?

With something like GIFT the app already has to give up writing to the
pages while they are moving, so changing the protection seems in line
with that?

Jason

2013-03-21 17:47:23

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 11:21:50AM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2013 at 07:15:25PM +0200, Michael S. Tsirkin wrote:
>
> > No because application does this:
> > init page
> >
> > ...
> >
> > after a lot of time
> >
> > ..
> >
> > register
> > send
> > unregister
> >
> > so it can not be read only.
>
> mprotect(READONLY)
> register
> send
> unregister
> mprotect(WRITABLE)
>
> ?
> With something like GIFT the app already has to give up writing to the
> pages while they are moving, so changing the protection seems in line
> with that?
>
> Jason

It doesn't actually, and our app would sometimes write to these pages.
It simply does not care which version does the remote get in this case
since we track writes and resend later.

Also this is per-page, MRs have byte granularity so easier to use.

--
MST

2013-03-21 17:57:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 07:42:37PM +0200, Michael S. Tsirkin wrote:

> It doesn't actually, and our app would sometimes write to these pages.
> It simply does not care which version does the remote get in this case
> since we track writes and resend later.

Heh, somehow I thought you might say that :)

A new flag seems like the only way then - maybe:
IBV_ACCESS_NON_COHERENT - The adaptor and the CPU do not share
a coherent view of registered memory. Memory writes from the CPU
after ibv_reg_mr completes may not be reflected in the memory
viewed by the adaptor.

Can only be combined with read only access permissions.

Jason

2013-03-21 18:02:52

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 11:57:32AM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2013 at 07:42:37PM +0200, Michael S. Tsirkin wrote:
>
> > It doesn't actually, and our app would sometimes write to these pages.
> > It simply does not care which version does the remote get in this case
> > since we track writes and resend later.
>
> Heh, somehow I thought you might say that :)
>
> A new flag seems like the only way then - maybe:
> IBV_ACCESS_NON_COHERENT - The adaptor and the CPU do not share
> a coherent view of registered memory. Memory writes from the CPU
> after ibv_reg_mr completes may not be reflected in the memory
> viewed by the adaptor.
>
> Can only be combined with read only access permissions.
>
> Jason

I kind of like _GIFT for name, gifts are nice :) But yes that's exactly
the semantics we need.

--
MST

2013-03-21 18:16:49

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 11:11:15AM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote:
> > > On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <[email protected]> wrote:
> > > >> In that case, no, I don't see any reason for LOCAL_WRITE, since the
> > > >> only RDMA operations that will access this memory are remote reads.
> > > >
> > > > What is the meaning of LOCAL_WRITE then? There are no local
> > > > RDMA writes as far as I can see.
> > >
> > > Umm, it means you're giving the local adapter permission to write to
> > > that memory. So you can use it as a receive buffer or as the target
> > > for remote data from an RDMA read operation.
> >
> > Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ.
> > I don't see why do you need to give adapter permission
>
> The access flags have to do with what independent access remote nodes
> get. There are four major cases:
>
> access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes
> read the memory.
>
> access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under
> the direct control of the application, can read this memory. Remote
> nodes are barred.
>
> access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor
> will let remote nodes write the memory
>
> access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that
> memory. Only the adaptor, under the direct control of the application,
> can write the memory.

This is the one I find redundant. Since the write will be done
by the adaptor under direct control by the application,
why does it make sense to declare this beforehand?
If you don't want to allow local write access to memory,
just do not post any receive WRs with this address.
If you posted and regret it, reset the QP to cancel.

But IB spec specified LOCAL_WRITE in this redundant way so I guess
applications expect it to have the semantics defined there, I just
didn't remember what they are.

No way around it then, need another flag.

2013-03-21 18:41:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote:

> This is the one I find redundant. Since the write will be done by
> the adaptor under direct control by the application, why does it
> make sense to declare this beforehand? If you don't want to allow
> local write access to memory, just do not post any receive WRs with
> this address. If you posted and regret it, reset the QP to cancel.

This is to support your COW scenario - the app declares before hand to
the kernel that it will write to the memory and the kernel ensures
pages are dedicated to the app at registration time. Or the app says
it will only read and the kernel could leave them shared.

The adaptor enforces the access control to prevent a naughty app from
writing to shared memory - think about mmap'ing libc.so and then using
RDMA to write to the shared pages. It is necessary to ensure that is
impossible.

Jason

2013-03-21 19:15:25

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 12:41:35PM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote:
>
> > This is the one I find redundant. Since the write will be done by
> > the adaptor under direct control by the application, why does it
> > make sense to declare this beforehand? If you don't want to allow
> > local write access to memory, just do not post any receive WRs with
> > this address. If you posted and regret it, reset the QP to cancel.
>
> This is to support your COW scenario - the app declares before hand to
> the kernel that it will write to the memory and the kernel ensures
> pages are dedicated to the app at registration time. Or the app says
> it will only read and the kernel could leave them shared.

Someone here is confused. LOCAL_WRITE/absence of it does not address
COW, it breaks COW anyway. Are you now saying we should change rdma so
without LOCAL_WRITE it will not break COW?

> The adaptor enforces the access control to prevent a naughty app from
> writing to shared memory - think about mmap'ing libc.so and then using
> RDMA to write to the shared pages. It is necessary to ensure that is
> impossible.
>
> Jason

That's why it's redundant: we can't trust an application to tell us
'this page is writeable', we must get this info from kernel. And so
there's apparently no need for application to tell adaptor about
LOCAL_WRITE.

--
MST

2013-03-21 20:09:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 09:15:41PM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 21, 2013 at 12:41:35PM -0600, Jason Gunthorpe wrote:
> > On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote:
> >
> > > This is the one I find redundant. Since the write will be done by
> > > the adaptor under direct control by the application, why does it
> > > make sense to declare this beforehand? If you don't want to allow
> > > local write access to memory, just do not post any receive WRs with
> > > this address. If you posted and regret it, reset the QP to cancel.
> >
> > This is to support your COW scenario - the app declares before hand to
> > the kernel that it will write to the memory and the kernel ensures
> > pages are dedicated to the app at registration time. Or the app says
> > it will only read and the kernel could leave them shared.
>
> Someone here is confused. LOCAL_WRITE/absence of it does not address
> COW, it breaks COW anyway. Are you now saying we should change rdma so
> without LOCAL_WRITE it will not break COW?

I am talking about 'from a spec' perspective - not what Linux does
today. The absence of LOCAL_WRITE is part of the specification to
support shared pages.

Pages can only be kept shared if all the ACCESS WRITE bits are clear -
today Linux always breaks the COW, but if you patch in the ability to
keep things shared then it must only happen when *all* the ACCESS
WRITE bits are clear.

> > The adaptor enforces the access control to prevent a naughty app from
> > writing to shared memory - think about mmap'ing libc.so and then using
> > RDMA to write to the shared pages. It is necessary to ensure that is
> > impossible.

> That's why it's redundant: we can't trust an application to tell us
> 'this page is writeable', we must get this info from kernel. And so
> there's apparently no need for application to tell adaptor about
> LOCAL_WRITE.

The API design gives user space maximum flexibility, if it wants to
create an enforced no-write MR in otherwise writable pages by skipping
LOCAL_WRITE then it can do so.

The kernel's role in this should be to deny ibv_reg_mr with WRITE bits
set if the pages are not writable by the app - I don't know if it does
this today, it isn't critically important as long as the pages are
unshared.

Jason