2013-06-06 12:44:08

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] mm: Revert pinned_vm braindamage


Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
broke RLIMIT_MEMLOCK.

Before that patch: mm_struct::locked_vm < RLIMIT_MEMLOCK; after that
patch we have: mm_struct::locked_vm < RLIMIT_MEMLOCK &&
mm_struct::pinned_vm < RLIMIT_MEMLOCK.

The patch doesn't mention RLIMIT_MEMLOCK and thus also doesn't discus
this (user visible) change in semantics. And thus we must assume it was
unintentional.

Since RLIMIT_MEMLOCK is very clearly a limit on the amount of pages the
process can 'lock' into memory it should very much include pinned pages
as well as mlock()ed pages. Neither can be paged.

Since nobody had anything constructive to say about the VM_PINNED
approach and the IB code hurts my head too much to make it work I
propose we revert said patch.


Once again the rationale; MLOCK(2) is part of POSIX Realtime Extentsion
(1003.1b-1993/1003.1i-1995). It states that the specified part of the
user address space should stay memory resident until either program exit
or a matching munlock() call.

This definition basically excludes major faults from happening on the
pages -- a major fault being one where IO needs to happen to obtain the
page content; the direct implication being that page content must remain
in memory.

Linux has taken this literal and made mlock()ed pages subject to page
migration (albeit only for the explicit move_pages() syscall; but it
would very much like to make them subject to implicit page migration for
the purpose of compaction etc.).

This view disregards the intention of the spec; since mlock() is part of
the realtime spec the intention is very much that the user address range
generate no faults; neither minor nor major -- any delay is
unacceptable.

This leaves the RT people unhappy -- therefore _if_ we continue with
this Linux specific interpretation of mlock() we must introduce new
syscalls that implement the intended mlock() semantics.

It was found that there are useful purposes for this weaker mlock(), a
rationale to indeed have two sets of syscalls. The weaker mlock() can be
used in the context of security -- where we avoid sensitive data being
written to disk, and in the context of userspace deamons that are part
of the IO path -- which would otherwise form IO deadlocks.

The proposed second set of primitives would be mpin() and munpin() and
would implement the intended mlock() semantics.

Such pages would not be migratable in any way (a possible
implementation would be to 'pin' the pages using an extra refcount on
the page frame). From the above we can see that any mpin()ed page is
also an mlock()ed page, since mpin() will disallow any fault, and thus
will also disallow major faults.

While we still lack the formal mpin() and munpin() syscalls there are a
number of sites that have similar 'side effects' and result in user
controlled 'pinning' of pages. Namely IB and perf.

For the purpose of RLIMIT_MEMLOCK we must use intent only as it is not
part of the formal spec. The only useful thing is to limit the amount of
pages a user can exempt from paging. This would therefore include all
pages either mlock()ed or mpin()ed.


Back to the patch; a resource limit must have a resource counter to
enact the limit upon. Before the patch this was mm_struct::locked_vm.
After the patch there is no such thing left.

The patch was proposed to 'fix' a double accounting problem where pages
are both pinned and mlock()ed. This was particularly visible when using
mlockall() on a process that uses either IB or perf.

I state that since mlockall() disables/invalidates RLIMIT_MEMLOCK the
actual resource counter value is irrelevant, and thus the reported
problem is a non-problem.

However, it would still be possible to observe weirdness in the very
unlikely event that a user would indeed call mlock() upon an address
range obtained from IB/perf. In this case he would be unduly constrained
and find his effective RLIMIT_MEMLOCK limit halved (at worst).

After the patch; that same user will find he has an effectively double
RLIMIT_MEMLOCK, since the IB/perf pages are not counted towards the same
limit as his mlock() pages are. It is far more likely a user will employ
mlock() on different rages than those he received from IB/perf since he
already knows those aren't going anywhere.

Therefore the patch trades an unlikely weirdness for a much more likely
weirdness. So barring a proper solution I propose we revert.

I've yet to hear a coherent objection to the above. Christoph is always
quick to yell: 'but if fixes a double accounting issue' but is
completely deaf to the fact that he changed user visible semantics
without mention and regard.

Signed-off-by: Peter Zijlstra <[email protected]>
---
drivers/infiniband/core/umem.c | 8 ++++----
drivers/infiniband/hw/ipath/ipath_user_pages.c | 6 +++---
drivers/infiniband/hw/qib/qib_user_pages.c | 4 ++--
fs/proc/task_mmu.c | 2 --
include/linux/mm_types.h | 1 -
kernel/events/core.c | 6 +++---
6 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a841123..cc92137b 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -137,7 +137,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,

down_write(&current->mm->mmap_sem);

- locked = npages + current->mm->pinned_vm;
+ locked = npages + current->mm->locked_vm;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
@@ -207,7 +207,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
__ib_umem_release(context->device, umem, 0);
kfree(umem);
} else
- current->mm->pinned_vm = locked;
+ current->mm->locked_vm = locked;

up_write(&current->mm->mmap_sem);
if (vma_list)
@@ -223,7 +223,7 @@ static void ib_umem_account(struct work_struct *work)
struct ib_umem *umem = container_of(work, struct ib_umem, work);

down_write(&umem->mm->mmap_sem);
- umem->mm->pinned_vm -= umem->diff;
+ umem->mm->locked_vm -= umem->diff;
up_write(&umem->mm->mmap_sem);
mmput(umem->mm);
kfree(umem);
@@ -269,7 +269,7 @@ void ib_umem_release(struct ib_umem *umem)
} else
down_write(&mm->mmap_sem);

- current->mm->pinned_vm -= diff;
+ current->mm->locked_vm -= diff;
up_write(&mm->mmap_sem);
mmput(mm);
kfree(umem);
diff --git a/drivers/infiniband/hw/ipath/ipath_user_pages.c b/drivers/infiniband/hw/ipath/ipath_user_pages.c
index dc66c45..cfed539 100644
--- a/drivers/infiniband/hw/ipath/ipath_user_pages.c
+++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c
@@ -79,7 +79,7 @@ static int __ipath_get_user_pages(unsigned long start_page, size_t num_pages,
goto bail_release;
}

- current->mm->pinned_vm += num_pages;
+ current->mm->locked_vm += num_pages;

ret = 0;
goto bail;
@@ -178,7 +178,7 @@ void ipath_release_user_pages(struct page **p, size_t num_pages)

__ipath_release_user_pages(p, num_pages, 1);

- current->mm->pinned_vm -= num_pages;
+ current->mm->locked_vm -= num_pages;

up_write(&current->mm->mmap_sem);
}
@@ -195,7 +195,7 @@ static void user_pages_account(struct work_struct *_work)
container_of(_work, struct ipath_user_pages_work, work);

down_write(&work->mm->mmap_sem);
- work->mm->pinned_vm -= work->num_pages;
+ work->mm->locked_vm -= work->num_pages;
up_write(&work->mm->mmap_sem);
mmput(work->mm);
kfree(work);
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 2bc1d2b..7689e49 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -74,7 +74,7 @@ static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
goto bail_release;
}

- current->mm->pinned_vm += num_pages;
+ current->mm->locked_vm += num_pages;

ret = 0;
goto bail;
@@ -151,7 +151,7 @@ void qib_release_user_pages(struct page **p, size_t num_pages)
__qib_release_user_pages(p, num_pages, 1);

if (current->mm) {
- current->mm->pinned_vm -= num_pages;
+ current->mm->locked_vm -= num_pages;
up_write(&current->mm->mmap_sem);
}
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3e636d8..7d09d6a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -44,7 +44,6 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
"VmPeak:\t%8lu kB\n"
"VmSize:\t%8lu kB\n"
"VmLck:\t%8lu kB\n"
- "VmPin:\t%8lu kB\n"
"VmHWM:\t%8lu kB\n"
"VmRSS:\t%8lu kB\n"
"VmData:\t%8lu kB\n"
@@ -56,7 +55,6 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
hiwater_vm << (PAGE_SHIFT-10),
total_vm << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
- mm->pinned_vm << (PAGE_SHIFT-10),
hiwater_rss << (PAGE_SHIFT-10),
total_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..d185c13 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -356,7 +356,6 @@ struct mm_struct {

unsigned long total_vm; /* Total pages mapped */
unsigned long locked_vm; /* Pages that have PG_mlocked set */
- unsigned long pinned_vm; /* Refcount permanently increased */
unsigned long shared_vm; /* Shared pages (files) */
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */
unsigned long stack_vm; /* VM_GROWSUP/DOWN */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 95edd5a..1e926b2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3729,7 +3729,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)

if (ring_buffer_put(rb)) {
atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
- vma->vm_mm->pinned_vm -= mmap_locked;
+ vma->vm_mm->locked_vm -= mmap_locked;
free_uid(mmap_user);
}
}
@@ -3805,7 +3805,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)

lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
- locked = vma->vm_mm->pinned_vm + extra;
+ locked = vma->vm_mm->locked_vm + extra;

if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
!capable(CAP_IPC_LOCK)) {
@@ -3831,7 +3831,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
rb->mmap_user = get_current_user();

atomic_long_add(user_extra, &user->locked_vm);
- vma->vm_mm->pinned_vm += extra;
+ vma->vm_mm->locked_vm += extra;

rcu_assign_pointer(event->rb, rb);


Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

On Thu, 6 Jun 2013, Peter Zijlstra wrote:

> Since RLIMIT_MEMLOCK is very clearly a limit on the amount of pages the
> process can 'lock' into memory it should very much include pinned pages
> as well as mlock()ed pages. Neither can be paged.

So we we thought that this is the sum of the pages that a process has
mlocked. Initiated by the process and/or environment explicitly. A user
space initiated action.

> Since nobody had anything constructive to say about the VM_PINNED
> approach and the IB code hurts my head too much to make it work I
> propose we revert said patch.

I said that the use of a PIN page flag would allow correct accounting if
one wanted to interpret the limit the way you do.

> Once again the rationale; MLOCK(2) is part of POSIX Realtime Extentsion
> (1003.1b-1993/1003.1i-1995). It states that the specified part of the
> user address space should stay memory resident until either program exit
> or a matching munlock() call.
>
> This definition basically excludes major faults from happening on the
> pages -- a major fault being one where IO needs to happen to obtain the
> page content; the direct implication being that page content must remain
> in memory.

Exactly that is the definition.

> Linux has taken this literal and made mlock()ed pages subject to page
> migration (albeit only for the explicit move_pages() syscall; but it
> would very much like to make them subject to implicit page migration for
> the purpose of compaction etc.).

Page migration is not a page fault? The ability to move a process
completely (including its mlocked segments) is important for the manual
migration of process memory. That is what page migration was made for. If
mlocked pages are treated as pinnned pages then the complete process can
no longer be moved from node to node.

> This view disregards the intention of the spec; since mlock() is part of
> the realtime spec the intention is very much that the user address range
> generate no faults; neither minor nor major -- any delay is
> unacceptable.

Where does it say that no faults are generated? Dont we generate COW on
mlocked ranges?

> This leaves the RT people unhappy -- therefore _if_ we continue with
> this Linux specific interpretation of mlock() we must introduce new
> syscalls that implement the intended mlock() semantics.

Intended means Peter's semantics?

> It was found that there are useful purposes for this weaker mlock(), a
> rationale to indeed have two sets of syscalls. The weaker mlock() can be
> used in the context of security -- where we avoid sensitive data being
> written to disk, and in the context of userspace deamons that are part
> of the IO path -- which would otherwise form IO deadlocks.

Migratable mlocked pages enable complete process migration between nodes
of a NUMA system for HPC workloads.

> The proposed second set of primitives would be mpin() and munpin() and
> would implement the intended mlock() semantics.

I agree that we need mpin and munpin. But they should not be called mlock
semantics.

> Such pages would not be migratable in any way (a possible
> implementation would be to 'pin' the pages using an extra refcount on
> the page frame). From the above we can see that any mpin()ed page is
> also an mlock()ed page, since mpin() will disallow any fault, and thus
> will also disallow major faults.

That cannot be so since mlocked pages need to be migratable.

> While we still lack the formal mpin() and munpin() syscalls there are a
> number of sites that have similar 'side effects' and result in user
> controlled 'pinning' of pages. Namely IB and perf.

Right thats why we need this.

> For the purpose of RLIMIT_MEMLOCK we must use intent only as it is not
> part of the formal spec. The only useful thing is to limit the amount of
> pages a user can exempt from paging. This would therefore include all
> pages either mlock()ed or mpin()ed.

RLIMIT_MEMLOCK is a limit on the pages that a process has mlocked into
memory. Pinning is not initiated by user space but by the kernel. Either
temporarily (page count increases are used all over the kernel for this)
or for longer time frame (IB and Perf and likely more drivers that we have
not found yet).


> > > Back to the patch; a resource limit must have a resource counter to
> enact the limit upon. Before the patch this was mm_struct::locked_vm.
> After the patch there is no such thing left.

The limit was not checked correctly before the patch since pinned pages
were accounted as mlocked.

> I state that since mlockall() disables/invalidates RLIMIT_MEMLOCK the
> actual resource counter value is irrelevant, and thus the reported
> problem is a non-problem.

Where does it disable RLIMIT_MEMLOCK?

> However, it would still be possible to observe weirdness in the very
> unlikely event that a user would indeed call mlock() upon an address
> range obtained from IB/perf. In this case he would be unduly constrained
> and find his effective RLIMIT_MEMLOCK limit halved (at worst).

This is weird for other reasons as well since we are using two different
methods: MLOCK leads to the page be marked as PG_mlock and be put on a
special LRU list. Pinning is simply an increase in the refcounts. Its
going to be difficult to keep accounting straight because a page can have
both.

> I've yet to hear a coherent objection to the above. Christoph is always
> quick to yell: 'but if fixes a double accounting issue' but is
> completely deaf to the fact that he changed user visible semantics
> without mention and regard.

The semantics we agreed upon for mlock were that they are migratable.
Pinned pages cannot be migrated and therefore are not mlocked pages.

It would be best to have two different mechanisms for this. Note that the
pinning is always done by the kernel and/or device driver related to some
other activity whereas mlocking is initiated by userspace. We do not need
a new API.

If you want to avoid the moving of mlocked pages then disasble the
mechanisms that need to move processes around. Having the ability in
general to move processes is a degree of control that we want. Pinning
could be restricted to where it is required by the hardware.

We could also add a process flag that exempts a certain process from page
migration which may be the result of NUMA scheduler actions, explict
requests from the user to migrate a process, defrag or compaction.

Excessive amounts of pinned pages limit the ability of the kernel
to manage its memory severely which may cause instabilities. mlocked pages
are better because the kernel still has some options.

2013-06-07 11:04:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

On Thu, Jun 06, 2013 at 06:46:50PM +0000, Christoph Lameter wrote:
> On Thu, 6 Jun 2013, Peter Zijlstra wrote:
>
> > Since RLIMIT_MEMLOCK is very clearly a limit on the amount of pages the
> > process can 'lock' into memory it should very much include pinned pages
> > as well as mlock()ed pages. Neither can be paged.
>
> So we we thought that this is the sum of the pages that a process has
> mlocked. Initiated by the process and/or environment explicitly. A user
> space initiated action.

Which we; also it remains fact that your changelog didn't mention this
change in semantics at all. Nor did you CC all affected parties.

However you twist this; your patch leaves an inconsistent mess. If you
really think they're two different things then you should have
introduced a second RLIMIT_MEMPIN to go along with your counter.

I'll argue against such a thing; for I think that limiting the total
amount of pages a user can exempt from paging is the far more
userful/natural thing to measure/limit.

> > Since nobody had anything constructive to say about the VM_PINNED
> > approach and the IB code hurts my head too much to make it work I
> > propose we revert said patch.
>
> I said that the use of a PIN page flag would allow correct accounting if
> one wanted to interpret the limit the way you do.

You failed to explain how that would help any. With a pin page flag you
still need to find the mm to unaccount crap from. Also, all user
controlled address space ops operate on vmas.

We had the VM_LOCKED far before we had the lock page flag. And you
cannot replace all VM_LOCKED utility with the pageflag either.

> > Once again the rationale; MLOCK(2) is part of POSIX Realtime Extentsion
> > (1003.1b-1993/1003.1i-1995). It states that the specified part of the
> > user address space should stay memory resident until either program exit
> > or a matching munlock() call.
> >
> > This definition basically excludes major faults from happening on the
> > pages -- a major fault being one where IO needs to happen to obtain the
> > page content; the direct implication being that page content must remain
> > in memory.
>
> Exactly that is the definition.
>
> > Linux has taken this literal and made mlock()ed pages subject to page
> > migration (albeit only for the explicit move_pages() syscall; but it
> > would very much like to make them subject to implicit page migration for
> > the purpose of compaction etc.).
>
> Page migration is not a page fault?

It introduces faults; what happens when a process hits the migration
pte? It gets a random delay and eventually services a minor fault to the
new page.

At which point the saw will have cut your finger off (going with the
most popular RT application ever -- that of a bandsaw and a laser beam).

> The ability to move a process
> completely (including its mlocked segments) is important for the manual
> migration of process memory. That is what page migration was made for. If
> mlocked pages are treated as pinnned pages then the complete process can
> no longer be moved from node to node.
>
> > This view disregards the intention of the spec; since mlock() is part of
> > the realtime spec the intention is very much that the user address range
> > generate no faults; neither minor nor major -- any delay is
> > unacceptable.
>
> Where does it say that no faults are generated? Dont we generate COW on
> mlocked ranges?

That's under user control. If the user uses fork() the user can avoid
those faults by pre-faulting the pages.

> > This leaves the RT people unhappy -- therefore _if_ we continue with
> > this Linux specific interpretation of mlock() we must introduce new
> > syscalls that implement the intended mlock() semantics.
>
> Intended means Peter's semantics?

No, I don't actually write RT applications. But I've had plenty of
arguments with RT people when I explained to them what our mlock()
actually does vs what they expected it to do.

They're not happy. Aside from that; you HPC/HFT minimal latency lot
should very well appreciate the minimal interference stuff they do
actually expect.

> > It was found that there are useful purposes for this weaker mlock(), a
> > rationale to indeed have two sets of syscalls. The weaker mlock() can be
> > used in the context of security -- where we avoid sensitive data being
> > written to disk, and in the context of userspace deamons that are part
> > of the IO path -- which would otherwise form IO deadlocks.
>
> Migratable mlocked pages enable complete process migration between nodes
> of a NUMA system for HPC workloads.

This might well be; and I'm not arguing we remove this. I'm merely
stating that it doesn't make everybody happy. Also what purpose do HPC
type applications have for mlock()?

> > The proposed second set of primitives would be mpin() and munpin() and
> > would implement the intended mlock() semantics.
>
> I agree that we need mpin and munpin. But they should not be called mlock
> semantics.

Here we must disagree I fear; given that mlock() is of RT origin and RT
people very much want/expect mlock() to do what our proposed mpin() will
do.

> > Such pages would not be migratable in any way (a possible
> > implementation would be to 'pin' the pages using an extra refcount on
> > the page frame). From the above we can see that any mpin()ed page is
> > also an mlock()ed page, since mpin() will disallow any fault, and thus
> > will also disallow major faults.
>
> That cannot be so since mlocked pages need to be migratable.

I'm talking about the proposed mpin() stuff.

> > While we still lack the formal mpin() and munpin() syscalls there are a
> > number of sites that have similar 'side effects' and result in user
> > controlled 'pinning' of pages. Namely IB and perf.
>
> Right thats why we need this.

So I proposed most of the machinery that would be required to actually
implement the syscalls. Except that the IB code stumped me. In
particular I cannot easily find the userspace address to unpin for
ipath/qib release paths.

Once we have that we can trivially implement the syscalls.

> > For the purpose of RLIMIT_MEMLOCK we must use intent only as it is not
> > part of the formal spec. The only useful thing is to limit the amount of
> > pages a user can exempt from paging. This would therefore include all
> > pages either mlock()ed or mpin()ed.
>
> RLIMIT_MEMLOCK is a limit on the pages that a process has mlocked into
> memory.

See my argument above; I don't think userspace is served well with two
RLIMITs.

> Pinning is not initiated by user space but by the kernel. Either
> temporarily (page count increases are used all over the kernel for this)
> or for longer time frame (IB and Perf and likely more drivers that we have
> not found yet).

Here I disagree, I'll argue that all pins that require userspace action
to go away are userspace controlled pins. This makes IB/Perf user
controlled pins and should thus be treated the same as mpin()/munpin()
calls.

> > > > Back to the patch; a resource limit must have a resource counter to
> > enact the limit upon. Before the patch this was mm_struct::locked_vm.
> > After the patch there is no such thing left.
>
> The limit was not checked correctly before the patch since pinned pages
> were accounted as mlocked.

This goes back to what you want the limit to mean; I think a single
limit counting all pages exempt from paging is the far more useful
limit.

Again, your changelog was completely devoid of any rlimit discussion.

> > I state that since mlockall() disables/invalidates RLIMIT_MEMLOCK the
> > actual resource counter value is irrelevant, and thus the reported
> > problem is a non-problem.
>
> Where does it disable RLIMIT_MEMLOCK?

The only way to get an MCL_FUTURE is through CAP_IPC_LOCK. Practically
most MCL_CURRENT are also larger than RLIMIT_MEMLOCK and this also
requires CAP_IPC_LOCK.

Once you have CAP_IPC_LOCK, RLIMIT_MEMLOCK becomes irrelevant.

> > However, it would still be possible to observe weirdness in the very
> > unlikely event that a user would indeed call mlock() upon an address
> > range obtained from IB/perf. In this case he would be unduly constrained
> > and find his effective RLIMIT_MEMLOCK limit halved (at worst).
>
> This is weird for other reasons as well since we are using two different
> methods: MLOCK leads to the page be marked as PG_mlock and be put on a
> special LRU list. Pinning is simply an increase in the refcounts. Its
> going to be difficult to keep accounting straight because a page can have
> both.

The way how the kernel implements these semantics is irrelevant.

> > I've yet to hear a coherent objection to the above. Christoph is always
> > quick to yell: 'but if fixes a double accounting issue' but is
> > completely deaf to the fact that he changed user visible semantics
> > without mention and regard.
>
> The semantics we agreed upon for mlock were that they are migratable.
> Pinned pages cannot be migrated and therefore are not mlocked pages.
> It would be best to have two different mechanisms for this.

Best for whoem? How is making two RLIMIT settings better than having
one?

For me the being able to migrate a page is a consequence of allowing
minor faults; not a fundamental property.

> Note that the
> pinning is always done by the kernel and/or device driver related to some
> other activity whereas mlocking is initiated by userspace.

See my earlier argument. If unpinning requires userspace action; its a
userspace controlled pin. Also for perf the setup is very much also a
userspace action; an mmap() call, and I expect the same is true for IB;
although it might use different syscalls.

> We do not need a new API.

This is in direct contradiction with your earlier statement where you
say: "I agree that we need mpin and munpin.".

Please make up your mind.

> If you want to avoid the moving of mlocked pages then disasble the
> mechanisms that need to move processes around.

This is not really a feasible option anymore; page migration is creeping
all over the place. Also who is to say people don't want to also run
something with huge pages along side their RT proglet.

> Having the ability in
> general to move processes is a degree of control that we want. Pinning
> could be restricted to where it is required by the hardware.

Or software; you cannot blindly dismiss realtime guarantees the software
needs to provide.

> We could also add a process flag that exempts a certain process from page
> migration which may be the result of NUMA scheduler actions, explict
> requests from the user to migrate a process, defrag or compaction.

This too is a new API; and one I think is much less precise than mpin().
You wouldn't want an entire process to be exempt from paging. Only the
code and data involved with the realtime part of the process needs stay
put.

Furthermore, mpin() could simply migrate the pages into as few unmovable
page blocks as possible to minimise the impact on the rest of the vm.
Allocating entire processes as umovable and migrating any DSO it touches
into unmovable is a far more onerous endeavour.

With the pre-compaction of pinned memory, I don't see the VM impact of
pinned pages being much bigger than that of mlocked pages. The main
issue will be limiting the amount of memory exempt from paging, not
fragmentation per se.

Yes you can get some unmovable block fragmentation on unpin, but if its
a common thing a new pin will be able to fill those blocks again.

> Excessive amounts of pinned pages limit the ability of the kernel
> to manage its memory severely which may cause instabilities. mlocked pages
> are better because the kernel still has some options.

I'm not arguing which are better; I've even conceded we need mpin() and
can keep the current mlock() semantics. I'm arguing for what we _need_.

Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

On Fri, 7 Jun 2013, Peter Zijlstra wrote:

> However you twist this; your patch leaves an inconsistent mess. If you
> really think they're two different things then you should have
> introduced a second RLIMIT_MEMPIN to go along with your counter.

Well continuing to repeat myself: I worked based on agreed upon
characteristics of mlocked pages. The patch was there to address a
brokenness in the mlock accounting because someone naively assumed that
pinning = mlock.

> I'll argue against such a thing; for I think that limiting the total
> amount of pages a user can exempt from paging is the far more
> userful/natural thing to measure/limit.

Pinned pages are exempted by the kernel. A device driver or some other
kernel process (reclaim, page migration, io etc) increase the page count.
There is currently no consistent accounting for pinned pages. The
vm_pinned counter was introduced to allow the largest pinners to track
what they did.

> > I said that the use of a PIN page flag would allow correct accounting if
> > one wanted to interpret the limit the way you do.
>
> You failed to explain how that would help any. With a pin page flag you
> still need to find the mm to unaccount crap from. Also, all user
> controlled address space ops operate on vmas.

Pinning is kernel controlled...

> > Page migration is not a page fault?
>
> It introduces faults; what happens when a process hits the migration
> pte? It gets a random delay and eventually services a minor fault to the
> new page.

Ok but this is similar to reclaim and other such things that are unmapping
pages.

> At which point the saw will have cut your finger off (going with the
> most popular RT application ever -- that of a bandsaw and a laser beam).

I am pretty confused by your newer notion of RT. RT was about high latency
deterministic behavior I thought. RT was basically an abused marketing
term and was referring to the bloating of the kernel with all sorts of
fair stuff that slows us down. What happened to make you work on low
latency stuff? There is some shift that you still need to go through to
make that transition. Yes, you would want to avoid reclaim and all sorts
of other stuff for low latency. So you disable auto NUMA, defrag etc to
avoid these things.

> > > This leaves the RT people unhappy -- therefore _if_ we continue with
> > > this Linux specific interpretation of mlock() we must introduce new
> > > syscalls that implement the intended mlock() semantics.
> >
> > Intended means Peter's semantics?
>
> No, I don't actually write RT applications. But I've had plenty of
> arguments with RT people when I explained to them what our mlock()
> actually does vs what they expected it to do.

Ok Guess this is all new to you at this point. I am happy to see that you
are willing to abandon your evil ways (although under pressure from your
users) and are willing to put the low latency people now in the RT camp.

> They're not happy. Aside from that; you HPC/HFT minimal latency lot
> should very well appreciate the minimal interference stuff they do
> actually expect.

Sure we do and we know how to do things to work around the "fair
scheduler" and other stuff. But you are breaking the basics of how we do
things with your conflation of pinning and mlocking.

We do not migrate, do not allow defragmentation or reclaim when running
low latency applications. These are non issues.

> This might well be; and I'm not arguing we remove this. I'm merely
> stating that it doesn't make everybody happy. Also what purpose do HPC
> type applications have for mlock()?

HPC wants to keep them in memory to avoid eviction. HPC apps are not as
sensitive to faults as low latency apps are. Minor faults have
traditionally be tolerated there. The lower you get in terms of the
latencies required the more difficult the OS control becomes.

> Here we must disagree I fear; given that mlock() is of RT origin and RT
> people very much want/expect mlock() to do what our proposed mpin() will
> do.

RT is a dirty word for me given the fairness and bloat issue. Not sure
what you mean with that. mlock is a means to keep data in memory and not a
magical wand that avoids all OS handling of the page.

> > That cannot be so since mlocked pages need to be migratable.
>
> I'm talking about the proposed mpin() stuff.

Could you write that up in detail? I am not sure how this could work at
this point.

> So I proposed most of the machinery that would be required to actually
> implement the syscalls. Except that the IB code stumped me. In
> particular I cannot easily find the userspace address to unpin for
> ipath/qib release paths.
>
> Once we have that we can trivially implement the syscalls.

Why would you need syscalls? Pinning is driver/kernel subsystem initiated
and therefore the driver can do the pin/unpin calls.

> > Pinning is not initiated by user space but by the kernel. Either
> > temporarily (page count increases are used all over the kernel for this)
> > or for longer time frame (IB and Perf and likely more drivers that we have
> > not found yet).
>
> Here I disagree, I'll argue that all pins that require userspace action
> to go away are userspace controlled pins. This makes IB/Perf user
> controlled pins and should thus be treated the same as mpin()/munpin()
> calls.

Both IB and Perf are kernel subsystems that pin as a side effect of
another syscall.

> This goes back to what you want the limit to mean; I think a single
> limit counting all pages exempt from paging is the far more useful
> limit.

Then you first need to show that the scheme account for *all* pages exempt
from paging. This includes dirty pages and various pages of other
subsystems that have an increased refcount in order to keep these pages
were they are. We need a couple of passes through the kernel to find these
locations and build up counter for these. Not all of these pages will be
associated with processes.

> The only way to get an MCL_FUTURE is through CAP_IPC_LOCK. Practically
> most MCL_CURRENT are also larger than RLIMIT_MEMLOCK and this also
> requires CAP_IPC_LOCK.
>
> Once you have CAP_IPC_LOCK, RLIMIT_MEMLOCK becomes irrelevant.

mlockall does not require CAP_IPC_LOCK. Never had an issue.

> > The semantics we agreed upon for mlock were that they are migratable.
> > Pinned pages cannot be migrated and therefore are not mlocked pages.
> > It would be best to have two different mechanisms for this.
>
> Best for whoem? How is making two RLIMIT settings better than having
> one?

The pinned pages do not need to have a separate limit. They are kernel
resources and could be bounded in other ways together with other resources
that could cause danger to the kernel (such as too many dirty pages).
Pinned pages are often not associated with a process.

> > We do not need a new API.
>
> This is in direct contradiction with your earlier statement where you
> say: "I agree that we need mpin and munpin.".
>
> Please make up your mind.

Yes we need mpin and munpin as a in kernel API. Glad to clue you in.

> > Having the ability in
> > general to move processes is a degree of control that we want. Pinning
> > could be restricted to where it is required by the hardware.
>
> Or software; you cannot blindly dismiss realtime guarantees the software
> needs to provide.

I cringe when I hear realtime guarantees and intending that to refer to
low latency.... But maybe they turned you finally? And I may just feeling
my decade old frustration with your old approach to "realtime".

What do you exactly mean by realtime guarantees? (Given that "realtime" is
such an overloaded marketing term).

> > Excessive amounts of pinned pages limit the ability of the kernel
> > to manage its memory severely which may cause instabilities. mlocked pages
> > are better because the kernel still has some options.
>
> I'm not arguing which are better; I've even conceded we need mpin() and
> can keep the current mlock() semantics. I'm arguing for what we _need_.

We need the kernel to track the pinned pages (all of them not just those
from IB and perf) and provide proper global boundaries to avoid kernel
failures. Yes.

2013-06-13 21:06:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

Let's try to get this wrapped up?

On Thu, 6 Jun 2013 14:43:51 +0200 Peter Zijlstra <[email protected]> wrote:

>
> Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
> broke RLIMIT_MEMLOCK.

I rather like what bc3e53f682 did, actually. RLIMIT_MEMLOCK limits the
amount of memory you can mlock(). Nice and simple.

This pinning thing which infiniband/perf are doing is conceptually
different and if we care at all, perhaps we should be looking at adding
RLIMIT_PINNED.

> Before that patch: mm_struct::locked_vm < RLIMIT_MEMLOCK; after that
> patch we have: mm_struct::locked_vm < RLIMIT_MEMLOCK &&
> mm_struct::pinned_vm < RLIMIT_MEMLOCK.

But this is a policy decision which was implemented in perf_mmap() and
perf can alter that decision. How bad would it be if perf just ignored
RLIMIT_MEMLOCK?


drivers/infiniband/hw/qib/qib_user_pages.c has issues, btw. It
compares the amount-to-be-pinned with rlimit(RLIMIT_MEMLOCK), but
forgets to also look at current->mm->pinned_vm. Duh.

It also does the pinned accounting in __qib_get_user_pages() but in
__qib_release_user_pages(), the caller is supposed to do it, which is
rather awkward.


Longer-term I don't think that inifinband or perf should be dinking
around with rlimit(RLIMIT_MEMLOCK) or ->pinned_vm. Those policy
decisions should be hoisted into a core mm helper where we can do it
uniformly (and more correctly than infiniband's attempt!).

2013-06-17 09:45:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

On Thu, Jun 13, 2013 at 02:06:32PM -0700, Andrew Morton wrote:
> Let's try to get this wrapped up?
>
> On Thu, 6 Jun 2013 14:43:51 +0200 Peter Zijlstra <[email protected]> wrote:
>
> >
> > Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
> > broke RLIMIT_MEMLOCK.
>
> I rather like what bc3e53f682 did, actually. RLIMIT_MEMLOCK limits the
> amount of memory you can mlock(). Nice and simple.
>
> This pinning thing which infiniband/perf are doing is conceptually
> different and if we care at all, perhaps we should be looking at adding
> RLIMIT_PINNED.

We could do that; but I really don't like doing it for the reasons I
outlined previously. It gives the user another knob to twiddle which is
pretty much the same as one he already has just slightly different.

Like said, I see RLIMIT_MEMLOCK to mean the amount of pages the user can
exempt from paging; since that is what the VM cares about most.

> > Before that patch: mm_struct::locked_vm < RLIMIT_MEMLOCK; after that
> > patch we have: mm_struct::locked_vm < RLIMIT_MEMLOCK &&
> > mm_struct::pinned_vm < RLIMIT_MEMLOCK.
>
> But this is a policy decision which was implemented in perf_mmap() and
> perf can alter that decision. How bad would it be if perf just ignored
> RLIMIT_MEMLOCK?

Then it could pin all memory -- seems like something bad.

> drivers/infiniband/hw/qib/qib_user_pages.c has issues, btw. It
> compares the amount-to-be-pinned with rlimit(RLIMIT_MEMLOCK), but
> forgets to also look at current->mm->pinned_vm. Duh.
>
> It also does the pinned accounting in __qib_get_user_pages() but in
> __qib_release_user_pages(), the caller is supposed to do it, which is
> rather awkward.
>
>
> Longer-term I don't think that inifinband or perf should be dinking
> around with rlimit(RLIMIT_MEMLOCK) or ->pinned_vm. Those policy
> decisions should be hoisted into a core mm helper where we can do it
> uniformly (and more correctly than infiniband's attempt!).

Agreed, hence my VM_PINNED proposal that would lift most of that to the
core VM.

I just got really lost in the IB code :/

2013-06-17 11:08:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

On Fri, Jun 07, 2013 at 02:52:05PM +0000, Christoph Lameter wrote:
> On Fri, 7 Jun 2013, Peter Zijlstra wrote:
>
> > However you twist this; your patch leaves an inconsistent mess. If you
> > really think they're two different things then you should have
> > introduced a second RLIMIT_MEMPIN to go along with your counter.
>
> Well continuing to repeat myself: I worked based on agreed upon
> characteristics of mlocked pages. The patch was there to address a
> brokenness in the mlock accounting because someone naively assumed that
> pinning = mlock.

They did no such thing; being one of those who wrote such code. I
expressly used RLIMIT_MEMLOCK for its the one limit userspace has to
limit pages that are exempt from paging.

> > I'll argue against such a thing; for I think that limiting the total
> > amount of pages a user can exempt from paging is the far more
> > userful/natural thing to measure/limit.
>
> Pinned pages are exempted by the kernel. A device driver or some other
> kernel process (reclaim, page migration, io etc) increase the page count.
> There is currently no consistent accounting for pinned pages. The
> vm_pinned counter was introduced to allow the largest pinners to track
> what they did.

No, not the largest, user space controlled pinnners. The thing that
makes all the difference is the _USER_ control.

> > > I said that the use of a PIN page flag would allow correct accounting if
> > > one wanted to interpret the limit the way you do.
> >
> > You failed to explain how that would help any. With a pin page flag you
> > still need to find the mm to unaccount crap from. Also, all user
> > controlled address space ops operate on vmas.
>
> Pinning is kernel controlled...

That's pure bull; perf and IB have user controlled pinning.

> > > Page migration is not a page fault?
> >
> > It introduces faults; what happens when a process hits the migration
> > pte? It gets a random delay and eventually services a minor fault to the
> > new page.
>
> Ok but this is similar to reclaim and other such things that are unmapping
> pages.

Which are avoided by mlock() an mlock()ed page will (typically) not get
unmapped.

> > At which point the saw will have cut your finger off (going with the
> > most popular RT application ever -- that of a bandsaw and a laser beam).
>
> I am pretty confused by your newer notion of RT. RT was about high latency
> deterministic behavior I thought. RT was basically an abused marketing
> term and was referring to the bloating of the kernel with all sorts of
> fair stuff that slows us down. What happened to make you work on low
> latency stuff? There is some shift that you still need to go through to
> make that transition. Yes, you would want to avoid reclaim and all sorts
> of other stuff for low latency. So you disable auto NUMA, defrag etc to
> avoid these things.

Its about low latency deterministic stuff; its just that worst case
latency is more important than avg latency and hence some things get
more expensive on avg.

But we don't want to disable defrag, defrag is good for the general
health of the system. Also not all applications are RT; we want a kernel
that's able to run general purpose stuff along with some RT apps.

> > > > This leaves the RT people unhappy -- therefore _if_ we continue with
> > > > this Linux specific interpretation of mlock() we must introduce new
> > > > syscalls that implement the intended mlock() semantics.
> > >
> > > Intended means Peter's semantics?
> >
> > No, I don't actually write RT applications. But I've had plenty of
> > arguments with RT people when I explained to them what our mlock()
> > actually does vs what they expected it to do.
>
> Ok Guess this is all new to you at this point. I am happy to see that you
> are willing to abandon your evil ways (although under pressure from your
> users) and are willing to put the low latency people now in the RT camp.

*sigh*.. we have shared goals up to a point. RT bounds and if possible
lowers the worst case latency. The worst case is absolutely most
important for us.

By doing so we often also lower the avg latency, but its not the primary
goal.

> > They're not happy. Aside from that; you HPC/HFT minimal latency lot
> > should very well appreciate the minimal interference stuff they do
> > actually expect.
>
> Sure we do and we know how to do things to work around the "fair
> scheduler" and other stuff. But you are breaking the basics of how we do
> things with your conflation of pinning and mlocking.

Baseless statement there; I've given a very clear and concise definition
of how I see mlock() and mpin() work together. You can't dismiss all
that and expect people to take you seriously.

And argue until your face is blue but IB and perf are very much user
controlled pins.

> We do not migrate, do not allow defragmentation or reclaim when running
> low latency applications. These are non issues.

Sounds like you lot are a bunch of work-around hacks. We very much want
to allow running such applications without having to rebuild the kernel.
In fact we want to allow running apps that need these CONFIG things both
enabled and disabled.

And while page migration and defrag have CONFIG knobs, reclaim does not;
file based reclaim is unconditional (except as already noted when using
mlock()).

I just want to be able to have all that enabled and still have
applications be able to express their needs and be able to run.

> > Here we must disagree I fear; given that mlock() is of RT origin and RT
> > people very much want/expect mlock() to do what our proposed mpin() will
> > do.
>
> RT is a dirty word for me given the fairness and bloat issue. Not sure
> what you mean with that. mlock is a means to keep data in memory and not a
> magical wand that avoids all OS handling of the page.

RT is about bounded worst case latency, and a preference for that bound
to be as low as possible. That is all RT is and wants to be -- its
definitely not about fairness in any way shape or form.

But yes, the intention was very much for the page (and mapping) to be
left alone by the OS -- its unfortunate the specs wording don't clarify
this.

> > > That cannot be so since mlocked pages need to be migratable.
> >
> > I'm talking about the proposed mpin() stuff.
>
> Could you write that up in detail? I am not sure how this could work at
> this point.

*omg*.. I did. mpin() (and we'll include IB and perf here) is a stronger
mlock() in that it must avoid all faults, not only the major faults.

After that there's just implementation details; the proposed
implementation used VM_PINNED to tag the VMAs use user called mpin() on.
It will use get_user_pages() to acquire a ref and effect the actual
pinning.

Below I also proposed we could pre-compact such regions into UNMOVABLE
page blocks.

But I don't much care about the implementation per-se. All I've been
really arguing for is semantics.

> > So I proposed most of the machinery that would be required to actually
> > implement the syscalls. Except that the IB code stumped me. In
> > particular I cannot easily find the userspace address to unpin for
> > ipath/qib release paths.
> >
> > Once we have that we can trivially implement the syscalls.
>
> Why would you need syscalls? Pinning is driver/kernel subsystem initiated
> and therefore the driver can do the pin/unpin calls.

mpin()/munpin() syscalls so userspace can effect the same on their
desired ranges.

> > > Pinning is not initiated by user space but by the kernel. Either
> > > temporarily (page count increases are used all over the kernel for this)
> > > or for longer time frame (IB and Perf and likely more drivers that we have
> > > not found yet).
> >
> > Here I disagree, I'll argue that all pins that require userspace action
> > to go away are userspace controlled pins. This makes IB/Perf user
> > controlled pins and should thus be treated the same as mpin()/munpin()
> > calls.
>
> Both IB and Perf are kernel subsystems that pin as a side effect of
> another syscall.

Yeah so? You still have user controlled pinning; the kernel can not rid
of these resources on its own

> > This goes back to what you want the limit to mean; I think a single
> > limit counting all pages exempt from paging is the far more useful
> > limit.
>
> Then you first need to show that the scheme account for *all* pages exempt
> from paging. This includes dirty pages and various pages of other
> subsystems that have an increased refcount in order to keep these pages
> were they are. We need a couple of passes through the kernel to find these
> locations and build up counter for these. Not all of these pages will be
> associated with processes.

All _user_ controlled pages exempt from paging. This would be mlock() +
IB + perf. AFAIK there's nothing else that a user can do to exempt pages
from paging.

Our big disconnect seems to be user controlled vs kernel controlled.

Why do you care about kernel pins? Those are transient are they not?

> > The only way to get an MCL_FUTURE is through CAP_IPC_LOCK. Practically
> > most MCL_CURRENT are also larger than RLIMIT_MEMLOCK and this also
> > requires CAP_IPC_LOCK.
> >
> > Once you have CAP_IPC_LOCK, RLIMIT_MEMLOCK becomes irrelevant.
>
> mlockall does not require CAP_IPC_LOCK. Never had an issue.

MCL_FUTURE does absolutely require CAP_IPC_LOCK, MCL_CURRENT requires a
huge (as opposed to the default 64k) RLIMIT or CAP_IPC_LOCK.

There's no argument there, look at the code.

> > > The semantics we agreed upon for mlock were that they are migratable.
> > > Pinned pages cannot be migrated and therefore are not mlocked pages.
> > > It would be best to have two different mechanisms for this.
> >
> > Best for whoem? How is making two RLIMIT settings better than having
> > one?
>
> The pinned pages do not need to have a separate limit. They are kernel
> resources and could be bounded in other ways together with other resources
> that could cause danger to the kernel (such as too many dirty pages).
> Pinned pages are often not associated with a process.

So why is grouping user spaced pinned pages with mlock()ed pages such a
bad idea? Both can cause vm deadlocks for not allowing to reclaim pages.

Note that throughout I've been talking about user space controlled pins;
I really don't see a problem with kernel pins, which are by and large
small and transient.

> > > We do not need a new API.
> >
> > This is in direct contradiction with your earlier statement where you
> > say: "I agree that we need mpin and munpin.".
> >
> > Please make up your mind.
>
> Yes we need mpin and munpin as a in kernel API. Glad to clue you in.

Dude,.. my VM_PINNED patch had that; its you who seems to be somewhat
slow of whit here.

> > > Having the ability in
> > > general to move processes is a degree of control that we want. Pinning
> > > could be restricted to where it is required by the hardware.
> >
> > Or software; you cannot blindly dismiss realtime guarantees the software
> > needs to provide.
>
> I cringe when I hear realtime guarantees and intending that to refer to
> low latency.... But maybe they turned you finally? And I may just feeling
> my decade old frustration with your old approach to "realtime".
>
> What do you exactly mean by realtime guarantees? (Given that "realtime" is
> such an overloaded marketing term).

Bounded and preferred low as possible worst case latency. The bound
gives determinism the second finer resolution.

Yes there's a whole host of various real world realtime software
constraints. There's the hard realtime case where if you miss a deadline
the world 'ends' -- tokamak EM control for nuclear fusion; the band saw
vs finger thing etc. And there's soft realtime in a hundred different
shades but its generally understood to mean stuff that can recover from
sporadic lateness.

We want to cater to all those people and thus typically focus on the
hard-rt case.

Now a RT application will typically have a large !RT part; it will log
data, maybe update decision matrices, talk to other computers etc..

So while we want our self driving car to not collide with stuff, we also
want to map the road, download maps, track other moving objects etc. Not
all of that has the same constraints; sensor data must always precede
maps if at all available, mapping/updating maps isn't at all important
but nice to have etc.

Such systems would like as full a general purpose system as possible
that is still able to provide guarantees where required.

Therefore its desired to have 'nonsense' like compaction etc. enabled.

> > > Excessive amounts of pinned pages limit the ability of the kernel
> > > to manage its memory severely which may cause instabilities. mlocked pages
> > > are better because the kernel still has some options.
> >
> > I'm not arguing which are better; I've even conceded we need mpin() and
> > can keep the current mlock() semantics. I'm arguing for what we _need_.
>
> We need the kernel to track the pinned pages (all of them not just those
> from IB and perf) and provide proper global boundaries to avoid kernel
> failures. Yes.

That's not what I said. Also you failed to provide rationale for your
desire to do so. Why would we need to track the transient kernel pins?
Those will go away on themselves.

The IB/perf/mpin() pins are special in that they're dependent on
userspace and will not go away on their own. Also they need to be
limited lest userspace would consume too much of them and VM deadlocks
might ensue.

2013-06-17 12:29:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

On Thu, 13 Jun 2013, Andrew Morton wrote:
> Let's try to get this wrapped up?
>
> On Thu, 6 Jun 2013 14:43:51 +0200 Peter Zijlstra <[email protected]> wrote:
>
> >
> > Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
> > broke RLIMIT_MEMLOCK.
>
> I rather like what bc3e53f682 did, actually. RLIMIT_MEMLOCK limits the
> amount of memory you can mlock(). Nice and simple.
>
> This pinning thing which infiniband/perf are doing is conceptually
> different and if we care at all, perhaps we should be looking at adding
> RLIMIT_PINNED.

Actually PINNED is just a stronger version of MEMLOCK. PINNED and
MEMLOCK are both preventing the page from being paged out. PINNED adds
the constraint of preventing minor faults as well.

So I think the really important tuning knob is the limitation of pages
which cannot be paged out. And this is what RLIMIT_MEMLOCK is about.

Now if you want to add RLIMIT_PINNED as well, then it only limits the
number of pages which cannot create minor faults, but that does not
affect the limitation of total pages which cannot be paged out.

Thanks,

tglx

Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

On Mon, 17 Jun 2013, Peter Zijlstra wrote:

> They did no such thing; being one of those who wrote such code. I
> expressly used RLIMIT_MEMLOCK for its the one limit userspace has to
> limit pages that are exempt from paging.

Dont remember reviewing that. Assumptions were wrong in that patch then.

> > Pinned pages are exempted by the kernel. A device driver or some other
> > kernel process (reclaim, page migration, io etc) increase the page count.
> > There is currently no consistent accounting for pinned pages. The
> > vm_pinned counter was introduced to allow the largest pinners to track
> > what they did.
>
> No, not the largest, user space controlled pinnners. The thing that
> makes all the difference is the _USER_ control.

The pinning *cannot* be done from user space. Here it is the IB subsystem
that is doing it.

> > Pinning is kernel controlled...
>
> That's pure bull; perf and IB have user controlled pinning.

Pinning is a side effect of memory registratin in IB.

> > Ok but this is similar to reclaim and other such things that are unmapping
> > pages.
>
> Which are avoided by mlock() an mlock()ed page will (typically) not get
> unmapped.

Oh it will get unmapped definitely by page migration and such. Otherwise
it could not be moved.

> But we don't want to disable defrag, defrag is good for the general
> health of the system. Also not all applications are RT; we want a kernel
> that's able to run general purpose stuff along with some RT apps.

Still have no idea what you mean by RT.

> *sigh*.. we have shared goals up to a point. RT bounds and if possible
> lowers the worst case latency. The worst case is absolutely most
> important for us.

Ok that differs then.

> By doing so we often also lower the avg latency, but its not the primary
> goal.

So far we have mostly seen these measures to increase the average latency.

> And argue until your face is blue but IB and perf are very much user
> controlled pins.

Sure the user controls the memory registred with IB but the pinning is
done because a device maps the memory. Side effect.

> And while page migration and defrag have CONFIG knobs, reclaim does not;
> file based reclaim is unconditional (except as already noted when using
> mlock()).

Yes that is a big problem.

> I just want to be able to have all that enabled and still have
> applications be able to express their needs and be able to run.

That is not working because these goodies all cause additional latency,
by taking the processor away and disturbing the caches.
You cannot have the cake and eat it too.

> But yes, the intention was very much for the page (and mapping) to be
> left alone by the OS -- its unfortunate the specs wording don't clarify
> this.

All the talks amount this with the mm developers that I have been with had
a different impression and we have implemented it according to the
understanding that mlocked pages are movable. Its not that I am that fond
of it but I accepted it and worked with that notion.

> > > I'm talking about the proposed mpin() stuff.
> >
> > Could you write that up in detail? I am not sure how this could work at
> > this point.
>
> *omg*.. I did. mpin() (and we'll include IB and perf here) is a stronger
> mlock() in that it must avoid all faults, not only the major faults.

Looked like hand waving to me. How exactly does mpin get implemented and
how does it interact with the vm? Do we have to create an additional LRU
for this? Does this do any good given that there is still reclaim and lots
of other stuff going on?

> But I don't much care about the implementation per-se. All I've been
> really arguing for is semantics.

Yes you are arguing to change the established semantics for mlocked pages.

> > Both IB and Perf are kernel subsystems that pin as a side effect of
> > another syscall.
>
> Yeah so? You still have user controlled pinning; the kernel can not rid
> of these resources on its own

Oh there is an mmu_notifier subsystem for this purpose. You can setup a
callback that can unpin pages. Again this is kernel specific. The device
driver needs to register the mmu notifier.

> All _user_ controlled pages exempt from paging. This would be mlock() +
> IB + perf. AFAIK there's nothing else that a user can do to exempt pages
> from paging.

What about dirtying pages, writeback?

> Our big disconnect seems to be user controlled vs kernel controlled.
>
> Why do you care about kernel pins? Those are transient are they not?

Nope the pins for page migration, ib etc can be quite long. And I suspect
that there are multiple subsystems that we are not aware of that also pin
for a longer time period but do not account for their pinning actions.

> > mlockall does not require CAP_IPC_LOCK. Never had an issue.
>
> MCL_FUTURE does absolutely require CAP_IPC_LOCK, MCL_CURRENT requires a
> huge (as opposed to the default 64k) RLIMIT or CAP_IPC_LOCK.
>
> There's no argument there, look at the code.

I am sorry but we have been mlockall() for years now without the
issues that you are bringing up. AFAICT mlockall does not require
MCL_FUTURE.

> > The pinned pages do not need to have a separate limit. They are kernel
> > resources and could be bounded in other ways together with other resources
> > that could cause danger to the kernel (such as too many dirty pages).
> > Pinned pages are often not associated with a process.
>
> So why is grouping user spaced pinned pages with mlock()ed pages such a
> bad idea? Both can cause vm deadlocks for not allowing to reclaim pages.

So can other things.

> > What do you exactly mean by realtime guarantees? (Given that "realtime" is
> > such an overloaded marketing term).
>
> Bounded and preferred low as possible worst case latency. The bound
> gives determinism the second finer resolution.

Well and you still want to run reclaim on the processor etc? What is the
point of all of that when the OS can come in an take the processor away
for a much longer time frame.

> So while we want our self driving car to not collide with stuff, we also
> want to map the road, download maps, track other moving objects etc. Not
> all of that has the same constraints; sensor data must always precede
> maps if at all available, mapping/updating maps isn't at all important
> but nice to have etc.

Well if there is a big boulder on the road (reclaim) then I would worry
about that rather but extensive effort in to clear the road of small
stones (like minor faults).

> That's not what I said. Also you failed to provide rationale for your
> desire to do so. Why would we need to track the transient kernel pins?
> Those will go away on themselves.

I was not talking about transient pins but the longer term ones.

> The IB/perf/mpin() pins are special in that they're dependent on
> userspace and will not go away on their own. Also they need to be
> limited lest userspace would consume too much of them and VM deadlocks
> might ensure.

Yes that is what I said should be done.

2013-06-20 11:34:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] mm: Revert pinned_vm braindamage


* Thomas Gleixner <[email protected]> wrote:

> On Thu, 13 Jun 2013, Andrew Morton wrote:
> > Let's try to get this wrapped up?
> >
> > On Thu, 6 Jun 2013 14:43:51 +0200 Peter Zijlstra <[email protected]> wrote:
> >
> > >
> > > Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
> > > broke RLIMIT_MEMLOCK.
> >
> > I rather like what bc3e53f682 did, actually. RLIMIT_MEMLOCK limits the
> > amount of memory you can mlock(). Nice and simple.
> >
> > This pinning thing which infiniband/perf are doing is conceptually
> > different and if we care at all, perhaps we should be looking at adding
> > RLIMIT_PINNED.
>
> Actually PINNED is just a stronger version of MEMLOCK. PINNED and
> MEMLOCK are both preventing the page from being paged out. PINNED adds
> the constraint of preventing minor faults as well.
>
> So I think the really important tuning knob is the limitation of pages
> which cannot be paged out. And this is what RLIMIT_MEMLOCK is about.
>
> Now if you want to add RLIMIT_PINNED as well, then it only limits the
> number of pages which cannot create minor faults, but that does not
> affect the limitation of total pages which cannot be paged out.

Agreed.

( Furthermore, the RLIMIT_MEMLOCK semantics change actively broke code so
this is not academic and it would be nice to progress with it. )

Thanks,

Ingo

2013-06-20 11:49:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] mm: Revert pinned_vm braindamage


* Christoph Lameter <[email protected]> wrote:

> On Mon, 17 Jun 2013, Peter Zijlstra wrote:
>
> > They did no such thing; being one of those who wrote such code. I
> > expressly used RLIMIT_MEMLOCK for its the one limit userspace has to
> > limit pages that are exempt from paging.
>
> Dont remember reviewing that. Assumptions were wrong in that patch then.
>
> > > Pinned pages are exempted by the kernel. A device driver or some other
> > > kernel process (reclaim, page migration, io etc) increase the page count.
> > > There is currently no consistent accounting for pinned pages. The
> > > vm_pinned counter was introduced to allow the largest pinners to track
> > > what they did.
> >
> > No, not the largest, user space controlled pinnners. The thing that
> > makes all the difference is the _USER_ control.
>
> The pinning *cannot* be done from user space. Here it is the IB subsystem
> that is doing it.

Peter clearly pointed it out that in the perf case it's user-space that
initiates the pinned memory mapping which is resource-controlled via
RLIMIT_MEMLOCK - and this was implemented that way before your commit
broke the code.

You seem to be hell bent on defining 'memory pinning' only as "the thing
done via the mlock*() system calls", but that is a nonsensical distinction
that actively and incorrectly ignores other system calls that can and do
pin memory legitimately.

If some other system call results in mapping pinned memory that is at
least as restrictively pinned as an mlock()-ed vma (the perf syscall is
such) then it's entirely proper design to be resource controlled under
RLIMIT_MEMLOCK as well. In fact this worked so before your commit broke
it.

> > > mlockall does not require CAP_IPC_LOCK. Never had an issue.
> >
> > MCL_FUTURE does absolutely require CAP_IPC_LOCK, MCL_CURRENT requires
> > a huge (as opposed to the default 64k) RLIMIT or CAP_IPC_LOCK.
> >
> > There's no argument there, look at the code.
>
> I am sorry but we have been mlockall() for years now without the issues
> that you are bringing up. AFAICT mlockall does not require MCL_FUTURE.

You only have to read the mlockall() code to see that Peter's claim is
correct:

mm/mlock.c:

SYSCALL_DEFINE1(mlockall, int, flags)
{
unsigned long lock_limit;
int ret = -EINVAL;

if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
goto out;

ret = -EPERM;
if (!can_do_mlock())
goto out;
...


int can_do_mlock(void)
{
if (capable(CAP_IPC_LOCK))
return 1;
if (rlimit(RLIMIT_MEMLOCK) != 0)
return 1;
return 0;
}
EXPORT_SYMBOL(can_do_mlock);

Q.E.D.

Thanks,

Ingo

Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

On Thu, 20 Jun 2013, Ingo Molnar wrote:

> Peter clearly pointed it out that in the perf case it's user-space that
> initiates the pinned memory mapping which is resource-controlled via
> RLIMIT_MEMLOCK - and this was implemented that way before your commit
> broke the code.

There is no way that user space can initiate a page pin right now. Perf is
pinning the page from the kernel. Similarly the IB subsystem pins memory
meeded for device I/O.

> You seem to be hell bent on defining 'memory pinning' only as "the thing
> done via the mlock*() system calls", but that is a nonsensical distinction
> that actively and incorrectly ignores other system calls that can and do
> pin memory legitimately.

Nope. I have said that Memory pinning is done by increasing the refcount
which is different from mlock which sets a page flag.

I have consistently argued that these are two different things. And I am
a
bit surprised that this point has not been understood after all these
repetitions.

Memory pinning these days is done as a side effect of kernel / driver
needs. I.e. the memory registration done through the IB subsystem and
elsewhere.

> int can_do_mlock(void)
> {
> if (capable(CAP_IPC_LOCK))
> return 1;
> if (rlimit(RLIMIT_MEMLOCK) != 0)
> return 1;
> return 0;
> }
> EXPORT_SYMBOL(can_do_mlock);
>
> Q.E.D.

Argh. Just checked the apps. True. They did set the rlimit to 0 at some
point in order to make this work. Then they monitor the number of locked
pages and create alerts so that action can be taking if a system uses too
many mlocked pages.

2013-06-21 06:25:54

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

On Thu, Jun 20, 2013 at 7:48 AM, Christoph Lameter <[email protected]> wrote:
> There is no way that user space can initiate a page pin right now. Perf is
> pinning the page from the kernel. Similarly the IB subsystem pins memory
> meeded for device I/O.

Christoph, your argument would be a lot more convincing if you stopped
repeating this nonsense. Sure, in a strict sense, it might be true
that the IB subsystem in the kernel is the code that actually pins
memory, but given that unprivileged userspace can tell the kernel to
pin arbitrary parts of its memory for any amount of time, is that
relevant? And in fact taking your "initiate" word choice above, I
don't even think your statement is true -- userspace initiates the
pinning by, for example, doing an IB memory registration (libibverbs
ibv_reg_mr() call), which turns into a system call, which leads to the
kernel trying to pin pages. The pages aren't unpinned until userspace
unregisters the memory (or causes a cleanup by closing the context
fd).

Here's an argument by analogy. Would it make any sense for me to say
userspace can't mlock memory, because only the kernel can set
VM_LOCKED on a vma? Of course not. Userspace has the mlock() system
call, and although the actual work happens in the kernel, we clearly
want to be able to limit the amount of memory locked by the kernel ON
BEHALF OF USERSPACE.

- R.

Subject: Re: [PATCH] mm: Revert pinned_vm braindamage

On Thu, 20 Jun 2013, Roland Dreier wrote:

> Christoph, your argument would be a lot more convincing if you stopped
> repeating this nonsense. Sure, in a strict sense, it might be true

Well this is regarding tracking of pages that need to stay resident and
since the kernel does the pinning through the IB subsystem it is trackable
right there. No nonsense and no need for a separate pinning system call.

> that the IB subsystem in the kernel is the code thatactually pins
> memory, but given that unprivileged userspace can tell the kernel to
> pin arbitrary parts of its memory for any amount of time, is that
> relevant? And in fact taking your "initiate" word choice above, I
> don't even think your statement is true -- userspace initiates the
> pinning by, for example, doing an IB memory registration (libibverbs
> ibv_reg_mr() call), which turns into a system call, which leads to the
> kernel trying to pin pages. The pages aren't unpinned until userspace
> unregisters the memory (or causes a cleanup by closing the context
> fd).

In some sense userspace initiates everything since the kernels purpose
is to run applications. So you can say that everything is user initated if
you wanted.

However, the user visible mechanism here is a registration of memory with
the IB subsystem for RDMA. The primary intend is not to pin the pages but
to make memory available for remote I/O. The pages are pinned *because*
otherwise remote RDMA operations could corrupt memory due to the kernel
moving/evicting memory.

> Here's an argument by analogy. Would it make any sense for me to say
> userspace can't mlock memory, because only the kernel can set
> VM_LOCKED on a vma? Of course not. Userspace has the mlock() system
> call, and although the actual work happens in the kernel, we clearly
> want to be able to limit the amount of memory locked by the kernel ON
> BEHALF OF USERSPACE.

I would think that mlock is a memory management function and therefore the
app/user directly says that the memory is not to be evicted from memory.

This is different for the IB subsystem which is dealing with I/O and only
indirectly with memory. Would we have a different mechanism to prevent
reclaim etc the we would not need to pin the pages.

Actual there is such a mechanism that could be used here. If you had a
reserved memory region that is not mapped by the kernel (boot time alloc,
device memory) then you can use VM_PFNMAP to refer to that region and the
kernel would not be able to do reclaim on that memory. No pinning
necessary if the IB subsystem would register that type of memory.