2013-05-22 19:31:19

by Vince Weaver

[permalink] [raw]
Subject: OOPS in perf_mmap_close()

Hello

I was fuzzing the kernel using my "perf_fuzzer" tool that builds on top of
the trinity fuzzer. (You can get it as part of of my perf_event_test
suite https://github.com/deater/perf_event_tests )

In any case while letting it run I got the following OOPS on
Linux 3.10-rc2

[142450.070877] IP: [<ffffffff810a9859>] perf_mmap_close+0x52/0xa8
[142450.070890] PGD 59a8067 PUD 2947067 PMD 0
[142450.070899] Oops: 0002 [#1] SMP
[142450.070906] Modules linked in: bluetooth msr cpufreq_stats dn_rtmsg can_raw nfnetlink can_bcm can xfrm_user xfrm_algo nfc rfkill ax25 scsi_transport_iscsi atm ipt_ULOG x_tables ipx p8023 p8022 irda crc_ccitt appletalk psnap llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd dns_resolver fscache sunrpc loop fuse snd_hda_codec_hdmi snd_hda_codec_realtek coretemp kvm_intel kvm evdev nouveau mxm_wmi ttm drm_kms_helper microcode drm i2c_algo_bit video snd_hda_intel snd_hda_codec wmi snd_hwdep snd_pcm snd_page_alloc snd_seq snd_seq_device snd_timer snd psmouse serio_raw acpi_cpufreq mperf processor thermal_sys button pcspkr i2c_nforce2 shpchp soundcore i2c_core ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif ata_generic r8169 mii ahci libahci ehci_pci ohci_hcd ehci_hcd libata scsi_mod usbcore usb_common
[142450.071099] CPU: 0 PID: 2539 Comm: perf_fuzzer Tainted: G 3.10.0-rc2 #2
[142450.071106] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
[142450.071113] task: ffff880119e25830 ti: ffff880005aea000 task.ti: ffff880005aea000
[142450.071120] RIP: 0010:[<ffffffff810a9859>] [<ffffffff810a9859>] perf_mmap_close+0x52/0xa8
[142450.071131] RSP: 0018:ffff880005aebea8 EFLAGS: 00010202[142450.071136] RAX: 0000000000000001 RBX: ffff88000455e800 RCX: ffffffffffffffff
[142450.071143] RDX: ffff880004732ac0 RSI: ffff88000455ea30 RDI: ffff88000455ea30
[142450.071151] RBP: 0000000000000000 R08: 00007f5adc87b000 R09: 00007f5adc87a000
[142450.071157] R10: ffff880119684ce0 R11: 0000000000000206 R12: ffff880118697bc0
[142450.071164] R13: ffff88000455ea30 R14: ffff880119684cc0 R15: ffff880118796368
[142450.071171] FS: 00007f5adc881700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[142450.071178] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[142450.071184] CR2: 0000000000000060 CR3: 00000000029f1000 CR4: 00000000000407f0
[142450.071190] DR0: 000000000b73c476 DR1: 000000000a837586 DR2: 00000000481f91c8
[142450.071197] DR3: 000000005519375a DR6: 00000000ffff0ff0 DR7: 000000000051060a
[142450.071203] Stack:
[142450.071206] ffff880119684cc0 0000000000000000 0000000000000000 00007f5adc87a000
[142450.071218] 00007f5adc87b000 ffffffff810d2ce0 ffff880117183740 ffff880119684cc0
[142450.071229] ffffffff810d4342 ffff880119684cc0 ffff8801196c3430 ffff880117183748
[142450.071241] Call Trace:
[142450.071248] [<ffffffff810d2ce0>] ? remove_vma+0x28/0x5f
[142450.071255] [<ffffffff810d4342>] ? do_munmap+0x2d5/0x306
[142450.071262] [<ffffffff810d43ab>] ? vm_munmap+0x38/0x4e
[142450.071268] [<ffffffff810d43db>] ? SyS_munmap+0x1a/0x1f
[142450.071276] [<ffffffff81369b12>] ? system_call_fastpath+0x16/0x1b
[142450.071281] Code: 4c 89 ee e8 3e 21 fa ff 85 c0 74 6c 4c 8b a3 68 02 00 00 48 8b ab 60 02 00 00 41 8b 44 24 18 c1 e0 0c 48 98 48 c1 e8 0c 48 ff c0 <f0> 48 29 45 60 49 8b 46 40 48 63 93 5c 02 00 00 48 29 90 b8 00
[142450.071382] RIP [<ffffffff810a9859>] perf_mmap_close+0x52/0xa8
[142450.071390] RSP <ffff880005aebea8>
[142450.071394] CR2: 0000000000000060
[142450.073009] ---[ end trace a49b6a0053924d8b ]---

Vince Weaver
[email protected]
http://www.eece.maine.edu/~vweaver/


2013-05-23 00:16:00

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()


> In any case while letting it run I got the following OOPS on
> Linux 3.10-rc2

Included below is test code (based on the fuzzer output) that reliably
OOPSes my core2 machine. It's a bit long, but I'm remote from the
machine now so I can't refine it (as the code locked up the machine the
last time I tested).


/* perf_mmap_close_bug.c */
/* By Vince Weaver <vincent.weaver at maine.edu> */
/* compile with "gcc -O2 -Wall -o perf_mmap_close_bug perf_mmap_close_bug.c */
/* This will reliably OOPS my core2 Linux 3.10-rc2 machine */

#include <stdio.h>
#include <linux/perf_event.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdlib.h>
#include <errno.h>

int perf_event_open(struct perf_event_attr *hw_event_uptr,
pid_t pid, int cpu, int group_fd, unsigned long flags) {

return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu,
group_fd, flags);
}


int main(int argc, char **argv) {

struct perf_event_attr pe1,pe2;
int fd1,fd2;

memset(&pe1,0,sizeof(struct perf_event_attr));
memset(&pe2,0,sizeof(struct perf_event_attr));

pe1.type=0;
pe1.size=0x60;
pe1.config=3;
pe1.sample_type=0x4f0;
pe1.read_format=5;
pe1.exclude_kernel=1;
pe1.bp_type=1;
pe1.config1=0x1d469257;
pe1.config2=2;

fd1=perf_event_open(&pe1,0,0,-1,3);
if (fd1<0) {
fprintf(stderr,"Error opening fd1 %s\n",strerror(errno));
exit(1);
}

mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_SHARED, fd1, 0);

ioctl(fd1,PERF_EVENT_IOC_RESET,0);

pe2.type=1;
pe2.size=0x60;
pe2.config=2;
pe2.read_format=3;
pe2.exclusive=1;
pe2.exclude_user=1;
pe2.mmap=1;
pe2.inherit_stat=1;
pe2.enable_on_exec=1;
pe2.task=1;
pe2.watermark=1;
pe2.precise_ip=2;
pe2.sample_id_all=1;
pe2.exclude_guest=1;
pe2.wakeup_events=1500395763;

fd2=perf_event_open(&pe2,0,0,fd1,3);

fd1=perf_event_open(&pe1,0,0,-1,3);
if (fd2<0) {
fprintf(stderr,"Error opening fd2 %s\n",strerror(errno));
exit(1);
}

mmap(NULL, 69632, PROT_READ|PROT_WRITE, MAP_SHARED, fd2, 0);

ioctl(fd1,PERF_EVENT_IOC_ENABLE,0);

ioctl(fd2,PERF_EVENT_IOC_ENABLE,0);

close(fd2);

return 0;
}

2013-05-23 03:44:28

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()


In case anyone cares, the Oops is happening here:

1a56: 48 c1 e8 0c shr $0xc,%rax
1a5a: 48 ff c0 inc %rax
> 1a5d: f0 48 29 45 60 lock sub %rax,0x60(%rbp)
1a62: 49 8b 46 40 mov 0x40(%r14),%rax

Which maps to this in perf_mmap_close() in kernel/events/core.c:

atomic_long_sub((size >> PAGE_SHIFT) + 1, &user->locked_vm);

And "user" (%rbp) is RBP: 0000000000000000, hence the problem.

I'm having trouble tracking the problem back any further as the code is a
bit covoluted and is not commented at all.

Vince

2013-05-23 04:48:12

by Al Viro

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Wed, May 22, 2013 at 11:48:51PM -0400, Vince Weaver wrote:
>
> In case anyone cares, the Oops is happening here:
>
> 1a56: 48 c1 e8 0c shr $0xc,%rax
> 1a5a: 48 ff c0 inc %rax
> > 1a5d: f0 48 29 45 60 lock sub %rax,0x60(%rbp)
> 1a62: 49 8b 46 40 mov 0x40(%r14),%rax
>
> Which maps to this in perf_mmap_close() in kernel/events/core.c:
>
> atomic_long_sub((size >> PAGE_SHIFT) + 1, &user->locked_vm);
>
> And "user" (%rbp) is RBP: 0000000000000000, hence the problem.
>
> I'm having trouble tracking the problem back any further as the code is a
> bit covoluted and is not commented at all.

FWIW, at least part of perf_mmap_close() is obvious garbage - increment of
->pinned_vm happens in mmap(), decrement - on the ->close() of the last
VMA clonal to one we'd created in that mmap(), regardless of the address
space it's in. Not that handling of ->pinned_vm made any sense wrt fork()...

Actually... What happens if you mmap() the same opened file of that
kind several times, each time with the same size? AFAICS, on all
subsequent calls we'll get
mutex_lock(&event->mmap_mutex);
if (event->rb) {
if (event->rb->nr_pages == nr_pages)
atomic_inc(&event->rb->refcount);
else
...
goto unlock;
unlock:
if (!ret)
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);

i.e. we bump event->mmap_count *and* event->rb->refcount. munmap()
all of them and each will generate a call of perf_mmap_close(); ->mmap_count
will go down to zero and on all but the last call we'll have nothing else
done. On the last call we'll hit ring_buffer_put(), which will decrement
event->rb->refcount once. Note that by that point we simply don't know
how many times we'd incremented it in those mmap() calls - it's too late
to clean up. IOW, unless I'm misreading that code, we've got a leak in
there. Not the same bug, but...

2013-05-23 10:42:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, May 23, 2013 at 05:48:03AM +0100, Al Viro wrote:
> On Wed, May 22, 2013 at 11:48:51PM -0400, Vince Weaver wrote:
> >
> > In case anyone cares, the Oops is happening here:
> >
> > 1a56: 48 c1 e8 0c shr $0xc,%rax
> > 1a5a: 48 ff c0 inc %rax
> > > 1a5d: f0 48 29 45 60 lock sub %rax,0x60(%rbp)
> > 1a62: 49 8b 46 40 mov 0x40(%r14),%rax
> >
> > Which maps to this in perf_mmap_close() in kernel/events/core.c:
> >
> > atomic_long_sub((size >> PAGE_SHIFT) + 1, &user->locked_vm);
> >
> > And "user" (%rbp) is RBP: 0000000000000000, hence the problem.
> >
> > I'm having trouble tracking the problem back any further as the code is a
> > bit covoluted and is not commented at all.
>
> FWIW, at least part of perf_mmap_close() is obvious garbage - increment of
> ->pinned_vm happens in mmap(), decrement - on the ->close() of the last
> VMA clonal to one we'd created in that mmap(), regardless of the address
> space it's in. Not that handling of ->pinned_vm made any sense wrt fork()...

Right it doesn't. I think the easiest solution for now is to not copy the VMA
on fork().

But I totally missed patch bc3e53f682d that introduced pinned_vm, AFAICT that
also wrecked some accounting. We should still account both against
RLIMIT_MEMLOCK.

> Actually... What happens if you mmap() the same opened file of that
> kind several times, each time with the same size? AFAICS, on all
> subsequent calls we'll get
> mutex_lock(&event->mmap_mutex);
> if (event->rb) {
> if (event->rb->nr_pages == nr_pages)
> atomic_inc(&event->rb->refcount);
> else
> ...
> goto unlock;
> unlock:
> if (!ret)
> atomic_inc(&event->mmap_count);
> mutex_unlock(&event->mmap_mutex);
>
> i.e. we bump event->mmap_count *and* event->rb->refcount. munmap()
> all of them and each will generate a call of perf_mmap_close(); ->mmap_count
> will go down to zero and on all but the last call we'll have nothing else
> done. On the last call we'll hit ring_buffer_put(), which will decrement
> event->rb->refcount once. Note that by that point we simply don't know
> how many times we'd incremented it in those mmap() calls - it's too late
> to clean up. IOW, unless I'm misreading that code, we've got a leak in
> there. Not the same bug, but...

Quite so, lets remove that rb->refcount.

Now I don't think any of this explains Vince's splat, I'll go stare at that
next.

---
kernel/events/core.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9dc297f..c75b9c6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3676,9 +3676,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
WARN_ON_ONCE(event->ctx->parent_ctx);
mutex_lock(&event->mmap_mutex);
if (event->rb) {
- if (event->rb->nr_pages == nr_pages)
- atomic_inc(&event->rb->refcount);
- else
+ if (event->rb->nr_pages != nr_pages)
ret = -EINVAL;
goto unlock;
}
@@ -3699,7 +3697,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)

lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
- locked = vma->vm_mm->pinned_vm + extra;
+ locked = vma->vm_mm->locked_vm + vma->vm_mm->pinned_vm + extra;

if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
!capable(CAP_IPC_LOCK)) {
@@ -3734,7 +3732,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);

- vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_ops = &perf_mmap_vmops;

return ret;

2013-05-23 12:52:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()


The below fixes things for me.

---
Subject: perf: Fix perf mmap issues

Vince reported a problem found by his perf specific trinity fuzzer. Al noticed
perf's mmap() has issues against fork() -- since we use vma->vm_mm -- and
double mmap() -- we leak an rb refcount.

While staring at this code I also found the introduction of
mm_struct::pinned_vm wrecked accounting against RLIMIT_LOCKED.

Aside from the rb reference leak spotted by Al, Vince's example prog was indeed
doing a double mmap() through the use of perf_event_set_output(). When cloning
the buffer of another event we should also copy their mmap_{user,locked}
settings, because as it turns out, this second event might be the last to get
unmapped, meaning it will hit perf_mmap_close() and needs to undo whatever the
initial mmap() did.

This means perf_event_set_output() now needs to hold both events their
mmap_mutex.

Cc: Christoph Lameter <[email protected]>
Cc: Al Viro <[email protected]>
Reported-by: Vince Weaver <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/events/core.c | 42 +++++++++++++++++++++++++++++++++++-------
1 file changed, 35 insertions(+), 7 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9dc297f..b736ad2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3676,9 +3676,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
WARN_ON_ONCE(event->ctx->parent_ctx);
mutex_lock(&event->mmap_mutex);
if (event->rb) {
- if (event->rb->nr_pages == nr_pages)
- atomic_inc(&event->rb->refcount);
- else
+ if (event->rb->nr_pages != nr_pages)
ret = -EINVAL;
goto unlock;
}
@@ -3699,7 +3697,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)

lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
- locked = vma->vm_mm->pinned_vm + extra;
+ locked = vma->vm_mm->locked_vm + vma->vm_mm->pinned_vm + extra;

if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
!capable(CAP_IPC_LOCK)) {
@@ -3734,7 +3732,7 @@ unlock:
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);

- vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_ops = &perf_mmap_vmops;

return ret;
@@ -6381,14 +6379,26 @@ err_size:
goto out;
}

+static void mutex_lock_double(struct mutex *m1, struct mutex *m2)
+{
+ if (m2 < m1) swap(m1, m2);
+
+ mutex_lock(m1);
+ mutex_lock_nested(m2, SINGLE_DEPTH_NESTING);
+}
+
static int
perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
{
struct ring_buffer *rb = NULL, *old_rb = NULL;
+ struct user_struct *mmap_user = NULL;
+ int mmap_locked = 0;
int ret = -EINVAL;

- if (!output_event)
+ if (!output_event) {
+ mutex_lock(&event->mmap_mutex);
goto set;
+ }

/* don't allow circular references */
if (event == output_event)
@@ -6406,8 +6416,9 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
if (output_event->cpu == -1 && output_event->ctx != event->ctx)
goto out;

+ mutex_lock_double(&event->mmap_mutex, &output_event->mmap_mutex);
set:
- mutex_lock(&event->mmap_mutex);
+
/* Can't redirect output if we've got an active mmap() */
if (atomic_read(&event->mmap_count))
goto unlock;
@@ -6417,15 +6428,32 @@ set:
rb = ring_buffer_get(output_event);
if (!rb)
goto unlock;
+
+ /*
+ * If we're going to copy the rb, we need also copy
+ * mmap_{user,locked} in case we'll hit perf_mmap_close().
+ */
+ mmap_user = output_event->mmap_user;
+ mmap_locked = output_event->mmap_locked;
}

old_rb = event->rb;
rcu_assign_pointer(event->rb, rb);
if (old_rb)
ring_buffer_detach(event, old_rb);
+
+ /*
+ * Per the above check we cannot have an active mmap on event,
+ * therefore mmap_{user,locked} must be clear.
+ */
+ event->mmap_user = mmap_user;
+ event->mmap_locked = mmap_locked;
+
ret = 0;
unlock:
mutex_unlock(&event->mmap_mutex);
+ if (output_event)
+ mutex_unlock(&output_event->mmap_mutex);

if (old_rb)
ring_buffer_put(old_rb);

2013-05-23 14:06:11

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()


I can confirm your patch avoids the oops on my machine.

It does lead to interesting behavior if I run the sample program
multiple times (with added printfs):

vince@core2:~$ ./perf_mmap_close_bug
mmap1=0x7f06a6e90000
mmap2=0x7f06a6e7f000
vince@core2:~$ ./perf_mmap_close_bug
mmap1=0x7f878a138000
mmap2=0x7f878a127000
vince@core2:~$ ./perf_mmap_close_bug
mmap1=0xffffffffffffffff
Error opening fd2 Invalid argument

and then it never successfully completes again. Is this unexpected
behavior? Or just a normal mmap/perf interaction that I don't understand?

Vince

Subject: Re: OOPS in perf_mmap_close()

On Thu, 23 May 2013, Peter Zijlstra wrote:

> Right it doesn't. I think the easiest solution for now is to not copy the VMA
> on fork().

Right. Pinned pages are not inherited. If a page is unpinned then that is
going to happen for all address spaces that reference the page.

> But I totally missed patch bc3e53f682d that introduced pinned_vm, AFAICT that
> also wrecked some accounting. We should still account both against
> RLIMIT_MEMLOCK.

The point of the patch was to unwreck accounting. Before the patch mlocked
pages were counted twice which resulted in stramge VM scenarios where more
pages were mlocked than memory available. Note that a pinned page may also
be mlocked.

Simply adding the two will reintroduce the problems that were fixed by the
patch.

2013-05-23 15:25:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, May 23, 2013 at 02:09:10PM +0000, Christoph Lameter wrote:
> On Thu, 23 May 2013, Peter Zijlstra wrote:
>
> > Right it doesn't. I think the easiest solution for now is to not copy the VMA
> > on fork().
>
> Right. Pinned pages are not inherited. If a page is unpinned then that is
> going to happen for all address spaces that reference the page.
>
> > But I totally missed patch bc3e53f682d that introduced pinned_vm, AFAICT that
> > also wrecked some accounting. We should still account both against
> > RLIMIT_MEMLOCK.
>
> The point of the patch was to unwreck accounting. Before the patch mlocked
> pages were counted twice which resulted in stramge VM scenarios where more
> pages were mlocked than memory available. Note that a pinned page may also
> be mlocked.
>
> Simply adding the two will reintroduce the problems that were fixed by the
> patch.

The patch completely fails to explain how RLIMIT_LOCKED is supposed to
deal with pinned vs locked. Perf used to account its pages against
RLIMIT_LOCKED, with the patch it compares pinned against RLIMIT_LOCKED
but completely discards any possible locked pages.

IMO that's broken.

2013-05-23 15:26:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, May 23, 2013 at 10:10:36AM -0400, Vince Weaver wrote:
>
> I can confirm your patch avoids the oops on my machine.
>
> It does lead to interesting behavior if I run the sample program
> multiple times (with added printfs):
>
> vince@core2:~$ ./perf_mmap_close_bug
> mmap1=0x7f06a6e90000
> mmap2=0x7f06a6e7f000
> vince@core2:~$ ./perf_mmap_close_bug
> mmap1=0x7f878a138000
> mmap2=0x7f878a127000
> vince@core2:~$ ./perf_mmap_close_bug
> mmap1=0xffffffffffffffff
> Error opening fd2 Invalid argument
>
> and then it never successfully completes again. Is this unexpected
> behavior?

Sounds weird to me, I'll see if I can reproduce/understand.

2013-05-23 15:42:49

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, 23 May 2013, Peter Zijlstra wrote:

> On Thu, May 23, 2013 at 10:10:36AM -0400, Vince Weaver wrote:
> >
> > I can confirm your patch avoids the oops on my machine.
> >
> > It does lead to interesting behavior if I run the sample program
> > multiple times (with added printfs):
> >
> > vince@core2:~$ ./perf_mmap_close_bug
> > mmap1=0x7f06a6e90000
> > mmap2=0x7f06a6e7f000
> > vince@core2:~$ ./perf_mmap_close_bug
> > mmap1=0x7f878a138000
> > mmap2=0x7f878a127000
> > vince@core2:~$ ./perf_mmap_close_bug
> > mmap1=0xffffffffffffffff
> > Error opening fd2 Invalid argument
> >
> > and then it never successfully completes again. Is this unexpected
> > behavior?
>
> Sounds weird to me, I'll see if I can reproduce/understand.
>

I don't know if it's related, but even with 3.10-rc2 with your patch
applied and running the fuzzer a bit the system eventually becomes
unstable and oopsing like mad, but in non-perf related ways. hmmm.

I've set up a serial console and maybe I can get some better messages.

Vince

[ 1188.896010] kernel BUG at mm/slab.c:3005!
[ 1188.896010] invalid opcode: 0000 [#1] SMP
[ 1188.896010] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn
[ 1188.896010] CPU: 1 PID: 3406 Comm: sudo Not tainted 3.10.0-rc2 #3
[ 1188.896010] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2
[ 1188.896010] task: ffff880116726770 ti: ffff880117564000 task.ti: ffff88011750
[ 1188.896010] RIP: 0010:[<ffffffff810e9c16>] [<ffffffff810e9c16>] ____cache_a0
[ 1188.896010] RSP: 0018:ffff880117565e88 EFLAGS: 00010096
[ 1188.896010] RAX: ffff880119dbf748 RBX: ffff88011974c200 RCX: 0000000000000007
[ 1188.896010] RDX: 000000000000001e RSI: 0000000000000000 RDI: 0000000000000000
[ 1188.896010] RBP: ffff880119dbf740 R08: ffff880117834c00 R09: ffff880117832000
[ 1188.896010] R10: 00007fffb1360440 R11: 0000000000000246 R12: ffff880119c92a40
[ 1188.896010] R13: 0000000000000000 R14: ffff880117834c00 R15: 0000000000000010
[ 1188.896010] FS: 00007f1339467800(0000) GS:ffff88011fc80000(0000) knlGS:00000
[ 1188.896010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1188.896010] CR2: 00007f133898a040 CR3: 00000001176ab000 CR4: 00000000000407e0
[ 1188.896010] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1188.896010] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1188.896010] Stack:
[ 1188.896010] 0000000000000001 000080d000000000 ffff880117941c80 ffff880119c90
[ 1188.896010] 00000000000080d0 0000000000000246 00000000000080d0 00007fffb1360
[ 1188.896010] ffffffff810eace7 ffff880117941c80 00000000000003e8 0000000000000
[ 1188.896010] Call Trace:
[ 1188.896010] [<ffffffff810eace7>] ? kmem_cache_alloc+0x47/0xf8
[ 1188.896010] [<ffffffff8103abb6>] ? alloc_uid+0x5c/0xee
[ 1188.896010] [<ffffffff8103efda>] ? set_user+0xd/0x70
[ 1188.896010] [<ffffffff81040529>] ? SyS_setresuid+0xb6/0x113
[ 1188.896010] [<ffffffff81369b92>] ? system_call_fastpath+0x16/0x1b
[ 1188.896010] Code: 20 48 8b 5d 28 48 8d 55 28 c7 45 60 01 00 00 00 48 39 d3 7
[ 1188.896010] RIP [<ffffffff810e9c16>] ____cache_alloc+0x11c/0x290
[ 1188.896010] RSP <ffff880117565e88>
[ 1188.896010] ---[ end trace a14ae9e1a2282660 ]---
[ 1192.279580] ------------[ cut here ]------------
[ 1192.279580] WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x7)
[ 1192.279580] Watchdog detected hard LOCKUP on cpu 1
[ 1192.279580] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn
[ 1192.279580] CPU: 1 PID: 24 Comm: kworker/1:1 Tainted: G D 3.10.0-r3
[ 1192.279580] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2
[ 1192.279580] Workqueue: events cache_reap
[ 1192.279580] 0000000000000000 ffffffff8102e205 ffff88011fc87d00 ffff88011a220
[ 1192.279580] ffff88011fc87d50 ffff88011fc87de0 ffff88011fc87ef8 ffffffff81020
[ 1192.279580] ffffffff814ecec5 0000000000000020 ffff88011fc87d60 ffff88011fc80
[ 1192.279580] Call Trace:
[ 1192.279580] <NMI> [<ffffffff8102e205>] ? warn_slowpath_common+0x5b/0x70
[ 1192.279580] [<ffffffff8102e2b0>] ? warn_slowpath_fmt+0x47/0x49
[ 1192.279580] [<ffffffff8108a920>] ? watchdog_overflow_callback+0x7e/0x9d
[ 1192.279580] [<ffffffff810ad1f8>] ? __perf_event_overflow+0x12c/0x1ae
[ 1192.279580] [<ffffffff810ab29b>] ? perf_event_update_userpage+0x12/0xbd
[ 1192.279580] [<ffffffff81011f41>] ? intel_pmu_handle_irq+0x242/0x2aa
[ 1192.279580] [<ffffffff81365a21>] ? nmi_handle.isra.0+0x3c/0x5a
[ 1192.279580] [<ffffffff81365adc>] ? do_nmi+0x9d/0x2ab
[ 1192.279580] [<ffffffff813652b7>] ? end_repeat_nmi+0x1e/0x2e
[ 1192.279580] [<ffffffff810704c6>] ? do_raw_spin_lock+0x15/0x1b
[ 1192.279580] [<ffffffff810704c6>] ? do_raw_spin_lock+0x15/0x1b
[ 1192.279580] [<ffffffff810704c6>] ? do_raw_spin_lock+0x15/0x1b
[ 1192.279580] <<EOE>> [<ffffffff810e944e>] ? drain_array+0x46/0xc1
[ 1192.279580] [<ffffffff810e967d>] ? cache_reap+0xba/0x1b5
[ 1192.279580] [<ffffffff81044e68>] ? process_one_work+0x18b/0x287
[ 1192.279580] [<ffffffff8104530d>] ? worker_thread+0x121/0x1e7
[ 1192.279580] [<ffffffff810451ec>] ? rescuer_thread+0x265/0x265
[ 1192.279580] [<ffffffff810496be>] ? kthread+0x7d/0x85
[ 1192.279580] [<ffffffff81049641>] ? __kthread_parkme+0x59/0x59
[ 1192.279580] [<ffffffff81369aec>] ? ret_from_fork+0x7c/0xb0
[ 1192.279580] [<ffffffff81049641>] ? __kthread_parkme+0x59/0x59
[ 1192.279580] ---[ end trace a14ae9e1a2282661 ]---

Subject: Re: OOPS in perf_mmap_close()

On Thu, 23 May 2013, Peter Zijlstra wrote:

> The patch completely fails to explain how RLIMIT_LOCKED is supposed to
> deal with pinned vs locked. Perf used to account its pages against
> RLIMIT_LOCKED, with the patch it compares pinned against RLIMIT_LOCKED
> but completely discards any possible locked pages.

Pinned pages are different from mlock. Mlock semantics means that the
pages are kept in memory but the pages are movable (subject to page
migration f.e.).

Pinned pages have to stay where they are since the physical addresses may
be used for device I/O or other stuff.

Both pinned and mlocked pages cannot be evicted from memory. If one wants
to account for unevictable pages then both are contributing. However,
since a pinned page may be mlocked simply adding up the counter may cause
problems. The sum could be used as a worst case estimate though.

We could mlock all pinned pages but then the issue arises on how to track
that properly in order to unpin when the I/O action is done since the app
may have also mlocked pages.

2013-05-23 16:39:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, May 23, 2013 at 04:12:14PM +0000, Christoph Lameter wrote:
> On Thu, 23 May 2013, Peter Zijlstra wrote:
>
> > The patch completely fails to explain how RLIMIT_LOCKED is supposed to
> > deal with pinned vs locked. Perf used to account its pages against
> > RLIMIT_LOCKED, with the patch it compares pinned against RLIMIT_LOCKED
> > but completely discards any possible locked pages.
>
> Pinned pages are different from mlock. Mlock semantics means that the
> pages are kept in memory but the pages are movable (subject to page
> migration f.e.).
>
> Pinned pages have to stay where they are since the physical addresses may
> be used for device I/O or other stuff.
>
> Both pinned and mlocked pages cannot be evicted from memory. If one wants
> to account for unevictable pages then both are contributing. However,
> since a pinned page may be mlocked simply adding up the counter may cause
> problems. The sum could be used as a worst case estimate though.
>
> We could mlock all pinned pages but then the issue arises on how to track
> that properly in order to unpin when the I/O action is done since the app
> may have also mlocked pages.

I know all that, and its completely irrelevant to the discussion.

You cannot simply take away pinned pages from the RLIMIT_MEMLOCK
accounting without mention nor replacement limits.

You now have double the amount of memory you can loose, once to actual
mlock() and once through whatever generates pinned -- if it bothers with
checking limits at all.

Where we had the guarantee that x < y; you did x := x1 + x2; which then
should result in: x1 + x2 < y, instead you did: x1 < y && x2 < y, not
the same and completely wrong.

Subject: Re: OOPS in perf_mmap_close()

On Thu, 23 May 2013, Peter Zijlstra wrote:

> I know all that, and its completely irrelevant to the discussion.

What you said in the rest of the email seems to indicate that you
still do not know that and I am repeating what I have said before here.

> You now have double the amount of memory you can loose, once to actual
> mlock() and once through whatever generates pinned -- if it bothers with
> checking limits at all.

It was already doubled which was the reason for the patch. The patch
avoided the doubling that we saw and it allowed to distinguish between
mlocked and pinned pages.

> Where we had the guarantee that x < y; you did x := x1 + x2; which then
> should result in: x1 + x2 < y, instead you did: x1 < y && x2 < y, not
> the same and completely wrong.

We never had that guarantee. We were accounting many pages twice in
the same counter. Once for mlocking and once for pinning. Thus the problem
that the patch addresses. Read the changelog?

There are other sources that cause pages to be not evictable (like f.e.
dirtying). Mlock accounting is not accurate in any case. The mlocked page
limit is per thread which is another issue and so is the vm_pinned
counter. The pages actually may be shared between many processes and the
ownership of those pages is not clear. The accounting for mlock and
pinning also is a bit problematic as a result.

2013-05-23 19:24:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, May 23, 2013 at 05:59:10PM +0000, Christoph Lameter wrote:
> On Thu, 23 May 2013, Peter Zijlstra wrote:
>
> > I know all that, and its completely irrelevant to the discussion.
>
> What you said in the rest of the email seems to indicate that you
> still do not know that and I am repeating what I have said before here.

I'm very much aware of the difference between locked and pinned pages;
and I don't object to accounting them separately per se. I do however
object to making a joke out of resource limits and silently changing
behaviour.

Your changelog (that you seem to be so proud of that you want me to read
it yet again) completely fails to mention what happens to resource
limits.

> > You now have double the amount of memory you can loose, once to actual
> > mlock() and once through whatever generates pinned -- if it bothers with
> > checking limits at all.
>
> It was already doubled which was the reason for the patch. The patch
> avoided the doubling that we saw and it allowed to distinguish between
> mlocked and pinned pages.

So it was double and now its still double, so your patch fixed what
exactly?

> > Where we had the guarantee that x < y; you did x := x1 + x2; which then
> > should result in: x1 + x2 < y, instead you did: x1 < y && x2 < y, not
> > the same and completely wrong.
>
> We never had that guarantee.

Rlimits says we have; if it was buggy we should've fixed it, not made it
worse. What's the point of pretending we have RLIMIT_MEMLOCK if we then
feel free to ignore it?

> We were accounting many pages twice in
> the same counter. Once for mlocking and once for pinning. Thus the problem
> that the patch addresses. Read the changelog?

Splitting the counter doesn't magically fix it. Who was accounting
double and why not fix those?

Splitting things doesn't fix anything; the sum is still counting double
and you've made resource limits more complex.

We should make mlock skip over the special pinned vmas instead of
pretending pinned pages shouldn't be part of resource limits.

> There are other sources that cause pages to be not evictable (like f.e.
> dirtying). Mlock accounting is not accurate in any case. The mlocked page
> limit is per thread

Its per process; see how locked_vm and pinned_vm are part of mm_struct
and modified under mmap_sem.

> which is another issue and so is the vm_pinned
> counter. The pages actually may be shared between many processes and the
> ownership of those pages is not clear. The accounting for mlock and
> pinning also is a bit problematic as a result.

I maintain that you cannot simply split a resource counter without
properly explaining what happens to resource limits.

2013-05-23 23:35:46

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, 23 May 2013, Peter Zijlstra wrote:

> On Thu, May 23, 2013 at 10:10:36AM -0400, Vince Weaver wrote:
> >
> > I can confirm your patch avoids the oops on my machine.
> >
> > It does lead to interesting behavior if I run the sample program
> > multiple times (with added printfs):
> >
> > vince@core2:~$ ./perf_mmap_close_bug
> > mmap1=0x7f06a6e90000
> > mmap2=0x7f06a6e7f000
> > vince@core2:~$ ./perf_mmap_close_bug
> > mmap1=0x7f878a138000
> > mmap2=0x7f878a127000
> > vince@core2:~$ ./perf_mmap_close_bug
> > mmap1=0xffffffffffffffff
> > Error opening fd2 Invalid argument
> >
> > and then it never successfully completes again. Is this unexpected
> > behavior?
>
> Sounds weird to me, I'll see if I can reproduce/understand.

I tracked this down in case you haven't already.

The problem is that in the kernel patched
with your patch locked_vm is getting decremented twice in the sample code
and going negative. I'm not sure why this isn't a problem until the
third time through. Here are my crude debug printk
results from kernel/events/core.c

[ 28.684862] user_extra: 17 user_lock_limit: 129
[ 28.698458] user_locked: 17 locked_vm: 0 user_extra 17
[ 28.713853] locked: 0 locked_vm: 0 pinned_vm: 0 extra: 0 lock_limit: 16
[ 28.733728] perf_mmap: locked_vm: 17
[ 28.744509] mmap_close: locked_vm=0
[ 28.754939] mmap_close: locked_vm=-17
[ 29.472741] user_extra: 17 user_lock_limit: 129
[ 29.486332] user_locked: 0 locked_vm: -17 user_extra 17
[ 29.501996] locked: 0 locked_vm: 0 pinned_vm: 0 extra: 0 lock_limit: 16
[ 29.521874] perf_mmap: locked_vm: 0
[ 29.532400] mmap_close: locked_vm=-17
[ 29.543352] mmap_close: locked_vm=-34
[ 30.028236] user_extra: 17 user_lock_limit: 129
[ 30.041835] user_locked: -17 locked_vm: -34 user_extra 17
[ 30.058018] extra: -275 user_locked: -17 user_lock_limit: 258
[ 30.075232] locked: -275 locked_vm: 0 pinned_vm: 0 extra: -275 lock_limit: 16

Vince

2013-05-24 09:21:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, May 23, 2013 at 07:40:12PM -0400, Vince Weaver wrote:
> On Thu, 23 May 2013, Peter Zijlstra wrote:
>
> > On Thu, May 23, 2013 at 10:10:36AM -0400, Vince Weaver wrote:
> > >
> > > I can confirm your patch avoids the oops on my machine.
> > >
> > > It does lead to interesting behavior if I run the sample program
> > > multiple times (with added printfs):
> > >
> > > vince@core2:~$ ./perf_mmap_close_bug
> > > mmap1=0x7f06a6e90000
> > > mmap2=0x7f06a6e7f000
> > > vince@core2:~$ ./perf_mmap_close_bug
> > > mmap1=0x7f878a138000
> > > mmap2=0x7f878a127000
> > > vince@core2:~$ ./perf_mmap_close_bug
> > > mmap1=0xffffffffffffffff
> > > Error opening fd2 Invalid argument
> > >
> > > and then it never successfully completes again. Is this unexpected
> > > behavior?
> >
> > Sounds weird to me, I'll see if I can reproduce/understand.
>
> I tracked this down in case you haven't already.
>
> The problem is that in the kernel patched
> with your patch locked_vm is getting decremented twice in the sample code
> and going negative. I'm not sure why this isn't a problem until the
> third time through. Here are my crude debug printk
> results from kernel/events/core.c

D'uh I think I see what's happening.. we haev split the mmap state
between the ringbuffer and event objects and since we have two events
and one ringbuffer we're hosed.

Let me try and straighten this out.

2013-05-24 14:01:58

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

Subject: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
broke RLIMIT_MEMLOCK.

The primary purpose of mm_struct::locked_vm was to act as resource
counter against RLIMIT_MEMLOCK. By splitting it the semantics got
greatly fuddled.

The 'bug' fixed was when you would mlock() IB pinned memory, in that
case the pages would be counted twice in mm_struct::locked_vm. The only
reason someone would actually do this is with mlockall(), which
basically requires you've disabled RLIMIT_MEMLOCK.

However by splitting the counter into two counters, RLIMIT_MEMLOCK has
no single resource counter; instead the patch makes it so that both
locked and pinned pages are below RLIMIT_MEMLOCK, effectively doubling
RLIMIT_MEMLOCK.

I claim that the patch made a bad situation worse by shifting the bug
from an unlikely corner case to a far more likely scenario.

This patch proposes to properly fix the problem by introducing
VM_PINNED. This also provides the groundwork for a possible mpin()
syscall or MADV_PIN -- although these are not included.

It recognises that pinned page semantics are a strict super-set of
locked page semantics -- a pinned page will not generate major faults
(and thus satisfies mlock() requirements).

The patch has some rough edges, primarily IB doesn't compile for I got a
little lost trying to find the process address for a number of
*_release_user_pages() calls which are now required.

Furthermore, vma expansion (stack) vs VM_PINNED is 'broken' in that I
don't think it will behave properly if you PIN the bottom (or top if
your arch is so inclined) stack page.

If people find this approach unworkable, I request we revert the above
mentioned patch to at least restore RLIMIT_MEMLOCK to a usable state
again.

Not-signed-off-by: Peter Zijlstra <[email protected]>
---
vers/infiniband/core/umem.c | 41 ++++++------
drivers/infiniband/hw/ipath/ipath_file_ops.c | 2 +-
drivers/infiniband/hw/ipath/ipath_kernel.h | 4 +-
drivers/infiniband/hw/ipath/ipath_user_pages.c | 29 ++++++---
drivers/infiniband/hw/qib/qib.h | 2 +-
drivers/infiniband/hw/qib/qib_file_ops.c | 2 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 18 ++++--
include/linux/mm.h | 3 +
include/linux/mm_types.h | 5 ++
include/linux/perf_event.h | 1 -
include/rdma/ib_umem.h | 3 +-
kernel/events/core.c | 17 +++--
kernel/fork.c | 2 +
mm/mlock.c | 87 +++++++++++++++++++++-----
mm/mmap.c | 16 +++--
15 files changed, 167 insertions(+), 65 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a841123..f8f47dc 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -137,17 +137,22 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,

down_write(&current->mm->mmap_sem);

- locked = npages + current->mm->pinned_vm;
+ locked = npages + mm_locked_pages(current->mm);
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
ret = -ENOMEM;
- goto out;
+ goto err;
}

cur_base = addr & PAGE_MASK;
+ umem->start_addr = cur_base;
+ umem->end_addr = cur_base + npages;
+
+ ret = mm_mpin(umem->start_addr, umem->end_addr);
+ if (ret)
+ goto err;

- ret = 0;
while (npages) {
ret = get_user_pages(current, current->mm, cur_base,
min_t(unsigned long, npages,
@@ -155,7 +160,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
1, !umem->writable, page_list, vma_list);

if (ret < 0)
- goto out;
+ goto err_unpin;

cur_base += ret * PAGE_SIZE;
npages -= ret;
@@ -168,7 +173,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
GFP_KERNEL);
if (!chunk) {
ret = -ENOMEM;
- goto out;
+ goto err_unpin;
}

chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
@@ -191,7 +196,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
kfree(chunk);

ret = -ENOMEM;
- goto out;
+ goto err_unpin;
}

ret -= chunk->nents;
@@ -202,19 +207,23 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
ret = 0;
}

-out:
- if (ret < 0) {
- __ib_umem_release(context->device, umem, 0);
- kfree(umem);
- } else
- current->mm->pinned_vm = locked;
+ if (ret < 0)
+ goto err_unpin;

+unlock:
up_write(&current->mm->mmap_sem);
if (vma_list)
free_page((unsigned long) vma_list);
free_page((unsigned long) page_list);

return ret < 0 ? ERR_PTR(ret) : umem;
+
+err_unpin:
+ mm_munpin(umem->start_addr, umem->end_addr);
+err:
+ __ib_umem_release(context->device, umem, 0);
+ kfree(umem);
+ goto unlock;
}
EXPORT_SYMBOL(ib_umem_get);

@@ -223,7 +232,7 @@ static void ib_umem_account(struct work_struct *work)
struct ib_umem *umem = container_of(work, struct ib_umem, work);

down_write(&umem->mm->mmap_sem);
- umem->mm->pinned_vm -= umem->diff;
+ mm_munpin(umem->start_addr, umem->end_addr);
up_write(&umem->mm->mmap_sem);
mmput(umem->mm);
kfree(umem);
@@ -237,7 +246,6 @@ void ib_umem_release(struct ib_umem *umem)
{
struct ib_ucontext *context = umem->context;
struct mm_struct *mm;
- unsigned long diff;

__ib_umem_release(umem->context->device, umem, 1);

@@ -247,8 +255,6 @@ void ib_umem_release(struct ib_umem *umem)
return;
}

- diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT;
-
/*
* We may be called with the mm's mmap_sem already held. This
* can happen when a userspace munmap() is the call that drops
@@ -261,7 +267,6 @@ void ib_umem_release(struct ib_umem *umem)
if (!down_write_trylock(&mm->mmap_sem)) {
INIT_WORK(&umem->work, ib_umem_account);
umem->mm = mm;
- umem->diff = diff;

queue_work(ib_wq, &umem->work);
return;
@@ -269,7 +274,7 @@ void ib_umem_release(struct ib_umem *umem)
} else
down_write(&mm->mmap_sem);

- current->mm->pinned_vm -= diff;
+ mm_munpin(umem->start_addr, umem->end_addr);
up_write(&mm->mmap_sem);
mmput(mm);
kfree(umem);
diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index 6d7f453..16a0ad2 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -456,7 +456,7 @@ static int ipath_tid_update(struct ipath_portdata *pd, struct file *fp,
ipath_stats.sps_pageunlocks++;
}
}
- ipath_release_user_pages(pagep, cnt);
+ ipath_release_user_pages(pagep, ti->tidvaddr, cnt);
} else {
/*
* Copy the updated array, with ipath_tid's filled in, back
diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h
index 6559af6..448bc97 100644
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -1082,8 +1082,8 @@ static inline void ipath_sdma_desc_unreserve(struct ipath_devdata *dd, u16 cnt)
#define IPATH_DFLT_RCVHDRSIZE 9

int ipath_get_user_pages(unsigned long, size_t, struct page **);
-void ipath_release_user_pages(struct page **, size_t);
-void ipath_release_user_pages_on_close(struct page **, size_t);
+void ipath_release_user_pages(struct page **, unsigned long, size_t);
+void ipath_release_user_pages_on_close(struct page **, unsigned long, size_t);
int ipath_eeprom_read(struct ipath_devdata *, u8, void *, int);
int ipath_eeprom_write(struct ipath_devdata *, u8, const void *, int);
int ipath_tempsense_read(struct ipath_devdata *, u8 regnum);
diff --git a/drivers/infiniband/hw/ipath/ipath_user_pages.c b/drivers/infiniband/hw/ipath/ipath_user_pages.c
index dc66c45..e368614 100644
--- a/drivers/infiniband/hw/ipath/ipath_user_pages.c
+++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c
@@ -39,7 +39,7 @@
#include "ipath_kernel.h"

static void __ipath_release_user_pages(struct page **p, size_t num_pages,
- int dirty)
+ int dirty)
{
size_t i;

@@ -57,12 +57,15 @@ static int __ipath_get_user_pages(unsigned long start_page, size_t num_pages,
struct page **p, struct vm_area_struct **vma)
{
unsigned long lock_limit;
+ unsigned long locked;
size_t got;
int ret;

+ locked = mm_locked_pages(current->mm);
+ locked += num_pages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

- if (num_pages > lock_limit) {
+ if (locked > lock_limit) {
ret = -ENOMEM;
goto bail;
}
@@ -70,6 +73,10 @@ static int __ipath_get_user_pages(unsigned long start_page, size_t num_pages,
ipath_cdbg(VERBOSE, "pin %lx pages from vaddr %lx\n",
(unsigned long) num_pages, start_page);

+ ret = mm_mpin(start_page, start_page + num_pages);
+ if (ret)
+ goto bail;
+
for (got = 0; got < num_pages; got += ret) {
ret = get_user_pages(current, current->mm,
start_page + got * PAGE_SIZE,
@@ -78,14 +85,12 @@ static int __ipath_get_user_pages(unsigned long start_page, size_t num_pages,
if (ret < 0)
goto bail_release;
}
-
- current->mm->pinned_vm += num_pages;
-
ret = 0;
goto bail;

bail_release:
__ipath_release_user_pages(p, got, 0);
+ mm_munpin(start_page, start_page + num_pages);
bail:
return ret;
}
@@ -172,13 +177,13 @@ int ipath_get_user_pages(unsigned long start_page, size_t num_pages,
return ret;
}

-void ipath_release_user_pages(struct page **p, size_t num_pages)
+void ipath_release_user_pages(struct page **p, unsigned long start_page,
+ size_t num_pages)
{
down_write(&current->mm->mmap_sem);

__ipath_release_user_pages(p, num_pages, 1);
-
- current->mm->pinned_vm -= num_pages;
+ mm_munpin(start_page, start_page + num_pages);

up_write(&current->mm->mmap_sem);
}
@@ -186,6 +191,7 @@ void ipath_release_user_pages(struct page **p, size_t num_pages)
struct ipath_user_pages_work {
struct work_struct work;
struct mm_struct *mm;
+ unsigned long start_page;
unsigned long num_pages;
};

@@ -195,13 +201,15 @@ static void user_pages_account(struct work_struct *_work)
container_of(_work, struct ipath_user_pages_work, work);

down_write(&work->mm->mmap_sem);
- work->mm->pinned_vm -= work->num_pages;
+ mm_munpin(work->start_page, work->start_page + work->num_pages);
up_write(&work->mm->mmap_sem);
mmput(work->mm);
kfree(work);
}

-void ipath_release_user_pages_on_close(struct page **p, size_t num_pages)
+void ipath_release_user_pages_on_close(struct page **p,
+ unsigned long start_page,
+ size_t num_pages)
{
struct ipath_user_pages_work *work;
struct mm_struct *mm;
@@ -218,6 +226,7 @@ void ipath_release_user_pages_on_close(struct page **p, size_t num_pages)

INIT_WORK(&work->work, user_pages_account);
work->mm = mm;
+ work->start_page = start_page;
work->num_pages = num_pages;

queue_work(ib_wq, &work->work);
diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h
index 4d11575..a6213e2 100644
--- a/drivers/infiniband/hw/qib/qib.h
+++ b/drivers/infiniband/hw/qib/qib.h
@@ -1343,7 +1343,7 @@ void qib_sdma_process_event(struct qib_pportdata *, enum qib_sdma_events);
#define QIB_RCVHDR_ENTSIZE 32

int qib_get_user_pages(unsigned long, size_t, struct page **);
-void qib_release_user_pages(struct page **, size_t);
+void qib_release_user_pages(struct page **, unsigned long, size_t);
int qib_eeprom_read(struct qib_devdata *, u8, void *, int);
int qib_eeprom_write(struct qib_devdata *, u8, const void *, int);
u32 __iomem *qib_getsendbuf_range(struct qib_devdata *, u32 *, u32, u32);
diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c
index b56c942..ceff69f 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -423,7 +423,7 @@ static int qib_tid_update(struct qib_ctxtdata *rcd, struct file *fp,
dd->pageshadow[ctxttid + tid] = NULL;
}
}
- qib_release_user_pages(pagep, cnt);
+ qib_release_user_pages(pagep, ti->tidvaddr, cnt);
} else {
/*
* Copy the updated array, with qib_tid's filled in, back
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 2bc1d2b..e7b0e0d 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -55,16 +55,24 @@ static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
struct page **p, struct vm_area_struct **vma)
{
unsigned long lock_limit;
+ unsigned long locked;
size_t got;
int ret;

+ locked = mm_locked_pages(current->mm);
+ locked += num_pages;
+
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

- if (num_pages > lock_limit && !capable(CAP_IPC_LOCK)) {
+ if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
ret = -ENOMEM;
goto bail;
}

+ ret = mm_mpin(start_page, start_page + num_pages);
+ if (ret)
+ goto bail;
+
for (got = 0; got < num_pages; got += ret) {
ret = get_user_pages(current, current->mm,
start_page + got * PAGE_SIZE,
@@ -74,13 +82,12 @@ static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
goto bail_release;
}

- current->mm->pinned_vm += num_pages;
-
ret = 0;
goto bail;

bail_release:
__qib_release_user_pages(p, got, 0);
+ mm_munpin(start_page, star_page + num_pages);
bail:
return ret;
}
@@ -143,15 +150,16 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
return ret;
}

-void qib_release_user_pages(struct page **p, size_t num_pages)
+void qib_release_user_pages(struct page **p, unsigned long start_page, size_t num_pages)
{
if (current->mm) /* during close after signal, mm can be NULL */
down_write(&current->mm->mmap_sem);

__qib_release_user_pages(p, num_pages, 1);

+
if (current->mm) {
- current->mm->pinned_vm -= num_pages;
+ mm_munpin(start_page, start_page + num_pages);
up_write(&current->mm->mmap_sem);
}
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..de10e5b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -90,6 +90,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */

+#define VM_PINNED 0x00001000
#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */

@@ -1526,6 +1527,8 @@ static inline void mm_populate(unsigned long addr, unsigned long len)
/* Ignore errors */
(void) __mm_populate(addr, len, 1);
}
+extern int mm_mpin(unsigned long start, unsigned long end);
+extern int mm_unpin(unsigned long start, unsigned long end);
#else
static inline void mm_populate(unsigned long addr, unsigned long len) {}
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..3e3475b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -456,4 +456,9 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return mm->cpu_vm_mask_var;
}

+static inline unsigned long mm_locked_pages(struct mm_struct *mm)
+{
+ return mm->pinned_vm + mm->locked_vm;
+}
+
#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f463a46..1003c71 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -389,7 +389,6 @@ struct perf_event {
/* mmap bits */
struct mutex mmap_mutex;
atomic_t mmap_count;
- int mmap_locked;
struct user_struct *mmap_user;
struct ring_buffer *rb;
struct list_head rb_entry;
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 9ee0d2e..4f3c802 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -41,6 +41,8 @@ struct ib_ucontext;

struct ib_umem {
struct ib_ucontext *context;
+ unsigned long start_addr;
+ unsigned long nr_pages;
size_t length;
int offset;
int page_size;
@@ -49,7 +51,6 @@ struct ib_umem {
struct list_head chunk_list;
struct work_struct work;
struct mm_struct *mm;
- unsigned long diff;
};

struct ib_umem_chunk {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9dc297f..ceb8905 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3617,7 +3617,6 @@ static void perf_mmap_close(struct vm_area_struct *vma)
struct ring_buffer *rb = event->rb;

atomic_long_sub((size >> PAGE_SHIFT) + 1, &user->locked_vm);
- vma->vm_mm->pinned_vm -= event->mmap_locked;
rcu_assign_pointer(event->rb, NULL);
ring_buffer_detach(event, rb);
mutex_unlock(&event->mmap_mutex);
@@ -3699,7 +3698,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)

lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
- locked = vma->vm_mm->pinned_vm + extra;
+ locked = mm_locked_pages(vma->vm_mm) + extra;

if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
!capable(CAP_IPC_LOCK)) {
@@ -3723,9 +3722,8 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
rcu_assign_pointer(event->rb, rb);

atomic_long_add(user_extra, &user->locked_vm);
- event->mmap_locked = extra;
event->mmap_user = get_current_user();
- vma->vm_mm->pinned_vm += event->mmap_locked;
+ vma->vm_mm->pinned_vm += extra;

perf_event_update_userpage(event);

@@ -3734,7 +3732,16 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);

- vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ /*
+ * We're a vma of pinned pages; add special to avoid MADV_DODUMP and
+ * mlock(). mlock()+munlock() could accidentally put these pages onto
+ * the LRU, we wouldn't want that; also it implies VM_DONTEXPAND
+ * which we need as well.
+ *
+ * VM_PINNED will make munmap automagically subtract
+ * mm_struct::pinned_vm.
+ */
+ vma->vm_flags |= VM_PINNED | VM_SPECIAL | VM_DONTDUMP;
vma->vm_ops = &perf_mmap_vmops;

return ret;
diff --git a/kernel/fork.c b/kernel/fork.c
index 987b28a..815f196 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -411,6 +411,8 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
tmp->vm_flags &= ~VM_LOCKED;
+ if (tmp->vm_flags & VM_PINNED)
+ mm->pinned_vm += vma_pages(tmp);
tmp->vm_next = tmp->vm_prev = NULL;
file = tmp->vm_file;
if (file) {
diff --git a/mm/mlock.c b/mm/mlock.c
index 79b7cf7..335c705 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -277,9 +277,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
{
struct mm_struct *mm = vma->vm_mm;
pgoff_t pgoff;
- int nr_pages;
+ int nr_pages, nr_locked, nr_pinned;
int ret = 0;
- int lock = !!(newflags & VM_LOCKED);

if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
@@ -310,9 +309,49 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
* Keep track of amount of locked VM.
*/
nr_pages = (end - start) >> PAGE_SHIFT;
- if (!lock)
- nr_pages = -nr_pages;
- mm->locked_vm += nr_pages;
+
+ /*
+ * We should only account pages once, if VM_PINNED is set pages are
+ * accounted in mm_struct::pinned_vm, otherwise if VM_LOCKED is set,
+ * we'll account them in mm_struct::locked_vm.
+ *
+ * PL := vma->vm_flags
+ * PL' := newflags
+ * PLd := {pinned,locked}_vm delta
+ *
+ * PL->PL' PLd
+ * -----------
+ * 00 01 0+
+ * 00 10 +0
+ * 01 11 +-
+ * 01 00 0-
+ * 10 00 -0
+ * 10 11 00
+ * 11 01 -+
+ * 11 10 00
+ */
+
+ nr_pinned = nr_locked = 0;
+
+ if ((vma->vm_flags ^ newflags) & VM_PINNED) {
+ if (vma->vm_flags & VM_PINNED)
+ nr_pinned = -nr_pages;
+ else
+ nr_pinned = nr_pages;
+ }
+
+ if (vma->vm_flags & VM_PINNED) {
+ if ((newflags & (VM_PINNED|VM_LOCKED)) == VM_LOCKED)
+ nr_locked = nr_pages;
+ } else {
+ if (vma->vm_flags & VM_LOCKED)
+ nr_locked = -nr_pages;
+ else if (newflags & VM_LOCKED)
+ nr_locked = nr_pages;
+ }
+
+ mm->pinned_vm += nr_pinned;
+ mm->locked_vm += nr_locked;

/*
* vm_flags is protected by the mmap_sem held in write mode.
@@ -320,7 +359,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
* set VM_LOCKED, __mlock_vma_pages_range will bring it back.
*/

- if (lock)
+ if (((vma->vm_flags ^ newflags) & VM_PINNED) || (newflags & VM_LOCKED))
vma->vm_flags = newflags;
else
munlock_vma_pages_range(vma, start, end);
@@ -330,7 +369,10 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
return ret;
}

-static int do_mlock(unsigned long start, size_t len, int on)
+#define MLOCK_F_ON 0x01
+#define MLOCK_F_PIN 0x02
+
+static int do_mlock(unsigned long start, size_t len, unsigned int flags)
{
unsigned long nstart, end, tmp;
struct vm_area_struct * vma, * prev;
@@ -352,13 +394,18 @@ static int do_mlock(unsigned long start, size_t len, int on)
prev = vma;

for (nstart = start ; ; ) {
- vm_flags_t newflags;
+ vm_flags_t newflags = vma->vm_flags;
+ vm_flags_t flag = VM_LOCKED;

- /* Here we know that vma->vm_start <= nstart < vma->vm_end. */
+ if (flags & MLOCK_F_PIN)
+ flag = VM_PINNED;

- newflags = vma->vm_flags & ~VM_LOCKED;
- if (on)
- newflags |= VM_LOCKED;
+ if (flags & MLOCK_F_ON)
+ newflags |= flag;
+ else
+ newflags &= ~flag;
+
+ /* Here we know that vma->vm_start <= nstart < vma->vm_end. */

tmp = vma->vm_end;
if (tmp > end)
@@ -381,6 +428,18 @@ static int do_mlock(unsigned long start, size_t len, int on)
return error;
}

+int mm_mpin(unsigned long start, unsigned long end)
+{
+ return do_mlock(start, end, MLOCK_F_ON | MLOCK_F_PIN);
+}
+EXPORT_SYMBOL_GPL(mm_mpin);
+
+int mm_munpin(unsigned long start, unsigned long end)
+{
+ return do_mlock(start, end, MLOCK_F_PIN);
+}
+EXPORT_SYMBOL_GPL(mm_munpin);
+
/*
* __mm_populate - populate and/or mlock pages within a range of address space.
*
@@ -460,14 +519,14 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
start &= PAGE_MASK;

locked = len >> PAGE_SHIFT;
- locked += current->mm->locked_vm;
+ locked += mm_locked_pages(current->mm);

lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;

/* check against resource limits */
if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
- error = do_mlock(start, len, 1);
+ error = do_mlock(start, len, MLOCK_F_ON);
up_write(&current->mm->mmap_sem);
if (!error)
error = __mm_populate(start, len, 0);
diff --git a/mm/mmap.c b/mm/mmap.c
index f681e18..5cd10c4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1258,7 +1258,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
if (vm_flags & VM_LOCKED) {
unsigned long locked, lock_limit;
locked = len >> PAGE_SHIFT;
- locked += mm->locked_vm;
+ locked += mm_locked_pages(mm);
lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
@@ -2070,7 +2070,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
if (vma->vm_flags & VM_LOCKED) {
unsigned long locked;
unsigned long limit;
- locked = mm->locked_vm + grow;
+ locked = mm_locked_pages(mm) + grow;
limit = ACCESS_ONCE(rlim[RLIMIT_MEMLOCK].rlim_cur);
limit >>= PAGE_SHIFT;
if (locked > limit && !capable(CAP_IPC_LOCK))
@@ -2547,13 +2547,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
/*
* unlock any mlock()ed ranges before detaching vmas
*/
- if (mm->locked_vm) {
+ if (mm->locked_vm || mm->pinned_vm) {
struct vm_area_struct *tmp = vma;
while (tmp && tmp->vm_start < end) {
- if (tmp->vm_flags & VM_LOCKED) {
+ if (tmp->vm_flags & VM_PINNED)
+ mm->pinned_vm -= vma_pages(tmp);
+ else if (tmp->vm_flags & VM_LOCKED)
mm->locked_vm -= vma_pages(tmp);
+
+ if (tmp->vm_flags & VM_LOCKED)
munlock_vma_pages_all(tmp);
- }
+
tmp = tmp->vm_next;
}
}
@@ -2628,7 +2632,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
if (mm->def_flags & VM_LOCKED) {
unsigned long locked, lock_limit;
locked = len >> PAGE_SHIFT;
- locked += mm->locked_vm;
+ locked += mm_locked_pages(mm);
lock_limit = rlimit(RLIMIT_MEMLOCK);
lock_limit >>= PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))

Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

On Fri, 24 May 2013, Peter Zijlstra wrote:

> Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
> broke RLIMIT_MEMLOCK.

Nope the patch fixed a problem with double accounting.

The problem that we seem to have is to define what mlocked and pinned mean
and how this relates to RLIMIT_MEMLOCK.

mlocked pages are pages that are movable (not pinned!!!) and that are
marked in some way by user space actions as mlocked (POSIX semantics).
They are marked with a special page flag (PG_mlocked).

Pinned pages are pages that have an elevated refcount because the hardware
needs to use these pages for I/O. The elevated refcount may be temporary
(then we dont care about this) or for a longer time (such as the memory
registration of the IB subsystem). That is when we account the memory as
pinned. The elevated refcount stops page migration and other things from
trying to move that memory.

Pages can be both pinned and mlocked. Before my patch some pages those two
issues were conflated since the same counter was used and therefore these
pages were counted twice. If an RDMA application was running using
mlockall() and was performing large scale I/O then the counters could show
extraordinary large numbers and the VM would start to behave erratically.

It is important for the VM to know which pages cannot be evicted but that
involves many more pages due to dirty pages etc etc.

So far the assumption has been that RLIMIT_MEMLOCK is a limit on the pages
that userspace has mlocked.

You want the counter to mean something different it seems. What is it?

I think we need to be first clear on what we want to accomplish and what
these counters actually should count before changing things.

Certainly would appreciate improvements in this area but resurrecting the
conflation between mlocked and pinned pages is not the way to go.

> This patch proposes to properly fix the problem by introducing
> VM_PINNED. This also provides the groundwork for a possible mpin()
> syscall or MADV_PIN -- although these are not included.

Maybe add a new PIN page flag? Pages are not pinned per vma as the patch
seems to assume.

> It recognises that pinned page semantics are a strict super-set of
> locked page semantics -- a pinned page will not generate major faults
> (and thus satisfies mlock() requirements).

Not exactly true. Pinned pages may not have the mlocked flag set and they
are not managed on the unevictable LRU lists of the MM.

> If people find this approach unworkable, I request we revert the above
> mentioned patch to at least restore RLIMIT_MEMLOCK to a usable state
> again.

Cannot do that. This will cause the breakage that the patch was fixing to
resurface.

2013-05-26 01:11:46

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

On Fri, May 24, 2013 at 11:40 AM, Christoph Lameter <[email protected]> wrote:
> On Fri, 24 May 2013, Peter Zijlstra wrote:
>
>> Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
>> broke RLIMIT_MEMLOCK.
>
> Nope the patch fixed a problem with double accounting.
>
> The problem that we seem to have is to define what mlocked and pinned mean
> and how this relates to RLIMIT_MEMLOCK.
>
> mlocked pages are pages that are movable (not pinned!!!) and that are
> marked in some way by user space actions as mlocked (POSIX semantics).
> They are marked with a special page flag (PG_mlocked).
>
> Pinned pages are pages that have an elevated refcount because the hardware
> needs to use these pages for I/O. The elevated refcount may be temporary
> (then we dont care about this) or for a longer time (such as the memory
> registration of the IB subsystem). That is when we account the memory as
> pinned. The elevated refcount stops page migration and other things from
> trying to move that memory.
>
> Pages can be both pinned and mlocked. Before my patch some pages those two
> issues were conflated since the same counter was used and therefore these
>
pages were counted twice. If an RDMA application was running using
> mlockall() and was performing large scale I/O then the counters could show
> extraordinary large numbers and the VM would start to behave erratically.
>
> It is important for the VM to know which pages cannot be evicted but that
> involves many more pages due to dirty pages etc etc.
>
> So far the assumption has been that RLIMIT_MEMLOCK is a limit on the pages
> that userspace has mlocked.
>
> You want the counter to mean something different it seems. What is it?
>
> I think we need to be first clear on what we want to accomplish and what
> these counters actually should count before changing things.

Hm.
If pinned and mlocked are totally difference intentionally, why IB uses
RLIMIT_MEMLOCK. Why don't IB uses IB specific limit and why only IB raise up
number of pinned pages and other gup users don't.
I can't guess IB folk's intent.

And now ever IB code has duplicated RLIMIT_MEMLOCK
check and at least __ipath_get_user_pages() forget to check
capable(CAP_IPC_LOCK).
That's bad.


> Certainly would appreciate improvements in this area but resurrecting the
> conflation between mlocked and pinned pages is not the way to go.
>
>> This patch proposes to properly fix the problem by introducing
>> VM_PINNED. This also provides the groundwork for a possible mpin()
>> syscall or MADV_PIN -- although these are not included.
>
> Maybe add a new PIN page flag? Pages are not pinned per vma as the patch
> seems to assume.

Generically, you are right. But if VM_PINNED is really only for IB,
this is acceptable
limitation. They can split vma for their own purpose.

Anyway, I agree we should clearly understand the semantics of IB pinning and
the userland usage and assumption.

2013-05-27 06:49:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

On Fri, May 24, 2013 at 03:40:26PM +0000, Christoph Lameter wrote:
> On Fri, 24 May 2013, Peter Zijlstra wrote:
>
> > Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
> > broke RLIMIT_MEMLOCK.
>
> Nope the patch fixed a problem with double accounting.

And introduces another problem by now under-counting in a much more
likely scenario.

Before your patch pinned was included in locked and thus RLIMIT_MEMLOCK
had a single resource counter. After your patch RLIMIT_MEMLOCK is
applied separately to both -- more or less.

Your patch changes RLIMIT_MEMLOCK semantics, doesn't mention this in the
Changelog which makes one think this was unintentional. And you're
refusing to acknowledge that this might be a problem.

The fact that RLIMIT_MEMLOCK is the only consumer of the counter tells
me you simply didn't seem to care and did a sloppy patch 'solving' your
immediate issue.

The below discussion of what mlock vs pinned should mean for
RLIMIT_MEMLOCK should have been part of that patch, since it changed the
previously established behaviour.

> The problem that we seem to have is to define what mlocked and pinned mean
> and how this relates to RLIMIT_MEMLOCK.
>
> mlocked pages are pages that are movable (not pinned!!!) and that are
> marked in some way by user space actions as mlocked (POSIX semantics).
> They are marked with a special page flag (PG_mlocked).

NO, mlocked pages are pages that do not leave core memory; IOW do not
cause major faults. Pinning pages is a perfectly spec compliant mlock()
implementation.

See: http://pubs.opengroup.org/onlinepubs/7908799/xsh/mlock.html

The fact that we _want_ a weaker implementation because we like to move
them despite them being mlocked is us playing weasel words with the spec
:-) So much so that we have to introduce a means of actually pinning
them _before_ we start to actually migrate mlocked pages (or did
somebody actually change that too?).

RT people will be unhappy if they hit minor faults or have to wait
behind page-migration (esp. since the migrate code only removes the
migration PTE after all pages in the batch are migrated).

BTW, notice that mlock() is part of the REALTIME spec, and while its
wording is such that is allows us to make it unsuitable for actual
real-time usage this goes against the intent of the spec.

Now in an earlier discussion on the issue 'we' (I can't remember if you
participated there, I remember Mel and Kosaki-San) agreed that for
'normal' (read not whacky real-time people) mlock can still be useful
and we should introduce a pinned user API for the RT people.

> Pinned pages are pages that have an elevated refcount because the hardware
> needs to use these pages for I/O. The elevated refcount may be temporary
> (then we dont care about this) or for a longer time (such as the memory
> registration of the IB subsystem). That is when we account the memory as
> pinned. The elevated refcount stops page migration and other things from
> trying to move that memory.

Again I _know_ that!!!

> Pages can be both pinned and mlocked.

Right, but apart for mlockall() this is a highly unlikely situation to
actually occur. And if you're using mlockall() you've effectively
disabled RLIMIT_MEMLOCK and thus nobody cares if the resource counter
goes funny.

> Before my patch some pages those two
> issues were conflated since the same counter was used and therefore these
> pages were counted twice. If an RDMA application was running using
> mlockall() and was performing large scale I/O then the counters could show
> extraordinary large numbers and the VM would start to behave erratically.
>
> It is important for the VM to know which pages cannot be evicted but that
> involves many more pages due to dirty pages etc etc.

But the VM as such doesn't care, the VM doesn't use either locked_vm nor
pinned_vm for anything! The only user is RLIMIT_MEMLOCK and you silently
changed semantics.

> So far the assumption has been that RLIMIT_MEMLOCK is a limit on the pages
> that userspace has mlocked.
>
> You want the counter to mean something different it seems. What is it?

It would've been so much better if you'd asked me that before you mucked
it up.

The intent of RLIMIT_MEMLOCK is clearly to limit the amount of memory
the user can lock into core memory as per the direct definition of
mlock. By this definition pinned should be included. This leaves us to
solve the issue of pinned mlocked memory.

> I think we need to be first clear on what we want to accomplish and what
> these counters actually should count before changing things.

Backward isn't it... _you_ changed it without consideration.

> Certainly would appreciate improvements in this area but resurrecting the
> conflation between mlocked and pinned pages is not the way to go.
>
> > This patch proposes to properly fix the problem by introducing
> > VM_PINNED. This also provides the groundwork for a possible mpin()
> > syscall or MADV_PIN -- although these are not included.
>
> Maybe add a new PIN page flag? Pages are not pinned per vma as the patch
> seems to assume.

The IB code does a big get_user_pages(), which last time I checked
pins a sequential range of pages. Therefore the VMA approach.

The perf thing is guaranteed to be virtually sequential and already is a
VMA -- its the result of a user mmap() call.

> > It recognises that pinned page semantics are a strict super-set of
> > locked page semantics -- a pinned page will not generate major faults
> > (and thus satisfies mlock() requirements).
>
> Not exactly true. Pinned pages may not have the mlocked flag set and they
> are not managed on the unevictable LRU lists of the MM.

Re-read the spec; pinned pages stay in memory; therefore they qualify.
The fact that we've gone all 'smart' about dealing with such pages is
immaterial.

> > If people find this approach unworkable, I request we revert the above
> > mentioned patch to at least restore RLIMIT_MEMLOCK to a usable state
> > again.
>
> Cannot do that. This will cause the breakage that the patch was fixing to
> resurface.

But fixes my borkage, which I argue is much more user visible since it
doesn't require CAP_IPC_LOCK :-)

2013-05-28 08:56:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()


Take 2.. :-)

Leaving the locked vs pinned debacle for another time. The below patch
survives many non-priv execution of the problem prog.

---
Subject: perf: Fix perf mmap issues

Vince reported a problem found by his perf specific trinity fuzzer.

Al noticed 2 problems with perf's mmap():
- it has issues against fork() since we use vma->vm_mm for accounting.
- it has an rb refcount leak on double mmap().

We fix the issues against fork() by using VM_DONTCOPY; I don't think there's
code out there that uses this; we didn't hear about weird accounting
problems/crashes. If we do need this to work, the previously proposed VM_PINNED
could make this work.

Aside from the rb reference leak spotted by Al, Vince's example prog was indeed
doing a double mmap() through the use of perf_event_set_output().

This exposes another problem, since we now have 2 events with one buffer, the
accounting gets screwy because we account per event. Fix this by making the
buffer responsible for its own accounting.

Cc: Al Viro <[email protected]>
Reported-by: Vince Weaver <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/perf_event.h | 3 +--
kernel/events/core.c | 37 ++++++++++++++++++++-----------------
kernel/events/internal.h | 3 +++
3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f463a46..c5b6dbf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -389,8 +389,7 @@ struct perf_event {
/* mmap bits */
struct mutex mmap_mutex;
atomic_t mmap_count;
- int mmap_locked;
- struct user_struct *mmap_user;
+
struct ring_buffer *rb;
struct list_head rb_entry;

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9dc297f..ae752cd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2917,7 +2917,7 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}

-static void ring_buffer_put(struct ring_buffer *rb);
+static bool ring_buffer_put(struct ring_buffer *rb);

static void free_event(struct perf_event *event)
{
@@ -3582,13 +3582,13 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
return rb;
}

-static void ring_buffer_put(struct ring_buffer *rb)
+static bool ring_buffer_put(struct ring_buffer *rb)
{
struct perf_event *event, *n;
unsigned long flags;

if (!atomic_dec_and_test(&rb->refcount))
- return;
+ return false;

spin_lock_irqsave(&rb->event_lock, flags);
list_for_each_entry_safe(event, n, &rb->event_list, rb_entry) {
@@ -3598,6 +3598,7 @@ static void ring_buffer_put(struct ring_buffer *rb)
spin_unlock_irqrestore(&rb->event_lock, flags);

call_rcu(&rb->rcu_head, rb_free_rcu);
+ return true;
}

static void perf_mmap_open(struct vm_area_struct *vma)
@@ -3612,18 +3613,20 @@ static void perf_mmap_close(struct vm_area_struct *vma)
struct perf_event *event = vma->vm_file->private_data;

if (atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex)) {
- unsigned long size = perf_data_size(event->rb);
- struct user_struct *user = event->mmap_user;
struct ring_buffer *rb = event->rb;
+ struct user_struct *mmap_user = rb->mmap_user;
+ int mmap_locked = rb->mmap_locked;
+ unsigned long size = perf_data_size(rb);

- atomic_long_sub((size >> PAGE_SHIFT) + 1, &user->locked_vm);
- vma->vm_mm->pinned_vm -= event->mmap_locked;
rcu_assign_pointer(event->rb, NULL);
ring_buffer_detach(event, rb);
mutex_unlock(&event->mmap_mutex);

- ring_buffer_put(rb);
- free_uid(user);
+ if (ring_buffer_put(rb)) {
+ atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
+ vma->vm_mm->pinned_vm -= mmap_locked;
+ free_uid(mmap_user);
+ }
}
}

@@ -3676,9 +3679,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
WARN_ON_ONCE(event->ctx->parent_ctx);
mutex_lock(&event->mmap_mutex);
if (event->rb) {
- if (event->rb->nr_pages == nr_pages)
- atomic_inc(&event->rb->refcount);
- else
+ if (event->rb->nr_pages != nr_pages)
ret = -EINVAL;
goto unlock;
}
@@ -3720,12 +3721,14 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
ret = -ENOMEM;
goto unlock;
}
- rcu_assign_pointer(event->rb, rb);
+
+ rb->mmap_locked = extra;
+ rb->mmap_user = get_current_user();

atomic_long_add(user_extra, &user->locked_vm);
- event->mmap_locked = extra;
- event->mmap_user = get_current_user();
- vma->vm_mm->pinned_vm += event->mmap_locked;
+ vma->vm_mm->pinned_vm += extra;
+
+ rcu_assign_pointer(event->rb, rb);

perf_event_update_userpage(event);

@@ -3734,7 +3737,7 @@ unlock:
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);

- vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_ops = &perf_mmap_vmops;

return ret;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index eb675c4..5bc6c8e 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -31,6 +31,9 @@ struct ring_buffer {
spinlock_t event_lock;
struct list_head event_list;

+ int mmap_locked;
+ struct user_struct *mmap_user;
+
struct perf_event_mmap_page *user_page;
void *data_pages[0];
};

Subject: [tip:perf/urgent] perf: Fix perf mmap bugs

Commit-ID: 26cb63ad11e04047a64309362674bcbbd6a6f246
Gitweb: http://git.kernel.org/tip/26cb63ad11e04047a64309362674bcbbd6a6f246
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 28 May 2013 10:55:48 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 28 May 2013 11:05:08 +0200

perf: Fix perf mmap bugs

Vince reported a problem found by his perf specific trinity
fuzzer.

Al noticed 2 problems with perf's mmap():

- it has issues against fork() since we use vma->vm_mm for accounting.
- it has an rb refcount leak on double mmap().

We fix the issues against fork() by using VM_DONTCOPY; I don't
think there's code out there that uses this; we didn't hear
about weird accounting problems/crashes. If we do need this to
work, the previously proposed VM_PINNED could make this work.

Aside from the rb reference leak spotted by Al, Vince's example
prog was indeed doing a double mmap() through the use of
perf_event_set_output().

This exposes another problem, since we now have 2 events with
one buffer, the accounting gets screwy because we account per
event. Fix this by making the buffer responsible for its own
accounting.

Reported-by: Vince Weaver <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/perf_event.h | 3 +--
kernel/events/core.c | 37 ++++++++++++++++++++-----------------
kernel/events/internal.h | 3 +++
3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f463a46..c5b6dbf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -389,8 +389,7 @@ struct perf_event {
/* mmap bits */
struct mutex mmap_mutex;
atomic_t mmap_count;
- int mmap_locked;
- struct user_struct *mmap_user;
+
struct ring_buffer *rb;
struct list_head rb_entry;

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9dc297f..ae752cd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2917,7 +2917,7 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}

-static void ring_buffer_put(struct ring_buffer *rb);
+static bool ring_buffer_put(struct ring_buffer *rb);

static void free_event(struct perf_event *event)
{
@@ -3582,13 +3582,13 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
return rb;
}

-static void ring_buffer_put(struct ring_buffer *rb)
+static bool ring_buffer_put(struct ring_buffer *rb)
{
struct perf_event *event, *n;
unsigned long flags;

if (!atomic_dec_and_test(&rb->refcount))
- return;
+ return false;

spin_lock_irqsave(&rb->event_lock, flags);
list_for_each_entry_safe(event, n, &rb->event_list, rb_entry) {
@@ -3598,6 +3598,7 @@ static void ring_buffer_put(struct ring_buffer *rb)
spin_unlock_irqrestore(&rb->event_lock, flags);

call_rcu(&rb->rcu_head, rb_free_rcu);
+ return true;
}

static void perf_mmap_open(struct vm_area_struct *vma)
@@ -3612,18 +3613,20 @@ static void perf_mmap_close(struct vm_area_struct *vma)
struct perf_event *event = vma->vm_file->private_data;

if (atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex)) {
- unsigned long size = perf_data_size(event->rb);
- struct user_struct *user = event->mmap_user;
struct ring_buffer *rb = event->rb;
+ struct user_struct *mmap_user = rb->mmap_user;
+ int mmap_locked = rb->mmap_locked;
+ unsigned long size = perf_data_size(rb);

- atomic_long_sub((size >> PAGE_SHIFT) + 1, &user->locked_vm);
- vma->vm_mm->pinned_vm -= event->mmap_locked;
rcu_assign_pointer(event->rb, NULL);
ring_buffer_detach(event, rb);
mutex_unlock(&event->mmap_mutex);

- ring_buffer_put(rb);
- free_uid(user);
+ if (ring_buffer_put(rb)) {
+ atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
+ vma->vm_mm->pinned_vm -= mmap_locked;
+ free_uid(mmap_user);
+ }
}
}

@@ -3676,9 +3679,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
WARN_ON_ONCE(event->ctx->parent_ctx);
mutex_lock(&event->mmap_mutex);
if (event->rb) {
- if (event->rb->nr_pages == nr_pages)
- atomic_inc(&event->rb->refcount);
- else
+ if (event->rb->nr_pages != nr_pages)
ret = -EINVAL;
goto unlock;
}
@@ -3720,12 +3721,14 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
ret = -ENOMEM;
goto unlock;
}
- rcu_assign_pointer(event->rb, rb);
+
+ rb->mmap_locked = extra;
+ rb->mmap_user = get_current_user();

atomic_long_add(user_extra, &user->locked_vm);
- event->mmap_locked = extra;
- event->mmap_user = get_current_user();
- vma->vm_mm->pinned_vm += event->mmap_locked;
+ vma->vm_mm->pinned_vm += extra;
+
+ rcu_assign_pointer(event->rb, rb);

perf_event_update_userpage(event);

@@ -3734,7 +3737,7 @@ unlock:
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);

- vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_ops = &perf_mmap_vmops;

return ret;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index eb675c4..5bc6c8e 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -31,6 +31,9 @@ struct ring_buffer {
spinlock_t event_lock;
struct list_head event_list;

+ int mmap_locked;
+ struct user_struct *mmap_user;
+
struct perf_event_mmap_page *user_page;
void *data_pages[0];
};

2013-05-28 16:14:40

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Tue, 28 May 2013, Peter Zijlstra wrote:

> Take 2.. :-)
>
> Leaving the locked vs pinned debacle for another time. The below patch
> survives many non-priv execution of the problem prog.

It looks like this is already in tip, but I can confirm that this
patch seems to fix things on my machine and holds up against longer
fuzzing runs.

Thanks,

Vince

Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

On Sat, 25 May 2013, KOSAKI Motohiro wrote:

> If pinned and mlocked are totally difference intentionally, why IB uses
> RLIMIT_MEMLOCK. Why don't IB uses IB specific limit and why only IB raise up
> number of pinned pages and other gup users don't.
> I can't guess IB folk's intent.

True another limit would be better. The reason that IB raises the
pinned pages is because IB permanently pins those pages. Other users of
gup do that temporarily.

If there are other users that pin pages permanently should also account
for it.

Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

On Mon, 27 May 2013, Peter Zijlstra wrote:

> Before your patch pinned was included in locked and thus RLIMIT_MEMLOCK
> had a single resource counter. After your patch RLIMIT_MEMLOCK is
> applied separately to both -- more or less.

Before the patch the count was doubled since a single page was counted
twice: Once because it was mlocked (marked with PG_mlock) and then again
because it was also pinned (the refcount was increased). Two different things.

We have agreed for a long time that mlocked pages are movable. That is not
true for pinned pages and therefore pinning pages therefore do not fall
into that category (Hugh? AFAICR you came up with that rule?)

> NO, mlocked pages are pages that do not leave core memory; IOW do not
> cause major faults. Pinning pages is a perfectly spec compliant mlock()
> implementation.

That is not the definition that we have used so far.

> Now in an earlier discussion on the issue 'we' (I can't remember if you
> participated there, I remember Mel and Kosaki-San) agreed that for
> 'normal' (read not whacky real-time people) mlock can still be useful
> and we should introduce a pinned user API for the RT people.

Right. I remember that.

> > Pinned pages are pages that have an elevated refcount because the hardware
> > needs to use these pages for I/O. The elevated refcount may be temporary
> > (then we dont care about this) or for a longer time (such as the memory
> > registration of the IB subsystem). That is when we account the memory as
> > pinned. The elevated refcount stops page migration and other things from
> > trying to move that memory.
>
> Again I _know_ that!!!

But then you refuse to acknowledge the difference and want to conflate
both.

> > Pages can be both pinned and mlocked.
>
> Right, but apart for mlockall() this is a highly unlikely situation to
> actually occur. And if you're using mlockall() you've effectively
> disabled RLIMIT_MEMLOCK and thus nobody cares if the resource counter
> goes funny.

mlockall() would never be used on all processes. You still need the
RLIMIT_MLOCK to ensure that the box does not lock up.

> > I think we need to be first clear on what we want to accomplish and what
> > these counters actually should count before changing things.
>
> Backward isn't it... _you_ changed it without consideration.

I applied the categorization that we had agreed on before during the
development of page migratiob. Pinning is not compatible.

> The IB code does a big get_user_pages(), which last time I checked
> pins a sequential range of pages. Therefore the VMA approach.

The IB code (and other code) can require the pinning of pages in various
ways.

2013-05-28 18:17:32

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Tue, 28 May 2013, Vince Weaver wrote:

> It looks like this is already in tip, but I can confirm that this
> patch seems to fix things on my machine and holds up against longer
> fuzzing runs.

OK, I take it back. Even with the new patch applied, my fuzzer can still
make the kernel leak user->locked_vm

I assume that the locked_vm value should go back to 0 once a process that
has a bunch of mmap'd perf_events opened exits?

I admit this is sort of an obscure corner case, but it does mean that a
user can leak user->locked_vm to the point that "perf record" no longer
works.

Vince

2013-05-29 07:44:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Tue, May 28, 2013 at 02:22:11PM -0400, Vince Weaver wrote:
> On Tue, 28 May 2013, Vince Weaver wrote:
>
> > It looks like this is already in tip, but I can confirm that this
> > patch seems to fix things on my machine and holds up against longer
> > fuzzing runs.
>
> OK, I take it back. Even with the new patch applied, my fuzzer can still
> make the kernel leak user->locked_vm
>
> I assume that the locked_vm value should go back to 0 once a process that
> has a bunch of mmap'd perf_events opened exits?

Yep.

> I admit this is sort of an obscure corner case, but it does mean that a
> user can leak user->locked_vm to the point that "perf record" no longer
> works.

Hurm.. I don't suppose you have an easy reproducer handy eh? I'll go
stare at it. At least the current state is better than before, but
clearly we're not quite there yet.

2013-05-29 07:58:52

by Ingo Molnar

[permalink] [raw]
Subject: [regression] Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK


* Christoph Lameter <[email protected]> wrote:

> On Mon, 27 May 2013, Peter Zijlstra wrote:
>
> > Before your patch pinned was included in locked and thus RLIMIT_MEMLOCK
> > had a single resource counter. After your patch RLIMIT_MEMLOCK is
> > applied separately to both -- more or less.
>
> Before the patch the count was doubled since a single page was counted
> twice: Once because it was mlocked (marked with PG_mlock) and then again
> because it was also pinned (the refcount was increased). Two different
> things.

Christoph, why are you *STILL* arguing??

You caused a *regression* in a userspace ABI plain and simple, and a
security relevant one. Furtermore you modified kernel/events/core.c yet
you never even Cc:-ed the parties involved ...

All your excuses, obfuscation and attempts to redefine the universe to
your liking won't change reality: it worked before, it does not now. Take
responsibility for your action for christ's sake and move forward towards
a resolution , okay?

When can we expect a fix from you for the breakage you caused? Or at least
a word that acknowledges that you broke a user ABI carelessly?

Thanks,

Ingo

2013-05-29 08:07:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()


* Vince Weaver <[email protected]> wrote:

> On Tue, 28 May 2013, Peter Zijlstra wrote:
>
> > Take 2.. :-)
> >
> > Leaving the locked vs pinned debacle for another time. The below patch
> > survives many non-priv execution of the problem prog.
>
> It looks like this is already in tip, [...]

Wanted to help out with testing it in parallel - even though it was not
yet clear whether it's the full fix.

Since you are still seeing a leak I will hold on to it until it's fully
fixed.

Thanks,

Ingo

2013-05-29 13:12:43

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Wed, 29 May 2013, Peter Zijlstra wrote:

> Hurm.. I don't suppose you have an easy reproducer handy eh? I'll go
> stare at it. At least the current state is better than before, but
> clearly we're not quite there yet.

I've been working on a reproducible test case. Simple modifications to
the existing testcase don't seem to trigger things. I have a 100k-300k
syscall trace that triggers things, I'm working on trying to get that down
to something reasonable.

Vince

2013-05-29 19:13:57

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Wed, 29 May 2013, Peter Zijlstra wrote:
>
> Hurm.. I don't suppose you have an easy reproducer handy eh? I'll go
> stare at it. At least the current state is better than before, but
> clearly we're not quite there yet.

OK, below is an easy reproducer. Just run it two or three times.
It leaks 129 pages in user->locked_vm each time you run it.

It took me a while to bisect this down from 10,000 syscalls to just 3.
I now have a tool that can generate valid perf test_cases from my fuzzer
traces, which should be useful.

Vince

/* log_to_code output from bisect19.log */
/* by Vince Weaver <vincent.weaver _at_ maine.edu */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <linux/hw_breakpoint.h>
#include <linux/perf_event.h>

int fd[1024];
struct perf_event_attr pe[1024];
char *mmap_result[1024];

int perf_event_open(struct perf_event_attr *hw_event_uptr,
pid_t pid, int cpu, int group_fd, unsigned long flags) {

return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu,
group_fd, flags);
}

int main(int argc, char **argv) {

memset(&pe[7],0,sizeof(struct perf_event_attr));
pe[7].type=PERF_TYPE_BREAKPOINT;
pe[7].size=96;
pe[7].config=0x0;
pe[7].sample_type=0; /* 0 */
pe[7].read_format=PERF_FORMAT_ID; /* 4 */
pe[7].inherit=1;
pe[7].pinned=1;
pe[7].exclusive=1;
pe[7].exclude_idle=1;
pe[7].mmap=1;
pe[7].comm=1;
pe[7].task=1;
pe[7].precise_ip=3; /* must have zero skid */
pe[7].exclude_host=1;
pe[7].wakeup_events=1886953739;
pe[7].bp_type=HW_BREAKPOINT_R|HW_BREAKPOINT_W; /*3*/
pe[7].bp_addr=0x60ac86c7;
pe[7].bp_len=0x1;

fd[7]=perf_event_open(&pe[7],0,0,-1,0 /*0*/ );

mmap_result[7]=mmap(NULL, 129*4096,PROT_READ|PROT_WRITE, MAP_SHARED,fd[7], 0);
if (mmap_result[7]==MAP_FAILED) {
printf("MMAP FAILED!\n");
exit(1);
}

memset(&pe[8],0,sizeof(struct perf_event_attr));
pe[8].type=PERF_TYPE_HARDWARE;
pe[8].size=96;
pe[8].config=PERF_COUNT_HW_BUS_CYCLES;
pe[8].sample_type=0; /* 0 */
pe[8].read_format=PERF_FORMAT_ID; /* 4 */
pe[8].disabled=1;
pe[8].inherit=1;
pe[8].pinned=1;
pe[8].exclude_kernel=1;
pe[8].comm=1;
pe[8].watermark=1;
pe[8].precise_ip=0; /* arbitrary skid */
pe[8].sample_id_all=1;
pe[8].exclude_guest=1;
pe[8].wakeup_watermark=1153443849;
pe[8].bp_type=HW_BREAKPOINT_EMPTY;

fd[8]=perf_event_open(&pe[8],0,0,fd[7],PERF_FLAG_FD_NO_GROUP|PERF_FLAG_FD_OUTPUT /*3*/ );

/* Replayed 3 syscalls */
return 0;
}

2013-05-29 19:53:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

Hi

I'm unhappy you guys uses offensive word so much. Please cool down all you guys. :-/
In fact, _BOTH_ the behavior before and after Cristoph's patch doesn't have cleaner semantics.
And PeterZ proposed make new cleaner one rather than revert. No need to hassle.

I'm 100% sure -rt people need stronger-mlock api. Please join discussion to make better API.
In my humble opinion is: we should make mlock3(addr, len flags) new syscall (*) and support
-rt requirement directly. And current strange IB RLIMIT_MEMLOCK usage should gradually migrate
it.
(*) or, to enhance mbind() is an option because i expect apps need to pin pages nearby NUMA nodes
in many case.

As your know, current IB pinning implementation doesn't guarantee no minor fault when fork
is used. It's ok for IB. They uses madvise(MADV_NOFORK) too. But I'm not sure *all* of rt
application are satisfied this. We might need to implement copy-on-fork or might not. I'd
like hear other people's opinion.

Also, all developer should know this pinning breaks when memory hot-plug is happen likes
cpu bounding bysched_setaffinity() may break when cpu hot-remove.

2013-05-30 06:32:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK


* KOSAKI Motohiro <[email protected]> wrote:

> Hi
>
> I'm unhappy you guys uses offensive word so much. Please cool down all
> you guys. :-/ In fact, _BOTH_ the behavior before and after Cristoph's
> patch doesn't have cleaner semantics.

Erm, this feature _regressed_ after the patch. All other concerns are
secondary. What's so difficult to understand about that?

> And PeterZ proposed make new cleaner one rather than revert. No need to
> hassle.

Sorry, that's not how we deal with regressions upstream...

I've been pretty polite in fact, compared to say Linus - someone should
put a politically correct summary of:

https://lkml.org/lkml/2013/2/22/456
https://lkml.org/lkml/2013/3/26/1113

into Documentation/, people keep forgetting about it.

I'm sorry if being direct and to the point offends you [no, not really],
this discussion got _way_ off the real track it should be on: that this is
a _regression_.

Thanks,

Ingo

2013-05-30 07:25:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Wed, May 29, 2013 at 03:18:32PM -0400, Vince Weaver wrote:
> On Wed, 29 May 2013, Peter Zijlstra wrote:
> >
> > Hurm.. I don't suppose you have an easy reproducer handy eh? I'll go
> > stare at it. At least the current state is better than before, but
> > clearly we're not quite there yet.
>
> OK, below is an easy reproducer. Just run it two or three times.
> It leaks 129 pages in user->locked_vm each time you run it.
>
> It took me a while to bisect this down from 10,000 syscalls to just 3.
> I now have a tool that can generate valid perf test_cases from my fuzzer
> traces, which should be useful.

Awesome! How specific is it to perf? I mean, would that tool work
equally well for other tinity report?

I'll go prod, thanks again!

2013-05-30 12:46:32

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, 30 May 2013, Peter Zijlstra wrote:
>
> Awesome! How specific is it to perf? I mean, would that tool work
> equally well for other tinity report?

No, it's fairly specific to my perf_fuzzer. I only use trinity code to
set up the perf_event_open() attr parameters, everything else I use my own
code for.

The logging output from perf_fuzzer is fairly straightforward. Just a
single char indicating which syscall and a space separated list of
the result (if relevant) and then all the syscall parameters.

The logging output can then be used by both the replayer and the
sample-code generator.

While in theory this could be extended to something generic, currently it
only works for perf_event_open/close/ioctl/read/write/mmap/munmap, and
even then it takes some shortcuts because I know it will only be run when
finding perf_event problems.

The code for all of this is here if anyone else wants to play with it.
https://github.com/deater/perf_event_tests/tree/master/fuzzer

> I'll go prod, thanks again!

The bug looks related to hw breakpoints, not sure if the bugs Oleg
Nesterov has been reporting in this area might be relevant or not.

Vince

2013-05-30 18:31:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

On Tue, May 28, 2013 at 04:37:06PM +0000, Christoph Lameter wrote:
> On Mon, 27 May 2013, Peter Zijlstra wrote:
>
> > Before your patch pinned was included in locked and thus RLIMIT_MEMLOCK
> > had a single resource counter. After your patch RLIMIT_MEMLOCK is
> > applied separately to both -- more or less.
>
> Before the patch the count was doubled since a single page was counted
> twice: Once because it was mlocked (marked with PG_mlock) and then again
> because it was also pinned (the refcount was increased). Two different things.

Before the patch RLIMIT_MEMLOCK was a limit on the sum of both, after it
is no longer. This change was not described in the changelog.

This is a bug plain and simple and you refuse to acknowledge it.

> We have agreed for a long time that mlocked pages are movable. That is not
> true for pinned pages and therefore pinning pages therefore do not fall
> into that category (Hugh? AFAICR you came up with that rule?)

Into what category? RLIMIT_MEMLOCK should be a limit on the amount of
pages that are constrained to memory (can not be paged). How does a
pinned page not qualify for that?

> > NO, mlocked pages are pages that do not leave core memory; IOW do not
> > cause major faults. Pinning pages is a perfectly spec compliant mlock()
> > implementation.
>
> That is not the definition that we have used so far.

There's two statements there; which one do you disagree with?

On the first; that is exactly the mlock() definition you want; excluding
major faults only means faults cannot be subject to IO, IOW pages must
remain in memory. Not excluding minor faults allows you to unmap and
migrate pages.

On the second; excluding any faults; that is what the mlock() specs
intended -- because that is the thing real-time people really want and
mlock is part of the real-time POSIX spec.

[ Hence the need to provide mpin() and munpin() syscalls before we make
mlock() pages migratable, otherwise RT people are screwed. ]

Since excluding any fault per definition also excludes major faults,
the second is a clear superset of the first. Both ensure pages stay in
memory, thus both should count towards RLIMIT_MEMLOCK.

> But then you refuse to acknowledge the difference and want to conflate
> both.

You're the one that is confused. You're just repeating yourself without
providing argument or reason.

> > > Pages can be both pinned and mlocked.
> >
> > Right, but apart for mlockall() this is a highly unlikely situation to
> > actually occur. And if you're using mlockall() you've effectively
> > disabled RLIMIT_MEMLOCK and thus nobody cares if the resource counter
> > goes funny.
>
> mlockall() would never be used on all processes. You still need the
> RLIMIT_MLOCK to ensure that the box does not lock up.

You're so skilled at missing the point its not funny.

The resource counter (formerly mm_struct::locked_vm, currently broken)
associated with RLIMIT_MEMLOCK is per process. Therefore when you use
mlockall() you've already done away with the limit, and thus the
resource counter value is irrelevant.

> > > I think we need to be first clear on what we want to accomplish and what
> > > these counters actually should count before changing things.
> >
> > Backward isn't it... _you_ changed it without consideration.
>
> I applied the categorization that we had agreed on before during the
> development of page migratiob. Pinning is not compatible.

I've never agreed to any such thing, you changed my code without cc'ing
me and without describing the actual change.

Please explain how pinned pages are not stuck in memory and can be
paged? Once you've convinced me of that, I'll concede they should not be
counted towards RLIMIT_MEMLOCK.

> > The IB code does a big get_user_pages(), which last time I checked
> > pins a sequential range of pages. Therefore the VMA approach.
>
> The IB code (and other code) can require the pinning of pages in various
> ways.

You can also mlock()/madvise()/mprotect() in various ways, this is a
non-argument against using VMAs.

Is there anything besides IB and perf that has user controlled page
pinning? If so, it needs fixing.

2013-05-30 20:00:00

by Pekka Enberg

[permalink] [raw]
Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

On Mon, 27 May 2013, Peter Zijlstra wrote:
>> Before your patch pinned was included in locked and thus RLIMIT_MEMLOCK
>> had a single resource counter. After your patch RLIMIT_MEMLOCK is
>> applied separately to both -- more or less.

On Tue, May 28, 2013 at 7:37 PM, Christoph Lameter <[email protected]> wrote:
> Before the patch the count was doubled since a single page was counted
> twice: Once because it was mlocked (marked with PG_mlock) and then again
> because it was also pinned (the refcount was increased). Two different things.

Pinned vs. mlocked counting isn't the problem here. You changed
RLIMIT_MEMLOCK userspace ABI and that needs to be restored. So the
question is how can we preserve the old RLIMIT_MEMLOCK semantics while
avoiding the double accounting issue.

Pekka

2013-05-30 20:42:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

>> I'm unhappy you guys uses offensive word so much. Please cool down all
>> you guys. :-/ In fact, _BOTH_ the behavior before and after Cristoph's
>> patch doesn't have cleaner semantics.
>
> Erm, this feature _regressed_ after the patch. All other concerns are
> secondary. What's so difficult to understand about that?

Because it is not new commit at all. Christoph's patch was introduced
10 releases before.

$ git describe bc3e53f682
v3.1-7235-gbc3e53f

If we just revert it, we may get another and opposite regression
report. I'm worried
about it. Moreover, I don't think discussion better fix is too difficult for us.

2013-05-30 21:01:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index a841123..f8f47dc 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -137,17 +137,22 @@ struct ib_umem *ib_umem_get (struct ib_ucontext *context, unsigned long addr,
>
> down_write(&current->mm->mmap_sem);
>
> - locked = npages + current->mm->pinned_vm;
> + locked = npages + mm_locked_pages(current->mm);
> lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>
> if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> ret = -ENOMEM;
> - goto out;
> + goto err;
> }
>
> cur_base = addr & PAGE_MASK;
> + umem->start_addr = cur_base;
> + umem->end_addr = cur_base + npages;
> +
> + ret = mm_mpin(umem->start_addr, umem->end_addr);
> + if (ret)
> + goto err;

I believe RLIMIT_MEMLOCK should be checked within mm_mpin().


> +static inline unsigned long mm_locked_pages(struct mm_struct *mm)
> +{
> + return mm->pinned_vm + mm->locked_vm;
> +}

This is acceptable. but if we create mm_locked_pages(), /proc should
also use this.
Otherwise pinning operation magically decrease VmLck field in
/proc/pid/status and people
get a confusion.



> @@ -310,9 +309,49 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> * Keep track of amount of locked VM.
> */
> nr_pages = (end - start) >> PAGE_SHIFT;
> - if (!lock)
> - nr_pages = -nr_pages;
> - mm->locked_vm += nr_pages;
> +
> + /*
> + * We should only account pages once, if VM_PINNED is set pages are
> + * accounted in mm_struct::pinned_vm, otherwise if VM_LOCKED is set,
> + * we'll account them in mm_struct::locked_vm.
> + *
> + * PL := vma->vm_flags
> + * PL' := newflags
> + * PLd := {pinned,locked}_vm delta
> + *
> + * PL->PL' PLd
> + * -----------
> + * 00 01 0+
> + * 00 10 +0
> + * 01 11 +-
> + * 01 00 0-
> + * 10 00 -0
> + * 10 11 00
> + * 11 01 -+
> + * 11 10 00
> + */

This comment is too cryptic. :-)

2013-05-31 09:27:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK


* KOSAKI Motohiro <[email protected]> wrote:

> >> I'm unhappy you guys uses offensive word so much. Please cool down all
> >> you guys. :-/ In fact, _BOTH_ the behavior before and after Cristoph's
> >> patch doesn't have cleaner semantics.
> >
> > Erm, this feature _regressed_ after the patch. All other concerns are
> > secondary. What's so difficult to understand about that?
>
> Because it is not new commit at all. Christoph's patch was introduced
> 10 releases before.

That's indeed sad - resource limits are not typically tested by apps. The
lack of Cc:s to people affected also helped hide the effects of the
change.

> $ git describe bc3e53f682
> v3.1-7235-gbc3e53f
>
> If we just revert it, we may get another and opposite regression report.
> I'm worried about it. Moreover, I don't think discussion better fix is
> too difficult for us.

Sure, I agree that a fix is generally better.

Thanks,

Ingo

2013-05-31 15:46:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, May 30, 2013 at 08:51:00AM -0400, Vince Weaver wrote:
> On Thu, 30 May 2013, Peter Zijlstra wrote:
> > I'll go prod, thanks again!
>
> The bug looks related to hw breakpoints, not sure if the bugs Oleg
> Nesterov has been reporting in this area might be relevant or not.

Nah its not. I figured out what happened but I'm having a bit of a hard
time closing it without making a mess :-)

The 'new' issue (above all the previous things) is that when we redirect an
events output that event has a reference to the buffer and will keep the buffer
alive after the last munmap() of said buffer.

By the time we close() the last event and drop the last reference to the buffer
we don't have enough context left to properly unaccount things.

So what I've been trying to do is always attach events to the rb through the
ring_buffer::even_list -- which was previously only used for poll() wakeups --
and on the last munmap() explicitly detach all events and destroy the buffer.

While that sounds 'simple' we run into lock inversion between the
perf_event::mmap_mutex and ring_buffer::event_lock and things becomes nasty.

There's also the additional ring_buffer::mmap_count; which I use to keep track
of the total amount of mmap()s still out there, as opposed to the
perf_event::mmap_count which only counts the mmap()s for that particular event.
This leaves us with two atomics and a few 'nice' races.

The below is my current hackery which compiles but I know must have some issues
because I'm too tired to think straight atm. Completely untested.

I'm only posting this so that someone can tell me; you're being an idiot! if
you do _foo_ its all fixed much easier :-)

If this better approach fails to materialize, I'll continue onwards and try and
get this working ;-)

---
include/linux/perf_event.h | 3 +-
kernel/events/core.c | 173 ++++++++++++++++++++++++++------------------
kernel/events/internal.h | 4 +
3 files changed, 109 insertions(+), 71 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f463a46..c5b6dbf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -389,8 +389,7 @@ struct perf_event {
/* mmap bits */
struct mutex mmap_mutex;
atomic_t mmap_count;
- int mmap_locked;
- struct user_struct *mmap_user;
+
struct ring_buffer *rb;
struct list_head rb_entry;

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9dc297f..2eb0d81 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -196,9 +196,6 @@ static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
static void update_context_time(struct perf_event_context *ctx);
static u64 perf_event_time(struct perf_event *event);

-static void ring_buffer_attach(struct perf_event *event,
- struct ring_buffer *rb);
-
void __weak perf_event_print_debug(void) { }

extern __weak const char *perf_pmu_name(void)
@@ -2917,7 +2914,8 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}

-static void ring_buffer_put(struct ring_buffer *rb);
+static bool ring_buffer_put(struct ring_buffer *rb);
+static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);

static void free_event(struct perf_event *event)
{
@@ -2942,15 +2940,27 @@ static void free_event(struct perf_event *event)
if (has_branch_stack(event)) {
static_key_slow_dec_deferred(&perf_sched_events);
/* is system-wide event */
- if (!(event->attach_state & PERF_ATTACH_TASK))
+ if (!(event->attach_state & PERF_ATTACH_TASK)) {
atomic_dec(&per_cpu(perf_branch_stack_events,
event->cpu));
+ }
}
}

if (event->rb) {
- ring_buffer_put(event->rb);
- event->rb = NULL;
+ struct ring_buffer *rb;
+
+ /*
+ * XXX should not happen ?!
+ */
+ mutex_lock(&event->mmap_mutex);
+ rb = event->rb;
+ if (rb) {
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb);
+ }
+ mutex_unlock(&event->mmap_mutex);
}

if (is_cgroup_event(event))
@@ -3188,30 +3198,12 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
unsigned int events = POLL_HUP;

/*
- * Race between perf_event_set_output() and perf_poll(): perf_poll()
- * grabs the rb reference but perf_event_set_output() overrides it.
- * Here is the timeline for two threads T1, T2:
- * t0: T1, rb = rcu_dereference(event->rb)
- * t1: T2, old_rb = event->rb
- * t2: T2, event->rb = new rb
- * t3: T2, ring_buffer_detach(old_rb)
- * t4: T1, ring_buffer_attach(rb1)
- * t5: T1, poll_wait(event->waitq)
- *
- * To avoid this problem, we grab mmap_mutex in perf_poll()
- * thereby ensuring that the assignment of the new ring buffer
- * and the detachment of the old buffer appear atomic to perf_poll()
+ * Pin the event->rb by taking event->mmap_mutex.
*/
mutex_lock(&event->mmap_mutex);
-
- rcu_read_lock();
- rb = rcu_dereference(event->rb);
- if (rb) {
- ring_buffer_attach(event, rb);
+ rb = event->rb;
+ if (rb)
events = atomic_xchg(&rb->poll, 0);
- }
- rcu_read_unlock();
-
mutex_unlock(&event->mmap_mutex);

poll_wait(file, &event->waitq, wait);
@@ -3521,16 +3513,12 @@ static void ring_buffer_attach(struct perf_event *event,
return;

spin_lock_irqsave(&rb->event_lock, flags);
- if (!list_empty(&event->rb_entry))
- goto unlock;
-
- list_add(&event->rb_entry, &rb->event_list);
-unlock:
+ if (list_empty(&event->rb_entry))
+ list_add(&event->rb_entry, &rb->event_list);
spin_unlock_irqrestore(&rb->event_lock, flags);
}

-static void ring_buffer_detach(struct perf_event *event,
- struct ring_buffer *rb)
+static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
{
unsigned long flags;

@@ -3549,13 +3537,10 @@ static void ring_buffer_wakeup(struct perf_event *event)

rcu_read_lock();
rb = rcu_dereference(event->rb);
- if (!rb)
- goto unlock;
-
- list_for_each_entry_rcu(event, &rb->event_list, rb_entry)
- wake_up_all(&event->waitq);
-
-unlock:
+ if (rb) {
+ list_for_each_entry_rcu(event, &rb->event_list, rb_entry)
+ wake_up_all(&event->waitq);
+ }
rcu_read_unlock();
}

@@ -3582,22 +3567,15 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
return rb;
}

-static void ring_buffer_put(struct ring_buffer *rb)
+static bool ring_buffer_put(struct ring_buffer *rb)
{
- struct perf_event *event, *n;
- unsigned long flags;
-
if (!atomic_dec_and_test(&rb->refcount))
- return;
+ return false;

- spin_lock_irqsave(&rb->event_lock, flags);
- list_for_each_entry_safe(event, n, &rb->event_list, rb_entry) {
- list_del_init(&event->rb_entry);
- wake_up_all(&event->waitq);
- }
- spin_unlock_irqrestore(&rb->event_lock, flags);
+ WARN_ON_ONCE(!list_empty(&rb->event_list));

call_rcu(&rb->rcu_head, rb_free_rcu);
+ return true;
}

static void perf_mmap_open(struct vm_area_struct *vma)
@@ -3605,26 +3583,72 @@ static void perf_mmap_open(struct vm_area_struct *vma)
struct perf_event *event = vma->vm_file->private_data;

atomic_inc(&event->mmap_count);
+ atomic_inc(&event->rb->mmap_count);
}

static void perf_mmap_close(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;

- if (atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex)) {
- unsigned long size = perf_data_size(event->rb);
- struct user_struct *user = event->mmap_user;
- struct ring_buffer *rb = event->rb;
+ struct ring_buffer *rb = event->rb;
+ struct user_struct *mmap_user = rb->mmap_user;
+ int mmap_locked = rb->mmap_locked;
+ unsigned long size = perf_data_size(rb);
+
+ atomic_dec(&rb->mmap_count);
+
+ if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
+ return;
+
+ /* Detach current event from the buffer. */
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ mutex_unlock(&event->mmap_mutex);
+
+ /* If there's still other mmap()s of this buffer, we're done. */
+ if (atomic_read(&rb->mmap_count)) {
+ ring_buffer_put(rb); /* can't be last */
+ return;
+ }
+
+ /*
+ * No other mmap()s, detach from all other events that might redirect
+ * into the now unreachable buffer.
+ */
+again:
+ rcu_read_lock();
+ list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
+ bool restart = false;
+
+ if (!atomic_long_inc_not_zero(&event->refcount))
+ continue;
+ rcu_read_unlock();
+
+ mutex_lock(&event->mmap_mutex);
+ if (event->rb != rb)
+ goto unlock_next;

- atomic_long_sub((size >> PAGE_SHIFT) + 1, &user->locked_vm);
- vma->vm_mm->pinned_vm -= event->mmap_locked;
rcu_assign_pointer(event->rb, NULL);
ring_buffer_detach(event, rb);
+ ring_buffer_put(rb); /* can't be last, we still have one */
+ restart = true;
+
+unlock_next:
mutex_unlock(&event->mmap_mutex);
+ put_event(event);
+ if (restart) /* iteration isn't safe against removal */
+ goto again;

- ring_buffer_put(rb);
- free_uid(user);
+ rcu_read_lock();
}
+ rcu_read_unlock();
+
+ /* OK, buffer is fully detached and unmapped, undo accounting */
+ atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
+ vma->vm_mm->pinned_vm -= mmap_locked;
+ free_uid(mmap_user);
+
+ ring_buffer_put(rb); /* must be last */
}

static const struct vm_operations_struct perf_mmap_vmops = {
@@ -3674,12 +3698,19 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
return -EINVAL;

WARN_ON_ONCE(event->ctx->parent_ctx);
+again:
mutex_lock(&event->mmap_mutex);
if (event->rb) {
- if (event->rb->nr_pages == nr_pages)
- atomic_inc(&event->rb->refcount);
- else
+ if (event->rb->nr_pages != nr_pages)
ret = -EINVAL;
+
+ /*
+ * XXX raced against perf_mmap_close() through perf_event_set_output().
+ */
+ if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
+ mutex_unlock(&event->mmap_mutex);
+ goto again;
+ }
goto unlock;
}

@@ -3720,12 +3751,16 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
ret = -ENOMEM;
goto unlock;
}
- rcu_assign_pointer(event->rb, rb);
+
+ atomic_set(&rb->mmap_count, 1);
+ rb->mmap_locked = extra;
+ rb->mmap_user = get_current_user();

atomic_long_add(user_extra, &user->locked_vm);
- event->mmap_locked = extra;
- event->mmap_user = get_current_user();
- vma->vm_mm->pinned_vm += event->mmap_locked;
+ vma->vm_mm->pinned_vm += extra;
+
+ ring_buffer_attach(event, rb);
+ rcu_assign_pointer(event->rb, rb);

perf_event_update_userpage(event);

@@ -3734,7 +3769,7 @@ unlock:
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);

- vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_ops = &perf_mmap_vmops;

return ret;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index eb675c4..41f5685 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -31,6 +31,10 @@ struct ring_buffer {
spinlock_t event_lock;
struct list_head event_list;

+ atomic_t mmap_count;
+ int mmap_locked;
+ struct user_struct *mmap_user;
+
struct perf_event_mmap_page *user_page;
void *data_pages[0];
};

2013-06-03 13:27:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Thu, May 30, 2013 at 08:51:00AM -0400, Vince Weaver wrote:
> > I'll go prod, thanks again!


OK the below builds and seems to survive both test cases and about 10 minutes
of fuzzing -- fingers crossed.

---
include/linux/perf_event.h | 3 +-
kernel/events/core.c | 230 ++++++++++++++++++++++++++++++--------------
kernel/events/internal.h | 4 +
3 files changed, 163 insertions(+), 74 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6fddac1..74a4e14 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -390,8 +390,7 @@ struct perf_event {
/* mmap bits */
struct mutex mmap_mutex;
atomic_t mmap_count;
- int mmap_locked;
- struct user_struct *mmap_user;
+
struct ring_buffer *rb;
struct list_head rb_entry;

diff --git a/kernel/events/core.c b/kernel/events/core.c
index a0780b3..b8dcbf6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -198,9 +198,6 @@ static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
static void update_context_time(struct perf_event_context *ctx);
static u64 perf_event_time(struct perf_event *event);

-static void ring_buffer_attach(struct perf_event *event,
- struct ring_buffer *rb);
-
void __weak perf_event_print_debug(void) { }

extern __weak const char *perf_pmu_name(void)
@@ -3023,6 +3020,7 @@ static void free_event_rcu(struct rcu_head *head)
}

static void ring_buffer_put(struct ring_buffer *rb);
+static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);

static void free_event(struct perf_event *event)
{
@@ -3047,15 +3045,30 @@ static void free_event(struct perf_event *event)
if (has_branch_stack(event)) {
static_key_slow_dec_deferred(&perf_sched_events);
/* is system-wide event */
- if (!(event->attach_state & PERF_ATTACH_TASK))
+ if (!(event->attach_state & PERF_ATTACH_TASK)) {
atomic_dec(&per_cpu(perf_branch_stack_events,
event->cpu));
+ }
}
}

if (event->rb) {
- ring_buffer_put(event->rb);
- event->rb = NULL;
+ struct ring_buffer *rb;
+
+ /*
+ * Can happen when we close an event with re-directed output.
+ *
+ * Since we have a 0 refcount, perf_mmap_close() will skip
+ * over us; possibly making our ring_buffer_put() the last.
+ */
+ mutex_lock(&event->mmap_mutex);
+ rb = event->rb;
+ if (rb) {
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb); /* could be last */
+ }
+ mutex_unlock(&event->mmap_mutex);
}

if (is_cgroup_event(event))
@@ -3293,30 +3306,13 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
unsigned int events = POLL_HUP;

/*
- * Race between perf_event_set_output() and perf_poll(): perf_poll()
- * grabs the rb reference but perf_event_set_output() overrides it.
- * Here is the timeline for two threads T1, T2:
- * t0: T1, rb = rcu_dereference(event->rb)
- * t1: T2, old_rb = event->rb
- * t2: T2, event->rb = new rb
- * t3: T2, ring_buffer_detach(old_rb)
- * t4: T1, ring_buffer_attach(rb1)
- * t5: T1, poll_wait(event->waitq)
- *
- * To avoid this problem, we grab mmap_mutex in perf_poll()
- * thereby ensuring that the assignment of the new ring buffer
- * and the detachment of the old buffer appear atomic to perf_poll()
+ * Pin the event->rb by taking event->mmap_mutex; otherwise
+ * perf_event_set_output() can swizzle our rb and make us miss wakeups.
*/
mutex_lock(&event->mmap_mutex);
-
- rcu_read_lock();
- rb = rcu_dereference(event->rb);
- if (rb) {
- ring_buffer_attach(event, rb);
+ rb = event->rb;
+ if (rb)
events = atomic_xchg(&rb->poll, 0);
- }
- rcu_read_unlock();
-
mutex_unlock(&event->mmap_mutex);

poll_wait(file, &event->waitq, wait);
@@ -3626,16 +3622,12 @@ static void ring_buffer_attach(struct perf_event *event,
return;

spin_lock_irqsave(&rb->event_lock, flags);
- if (!list_empty(&event->rb_entry))
- goto unlock;
-
- list_add(&event->rb_entry, &rb->event_list);
-unlock:
+ if (list_empty(&event->rb_entry))
+ list_add(&event->rb_entry, &rb->event_list);
spin_unlock_irqrestore(&rb->event_lock, flags);
}

-static void ring_buffer_detach(struct perf_event *event,
- struct ring_buffer *rb)
+static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
{
unsigned long flags;

@@ -3654,13 +3646,10 @@ static void ring_buffer_wakeup(struct perf_event *event)

rcu_read_lock();
rb = rcu_dereference(event->rb);
- if (!rb)
- goto unlock;
-
- list_for_each_entry_rcu(event, &rb->event_list, rb_entry)
- wake_up_all(&event->waitq);
-
-unlock:
+ if (rb) {
+ list_for_each_entry_rcu(event, &rb->event_list, rb_entry)
+ wake_up_all(&event->waitq);
+ }
rcu_read_unlock();
}

@@ -3689,18 +3678,10 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)

static void ring_buffer_put(struct ring_buffer *rb)
{
- struct perf_event *event, *n;
- unsigned long flags;
-
if (!atomic_dec_and_test(&rb->refcount))
return;

- spin_lock_irqsave(&rb->event_lock, flags);
- list_for_each_entry_safe(event, n, &rb->event_list, rb_entry) {
- list_del_init(&event->rb_entry);
- wake_up_all(&event->waitq);
- }
- spin_unlock_irqrestore(&rb->event_lock, flags);
+ WARN_ON_ONCE(!list_empty(&rb->event_list));

call_rcu(&rb->rcu_head, rb_free_rcu);
}
@@ -3710,26 +3691,100 @@ static void perf_mmap_open(struct vm_area_struct *vma)
struct perf_event *event = vma->vm_file->private_data;

atomic_inc(&event->mmap_count);
+ atomic_inc(&event->rb->mmap_count);
}

+/*
+ * A buffer can be mmap()ed multiple times; either directly through the same
+ * event, or through other events by use of perf_event_set_output().
+ *
+ * In order to undo the VM accounting done by perf_mmap() we need to destroy
+ * the buffer here, where we still have a VM context. This means we need
+ * to detach all events redirecting to us.
+ */
static void perf_mmap_close(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;

- if (atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex)) {
- unsigned long size = perf_data_size(event->rb);
- struct user_struct *user = event->mmap_user;
- struct ring_buffer *rb = event->rb;
+ struct ring_buffer *rb = event->rb;
+ struct user_struct *mmap_user = rb->mmap_user;
+ int mmap_locked = rb->mmap_locked;
+ unsigned long size = perf_data_size(rb);
+
+ atomic_dec(&rb->mmap_count);

- atomic_long_sub((size >> PAGE_SHIFT) + 1, &user->locked_vm);
- vma->vm_mm->pinned_vm -= event->mmap_locked;
- rcu_assign_pointer(event->rb, NULL);
- ring_buffer_detach(event, rb);
+ if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
+ return;
+
+ /* Detach current event from the buffer. */
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ mutex_unlock(&event->mmap_mutex);
+
+ /* If there's still other mmap()s of this buffer, we're done. */
+ if (atomic_read(&rb->mmap_count)) {
+ ring_buffer_put(rb); /* can't be last */
+ return;
+ }
+
+ /*
+ * No other mmap()s, detach from all other events that might redirect
+ * into the now unreachable buffer. Somewhat complicated by the
+ * fact that rb::event_lock otherwise nests inside mmap_mutex.
+ */
+again:
+ rcu_read_lock();
+ list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
+ if (!atomic_long_inc_not_zero(&event->refcount)) {
+ /*
+ * This event is en-route to free_event() which will
+ * detach it and remove it from the list.
+ */
+ continue;
+ }
+ rcu_read_unlock();
+
+ mutex_lock(&event->mmap_mutex);
+ /*
+ * Check we didn't race with perf_event_set_output() which can
+ * swizzle the rb from under us while we were waiting to
+ * acquire mmap_mutex.
+ *
+ * If we find a different rb; ignore this event, a next
+ * iteration will no longer find it on the list. We have to
+ * still restart the iteration to make sure we're not now
+ * iterating the wrong list.
+ */
+ if (event->rb == rb) {
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb); /* can't be last, we still have one */
+ }
mutex_unlock(&event->mmap_mutex);
+ put_event(event);

- ring_buffer_put(rb);
- free_uid(user);
+ /*
+ * Restart the iteration; either we're on the wrong list or
+ * destroyed its integrity by doing a deletion.
+ */
+ goto again;
}
+ rcu_read_unlock();
+
+ /*
+ * It could be there's still a few 0-ref events on the list; they'll
+ * get cleaned up by free_event() -- they'll also still have their
+ * ref on the rb and will free it whenever they are done with it.
+ *
+ * Aside from that, this buffer is 'fully' detached and unmapped,
+ * undo the VM accounting.
+ */
+
+ atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
+ vma->vm_mm->pinned_vm -= mmap_locked;
+ free_uid(mmap_user);
+
+ ring_buffer_put(rb); /* could be last */
}

static const struct vm_operations_struct perf_mmap_vmops = {
@@ -3779,12 +3834,21 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
return -EINVAL;

WARN_ON_ONCE(event->ctx->parent_ctx);
+again:
mutex_lock(&event->mmap_mutex);
if (event->rb) {
- if (event->rb->nr_pages == nr_pages)
- atomic_inc(&event->rb->refcount);
- else
+ if (event->rb->nr_pages != nr_pages)
ret = -EINVAL;
+
+ if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
+ /*
+ * Raced against perf_mmap_close() through
+ * perf_event_set_output(). Try again, hope for better
+ * luck.
+ */
+ mutex_unlock(&event->mmap_mutex);
+ goto again;
+ }
goto unlock;
}

@@ -3825,12 +3889,16 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
ret = -ENOMEM;
goto unlock;
}
- rcu_assign_pointer(event->rb, rb);
+
+ atomic_set(&rb->mmap_count, 1);
+ rb->mmap_locked = extra;
+ rb->mmap_user = get_current_user();

atomic_long_add(user_extra, &user->locked_vm);
- event->mmap_locked = extra;
- event->mmap_user = get_current_user();
- vma->vm_mm->pinned_vm += event->mmap_locked;
+ vma->vm_mm->pinned_vm += extra;
+
+ ring_buffer_attach(event, rb);
+ rcu_assign_pointer(event->rb, rb);

perf_event_update_userpage(event);

@@ -3839,7 +3907,11 @@ unlock:
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);

- vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+ /*
+ * Since pinned accounting is per vm we cannot allow fork() to copy our
+ * vma.
+ */
+ vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_ops = &perf_mmap_vmops;

return ret;
@@ -6565,6 +6637,8 @@ set:
if (atomic_read(&event->mmap_count))
goto unlock;

+ old_rb = event->rb;
+
if (output_event) {
/* get the rb we want to redirect to */
rb = ring_buffer_get(output_event);
@@ -6572,16 +6646,28 @@ set:
goto unlock;
}

- old_rb = event->rb;
- rcu_assign_pointer(event->rb, rb);
if (old_rb)
ring_buffer_detach(event, old_rb);
+
+ if (rb)
+ ring_buffer_attach(event, rb);
+
+ rcu_assign_pointer(event->rb, rb);
+
+ if (old_rb) {
+ ring_buffer_put(old_rb);
+ /*
+ * Since we detached before setting the new rb, so that we
+ * could attach the new rb, we could have missed a wakeup.
+ * Provide it now.
+ */
+ wake_up_all(&event->waitq);
+ }
+
ret = 0;
unlock:
mutex_unlock(&event->mmap_mutex);

- if (old_rb)
- ring_buffer_put(old_rb);
out:
return ret;
}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index eb675c4..41f5685 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -31,6 +31,10 @@ struct ring_buffer {
spinlock_t event_lock;
struct list_head event_list;

+ atomic_t mmap_count;
+ int mmap_locked;
+ struct user_struct *mmap_user;
+
struct perf_event_mmap_page *user_page;
void *data_pages[0];
};

2013-06-03 17:18:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Mon, Jun 03, 2013 at 03:26:45PM +0200, Peter Zijlstra wrote:
> On Thu, May 30, 2013 at 08:51:00AM -0400, Vince Weaver wrote:
> > > I'll go prod, thanks again!
>
>
> OK the below builds and seems to survive both test cases and about 10 minutes
> of fuzzing -- fingers crossed.

OK, looks like I still drop something somewhere. mmap()s still goes to
0. Stupid thing!

2013-06-03 19:25:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Mon, Jun 03, 2013 at 03:26:45PM +0200, Peter Zijlstra wrote:
> On Thu, May 30, 2013 at 08:51:00AM -0400, Vince Weaver wrote:
> > > I'll go prod, thanks again!
>
>
> OK the below builds and seems to survive both test cases and about 10 minutes
> of fuzzing -- fingers crossed.
>
> ---
> @@ -3779,12 +3834,21 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> return -EINVAL;
>
> WARN_ON_ONCE(event->ctx->parent_ctx);
> +again:
> mutex_lock(&event->mmap_mutex);
> if (event->rb) {
> - if (event->rb->nr_pages == nr_pages)
> - atomic_inc(&event->rb->refcount);
> - else
> + if (event->rb->nr_pages != nr_pages)
{
> ret = -EINVAL;
goto unlock;
}

Seems to cure it... I'm going to let it run over night.

> +
> + if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
> + /*
> + * Raced against perf_mmap_close() through
> + * perf_event_set_output(). Try again, hope for better
> + * luck.
> + */
> + mutex_unlock(&event->mmap_mutex);
> + goto again;
> + }
> goto unlock;
> }
>

2013-06-04 08:44:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:perf/urgent] perf: Fix perf mmap bugs


Ingo asked for a delta against the patch already included in -tip.

Vince; perf_fuzzer crashed after about 5670000 iterations -- not sure
you're aware it will crash at times. Anyway at this time accounting was
still good.

---
Subject: perf: Fix mmap() accounting hole

Vince's fuzzer once again found holes. This time it spotted a leak in
the locked page accounting.

When an event had redirected output and its close() was the last
reference to the buffer we didn't have a vm context to undo accounting.

Change the code to destroy the buffer on the last munmap() and detach
all redirected events at that time. This provides us the right context
to undo the vm accounting.

Reported-by: Vince Weaver <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/events/core.c | 228 +++++++++++++++++++++++++++++++---------------
kernel/events/internal.h | 3 +-
2 files changed, 159 insertions(+), 72 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 95edd5a..471e11b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -198,9 +198,6 @@ static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
static void update_context_time(struct perf_event_context *ctx);
static u64 perf_event_time(struct perf_event *event);

-static void ring_buffer_attach(struct perf_event *event,
- struct ring_buffer *rb);
-
void __weak perf_event_print_debug(void) { }

extern __weak const char *perf_pmu_name(void)
@@ -3022,7 +3019,8 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}

-static bool ring_buffer_put(struct ring_buffer *rb);
+static void ring_buffer_put(struct ring_buffer *rb);
+static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);

static void free_event(struct perf_event *event)
{
@@ -3047,15 +3045,30 @@ static void free_event(struct perf_event *event)
if (has_branch_stack(event)) {
static_key_slow_dec_deferred(&perf_sched_events);
/* is system-wide event */
- if (!(event->attach_state & PERF_ATTACH_TASK))
+ if (!(event->attach_state & PERF_ATTACH_TASK)) {
atomic_dec(&per_cpu(perf_branch_stack_events,
event->cpu));
+ }
}
}

if (event->rb) {
- ring_buffer_put(event->rb);
- event->rb = NULL;
+ struct ring_buffer *rb;
+
+ /*
+ * Can happen when we close an event with re-directed output.
+ *
+ * Since we have a 0 refcount, perf_mmap_close() will skip
+ * over us; possibly making our ring_buffer_put() the last.
+ */
+ mutex_lock(&event->mmap_mutex);
+ rb = event->rb;
+ if (rb) {
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb); /* could be last */
+ }
+ mutex_unlock(&event->mmap_mutex);
}

if (is_cgroup_event(event))
@@ -3293,30 +3306,13 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
unsigned int events = POLL_HUP;

/*
- * Race between perf_event_set_output() and perf_poll(): perf_poll()
- * grabs the rb reference but perf_event_set_output() overrides it.
- * Here is the timeline for two threads T1, T2:
- * t0: T1, rb = rcu_dereference(event->rb)
- * t1: T2, old_rb = event->rb
- * t2: T2, event->rb = new rb
- * t3: T2, ring_buffer_detach(old_rb)
- * t4: T1, ring_buffer_attach(rb1)
- * t5: T1, poll_wait(event->waitq)
- *
- * To avoid this problem, we grab mmap_mutex in perf_poll()
- * thereby ensuring that the assignment of the new ring buffer
- * and the detachment of the old buffer appear atomic to perf_poll()
+ * Pin the event->rb by taking event->mmap_mutex; otherwise
+ * perf_event_set_output() can swizzle our rb and make us miss wakeups.
*/
mutex_lock(&event->mmap_mutex);
-
- rcu_read_lock();
- rb = rcu_dereference(event->rb);
- if (rb) {
- ring_buffer_attach(event, rb);
+ rb = event->rb;
+ if (rb)
events = atomic_xchg(&rb->poll, 0);
- }
- rcu_read_unlock();
-
mutex_unlock(&event->mmap_mutex);

poll_wait(file, &event->waitq, wait);
@@ -3626,16 +3622,12 @@ static void ring_buffer_attach(struct perf_event *event,
return;

spin_lock_irqsave(&rb->event_lock, flags);
- if (!list_empty(&event->rb_entry))
- goto unlock;
-
- list_add(&event->rb_entry, &rb->event_list);
-unlock:
+ if (list_empty(&event->rb_entry))
+ list_add(&event->rb_entry, &rb->event_list);
spin_unlock_irqrestore(&rb->event_lock, flags);
}

-static void ring_buffer_detach(struct perf_event *event,
- struct ring_buffer *rb)
+static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
{
unsigned long flags;

@@ -3654,13 +3646,10 @@ static void ring_buffer_wakeup(struct perf_event *event)

rcu_read_lock();
rb = rcu_dereference(event->rb);
- if (!rb)
- goto unlock;
-
- list_for_each_entry_rcu(event, &rb->event_list, rb_entry)
- wake_up_all(&event->waitq);
-
-unlock:
+ if (rb) {
+ list_for_each_entry_rcu(event, &rb->event_list, rb_entry)
+ wake_up_all(&event->waitq);
+ }
rcu_read_unlock();
}

@@ -3687,23 +3676,14 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
return rb;
}

-static bool ring_buffer_put(struct ring_buffer *rb)
+static void ring_buffer_put(struct ring_buffer *rb)
{
- struct perf_event *event, *n;
- unsigned long flags;
-
if (!atomic_dec_and_test(&rb->refcount))
- return false;
+ return;

- spin_lock_irqsave(&rb->event_lock, flags);
- list_for_each_entry_safe(event, n, &rb->event_list, rb_entry) {
- list_del_init(&event->rb_entry);
- wake_up_all(&event->waitq);
- }
- spin_unlock_irqrestore(&rb->event_lock, flags);
+ WARN_ON_ONCE(!list_empty(&rb->event_list));

call_rcu(&rb->rcu_head, rb_free_rcu);
- return true;
}

static void perf_mmap_open(struct vm_area_struct *vma)
@@ -3711,28 +3691,100 @@ static void perf_mmap_open(struct vm_area_struct *vma)
struct perf_event *event = vma->vm_file->private_data;

atomic_inc(&event->mmap_count);
+ atomic_inc(&event->rb->mmap_count);
}

+/*
+ * A buffer can be mmap()ed multiple times; either directly through the same
+ * event, or through other events by use of perf_event_set_output().
+ *
+ * In order to undo the VM accounting done by perf_mmap() we need to destroy
+ * the buffer here, where we still have a VM context. This means we need
+ * to detach all events redirecting to us.
+ */
static void perf_mmap_close(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;

- if (atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex)) {
- struct ring_buffer *rb = event->rb;
- struct user_struct *mmap_user = rb->mmap_user;
- int mmap_locked = rb->mmap_locked;
- unsigned long size = perf_data_size(rb);
+ struct ring_buffer *rb = event->rb;
+ struct user_struct *mmap_user = rb->mmap_user;
+ int mmap_locked = rb->mmap_locked;
+ unsigned long size = perf_data_size(rb);

- rcu_assign_pointer(event->rb, NULL);
- ring_buffer_detach(event, rb);
- mutex_unlock(&event->mmap_mutex);
+ atomic_dec(&rb->mmap_count);
+
+ if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
+ return;
+
+ /* Detach current event from the buffer. */
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ mutex_unlock(&event->mmap_mutex);
+
+ /* If there's still other mmap()s of this buffer, we're done. */
+ if (atomic_read(&rb->mmap_count)) {
+ ring_buffer_put(rb); /* can't be last */
+ return;
+ }

- if (ring_buffer_put(rb)) {
- atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
- vma->vm_mm->pinned_vm -= mmap_locked;
- free_uid(mmap_user);
+ /*
+ * No other mmap()s, detach from all other events that might redirect
+ * into the now unreachable buffer. Somewhat complicated by the
+ * fact that rb::event_lock otherwise nests inside mmap_mutex.
+ */
+again:
+ rcu_read_lock();
+ list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
+ if (!atomic_long_inc_not_zero(&event->refcount)) {
+ /*
+ * This event is en-route to free_event() which will
+ * detach it and remove it from the list.
+ */
+ continue;
}
+ rcu_read_unlock();
+
+ mutex_lock(&event->mmap_mutex);
+ /*
+ * Check we didn't race with perf_event_set_output() which can
+ * swizzle the rb from under us while we were waiting to
+ * acquire mmap_mutex.
+ *
+ * If we find a different rb; ignore this event, a next
+ * iteration will no longer find it on the list. We have to
+ * still restart the iteration to make sure we're not now
+ * iterating the wrong list.
+ */
+ if (event->rb == rb) {
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb); /* can't be last, we still have one */
+ }
+ mutex_unlock(&event->mmap_mutex);
+ put_event(event);
+
+ /*
+ * Restart the iteration; either we're on the wrong list or
+ * destroyed its integrity by doing a deletion.
+ */
+ goto again;
}
+ rcu_read_unlock();
+
+ /*
+ * It could be there's still a few 0-ref events on the list; they'll
+ * get cleaned up by free_event() -- they'll also still have their
+ * ref on the rb and will free it whenever they are done with it.
+ *
+ * Aside from that, this buffer is 'fully' detached and unmapped,
+ * undo the VM accounting.
+ */
+
+ atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
+ vma->vm_mm->pinned_vm -= mmap_locked;
+ free_uid(mmap_user);
+
+ ring_buffer_put(rb); /* could be last */
}

static const struct vm_operations_struct perf_mmap_vmops = {
@@ -3782,10 +3834,24 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
return -EINVAL;

WARN_ON_ONCE(event->ctx->parent_ctx);
+again:
mutex_lock(&event->mmap_mutex);
if (event->rb) {
- if (event->rb->nr_pages != nr_pages)
+ if (event->rb->nr_pages != nr_pages) {
ret = -EINVAL;
+ goto unlock;
+ }
+
+ if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
+ /*
+ * Raced against perf_mmap_close() through
+ * perf_event_set_output(). Try again, hope for better
+ * luck.
+ */
+ mutex_unlock(&event->mmap_mutex);
+ goto again;
+ }
+
goto unlock;
}

@@ -3827,12 +3893,14 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
goto unlock;
}

+ atomic_set(&rb->mmap_count, 1);
rb->mmap_locked = extra;
rb->mmap_user = get_current_user();

atomic_long_add(user_extra, &user->locked_vm);
vma->vm_mm->pinned_vm += extra;

+ ring_buffer_attach(event, rb);
rcu_assign_pointer(event->rb, rb);

perf_event_update_userpage(event);
@@ -3842,6 +3910,10 @@ unlock:
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);

+ /*
+ * Since pinned accounting is per vm we cannot allow fork() to copy our
+ * vma.
+ */
vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_ops = &perf_mmap_vmops;

@@ -6568,6 +6640,8 @@ set:
if (atomic_read(&event->mmap_count))
goto unlock;

+ old_rb = event->rb;
+
if (output_event) {
/* get the rb we want to redirect to */
rb = ring_buffer_get(output_event);
@@ -6575,16 +6649,28 @@ set:
goto unlock;
}

- old_rb = event->rb;
- rcu_assign_pointer(event->rb, rb);
if (old_rb)
ring_buffer_detach(event, old_rb);
+
+ if (rb)
+ ring_buffer_attach(event, rb);
+
+ rcu_assign_pointer(event->rb, rb);
+
+ if (old_rb) {
+ ring_buffer_put(old_rb);
+ /*
+ * Since we detached before setting the new rb, so that we
+ * could attach the new rb, we could have missed a wakeup.
+ * Provide it now.
+ */
+ wake_up_all(&event->waitq);
+ }
+
ret = 0;
unlock:
mutex_unlock(&event->mmap_mutex);

- if (old_rb)
- ring_buffer_put(old_rb);
out:
return ret;
}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 5bc6c8e..ca65997 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -31,7 +31,8 @@ struct ring_buffer {
spinlock_t event_lock;
struct list_head event_list;

- int mmap_locked;
+ atomic_t mmap_count;
+ unsigned long mmap_locked;
struct user_struct *mmap_user;

struct perf_event_mmap_page *user_page;

2013-06-05 11:55:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:perf/urgent] perf: Fix perf mmap bugs

On Tue, Jun 04, 2013 at 10:44:21AM +0200, Peter Zijlstra wrote:
>
> Ingo asked for a delta against the patch already included in -tip.
>
> Vince; perf_fuzzer crashed after about 5670000 iterations -- not sure
> you're aware it will crash at times. Anyway at this time accounting was
> still good.
>
> ---
> Subject: perf: Fix mmap() accounting hole
>
> Vince's fuzzer once again found holes. This time it spotted a leak in
> the locked page accounting.
>
> When an event had redirected output and its close() was the last
> reference to the buffer we didn't have a vm context to undo accounting.
>
> Change the code to destroy the buffer on the last munmap() and detach
> all redirected events at that time. This provides us the right context
> to undo the vm accounting.
>
> Reported-by: Vince Weaver <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>

Vince, can you confirm?

2013-06-05 15:55:53

by Vince Weaver

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Mon, 3 Jun 2013, Peter Zijlstra wrote:

> Seems to cure it... I'm going to let it run over night.

[sorry for the delay in testing, was at the hospital waiting for
a new baby, and the free wi-fi had the ssh port blocked]

I'm running 3.10-rc4 with your patch plus the 3-line fixup applied and my
various tests are solid so far.

It would be nice to get this applied, without the patch it's trivial to
oops/lock any kernel going back to 3.2 and probably earlier (I haven't
tested older than 3.2 yet).

Vince

2013-06-05 16:55:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: OOPS in perf_mmap_close()

On Wed, Jun 05, 2013 at 11:54:52AM -0400, Vince Weaver wrote:
> On Mon, 3 Jun 2013, Peter Zijlstra wrote:
>
> > Seems to cure it... I'm going to let it run over night.
>
> [sorry for the delay in testing, was at the hospital waiting for
> a new baby, and the free wi-fi had the ssh port blocked]

Congrats!!

Subject: [tip:perf/core] perf: Fix mmap() accounting hole

Commit-ID: 9bb5d40cd93c9dd4be74834b1dcb1ba03629716b
Gitweb: http://git.kernel.org/tip/9bb5d40cd93c9dd4be74834b1dcb1ba03629716b
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 4 Jun 2013 10:44:21 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 19 Jun 2013 12:44:13 +0200

perf: Fix mmap() accounting hole

Vince's fuzzer once again found holes. This time it spotted a leak in
the locked page accounting.

When an event had redirected output and its close() was the last
reference to the buffer we didn't have a vm context to undo accounting.

Change the code to destroy the buffer on the last munmap() and detach
all redirected events at that time. This provides us the right context
to undo the vm accounting.

Reported-and-tested-by: Vince Weaver <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 228 ++++++++++++++++++++++++++++++++---------------
kernel/events/internal.h | 3 +-
2 files changed, 159 insertions(+), 72 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index ae752cd..b391907 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -196,9 +196,6 @@ static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
static void update_context_time(struct perf_event_context *ctx);
static u64 perf_event_time(struct perf_event *event);

-static void ring_buffer_attach(struct perf_event *event,
- struct ring_buffer *rb);
-
void __weak perf_event_print_debug(void) { }

extern __weak const char *perf_pmu_name(void)
@@ -2917,7 +2914,8 @@ static void free_event_rcu(struct rcu_head *head)
kfree(event);
}

-static bool ring_buffer_put(struct ring_buffer *rb);
+static void ring_buffer_put(struct ring_buffer *rb);
+static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);

static void free_event(struct perf_event *event)
{
@@ -2942,15 +2940,30 @@ static void free_event(struct perf_event *event)
if (has_branch_stack(event)) {
static_key_slow_dec_deferred(&perf_sched_events);
/* is system-wide event */
- if (!(event->attach_state & PERF_ATTACH_TASK))
+ if (!(event->attach_state & PERF_ATTACH_TASK)) {
atomic_dec(&per_cpu(perf_branch_stack_events,
event->cpu));
+ }
}
}

if (event->rb) {
- ring_buffer_put(event->rb);
- event->rb = NULL;
+ struct ring_buffer *rb;
+
+ /*
+ * Can happen when we close an event with re-directed output.
+ *
+ * Since we have a 0 refcount, perf_mmap_close() will skip
+ * over us; possibly making our ring_buffer_put() the last.
+ */
+ mutex_lock(&event->mmap_mutex);
+ rb = event->rb;
+ if (rb) {
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb); /* could be last */
+ }
+ mutex_unlock(&event->mmap_mutex);
}

if (is_cgroup_event(event))
@@ -3188,30 +3201,13 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
unsigned int events = POLL_HUP;

/*
- * Race between perf_event_set_output() and perf_poll(): perf_poll()
- * grabs the rb reference but perf_event_set_output() overrides it.
- * Here is the timeline for two threads T1, T2:
- * t0: T1, rb = rcu_dereference(event->rb)
- * t1: T2, old_rb = event->rb
- * t2: T2, event->rb = new rb
- * t3: T2, ring_buffer_detach(old_rb)
- * t4: T1, ring_buffer_attach(rb1)
- * t5: T1, poll_wait(event->waitq)
- *
- * To avoid this problem, we grab mmap_mutex in perf_poll()
- * thereby ensuring that the assignment of the new ring buffer
- * and the detachment of the old buffer appear atomic to perf_poll()
+ * Pin the event->rb by taking event->mmap_mutex; otherwise
+ * perf_event_set_output() can swizzle our rb and make us miss wakeups.
*/
mutex_lock(&event->mmap_mutex);
-
- rcu_read_lock();
- rb = rcu_dereference(event->rb);
- if (rb) {
- ring_buffer_attach(event, rb);
+ rb = event->rb;
+ if (rb)
events = atomic_xchg(&rb->poll, 0);
- }
- rcu_read_unlock();
-
mutex_unlock(&event->mmap_mutex);

poll_wait(file, &event->waitq, wait);
@@ -3521,16 +3517,12 @@ static void ring_buffer_attach(struct perf_event *event,
return;

spin_lock_irqsave(&rb->event_lock, flags);
- if (!list_empty(&event->rb_entry))
- goto unlock;
-
- list_add(&event->rb_entry, &rb->event_list);
-unlock:
+ if (list_empty(&event->rb_entry))
+ list_add(&event->rb_entry, &rb->event_list);
spin_unlock_irqrestore(&rb->event_lock, flags);
}

-static void ring_buffer_detach(struct perf_event *event,
- struct ring_buffer *rb)
+static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
{
unsigned long flags;

@@ -3549,13 +3541,10 @@ static void ring_buffer_wakeup(struct perf_event *event)

rcu_read_lock();
rb = rcu_dereference(event->rb);
- if (!rb)
- goto unlock;
-
- list_for_each_entry_rcu(event, &rb->event_list, rb_entry)
- wake_up_all(&event->waitq);
-
-unlock:
+ if (rb) {
+ list_for_each_entry_rcu(event, &rb->event_list, rb_entry)
+ wake_up_all(&event->waitq);
+ }
rcu_read_unlock();
}

@@ -3582,23 +3571,14 @@ static struct ring_buffer *ring_buffer_get(struct perf_event *event)
return rb;
}

-static bool ring_buffer_put(struct ring_buffer *rb)
+static void ring_buffer_put(struct ring_buffer *rb)
{
- struct perf_event *event, *n;
- unsigned long flags;
-
if (!atomic_dec_and_test(&rb->refcount))
- return false;
+ return;

- spin_lock_irqsave(&rb->event_lock, flags);
- list_for_each_entry_safe(event, n, &rb->event_list, rb_entry) {
- list_del_init(&event->rb_entry);
- wake_up_all(&event->waitq);
- }
- spin_unlock_irqrestore(&rb->event_lock, flags);
+ WARN_ON_ONCE(!list_empty(&rb->event_list));

call_rcu(&rb->rcu_head, rb_free_rcu);
- return true;
}

static void perf_mmap_open(struct vm_area_struct *vma)
@@ -3606,28 +3586,100 @@ static void perf_mmap_open(struct vm_area_struct *vma)
struct perf_event *event = vma->vm_file->private_data;

atomic_inc(&event->mmap_count);
+ atomic_inc(&event->rb->mmap_count);
}

+/*
+ * A buffer can be mmap()ed multiple times; either directly through the same
+ * event, or through other events by use of perf_event_set_output().
+ *
+ * In order to undo the VM accounting done by perf_mmap() we need to destroy
+ * the buffer here, where we still have a VM context. This means we need
+ * to detach all events redirecting to us.
+ */
static void perf_mmap_close(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;

- if (atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex)) {
- struct ring_buffer *rb = event->rb;
- struct user_struct *mmap_user = rb->mmap_user;
- int mmap_locked = rb->mmap_locked;
- unsigned long size = perf_data_size(rb);
+ struct ring_buffer *rb = event->rb;
+ struct user_struct *mmap_user = rb->mmap_user;
+ int mmap_locked = rb->mmap_locked;
+ unsigned long size = perf_data_size(rb);

- rcu_assign_pointer(event->rb, NULL);
- ring_buffer_detach(event, rb);
- mutex_unlock(&event->mmap_mutex);
+ atomic_dec(&rb->mmap_count);
+
+ if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
+ return;
+
+ /* Detach current event from the buffer. */
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ mutex_unlock(&event->mmap_mutex);
+
+ /* If there's still other mmap()s of this buffer, we're done. */
+ if (atomic_read(&rb->mmap_count)) {
+ ring_buffer_put(rb); /* can't be last */
+ return;
+ }

- if (ring_buffer_put(rb)) {
- atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
- vma->vm_mm->pinned_vm -= mmap_locked;
- free_uid(mmap_user);
+ /*
+ * No other mmap()s, detach from all other events that might redirect
+ * into the now unreachable buffer. Somewhat complicated by the
+ * fact that rb::event_lock otherwise nests inside mmap_mutex.
+ */
+again:
+ rcu_read_lock();
+ list_for_each_entry_rcu(event, &rb->event_list, rb_entry) {
+ if (!atomic_long_inc_not_zero(&event->refcount)) {
+ /*
+ * This event is en-route to free_event() which will
+ * detach it and remove it from the list.
+ */
+ continue;
}
+ rcu_read_unlock();
+
+ mutex_lock(&event->mmap_mutex);
+ /*
+ * Check we didn't race with perf_event_set_output() which can
+ * swizzle the rb from under us while we were waiting to
+ * acquire mmap_mutex.
+ *
+ * If we find a different rb; ignore this event, a next
+ * iteration will no longer find it on the list. We have to
+ * still restart the iteration to make sure we're not now
+ * iterating the wrong list.
+ */
+ if (event->rb == rb) {
+ rcu_assign_pointer(event->rb, NULL);
+ ring_buffer_detach(event, rb);
+ ring_buffer_put(rb); /* can't be last, we still have one */
+ }
+ mutex_unlock(&event->mmap_mutex);
+ put_event(event);
+
+ /*
+ * Restart the iteration; either we're on the wrong list or
+ * destroyed its integrity by doing a deletion.
+ */
+ goto again;
}
+ rcu_read_unlock();
+
+ /*
+ * It could be there's still a few 0-ref events on the list; they'll
+ * get cleaned up by free_event() -- they'll also still have their
+ * ref on the rb and will free it whenever they are done with it.
+ *
+ * Aside from that, this buffer is 'fully' detached and unmapped,
+ * undo the VM accounting.
+ */
+
+ atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
+ vma->vm_mm->pinned_vm -= mmap_locked;
+ free_uid(mmap_user);
+
+ ring_buffer_put(rb); /* could be last */
}

static const struct vm_operations_struct perf_mmap_vmops = {
@@ -3677,10 +3729,24 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
return -EINVAL;

WARN_ON_ONCE(event->ctx->parent_ctx);
+again:
mutex_lock(&event->mmap_mutex);
if (event->rb) {
- if (event->rb->nr_pages != nr_pages)
+ if (event->rb->nr_pages != nr_pages) {
ret = -EINVAL;
+ goto unlock;
+ }
+
+ if (!atomic_inc_not_zero(&event->rb->mmap_count)) {
+ /*
+ * Raced against perf_mmap_close() through
+ * perf_event_set_output(). Try again, hope for better
+ * luck.
+ */
+ mutex_unlock(&event->mmap_mutex);
+ goto again;
+ }
+
goto unlock;
}

@@ -3722,12 +3788,14 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
goto unlock;
}

+ atomic_set(&rb->mmap_count, 1);
rb->mmap_locked = extra;
rb->mmap_user = get_current_user();

atomic_long_add(user_extra, &user->locked_vm);
vma->vm_mm->pinned_vm += extra;

+ ring_buffer_attach(event, rb);
rcu_assign_pointer(event->rb, rb);

perf_event_update_userpage(event);
@@ -3737,6 +3805,10 @@ unlock:
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);

+ /*
+ * Since pinned accounting is per vm we cannot allow fork() to copy our
+ * vma.
+ */
vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_ops = &perf_mmap_vmops;

@@ -6415,6 +6487,8 @@ set:
if (atomic_read(&event->mmap_count))
goto unlock;

+ old_rb = event->rb;
+
if (output_event) {
/* get the rb we want to redirect to */
rb = ring_buffer_get(output_event);
@@ -6422,16 +6496,28 @@ set:
goto unlock;
}

- old_rb = event->rb;
- rcu_assign_pointer(event->rb, rb);
if (old_rb)
ring_buffer_detach(event, old_rb);
+
+ if (rb)
+ ring_buffer_attach(event, rb);
+
+ rcu_assign_pointer(event->rb, rb);
+
+ if (old_rb) {
+ ring_buffer_put(old_rb);
+ /*
+ * Since we detached before setting the new rb, so that we
+ * could attach the new rb, we could have missed a wakeup.
+ * Provide it now.
+ */
+ wake_up_all(&event->waitq);
+ }
+
ret = 0;
unlock:
mutex_unlock(&event->mmap_mutex);

- if (old_rb)
- ring_buffer_put(old_rb);
out:
return ret;
}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 5bc6c8e..ca65997 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -31,7 +31,8 @@ struct ring_buffer {
spinlock_t event_lock;
struct list_head event_list;

- int mmap_locked;
+ atomic_t mmap_count;
+ unsigned long mmap_locked;
struct user_struct *mmap_user;

struct perf_event_mmap_page *user_page;