2009-01-29 20:03:42

by Lee Schermerhorn

[permalink] [raw]
Subject: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

Against: 2.6.28.2

We see a:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff80336805>] __downgrade_write+0x43/0xb4
:
Call Trace:
[<ffffffff80284dec>] mlock_vma_pages_range+0x53/0xc3
[<ffffffff80287021>] mmap_region+0x389/0x471
[<ffffffff802876b5>] do_mmap_pgoff+0x308/0x36d
[<ffffffff802100a8>] sys_mmap+0x8b/0x110
[<ffffffff8020bc8b>] system_call_fastpath+0x16/0x1b

when mmap_region() merges two segments of a file mmap()ed with
MAP_LOCKED [VM_LOCKED].

This patch provides a simplistic fix, suitable, I hope, for -stable.

Pass the merged vma to mlock_vma_pages_range().

Tested with: http://free.linux.hp.com/~lts/Tests/mmap_lock.c

I'll attempt a rework of mlock_vma_pages_range(), et al, to
eliminate the 'vma' arg for 29-rc?.

Signed-off-by: Lee Schermerhorn <[email protected]>

mm/mmap.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

Index: linux-2.6.28.2/mm/mmap.c
===================================================================
--- linux-2.6.28.2.orig/mm/mmap.c 2009-01-29 11:34:08.000000000 -0500
+++ linux-2.6.28.2/mm/mmap.c 2009-01-29 12:22:14.000000000 -0500
@@ -1094,7 +1094,7 @@ unsigned long mmap_region(struct file *f
int accountable)
{
struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma, *prev;
+ struct vm_area_struct *vma, *prev, *merged = NULL;
int correct_wcount = 0;
int error;
struct rb_node **rb_link, *rb_parent;
@@ -1207,14 +1207,19 @@ munmap_back:
if (vma_wants_writenotify(vma))
vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);

- if (file && vma_merge(mm, prev, addr, vma->vm_end,
- vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
- mpol_put(vma_policy(vma));
- kmem_cache_free(vm_area_cachep, vma);
- fput(file);
- if (vm_flags & VM_EXECUTABLE)
- removed_exe_file_vma(mm);
- } else {
+ if (file) {
+ merged = vma_merge(mm, prev, addr, vma->vm_end, vma->vm_flags,
+ NULL, file, pgoff, vma_policy(vma));
+ if (merged) {
+ mpol_put(vma_policy(vma));
+ kmem_cache_free(vm_area_cachep, vma);
+ fput(file);
+ if (vm_flags & VM_EXECUTABLE)
+ removed_exe_file_vma(mm);
+ vma = merged; /* for mlock_vma_pages_range() */
+ }
+ }
+ if (!merged) {
vma_link(mm, vma, prev, rb_link, rb_parent);
file = vma->vm_file;
}


2009-01-29 20:34:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Thu, 29 Jan 2009, Lee Schermerhorn wrote:
>
> This patch provides a simplistic fix, suitable, I hope, for -stable.

So I've looked a bit at that function, and I wonder why we can't just do
the "vma_merge()" earlier - before we allocate that whole vma to begin
with.

We already do that for anonymous mappings _anyway_, using the exact same
"vma_merge()" function. So how about just expanding on that logic, and
simplifying everything at the same time?

THIS PATCH IS TOTALLY UNTESTED!

There may be some reason why we didn't do it this way to begin with, but
this woudl actually seem to be the more logical way to fix the problem: by
simplifying the whole function to the point where we never try to do that
"try to merge late and then fix up the fact that we have to free the vma
we allocated earlier" thing at all!

Hmm? This would remove a more lines than it adds, and actually makes
things more readable. If it works. Anybody see why it wouldn't?

Linus

---
mm/mmap.c | 30 ++++++++++--------------------
1 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8d95902..3f78ead 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
if (vm_flags & VM_SPECIAL)
return NULL;

+ /* Anonymous shared mappings are unsharable */
+ if ((vm_flags & VM_SHARED) && !file)
+ return NULL;
+
if (prev)
next = prev->vm_next;
else
@@ -1134,16 +1138,11 @@ munmap_back:
}

/*
- * Can we just expand an old private anonymous mapping?
- * The VM_SHARED test is necessary because shmem_zero_setup
- * will create the file object for a shared anonymous map below.
+ * Can we just expand an old mapping?
*/
- if (!file && !(vm_flags & VM_SHARED)) {
- vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
- NULL, NULL, pgoff, NULL);
- if (vma)
- goto out;
- }
+ vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+ if (vma)
+ goto out;

/*
* Determine the object being mapped and call the appropriate
@@ -1206,17 +1205,8 @@ munmap_back:
if (vma_wants_writenotify(vma))
vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);

- if (file && vma_merge(mm, prev, addr, vma->vm_end,
- vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
- mpol_put(vma_policy(vma));
- kmem_cache_free(vm_area_cachep, vma);
- fput(file);
- if (vm_flags & VM_EXECUTABLE)
- removed_exe_file_vma(mm);
- } else {
- vma_link(mm, vma, prev, rb_link, rb_parent);
- file = vma->vm_file;
- }
+ vma_link(mm, vma, prev, rb_link, rb_parent);
+ file = vma->vm_file;

/* Once vma denies write, undo our temporary denial count */
if (correct_wcount)

2009-01-29 20:49:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Thu, 29 Jan 2009, Linus Torvalds wrote:
>
> THIS PATCH IS TOTALLY UNTESTED!

Well, it boots. FWIW. I've not really tested anything interesting with it,
but any potential breakage is at least not catastrophic and immediate.

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8d95902..3f78ead 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> if (vm_flags & VM_SPECIAL)
> return NULL;
>
> + /* Anonymous shared mappings are unsharable */
> + if ((vm_flags & VM_SHARED) && !file)
> + return NULL;
> +

.. and I think this part of it is actually unnecessary, because what
happens is that a shared anon mapping is turned into a shmem mapping when
it is inserted, and that actually ends up allocating a file for it. So the
vma->vm_file for anon mappings will not match a NULL file pointer
_anyway_, so there's no way it would end up merging.

So my patch can be further simplified, I think, to just the following.
Even more total lines removed.

I still want somebody else to look at and think about it, though.

Linus

---
mm/mmap.c | 26 ++++++--------------------
1 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8d95902..d3fa10a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1134,16 +1134,11 @@ munmap_back:
}

/*
- * Can we just expand an old private anonymous mapping?
- * The VM_SHARED test is necessary because shmem_zero_setup
- * will create the file object for a shared anonymous map below.
+ * Can we just expand an old mapping?
*/
- if (!file && !(vm_flags & VM_SHARED)) {
- vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
- NULL, NULL, pgoff, NULL);
- if (vma)
- goto out;
- }
+ vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+ if (vma)
+ goto out;

/*
* Determine the object being mapped and call the appropriate
@@ -1206,17 +1201,8 @@ munmap_back:
if (vma_wants_writenotify(vma))
vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);

- if (file && vma_merge(mm, prev, addr, vma->vm_end,
- vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
- mpol_put(vma_policy(vma));
- kmem_cache_free(vm_area_cachep, vma);
- fput(file);
- if (vm_flags & VM_EXECUTABLE)
- removed_exe_file_vma(mm);
- } else {
- vma_link(mm, vma, prev, rb_link, rb_parent);
- file = vma->vm_file;
- }
+ vma_link(mm, vma, prev, rb_link, rb_parent);
+ file = vma->vm_file;

/* Once vma denies write, undo our temporary denial count */
if (correct_wcount)

2009-01-29 22:34:00

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Thu, 29 Jan 2009, Linus Torvalds wrote:
> On Thu, 29 Jan 2009, Linus Torvalds wrote:
> >
> > THIS PATCH IS TOTALLY UNTESTED!

Sorry, I was away yesterday, and not yet caught up with things today.

I certainly agree that it would be much nicer not to allocated a vma
in one place, then later throw it away on finding it can be merged:
your patch looks a plausible improvement.

The problem has always been that file->f_op->mmap(file, vma) on a
special file can do various strange things, changing surprising
fields of the struct vma passed to it, changing its mergeability.

Now it may well be that every driver which does something strange
there already sets one of the VM_SPECIAL flags which prevent merging,
or can easily be fixed to do so, or otherwise clearly cannot pose a
problem.

But I need to mull it over and do some checking tomorrow:
anything I say tonight, I'd probably change my mind about.

Hugh

2009-01-29 22:47:23

by Maksim Yevmenkin

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Thu, Jan 29, 2009 at 12:48 PM, Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 29 Jan 2009, Linus Torvalds wrote:
>>
>> THIS PATCH IS TOTALLY UNTESTED!
>
> Well, it boots. FWIW. I've not really tested anything interesting with it,
> but any potential breakage is at least not catastrophic and immediate.
>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 8d95902..3f78ead 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>> if (vm_flags & VM_SPECIAL)
>> return NULL;
>>
>> + /* Anonymous shared mappings are unsharable */
>> + if ((vm_flags & VM_SHARED) && !file)
>> + return NULL;
>> +
>
> .. and I think this part of it is actually unnecessary, because what
> happens is that a shared anon mapping is turned into a shmem mapping when
> it is inserted, and that actually ends up allocating a file for it. So the
> vma->vm_file for anon mappings will not match a NULL file pointer
> _anyway_, so there's no way it would end up merging.
>
> So my patch can be further simplified, I think, to just the following.
> Even more total lines removed.
>
> I still want somebody else to look at and think about it, though.

Just to confirm. This patch also appear to fix the immediate issue for us.

thanks,
max

2009-01-29 22:50:01

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

Maksim Yevmenkin wrote:
> On Thu, Jan 29, 2009 at 12:48 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, 29 Jan 2009, Linus Torvalds wrote:
>>> THIS PATCH IS TOTALLY UNTESTED!
>> Well, it boots. FWIW. I've not really tested anything interesting with it,
>> but any potential breakage is at least not catastrophic and immediate.
>>
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 8d95902..3f78ead 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>>> if (vm_flags & VM_SPECIAL)
>>> return NULL;
>>>
>>> + /* Anonymous shared mappings are unsharable */
>>> + if ((vm_flags & VM_SHARED) && !file)
>>> + return NULL;
>>> +
>> .. and I think this part of it is actually unnecessary, because what
>> happens is that a shared anon mapping is turned into a shmem mapping when
>> it is inserted, and that actually ends up allocating a file for it. So the
>> vma->vm_file for anon mappings will not match a NULL file pointer
>> _anyway_, so there's no way it would end up merging.
>>
>> So my patch can be further simplified, I think, to just the following.
>> Even more total lines removed.
>>
>> I still want somebody else to look at and think about it, though.
>
> Just to confirm. This patch also appear to fix the immediate issue for us.

Is there a (small) test program available?

Thanks,
--
~Randy

2009-01-29 23:03:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Thu, 29 Jan 2009, Hugh Dickins wrote:
>
> The problem has always been that file->f_op->mmap(file, vma) on a
> special file can do various strange things, changing surprising
> fields of the struct vma passed to it, changing its mergeability.

Ahh, you're right. That's somewhat bogus, but it does explain why we have
those two separate merge_vma() calls.

And as you also point out:

> Now it may well be that every driver which does something strange
> there already sets one of the VM_SPECIAL flags which prevent merging,
> or can easily be fixed to do so, or otherwise clearly cannot pose a
> problem.

.. and I think this is the right answer. If a device driver really does
something as odd as play games with the offset that would make merging
wrong, it had better set some of the VM_SPECIAL bits (presumably VM_IO) in
order to never merge at all.

I added DaveM to the cc, since he's the one who is credited with the
comment saying that several drivers change addr (vm_start). I only really
see one, namely bf54x-lq043fb.c, and that one really depends on the whole
nommu thing (ie it seems to simply require a particular virtual address,
broken as that may be).

Btw, changing vm_start is actually very very wrong - it cannot work in
general. Why? Because we earlier checked the "overlapping mmap" case with
the original vm_start, an a driver that changes its vm_start in its mmap()
routine would cause odd issues there. We also pass in "prev" to vma_link,
so it could end up in totally the wrong place.

So changing vm_start is a sign of a driver bug, nothing less.

But there are more drivers who do change vm_pgoff, which is at least valid
(if a bit dodgy), and is indeed a mergeability issue. But at least a
subset of those do already set VM_IO (eg fbmem.c - I only looked at one,
and was happy that it already did that).

As background for Davem, here's the patch once more.

Linus

---
mm/mmap.c | 26 ++++++--------------------
1 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8d95902..d3fa10a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1134,16 +1134,11 @@ munmap_back:
}

/*
- * Can we just expand an old private anonymous mapping?
- * The VM_SHARED test is necessary because shmem_zero_setup
- * will create the file object for a shared anonymous map below.
+ * Can we just expand an old mapping?
*/
- if (!file && !(vm_flags & VM_SHARED)) {
- vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
- NULL, NULL, pgoff, NULL);
- if (vma)
- goto out;
- }
+ vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+ if (vma)
+ goto out;

/*
* Determine the object being mapped and call the appropriate
@@ -1206,17 +1201,8 @@ munmap_back:
if (vma_wants_writenotify(vma))
vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);

- if (file && vma_merge(mm, prev, addr, vma->vm_end,
- vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
- mpol_put(vma_policy(vma));
- kmem_cache_free(vm_area_cachep, vma);
- fput(file);
- if (vm_flags & VM_EXECUTABLE)
- removed_exe_file_vma(mm);
- } else {
- vma_link(mm, vma, prev, rb_link, rb_parent);
- file = vma->vm_file;
- }
+ vma_link(mm, vma, prev, rb_link, rb_parent);
+ file = vma->vm_file;

/* Once vma denies write, undo our temporary denial count */
if (correct_wcount)

2009-01-29 23:31:19

by Maksim Yevmenkin

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Thu, Jan 29, 2009 at 2:48 PM, Randy Dunlap <[email protected]> wrote:
> Maksim Yevmenkin wrote:
>> On Thu, Jan 29, 2009 at 12:48 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Thu, 29 Jan 2009, Linus Torvalds wrote:
>>>> THIS PATCH IS TOTALLY UNTESTED!
>>> Well, it boots. FWIW. I've not really tested anything interesting with it,
>>> but any potential breakage is at least not catastrophic and immediate.
>>>
>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>> index 8d95902..3f78ead 100644
>>>> --- a/mm/mmap.c
>>>> +++ b/mm/mmap.c
>>>> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>>>> if (vm_flags & VM_SPECIAL)
>>>> return NULL;
>>>>
>>>> + /* Anonymous shared mappings are unsharable */
>>>> + if ((vm_flags & VM_SHARED) && !file)
>>>> + return NULL;
>>>> +
>>> .. and I think this part of it is actually unnecessary, because what
>>> happens is that a shared anon mapping is turned into a shmem mapping when
>>> it is inserted, and that actually ends up allocating a file for it. So the
>>> vma->vm_file for anon mappings will not match a NULL file pointer
>>> _anyway_, so there's no way it would end up merging.
>>>
>>> So my patch can be further simplified, I think, to just the following.
>>> Even more total lines removed.
>>>
>>> I still want somebody else to look at and think about it, though.
>>
>> Just to confirm. This patch also appear to fix the immediate issue for us.
>
> Is there a (small) test program available?

Yes, it was in the original (first) email. Here it is again

/*
* Program to provoke kernel NULL pointer de-reference during
* mmap(...MAP_LOCKED...) in Linux 2.6.28.
*
* 1. Create a 32KB test file in /tmp (avoids mlock limit on all recent
* Linuxes).
* 2. mmap it with MAP_LOCKED from top to bottom. (Provokes the oops,
* since vmas can be merged in this case.)
* 3. Clean up.
*
* Compile:
*
* gcc maplock-bug.c -o maplog-bug
*/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

#define SIZE (32*1024) /* Will get rounded down to page size if nec. */

static char tmp[] = "./maplock-bug.XXXXXX";
static char junkbuf[SIZE];

int
main(void)
{
int fd;
int ps = getpagesize();
size_t sz = (SIZE / ps) * ps;
void **addrs;
off_t off;
int i;

if ((addrs = malloc((sz / ps) * sizeof (*addrs))) == 0) {
perror("malloc");
exit(1);
}

if ((fd = mkstemp(tmp)) < 0) {
perror("mkstemp");
exit(1);
}

if (write(fd, junkbuf, sz) != sz) {
perror("write");
exit(1);
}

if (close(fd) < 0) {
perror("close");
exit(1);
}

if ((fd = open(tmp, O_RDONLY)) < 0) {
perror("open");
exit(1);
}

for (off = sz - ps, i = 0; off >= 0; off -= ps, i++) {
if ((addrs[i] =
mmap(0, ps, PROT_READ, MAP_SHARED|MAP_LOCKED,
fd, off)) == MAP_FAILED) {
perror("mmap");
exit(1);
}

printf("Mapped offset 0x%jx at %p\n",
(uintmax_t)off, addrs[i]);
}

if (close(fd) < 0) {
perror("close");
exit(1);
}

for (i = 0; i < sz / ps; i++) {
if (munmap(addrs[i], ps) < 0) {
perror("munmap");
exit(1);
}
printf("Unmapped %p\n", addrs[i]);
}

if (unlink(tmp) < 0) {
perror("unlink");
exit(1);
}

printf("Done\n");
}

Thanks,
max

2009-01-30 02:09:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Thu, 29 Jan 2009, Maksim Yevmenkin wrote:
>
> Just to confirm. This patch also appear to fix the immediate issue for us.

Ok, I ended up committing it.

I didn't do the whole "Cc: [email protected]" thing, because maybe the
non-cleanup version is less controversial for stable, but I decided that
there's no way I don't want the cleanup done in mainline kernel, and if we
end up needing a few VM_IO's added, I think it's worth it. Even after -rc3
(since I suspend we won't actually need them).

Linus

2009-01-30 04:43:50

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Thu, 2009-01-29 at 15:02 -0800, Linus Torvalds wrote:
>
> On Thu, 29 Jan 2009, Hugh Dickins wrote:
> >
> > The problem has always been that file->f_op->mmap(file, vma) on a
> > special file can do various strange things, changing surprising
> > fields of the struct vma passed to it, changing its mergeability.
>
> Ahh, you're right. That's somewhat bogus, but it does explain why we have
> those two separate merge_vma() calls.
>
> And as you also point out:
>
> > Now it may well be that every driver which does something strange
> > there already sets one of the VM_SPECIAL flags which prevent merging,
> > or can easily be fixed to do so, or otherwise clearly cannot pose a
> > problem.
>
> .. and I think this is the right answer. If a device driver really does
> something as odd as play games with the offset that would make merging
> wrong, it had better set some of the VM_SPECIAL bits (presumably VM_IO) in
> order to never merge at all.

Just want to note that install_special_mappings()--used for, e.g., the
vdso--sets VM_DONTEXPAND [one of the VM_SPECIAL flags] which, if we want
to prevent merging, makes a lot of sense to me. It appears that
get_user_pages() will balk at addresses in vmas with VM_IO [and
VM_PFNMAP] which might not be what one wants. For example, you can't
pre-populate the ptes via make_pages_present() with this flag.

Lee

2009-01-30 04:53:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Thu, 29 Jan 2009, Lee Schermerhorn wrote:
>
> Just want to note that install_special_mappings()--used for, e.g., the
> vdso--sets VM_DONTEXPAND [one of the VM_SPECIAL flags] which, if we want
> to prevent merging, makes a lot of sense to me.

Yes. VM_DONTEXPAND in many ways really fits the "don't merge" thing,
because the whole mremap() VM expansion thing is really equivalent to this
explicit merge.

> It appears that get_user_pages() will balk at addresses in vmas with
> VM_IO [and VM_PFNMAP] which might not be what one wants. For example,
> you can't pre-populate the ptes via make_pages_present() with this flag.

Well, anybody who plays games with vm_pgoff really _should_ be VM_IO. I
don't see any real reason for it except for having a very special IO
mapping. Of course, you quite possibly _also_ want VM_DONTEXPAND.

Linus

2009-01-30 05:59:14

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Thu, Jan 29, 2009 at 06:08:30PM -0800, Linus Torvalds wrote:
>
>
> On Thu, 29 Jan 2009, Maksim Yevmenkin wrote:
> >
> > Just to confirm. This patch also appear to fix the immediate issue for us.
>
> Ok, I ended up committing it.
>
> I didn't do the whole "Cc: [email protected]" thing, because maybe the
> non-cleanup version is less controversial for stable, but I decided that
> there's no way I don't want the cleanup done in mainline kernel, and if we
> end up needing a few VM_IO's added, I think it's worth it. Even after -rc3
> (since I suspend we won't actually need them).

Which version was the "non-cleanup" version that should be added to the
stable trees?

The commit you did make, de33c8db5910cda599899dd431cc30d7c1018cbf,
seemed pretty "tiny" to me.

confused,

greg k-h

2009-01-30 08:34:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Thu, 2009-01-29 at 12:48 -0800, Linus Torvalds wrote:
> I still want somebody else to look at and think about it, though.
>
> Linus
>
> ---
> mm/mmap.c | 26 ++++++--------------------
> 1 files changed, 6 insertions(+), 20 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8d95902..d3fa10a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1134,16 +1134,11 @@ munmap_back:
> }
>
> /*
> - * Can we just expand an old private anonymous mapping?
> - * The VM_SHARED test is necessary because shmem_zero_setup
> - * will create the file object for a shared anonymous map below.
> + * Can we just expand an old mapping?
> */
> - if (!file && !(vm_flags & VM_SHARED)) {
> - vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
> - NULL, NULL, pgoff, NULL);
> - if (vma)
> - goto out;
> - }
> + vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
> + if (vma)
> + goto out;

You've made checkpatch unhappy ;-)

So we don't bother with anonymous only, always attempt the merge.

> @@ -1206,17 +1201,8 @@ munmap_back:
> if (vma_wants_writenotify(vma))
> vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
>
> - if (file && vma_merge(mm, prev, addr, vma->vm_end,
> - vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
> - mpol_put(vma_policy(vma));
> - kmem_cache_free(vm_area_cachep, vma);
> - fput(file);
> - if (vm_flags & VM_EXECUTABLE)
> - removed_exe_file_vma(mm);
> - } else {
> - vma_link(mm, vma, prev, rb_link, rb_parent);
> - file = vma->vm_file;
> - }
> + vma_link(mm, vma, prev, rb_link, rb_parent);
> + file = vma->vm_file;

And here we don't bother merging because that would have been done
before. Assuming ->mmap() doesn't go wild, in which case it ought to
have set a VM_SPECIAL bit anyway to discourage merging.

[ And even if it didn't, failing to merge shouldn't be a problem, as
minimizing the vmas is an optimization, not a strict requirement
afaik. ]

The obvious glaring difference is the vma_policy() cruft. But staring at
the code a bit I can't see how the new vma can have acquired a vm_policy
here, so it ought to not matter.

Looks ok to my eyes, so I guess:

Acked-by: Peter Zijlstra <[email protected]>

2009-01-30 16:39:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Thu, 29 Jan 2009, Greg KH wrote:
>
> Which version was the "non-cleanup" version that should be added to the
> stable trees?

There were two different versions:

From: Andrew Morton <[email protected]>
Subject: Re: possible bug in mmap_region() in linux-2.6.28 kernel
Message-Id: <[email protected]>

From: Lee Schermerhorn <[email protected]>
Subject: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
Message-Id: <1233259410.2315.75.camel@lts-notebook>

and I'm actually not at all sure which one should go into stable (or if we
should just pick the same one that went into mainline).

> The commit you did make, de33c8db5910cda599899dd431cc30d7c1018cbf,
> seemed pretty "tiny" to me.

Oh, yes, in m any ways it's smaller than the trivial patches, since it's
really just removing code (well, it adds "file" as a parameter to the
first vma_merge, it used to always be NULL since we only did that for the
"!file" case).

The downside with that commit is simply that there is a (pretty damn
small, imho) possibility that somebody will report a regression with it,
and we'll have to add some appropriate

vma->vm_flags |= VM_IO | VM_DONTEXPAND;

to some odd special device driver, to make sure that it cannot trigger the
"merge early" case because it needs its ->mmap() function called, rather
than just expand the mapping quietly.

The reason(s) I don't think it's very likely is that

(a) hopefully drivers already explicitly mark their mappings with one of
the VM_SPECIAL bits already.

Most such drivers would likely want to use "vm_insert_pfn()" (to
insert random IO pages into the mapping), and in order to do that
they have to add the VM_PFNMAP to vm_flags - we check. That would
disable any merging, because that flag won't be set in the new vma
before calling ->mmap.

(b) even if the drivers don't do that explicit VM_PFNMAP, a driver that
does special things at mmap() time - even if it just uses regular
pages - probably prepopulates its mmap area with vm_insert_page().

That in turn, will implicitly set VM_INSERTPAGE, and again: the
merge logic that verifies that the old vma->vm_flags == vm_flags (in
is_mergeable_vma) would disable merging.

(c) finally - even if a merge really would be possible (ie vm_flags
didn't really get changed by the special ->mmap() function, but the
driver still depended on ->mmap() being called for some other
reason), I suspect it is really really unlikely that any application
would actually request two magic consecutive mmap's to the same
special device, with the same permissions and with just the right
pgoff values etc. IOW, even if a merge could happen in theory with a
flaky driver that doesn't like it, I don't think we'd hit it in
practice.

So I feel I had pretty good reasons to say "f*ck it, I'm going to fix this
properly in mainline with a much prettier patch".

But none of the above really changes the fact that the patch I committed
to mainline was really quite fundamentally more invasive than either of
the "simple" patches. All three patches are small, with mine arguably the
smallest of the lot, but mine actually changed semantics, while Andrew's
and Lee's patch literally only fix the invalid pointer use.

I'll leave it to others to decide which one goes into -stable. I
personally don't really think it matters. I argue above that mine is
pretty safe and thus perfectly fine even for -stable, but reality has a
habit of sometimes disagreeing with me. Dang.

Linus

2009-01-30 16:46:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Fri, 30 Jan 2009, Peter Zijlstra wrote:
>
> > + vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
> > + if (vma)
> > + goto out;
>
> You've made checkpatch unhappy ;-)

Yeah, we need to fix that piece of crud.

checkpatch, that is.

I think "git grep" is more important than the "ooh, I have a tiny d*ck,
everybody else should have one too" argument against larger terminals.

Linus

2009-01-30 16:50:39

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

Linus Torvalds wrote:
>
> On Fri, 30 Jan 2009, Peter Zijlstra wrote:
>>> + vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
>>> + if (vma)
>>> + goto out;
>> You've made checkpatch unhappy ;-)
>
> Yeah, we need to fix that piece of crud.
>
> checkpatch, that is.

It's just advisory, fwiw.

> I think "git grep" is more important than the "ooh, I have a tiny d*ck,
> everybody else should have one too" argument against larger terminals.

and fix the Documentation/CodingStyle recommendations?

--
~Randy

2009-01-30 17:42:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Fri, 30 Jan 2009, Linus Torvalds wrote:
> On Thu, 29 Jan 2009, Greg KH wrote:
> >
> > Which version was the "non-cleanup" version that should be added to the
> > stable trees?
>
> There were two different versions:
>
> From: Andrew Morton <[email protected]>
> Subject: Re: possible bug in mmap_region() in linux-2.6.28 kernel
> Message-Id: <[email protected]>
>
> From: Lee Schermerhorn <[email protected]>
> Subject: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
> Message-Id: <1233259410.2315.75.camel@lts-notebook>
>
> and I'm actually not at all sure which one should go into stable (or if we
> should just pick the same one that went into mainline).
>
...
>
> But none of the above really changes the fact that the patch I committed
> to mainline was really quite fundamentally more invasive than either of
> the "simple" patches. All three patches are small, with mine arguably the
> smallest of the lot, but mine actually changed semantics, while Andrew's
> and Lee's patch literally only fix the invalid pointer use.
>
> I'll leave it to others to decide which one goes into -stable. I
> personally don't really think it matters. I argue above that mine is
> pretty safe and thus perfectly fine even for -stable, but reality has a
> habit of sometimes disagreeing with me. Dang.

I'd say one of the non-cleanup versions for -stable
(but I've not compared them to see which one is better).

I'm still working my way through all the ->mmap methods to check
their safety with regard to yesterday's change (there are about
ten times as many as the last time I looked). So far there's only
one driver mmap I want to go back and recheck, the vast majority
are as good today as they were the day before, but ...

... what I think you have done is break the vma merging on
ordinary files: because of that irritating VM_CAN_NONLINEAR
flag which generic_file_mmap() and some others add in.

To break the merging won't cause anyone much trouble,
but is a slight regression we should fix.

I'd have been very upset not to find something ;)

Hugh

2009-01-30 18:15:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Fri, 30 Jan 2009, Hugh Dickins wrote:
>
> ... what I think you have done is break the vma merging on
> ordinary files: because of that irritating VM_CAN_NONLINEAR
> flag which generic_file_mmap() and some others add in.

Ahh. Yes. VM_CAN_NONLINEAR is a "reverse flag", ie unlike the other flags
it's to some degree about being extra _normal_, not about being odd. Most
special flags tend to disable the VM from doing some clever thing, this
one enables it.

> To break the merging won't cause anyone much trouble,
> but is a slight regression we should fix.

Yeah. Just masking it off when comparing is probably the simplest option.
Make it a separate #define just for readability. There might be other
flags like this in the future.

> I'd have been very upset not to find something ;)

Yay for you ;)

And yes, this is the kind of thing that probably does mean that we're
better off with the no-semantic-changes patches in -stable.

Linus

---
mm/mmap.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index d3fa10a..c581df1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -658,6 +658,9 @@ again: remove_next = 1 + (end > next->vm_end);
validate_mm(mm);
}

+/* Flags that can be inherited from an existing mapping when merging */
+#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR)
+
/*
* If the vma has a ->close operation then the driver probably needs to release
* per-vma resources, so we don't attempt to merge those.
@@ -665,7 +668,7 @@ again: remove_next = 1 + (end > next->vm_end);
static inline int is_mergeable_vma(struct vm_area_struct *vma,
struct file *file, unsigned long vm_flags)
{
- if (vma->vm_flags != vm_flags)
+ if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS)
return 0;
if (vma->vm_file != file)
return 0;

2009-01-30 18:32:03

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Fri, 30 Jan 2009, Linus Torvalds wrote:
> On Fri, 30 Jan 2009, Hugh Dickins wrote:
> >
> > ... what I think you have done is break the vma merging on
> > ordinary files: because of that irritating VM_CAN_NONLINEAR
> > flag which generic_file_mmap() and some others add in.
>
> Ahh. Yes. VM_CAN_NONLINEAR is a "reverse flag", ie unlike the other flags
> it's to some degree about being extra _normal_, not about being odd. Most
> special flags tend to disable the VM from doing some clever thing, this
> one enables it.
>
> > To break the merging won't cause anyone much trouble,
> > but is a slight regression we should fix.
>
> Yeah. Just masking it off when comparing is probably the simplest option.
> Make it a separate #define just for readability. There might be other
> flags like this in the future.
>
> > I'd have been very upset not to find something ;)
>
> Yay for you ;)
>
> And yes, this is the kind of thing that probably does mean that we're
> better off with the no-semantic-changes patches in -stable.
>
> Linus

Yes, your patch looks just right to me - or maybe add a comment that
it's only mmap_region()'s call to vma_merge() which needs that mask?

If you hear no more from me, that'll be all I found.

Hugh

>
> ---
> mm/mmap.c | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d3fa10a..c581df1 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -658,6 +658,9 @@ again: remove_next = 1 + (end > next->vm_end);
> validate_mm(mm);
> }
>
> +/* Flags that can be inherited from an existing mapping when merging */
> +#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR)
> +
> /*
> * If the vma has a ->close operation then the driver probably needs to release
> * per-vma resources, so we don't attempt to merge those.
> @@ -665,7 +668,7 @@ again: remove_next = 1 + (end > next->vm_end);
> static inline int is_mergeable_vma(struct vm_area_struct *vma,
> struct file *file, unsigned long vm_flags)
> {
> - if (vma->vm_flags != vm_flags)
> + if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS)
> return 0;
> if (vma->vm_file != file)
> return 0;

2009-01-30 19:53:19

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Fri, 2009-01-30 at 10:14 -0800, Linus Torvalds wrote:
>
> On Fri, 30 Jan 2009, Hugh Dickins wrote:
> >
> > ... what I think you have done is break the vma merging on
> > ordinary files: because of that irritating VM_CAN_NONLINEAR
> > flag which generic_file_mmap() and some others add in.
>
> Ahh. Yes. VM_CAN_NONLINEAR is a "reverse flag", ie unlike the other flags
> it's to some degree about being extra _normal_, not about being odd. Most
> special flags tend to disable the VM from doing some clever thing, this
> one enables it.
>
> > To break the merging won't cause anyone much trouble,
> > but is a slight regression we should fix.
>
> Yeah. Just masking it off when comparing is probably the simplest option.
> Make it a separate #define just for readability. There might be other
> flags like this in the future.
>
> > I'd have been very upset not to find something ;)
>
> Yay for you ;)
>
> And yes, this is the kind of thing that probably does mean that we're
> better off with the no-semantic-changes patches in -stable.
>
> Linus
>
> ---
> mm/mmap.c | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d3fa10a..c581df1 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -658,6 +658,9 @@ again: remove_next = 1 + (end > next->vm_end);
> validate_mm(mm);
> }
>
> +/* Flags that can be inherited from an existing mapping when merging */
> +#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR)
> +
> /*
> * If the vma has a ->close operation then the driver probably needs to release
> * per-vma resources, so we don't attempt to merge those.
> @@ -665,7 +668,7 @@ again: remove_next = 1 + (end > next->vm_end);
> static inline int is_mergeable_vma(struct vm_area_struct *vma,
> struct file *file, unsigned long vm_flags)
> {
> - if (vma->vm_flags != vm_flags)
> + if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS)
> return 0;
> if (vma->vm_file != file)
> return 0;

I tried this patch atop 29-rc3 + your patch from yesterday with my
simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c.

The test program shows the /proc/<pid>/maps before and after the mmap
and attempted merge. It's not merging:

7fd20a668000-7fd20a66a000 rw-p 7fd20a668000 00:00 0
7fd20a68a000-7fd20a68b000 r--s 00000000 68:23 6608852 /tmp/tmpfVx1SFL (deleted)
7fd20a68b000-7fd20a68c000 r--s 00001000 68:23 6608852 /tmp/tmpfVx1SFL (deleted)
7fd20a68c000-7fd20a690000 rw-p 7fd20a68c000 00:00 0

Ad hoc instrumentation shows that it's the VM_ACCOUNT flag that is
different between the existing file segment and the one attempting the
merge:

is_mergeable_vma: !mergable: vma flags: 0x80020f9:0x1020f9
| |-VM_ACCOUNT
+-----------VM_CAN_NONLINEAR

So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets
cleared later in mmap_region(). Comments say that this is for checking
memory availability during shmem_file_setup(). Maybe we can move the
temporary setting of VM_ACCOUNT until just before the call to
shmem_zero_setup()?

Lee

2009-01-30 20:31:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
>
> Ad hoc instrumentation shows that it's the VM_ACCOUNT flag that is
> different between the existing file segment and the one attempting the
> merge:

Gaah. I looked at VM_ACCOUNT, since it looked like a prime example of the
same VM_CAN_NONLINEAR thing and was thinking that I should just add it to
the "ok to merge" flags, but decided after a quick look that we do all the
VM_ACCOUNT handling before we call vma_merge, so I just left it at that.

But apparently we _do_ do some VM_ACCOUNT stuff afterwards.

> is_mergeable_vma: !mergable: vma flags: 0x80020f9:0x1020f9
> | |-VM_ACCOUNT
> +-----------VM_CAN_NONLINEAR
>
> So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets
> cleared later in mmap_region(). Comments say that this is for checking
> memory availability during shmem_file_setup(). Maybe we can move the
> temporary setting of VM_ACCOUNT until just before the call to
> shmem_zero_setup()?

Yeah, that would probably fix it, and looks like the right thing to do.

It all looks pretty confused wrong to set the whole VM_ACCOUNT flag for a
file-backed file AT ALL in the first place, but the code knows that it
won't matter for a shared file, and will be cleared again later.

So it plays these temporary games with vm_flags, and it didn't matter
because of how we used to call "vma_merge()" either early only for the
anonymous memory case (that had VM_ACCOUNT stable and didn't have that
temporary case at all) or much later (after having undone the temporary
flag setting) for files.

Why do we pass in that "accountable" flag, btw? It's only ever set to 0 by
a MAP_PRIVATE mapping that hits is_file_hugepages() (see do_mmap_pgoff),
and we could just do that decision all inside mmap_region(). So the flag
doesn't really seem to have any real meaning, and is just passed around
for some odd historical reason?

Linus

2009-01-30 20:34:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
>
> I tried this patch atop 29-rc3 + your patch from yesterday with my
> simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c.
>
> The test program shows the /proc/<pid>/maps before and after the mmap
> and attempted merge. It's not merging:
>
> 7fd20a668000-7fd20a66a000 rw-p 7fd20a668000 00:00 0
> 7fd20a68a000-7fd20a68b000 r--s 00000000 68:23 6608852 /tmp/tmpfVx1SFL (deleted)
> 7fd20a68b000-7fd20a68c000 r--s 00001000 68:23 6608852 /tmp/tmpfVx1SFL (deleted)
> 7fd20a68c000-7fd20a690000 rw-p 7fd20a68c000 00:00 0

Some testing - now that's a great idea! Good spotting.

>
> Ad hoc instrumentation shows that it's the VM_ACCOUNT flag that is
> different between the existing file segment and the one attempting the
> merge:
>
> is_mergeable_vma: !mergable: vma flags: 0x80020f9:0x1020f9
> | |-VM_ACCOUNT
> +-----------VM_CAN_NONLINEAR

Sorry, I should have noticed that: VM_ACCOUNT was certainly on my
mind as a flag that gives trouble when merging vmas, but I'd
misconvinced myself that it wasn't a problem in this case.

>
> So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets
> cleared later in mmap_region(). Comments say that this is for checking
> memory availability during shmem_file_setup(). Maybe we can move the
> temporary setting of VM_ACCOUNT until just before the call to
> shmem_zero_setup()?

Please let me think on that - we can live with the status quo for
the moment. I need to remind myself why I used that peculiar way
of conveying info from mmap.c to shmem.c: the right answer may be
just to pass another arg to shmem_zero_setup and shmem_file_setup.

Hugh

2009-01-30 20:55:40

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

Lee Schermerhorn wrote:

> I tried this patch atop 29-rc3 + your patch from yesterday with my
> simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c.

Is that URL correct? I can't find/access it.

> The test program shows the /proc/<pid>/maps before and after the mmap
> and attempted merge. It's not merging:

Thanks,
--
~Randy

2009-01-30 21:00:20

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Fri, 2009-01-30 at 12:53 -0800, Randy Dunlap wrote:
> Lee Schermerhorn wrote:
>
> > I tried this patch atop 29-rc3 + your patch from yesterday with my
> > simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c.

D'oh! that's ".hp.com"

> Is that URL correct? I can't find/access it.

Sorry, should have cut and pasted, like here:

http://free.linux.hp.com/~lts/Tests/mmap_lock.c


Lee

2009-01-30 21:14:22

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Fri, 30 Jan 2009, Linus Torvalds wrote:
> On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
> >
> > So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets
> > cleared later in mmap_region(). Comments say that this is for checking
> > memory availability during shmem_file_setup(). Maybe we can move the
> > temporary setting of VM_ACCOUNT until just before the call to
> > shmem_zero_setup()?
>
> Yeah, that would probably fix it, and looks like the right thing to do.

I do need to refresh my memory on that in a moment...

>
> It all looks pretty confused wrong to set the whole VM_ACCOUNT flag for a
> file-backed file AT ALL in the first place, but the code knows that it
> won't matter for a shared file, and will be cleared again later.
>
> So it plays these temporary games with vm_flags, and it didn't matter
> because of how we used to call "vma_merge()" either early only for the
> anonymous memory case (that had VM_ACCOUNT stable and didn't have that
> temporary case at all) or much later (after having undone the temporary
> flag setting) for files.

I'm to blame for those games, and now they've given trouble,
the right thing may be to put an end to them.

>
> Why do we pass in that "accountable" flag, btw? It's only ever set to 0 by
> a MAP_PRIVATE mapping that hits is_file_hugepages() (see do_mmap_pgoff),
> and we could just do that decision all inside mmap_region(). So the flag
> doesn't really seem to have any real meaning, and is just passed around
> for some odd historical reason?

It looks like the "accountable" flag dates from before Miklos separated
mmap_region() out from do_mmap_pgoff(): so he just passed it on down to
mmap_region() as an additional argument, preferring to leave the more
complex MAP_PRIVATE/is_file_hugepages test behind in do_mmap_pgoff().

It seemed rather a random refactoring to me. Looking at it again,
I wonder if we should be getting do_brk() to use mmap_region() too;
but my appetite for cleanups is low at present, bugs we have enough.

By the way, there's an argument to say that you should add
VM_MIXEDMAP to VM_CAN_NONLINEAR in VM_MERGEABLE_FLAGS: I don't
really care whether we merge the odd filemap_xip vma or not,
but it used to do so and now won't.

By the same (used to merge, now won't) argument, one could say
VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used
in one place only, quite a lot of drivers use vm_insert_page(), so
I feel more comfortable with the idea that it's stopping merges -
though in that case, shouldn't we add it to VM_SPECIAL?

But I'm caring more about that VM_ACCOUNT...

Hugh

2009-01-30 21:26:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Fri, 30 Jan 2009, Hugh Dickins wrote:
>
> But I'm caring more about that VM_ACCOUNT...

The bright side here is that I think the VM_ACCOUNT bit differences only
really happen for MAP_SHARED file mappings. For MAP_SHARED anon mappings
we already don't share (for other reasons), and for private mappings
VM_ACCOUNT looks stable (ie it's reliably either set or not, and stays
that way).

So the impact is reasonably low. But it would definitely be good to clean
up that whole VM_ACCOUNT logic, and incidentally then fix that merging
issue at the same time.

Linus

2009-01-30 21:37:03

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Fri, 2009-01-30 at 21:12 +0000, Hugh Dickins wrote:
> On Fri, 30 Jan 2009, Linus Torvalds wrote:
> > On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
> > >
> > > So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets
> > > cleared later in mmap_region(). Comments say that this is for checking
> > > memory availability during shmem_file_setup(). Maybe we can move the
> > > temporary setting of VM_ACCOUNT until just before the call to
> > > shmem_zero_setup()?
> >
> > Yeah, that would probably fix it, and looks like the right thing to do.
>
> I do need to refresh my memory on that in a moment...
>
> >
> > It all looks pretty confused wrong to set the whole VM_ACCOUNT flag for a
> > file-backed file AT ALL in the first place, but the code knows that it
> > won't matter for a shared file, and will be cleared again later.
> >
> > So it plays these temporary games with vm_flags, and it didn't matter
> > because of how we used to call "vma_merge()" either early only for the
> > anonymous memory case (that had VM_ACCOUNT stable and didn't have that
> > temporary case at all) or much later (after having undone the temporary
> > flag setting) for files.
>
> I'm to blame for those games, and now they've given trouble,
> the right thing may be to put an end to them.
>
> >
> > Why do we pass in that "accountable" flag, btw? It's only ever set to 0 by
> > a MAP_PRIVATE mapping that hits is_file_hugepages() (see do_mmap_pgoff),
> > and we could just do that decision all inside mmap_region(). So the flag
> > doesn't really seem to have any real meaning, and is just passed around
> > for some odd historical reason?
>
> It looks like the "accountable" flag dates from before Miklos separated
> mmap_region() out from do_mmap_pgoff(): so he just passed it on down to
> mmap_region() as an additional argument, preferring to leave the more
> complex MAP_PRIVATE/is_file_hugepages test behind in do_mmap_pgoff().
>
> It seemed rather a random refactoring to me. Looking at it again,
> I wonder if we should be getting do_brk() to use mmap_region() too;
> but my appetite for cleanups is low at present, bugs we have enough.
>
> By the way, there's an argument to say that you should add
> VM_MIXEDMAP to VM_CAN_NONLINEAR in VM_MERGEABLE_FLAGS: I don't
> really care whether we merge the odd filemap_xip vma or not,
> but it used to do so and now won't.
>
> By the same (used to merge, now won't) argument, one could say
> VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used
> in one place only, quite a lot of drivers use vm_insert_page(), so
> I feel more comfortable with the idea that it's stopping merges -
> though in that case, shouldn't we add it to VM_SPECIAL?
>
> But I'm caring more about that VM_ACCOUNT...

I just verified that adding VM_ACCOUNT to VM_MERGEABLE does allow the
merge to happen with the test program. And the system didn't come
crashing down around me. But, I wouldn't trust that simple test as the
last word. A short run of a stress load I use held up/still running,
but I can't tell whether it's merging as expected there.

I am running a slightly modified version of Maksim's test program under
the harness. I modified it to mmap the entire region to reserve space,
then MAP_FIXED at each page address in the range returned by the first
mmap. I saw that it was leaving holes between some of the pages w/o
this. I'm going to automate the check for merging [read map and verify
a single segment at expected range] and leave that running with the
load.

Lee

2009-01-30 21:38:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Fri, 30 Jan 2009, Hugh Dickins wrote:
>
> By the way, there's an argument to say that you should add
> VM_MIXEDMAP to VM_CAN_NONLINEAR in VM_MERGEABLE_FLAGS: I don't
> really care whether we merge the odd filemap_xip vma or not,
> but it used to do so and now won't.
>
> By the same (used to merge, now won't) argument, one could say
> VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used
> in one place only, quite a lot of drivers use vm_insert_page(), so
> I feel more comfortable with the idea that it's stopping merges -
> though in that case, shouldn't we add it to VM_SPECIAL?

VM_MIXEDMAP, yes. It's a special case.

But VM_INSERTPAGE - no.

Why? Because VM_INSERTPAGE implies that mmap() itself quite possibly
actually _does_ something (it may have inserted _all_ the pages, and may
not even have any ->fault handler at all).

So we can't just expand a current mapping and forget about it, we need to
do the mmap(). So the "only do merges early" fundamentally cannot merge
such a vma, even if the old code did.

Of course, we could look at whether it has a ->fault handler as an
indication of whether it's possible to merge or not. We already do that
for ->close, and in many ways ->fault would likely be a better indicator
of whether something is mergeable or not.

But there's really no point. VM_INSERTPAGE implies a very special mapping:
it's used by things like the SCSI target user space ring map, or the magic
network packet mmap thing. If you use those things, you really don't care
about merging adjacent VM's anyway, and at least the two I looked at
really do the whole "pre-populate at mmap time" thing.

And at least the packet one really does have a ->close function, and lacks
a ->fault function, so it wouldn't have merged before either (and looking
at ->fault again seems to be as valid as looking at ->close).

Linus

2009-01-30 21:47:52

by Will Crowder

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

What, no ".linux" TLD? :-) Oh well...in time, time...

Will Crowder

On Fri, 2009-01-30 at 15:59 -0500, Lee Schermerhorn wrote:
> On Fri, 2009-01-30 at 12:53 -0800, Randy Dunlap wrote:
> > Lee Schermerhorn wrote:
> >
> > > I tried this patch atop 29-rc3 + your patch from yesterday with my
> > > simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c.
>
> D'oh! that's ".hp.com"
>
> > Is that URL correct? I can't find/access it.
>
> Sorry, should have cut and pasted, like here:
>
> http://free.linux.hp.com/~lts/Tests/mmap_lock.c
>
>
> Lee
>

2009-01-30 22:28:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments


On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
>
> I just verified that adding VM_ACCOUNT to VM_MERGEABLE does allow the
> merge to happen with the test program. And the system didn't come
> crashing down around me. But, I wouldn't trust that simple test as the
> last word. A short run of a stress load I use held up/still running,
> but I can't tell whether it's merging as expected there.

Just ignoring VM_ACCOUNT for merging is not the right thing to do. It
probably works in practice for just about everything, but at least in
theory it does mean that mmap() can stop honoring MAP_NORESERVE.

Admittedly the circumstances where that happens are unlikely, and nobody
probably even really uses MAP_NORESERVE in the first place, so I doubt you
can ever see it as a real issue, but it's still technically wrong to merge
vma's that can differ in VM_ACCOUNT.

Now, the particular test you have, VM_ACCOUNT differs only during that
temporary window, and the vma's really do end up with the same VM_ACCOUNT
state in the end, so merging them is correct, but if you were to privately
map the same file (or private anonymous map) with the same permissions
next to each other so that they -could- merge, but use MAP_NORESERVE on
one and not on the other, then they shouldn't merge.

So VM_ACCOUNT does matter - just barely - for merging, and we just happen
to currently hit it too much due to a very odd internal reason.

Linus

2009-01-30 23:46:32

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Fri, Jan 30, 2009 at 05:40:24PM +0000, Hugh Dickins wrote:
> On Fri, 30 Jan 2009, Linus Torvalds wrote:
> > On Thu, 29 Jan 2009, Greg KH wrote:
> > >
> > > Which version was the "non-cleanup" version that should be added to the
> > > stable trees?
> >
> > There were two different versions:
> >
> > From: Andrew Morton <[email protected]>
> > Subject: Re: possible bug in mmap_region() in linux-2.6.28 kernel
> > Message-Id: <[email protected]>
> >
> > From: Lee Schermerhorn <[email protected]>
> > Subject: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
> > Message-Id: <1233259410.2315.75.camel@lts-notebook>
> >
> > and I'm actually not at all sure which one should go into stable (or if we
> > should just pick the same one that went into mainline).
> >
> ...
> >
> > But none of the above really changes the fact that the patch I committed
> > to mainline was really quite fundamentally more invasive than either of
> > the "simple" patches. All three patches are small, with mine arguably the
> > smallest of the lot, but mine actually changed semantics, while Andrew's
> > and Lee's patch literally only fix the invalid pointer use.
> >
> > I'll leave it to others to decide which one goes into -stable. I
> > personally don't really think it matters. I argue above that mine is
> > pretty safe and thus perfectly fine even for -stable, but reality has a
> > habit of sometimes disagreeing with me. Dang.
>
> I'd say one of the non-cleanup versions for -stable
> (but I've not compared them to see which one is better).

Ok, based on both of your comments about this, and the fact that the
in-tree one did break something, I'll go look at Andrew and Lee's
versions and pick one of them for -stable.

thanks,

greg k-h

2009-01-31 12:19:31

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Fri, 30 Jan 2009, Linus Torvalds wrote:
> On Fri, 30 Jan 2009, Hugh Dickins wrote:
> >
> > By the same (used to merge, now won't) argument, one could say
> > VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used
> > in one place only, quite a lot of drivers use vm_insert_page(), so
> > I feel more comfortable with the idea that it's stopping merges -
> > though in that case, shouldn't we add it to VM_SPECIAL?
>
> But VM_INSERTPAGE - no.
>
> Why? Because VM_INSERTPAGE implies that mmap() itself quite possibly
> actually _does_ something (it may have inserted _all_ the pages, and may
> not even have any ->fault handler at all).
>
> So we can't just expand a current mapping and forget about it, we need to
> do the mmap(). So the "only do merges early" fundamentally cannot merge
> such a vma, even if the old code did.

Good point indeed. Same as why we test for it in copy_page_range().
Should it be added to VM_SPECIAL? Not for this particular (need to
call ->mmap) reason, but we probably ought to add it anyway.

Hugh

>
> Of course, we could look at whether it has a ->fault handler as an
> indication of whether it's possible to merge or not. We already do that
> for ->close, and in many ways ->fault would likely be a better indicator
> of whether something is mergeable or not.
>
> But there's really no point. VM_INSERTPAGE implies a very special mapping:
> it's used by things like the SCSI target user space ring map, or the magic
> network packet mmap thing. If you use those things, you really don't care
> about merging adjacent VM's anyway, and at least the two I looked at
> really do the whole "pre-populate at mmap time" thing.
>
> And at least the packet one really does have a ->close function, and lacks
> a ->fault function, so it wouldn't have merged before either (and looking
> at ->fault again seems to be as valid as looking at ->close).
>
> Linus

2009-01-31 12:36:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments

On Fri, 30 Jan 2009, Linus Torvalds wrote:
> On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
> >
> > I just verified that adding VM_ACCOUNT to VM_MERGEABLE does allow the
> > merge to happen with the test program. And the system didn't come
> > crashing down around me. But, I wouldn't trust that simple test as the
> > last word. A short run of a stress load I use held up/still running,
> > but I can't tell whether it's merging as expected there.
>
> Just ignoring VM_ACCOUNT for merging is not the right thing to do. It
> probably works in practice for just about everything, but at least in
> theory it does mean that mmap() can stop honoring MAP_NORESERVE.
>
> Admittedly the circumstances where that happens are unlikely, and nobody
> probably even really uses MAP_NORESERVE in the first place, so I doubt you
> can ever see it as a real issue, but it's still technically wrong to merge
> vma's that can differ in VM_ACCOUNT.
>
> Now, the particular test you have, VM_ACCOUNT differs only during that
> temporary window, and the vma's really do end up with the same VM_ACCOUNT
> state in the end, so merging them is correct, but if you were to privately
> map the same file (or private anonymous map) with the same permissions
> next to each other so that they -could- merge, but use MAP_NORESERVE on
> one and not on the other, then they shouldn't merge.
>
> So VM_ACCOUNT does matter - just barely - for merging, and we just happen
> to currently hit it too much due to a very odd internal reason.

It matters more than just barely - if you care about non-overcommit,
or if you care about non-wrapping Committed_AS in your /proc/meminfo.

Ignoring VM_ACCOUNT when merging is very much the wrong thing to do,
because it lets an unaccounted area be treated thereafter as accounted,
or vice versa - even forgetting the MAP_NORESERVE special case.

I have by now recalled why I chose to play those VM_ACCOUNT games:
/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
* shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
* that memory reservation must be checked; but that reservation
* belongs to shared memory object, not to vma: so now clear it.
We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't
just need it in the explicit shmem_zero_setup() case, we also need it
for the (probably rare nowadays) case when mmap() is working on file
/dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON.

Still haven't decided what's best to do about it (plenty of diversions):
perhaps we just say my error was to overload VM_ACCOUNT, and define a
new flag for the purpose, which can go into VM_MERGEABLE_FLAGS; but
I'd prefer a neater solution if it crosses my mind.

Hugh

2009-01-31 18:35:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments



On Sat, 31 Jan 2009, Hugh Dickins wrote:
>
> We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't
> just need it in the explicit shmem_zero_setup() case, we also need it
> for the (probably rare nowadays) case when mmap() is working on file
> /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON.

Heh. We already have that. Maybe you didn't realize. Look at VM_NORESERVE.

So shmem.c can just look at "!(vma->vm_flags & VM_NORESERVE)" if it wants
to.

The only piece of information you don't have is that "accountable" flag,
but that was why I was pointing out how totally _useless_ that flag
actually is. We shouldn't pass it along, because it's always 1, except for
one special case that we can always calculate (private hugepages).

But in fact, even leaving that untouched, we can trivially just change
what 'VM_NORESERVE' means: we can just make it the end result of all that
'accountable' logic, instead of having it just mirror MAP_NORESERVE
blindly.

> Still haven't decided what's best to do about it (plenty of diversions):
> perhaps we just say my error was to overload VM_ACCOUNT, and define a
> new flag for the purpose, which can go into VM_MERGEABLE_FLAGS; but
> I'd prefer a neater solution if it crosses my mind.

How about this pretty trivial patch?

TOTALLY UNTESTED. As usual. But the concept is pretty simple, and it
actually removes a fair chunk of hacky code. The only reason the diffstat
output says that it adds more lines than it deletes is that I added more
comments and made that helper inline function rather than make a complex
conditional.

Whaddaya think?

Linus

---
mm/mmap.c | 48 +++++++++++++++++++++++++-----------------------
mm/shmem.c | 2 +-
2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index c581df1..5fcaec3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1090,6 +1090,15 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
mapping_cap_account_dirty(vma->vm_file->f_mapping);
}

+/*
+ * We account for memory if it's a private writeable mapping,
+ * and VM_NORESERVE wasn't set.
+ */
+static inline int private_accountable_mapping(unsigned int vm_flags)
+{
+ return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
+}
+
unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, unsigned long flags,
unsigned int vm_flags, unsigned long pgoff,
@@ -1117,23 +1126,24 @@ munmap_back:
if (!may_expand_vm(mm, len >> PAGE_SHIFT))
return -ENOMEM;

- if (flags & MAP_NORESERVE)
+ /*
+ * Set 'VM_NORESERVE' if we should not account for the
+ * memory use of this mapping. We only honor MAP_NORESERVE
+ * if we're allowed to overcommit memory.
+ */
+ if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)
+ vm_flags |= VM_NORESERVE;
+ if (!accountable)
vm_flags |= VM_NORESERVE;

- if (accountable && (!(flags & MAP_NORESERVE) ||
- sysctl_overcommit_memory == OVERCOMMIT_NEVER)) {
- if (vm_flags & VM_SHARED) {
- /* Check memory availability in shmem_file_setup? */
- vm_flags |= VM_ACCOUNT;
- } else if (vm_flags & VM_WRITE) {
- /*
- * Private writable mapping: check memory availability
- */
- charged = len >> PAGE_SHIFT;
- if (security_vm_enough_memory(charged))
- return -ENOMEM;
- vm_flags |= VM_ACCOUNT;
- }
+ /*
+ * Private writable mapping: check memory availability
+ */
+ if (private_accountable_mapping(vm_flags)) {
+ charged = len >> PAGE_SHIFT;
+ if (security_vm_enough_memory(charged))
+ return -ENOMEM;
+ vm_flags |= VM_ACCOUNT;
}

/*
@@ -1184,14 +1194,6 @@ munmap_back:
goto free_vma;
}

- /* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
- * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
- * that memory reservation must be checked; but that reservation
- * belongs to shared memory object, not to vma: so now clear it.
- */
- if ((vm_flags & (VM_SHARED|VM_ACCOUNT)) == (VM_SHARED|VM_ACCOUNT))
- vma->vm_flags &= ~VM_ACCOUNT;
-
/* Can addr have changed??
*
* Answer: Yes, several device drivers can do it in their
diff --git a/mm/shmem.c b/mm/shmem.c
index 5d0de96..19d566c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2628,7 +2628,7 @@ struct file *shmem_file_setup(char *name, loff_t size, unsigned long flags)
goto close_file;

#ifdef CONFIG_SHMEM
- SHMEM_I(inode)->flags = flags & VM_ACCOUNT;
+ SHMEM_I(inode)->flags = (flags & VM_NORESERVE) ? 0 : VM_ACCOUNT;
#endif
d_instantiate(dentry, inode);
inode->i_size = size;