2008-01-30 14:17:54

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] sys_remap_file_pages: fix ->vm_file accounting

Fix ->vm_file accounting, mmap_region() may do do_munmap().

Signed-off-by: Oleg Nesterov <[email protected]>

--- MM/mm/fremap.c~RFP 2007-10-25 16:22:12.000000000 +0400
+++ MM/mm/fremap.c 2008-01-30 16:56:39.000000000 +0300
@@ -192,8 +192,10 @@ asmlinkage long sys_remap_file_pages(uns
unsigned long addr;

flags &= MAP_NONBLOCK;
+ get_file(vma->vm_file);
addr = mmap_region(vma->vm_file, start, size,
flags, vma->vm_flags, pgoff, 1);
+ fput(vma->vm_file);
if (IS_ERR_VALUE(addr)) {
err = addr;
} else {


2008-01-30 16:55:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting


On Wed, 2008-01-30 at 17:20 +0300, Oleg Nesterov wrote:
> Fix ->vm_file accounting, mmap_region() may do do_munmap().

Ouch! I didn't think of that case at all...

There's a small problem with the patch: the vma itself is freed at
unmap, so the fput(vma->vm_file) may crash. Here's an updated patch.

Thanks for spotting this!

Miklos
----


From: Oleg Nesterov <[email protected]>

Fix ->vm_file accounting, mmap_region() may do do_munmap().

Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---

Index: linux/mm/fremap.c
===================================================================
--- linux.orig/mm/fremap.c 2008-01-17 19:00:17.000000000 +0100
+++ linux/mm/fremap.c 2008-01-30 17:47:27.000000000 +0100
@@ -190,10 +190,13 @@ asmlinkage long sys_remap_file_pages(uns
*/
if (mapping_cap_account_dirty(mapping)) {
unsigned long addr;
+ struct file *file = vma->vm_file;

flags &= MAP_NONBLOCK;
- addr = mmap_region(vma->vm_file, start, size,
+ get_file(file);
+ addr = mmap_region(file, start, size,
flags, vma->vm_flags, pgoff, 1);
+ fput(file);
if (IS_ERR_VALUE(addr)) {
err = addr;
} else {

2008-01-30 17:24:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting

On 01/30, Miklos Szeredi wrote:
>
> On Wed, 2008-01-30 at 17:20 +0300, Oleg Nesterov wrote:
> > Fix ->vm_file accounting, mmap_region() may do do_munmap().
>
> There's a small problem with the patch: the vma itself is freed at
> unmap, so the fput(vma->vm_file) may crash. Here's an updated patch.

Ah, indeed, thanks!


Offtopic. I noticed this problem while looking at this patch:

http://marc.info/?l=linux-mm-commits&m=120141116911711

So this (the old vma could be removed before we create the new mapping)
means that the patch above has another problem: if we are remapping the
whole VM_EXECUTABLE vma, removed_exe_file_vma() can clear ->exe_file
while it shouldn't (Matt Helsley cc'ed).

Oleg.

2008-02-02 21:18:41

by Matt Helsley

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting


On Wed, 2008-01-30 at 20:26 +0300, Oleg Nesterov wrote:
> On 01/30, Miklos Szeredi wrote:
> >
> > On Wed, 2008-01-30 at 17:20 +0300, Oleg Nesterov wrote:
> > > Fix ->vm_file accounting, mmap_region() may do do_munmap().
> >
> > There's a small problem with the patch: the vma itself is freed at
> > unmap, so the fput(vma->vm_file) may crash. Here's an updated patch.
>
> Ah, indeed, thanks!
>
>
> Offtopic. I noticed this problem while looking at this patch:
>
> http://marc.info/?l=linux-mm-commits&m=120141116911711
>
> So this (the old vma could be removed before we create the new mapping)
> means that the patch above has another problem: if we are remapping the
> whole VM_EXECUTABLE vma, removed_exe_file_vma() can clear ->exe_file
> while it shouldn't (Matt Helsley cc'ed).
>
> Oleg.

Looking at sys_remap_file_pages() it appears that the shared flag must
be set in order to remap. Executable mappings are always MAP_PRIVATE and
hence lack the shared flag so that any modifications to those areas
don't get written back to the executable. I don't think userspace can
change this flag -- even using plain mremap. So, unless there's a way to
change that flag, I don't think there's anything related to
VM_EXECUTABLE vmas that needs to be done here.

Cc'ing linux-mm.

Cheers,
-Matt Helsley

2008-02-02 21:18:56

by Matt Helsley

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting


On Wed, 2008-01-30 at 20:26 +0300, Oleg Nesterov wrote:
> On 01/30, Miklos Szeredi wrote:
> >
> > On Wed, 2008-01-30 at 17:20 +0300, Oleg Nesterov wrote:
> > > Fix ->vm_file accounting, mmap_region() may do do_munmap().
> >
> > There's a small problem with the patch: the vma itself is freed at
> > unmap, so the fput(vma->vm_file) may crash. Here's an updated patch.
>
> Ah, indeed, thanks!
>
>
> Offtopic. I noticed this problem while looking at this patch:
>
> http://marc.info/?l=linux-mm-commits&m=120141116911711
>
> So this (the old vma could be removed before we create the new mapping)
> means that the patch above has another problem: if we are remapping the
> whole VM_EXECUTABLE vma, removed_exe_file_vma() can clear ->exe_file
> while it shouldn't (Matt Helsley cc'ed).
>
> Oleg.

Only shared VMAs can be remapped by sys_remap_file_pages but all
executable mappings are MAP_PRIVATE and hence lack the shared flag. I
don't know of a way for userspace to change that flag so I think there's
nothing that needs to be done here.

Cc'ing linux-mm.

Cheers,
-Matt Helsley

2008-02-03 18:19:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting

(remove [email protected] from CC)

On 02/02, Matt Helsley wrote:
>
> On Wed, 2008-01-30 at 20:26 +0300, Oleg Nesterov wrote:
> >
> > Offtopic. I noticed this problem while looking at this patch:
> >
> > http://marc.info/?l=linux-mm-commits&m=120141116911711
> >
> > So this (the old vma could be removed before we create the new mapping)
> > means that the patch above has another problem: if we are remapping the
> > whole VM_EXECUTABLE vma, removed_exe_file_vma() can clear ->exe_file
> > while it shouldn't (Matt Helsley cc'ed).
> >
> > Oleg.
>
> Looking at sys_remap_file_pages() it appears that the shared flag must
> be set in order to remap. Executable mappings are always MAP_PRIVATE and
> hence lack the shared flag so that any modifications to those areas
> don't get written back to the executable. I don't think userspace can
> change this flag

Yes, userspace can't change it. But if MVFS changes ->vm_file it could also
change vm_flags... But I think you are right anyway, we shouldn't care.


So I have to try to find another bug ;) Suppose that ->load_binary() does
a series of do_mmap(MAP_EXECUTABLE). It is possible that mmap_region() can
merge 2 vmas. In that case we "leak" ->num_exe_file_vmas. Unless I missed
something, mmap_region() should do removed_exe_file_vma() when vma_merge()
succeds (near fput(file)).

Oleg.

2008-02-03 18:27:18

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting

Off-topic question to all. sys_remap_file_pages() doesn't work with
shared readonly mappings, why?

IOW, why it checks VM_SHARED but not VM_MAYSHARE?

Thanks in advance,

Oleg.

2008-02-06 20:14:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting

[ removed stable ]

On Sun, 3 Feb 2008, Oleg Nesterov wrote:
> Off-topic question to all. sys_remap_file_pages() doesn't work with
> shared readonly mappings, why?

Slight correction: it works with shared readonly mappings, doesn't
it, so long as the mmap'ed file was opened for reading and writing?

> IOW, why it checks VM_SHARED but not VM_MAYSHARE?

My guess has always been that it was just a misunderstanding of how
those VM_ flags end up working: assume so unless Ingo corrects me.

By the time I realized that oddity, we'd been driven into several
tiresome corners by the very existence of sys_remap_file_pages.
So whereas my usual instinct would have been to relax the restriction
and generalize, in its case I wanted to hold on to every restriction
we had.

sys_remap_file_pages does serve a useful purpose; but it subverts
vma principles, and has therefore caused us grief repeatedly.

Hugh

2008-02-06 20:35:26

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting

On Sun, 3 Feb 2008, Oleg Nesterov wrote:
>
> So I have to try to find another bug ;) Suppose that ->load_binary() does
> a series of do_mmap(MAP_EXECUTABLE). It is possible that mmap_region() can
> merge 2 vmas. In that case we "leak" ->num_exe_file_vmas. Unless I missed
> something, mmap_region() should do removed_exe_file_vma() when vma_merge()
> succeds (near fput(file)).

Or there's the complementary case of a VM_EXECUTABLE vma being
split in two, for example by an mprotect of a part of it.

Sorry, Matt, I don't like your patch at all. It seems to add a fair
amount of ugliness and unmaintainablity, all for a peculiar MVFS case
(you've tried to argue other advantages, but not always convinced!).

And I found it quite hard to see where the crucial difference comes.
I guess it's that MVFS changes vma->vm_file in its ->mmap? Well, if
MVFS does that, maybe something else does that too, but precisely to
rely on the present behaviour of /proc/pid/exe - so in fixing for
MVFS, we'd be breaking that hypothetical other?

I can understand patches to avoid mmap_sem for /proc/pid/exe, but
this one just seems too messy for too special an out-of-tree case.
(I've no last word on this, but that's my opinion.)

Hugh

2008-02-07 00:41:20

by Matt Helsley

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting


On Wed, 2008-02-06 at 20:33 +0000, Hugh Dickins wrote:
> On Sun, 3 Feb 2008, Oleg Nesterov wrote:
> >
> > So I have to try to find another bug ;) Suppose that ->load_binary() does
> > a series of do_mmap(MAP_EXECUTABLE). It is possible that mmap_region() can
> > merge 2 vmas. In that case we "leak" ->num_exe_file_vmas. Unless I missed
> > something, mmap_region() should do removed_exe_file_vma() when vma_merge()
> > succeds (near fput(file)).
>
> Or there's the complementary case of a VM_EXECUTABLE vma being
> split in two, for example by an mprotect of a part of it.
>
> Sorry, Matt, I don't like your patch at all. It seems to add a fair
> amount of ugliness and unmaintainablity, all for a peculiar MVFS case

I thought that getting rid of the separate versions of proc_exe_link()
improved maintainability. Do you have any specific details on what you
think makes the code introduced by the patch unmaintainable?

> (you've tried to argue other advantages, but not always convinced!).

Yup -- looking at how the VM_EXECUTABLE flag affects the vma walk it's
clear one of my arguments was wrong. So I can't blame you for being
unconvinced by that. :)

I still think it would help any stacking filesystems that can't use the
solution adopted by unionfs.

> And I found it quite hard to see where the crucial difference comes.
> I guess it's that MVFS changes vma->vm_file in its ->mmap? Well, if

Yup.

> MVFS does that, maybe something else does that too, but precisely to
> rely on the present behaviour of /proc/pid/exe - so in fixing for
> MVFS, we'd be breaking that hypothetical other?

I'm not completely certain that I understand your point. Are you
suggesting that some hypothetical code would want to use this "quirk"
of /proc/pid/exe for a legitimate purpose?

Assuming that is your point, I thought my non-hypothetical java example
clearly demonstrated that at least one non-hypothetical program doesn't
expect the "quirk" and breaks because of it. Frankly,
given /proc/pid/exe's output in the non-stacking case, I can't see how
its output in the stacking case we're discussing could be considered
anything but buggy.

Cheers,
-Matt Helsley

2008-02-07 16:42:55

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting

On Wed, 6 Feb 2008, Matt Helsley wrote:
> On Wed, 2008-02-06 at 20:33 +0000, Hugh Dickins wrote:
> >
> > Sorry, Matt, I don't like your patch at all. It seems to add a fair
> > amount of ugliness and unmaintainablity, all for a peculiar MVFS case
>
> I thought that getting rid of the separate versions of proc_exe_link()
> improved maintainability.

That's a plus, but I don't see it balances the minus.

> Do you have any specific details on what you
> think makes the code introduced by the patch unmaintainable?

The fact that you now have to shadow operations on vm_file by
operations on your exe_file, easy to get out of step; and that
we'll tend not to notice when it goes wrong, because the common
case will have them both in the root mount (that will hide busy
errors, won't it? or am I mistaken?).

Though I should concede that your latest patch looks like it catches
all the places it needs to, and that they don't really stretch beyond
mm/mmap.c. If you provided functions to do the get_file/fput/VM_EXE-
CUTABLE stuff, most of the ugliness would be confined to fs/proc/base.c.

I much preferred Andrew's take-a-copy-of-the-pathname approach, but
admit I'm not one to appreciate the namespace arguments against it.
I wish I had a more constructive solution.

> I still think it would help any stacking filesystems that can't use the
> solution adopted by unionfs.

If you think it's going to help lots of others, please get them to
speak up in support. But we're rather short of stacking filesystems
in the kernel at present, so it can be hard to argue.

> > And I found it quite hard to see where the crucial difference comes.
> > I guess it's that MVFS changes vma->vm_file in its ->mmap? Well, if
>
> Yup.
>
> > MVFS does that, maybe something else does that too, but precisely to
> > rely on the present behaviour of /proc/pid/exe - so in fixing for
> > MVFS, we'd be breaking that hypothetical other?
>
> I'm not completely certain that I understand your point. Are you
> suggesting that some hypothetical code would want to use this "quirk"
> of /proc/pid/exe for a legitimate purpose?

Yes, but our different viewpoints give us a different idea of what's
a quirk. The quirk I see is that MVFS is fiddling with vma->vm_file
in its ->mmap, which is rather unusual (though not unique); and then
complaining when this doesn't work as it wants. But you see it as a
quirk that the VM_EXECUTABLE's vm_file gets used for /proc/pid/exe?

> Assuming that is your point, I thought my non-hypothetical java example
> clearly demonstrated that at least one non-hypothetical program doesn't
> expect the "quirk" and breaks because of it.

Yes, and I'm suggesting that if we change the behaviour to suit you,
we might be causing a regression in something else. No evidence for
that, but it's a possibility (though probably not one I'd be arguing,
if I actually liked the patch involved).

When and if the VFS provides integrated support for stacking filesystems,
it would make sense to do something about this. But as things stand, it's
for the stacking filesystem to manage its stack. Why uglify the kernel
code (which doesn't include any user for this), instead of managing that
stacking yourself i.e. why not leave vm_file as is (or if necessary
point it to a proxy), and use its private_data for your optimizations?

> Frankly,
> given /proc/pid/exe's output in the non-stacking case, I can't see how
> its output in the stacking case we're discussing could be considered
> anything but buggy.

But whose bug?

Hugh

2008-02-11 10:16:53

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sys_remap_file_pages: fix ->vm_file accounting

Sorry for delay,

On 02/06, Hugh Dickins wrote:
>
> On Sun, 3 Feb 2008, Oleg Nesterov wrote:
> > Off-topic question to all. sys_remap_file_pages() doesn't work with
> > shared readonly mappings, why?
>
> Slight correction: it works with shared readonly mappings, doesn't
> it, so long as the mmap'ed file was opened for reading and writing?

Yes sure. I meant, if the file was opened without FMODE_WRITE, then
mmap(PROT_READ, MAP_SHARED) doesn't actually set VM_SHARED, it only
sets VM_MAYSHARE.

(this looks understandable, but means that !VM_SHARED doesn't necessary
imply the possible cow).

> > IOW, why it checks VM_SHARED but not VM_MAYSHARE?
>
> My guess has always been that it was just a misunderstanding of how
> those VM_ flags end up working: assume so unless Ingo corrects me.
>
> By the time I realized that oddity, we'd been driven into several
> tiresome corners by the very existence of sys_remap_file_pages.
> So whereas my usual instinct would have been to relax the restriction
> and generalize, in its case I wanted to hold on to every restriction
> we had.

OK, thanks a lot. I was afraid I missed some "obvious" reason why we
can't do this.

Oleg.