2015-07-16 23:15:51

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

fs/aio.c can't be compiled if CONFIG_MMU=n, filemap_page_mkwrite()
is not defined in this case. Add yet another "must not be called"
helper into nommu.c to make the linker happy.

I still think this is pointless, afaics sys_io_setup() simply can't
succeed if CONFIG_MMU=n. Instead we should make CONFIG_AIO depend
on CONFIG_MMU.

Reported-by: Fengguang Wu <[email protected]>
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
---
mm/nommu.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/nommu.c b/mm/nommu.c
index e3026fd..979afad 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -2008,6 +2008,12 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_map_pages);

+int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ BUG();
+ return 0;
+}
+
static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
unsigned long addr, void *buf, int len, int write)
{
--
1.5.5.1


2015-07-16 23:22:52

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

Hi Oleg,

On Fri, 17 Jul 2015 01:14:05 +0200 Oleg Nesterov <[email protected]> wrote:
>
> fs/aio.c can't be compiled if CONFIG_MMU=n, filemap_page_mkwrite()
> is not defined in this case. Add yet another "must not be called"
> helper into nommu.c to make the linker happy.
>
> I still think this is pointless, afaics sys_io_setup() simply can't
> succeed if CONFIG_MMU=n. Instead we should make CONFIG_AIO depend
> on CONFIG_MMU.
>
> Reported-by: Fengguang Wu <[email protected]>
> Reported-by: Stephen Rothwell <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>

Added as a fix to linux-next today.

--
Cheers,
Stephen Rothwell [email protected]

2015-07-16 23:24:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On Fri, 17 Jul 2015 01:14:05 +0200 Oleg Nesterov <[email protected]> wrote:

> fs/aio.c can't be compiled if CONFIG_MMU=n, filemap_page_mkwrite()
> is not defined in this case. Add yet another "must not be called"
> helper into nommu.c to make the linker happy.
>
> I still think this is pointless, afaics sys_io_setup() simply can't
> succeed if CONFIG_MMU=n. Instead we should make CONFIG_AIO depend
> on CONFIG_MMU.
>
> ..
>
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -2008,6 +2008,12 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> }
> EXPORT_SYMBOL(filemap_map_pages);
>
> +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + BUG();
> + return 0;
> +}
> +
> static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long addr, void *buf, int len, int write)
> {

So if anyone starts testing aio on NOMMU, this patch will make the
whole thing immediately go BUG. This isn't helpful :(

Yes, making AIO depend on MMU sounds better. Because if it wasn't
busted before, it sure is now!

2015-07-16 23:54:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 07/16, Andrew Morton wrote:
>
> On Fri, 17 Jul 2015 01:14:05 +0200 Oleg Nesterov <[email protected]> wrote:
>
> > fs/aio.c can't be compiled if CONFIG_MMU=n, filemap_page_mkwrite()
> > is not defined in this case. Add yet another "must not be called"
> > helper into nommu.c to make the linker happy.
> >
> > I still think this is pointless, afaics sys_io_setup() simply can't
> > succeed if CONFIG_MMU=n. Instead we should make CONFIG_AIO depend
> > on CONFIG_MMU.
> >
> > ..
> >
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -2008,6 +2008,12 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> > }
> > EXPORT_SYMBOL(filemap_map_pages);
> >
> > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > + BUG();
> > + return 0;
> > +}
> > +
> > static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
> > unsigned long addr, void *buf, int len, int write)
> > {
>
> So if anyone starts testing aio on NOMMU, this patch will make the
> whole thing immediately go BUG. This isn't helpful :(

Well, I'm afraid I could miss something, but _afaics_ this can not
happen. filemap_page_mkwrite() can't be called if NOMMU.

In particular, simply because sys_io_setup() is the only user (if
NOMMU) and it can't succeed. But even if I missed something and it
can succeed, ->page_mkwrite() must not be called anyway. But this,
again, unless I missed something ;)

> Yes, making AIO depend on MMU sounds better.

Perhaps Benjamin can change his mind or correct me.

> Because if it wasn't
> busted before, it sure is now!

I hope this change can't make any difference.

Oleg.

2015-07-17 14:06:18

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On Fri, Jul 17, 2015 at 01:52:28AM +0200, Oleg Nesterov wrote:
> On 07/16, Andrew Morton wrote:
> >
> > On Fri, 17 Jul 2015 01:14:05 +0200 Oleg Nesterov <[email protected]> wrote:
> >
> > > fs/aio.c can't be compiled if CONFIG_MMU=n, filemap_page_mkwrite()
> > > is not defined in this case. Add yet another "must not be called"
> > > helper into nommu.c to make the linker happy.
> > >
> > > I still think this is pointless, afaics sys_io_setup() simply can't
> > > succeed if CONFIG_MMU=n. Instead we should make CONFIG_AIO depend
> > > on CONFIG_MMU.
> > >
> > > ..
> > >
> > > --- a/mm/nommu.c
> > > +++ b/mm/nommu.c
> > > @@ -2008,6 +2008,12 @@ void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > }
> > > EXPORT_SYMBOL(filemap_map_pages);
> > >
> > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > +{
> > > + BUG();
> > > + return 0;
> > > +}
> > > +
> > > static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
> > > unsigned long addr, void *buf, int len, int write)
> > > {
> >
> > So if anyone starts testing aio on NOMMU, this patch will make the
> > whole thing immediately go BUG. This isn't helpful :(
>
> Well, I'm afraid I could miss something, but _afaics_ this can not
> happen. filemap_page_mkwrite() can't be called if NOMMU.
>
> In particular, simply because sys_io_setup() is the only user (if
> NOMMU) and it can't succeed. But even if I missed something and it
> can succeed, ->page_mkwrite() must not be called anyway. But this,
> again, unless I missed something ;)
>
> > Yes, making AIO depend on MMU sounds better.
>
> Perhaps Benjamin can change his mind or correct me.

Either try to fix it correctly, or disable the config. Making it just
compile but be knowingly broken is worse than either of those 2 options.
My point was that it is valid for someone to want to use the functionality
on a nommu system, and given that it should have worked before the page
migration code was added, It Would Be Nice(tm) to return it to that state.
Adding a BUG() like that to the code is just plain broken.

-ben

> > Because if it wasn't
> > busted before, it sure is now!
>
> I hope this change can't make any difference.
>
> Oleg.

--
"Thought is the essence of where you are now."

2015-07-17 17:29:14

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

Benjamin,

it seems that we do not understand each other,

On 07/17, Benjamin LaHaise wrote:
>
> On Fri, Jul 17, 2015 at 01:52:28AM +0200, Oleg Nesterov wrote:
> > > >
> > > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > +{
> > > > + BUG();
> > > > + return 0;
> > > > +}
> > > > +
> > > > static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
> > > > unsigned long addr, void *buf, int len, int write)
> > > > {
> > >
> > > So if anyone starts testing aio on NOMMU, this patch will make the
> > > whole thing immediately go BUG. This isn't helpful :(
> >
> > Well, I'm afraid I could miss something, but _afaics_ this can not
> > happen. filemap_page_mkwrite() can't be called if NOMMU.
> >
> > In particular, simply because sys_io_setup() is the only user (if
> > NOMMU) and it can't succeed. But even if I missed something and it
> > can succeed, ->page_mkwrite() must not be called anyway. But this,
> > again, unless I missed something ;)
> >
> > > Yes, making AIO depend on MMU sounds better.
> >
> > Perhaps Benjamin can change his mind or correct me.
>
> Either try to fix it correctly,

And I think this fix is correct. In a sense that we only add
filemap_page_mkwrite() to make the linker happy, it can never be called
and thus we can never hit this BUG().

Please look at filemap_fault() in nommu.c,

int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
BUG();
return 0;
}

this is the same thing. If nothing else, mm/memory.c is not even compiled
if NOMMU.

> or disable the config.

Yes, I think this makes more sense. but see below...

> Making it just
> compile but be knowingly broken is worse than either of those 2 options.

Why? See above. I think this change makes no difference except it fixes
the build.

Again, of course I could miss something. Could you explain your point?

> My point was that it is valid for someone to want to use the functionality
> on a nommu system, and given that it should have worked before the page
> migration code was added, It Would Be Nice(tm) to return it to that state.

Perhaps it worked on NOMMU before, I have no idea. But currently, afaics,
it can not. Even sys_io_setup() can't suceed. So I do not understand why
do we allow NOMMU && CONFIG_AIO.

But this is another issue. Of course I won't insist, please forget.

> Adding a BUG() like that to the code is just plain broken.

Why? Could you explain what I have missed?

Oleg.

2015-07-17 17:37:59

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On Fri, Jul 17, 2015 at 07:27:26PM +0200, Oleg Nesterov wrote:
> Benjamin,
>
> it seems that we do not understand each other,
...
> >
> > Either try to fix it correctly,
>
> And I think this fix is correct. In a sense that we only add
> filemap_page_mkwrite() to make the linker happy, it can never be called
> and thus we can never hit this BUG().
>
> Please look at filemap_fault() in nommu.c,
>
> int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> BUG();
> return 0;
> }
>
> this is the same thing. If nothing else, mm/memory.c is not even compiled
> if NOMMU.

Using BUG() is the wrong approach. If the code is not needed in NOMMU, then
#ifdef it out. Think about it: NOMMU systems are very low memory systems
and they should not have dead code compiled in if it is not needed.

> > or disable the config.
>
> Yes, I think this makes more sense. but see below...
>
> > Making it just
> > compile but be knowingly broken is worse than either of those 2 options.
>
> Why? See above. I think this change makes no difference except it fixes
> the build.
>
> Again, of course I could miss something. Could you explain your point?

Don't add BUG(). It's the equivalent approach of saying "I think this code
isn't needed, but I'm lazy and not going to remove it properly."

> > My point was that it is valid for someone to want to use the functionality
> > on a nommu system, and given that it should have worked before the page
> > migration code was added, It Would Be Nice(tm) to return it to that state.
>
> Perhaps it worked on NOMMU before, I have no idea. But currently, afaics,
> it can not. Even sys_io_setup() can't suceed. So I do not understand why
> do we allow NOMMU && CONFIG_AIO.
>
> But this is another issue. Of course I won't insist, please forget.

As I said in my earlier email: aio on NOMMU was broken by the page migration
code that was added in mid 3.1x. Fixing it should not be that hard.
>
> > Adding a BUG() like that to the code is just plain broken.
>
> Why? Could you explain what I have missed?

It's doing half the job. Either the code should be #if'd out or not.

-ben

> Oleg.

--
"Thought is the essence of where you are now."

2015-07-17 17:57:30

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 07/17, Benjamin LaHaise wrote:
>
> On Fri, Jul 17, 2015 at 07:27:26PM +0200, Oleg Nesterov wrote:
> > Benjamin,
> >
> > it seems that we do not understand each other,
> ...
> > >
> > > Either try to fix it correctly,
> >
> > And I think this fix is correct. In a sense that we only add
> > filemap_page_mkwrite() to make the linker happy, it can never be called
> > and thus we can never hit this BUG().
> >
> > Please look at filemap_fault() in nommu.c,
> >
> > int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > {
> > BUG();
> > return 0;
> > }
> >
> > this is the same thing. If nothing else, mm/memory.c is not even compiled
> > if NOMMU.
>
> Using BUG() is the wrong approach. If the code is not needed in NOMMU, then
> #ifdef it out. Think about it: NOMMU systems are very low memory systems
> and they should not have dead code compiled in if it is not needed.

OK, at least I hope you no longer think that this patch makes this code
knowingly broken.

> Don't add BUG(). It's the equivalent approach of saying "I think this code
> isn't needed, but I'm lazy and not going to remove it properly."

There is another interpretation: I think this code must be never called,
if it is actually called we have a serious problem which should be loudly
reported.

> > Why? Could you explain what I have missed?
>
> It's doing half the job. Either the code should be #if'd out or not.

Again, filemap_page_mkwrite() added to nommu.c matches filemap_fault()
and filemap_map_pages() we already have.


But I won't argue, you are maintainer. What exactly do you want me to
ifdef? Will you agree with the patch which adds ifdef into
aio_ring_vm_ops,

static const struct vm_operations_struct aio_ring_vm_ops = {
.mremap = aio_ring_mremap,
#ifdef CONFIG_MMU
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = filemap_page_mkwrite,
#endif
};

?

Oleg.

2015-07-17 18:12:29

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 2015-07-17 13:55, Oleg Nesterov wrote:
> On 07/17, Benjamin LaHaise wrote:
>>
>> Don't add BUG(). It's the equivalent approach of saying "I think this code
>> isn't needed, but I'm lazy and not going to remove it properly."
>
> There is another interpretation: I think this code must be never called,
> if it is actually called we have a serious problem which should be loudly
> reported.
>
And not compiling it at all _will_ loudly report it, it'll just report
it during linking instead of at run-time, which is a much better time to
shout about it.


Attachments:
smime.p7s (2.90 kB)
S/MIME Cryptographic Signature

2015-07-17 18:21:28

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 07/17, Austin S Hemmelgarn wrote:
>
> On 2015-07-17 13:55, Oleg Nesterov wrote:
>> On 07/17, Benjamin LaHaise wrote:
>>>
>>> Don't add BUG(). It's the equivalent approach of saying "I think this code
>>> isn't needed, but I'm lazy and not going to remove it properly."
>>
>> There is another interpretation: I think this code must be never called,
>> if it is actually called we have a serious problem which should be loudly
>> reported.
>>
> And not compiling it at all _will_ loudly report it, it'll just report
> it during linking instead of at run-time, which is a much better time to
> shout about it.

And how can we do this?

Oleg.

2015-07-17 18:39:33

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 2015-07-17 14:19, Oleg Nesterov wrote:
> On 07/17, Austin S Hemmelgarn wrote:
>>
>> On 2015-07-17 13:55, Oleg Nesterov wrote:
>>> On 07/17, Benjamin LaHaise wrote:
>>>>
>>>> Don't add BUG(). It's the equivalent approach of saying "I think this code
>>>> isn't needed, but I'm lazy and not going to remove it properly."
>>>
>>> There is another interpretation: I think this code must be never called,
>>> if it is actually called we have a serious problem which should be loudly
>>> reported.
>>>
>> And not compiling it at all _will_ loudly report it, it'll just report
>> it during linking instead of at run-time, which is a much better time to
>> shout about it.
>
> And how can we do this?
>
If a function that isn't defined (for example, you use a #if block to
comment it out under certain circumstances), then the link will fail
rather noisily something references it. We already know during the
compile that it's a NOMMU kernel, so anything that calls it on a MMU
enabled kernel can have a compile time check added instead of doing the
check at runtime (or even just calling it without checking), thus even
further reducing code size.



Attachments:
smime.p7s (2.90 kB)
S/MIME Cryptographic Signature

2015-07-17 18:56:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 07/17, Austin S Hemmelgarn wrote:
>
> On 2015-07-17 14:19, Oleg Nesterov wrote:
>> On 07/17, Austin S Hemmelgarn wrote:
>>>
>>> On 2015-07-17 13:55, Oleg Nesterov wrote:
>>>> On 07/17, Benjamin LaHaise wrote:
>>>>>
>>>>> Don't add BUG(). It's the equivalent approach of saying "I think this code
>>>>> isn't needed, but I'm lazy and not going to remove it properly."
>>>>
>>>> There is another interpretation: I think this code must be never called,
>>>> if it is actually called we have a serious problem which should be loudly
>>>> reported.
>>>>
>>> And not compiling it at all _will_ loudly report it, it'll just report
>>> it during linking instead of at run-time, which is a much better time to
>>> shout about it.
>>
>> And how can we do this?
>>
> If a function that isn't defined (for example, you use a #if block to
> comment it out under certain circumstances), then the link will fail
> rather noisily something references it.

This is what we are trying to fix.

> We already know during the
> compile that it's a NOMMU kernel, so anything that calls it on a MMU
> enabled kernel can have a compile time check added

It already has. memory.c is not compiled if NOMMU.

The problem is aio_ring_vm_ops which references this function. And btw
filemap_fault() too.

And just in case, I won't mind to add ifdef(CONFIG_MMU) there, I am
waiting for reply from Benjamin.

> instead of doing the
> check at runtime (or even just calling it without checking), thus even
> further reducing code size.

So what exactly do you suggest to fix the problem?

I agree with any solution which satisfies the maintainers.

Oleg.

2015-07-17 19:09:33

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 2015-07-17 14:54, Oleg Nesterov wrote:
> On 07/17, Austin S Hemmelgarn wrote:
>>
>> On 2015-07-17 14:19, Oleg Nesterov wrote:
>>> On 07/17, Austin S Hemmelgarn wrote:
>>>>
>>>> On 2015-07-17 13:55, Oleg Nesterov wrote:
>>>>> On 07/17, Benjamin LaHaise wrote:
>>>>>>
>>>>>> Don't add BUG(). It's the equivalent approach of saying "I think this code
>>>>>> isn't needed, but I'm lazy and not going to remove it properly."
>>>>>
>>>>> There is another interpretation: I think this code must be never called,
>>>>> if it is actually called we have a serious problem which should be loudly
>>>>> reported.
>>>>>
>>>> And not compiling it at all _will_ loudly report it, it'll just report
>>>> it during linking instead of at run-time, which is a much better time to
>>>> shout about it.
>>>
>>> And how can we do this?
>>>
>> If a function that isn't defined (for example, you use a #if block to
>> comment it out under certain circumstances), then the link will fail
>> rather noisily something references it.
>
> This is what we are trying to fix.
Ah, I misunderstood the intent of the patch, looking at it again this
makes perfect sense.
>
>> We already know during the
>> compile that it's a NOMMU kernel, so anything that calls it on a MMU
>> enabled kernel can have a compile time check added
>
> It already has. memory.c is not compiled if NOMMU.
>
> The problem is aio_ring_vm_ops which references this function. And btw
> filemap_fault() too.
>
> And just in case, I won't mind to add ifdef(CONFIG_MMU) there, I am
> waiting for reply from Benjamin.
>
>> instead of doing the
>> check at runtime (or even just calling it without checking), thus even
>> further reducing code size.
>
> So what exactly do you suggest to fix the problem?
Some sort of COMPILE_BUG_ON usage (I think that's the name of the macro
that I'm thinking of, not certain though) possibly? My only point is
that it's more useful if the function should never be called to find a
way to enforce this at build time instead of runtime (even more so for
NOMMU systems, they tend (in my experience at least) to be even more of
a pain to debug when they crash). I do agree though that a helpful
message is much preferred to a link error.



Attachments:
smime.p7s (2.90 kB)
S/MIME Cryptographic Signature

2015-07-17 22:33:36

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 07/17, Benjamin LaHaise wrote:
>
> it should have worked before the page
> migration code was added,

This is off-topic, but the whole "vm" logic in aio_setup_ring()
looks sub-optimal. I do not mean the code, just it seems to me it
is pointless to pollute the page cache, and expose the pages we
can not swap/free to lru. Afaics we _only_ need this for migration.

Perhaps I missed something, doesn't matter. But this means that
this memory is not accounted, so if I increase aio-max-nr then
this test-case

#define __NR_io_setup 206

int main(void)
{
int nr;

for (nr = 0; ;++nr) {
void *ctx = NULL;
int ret = syscall(__NR_io_setup, 1, &ctx);
if (ret) {
printf("failed %d %m: ", nr);
getchar();
}
}

return 0;
}

triggers OOM-killer which kills sshd and other daemons on my machine.
These pages were not even faulted in (or the shrinker can unmap them),
the kernel can not know who should be blamed.

Shouldn't we account aio events/pages somehow, say per-user, or in
mm->pinned_vm ?

I do not think this is unkown, and probably this all is fine. IOW,
this is just a question, not a bug-report or something like this.

And of course, this is not exploitable because aio-max-nr limits
the number of pages you can steal.

But otoh, aio_max_nr is system-wide, so the unpriviliged user can
ddos (say) mysqld. And this leads to the same question: shouldn't
we account nr_events at least?

Oleg.

2015-07-17 22:58:46

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

Benjamin,

This discussion was a bit confusing, so let me try to summarize and
please correct me if I misunderstood.

And sorry for bothering you, I just want to fix and forget about the
problem which was introduced by me (build failure with NOMMU).

So _iiuc_ you are starting to agree that tehcnically this change is
correct and we can never hit this BUG(). Like we can never hit another
BUG() in nommu.c:filemap_fault(), also referenced by aio_ring_vm_ops.
And this change should not hurt even if you make aio work with NOMMU.

No?

However, you still dislike this change because you think it is sub-
optimal and/or not clean enough. I won't argue with maintainer.

So what do you suggest instead? Will you agree with ifdef(CONFIG_MMU)
in aio_ring_vm_ops?

I aggree in advance with any suggestion.

On 07/17, Oleg Nesterov wrote:
>
> On 07/17, Benjamin LaHaise wrote:
> >
> > On Fri, Jul 17, 2015 at 07:27:26PM +0200, Oleg Nesterov wrote:
> > > Benjamin,
> > >
> > > it seems that we do not understand each other,
> > ...
> > > >
> > > > Either try to fix it correctly,
> > >
> > > And I think this fix is correct. In a sense that we only add
> > > filemap_page_mkwrite() to make the linker happy, it can never be called
> > > and thus we can never hit this BUG().
> > >
> > > Please look at filemap_fault() in nommu.c,
> > >
> > > int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > {
> > > BUG();
> > > return 0;
> > > }
> > >
> > > this is the same thing. If nothing else, mm/memory.c is not even compiled
> > > if NOMMU.
> >
> > Using BUG() is the wrong approach. If the code is not needed in NOMMU, then
> > #ifdef it out. Think about it: NOMMU systems are very low memory systems
> > and they should not have dead code compiled in if it is not needed.
>
> OK, at least I hope you no longer think that this patch makes this code
> knowingly broken.
>
> > Don't add BUG(). It's the equivalent approach of saying "I think this code
> > isn't needed, but I'm lazy and not going to remove it properly."
>
> There is another interpretation: I think this code must be never called,
> if it is actually called we have a serious problem which should be loudly
> reported.
>
> > > Why? Could you explain what I have missed?
> >
> > It's doing half the job. Either the code should be #if'd out or not.
>
> Again, filemap_page_mkwrite() added to nommu.c matches filemap_fault()
> and filemap_map_pages() we already have.
>
>
> But I won't argue, you are maintainer. What exactly do you want me to
> ifdef? Will you agree with the patch which adds ifdef into
> aio_ring_vm_ops,
>
> static const struct vm_operations_struct aio_ring_vm_ops = {
> .mremap = aio_ring_mremap,
> #ifdef CONFIG_MMU
> .fault = filemap_fault,
> .map_pages = filemap_map_pages,
> .page_mkwrite = filemap_page_mkwrite,
> #endif
> };
>
> ?
>
> Oleg.

2015-07-20 14:22:16

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

Hi, Oleg,

Oleg Nesterov <[email protected]> writes:

> Shouldn't we account aio events/pages somehow, say per-user, or in
> mm->pinned_vm ?

Ages ago I wrote a patch to account the completion ring to a process'
memlock limit:
"[patch] aio: remove aio-max-nr and instead use the memlock rlimit to
limit the number of pages pinned for the aio completion ring"
http://marc.info/?l=linux-aio&m=123661380807041&w=2

The problem with that patch is that it modifies the user/kernel
interface. It could be done over time, as Andrew outlined in that
thread, but I've been reluctant to take that on.

If you just mean we should account the memory so that the right process
can be killed, that sounds like a good idea to me.

Cheers,
Jeff

2015-07-20 17:35:04

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

Hi Jeff,

On 07/20, Jeff Moyer wrote:
>
> Hi, Oleg,
>
> Oleg Nesterov <[email protected]> writes:
>
> > Shouldn't we account aio events/pages somehow, say per-user, or in
> > mm->pinned_vm ?
>
> Ages ago I wrote a patch to account the completion ring to a process'
> memlock limit:
> "[patch] aio: remove aio-max-nr and instead use the memlock rlimit to
> limit the number of pages pinned for the aio completion ring"
> http://marc.info/?l=linux-aio&m=123661380807041&w=2
>
> The problem with that patch is that it modifies the user/kernel
> interface. It could be done over time, as Andrew outlined in that
> thread, but I've been reluctant to take that on.

See also the usage of mm->pinned_vm and user->locked_vm in perf_mmap(),
perhaps aio can do the same...

> If you just mean we should account the memory so that the right process
> can be killed, that sounds like a good idea to me.

Not sure we actually need this. I only meant that this looks confusing
because this memory is actually locked but the kernel doesn't know this.

And btw, I forgot to mention that I triggered OOM on the testing machine
with only 512mb ram, and aio-max-nr was huge. So, once again, while this
all doesn't look right to me, I do not think this is the real problem.

Except the fact that an unpriviliged user can steal all aio-max-nr events.
This probably worth fixing in any case.



And if we accept the fact this memory is locked and if we properly account
it, then may be we can just kill aio_migratepage(), aio_private_file(), and
change aio_setup_ring() to simply use install_special_mapping(). This will
greatly simplify the code. But let me remind that I know nothing about aio,
so please don't take my thoughts seriously.

Oleg.

2015-07-20 17:51:49

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On Mon, Jul 20, 2015 at 07:33:11PM +0200, Oleg Nesterov wrote:
> Hi Jeff,
>
> On 07/20, Jeff Moyer wrote:
> >
> > Hi, Oleg,
> >
> > Oleg Nesterov <[email protected]> writes:
> >
> > > Shouldn't we account aio events/pages somehow, say per-user, or in
> > > mm->pinned_vm ?
> >
> > Ages ago I wrote a patch to account the completion ring to a process'
> > memlock limit:
> > "[patch] aio: remove aio-max-nr and instead use the memlock rlimit to
> > limit the number of pages pinned for the aio completion ring"
> > http://marc.info/?l=linux-aio&m=123661380807041&w=2
> >
> > The problem with that patch is that it modifies the user/kernel
> > interface. It could be done over time, as Andrew outlined in that
> > thread, but I've been reluctant to take that on.
>
> See also the usage of mm->pinned_vm and user->locked_vm in perf_mmap(),
> perhaps aio can do the same...
>
> > If you just mean we should account the memory so that the right process
> > can be killed, that sounds like a good idea to me.
>
> Not sure we actually need this. I only meant that this looks confusing
> because this memory is actually locked but the kernel doesn't know this.
>
> And btw, I forgot to mention that I triggered OOM on the testing machine
> with only 512mb ram, and aio-max-nr was huge. So, once again, while this
> all doesn't look right to me, I do not think this is the real problem.
>
> Except the fact that an unpriviliged user can steal all aio-max-nr events.
> This probably worth fixing in any case.
>
>
>
> And if we accept the fact this memory is locked and if we properly account
> it, then may be we can just kill aio_migratepage(), aio_private_file(), and
> change aio_setup_ring() to simply use install_special_mapping(). This will
> greatly simplify the code. But let me remind that I know nothing about aio,
> so please don't take my thoughts seriously.

No, you can't get rid of that code. The page migration is required when
CPUs/memory is offlined and data needs to be moved to another node.
Similarly, support for mremap() is also required for container migration /
restoration.

As for accounting locked memory, we don't do that for memory pinned by
O_DIRECT either. Given how small the amount of memory aio can pin is
compared to O_DIRECT or mlock(), it is unlikely that the accounting of
how much aio has pinned will make any real difference in the big picture.
A single O_DIRECT i/o can pin megabytes of memory.

-ben

> Oleg.

--
"Thought is the essence of where you are now."

2015-07-20 18:30:15

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

Benjamin LaHaise <[email protected]> writes:

> As for accounting locked memory, we don't do that for memory pinned by
> O_DIRECT either. Given how small the amount of memory aio can pin is
> compared to O_DIRECT or mlock(), it is unlikely that the accounting of
> how much aio has pinned will make any real difference in the big picture.
> A single O_DIRECT i/o can pin megabytes of memory.

Actually, you can pin a lot of memory with aio. Worst case, aio-max-nr
represents the maximum number of pages you can lock in memory (assuming
you pass 1 to io_setup in a loop). So, for the default of 65536 events,
that comes to 256MB.

My system has libvirt installed, which changes the default to 1 million:
$ grep aio-max-nr /usr/lib/sysctl.d/libvirtd.conf
fs.aio-max-nr = 1048576

So, that means up to 4GB of memory can be tied up by aio. Oracle's
installation guide recommends the same.

The difference between the aio ring and something like DIO is that the
ring is long-lived. I/O, ideally, doesn't take that long, so that's a
poor comparison to make.

I tend to agree with Oleg, a system-wide setting just isn't a good fit
for this. Changing it now that it's in place is difficult, though, and
obviously low priority.

-Jeff

2015-07-20 18:33:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 07/20, Benjamin LaHaise wrote:
>
> On Mon, Jul 20, 2015 at 07:33:11PM +0200, Oleg Nesterov wrote:
> >
> > And if we accept the fact this memory is locked and if we properly account
> > it, then may be we can just kill aio_migratepage(), aio_private_file(), and
> > change aio_setup_ring() to simply use install_special_mapping(). This will
> > greatly simplify the code. But let me remind that I know nothing about aio,
> > so please don't take my thoughts seriously.
>
> No, you can't get rid of that code. The page migration is required when
> CPUs/memory is offlined and data needs to be moved to another node.

Of course, if we remove aio_migratepage() then aio can't be moved,

> Similarly, support for mremap() is also required for container migration /
> restoration.

This is not the problem. And one of the reasons to move ->mremap() into
vm_operations_struct was that install_special_mapping() can use it.

> Given how small the amount of memory aio can pin

I agree, but why should we worry about migration then? let this memory be
unmovable, don't use GFP_RECLAIMABLE/MOVABLE, etc.

But again, again, please ignore. This all is off-topic and my understanding
is very limited.

> it is unlikely that the accounting of
> how much aio has pinned will make any real difference in the big picture.

Agreed, but this can help to remove the system-wide aio-max-nr. Again,
unpriviliged user can steal aio.

Oleg.

2015-07-20 19:26:34

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 07/20, Oleg Nesterov wrote:
>
> But again, again, please ignore. This all is off-topic and my understanding
> is very limited.

Yes, yes, but sorry for noise and let me repeat...

This memory lives in page-cache/lru, it is visible for shrinker which
will unmap these pages for no reason on memory shortage. IOW, aio fools
the kernel, this memory looks reclaimable but it is not. And we only do
this for migration.

Even if this is not a problem, this does not look right. So perhaps at
least mapping_set_unevictable() makes sense. But I simply do not know
if migration will work with this change.


And I should have changes the subject a long ago... So what do you think
we should do with the build failure?

Oleg.

2015-07-20 19:39:15

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On Mon, Jul 20, 2015 at 09:24:40PM +0200, Oleg Nesterov wrote:
> On 07/20, Oleg Nesterov wrote:
> >
> > But again, again, please ignore. This all is off-topic and my understanding
> > is very limited.
>
> Yes, yes, but sorry for noise and let me repeat...
>
> This memory lives in page-cache/lru, it is visible for shrinker which
> will unmap these pages for no reason on memory shortage. IOW, aio fools
> the kernel, this memory looks reclaimable but it is not. And we only do
> this for migration.

And we have the same problem with O_DIRECT. Given the size of the LRU in
a modern system, I highly doubt a handful of pages getting scanned is a
major problem. If you want to improve this, go ahead, but we need to
retain support for page migration as people have run into the need for it.

> Even if this is not a problem, this does not look right. So perhaps at
> least mapping_set_unevictable() makes sense. But I simply do not know
> if migration will work with this change.

Nor do I know if that will work.

> And I should have changes the subject a long ago... So what do you think
> we should do with the build failure?

I honestly don't care what of the options you do -- please just don't go
about adding BUG()s.

-ben

> Oleg.

--
"Thought is the essence of where you are now."

2015-07-20 20:05:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 07/20, Benjamin LaHaise wrote:
>
> On Mon, Jul 20, 2015 at 09:24:40PM +0200, Oleg Nesterov wrote:
> > On 07/20, Oleg Nesterov wrote:
> > >
> > > But again, again, please ignore. This all is off-topic and my understanding
> > > is very limited.
> >
> > Yes, yes, but sorry for noise and let me repeat...
> >
> > This memory lives in page-cache/lru, it is visible for shrinker which
> > will unmap these pages for no reason on memory shortage. IOW, aio fools
> > the kernel, this memory looks reclaimable but it is not. And we only do
> > this for migration.
>
> And we have the same problem with O_DIRECT.

I don't this so, or I misunderstood. Sure, dio_get_page() will pin the page.
But not forever and this is unavoidable. This is not the same. OK, nevermind,
lets forget this.

> > And I should have changes the subject a long ago... So what do you think
> > we should do with the build failure?
>
> I honestly don't care what of the options you do -- please just don't go
> about adding BUG()s.

OK, thanks.

Then I'll send v2 which adds ifdef(MMU) into aio_ring_vm_ops tomorrow.

Oleg.

2015-07-21 15:30:57

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v2] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

Andrew,

this replaces

mm-move-mremap-from-file_operations-to-vm_operations_struct-fix.patch

in -mm tree nacked by Benjamin.

fs/aio.c can't be compiled if CONFIG_MMU=n, filemap_page_mkwrite() is
not defined in this case. Add ifdef(CONFIG_MMU) into aio_ring_vm_ops.

Note that this only fixes the build error, currently aio doesn't work
if !CONFIG_MMU anyway.

Reported-by: Fengguang Wu <[email protected]>
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
---
fs/aio.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index c122624..822c199 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -338,9 +338,11 @@ static int aio_ring_mremap(struct vm_area_struct *vma)

static const struct vm_operations_struct aio_ring_vm_ops = {
.mremap = aio_ring_mremap,
+#ifdef CONFIG_MMU
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = filemap_page_mkwrite,
+#endif
};

static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
--
1.7.1

2015-07-21 15:38:53

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH v2] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On Tue, Jul 21, 2015 at 05:29:03PM +0200, Oleg Nesterov wrote:
> Andrew,
>
> this replaces
>
> mm-move-mremap-from-file_operations-to-vm_operations_struct-fix.patch
>
> in -mm tree nacked by Benjamin.
>
> fs/aio.c can't be compiled if CONFIG_MMU=n, filemap_page_mkwrite() is
> not defined in this case. Add ifdef(CONFIG_MMU) into aio_ring_vm_ops.
>
> Note that this only fixes the build error, currently aio doesn't work
> if !CONFIG_MMU anyway.
>
> Reported-by: Fengguang Wu <[email protected]>
> Reported-by: Stephen Rothwell <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> fs/aio.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index c122624..822c199 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -338,9 +338,11 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
>
> static const struct vm_operations_struct aio_ring_vm_ops = {
> .mremap = aio_ring_mremap,
> +#ifdef CONFIG_MMU
> .fault = filemap_fault,
> .map_pages = filemap_map_pages,
> .page_mkwrite = filemap_page_mkwrite,
> +#endif
> };

One nit: using #if IS_ENABLED(CONFIG_MMU) is recommended these days not
#ifdef CONFIG_MMU.

-ben

> static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
> --
> 1.7.1
>

--
"Thought is the essence of where you are now."

2015-07-21 16:23:32

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

On 07/21, Benjamin LaHaise wrote:
>
> On Tue, Jul 21, 2015 at 05:29:03PM +0200, Oleg Nesterov wrote:
> >
> > static const struct vm_operations_struct aio_ring_vm_ops = {
> > .mremap = aio_ring_mremap,
> > +#ifdef CONFIG_MMU
> > .fault = filemap_fault,
> > .map_pages = filemap_map_pages,
> > .page_mkwrite = filemap_page_mkwrite,
> > +#endif
> > };
>
> One nit: using #if IS_ENABLED(CONFIG_MMU) is recommended these days not
> #ifdef CONFIG_MMU.

OK, this matches IS_ENABLED(CONFIG_MIGRATION). I am sending v3.

Oleg.

2015-07-21 16:22:10

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH v3] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

Andrew,

this replaces

mm-move-mremap-from-file_operations-to-vm_operations_struct-fix.patch

in -mm tree nacked by Benjamin.

fs/aio.c can't be compiled if CONFIG_MMU=n, filemap_page_mkwrite() is
not defined in this case. Add ifdef(CONFIG_MMU) into aio_ring_vm_ops.

Note that this only fixes the build error, currently aio doesn't work
if !CONFIG_MMU anyway.

Reported-by: Fengguang Wu <[email protected]>
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
---
fs/aio.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index c122624..822c199 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -338,9 +338,11 @@ static int aio_ring_mremap(struct vm_area_struct *vma)

static const struct vm_operations_struct aio_ring_vm_ops = {
.mremap = aio_ring_mremap,
+#if IS_ENABLED(CONFIG_MMU)
.fault = filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = filemap_page_mkwrite,
+#endif
};

static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
--
1.7.1