2009-06-16 18:05:34

by Izik Eidus

[permalink] [raw]
Subject: running get_user_pages() from kernel thread

Hey,

ksm is running from its kernel thread context get_user_pages(), but look
like it isnt safe to do get_user_pages() from kernel thread:
get_user_pages may lead into do_swap_page() that may lead into calling
grab_swap_token()

Now grab_swap_token() will do:
current_interval = global_faults - current->mm->... (here is the bug...
(kernel thread doesnt have mm strcture)

So the question is: is this thing is by desgin? (that kernel thread cant
call get_user_pages???), should i use something like switch_mm()??

Thanks.


2009-06-16 18:13:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

On Tue, Jun 16, 2009 at 09:05:27PM +0300, Izik Eidus wrote:
> So the question is: is this thing is by desgin? (that kernel thread cant
> call get_user_pages???), should i use something like switch_mm()??

I think switch_mm trick should be used for page faults, but gup
shouldn't require it because it gets the 'mm' as parameter and the
current->mm has to be irrelevant. current->mm is only relevant for
gup-fast (obviously :). So I think the only bit that needs fixing is
grab_swap_token to not run if current->mm is null.

2009-06-16 18:39:42

by Hugh Dickins

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

On Tue, 16 Jun 2009, Andrea Arcangeli wrote:
> On Tue, Jun 16, 2009 at 09:05:27PM +0300, Izik Eidus wrote:
> > So the question is: is this thing is by desgin? (that kernel thread cant
> > call get_user_pages???), should i use something like switch_mm()??
>
> I think switch_mm trick should be used for page faults, but gup
> shouldn't require it because it gets the 'mm' as parameter and the
> current->mm has to be irrelevant. current->mm is only relevant for
> gup-fast (obviously :). So I think the only bit that needs fixing is
> grab_swap_token to not run if current->mm is null.

Looks like Izik and I hit the same problem (otherwise running well):
I too decided we needn't do more than avoid the issue in grab_swap_token.
(I've a feeling someone has hit this issue before with some other thread,
though I've no idea which - does RHEL include a patch like this perhaps?).

[PATCH] mm: grab_swap_token back out if no mm

If a kthread happens to use get_user_pages() on an mm (as KSM does),
there's a chance that it will end up trying to read in a swap page,
and oops in grab_swap_token() because the kthread has no mm:
nothing clever, just avoid that case.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/thrash.c | 3 +++
1 file changed, 3 insertions(+)

--- 2.6.30-mm1/mm/thrash.c 2007-07-09 00:32:17.000000000 +0100
+++ linux/mm/thrash.c 2009-06-15 19:44:53.000000000 +0100
@@ -30,6 +30,9 @@ void grab_swap_token(void)
{
int current_interval;

+ if (!current->mm) /* kthread doing get_user_pages on an mm */
+ return;
+
global_faults++;

current_interval = global_faults - current->mm->faultstamp;

2009-06-16 18:55:29

by Rik van Riel

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

Hugh Dickins wrote:
> On Tue, 16 Jun 2009, Andrea Arcangeli wrote:
>> On Tue, Jun 16, 2009 at 09:05:27PM +0300, Izik Eidus wrote:
>>> So the question is: is this thing is by desgin? (that kernel thread cant
>>> call get_user_pages???), should i use something like switch_mm()??
>> I think switch_mm trick should be used for page faults, but gup
>> shouldn't require it because it gets the 'mm' as parameter and the
>> current->mm has to be irrelevant. current->mm is only relevant for
>> gup-fast (obviously :). So I think the only bit that needs fixing is
>> grab_swap_token to not run if current->mm is null.
>
> Looks like Izik and I hit the same problem (otherwise running well):
> I too decided we needn't do more than avoid the issue in grab_swap_token.
> (I've a feeling someone has hit this issue before with some other thread,
> though I've no idea which - does RHEL include a patch like this perhaps?).

It looks very familiar, indeed.

RHEL 4 and 5 both have code like this in the swap token
code. I seem to remember submitting such code upstream,
too.

I have no idea why it never got upstream, maybe I was
drowned in Xen work at the time, or maybe the bug simply
didn't happen upstream for whatever reason.

> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-06-16 19:20:42

by Izik Eidus

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

Hugh Dickins wrote:
> On Tue, 16 Jun 2009, Andrea Arcangeli wrote:
>
>> On Tue, Jun 16, 2009 at 09:05:27PM +0300, Izik Eidus wrote:
>>
>>> So the question is: is this thing is by desgin? (that kernel thread cant
>>> call get_user_pages???), should i use something like switch_mm()??
>>>
>> I think switch_mm trick should be used for page faults, but gup
>> shouldn't require it because it gets the 'mm' as parameter and the
>> current->mm has to be irrelevant. current->mm is only relevant for
>> gup-fast (obviously :). So I think the only bit that needs fixing is
>> grab_swap_token to not run if current->mm is null.
>>
>
> Looks like Izik and I hit the same problem (otherwise running well):
> I too decided we needn't do more than avoid the issue in grab_swap_token.
> (I've a feeling someone has hit this issue before with some other thread,
> though I've no idea which - does RHEL include a patch like this perhaps?).
>
> [PATCH] mm: grab_swap_token back out if no mm
>
> If a kthread happens to use get_user_pages() on an mm (as KSM does),
> there's a chance that it will end up trying to read in a swap page,
> and oops in grab_swap_token() because the kthread has no mm:
> nothing clever, just avoid that case.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> mm/thrash.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> --- 2.6.30-mm1/mm/thrash.c 2007-07-09 00:32:17.000000000 +0100
> +++ linux/mm/thrash.c 2009-06-15 19:44:53.000000000 +0100
> @@ -30,6 +30,9 @@ void grab_swap_token(void)
> {
> int current_interval;
>
> + if (!current->mm) /* kthread doing get_user_pages on an mm */
> + return;
> +
> global_faults++;
>
> current_interval = global_faults - current->mm->faultstamp;
>
Good, This solve another issue that you probably dont hit beacuse you
work with the madvise version:
the .release call back of the file_operations strcture will call to:
ksm_sma_release() that will call to remove_slot_from_hash_and_tree()
that will do the silly break_cow() call (even that the prcoesses just
die!!!)
Now beacuse that exit_mm() will set tsk->mm = NULL before the .release
will get called, we will trigger this path even without the kernel
thread context.
(I prepred patch that just avoid the break_cow() from the .release
context, but it isnt needed with this patch)

(You shouldnt have that specific problem in the madvise() version
beacuse we dont have this .release callback anymore, but we still do
there useless break_cow() calls, considering the fact that the process
just die, this break_cow() calls should be made only when the user ask
specifically to stop merging pages i guess...,
I will send patch that will make it work more logical on top of your
patches as soon as you send something)

Thanks.

2009-06-16 19:46:28

by Hugh Dickins

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

On Tue, 16 Jun 2009, Rik van Riel wrote:
> Hugh Dickins wrote:
> >
> > Looks like Izik and I hit the same problem (otherwise running well):
> > I too decided we needn't do more than avoid the issue in grab_swap_token.
> > (I've a feeling someone has hit this issue before with some other thread,
> > though I've no idea which - does RHEL include a patch like this perhaps?).
>
> It looks very familiar, indeed.
>
> RHEL 4 and 5 both have code like this in the swap token
> code. I seem to remember submitting such code upstream,
> too.
>
> I have no idea why it never got upstream, maybe I was
> drowned in Xen work at the time, or maybe the bug simply
> didn't happen upstream for whatever reason.

I've a suspicion it got triggered by some out-of-tree module,
and nobody saw the same problem with a vanilla kernel.

>
> > Signed-off-by: Hugh Dickins <[email protected]>
>
> Acked-by: Rik van Riel <[email protected]>

Thanks, Rik. I don't want to be a nuisance, but somehow, in the hour
since I sent that, I've come to feel it was the right patch to get our
testing back on track, but not really quite the right patch to go forward.
Silly, really, it does not matter much what we do in this case, but I'm
liking the below better - what do you think? If you approve, I'll send
it in to Andrew for mmotm: I can't quite justify it without KSM.
(checkpatch.pl didn't like "while(0)" so I cleaned up a little.)


[PATCH] mm: pass mm to grab_swap_token

If a kthread happens to use get_user_pages() on an mm (as KSM does),
there's a chance that it will end up trying to read in a swap page,
then oops in grab_swap_token() because the kthread has no mm: GUP
passes down the right mm, so grab_swap_token() ought to be using it.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/swap.h | 12 ++++++------
mm/memory.c | 2 +-
mm/thrash.c | 32 +++++++++++++++-----------------
3 files changed, 22 insertions(+), 24 deletions(-)

--- 2.6.30/include/linux/swap.h 2009-06-10 04:05:27.000000000 +0100
+++ linux/include/linux/swap.h 2009-06-16 20:29:13.000000000 +0100
@@ -314,8 +314,8 @@ extern int try_to_free_swap(struct page
struct backing_dev_info;

/* linux/mm/thrash.c */
-extern struct mm_struct * swap_token_mm;
-extern void grab_swap_token(void);
+extern struct mm_struct *swap_token_mm;
+extern void grab_swap_token(struct mm_struct *);
extern void __put_swap_token(struct mm_struct *);

static inline int has_swap_token(struct mm_struct *mm)
@@ -426,10 +426,10 @@ static inline swp_entry_t get_swap_page(
}

/* linux/mm/thrash.c */
-#define put_swap_token(x) do { } while(0)
-#define grab_swap_token() do { } while(0)
-#define has_swap_token(x) 0
-#define disable_swap_token() do { } while(0)
+#define put_swap_token(mm) do { } while (0)
+#define grab_swap_token(mm) do { } while (0)
+#define has_swap_token(mm) 0
+#define disable_swap_token() do { } while (0)

static inline int mem_cgroup_cache_charge_swapin(struct page *page,
struct mm_struct *mm, gfp_t mask, bool locked)
--- 2.6.30/mm/memory.c 2009-06-10 04:05:27.000000000 +0100
+++ linux/mm/memory.c 2009-06-16 20:29:13.000000000 +0100
@@ -2466,7 +2466,7 @@ static int do_swap_page(struct mm_struct
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry);
if (!page) {
- grab_swap_token(); /* Contend for token _before_ read-in */
+ grab_swap_token(mm); /* Contend for token _before_ read-in */
page = swapin_readahead(entry,
GFP_HIGHUSER_MOVABLE, vma, address);
if (!page) {
--- 2.6.30/mm/thrash.c 2007-07-09 00:32:17.000000000 +0100
+++ linux/mm/thrash.c 2009-06-16 20:29:13.000000000 +0100
@@ -26,47 +26,45 @@ static DEFINE_SPINLOCK(swap_token_lock);
struct mm_struct *swap_token_mm;
static unsigned int global_faults;

-void grab_swap_token(void)
+void grab_swap_token(struct mm_struct *mm)
{
int current_interval;

global_faults++;

- current_interval = global_faults - current->mm->faultstamp;
+ current_interval = global_faults - mm->faultstamp;

if (!spin_trylock(&swap_token_lock))
return;

/* First come first served */
if (swap_token_mm == NULL) {
- current->mm->token_priority = current->mm->token_priority + 2;
- swap_token_mm = current->mm;
+ mm->token_priority = mm->token_priority + 2;
+ swap_token_mm = mm;
goto out;
}

- if (current->mm != swap_token_mm) {
- if (current_interval < current->mm->last_interval)
- current->mm->token_priority++;
+ if (mm != swap_token_mm) {
+ if (current_interval < mm->last_interval)
+ mm->token_priority++;
else {
- if (likely(current->mm->token_priority > 0))
- current->mm->token_priority--;
+ if (likely(mm->token_priority > 0))
+ mm->token_priority--;
}
/* Check if we deserve the token */
- if (current->mm->token_priority >
- swap_token_mm->token_priority) {
- current->mm->token_priority += 2;
- swap_token_mm = current->mm;
+ if (mm->token_priority > swap_token_mm->token_priority) {
+ mm->token_priority += 2;
+ swap_token_mm = mm;
}
} else {
/* Token holder came in again! */
- current->mm->token_priority += 2;
+ mm->token_priority += 2;
}

out:
- current->mm->faultstamp = global_faults;
- current->mm->last_interval = current_interval;
+ mm->faultstamp = global_faults;
+ mm->last_interval = current_interval;
spin_unlock(&swap_token_lock);
-return;
}

/* Called on process exit. */

2009-06-16 20:12:16

by Johannes Weiner

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

On Tue, Jun 16, 2009 at 07:38:39PM +0100, Hugh Dickins wrote:
> On Tue, 16 Jun 2009, Andrea Arcangeli wrote:
> > On Tue, Jun 16, 2009 at 09:05:27PM +0300, Izik Eidus wrote:
> > > So the question is: is this thing is by desgin? (that kernel thread cant
> > > call get_user_pages???), should i use something like switch_mm()??
> >
> > I think switch_mm trick should be used for page faults, but gup
> > shouldn't require it because it gets the 'mm' as parameter and the
> > current->mm has to be irrelevant. current->mm is only relevant for
> > gup-fast (obviously :). So I think the only bit that needs fixing is
> > grab_swap_token to not run if current->mm is null.
>
> Looks like Izik and I hit the same problem (otherwise running well):
> I too decided we needn't do more than avoid the issue in grab_swap_token.
> (I've a feeling someone has hit this issue before with some other thread,
> though I've no idea which - does RHEL include a patch like this perhaps?).
>
> [PATCH] mm: grab_swap_token back out if no mm
>
> If a kthread happens to use get_user_pages() on an mm (as KSM does),
> there's a chance that it will end up trying to read in a swap page,
> and oops in grab_swap_token() because the kthread has no mm:
> nothing clever, just avoid that case.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> mm/thrash.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> --- 2.6.30-mm1/mm/thrash.c 2007-07-09 00:32:17.000000000 +0100
> +++ linux/mm/thrash.c 2009-06-15 19:44:53.000000000 +0100
> @@ -30,6 +30,9 @@ void grab_swap_token(void)
> {
> int current_interval;
>
> + if (!current->mm) /* kthread doing get_user_pages on an mm */
> + return;
> +

Did you have a particular reason not to pass in the faulting mm
instead?

As the swap token should only care about the faulting address space
leading to swap io and not about the running process anyway, we could
do it like below and remove all those pesky current->derefs in the
same go. What do you think?

Hannes

---
Subject: mm: remove task assumptions from swap token

grab_swap_token() should not make any assumptions about the running
process as the swap token is an attribute of the address space and the
faulting mm is not necessarily current->mm.

This fixes get_user_pages() from kernel threads which would blow up
when encountering a swapped out page and grab_swap_token()
dereferencing the unset for kernel threads current->mm.

Signed-off-by: Johannes Weiner <[email protected]>
---

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d476aad..9fe314b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -315,7 +315,7 @@ struct backing_dev_info;

/* linux/mm/thrash.c */
extern struct mm_struct * swap_token_mm;
-extern void grab_swap_token(void);
+extern void grab_swap_token(struct mm_struct *);
extern void __put_swap_token(struct mm_struct *);

static inline int has_swap_token(struct mm_struct *mm)
@@ -427,7 +427,7 @@ static inline swp_entry_t get_swap_page(void)

/* linux/mm/thrash.c */
#define put_swap_token(x) do { } while(0)
-#define grab_swap_token() do { } while(0)
+#define grab_swap_token(x) do { } while(0)
#define has_swap_token(x) 0
#define disable_swap_token() do { } while(0)

diff --git a/mm/memory.c b/mm/memory.c
index 4126dd1..862e120 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2466,7 +2466,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry);
if (!page) {
- grab_swap_token(); /* Contend for token _before_ read-in */
+ grab_swap_token(mm); /* Contend for token _before_ read-in */
page = swapin_readahead(entry,
GFP_HIGHUSER_MOVABLE, vma, address);
if (!page) {
diff --git a/mm/thrash.c b/mm/thrash.c
index c4c5205..8b864ae 100644
--- a/mm/thrash.c
+++ b/mm/thrash.c
@@ -26,47 +26,46 @@ static DEFINE_SPINLOCK(swap_token_lock);
struct mm_struct *swap_token_mm;
static unsigned int global_faults;

-void grab_swap_token(void)
+void grab_swap_token(struct mm_struct *mm)
{
int current_interval;

global_faults++;

- current_interval = global_faults - current->mm->faultstamp;
+ current_interval = global_faults - mm->faultstamp;

if (!spin_trylock(&swap_token_lock))
return;

/* First come first served */
if (swap_token_mm == NULL) {
- current->mm->token_priority = current->mm->token_priority + 2;
- swap_token_mm = current->mm;
+ mm->token_priority = mm->token_priority + 2;
+ swap_token_mm = mm;
goto out;
}

- if (current->mm != swap_token_mm) {
- if (current_interval < current->mm->last_interval)
- current->mm->token_priority++;
+ if (mm != swap_token_mm) {
+ if (current_interval < mm->last_interval)
+ mm->token_priority++;
else {
- if (likely(current->mm->token_priority > 0))
- current->mm->token_priority--;
+ if (likely(mm->token_priority > 0))
+ mm->token_priority--;
}
/* Check if we deserve the token */
- if (current->mm->token_priority >
+ if (mm->token_priority >
swap_token_mm->token_priority) {
- current->mm->token_priority += 2;
- swap_token_mm = current->mm;
+ mm->token_priority += 2;
+ swap_token_mm = mm;
}
} else {
/* Token holder came in again! */
- current->mm->token_priority += 2;
+ mm->token_priority += 2;
}

out:
- current->mm->faultstamp = global_faults;
- current->mm->last_interval = current_interval;
+ mm->faultstamp = global_faults;
+ mm->last_interval = current_interval;
spin_unlock(&swap_token_lock);
-return;
}

/* Called on process exit. */

2009-06-16 20:38:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

On Tue, Jun 16, 2009 at 08:45:05PM +0100, Hugh Dickins wrote:
> liking the below better - what do you think? If you approve, I'll send
> it in to Andrew for mmotm: I can't quite justify it without KSM.

I think it's definitely better to fix gup to be safe from kernel
thread even without KSM so this can go in mainline well before KSM.
Unless gup requires a swapin fault it won't ever trigger.

2009-06-16 20:47:19

by Hugh Dickins

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

On Tue, 16 Jun 2009, Izik Eidus wrote:
> Hugh Dickins wrote:
> >
> > Looks like Izik and I hit the same problem (otherwise running well):

Ran well enough overnight (with mm/mmap.c vm_flags hack to be merging
everything it could) that, to judge by your remarks below, I ought to
switch away from investigating and fixing and reviewing, to sending
you patches to bring us back in synch.

I don't think I have any problems with the madvise route, which weren't
already problems with the /dev/ksm route: there are oddities I'm eager
to look into (fork still raising questions for me), but nothing serious
enough to get in the way of resynching.

Andrea will be relieved to learn that I like mm_slot->touched very much:
I don't know (and now don't need to know) what was serving that purpose
in the /dev/ksm version.

> Good, This solve another issue that you probably dont hit beacuse you work
> with the madvise version:
> the .release call back of the file_operations strcture will call to:
> ksm_sma_release() that will call to remove_slot_from_hash_and_tree() that will
> do the silly break_cow() call (even that the prcoesses just die!!!)

Yes, I'd noticed before that break_cow() can be silly more work than
necessary, may need care in ordering things. But, if it's still any
issue, it's something that can be optimized later: you have a technique
that goes about it safely, that's good.

> Now beacuse that exit_mm() will set tsk->mm = NULL before the .release will
> get called, we will trigger this path even without the kernel thread context.
> (I prepred patch that just avoid the break_cow() from the .release context,
> but it isnt needed with this patch)
>
> (You shouldnt have that specific problem in the madvise() version beacuse we
> dont have this .release callback anymore, but we still do there useless
> break_cow() calls, considering the fact that the process just die, this
> break_cow() calls should be made only when the user ask specifically to stop
> merging pages i guess...,

Yes, and I'm thinking that although it's fine for madvise(,,MADV_MERGEABLE)
to be slow to get around to merging, probably madvise(,,MADV_UNMERGEABLE)
needs to have broken COW on any KSM pages before the call returns. I've
a suspicion that nobody will ever use MADV_UNMERGEABLE, outside of our
testing; and yet it's against my principles not to provide it, and if it
is used, then I think it needs to give that guarantee before returning.
But again, something we can fill in once we're in synch.

> I will send patch that will make it work more logical on top of your patches
> as soon as you send something)

Right, what would you like me to base against? What if you were to
send me a rollup patch against 2.6.30 of what your tree has now? Or
would you prefer to base against an mmotm? With or without your RFC
patches, or something close to them?

Once I have your base, whichever way you prefer it, then I can put
together a series of patches to adjust that to what I'm now working
with (mainly: the ksm.c end of it would be much as in your RFC though
with tidyups, and minus 4/4; whereas the madvise.c end of it I reworked).

The patches we send to Andrew for mmotm, later on, would be something
different, I believe (and something different from your git history):
there I think we'd be asking him to remove the KSM patches he already
has, and providing fresh equivalents based around the madvise interface
(so, for example, I think there would be no patch at all to mm/rmap.c,
all the changes made there earlier being later reverted).

Hugh

2009-06-16 20:58:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

On Tue, 16 Jun 2009, Johannes Weiner wrote:
>
> Did you have a particular reason not to pass in the faulting mm
> instead?

No good reason.

>
> As the swap token should only care about the faulting address space
> leading to swap io and not about the running process anyway, we could
> do it like below and remove all those pesky current->derefs in the
> same go. What do you think?

I think we'll have to claim joint authorship.
But I win by one checkpatch point, and a few minutes ;)

Hugh

2009-06-16 21:03:55

by Hugh Dickins

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

On Tue, 16 Jun 2009, Andrea Arcangeli wrote:
> On Tue, Jun 16, 2009 at 08:45:05PM +0100, Hugh Dickins wrote:
> > liking the below better - what do you think? If you approve, I'll send
> > it in to Andrew for mmotm: I can't quite justify it without KSM.
>
> I think it's definitely better to fix gup to be safe from kernel
> thread even without KSM so this can go in mainline well before KSM.
> Unless gup requires a swapin fault it won't ever trigger.

Yes, you're right, it should go into 2.6.31, no reason not.

Hugh

2009-06-16 21:16:29

by Rik van Riel

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

Johannes Weiner wrote:
> On Tue, Jun 16, 2009 at 07:38:39PM +0100, Hugh Dickins wrote:

>> --- 2.6.30-mm1/mm/thrash.c 2007-07-09 00:32:17.000000000 +0100
>> +++ linux/mm/thrash.c 2009-06-15 19:44:53.000000000 +0100
>> @@ -30,6 +30,9 @@ void grab_swap_token(void)
>> {
>> int current_interval;
>>
>> + if (!current->mm) /* kthread doing get_user_pages on an mm */
>> + return;
>> +
>
> Did you have a particular reason not to pass in the faulting mm
> instead?

If the task itself is not swapping, should we really give
it the swap token?

I admit, this could be a pretty weak reason :)

--
All rights reversed.

2009-06-16 21:19:16

by Rik van Riel

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

Hugh Dickins wrote:

> [PATCH] mm: pass mm to grab_swap_token
>
> If a kthread happens to use get_user_pages() on an mm (as KSM does),
> there's a chance that it will end up trying to read in a swap page,
> then oops in grab_swap_token() because the kthread has no mm: GUP
> passes down the right mm, so grab_swap_token() ought to be using it.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Provided nobody has a problem with giving the swap token
to a task that isn't actually swapping itself ...

Acked-by: Rik van Riel <[email protected]>

(I don't have any strong opinions either way)

--
All rights reversed.

2009-06-16 21:26:35

by Izik Eidus

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

Hugh Dickins wrote:
> On Tue, 16 Jun 2009, Izik Eidus wrote:
>
>> Hugh Dickins wrote:
>>
>>> Looks like Izik and I hit the same problem (otherwise running well):
>>>
>
> Ran well enough overnight (with mm/mmap.c vm_flags hack to be merging
> everything it could) that, to judge by your remarks below, I ought to
> switch away from investigating and fixing and reviewing, to sending
> you patches to bring us back in synch.
>

Take your time!,
The only things i want to change are this:
this silly break_cow() usage,
and move into usage of mm_count instead of mm_users (with some more
safety checks usage) - so when the process die we wont have to wait ksm
to find that it die and only then the physical memory will be released.
(I have a list of things i want to change in ksm, but really not for the
merging version)

> I don't think I have any problems with the madvise route, which weren't
> already problems with the /dev/ksm route: there are oddities I'm eager
> to look into (fork still raising questions for me), but nothing serious
> enough to get in the way of resynching.
>
> Andrea will be relieved to learn that I like mm_slot->touched very much:
> I don't know (and now don't need to know) what was serving that purpose
> in the /dev/ksm version.
>
>
>> Good, This solve another issue that you probably dont hit beacuse you work
>> with the madvise version:
>> the .release call back of the file_operations strcture will call to:
>> ksm_sma_release() that will call to remove_slot_from_hash_and_tree() that will
>> do the silly break_cow() call (even that the prcoesses just die!!!)
>>
>
> Yes, I'd noticed before that break_cow() can be silly more work than
> necessary, may need care in ordering things. But, if it's still any
> issue, it's something that can be optimized later: you have a technique
> that goes about it safely, that's good.
>
>
>> Now beacuse that exit_mm() will set tsk->mm = NULL before the .release will
>> get called, we will trigger this path even without the kernel thread context.
>> (I prepred patch that just avoid the break_cow() from the .release context,
>> but it isnt needed with this patch)
>>
>> (You shouldnt have that specific problem in the madvise() version beacuse we
>> dont have this .release callback anymore, but we still do there useless
>> break_cow() calls, considering the fact that the process just die, this
>> break_cow() calls should be made only when the user ask specifically to stop
>> merging pages i guess...,
>>
>
> Yes, and I'm thinking that although it's fine for madvise(,,MADV_MERGEABLE)
> to be slow to get around to merging, probably madvise(,,MADV_UNMERGEABLE)
> needs to have broken COW on any KSM pages before the call returns. I've
> a suspicion that nobody will ever use MADV_UNMERGEABLE, outside of our
> testing; and yet it's against my principles not to provide it, and if it
> is used, then I think it needs to give that guarantee before returning.
> But again, something we can fill in once we're in synch.
>
>
>> I will send patch that will make it work more logical on top of your patches
>> as soon as you send something)
>>
>
> Right, what would you like me to base against? What if you were to
> send me a rollup patch against 2.6.30 of what your tree has now? Or
> would you prefer to base against an mmotm? With or without your RFC
> patches, or something close to them?
>

Whatever is easier to you.

> Once I have your base, whichever way you prefer it, then I can put
> together a series of patches to adjust that to what I'm now working
> with (mainly: the ksm.c end of it would be much as in your RFC though
> with tidyups, and minus 4/4; whereas the madvise.c end of it I reworked).
>
> The patches we send to Andrew for mmotm, later on, would be something
> different, I believe (and something different from your git history):
> there I think we'd be asking him to remove the KSM patches he already
> has, and providing fresh equivalents based around the madvise interface
> (so, for example, I think there would be no patch at all to mm/rmap.c,
> all the changes made there earlier being later reverted).
>

Yes, I agree it will be easier to drop KSM and resend him once we have
everything ready.

> Hugh
>
Thanks Hugh!

2009-06-16 21:54:19

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/2] mm: make swap token dummies static inlines

Make use of the compiler's typechecking on !CONFIG_SWAP as well.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/swap.h | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d476aad..3c6e856 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -426,10 +426,22 @@ static inline swp_entry_t get_swap_page(void)
}

/* linux/mm/thrash.c */
-#define put_swap_token(x) do { } while(0)
-#define grab_swap_token() do { } while(0)
-#define has_swap_token(x) 0
-#define disable_swap_token() do { } while(0)
+static inline void put_swap_token(struct mm_struct *mm)
+{
+}
+
+static inline void grab_swap_token(void)
+{
+}
+
+static inline int has_swap_token(struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline void disable_swap_token(struct mm_struct *mm)
+{
+}

static inline int mem_cgroup_cache_charge_swapin(struct page *page,
struct mm_struct *mm, gfp_t mask, bool locked)
--
1.6.3

2009-06-16 21:54:33

by Johannes Weiner

[permalink] [raw]
Subject: [patch 2/2] mm: remove task assumptions from swap token

From: Hugh Dickins <[email protected]>

grab_swap_token() should not make any assumptions about the running
process as the swap token is an attribute of the address space and the
faulting mm is not necessarily current->mm.

This fixes get_user_pages() from kernel threads which would blow up
when encountering a swapped out page and grab_swap_token()
dereferencing the unset for kernel threads current->mm.

Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/swap.h | 4 ++--
mm/memory.c | 2 +-
mm/thrash.c | 31 +++++++++++++++----------------
3 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3c6e856..e73ea8a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -315,7 +315,7 @@ struct backing_dev_info;

/* linux/mm/thrash.c */
extern struct mm_struct * swap_token_mm;
-extern void grab_swap_token(void);
+extern void grab_swap_token(struct mm_struct *);
extern void __put_swap_token(struct mm_struct *);

static inline int has_swap_token(struct mm_struct *mm)
@@ -430,7 +430,7 @@ static inline void put_swap_token(struct mm_struct *mm)
{
}

-static inline void grab_swap_token(void)
+static inline void grab_swap_token(struct mm_struct *mm)
{
}

diff --git a/mm/memory.c b/mm/memory.c
index 4126dd1..862e120 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2466,7 +2466,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry);
if (!page) {
- grab_swap_token(); /* Contend for token _before_ read-in */
+ grab_swap_token(mm); /* Contend for token _before_ read-in */
page = swapin_readahead(entry,
GFP_HIGHUSER_MOVABLE, vma, address);
if (!page) {
diff --git a/mm/thrash.c b/mm/thrash.c
index c4c5205..8b864ae 100644
--- a/mm/thrash.c
+++ b/mm/thrash.c
@@ -26,47 +26,46 @@ static DEFINE_SPINLOCK(swap_token_lock);
struct mm_struct *swap_token_mm;
static unsigned int global_faults;

-void grab_swap_token(void)
+void grab_swap_token(struct mm_struct *mm)
{
int current_interval;

global_faults++;

- current_interval = global_faults - current->mm->faultstamp;
+ current_interval = global_faults - mm->faultstamp;

if (!spin_trylock(&swap_token_lock))
return;

/* First come first served */
if (swap_token_mm == NULL) {
- current->mm->token_priority = current->mm->token_priority + 2;
- swap_token_mm = current->mm;
+ mm->token_priority = mm->token_priority + 2;
+ swap_token_mm = mm;
goto out;
}

- if (current->mm != swap_token_mm) {
- if (current_interval < current->mm->last_interval)
- current->mm->token_priority++;
+ if (mm != swap_token_mm) {
+ if (current_interval < mm->last_interval)
+ mm->token_priority++;
else {
- if (likely(current->mm->token_priority > 0))
- current->mm->token_priority--;
+ if (likely(mm->token_priority > 0))
+ mm->token_priority--;
}
/* Check if we deserve the token */
- if (current->mm->token_priority >
+ if (mm->token_priority >
swap_token_mm->token_priority) {
- current->mm->token_priority += 2;
- swap_token_mm = current->mm;
+ mm->token_priority += 2;
+ swap_token_mm = mm;
}
} else {
/* Token holder came in again! */
- current->mm->token_priority += 2;
+ mm->token_priority += 2;
}

out:
- current->mm->faultstamp = global_faults;
- current->mm->last_interval = current_interval;
+ mm->faultstamp = global_faults;
+ mm->last_interval = current_interval;
spin_unlock(&swap_token_lock);
-return;
}

/* Called on process exit. */
--
1.6.3

2009-06-16 21:55:23

by Johannes Weiner

[permalink] [raw]
Subject: Re: running get_user_pages() from kernel thread

[added Andrew on Cc]

On Tue, Jun 16, 2009 at 09:57:30PM +0100, Hugh Dickins wrote:
> On Tue, 16 Jun 2009, Johannes Weiner wrote:
> >
> > Did you have a particular reason not to pass in the faulting mm
> > instead?
>
> No good reason.

Okay.

> > As the swap token should only care about the faulting address space
> > leading to swap io and not about the running process anyway, we could
> > do it like below and remove all those pesky current->derefs in the
> > same go. What do you think?
>
> I think we'll have to claim joint authorship.

That's for sure, I just didn't want to plant it on you :)

> But I win by one checkpatch point, and a few minutes ;)

Checkpatch compatibility restored, now back to the time machine!

2009-06-16 21:56:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 1/2] mm: make swap token dummies static inlines

Johannes Weiner wrote:
> Make use of the compiler's typechecking on !CONFIG_SWAP as well.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-06-16 21:57:09

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 2/2] mm: remove task assumptions from swap token

Johannes Weiner wrote:
> From: Hugh Dickins <[email protected]>
>
> grab_swap_token() should not make any assumptions about the running
> process as the swap token is an attribute of the address space and the
> faulting mm is not necessarily current->mm.
>
> This fixes get_user_pages() from kernel threads which would blow up
> when encountering a swapped out page and grab_swap_token()
> dereferencing the unset for kernel threads current->mm.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-06-17 02:00:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 2/2] mm: remove task assumptions from swap token

Hi, Hannes.

How about adding Hugh's comment ?

I think that is more straightforward and easy.
And it explained even real example like KSM.

So I suggest following as..

==
grab_swap_token() should not make any assumptions about the running
process as the swap token is an attribute of the address space and the
faulting mm is not necessarily current->mm.

If a kthread happens to use get_user_pages() on an mm (as KSM does),
there's a chance that it will end up trying to read in a swap page,
then oops in grab_swap_token() because the kthread has no mm: GUP
passes down the right mm, so grab_swap_token() ought to be using it.
==

Anyway, It looks good to me.
It might be just nitpick :)
If you feel it, ignore me.
Anyway I am OK.

On Tue, 16 Jun 2009 23:50:37 +0200
Johannes Weiner <[email protected]> wrote:

> From: Hugh Dickins <[email protected]>
>
> grab_swap_token() should not make any assumptions about the running
> process as the swap token is an attribute of the address space and the
> faulting mm is not necessarily current->mm.
>
> This fixes get_user_pages() from kernel threads which would blow up
> when encountering a swapped out page and grab_swap_token()
> dereferencing the unset for kernel threads current->mm.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Minchan Kim <[email protected]>

--
Kinds Regards
Minchan Kim

2009-06-17 08:35:17

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/2] mm: remove task assumptions from swap token

On Wed, Jun 17, 2009 at 11:00:34AM +0900, Minchan Kim wrote:
> Hi, Hannes.
>
> How about adding Hugh's comment ?

Sorry, I misinterpreted Hugh's reply to me that got into my inbox
directly and totally missed that he had already done the exact same
thing in the lkml thread until much later.

We should probably just use his version.

2009-06-21 13:02:57

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] mm: pass mm to grab_swap_token

If a kthread happens to use get_user_pages() on an mm (as KSM does),
there's a chance that it will end up trying to read in a swap page,
then oops in grab_swap_token() because the kthread has no mm: GUP
passes down the right mm, so grab_swap_token() ought to be using it.

We have not identified a stronger case than KSM's daemon (not yet
in mainline), but the issue must have come up before, since RHEL has
included a fix for this for years (though a different fix, they just
back out of grab_swap_token if current->mm is unset: which is what
we first proposed, but using the right mm here seems more correct).

Reported-by: Izik Eidus <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
Includes a small swap.h tidyup inspired by checkpatch.pl.
Due to crossed mails and crossed wires, there's a competing
patch-pair from Hannes:

http://lkml.org/lkml/2009/6/16/623
http://lkml.org/lkml/2009/6/16/624

We're all agreed that either this or those should go into 2.6.31:
personally, I don't rate arg typechecking of null functions when
!CONFIG_SWAP very highly, and prefer the one-line macros of Rik's
original; but you may well disagree, and prefer to take Hannes's.

include/linux/swap.h | 12 ++++++------
mm/memory.c | 2 +-
mm/thrash.c | 32 +++++++++++++++-----------------
3 files changed, 22 insertions(+), 24 deletions(-)

--- 2.6.30-git17/include/linux/swap.h 2009-06-21 07:48:18.000000000 +0100
+++ linux/include/linux/swap.h 2009-06-21 09:35:51.000000000 +0100
@@ -298,8 +298,8 @@ extern int try_to_free_swap(struct page
struct backing_dev_info;

/* linux/mm/thrash.c */
-extern struct mm_struct * swap_token_mm;
-extern void grab_swap_token(void);
+extern struct mm_struct *swap_token_mm;
+extern void grab_swap_token(struct mm_struct *);
extern void __put_swap_token(struct mm_struct *);

static inline int has_swap_token(struct mm_struct *mm)
@@ -419,10 +419,10 @@ static inline swp_entry_t get_swap_page(
}

/* linux/mm/thrash.c */
-#define put_swap_token(x) do { } while(0)
-#define grab_swap_token() do { } while(0)
-#define has_swap_token(x) 0
-#define disable_swap_token() do { } while(0)
+#define put_swap_token(mm) do { } while (0)
+#define grab_swap_token(mm) do { } while (0)
+#define has_swap_token(mm) 0
+#define disable_swap_token() do { } while (0)

static inline void
mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
--- 2.6.30-git17/mm/memory.c 2009-06-21 07:48:19.000000000 +0100
+++ linux/mm/memory.c 2009-06-21 09:35:51.000000000 +0100
@@ -2516,7 +2516,7 @@ static int do_swap_page(struct mm_struct
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry);
if (!page) {
- grab_swap_token(); /* Contend for token _before_ read-in */
+ grab_swap_token(mm); /* Contend for token _before_ read-in */
page = swapin_readahead(entry,
GFP_HIGHUSER_MOVABLE, vma, address);
if (!page) {
--- 2.6.30-git17/mm/thrash.c 2007-07-09 00:32:17.000000000 +0100
+++ linux/mm/thrash.c 2009-06-21 09:35:51.000000000 +0100
@@ -26,47 +26,45 @@ static DEFINE_SPINLOCK(swap_token_lock);
struct mm_struct *swap_token_mm;
static unsigned int global_faults;

-void grab_swap_token(void)
+void grab_swap_token(struct mm_struct *mm)
{
int current_interval;

global_faults++;

- current_interval = global_faults - current->mm->faultstamp;
+ current_interval = global_faults - mm->faultstamp;

if (!spin_trylock(&swap_token_lock))
return;

/* First come first served */
if (swap_token_mm == NULL) {
- current->mm->token_priority = current->mm->token_priority + 2;
- swap_token_mm = current->mm;
+ mm->token_priority = mm->token_priority + 2;
+ swap_token_mm = mm;
goto out;
}

- if (current->mm != swap_token_mm) {
- if (current_interval < current->mm->last_interval)
- current->mm->token_priority++;
+ if (mm != swap_token_mm) {
+ if (current_interval < mm->last_interval)
+ mm->token_priority++;
else {
- if (likely(current->mm->token_priority > 0))
- current->mm->token_priority--;
+ if (likely(mm->token_priority > 0))
+ mm->token_priority--;
}
/* Check if we deserve the token */
- if (current->mm->token_priority >
- swap_token_mm->token_priority) {
- current->mm->token_priority += 2;
- swap_token_mm = current->mm;
+ if (mm->token_priority > swap_token_mm->token_priority) {
+ mm->token_priority += 2;
+ swap_token_mm = mm;
}
} else {
/* Token holder came in again! */
- current->mm->token_priority += 2;
+ mm->token_priority += 2;
}

out:
- current->mm->faultstamp = global_faults;
- current->mm->last_interval = current_interval;
+ mm->faultstamp = global_faults;
+ mm->last_interval = current_interval;
spin_unlock(&swap_token_lock);
-return;
}

/* Called on process exit. */

2009-06-22 00:40:11

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: pass mm to grab_swap_token

On Sun, 21 Jun 2009 14:02:43 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> If a kthread happens to use get_user_pages() on an mm (as KSM does),
> there's a chance that it will end up trying to read in a swap page,
> then oops in grab_swap_token() because the kthread has no mm: GUP
> passes down the right mm, so grab_swap_token() ought to be using it.
>
> We have not identified a stronger case than KSM's daemon (not yet
> in mainline), but the issue must have come up before, since RHEL has
> included a fix for this for years (though a different fix, they just
> back out of grab_swap_token if current->mm is unset: which is what
> we first proposed, but using the right mm here seems more correct).
>
> Reported-by: Izik Eidus <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

I already added my sign to Hannes but it is folded in this patch.

--
Kinds Regards
Minchan Kim