2009-09-07 11:06:49

by Peter Zijlstra

[permalink] [raw]
Subject: [rfc] lru_add_drain_all() vs isolation

On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:

> [ 774.651779] SysRq : Show Blocked State
> [ 774.655770] task PC stack pid father
> [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> [ 774.676008] Call Trace:
> [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b

FWIW, something like the below (prone to explode since its utterly
untested) should (mostly) fix that one case. Something similar needs to
be done for pretty much all machine wide workqueue thingies, possibly
also flush_workqueue().

---
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 52 +++++++++++++++++++++++++++++++++++---------
mm/swap.c | 14 ++++++++---
3 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 6273fa9..95b1df2 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -213,6 +213,7 @@ extern int schedule_work_on(int cpu, struct work_struct *work);
extern int schedule_delayed_work(struct delayed_work *work, unsigned long delay);
extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
unsigned long delay);
+extern int schedule_on_mask(const struct cpumask *mask, work_func_t func);
extern int schedule_on_each_cpu(work_func_t func);
extern int current_is_keventd(void);
extern int keventd_up(void);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3c44b56..81456fc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -657,6 +657,23 @@ int schedule_delayed_work_on(int cpu,
}
EXPORT_SYMBOL(schedule_delayed_work_on);

+struct sched_work_struct {
+ struct work_struct work;
+ work_func_t func;
+ atomic_t *count;
+ struct completion *completion;
+};
+
+static void do_sched_work(struct work_struct *work)
+{
+ struct sched_work_struct *sws = work;
+
+ sws->func(NULL);
+
+ if (atomic_dec_and_test(sws->count))
+ complete(sws->completion);
+}
+
/**
* schedule_on_each_cpu - call a function on each online CPU from keventd
* @func: the function to call
@@ -666,29 +683,42 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
*
* schedule_on_each_cpu() is very slow.
*/
-int schedule_on_each_cpu(work_func_t func)
+int schedule_on_mask(const struct cpumask *mask, work_func_t func)
{
+ struct completion completion = COMPLETION_INITIALIZER_ONSTACK(completion);
+ atomic_t count = ATOMIC_INIT(cpumask_weight(mask));
+ struct sched_work_struct *works;
int cpu;
- struct work_struct *works;

- works = alloc_percpu(struct work_struct);
+ works = alloc_percpu(struct sched_work_struct);
if (!works)
return -ENOMEM;

- get_online_cpus();
- for_each_online_cpu(cpu) {
- struct work_struct *work = per_cpu_ptr(works, cpu);
+ for_each_cpu(cpu, mask) {
+ struct sched_work_struct *work = per_cpu_ptr(works, cpu);
+ work->count = &count;
+ work->completion = &completion;
+ work->func = func;

- INIT_WORK(work, func);
- schedule_work_on(cpu, work);
+ INIT_WORK(&work->work, do_sched_work);
+ schedule_work_on(cpu, &work->work);
}
- for_each_online_cpu(cpu)
- flush_work(per_cpu_ptr(works, cpu));
- put_online_cpus();
+ wait_for_completion(&completion);
free_percpu(works);
return 0;
}

+int schedule_on_each_cpu(work_func_t func)
+{
+ int ret;
+
+ get_online_cpus();
+ ret = schedule_on_mask(cpu_online_mask, func);
+ put_online_cpus();
+
+ return ret;
+}
+
void flush_scheduled_work(void)
{
flush_workqueue(keventd_wq);
diff --git a/mm/swap.c b/mm/swap.c
index cb29ae5..11e4b1e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@
/* How many pages do we try to swap or page in/out together? */
int page_cluster;

+static cpumask_t lru_drain_mask;
static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);

@@ -216,12 +217,15 @@ EXPORT_SYMBOL(mark_page_accessed);

void __lru_cache_add(struct page *page, enum lru_list lru)
{
- struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];
+ int cpu = get_cpu();
+ struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu)[lru];
+
+ cpumask_set_cpu(cpu, lru_drain_mask);

page_cache_get(page);
if (!pagevec_add(pvec, page))
____pagevec_lru_add(pvec, lru);
- put_cpu_var(lru_add_pvecs);
+ put_cpu();
}

/**
@@ -294,7 +298,9 @@ static void drain_cpu_pagevecs(int cpu)

void lru_add_drain(void)
{
- drain_cpu_pagevecs(get_cpu());
+ int cpu = get_cpu();
+ cpumask_clear_cpu(cpu, lru_drain_mask);
+ drain_cpu_pagevecs(cpu);
put_cpu();
}

@@ -308,7 +314,7 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
*/
int lru_add_drain_all(void)
{
- return schedule_on_each_cpu(lru_add_drain_per_cpu);
+ return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
}

/*


2009-09-07 13:39:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On 09/07, Peter Zijlstra wrote:
>
> On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
>
> > [ 774.651779] SysRq : Show Blocked State
> > [ 774.655770] task PC stack pid father
> > [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> > [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > [ 774.676008] Call Trace:
> > [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
>
> FWIW, something like the below (prone to explode since its utterly
> untested) should (mostly) fix that one case. Something similar needs to
> be done for pretty much all machine wide workqueue thingies, possibly
> also flush_workqueue().

Failed to google the previous discussion. Could you please point me?
What is the problem?

> +struct sched_work_struct {
> + struct work_struct work;
> + work_func_t func;
> + atomic_t *count;
> + struct completion *completion;
> +};

(not that it matters, but perhaps sched_work_struct should have a single
pointer to the struct which contains func,count,comletion).

> -int schedule_on_each_cpu(work_func_t func)
> +int schedule_on_mask(const struct cpumask *mask, work_func_t func)

Looks like a usefule helper. But,

> + for_each_cpu(cpu, mask) {
> + struct sched_work_struct *work = per_cpu_ptr(works, cpu);
> + work->count = &count;
> + work->completion = &completion;
> + work->func = func;
>
> - INIT_WORK(work, func);
> - schedule_work_on(cpu, work);
> + INIT_WORK(&work->work, do_sched_work);
> + schedule_work_on(cpu, &work->work);

This means the caller must ensure CPU online and can't go away. Otherwise
we can hang forever.

schedule_on_each_cpu() is fine, it calls us under get_online_cpus().
But,

> int lru_add_drain_all(void)
> {
> - return schedule_on_each_cpu(lru_add_drain_per_cpu);
> + return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
> }

This doesn't look safe.

Looks like, schedule_on_mask() should take get_online_cpus(), do
cpus_and(mask, mask, online_cpus), then schedule works.

If we don't care the work can migrate to another CPU, schedule_on_mask()
can do put_online_cpus() before wait_for_completion().

Oleg.

2009-09-07 13:53:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Mon, 2009-09-07 at 15:35 +0200, Oleg Nesterov wrote:
> On 09/07, Peter Zijlstra wrote:
> >
> > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> >
> > > [ 774.651779] SysRq : Show Blocked State
> > > [ 774.655770] task PC stack pid father
> > > [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> > > [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > [ 774.676008] Call Trace:
> > > [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> >
> > FWIW, something like the below (prone to explode since its utterly
> > untested) should (mostly) fix that one case. Something similar needs to
> > be done for pretty much all machine wide workqueue thingies, possibly
> > also flush_workqueue().
>
> Failed to google the previous discussion. Could you please point me?
> What is the problem?

Ah, the general problem is that when we carve up the machine into
partitions using cpusets, we still get machine wide tickles on all cpus
from workqueue stuff like schedule_on_each_cpu() and flush_workqueue(),
even if some cpus don't actually used their workqueue.

So the below limits lru_add_drain() activity to cpus that actually have
pages in their per-cpu lists.

flush_workqueue() could limit itself to cpus that had work queued since
the last flush_workqueue() invocation, etc.

This avoids un-needed disruption of these cpus.

Christoph wants this because he's running cpu-bound userspace and simply
doesn't care to donate a few cycles to the kernel maintenance when not
needed (every tiny bit helps in completing the HPC job sooner).

Mike ran into this because he's starving a partitioned cpu using an RT
task -- which currently starves the other cpus because the workqueues
don't get to run and everybody waits...

The lru_add_drain_all() thing is just one of the many cases, and the
below won't fully solve Mike's problem since the cpu could still have
pending work on the per-cpu list from starting the RT task.. but its
showing the direction on how improve things.

> > +struct sched_work_struct {
> > + struct work_struct work;
> > + work_func_t func;
> > + atomic_t *count;
> > + struct completion *completion;
> > +};
>
> (not that it matters, but perhaps sched_work_struct should have a single
> pointer to the struct which contains func,count,comletion).

Sure, it more-or-less grew while writing, I always forget completions
don't count.

> > -int schedule_on_each_cpu(work_func_t func)
> > +int schedule_on_mask(const struct cpumask *mask, work_func_t func)
>
> Looks like a usefule helper. But,
>
> > + for_each_cpu(cpu, mask) {
> > + struct sched_work_struct *work = per_cpu_ptr(works, cpu);
> > + work->count = &count;
> > + work->completion = &completion;
> > + work->func = func;
> >
> > - INIT_WORK(work, func);
> > - schedule_work_on(cpu, work);
> > + INIT_WORK(&work->work, do_sched_work);
> > + schedule_work_on(cpu, &work->work);
>
> This means the caller must ensure CPU online and can't go away. Otherwise
> we can hang forever.
>
> schedule_on_each_cpu() is fine, it calls us under get_online_cpus().
> But,
>
> > int lru_add_drain_all(void)
> > {
> > - return schedule_on_each_cpu(lru_add_drain_per_cpu);
> > + return schedule_on_mask(lru_drain_mask, lru_add_drain_per_cpu);
> > }
>
> This doesn't look safe.
>
> Looks like, schedule_on_mask() should take get_online_cpus(), do
> cpus_and(mask, mask, online_cpus), then schedule works.
>
> If we don't care the work can migrate to another CPU, schedule_on_mask()
> can do put_online_cpus() before wait_for_completion().

Ah, right. Like said, I only quickly hacked this up as an example on how
to improve isolation between cpus and limit unneeded work in the hope
someone would pick this up and maybe tackle other sites as well.

2009-09-07 14:22:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On 09/07, Peter Zijlstra wrote:
>
> On Mon, 2009-09-07 at 15:35 +0200, Oleg Nesterov wrote:
> >
> > Failed to google the previous discussion. Could you please point me?
> > What is the problem?
>
> Ah, the general problem is that when we carve up the machine into
> partitions using cpusets, we still get machine wide tickles on all cpus
> from workqueue stuff like schedule_on_each_cpu() and flush_workqueue(),
> even if some cpus don't actually used their workqueue.
>
> So the below limits lru_add_drain() activity to cpus that actually have
> pages in their per-cpu lists.

Thanks Peter!

> flush_workqueue() could limit itself to cpus that had work queued since
> the last flush_workqueue() invocation, etc.

But "work queued since the last flush_workqueue() invocation" just means
"has work queued". Please note that flush_cpu_workqueue() does nothing
if there are no works, except it does lock/unlock of cwq->lock.

IIRC, flush_cpu_workqueue() has to lock/unlock to avoid the races with
CPU hotplug, but _perhaps_ flush_workqueue() can do the check lockless.

Afaics, we can add the workqueue_struct->cpu_map_has_works to help
flush_workqueue(), but this means we should complicate insert_work()
and run_workqueue() which should set/clear the bit. But given that
flush_workqueue() should be avoided anyway, I am not sure.

Oleg.

2009-09-07 14:26:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Mon, 2009-09-07 at 16:18 +0200, Oleg Nesterov wrote:

> > flush_workqueue() could limit itself to cpus that had work queued since
> > the last flush_workqueue() invocation, etc.
>
> But "work queued since the last flush_workqueue() invocation" just means
> "has work queued". Please note that flush_cpu_workqueue() does nothing
> if there are no works, except it does lock/unlock of cwq->lock.
>
> IIRC, flush_cpu_workqueue() has to lock/unlock to avoid the races with
> CPU hotplug, but _perhaps_ flush_workqueue() can do the check lockless.
>
> Afaics, we can add the workqueue_struct->cpu_map_has_works to help
> flush_workqueue(), but this means we should complicate insert_work()
> and run_workqueue() which should set/clear the bit. But given that
> flush_workqueue() should be avoided anyway, I am not sure.

Ah, indeed. Then nothing new would be needed here, since it will indeed
not interrupt processing on the remote cpus that never queued any work.


2009-09-07 23:56:55

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

Hi Peter,

> On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
>
> > [ 774.651779] SysRq : Show Blocked State
> > [ 774.655770] task PC stack pid father
> > [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> > [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > [ 774.676008] Call Trace:
> > [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
>
> FWIW, something like the below (prone to explode since its utterly
> untested) should (mostly) fix that one case. Something similar needs to
> be done for pretty much all machine wide workqueue thingies, possibly
> also flush_workqueue().

Can you please explain reproduce way and problem detail?

AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
it doesn't cause any deadlock.



2009-09-08 08:20:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> Hi Peter,
>
> > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> >
> > > [ 774.651779] SysRq : Show Blocked State
> > > [ 774.655770] task PC stack pid father
> > > [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> > > [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > [ 774.676008] Call Trace:
> > > [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> >
> > FWIW, something like the below (prone to explode since its utterly
> > untested) should (mostly) fix that one case. Something similar needs to
> > be done for pretty much all machine wide workqueue thingies, possibly
> > also flush_workqueue().
>
> Can you please explain reproduce way and problem detail?
>
> AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> it doesn't cause any deadlock.

Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
cpu0 does mlock()->lru_add_drain_all(), which does
schedule_on_each_cpu(), which then waits for all cpus to complete the
work. Except that cpu1, which is busy with the RT task, will never run
keventd until the RT load goes away.

This is not so much an actual deadlock as a serious starvation case.

2009-09-08 10:06:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

> On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> > Hi Peter,
> >
> > > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > >
> > > > [ 774.651779] SysRq : Show Blocked State
> > > > [ 774.655770] task PC stack pid father
> > > > [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> > > > [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > > [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > > [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > > [ 774.676008] Call Trace:
> > > > [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > > [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > > [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > > [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > > [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > > [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > > [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > > [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > > [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > >
> > > FWIW, something like the below (prone to explode since its utterly
> > > untested) should (mostly) fix that one case. Something similar needs to
> > > be done for pretty much all machine wide workqueue thingies, possibly
> > > also flush_workqueue().
> >
> > Can you please explain reproduce way and problem detail?
> >
> > AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> > it doesn't cause any deadlock.
>
> Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> cpu0 does mlock()->lru_add_drain_all(), which does
> schedule_on_each_cpu(), which then waits for all cpus to complete the
> work. Except that cpu1, which is busy with the RT task, will never run
> keventd until the RT load goes away.
>
> This is not so much an actual deadlock as a serious starvation case.

This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
Why other workqueue flusher doesn't affect this issue?



2009-09-08 10:20:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Tue, 2009-09-08 at 19:06 +0900, KOSAKI Motohiro wrote:
> > On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> > > Hi Peter,
> > >
> > > > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > > >
> > > > > [ 774.651779] SysRq : Show Blocked State
> > > > > [ 774.655770] task PC stack pid father
> > > > > [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> > > > > [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > > > [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > > > [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > > > [ 774.676008] Call Trace:
> > > > > [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > > > [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > > > [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > > > [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > > > [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > > > [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > > > [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > > > [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > > > [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > > >
> > > > FWIW, something like the below (prone to explode since its utterly
> > > > untested) should (mostly) fix that one case. Something similar needs to
> > > > be done for pretty much all machine wide workqueue thingies, possibly
> > > > also flush_workqueue().
> > >
> > > Can you please explain reproduce way and problem detail?
> > >
> > > AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> > > it doesn't cause any deadlock.
> >
> > Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > cpu0 does mlock()->lru_add_drain_all(), which does
> > schedule_on_each_cpu(), which then waits for all cpus to complete the
> > work. Except that cpu1, which is busy with the RT task, will never run
> > keventd until the RT load goes away.
> >
> > This is not so much an actual deadlock as a serious starvation case.
>
> This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
> Why other workqueue flusher doesn't affect this issue?

flush_work() will only flush workqueues on which work has been enqueued
as Oleg pointed out.

The problem is with lru_add_drain_all() enqueueing work on all
workqueues.

There is nothing that makes lru_add_drain_all() the only such site, its
the one Mike posted to me, and my patch was a way to deal with that.

I also explained that its not only RT related in that the HPC folks also
want to avoid unneeded work -- for them its not starvation but a
performance issue.

In generic we should avoid doing work when there is no work to be done.

2009-09-08 11:41:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

> On Tue, 2009-09-08 at 19:06 +0900, KOSAKI Motohiro wrote:
> > > On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> > > > Hi Peter,
> > > >
> > > > > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > > > >
> > > > > > [ 774.651779] SysRq : Show Blocked State
> > > > > > [ 774.655770] task PC stack pid father
> > > > > > [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> > > > > > [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > > > > [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > > > > [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > > > > [ 774.676008] Call Trace:
> > > > > > [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > > > > [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > > > > [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > > > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > > [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > > > > [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > > > > [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > > > > [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > > > > [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > > > > [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > > > > [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > > > >
> > > > > FWIW, something like the below (prone to explode since its utterly
> > > > > untested) should (mostly) fix that one case. Something similar needs to
> > > > > be done for pretty much all machine wide workqueue thingies, possibly
> > > > > also flush_workqueue().
> > > >
> > > > Can you please explain reproduce way and problem detail?
> > > >
> > > > AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> > > > it doesn't cause any deadlock.
> > >
> > > Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > > cpu0 does mlock()->lru_add_drain_all(), which does
> > > schedule_on_each_cpu(), which then waits for all cpus to complete the
> > > work. Except that cpu1, which is busy with the RT task, will never run
> > > keventd until the RT load goes away.
> > >
> > > This is not so much an actual deadlock as a serious starvation case.
> >
> > This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
> > Why other workqueue flusher doesn't affect this issue?
>
> flush_work() will only flush workqueues on which work has been enqueued
> as Oleg pointed out.
>
> The problem is with lru_add_drain_all() enqueueing work on all
> workqueues.

Thank you for kindly explanation. I gradually become to understand this isssue.
Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code

for_each_online_cpu(cpu)
flush_work(per_cpu_ptr(works, cpu));

However, I don't think your approach solve this issue.
lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.

lru_add_pvecs is accounted when
- lru move
e.g. read(2), write(2), page fault, vmscan, page migration, et al

lru_rotate_pves is accounted when
- page writeback

IOW, if RT-thread call write(2) syscall or page fault, we face the same
problem. I don't think we can assume RT-thread don't make page fault....

hmm, this seems difficult problem. I guess any mm code should use
schedule_on_each_cpu(). I continue to think this issue awhile.


> There is nothing that makes lru_add_drain_all() the only such site, its
> the one Mike posted to me, and my patch was a way to deal with that.

Well, schedule_on_each_cpu() is very limited used function.
Practically we can ignore other caller.


> I also explained that its not only RT related in that the HPC folks also
> want to avoid unneeded work -- for them its not starvation but a
> performance issue.

I think you talked about OS jitter issue. if so, I don't think this issue
make serious problem. OS jitter mainly be caused by periodic action
(e.g. tick update, timer, vmstat update). it's because
little-delay x plenty-times = large-delay

lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
page-migration, memory-hotplug. all caller is not periodic.


> In generic we should avoid doing work when there is no work to be done.

Probably. but I'm not sure ;)


2009-09-08 12:05:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Tue, 2009-09-08 at 20:41 +0900, KOSAKI Motohiro wrote:

> Thank you for kindly explanation. I gradually become to understand this isssue.
> Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code
>
> for_each_online_cpu(cpu)
> flush_work(per_cpu_ptr(works, cpu));
>
> However, I don't think your approach solve this issue.
> lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.
>
> lru_add_pvecs is accounted when
> - lru move
> e.g. read(2), write(2), page fault, vmscan, page migration, et al
>
> lru_rotate_pves is accounted when
> - page writeback
>
> IOW, if RT-thread call write(2) syscall or page fault, we face the same
> problem. I don't think we can assume RT-thread don't make page fault....
>
> hmm, this seems difficult problem. I guess any mm code should use
> schedule_on_each_cpu(). I continue to think this issue awhile.

This is about avoiding work when there is non, clearly when an
application does use the kernel it creates work.

But a clearly userspace, cpu-bound process, while(1), should not get
interrupted by things like lru_add_drain() when it doesn't have any
pages to drain.

> > There is nothing that makes lru_add_drain_all() the only such site, its
> > the one Mike posted to me, and my patch was a way to deal with that.
>
> Well, schedule_on_each_cpu() is very limited used function.
> Practically we can ignore other caller.

No, we need to inspect all callers, having only a few makes that easier.

> > I also explained that its not only RT related in that the HPC folks also
> > want to avoid unneeded work -- for them its not starvation but a
> > performance issue.
>
> I think you talked about OS jitter issue. if so, I don't think this issue
> make serious problem. OS jitter mainly be caused by periodic action
> (e.g. tick update, timer, vmstat update). it's because
> little-delay x plenty-times = large-delay
>
> lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
> page-migration, memory-hotplug. all caller is not periodic.

Doesn't matter, if you want to reduce it, you need to address all of
them, a process 4 nodes away calling mlock() while this partition has
been user-bound for the last hour or so and doesn't have any lru pages
simply needn't be woken.

2009-09-08 14:04:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Tue, 8 Sep 2009, Peter Zijlstra wrote:

> This is about avoiding work when there is non, clearly when an
> application does use the kernel it creates work.

Hmmm. The lru draining in page migration is to reduce the number of pages
that are not on the lru to increase the chance of page migration to be
successful. A page on a per cpu list cannot be drained.

Reducing the number of cpus where we perform the drain results in
increased likelyhood that we cannot migrate a page because its on the per
cpu lists of a cpu not covered.

On the other hand if the cpu is offline then we know that it has no per
cpu pages. That is why I found the idea of the OFFLINE
scheduler attractive.

2009-09-08 14:20:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Tue, 2009-09-08 at 10:03 -0400, Christoph Lameter wrote:
> On Tue, 8 Sep 2009, Peter Zijlstra wrote:
>
> > This is about avoiding work when there is non, clearly when an
> > application does use the kernel it creates work.
>
> Hmmm. The lru draining in page migration is to reduce the number of pages
> that are not on the lru to increase the chance of page migration to be
> successful. A page on a per cpu list cannot be drained.
>
> Reducing the number of cpus where we perform the drain results in
> increased likelyhood that we cannot migrate a page because its on the per
> cpu lists of a cpu not covered.

Did you even read the patch?

There is _no_ functional difference between before and after, except
less wakeups on cpus that don't have any __lru_cache_add activity.

If there's pages on the per cpu lru_add_pvecs list it will be present in
the mask and will be send a drain request. If its not, then it won't be
send.


2009-09-08 15:23:46

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Tue, 8 Sep 2009, Peter Zijlstra wrote:

> There is _no_ functional difference between before and after, except
> less wakeups on cpus that don't have any __lru_cache_add activity.
>
> If there's pages on the per cpu lru_add_pvecs list it will be present in
> the mask and will be send a drain request. If its not, then it won't be
> send.

Ok I see.

A global cpu mask like this will cause cacheline bouncing. After all this
is a hot cpu path. Maybe do not set the bit if its already set
(which may be very frequent)? Then add some benchmarks to show that it
does not cause a regression on a 16p box (Nehalem) or so?





2009-09-08 15:27:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Tue, 2009-09-08 at 11:22 -0400, Christoph Lameter wrote:
> On Tue, 8 Sep 2009, Peter Zijlstra wrote:
>
> > There is _no_ functional difference between before and after, except
> > less wakeups on cpus that don't have any __lru_cache_add activity.
> >
> > If there's pages on the per cpu lru_add_pvecs list it will be present in
> > the mask and will be send a drain request. If its not, then it won't be
> > send.
>
> Ok I see.
>
> A global cpu mask like this will cause cacheline bouncing. After all this
> is a hot cpu path. Maybe do not set the bit if its already set
> (which may be very frequent)? Then add some benchmarks to show that it
> does not cause a regression on a 16p box (Nehalem) or so?

Yeah, testing the bit before poking at is sounds like a good plan.

Unless someone feels inclined to finish this and audit the kernel for
more such places, I'll stick it on the ever growing todo pile.


2009-09-08 15:32:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

The usefulness of a scheme like this requires:

1. There are cpus that continually execute user space code
without system interaction.

2. There are repeated VM activities that require page isolation /
migration.

The first page isolation activity will then clear the lru caches of the
processes doing number crunching in user space (and therefore the first
isolation will still interrupt). The second and following isolation will
then no longer interrupt the processes.

2. is rare. So the question is if the additional code in the LRU handling
can be justified. If lru handling is not time sensitive then yes.



2009-09-09 02:06:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

Hi

> > Thank you for kindly explanation. I gradually become to understand this isssue.
> > Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code
> >
> > for_each_online_cpu(cpu)
> > flush_work(per_cpu_ptr(works, cpu));
> >
> > However, I don't think your approach solve this issue.
> > lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.
> >
> > lru_add_pvecs is accounted when
> > - lru move
> > e.g. read(2), write(2), page fault, vmscan, page migration, et al
> >
> > lru_rotate_pves is accounted when
> > - page writeback
> >
> > IOW, if RT-thread call write(2) syscall or page fault, we face the same
> > problem. I don't think we can assume RT-thread don't make page fault....
> >
> > hmm, this seems difficult problem. I guess any mm code should use
> > schedule_on_each_cpu(). I continue to think this issue awhile.
>
> This is about avoiding work when there is non, clearly when an
> application does use the kernel it creates work.
>
> But a clearly userspace, cpu-bound process, while(1), should not get
> interrupted by things like lru_add_drain() when it doesn't have any
> pages to drain.

Yup. makes sense.
So, I think you mean you'd like to tackle this special case as fist step, right?
if yes, I agree.


> > > There is nothing that makes lru_add_drain_all() the only such site, its
> > > the one Mike posted to me, and my patch was a way to deal with that.
> >
> > Well, schedule_on_each_cpu() is very limited used function.
> > Practically we can ignore other caller.
>
> No, we need to inspect all callers, having only a few makes that easier.

Sorry my poor english. I meaned I don't oppose your patch approach. I don't oppose
additional work at all.


>
> > > I also explained that its not only RT related in that the HPC folks also
> > > want to avoid unneeded work -- for them its not starvation but a
> > > performance issue.
> >
> > I think you talked about OS jitter issue. if so, I don't think this issue
> > make serious problem. OS jitter mainly be caused by periodic action
> > (e.g. tick update, timer, vmstat update). it's because
> > little-delay x plenty-times = large-delay
> >
> > lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
> > page-migration, memory-hotplug. all caller is not periodic.
>
> Doesn't matter, if you want to reduce it, you need to address all of
> them, a process 4 nodes away calling mlock() while this partition has
> been user-bound for the last hour or so and doesn't have any lru pages
> simply needn't be woken.

Doesn't matter? You mean can we stop to discuss hits HPC performance issue
as Christoph pointed out?
hmmm, sorry, I haven't catch your point.


2009-09-09 04:27:19

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

> The usefulness of a scheme like this requires:
>
> 1. There are cpus that continually execute user space code
> without system interaction.
>
> 2. There are repeated VM activities that require page isolation /
> migration.
>
> The first page isolation activity will then clear the lru caches of the
> processes doing number crunching in user space (and therefore the first
> isolation will still interrupt). The second and following isolation will
> then no longer interrupt the processes.
>
> 2. is rare. So the question is if the additional code in the LRU handling
> can be justified. If lru handling is not time sensitive then yes.

Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
I think page migration don't need lru_add_drain_all() as synchronous, because
page migration have 10 times retry.

Then asynchronous lru_add_drain_all() cause

- if system isn't under heavy pressure, retry succussfull.
- if system is under heavy pressure or RT-thread work busy busy loop, retry failure.

I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.

What do you think?

2009-09-09 14:09:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Wed, 9 Sep 2009, KOSAKI Motohiro wrote:

> Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> I think page migration don't need lru_add_drain_all() as synchronous, because
> page migration have 10 times retry.

True this is only an optimization that increases the chance of isolation
being successful. You dont need draining at all.

> Then asynchronous lru_add_drain_all() cause
>
> - if system isn't under heavy pressure, retry succussfull.
> - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>
> I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
>
> What do you think?

The retries can be very fast if the migrate pages list is small. The
migrate attempts may be finished before the IPI can be processed by the
other cpus.

2009-09-09 15:39:00

by Minchan Kim

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> The usefulness of a scheme like this requires:
>>
>> 1. There are cpus that continually execute user space code
>> ? ?without system interaction.
>>
>> 2. There are repeated VM activities that require page isolation /
>> ? ?migration.
>>
>> The first page isolation activity will then clear the lru caches of the
>> processes doing number crunching in user space (and therefore the first
>> isolation will still interrupt). The second and following isolation will
>> then no longer interrupt the processes.
>>
>> 2. is rare. So the question is if the additional code in the LRU handling
>> can be justified. If lru handling is not time sensitive then yes.
>
> Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> I think page migration don't need lru_add_drain_all() as synchronous, because
> page migration have 10 times retry.
>
> Then asynchronous lru_add_drain_all() cause
>
> ?- if system isn't under heavy pressure, retry succussfull.
> ?- if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>
> I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.

I think, more exactly, we don't have to drain lru pages for mlocking.
Mlocked pages will go into unevictable lru due to
try_to_unmap when shrink of lru happens.
How about removing draining in case of mlock?

>
> What do you think?
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>



--
Kind regards,
Minchan Kim

2009-09-09 16:18:25

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Thu, 2009-09-10 at 00:39 +0900, Minchan Kim wrote:
> On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> The usefulness of a scheme like this requires:
> >>
> >> 1. There are cpus that continually execute user space code
> >> without system interaction.
> >>
> >> 2. There are repeated VM activities that require page isolation /
> >> migration.
> >>
> >> The first page isolation activity will then clear the lru caches of the
> >> processes doing number crunching in user space (and therefore the first
> >> isolation will still interrupt). The second and following isolation will
> >> then no longer interrupt the processes.
> >>
> >> 2. is rare. So the question is if the additional code in the LRU handling
> >> can be justified. If lru handling is not time sensitive then yes.
> >
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
> >
> > Then asynchronous lru_add_drain_all() cause
> >
> > - if system isn't under heavy pressure, retry succussfull.
> > - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
>
> I think, more exactly, we don't have to drain lru pages for mlocking.
> Mlocked pages will go into unevictable lru due to
> try_to_unmap when shrink of lru happens.
> How about removing draining in case of mlock?
>
> >
> > What do you think?


Remember how the code works: __mlock_vma_pages_range() loops calliing
get_user_pages() to fault in batches of 16 pages and returns the page
pointers for mlocking. Mlocking now requires isolation from the lru.
If you don't drain after each call to get_user_pages(), up to a
pagevec's worth of pages [~14] will likely still be in the pagevec and
won't be isolatable/mlockable(). We can end up with most of the pages
still on the normal lru lists. If we want to move to an almost
exclusively lazy culling of mlocked pages to the unevictable then we can
remove the drain. If we want to be more proactive in culling the
unevictable pages as we populate the vma, we'll want to keep the drain.

Lee

2009-09-09 16:45:59

by Minchan Kim

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

Hi, Lee.
Long time no see. :)

On Thu, Sep 10, 2009 at 1:18 AM, Lee Schermerhorn
<[email protected]> wrote:
> On Thu, 2009-09-10 at 00:39 +0900, Minchan Kim wrote:
>> On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
>> <[email protected]> wrote:
>> >> The usefulness of a scheme like this requires:
>> >>
>> >> 1. There are cpus that continually execute user space code
>> >> ? ?without system interaction.
>> >>
>> >> 2. There are repeated VM activities that require page isolation /
>> >> ? ?migration.
>> >>
>> >> The first page isolation activity will then clear the lru caches of the
>> >> processes doing number crunching in user space (and therefore the first
>> >> isolation will still interrupt). The second and following isolation will
>> >> then no longer interrupt the processes.
>> >>
>> >> 2. is rare. So the question is if the additional code in the LRU handling
>> >> can be justified. If lru handling is not time sensitive then yes.
>> >
>> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
>> > I think page migration don't need lru_add_drain_all() as synchronous, because
>> > page migration have 10 times retry.
>> >
>> > Then asynchronous lru_add_drain_all() cause
>> >
>> > ?- if system isn't under heavy pressure, retry succussfull.
>> > ?- if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
>> >
>> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
>>
>> I think, more exactly, we don't have to drain lru pages for mlocking.
>> Mlocked pages will go into unevictable lru due to
>> try_to_unmap when shrink of lru happens.
>> How about removing draining in case of mlock?
>>
>> >
>> > What do you think?
>
>
> Remember how the code works: ?__mlock_vma_pages_range() loops calliing
> get_user_pages() to fault in batches of 16 pages and returns the page
> pointers for mlocking. ?Mlocking now requires isolation from the lru.
> If you don't drain after each call to get_user_pages(), up to a
> pagevec's worth of pages [~14] will likely still be in the pagevec and
> won't be isolatable/mlockable(). ?We can end up with most of the pages

Sorry for confusing.
I said not lru_add_drain but lru_add_drain_all.
Now problem is schedule_on_each_cpu.

Anyway, that case pagevec's worth of pages will be much
increased by the number of CPU as you pointed out.

> still on the normal lru lists. ?If we want to move to an almost
> exclusively lazy culling of mlocked pages to the unevictable then we can
> remove the drain. ?If we want to be more proactive in culling the
> unevictable pages as we populate the vma, we'll want to keep the drain.
>

It's not good that lazy culling of many pages causes high reclaim overhead.
But now lazy culling of reclaim is doing just only shrink_page_list.
we can do it shrink_active_list's page_referenced so that we can sparse
cost of lazy culling.

> Lee
>
>



--
Kind regards,
Minchan Kim

2009-09-09 23:44:03

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

> On Wed, 9 Sep 2009, KOSAKI Motohiro wrote:
>
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
>
> True this is only an optimization that increases the chance of isolation
> being successful. You dont need draining at all.
>
> > Then asynchronous lru_add_drain_all() cause
> >
> > - if system isn't under heavy pressure, retry succussfull.
> > - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> >
> > What do you think?
>
> The retries can be very fast if the migrate pages list is small. The
> migrate attempts may be finished before the IPI can be processed by the
> other cpus.

Ah, I see. Yes, my last proposal is not good. small migration might be fail.

How about this?
- pass 1-2, lru_add_drain_all_async()
- pass 3-10, lru_add_drain_all()

this scheme might save RT-thread case and never cause regression. (I think)

The last remain problem is, if RT-thread binding cpu's pagevec has migrate
targetted page, migration still face the same issue.
but we can't solve it...
RT-thread must use /proc/sys/vm/drop_caches properly.


2009-09-09 23:58:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

> On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> The usefulness of a scheme like this requires:
> >>
> >> 1. There are cpus that continually execute user space code
> >> ? ?without system interaction.
> >>
> >> 2. There are repeated VM activities that require page isolation /
> >> ? ?migration.
> >>
> >> The first page isolation activity will then clear the lru caches of the
> >> processes doing number crunching in user space (and therefore the first
> >> isolation will still interrupt). The second and following isolation will
> >> then no longer interrupt the processes.
> >>
> >> 2. is rare. So the question is if the additional code in the LRU handling
> >> can be justified. If lru handling is not time sensitive then yes.
> >
> > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > I think page migration don't need lru_add_drain_all() as synchronous, because
> > page migration have 10 times retry.
> >
> > Then asynchronous lru_add_drain_all() cause
> >
> > ?- if system isn't under heavy pressure, retry succussfull.
> > ?- if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> >
> > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
>
> I think, more exactly, we don't have to drain lru pages for mlocking.
> Mlocked pages will go into unevictable lru due to
> try_to_unmap when shrink of lru happens.

Right.

> How about removing draining in case of mlock?

Umm, I don't like this. because perfectly no drain often make strange test result.
I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
but many tester and administrator wiill think it's bug... ;)

Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
costly operation. it hide drain cost. now, we only want to treat corner case issue.
I don't hope dramatic change.


2009-09-10 01:01:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Thu, 10 Sep 2009 08:58:20 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> > <[email protected]> wrote:
> > >> The usefulness of a scheme like this requires:
> > >>
> > >> 1. There are cpus that continually execute user space code
> > >>    without system interaction.
> > >>
> > >> 2. There are repeated VM activities that require page isolation /
> > >>    migration.
> > >>
> > >> The first page isolation activity will then clear the lru caches of the
> > >> processes doing number crunching in user space (and therefore the first
> > >> isolation will still interrupt). The second and following isolation will
> > >> then no longer interrupt the processes.
> > >>
> > >> 2. is rare. So the question is if the additional code in the LRU handling
> > >> can be justified. If lru handling is not time sensitive then yes.
> > >
> > > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > > I think page migration don't need lru_add_drain_all() as synchronous, because
> > > page migration have 10 times retry.
> > >
> > > Then asynchronous lru_add_drain_all() cause
> > >
> > >  - if system isn't under heavy pressure, retry succussfull.
> > >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> > >
> > > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> >
> > I think, more exactly, we don't have to drain lru pages for mlocking.
> > Mlocked pages will go into unevictable lru due to
> > try_to_unmap when shrink of lru happens.
>
> Right.
>
> > How about removing draining in case of mlock?
>
> Umm, I don't like this. because perfectly no drain often make strange test result.
> I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
> but many tester and administrator wiill think it's bug... ;)

I agree. I have no objection to your approach. :)

> Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
> costly operation. it hide drain cost. now, we only want to treat corner case issue.
> I don't hope dramatic change.

Another problem is as follow.

Although some CPUs don't have any thing to do, we do it.
HPC guys don't want to consume CPU cycle as Christoph pointed out.
I liked Peter's idea with regard to this.
My approach can solve it, too.
But I agree it would be dramatic change.

--
Kind regards,
Minchan Kim

2009-09-10 01:15:08

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

> On Thu, 10 Sep 2009 08:58:20 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
> > > On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> > > <[email protected]> wrote:
> > > >> The usefulness of a scheme like this requires:
> > > >>
> > > >> 1. There are cpus that continually execute user space code
> > > >>    without system interaction.
> > > >>
> > > >> 2. There are repeated VM activities that require page isolation /
> > > >>    migration.
> > > >>
> > > >> The first page isolation activity will then clear the lru caches of the
> > > >> processes doing number crunching in user space (and therefore the first
> > > >> isolation will still interrupt). The second and following isolation will
> > > >> then no longer interrupt the processes.
> > > >>
> > > >> 2. is rare. So the question is if the additional code in the LRU handling
> > > >> can be justified. If lru handling is not time sensitive then yes.
> > > >
> > > > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > > > I think page migration don't need lru_add_drain_all() as synchronous, because
> > > > page migration have 10 times retry.
> > > >
> > > > Then asynchronous lru_add_drain_all() cause
> > > >
> > > >  - if system isn't under heavy pressure, retry succussfull.
> > > >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> > > >
> > > > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> > >
> > > I think, more exactly, we don't have to drain lru pages for mlocking.
> > > Mlocked pages will go into unevictable lru due to
> > > try_to_unmap when shrink of lru happens.
> >
> > Right.
> >
> > > How about removing draining in case of mlock?
> >
> > Umm, I don't like this. because perfectly no drain often make strange test result.
> > I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
> > but many tester and administrator wiill think it's bug... ;)
>
> I agree. I have no objection to your approach. :)
>
> > Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
> > costly operation. it hide drain cost. now, we only want to treat corner case issue.
> > I don't hope dramatic change.
>
> Another problem is as follow.
>
> Although some CPUs don't have any thing to do, we do it.
> HPC guys don't want to consume CPU cycle as Christoph pointed out.
> I liked Peter's idea with regard to this.
> My approach can solve it, too.
> But I agree it would be dramatic change.

Is Perter's + mine approach bad?

It mean,

- RT-thread binding cpu is not grabbing the page
-> mlock successful by Peter's improvement
- RT-thread binding cpu is grabbing the page
-> mlock successful by mine approach
the page is culled later.



2009-09-10 01:24:08

by Minchan Kim

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Thu, 10 Sep 2009 10:15:07 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > On Thu, 10 Sep 2009 08:58:20 +0900 (JST)
> > KOSAKI Motohiro <[email protected]> wrote:
> >
> > > > On Wed, Sep 9, 2009 at 1:27 PM, KOSAKI Motohiro
> > > > <[email protected]> wrote:
> > > > >> The usefulness of a scheme like this requires:
> > > > >>
> > > > >> 1. There are cpus that continually execute user space code
> > > > >>    without system interaction.
> > > > >>
> > > > >> 2. There are repeated VM activities that require page isolation /
> > > > >>    migration.
> > > > >>
> > > > >> The first page isolation activity will then clear the lru caches of the
> > > > >> processes doing number crunching in user space (and therefore the first
> > > > >> isolation will still interrupt). The second and following isolation will
> > > > >> then no longer interrupt the processes.
> > > > >>
> > > > >> 2. is rare. So the question is if the additional code in the LRU handling
> > > > >> can be justified. If lru handling is not time sensitive then yes.
> > > > >
> > > > > Christoph, I'd like to discuss a bit related (and almost unrelated) thing.
> > > > > I think page migration don't need lru_add_drain_all() as synchronous, because
> > > > > page migration have 10 times retry.
> > > > >
> > > > > Then asynchronous lru_add_drain_all() cause
> > > > >
> > > > >  - if system isn't under heavy pressure, retry succussfull.
> > > > >  - if system is under heavy pressure or RT-thread work busy busy loop, retry failure.
> > > > >
> > > > > I don't think this is problematic bahavior. Also, mlock can use asynchrounous lru drain.
> > > >
> > > > I think, more exactly, we don't have to drain lru pages for mlocking.
> > > > Mlocked pages will go into unevictable lru due to
> > > > try_to_unmap when shrink of lru happens.
> > >
> > > Right.
> > >
> > > > How about removing draining in case of mlock?
> > >
> > > Umm, I don't like this. because perfectly no drain often make strange test result.
> > > I mean /proc/meminfo::Mlock might be displayed unexpected value. it is not leak. it's only lazy cull.
> > > but many tester and administrator wiill think it's bug... ;)
> >
> > I agree. I have no objection to your approach. :)
> >
> > > Practically, lru_add_drain_all() is nearly zero cost. because mlock's page fault is very
> > > costly operation. it hide drain cost. now, we only want to treat corner case issue.
> > > I don't hope dramatic change.
> >
> > Another problem is as follow.
> >
> > Although some CPUs don't have any thing to do, we do it.
> > HPC guys don't want to consume CPU cycle as Christoph pointed out.
> > I liked Peter's idea with regard to this.
> > My approach can solve it, too.
> > But I agree it would be dramatic change.
>
> Is Perter's + mine approach bad?

It's good to me! :)

> It mean,
>
> - RT-thread binding cpu is not grabbing the page
> -> mlock successful by Peter's improvement
> - RT-thread binding cpu is grabbing the page
> -> mlock successful by mine approach
> the page is culled later.
>
>
>
>


--
Kind regards,
Minchan Kim

2009-09-10 18:05:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] lru_add_drain_all() vs isolation

On Thu, 10 Sep 2009, KOSAKI Motohiro wrote:

> How about this?
> - pass 1-2, lru_add_drain_all_async()
> - pass 3-10, lru_add_drain_all()
>
> this scheme might save RT-thread case and never cause regression. (I think)

Sounds good.

> The last remain problem is, if RT-thread binding cpu's pagevec has migrate
> targetted page, migration still face the same issue.
> but we can't solve it...
> RT-thread must use /proc/sys/vm/drop_caches properly.

A system call "sys_os_quiet_down" may be useful. It would drain all
caches, fold counters etc etc so that there will be no OS activities
needed for those things later.