2011-04-25 09:41:13

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH] cpumask: convert cpumask_of_cpu() with cpumask_of()

This patch adapt the code to new api fashion.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
kernel/kthread.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3b34d27..4102518 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -202,7 +202,7 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
return;
}

- p->cpus_allowed = cpumask_of_cpu(cpu);
+ cpumask_copy(&p->cpus_allowed, cpumask_of(cpu));
p->rt.nr_cpus_allowed = 1;
p->flags |= PF_THREAD_BOUND;
}
--
1.7.3.1



2011-04-26 10:43:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] cpumask: convert cpumask_of_cpu() with cpumask_of()

On Mon, 2011-04-25 at 18:41 +0900, KOSAKI Motohiro wrote:
> This patch adapt the code to new api fashion.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
> kernel/kthread.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3b34d27..4102518 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -202,7 +202,7 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
> return;
> }
>
> - p->cpus_allowed = cpumask_of_cpu(cpu);
> + cpumask_copy(&p->cpus_allowed, cpumask_of(cpu));
> p->rt.nr_cpus_allowed = 1;
> p->flags |= PF_THREAD_BOUND;
> }

But why? Are we going to get rid of cpumask_t (which is a fixed sized
struct to direct assignment is perfectly fine)?

Also, do we want to convert cpus_allowed to cpumask_var_t? It would save
quite a lot of memory on distro configs that set NR_CPUS silly high.
Currently NR_CPUS=4096 configs allocate 512 bytes per task for this
bitmap, 511 of which will never be used on most machines (510 in the
near future).

The cost if of course an extra memory dereference in scheduler hot
paths.. also not nice.

2011-04-26 11:33:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] cpumask: convert cpumask_of_cpu() with cpumask_of()

> On Mon, 2011-04-25 at 18:41 +0900, KOSAKI Motohiro wrote:
> > This patch adapt the code to new api fashion.
> >
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Mike Galbraith <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > ---
> > kernel/kthread.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index 3b34d27..4102518 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -202,7 +202,7 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
> > return;
> > }
> >
> > - p->cpus_allowed = cpumask_of_cpu(cpu);
> > + cpumask_copy(&p->cpus_allowed, cpumask_of(cpu));
> > p->rt.nr_cpus_allowed = 1;
> > p->flags |= PF_THREAD_BOUND;
> > }
>
> But why? Are we going to get rid of cpumask_t (which is a fixed sized
> struct to direct assignment is perfectly fine)?
>
> Also, do we want to convert cpus_allowed to cpumask_var_t? It would save
> quite a lot of memory on distro configs that set NR_CPUS silly high.
> Currently NR_CPUS=4096 configs allocate 512 bytes per task for this
> bitmap, 511 of which will never be used on most machines (510 in the
> near future).
>
> The cost if of course an extra memory dereference in scheduler hot
> paths.. also not nice.

To be honest, I dislike current cpumask_size() always return NR_CPUS. It
screw up almost all of cpumask_var_t benefit. But, we have to eliminate
all dangerous = operator usage before changing its implementation.

(btw, I wonder this limitation doesn't documented at all in code. Should
we add this?)

Thus, now I'm finding all of =operator by tool and replacing it. The second
motivation of eliminating old api is to easy detect obsolete usage by tool.

So, I personally hope task->cpus_allowed convert to cpumask_var_t
as cpuset->cpus_allowed. For some years, storage guys repeatedly alert
stack overflow chance is increasing as time goes.

However, yes, extra memory dereference is also bad. If scheduler folks
dislike cpumask_var_t, I can drop to think convert task->cpus_allowed.

But, if we can't convert cpus_allowed, I'd like to move it into last of
task_struct. because usually cpu_allowed is only accessed first byte
(as you described).

Thanks.



>From 8844bba7597ac1c7fd2e88406da17d818ce271d1 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Tue, 26 Apr 2011 19:58:39 +0900
Subject: [PATCH] cpumask: add cpumask_var_t documentation

cpumask_var_t has one nortable difference against cpumask_t.
This patch adds the explanation.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/cpumask.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index a3ff24f..3d09505 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -617,6 +617,12 @@ static inline size_t cpumask_size(void)
* ... use 'tmpmask' like a normal struct cpumask * ...
*
* free_cpumask_var(tmpmask);
+ *
+ * However, one notable exception is there. cpumask_var_t is allocated
+ * only nr_cpu_ids bits (in the other hand, real cpumask_t always has
+ * NR_CPUS bits). therefore cpumask_var_t can't use '=' operator.
+ * It makes NR_CPUS size memcopy and bring memroy corruption. You have
+ * to use cpumask_copy() instead.
*/
#ifdef CONFIG_CPUMASK_OFFSTACK
typedef struct cpumask *cpumask_var_t;
--
1.7.3.1


2011-04-27 10:32:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] cpumask: convert cpumask_of_cpu() with cpumask_of()

> > But why? Are we going to get rid of cpumask_t (which is a fixed sized
> > struct to direct assignment is perfectly fine)?
> >
> > Also, do we want to convert cpus_allowed to cpumask_var_t? It would save
> > quite a lot of memory on distro configs that set NR_CPUS silly high.
> > Currently NR_CPUS=4096 configs allocate 512 bytes per task for this
> > bitmap, 511 of which will never be used on most machines (510 in the
> > near future).
> >
> > The cost if of course an extra memory dereference in scheduler hot
> > paths.. also not nice.

Probably, mesurement data is verbose than my poor english...

I've made concept proof patch today. The result is better than I expected.

<before>
Performance counter stats for 'hackbench 10 thread 1000' (10 runs):

1603777813 cache-references # 56.987 M/sec ( +- 1.824% ) (scaled from 25.36%)
13780381 cache-misses # 0.490 M/sec ( +- 1.360% ) (scaled from 25.55%)
24872032348 L1-dcache-loads # 883.770 M/sec ( +- 0.666% ) (scaled from 25.51%)
640394580 L1-dcache-load-misses # 22.755 M/sec ( +- 0.796% ) (scaled from 25.47%)

14.162411769 seconds time elapsed ( +- 0.675% )

<after>
Performance counter stats for 'hackbench 10 thread 1000' (10 runs):

1416147603 cache-references # 51.566 M/sec ( +- 4.407% ) (scaled from 25.40%)
10920284 cache-misses # 0.398 M/sec ( +- 5.454% ) (scaled from 25.56%)
24666962632 L1-dcache-loads # 898.196 M/sec ( +- 1.747% ) (scaled from 25.54%)
598640329 L1-dcache-load-misses # 21.798 M/sec ( +- 2.504% ) (scaled from 25.50%)

13.812193312 seconds time elapsed ( +- 1.696% )

* datail data is in result.txt


The trick is,
- Typical linux userland applications don't use mempolicy and/or cpusets
API at all.
- Then, 99.99% thread's tsk->cpus_alloed have cpu_all_mask.
- cpu_all_mask case, every thread can share the same bitmap. It may help to
reduce L1 cache miss in scheduler.

What do you think?


Attachments:
result.txt (3.10 kB)
result.txt (3.10 kB)
0001-s-task-cpus_allowed-tsk_cpus_allowed.patch (18.96 kB)
0002-change-task-cpus_allowed-to-pointer.patch (7.93 kB)
Download all attachments

2011-05-27 01:07:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] cpumask: convert cpumask_of_cpu() with cpumask_of()

On Wed, 2011-04-27 at 19:32 +0900, KOSAKI Motohiro wrote:
>
> I've made concept proof patch today. The result is better than I expected.
>
> <before>
> Performance counter stats for 'hackbench 10 thread 1000' (10 runs):
>
> 1603777813 cache-references # 56.987 M/sec ( +- 1.824% ) (scaled from 25.36%)
> 13780381 cache-misses # 0.490 M/sec ( +- 1.360% ) (scaled from 25.55%)
> 24872032348 L1-dcache-loads # 883.770 M/sec ( +- 0.666% ) (scaled from 25.51%)
> 640394580 L1-dcache-load-misses # 22.755 M/sec ( +- 0.796% ) (scaled from 25.47%)
>
> 14.162411769 seconds time elapsed ( +- 0.675% )
>
> <after>
> Performance counter stats for 'hackbench 10 thread 1000' (10 runs):
>
> 1416147603 cache-references # 51.566 M/sec ( +- 4.407% ) (scaled from 25.40%)
> 10920284 cache-misses # 0.398 M/sec ( +- 5.454% ) (scaled from 25.56%)
> 24666962632 L1-dcache-loads # 898.196 M/sec ( +- 1.747% ) (scaled from 25.54%)
> 598640329 L1-dcache-load-misses # 21.798 M/sec ( +- 2.504% ) (scaled from 25.50%)
>
> 13.812193312 seconds time elapsed ( +- 1.696% )
>
> * datail data is in result.txt
>
>
> The trick is,
> - Typical linux userland applications don't use mempolicy and/or cpusets
> API at all.
> - Then, 99.99% thread's tsk->cpus_alloed have cpu_all_mask.
> - cpu_all_mask case, every thread can share the same bitmap. It may help to
> reduce L1 cache miss in scheduler.
>
> What do you think?

Nice!

If you finish the first patch (sort the TODOs) I'll take it.

I'm unsure about the PF_THREAD_UNBOUND thing though, then again, the
alternative is adding another struct cpumask * and have that point to
the shared mask or the private mask.

But yeah, looks quite feasible.

2011-05-30 01:40:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] cpumask: convert cpumask_of_cpu() with cpumask_of()

>> The trick is,
>> - Typical linux userland applications don't use mempolicy and/or cpusets
>> API at all.
>> - Then, 99.99% thread's tsk->cpus_alloed have cpu_all_mask.
>> - cpu_all_mask case, every thread can share the same bitmap. It may help to
>> reduce L1 cache miss in scheduler.
>>
>> What do you think?
>
> Nice!
>
> If you finish the first patch (sort the TODOs) I'll take it.

Yeah, now I'm submitting a lot of cpumask cleanup patches to various arch and
subsystems. So, I expect I can finish this work in June.

> I'm unsure about the PF_THREAD_UNBOUND thing though, then again, the
> alternative is adding another struct cpumask * and have that point to
> the shared mask or the private mask.

Ahhh, I'm sorry. My explanation was bad. PF_THREAD_UNBOUND is not my point.
It's only concept proof patch, not for submitting. yes, I did cheat for getting
number easily. I think the good way is probably to add another cpumask* and
implement COW shared mask. but I'm ok other way too.


> But yeah, looks quite feasible.

Thank you to pay attention my patch!