Hi Peter -
Commit 34d76c41 introduced the update_shares_data percpu, but it ends up
causing problems on large ia64 machines. Specifically, ia64 is limited
to 64k in percpu vars and with NR_CPUS=4096, that ends up being 32k by
itself. It ends up causing link errors since that is how ia64 enforces
the 64k limit.
I can take a deeper look at finding a workable solution but thought I'd
mention it in case you had ideas already.
-Jeff
--
Jeff Mahoney
SuSE Labs
On Sun, 18 Oct 2009, Jeff Mahoney wrote:
> Hi Peter -
>
> Commit 34d76c41 introduced the update_shares_data percpu, but it ends up
> causing problems on large ia64 machines. Specifically, ia64 is limited
> to 64k in percpu vars and with NR_CPUS=4096, that ends up being 32k by
> itself. It ends up causing link errors since that is how ia64 enforces
> the 64k limit.
>
> I can take a deeper look at finding a workable solution but thought I'd
> mention it in case you had ideas already.
I am adding some IA64 CCs, as the failure is solely caused by the ia64
percpu implementation/pagefault handler optimization which requires the
.percpu section area not be larger than 64k, which blows up with 34d76c41
and NR_CPUS high enoufh (due to introduction of percpu array being
size-dependent on NR_CPUS).
--
Jiri Kosina
SUSE Labs, Novell Inc.
On Tue, 20 Oct 2009, Jiri Kosina wrote:
> > Commit 34d76c41 introduced the update_shares_data percpu, but it ends up
> > causing problems on large ia64 machines. Specifically, ia64 is limited
> > to 64k in percpu vars and with NR_CPUS=4096, that ends up being 32k by
> > itself. It ends up causing link errors since that is how ia64 enforces
> > the 64k limit.
> >
> > I can take a deeper look at finding a workable solution but thought I'd
> > mention it in case you had ideas already.
>
> I am adding some IA64 CCs, as the failure is solely caused by the ia64
> percpu implementation/pagefault handler optimization which requires the
> .percpu section area not be larger than 64k, which blows up with 34d76c41
> and NR_CPUS high enoufh (due to introduction of percpu array being
> size-dependent on NR_CPUS).
How about this one? (untested)
From: Jiri Kosina <[email protected]>
Subject: sched: move rq_weight data array out of .percpu
Commit 34d76c41 introduced percpu array update_shares_data, size of which
being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
Fix this by allocating this array dynamically and keep only pointer to it
percpu.
Signed-off-by: Jiri Kosina <[email protected]>
---
kernel/sched.c | 15 +++++++--------
1 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index e886895..21337da 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1564,11 +1564,9 @@ static unsigned long cpu_avg_load_per_task(int cpu)
#ifdef CONFIG_FAIR_GROUP_SCHED
-struct update_shares_data {
- unsigned long rq_weight[NR_CPUS];
-};
+unsigned long *update_shares_data;
-static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);
+static DEFINE_PER_CPU(unsigned long *, update_shares_data);
static void __set_se_shares(struct sched_entity *se, unsigned long shares);
@@ -1578,12 +1576,12 @@ static void __set_se_shares(struct sched_entity *se, unsigned long shares);
static void update_group_shares_cpu(struct task_group *tg, int cpu,
unsigned long sd_shares,
unsigned long sd_rq_weight,
- struct update_shares_data *usd)
+ unsigned long *usd)
{
unsigned long shares, rq_weight;
int boost = 0;
- rq_weight = usd->rq_weight[cpu];
+ rq_weight = usd[cpu];
if (!rq_weight) {
boost = 1;
rq_weight = NICE_0_LOAD;
@@ -1618,7 +1616,7 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
static int tg_shares_up(struct task_group *tg, void *data)
{
unsigned long weight, rq_weight = 0, shares = 0;
- struct update_shares_data *usd;
+ unsigned long *usd;
struct sched_domain *sd = data;
unsigned long flags;
int i;
@@ -1631,7 +1629,7 @@ static int tg_shares_up(struct task_group *tg, void *data)
for_each_cpu(i, sched_domain_span(sd)) {
weight = tg->cfs_rq[i]->load.weight;
- usd->rq_weight[i] = weight;
+ usd[i] = weight;
/*
* If there are currently no tasks on the cpu pretend there
@@ -9420,6 +9418,7 @@ void __init sched_init(void)
#ifdef CONFIG_FAIR_GROUP_SCHED
init_task_group.shares = init_task_group_load;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+ __get_cpu_var(update_shares_data) = kmalloc(NR_CPUS * sizeof(unsigned long));
#ifdef CONFIG_CGROUP_SCHED
/*
* How much cpu bandwidth does init_task_group get?
--
Jiri Kosina
SUSE Labs, Novell Inc.
Hello,
Jiri Kosina wrote:
> How about this one? (untested)
>
> From: Jiri Kosina <[email protected]>
> Subject: sched: move rq_weight data array out of .percpu
>
> Commit 34d76c41 introduced percpu array update_shares_data, size of which
> being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
> NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
>
> Fix this by allocating this array dynamically and keep only pointer to it
> percpu.
If the structure can be dynamically allocated at all, just using
percpu_alloc() to allocate the whole thing should solve it with the
correct numa locality.
Thanks.
--
tejun
On Tue, 20 Oct 2009, Tejun Heo wrote:
> If the structure can be dynamically allocated at all, just using
> percpu_alloc() to allocate the whole thing should solve it with the
> correct numa locality.
>
> Thanks.
So how about something along the lines below? (completely untested as
well)
From: Jiri Kosina <[email protected]>
Subject: sched: move rq_weight data array out of .percpu
Commit 34d76c41 introduced percpu array update_shares_data, size of which
being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
Fix this by allocating this array dynamically and keep only pointer to it
percpu.
Signed-off-by: Jiri Kosina <[email protected]>
---
kernel/sched.c | 19 +++++++++----------
1 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index e886895..071cc66 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1564,11 +1564,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
#ifdef CONFIG_FAIR_GROUP_SCHED
-struct update_shares_data {
- unsigned long rq_weight[NR_CPUS];
-};
-
-static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);
+unsigned long *update_shares_data;
static void __set_se_shares(struct sched_entity *se, unsigned long shares);
@@ -1578,12 +1574,12 @@ static void __set_se_shares(struct sched_entity *se, unsigned long shares);
static void update_group_shares_cpu(struct task_group *tg, int cpu,
unsigned long sd_shares,
unsigned long sd_rq_weight,
- struct update_shares_data *usd)
+ unsigned long *usd)
{
unsigned long shares, rq_weight;
int boost = 0;
- rq_weight = usd->rq_weight[cpu];
+ rq_weight = usd[cpu];
if (!rq_weight) {
boost = 1;
rq_weight = NICE_0_LOAD;
@@ -1618,7 +1614,7 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
static int tg_shares_up(struct task_group *tg, void *data)
{
unsigned long weight, rq_weight = 0, shares = 0;
- struct update_shares_data *usd;
+ unsigned long *usd;
struct sched_domain *sd = data;
unsigned long flags;
int i;
@@ -1627,11 +1623,10 @@ static int tg_shares_up(struct task_group *tg, void *data)
return 0;
local_irq_save(flags);
- usd = &__get_cpu_var(update_shares_data);
for_each_cpu(i, sched_domain_span(sd)) {
weight = tg->cfs_rq[i]->load.weight;
- usd->rq_weight[i] = weight;
+ usd = *per_cpu_ptr(update_shares_data, i) = weight;
/*
* If there are currently no tasks on the cpu pretend there
@@ -9407,6 +9402,10 @@ void __init sched_init(void)
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_GROUP_SCHED */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ update_shares_data = __alloc_percpu(NR_CPUS * sizeof(unsigned long),
+ unsigned long);
+#endif
for_each_possible_cpu(i) {
struct rq *rq;
--
Jiri Kosina
SUSE Labs, Novell Inc.
Jiri Kosina wrote:
> So how about something along the lines below? (completely untested as
> well)
>
> From: Jiri Kosina <[email protected]>
> Subject: sched: move rq_weight data array out of .percpu
>
> Commit 34d76c41 introduced percpu array update_shares_data, size of which
> being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
> NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
>
> Fix this by allocating this array dynamically and keep only pointer to it
> percpu.
Looks good to me from percpu POV.
Thanks.
--
tejun
Tejun Heo wrote:
> Jiri Kosina wrote:
>> So how about something along the lines below? (completely untested as
>> well)
>>
>> From: Jiri Kosina <[email protected]>
>> Subject: sched: move rq_weight data array out of .percpu
>>
>> Commit 34d76c41 introduced percpu array update_shares_data, size of which
>> being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
>> NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
>>
>> Fix this by allocating this array dynamically and keep only pointer to it
>> percpu.
>
> Looks good to me from percpu POV.
Oh... one thing. If you're doing dynamic allocation you can use
nr_cpu_ids instead of NR_CPUS.
--
tejun
* Jiri Kosina <[email protected]> wrote:
> On Tue, 20 Oct 2009, Jiri Kosina wrote:
>
> > > Commit 34d76c41 introduced the update_shares_data percpu, but it ends up
> > > causing problems on large ia64 machines. Specifically, ia64 is limited
> > > to 64k in percpu vars and with NR_CPUS=4096, that ends up being 32k by
> > > itself. It ends up causing link errors since that is how ia64 enforces
> > > the 64k limit.
> > >
> > > I can take a deeper look at finding a workable solution but thought I'd
> > > mention it in case you had ideas already.
> >
> > I am adding some IA64 CCs, as the failure is solely caused by the ia64
> > percpu implementation/pagefault handler optimization which requires the
> > .percpu section area not be larger than 64k, which blows up with 34d76c41
> > and NR_CPUS high enoufh (due to introduction of percpu array being
> > size-dependent on NR_CPUS).
>
> How about this one? (untested)
>
>
>
> From: Jiri Kosina <[email protected]>
> Subject: sched: move rq_weight data array out of .percpu
>
> Commit 34d76c41 introduced percpu array update_shares_data, size of which
> being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
> NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
>
> Fix this by allocating this array dynamically and keep only pointer to it
> percpu.
>
> Signed-off-by: Jiri Kosina <[email protected]>
> ---
> kernel/sched.c | 15 +++++++--------
> 1 files changed, 7 insertions(+), 8 deletions(-)
Seems like an IA64 bug to me. It's _very_ easy to get more than 64K of
percpu data - i'm wondering why IA64 only triggered it now? Sure
lockdep/lockstat must have triggered it before.
Ingo
On Tue, 20 Oct 2009, Ingo Molnar wrote:
> > Commit 34d76c41 introduced percpu array update_shares_data, size of which
> > being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
> > NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
> >
> > Fix this by allocating this array dynamically and keep only pointer to it
> > percpu.
> >
> > Signed-off-by: Jiri Kosina <[email protected]>
> > ---
> > kernel/sched.c | 15 +++++++--------
> > 1 files changed, 7 insertions(+), 8 deletions(-)
>
> Seems like an IA64 bug to me.
IA64 guys actually use that as some kind of optimization for fast access
to the percpu data in their pagefault handler, as far as I know.
> It's _very_ easy to get more than 64K of percpu data - i'm wondering why
> IA64 only triggered it now? Sure lockdep/lockstat must have triggered it
> before.
Do we actually have lockdep on ia64?
--
Jiri Kosina
SUSE Labs, Novell Inc.
On Tue, 20 Oct 2009, Tejun Heo wrote:
> Oh... one thing. If you're doing dynamic allocation you can use
> nr_cpu_ids instead of NR_CPUS.
Right, that's proper optimization.
From: Jiri Kosina <[email protected]>
Subject: sched: move rq_weight data array out of .percpu
Commit 34d76c41 introduced percpu array update_shares_data, size of which
being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
Fix this by allocating this array dynamically and keep only pointer to it
percpu.
Signed-off-by: Jiri Kosina <[email protected]>
---
kernel/sched.c | 19 +++++++++----------
1 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index e886895..e70b526 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1564,11 +1564,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
#ifdef CONFIG_FAIR_GROUP_SCHED
-struct update_shares_data {
- unsigned long rq_weight[NR_CPUS];
-};
-
-static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);
+unsigned long *update_shares_data;
static void __set_se_shares(struct sched_entity *se, unsigned long shares);
@@ -1578,12 +1574,12 @@ static void __set_se_shares(struct sched_entity *se, unsigned long shares);
static void update_group_shares_cpu(struct task_group *tg, int cpu,
unsigned long sd_shares,
unsigned long sd_rq_weight,
- struct update_shares_data *usd)
+ unsigned long *usd)
{
unsigned long shares, rq_weight;
int boost = 0;
- rq_weight = usd->rq_weight[cpu];
+ rq_weight = usd[cpu];
if (!rq_weight) {
boost = 1;
rq_weight = NICE_0_LOAD;
@@ -1618,7 +1614,7 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
static int tg_shares_up(struct task_group *tg, void *data)
{
unsigned long weight, rq_weight = 0, shares = 0;
- struct update_shares_data *usd;
+ unsigned long *usd;
struct sched_domain *sd = data;
unsigned long flags;
int i;
@@ -1627,11 +1623,10 @@ static int tg_shares_up(struct task_group *tg, void *data)
return 0;
local_irq_save(flags);
- usd = &__get_cpu_var(update_shares_data);
for_each_cpu(i, sched_domain_span(sd)) {
weight = tg->cfs_rq[i]->load.weight;
- usd->rq_weight[i] = weight;
+ usd = *per_cpu_ptr(update_shares_data, i) = weight;
/*
* If there are currently no tasks on the cpu pretend there
@@ -9407,6 +9402,10 @@ void __init sched_init(void)
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_GROUP_SCHED */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ update_shares_data = __alloc_percpu(nr_cpu_ids * sizeof(unsigned long),
+ unsigned long);
+#endif
for_each_possible_cpu(i) {
struct rq *rq;
--
Jiri Kosina
SUSE Labs, Novell Inc.
* Jiri Kosina <[email protected]> wrote:
> On Tue, 20 Oct 2009, Ingo Molnar wrote:
>
> > > Commit 34d76c41 introduced percpu array update_shares_data, size of which
> > > being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
> > > NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
> > >
> > > Fix this by allocating this array dynamically and keep only pointer to it
> > > percpu.
> > >
> > > Signed-off-by: Jiri Kosina <[email protected]>
> > > ---
> > > kernel/sched.c | 15 +++++++--------
> > > 1 files changed, 7 insertions(+), 8 deletions(-)
> >
> > Seems like an IA64 bug to me.
>
> IA64 guys actually use that as some kind of optimization for fast
> access to the percpu data in their pagefault handler, as far as I
> know.
Still looks like a bug if it causes a breakage (linker error) on IA64,
and if the 'fix' (i'd call it a workaround) causes a (small but nonzero)
performance regression on other architectures.
Ingo
Ingo Molnar a ?crit :
> Still looks like a bug if it causes a breakage (linker error) on IA64,
> and if the 'fix' (i'd call it a workaround) causes a (small but nonzero)
> performance regression on other architectures.
>
True, but this also save some amount of ram for some distro kernels.
If we keep this static NR_CPUS thing, we might be able to free
the end of table, for other per_cpu users ?
if (nr_cpus_ids < NR_CPUS) {
per_cpu_free_static_zone(&update_shares_data[nr_cpus_ids],
sizeof(long)*(NR_CPUS - nr_cpus_ids));
}
Hello,
Ingo Molnar wrote:
> Still looks like a bug if it causes a breakage (linker error) on IA64,
In a sense, yes.
> and if the 'fix' (i'd call it a workaround) causes a (small but nonzero)
> performance regression on other architectures.
Yeah, by small amount but also reduces memory usage quite a bit. For
this case, I think dynamic allocation is actually the right thing to
do.
Extending the first chunk size shouldn't be difficult on ia64 but
other than this, there still is a lot of room left with the current
64k limit, so I think we should be safe for a while. BTW, I also am
curious how lockdep has been coping with the limit.
Thanks.
--
tejun
On Tue, 20 Oct 2009, Tejun Heo wrote:
> Extending the first chunk size shouldn't be difficult on ia64 but other
> than this, there still is a lot of room left with the current 64k limit,
> so I think we should be safe for a while. BTW, I also am curious how
> lockdep has been coping with the limit.
There is no lockdep on ia64, and we don't have such restriction on .percpu
section on any other architecture which supports lockdep, right?
--
Jiri Kosina
SUSE Labs, Novell Inc.
Jiri Kosina wrote:
> On Tue, 20 Oct 2009, Tejun Heo wrote:
>
>> Extending the first chunk size shouldn't be difficult on ia64 but other
>> than this, there still is a lot of room left with the current 64k limit,
>> so I think we should be safe for a while. BTW, I also am curious how
>> lockdep has been coping with the limit.
>
> There is no lockdep on ia64, and we don't have such restriction on .percpu
> section on any other architecture which supports lockdep, right?
Yeap, the restriction is only on ia64.
--
tejun
Eric Dumazet wrote:
> Ingo Molnar a ?crit :
>
>> Still looks like a bug if it causes a breakage (linker error) on IA64,
>> and if the 'fix' (i'd call it a workaround) causes a (small but nonzero)
>> performance regression on other architectures.
>>
>
> True, but this also save some amount of ram for some distro kernels.
>
> If we keep this static NR_CPUS thing, we might be able to free
> the end of table, for other per_cpu users ?
>
> if (nr_cpus_ids < NR_CPUS) {
> per_cpu_free_static_zone(&update_shares_data[nr_cpus_ids],
> sizeof(long)*(NR_CPUS - nr_cpus_ids));
> }
That's doable but Considering that the users of NR_CPUS are pretty few
(and should be kept that way), I think it would be better to just use
dynamic allocation, which no longer incurs any major performance
difference (the only difference is constant offset vs. pointer in a
variable), for those cases.
Thanks.
--
tejun
On Tue, 2009-10-20 at 08:15 +0200, Ingo Molnar wrote:
>
> Seems like an IA64 bug to me. It's _very_ easy to get more than 64K of
> percpu data - i'm wondering why IA64 only triggered it now? Sure
> lockdep/lockstat must have triggered it before.
I seem to remember people wanting to make lockstat data dynamically
allocated because of this.
But then, I'm not sure ia64 actually has lockdep support.
On 10/20/2009 02:35 AM, Ingo Molnar wrote:
>
> * Jiri Kosina <[email protected]> wrote:
>
>> On Tue, 20 Oct 2009, Ingo Molnar wrote:
>>
>>>> Commit 34d76c41 introduced percpu array update_shares_data, size of which
>>>> being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
>>>> NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
>>>>
>>>> Fix this by allocating this array dynamically and keep only pointer to it
>>>> percpu.
>>>>
>>>> Signed-off-by: Jiri Kosina <[email protected]>
>>>> ---
>>>> kernel/sched.c | 15 +++++++--------
>>>> 1 files changed, 7 insertions(+), 8 deletions(-)
>>>
>>> Seems like an IA64 bug to me.
>>
>> IA64 guys actually use that as some kind of optimization for fast
>> access to the percpu data in their pagefault handler, as far as I
>> know.
>
> Still looks like a bug if it causes a breakage (linker error) on IA64,
> and if the 'fix' (i'd call it a workaround) causes a (small but nonzero)
> performance regression on other architectures.
The linker error isn't a bug, it's enforcement. The ia64 linker script
explicitly rewinds the location pointer back to the start of
.data.percpu + 64k to start the .data section to cause the error if
.data.percpu is larger than 64k.
-Jeff
--
Jeff Mahoney
SUSE Labs
* Jeff Mahoney <[email protected]> wrote:
> On 10/20/2009 02:35 AM, Ingo Molnar wrote:
> >
> > * Jiri Kosina <[email protected]> wrote:
> >
> >> On Tue, 20 Oct 2009, Ingo Molnar wrote:
> >>
> >>>> Commit 34d76c41 introduced percpu array update_shares_data, size of which
> >>>> being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
> >>>> NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
> >>>>
> >>>> Fix this by allocating this array dynamically and keep only pointer to it
> >>>> percpu.
> >>>>
> >>>> Signed-off-by: Jiri Kosina <[email protected]>
> >>>> ---
> >>>> kernel/sched.c | 15 +++++++--------
> >>>> 1 files changed, 7 insertions(+), 8 deletions(-)
> >>>
> >>> Seems like an IA64 bug to me.
> >>
> >> IA64 guys actually use that as some kind of optimization for fast
> >> access to the percpu data in their pagefault handler, as far as I
> >> know.
> >
> > Still looks like a bug if it causes a breakage (linker error) on IA64,
> > and if the 'fix' (i'd call it a workaround) causes a (small but nonzero)
> > performance regression on other architectures.
>
> The linker error isn't a bug, it's enforcement. The ia64 linker script
> explicitly rewinds the location pointer back to the start of
> .data.percpu + 64k to start the .data section to cause the error if
> .data.percpu is larger than 64k.
Since every other SMP architecture manages to support more than 64K of
pecpu data, this is clearly an ugly, self-inflicted limitation of IA64
that has now escallated into a link failure.
Now, 34d76c41 could certainly be improved in a way that works around the
IA64 problem too: we can allocate the data dynamically as long as the
proper percpu allocator is used (not kmalloc as in the patch in this
thread). But arguing that the current IA64 64K limit behavior is
anything but very broken is rather shortsighted.
IA64 should be fixed really - we can get past the 64K of percpu data
limit anytime we add a few more pages of per-cpu data to the kernel -
the scheduler just happened to be the one to cross it this time.
The scheduler change in 34d76c41 has been done two months ago and has
been upstream for a month, so this compaint is rather late and at
minimum a certain degree of honesty about the situation is warranted.
Saying that all static percpu data must be below 64K, which will only be
noticed once IA64 gets its testing act together months after it's been
created is silly. If you want to enforce such a limit make it testable
in a _timely_ fashion. Or fix the limit really.
Ingo
Ingo Molnar wrote:
> IA64 should be fixed really - we can get past the 64K of percpu data
> limit anytime we add a few more pages of per-cpu data to the kernel -
> the scheduler just happened to be the one to cross it this time.
Christoph (cc'd hi!) was talking about re-purposing one of reserved
generic registers (for current or something, I don't remember the
details) for percpu base and just getting the original one via percpu
access. With that we should be able to lift the limitation without
much performance penalty.
> Saying that all static percpu data must be below 64K, which will only be
> noticed once IA64 gets its testing act together months after it's been
> created is silly. If you want to enforce such a limit make it testable
> in a _timely_ fashion. Or fix the limit really.
Yeah, the problem probably was that it only pushes the perpcu area
very slightly over the limit depending on configuration so it's not
too surprising that it didn't get reported for some time. Also, the
linker script thing is a sanity check which is intentionally put there
to trigger when this happens so that it can be found out clearly
during build time.
In the long term, it will be a good idea to lift that restriction.
That said, I really don't think we should be adding NR_CPUS sized
array to percpu area either. Things like that quickly become very
scary with N*1024 cpu configurations.
Thanks.
--
tejun
Of course I forgot to actually cc. Cc'ing and quoting whole body.
Sorry.
Tejun Heo wrote:
> Ingo Molnar wrote:
>> IA64 should be fixed really - we can get past the 64K of percpu data
>> limit anytime we add a few more pages of per-cpu data to the kernel -
>> the scheduler just happened to be the one to cross it this time.
>
> Christoph (cc'd hi!) was talking about re-purposing one of reserved
> generic registers (for current or something, I don't remember the
> details) for percpu base and just getting the original one via percpu
> access. With that we should be able to lift the limitation without
> much performance penalty.
>
>> Saying that all static percpu data must be below 64K, which will only be
>> noticed once IA64 gets its testing act together months after it's been
>> created is silly. If you want to enforce such a limit make it testable
>> in a _timely_ fashion. Or fix the limit really.
>
> Yeah, the problem probably was that it only pushes the perpcu area
> very slightly over the limit depending on configuration so it's not
> too surprising that it didn't get reported for some time. Also, the
> linker script thing is a sanity check which is intentionally put there
> to trigger when this happens so that it can be found out clearly
> during build time.
>
> In the long term, it will be a good idea to lift that restriction.
> That said, I really don't think we should be adding NR_CPUS sized
> array to percpu area either. Things like that quickly become very
> scary with N*1024 cpu configurations.
--
tejun
On 10/20/2009 02:27 AM, Jiri Kosina wrote:
> @@ -1627,11 +1623,10 @@ static int tg_shares_up(struct task_group *tg, void *data)
> return 0;
>
> local_irq_save(flags);
> - usd = &__get_cpu_var(update_shares_data);
>
> for_each_cpu(i, sched_domain_span(sd)) {
> weight = tg->cfs_rq[i]->load.weight;
> - usd->rq_weight[i] = weight;
> + usd = *per_cpu_ptr(update_shares_data, i) = weight;
>
> /*
> * If there are currently no tasks on the cpu pretend there
I don't think this is what you want here.
In the original version, usd is the percpu var using the current cpu. In
your version, usd is the percpu var using i instead of the current cpu.
I'll post my version of the patch shortly. I don't think keeping most of
the original version is a bad thing. We can just allocate it dynamically
instead.
-Jeff
--
Jeff Mahoney
SUSE Labs
On 10/20/2009 10:18 AM, Jeff Mahoney wrote:
> On 10/20/2009 02:27 AM, Jiri Kosina wrote:
>> @@ -1627,11 +1623,10 @@ static int tg_shares_up(struct task_group *tg, void *data)
>> return 0;
>>
>> local_irq_save(flags);
>> - usd = &__get_cpu_var(update_shares_data);
>>
>> for_each_cpu(i, sched_domain_span(sd)) {
>> weight = tg->cfs_rq[i]->load.weight;
>> - usd->rq_weight[i] = weight;
>> + usd = *per_cpu_ptr(update_shares_data, i) = weight;
>>
>> /*
>> * If there are currently no tasks on the cpu pretend there
>
> I don't think this is what you want here.
>
> In the original version, usd is the percpu var using the current cpu. In
> your version, usd is the percpu var using i instead of the current cpu.
>
> I'll post my version of the patch shortly. I don't think keeping most of
> the original version is a bad thing. We can just allocate it dynamically
> instead.
This version fixes a build issue (__alignof__(unsigned long)) and fixes the
percpu lookup to be usd = percpu array pointer, usd[i] = actual variable.
-Jeff
From: Jiri Kosina <[email protected]>
Subject: sched: move rq_weight data array out of .percpu
Commit 34d76c41 introduced percpu array update_shares_data, size of which
being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
Fix this by allocating this array dynamically and keep only pointer to it
percpu.
Signed-off-by: Jiri Kosina <[email protected]>
---
kernel/sched.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1564,11 +1564,7 @@ static unsigned long cpu_avg_load_per_ta
#ifdef CONFIG_FAIR_GROUP_SCHED
-struct update_shares_data {
- unsigned long rq_weight[NR_CPUS];
-};
-
-static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);
+unsigned long *update_shares_data;
static void __set_se_shares(struct sched_entity *se, unsigned long shares);
@@ -1578,12 +1574,12 @@ static void __set_se_shares(struct sched
static void update_group_shares_cpu(struct task_group *tg, int cpu,
unsigned long sd_shares,
unsigned long sd_rq_weight,
- struct update_shares_data *usd)
+ unsigned long *usd)
{
unsigned long shares, rq_weight;
int boost = 0;
- rq_weight = usd->rq_weight[cpu];
+ rq_weight = usd[cpu];
if (!rq_weight) {
boost = 1;
rq_weight = NICE_0_LOAD;
@@ -1618,7 +1614,7 @@ static void update_group_shares_cpu(stru
static int tg_shares_up(struct task_group *tg, void *data)
{
unsigned long weight, rq_weight = 0, shares = 0;
- struct update_shares_data *usd;
+ unsigned long *usd;
struct sched_domain *sd = data;
unsigned long flags;
int i;
@@ -1627,11 +1623,11 @@ static int tg_shares_up(struct task_grou
return 0;
local_irq_save(flags);
- usd = &__get_cpu_var(update_shares_data);
+ usd = per_cpu_ptr(update_shares_data, smp_processor_id());
for_each_cpu(i, sched_domain_span(sd)) {
weight = tg->cfs_rq[i]->load.weight;
- usd->rq_weight[i] = weight;
+ usd[i] = weight;
/*
* If there are currently no tasks on the cpu pretend there
@@ -9407,6 +9403,10 @@ void __init sched_init(void)
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_GROUP_SCHED */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ update_shares_data = __alloc_percpu(nr_cpu_ids * sizeof(unsigned long),
+ __alignof__(unsigned long));
+#endif
for_each_possible_cpu(i) {
struct rq *rq;
--
Jeff Mahoney
SUSE Labs
* Jeff Mahoney <[email protected]> wrote:
> This version fixes a build issue (__alignof__(unsigned long)) and
> fixes the percpu lookup to be usd = percpu array pointer, usd[i] =
> actual variable.
Before applying this part-workaround i'd like to see confirmation that
the IA64 limitation is being fixed. It would be bad for such bugs to pop
up again and again in the future, with a 2 months bug fixing latency.
Ingo
The suggestion to Tony was to stop using the fixed TLB address for the per
cpu areas and instead use a global register like on x86 and sparc. Peter
Chubb then said that we had available global registers (cced him) that
could be used for this purpose. Doing so would make percpu support conform
to the schemes used by other architectures.
However, IA64 supports TLBs for larger pages (128MB, 256MB ...). So we
could change the maximum size of the static percpu data by using a
different sized TLB entry.
Tony?
Ingo Molnar wrote:
> * Jeff Mahoney <[email protected]> wrote:
>
>> This version fixes a build issue (__alignof__(unsigned long)) and
>> fixes the percpu lookup to be usd = percpu array pointer, usd[i] =
>> actual variable.
>
> Before applying this part-workaround i'd like to see confirmation that
> the IA64 limitation is being fixed. It would be bad for such bugs to pop
> up again and again in the future, with a 2 months bug fixing latency.
I talked with Christoph earlier and he basically said that the
constant can just be bumped up without any ill effect (it would be
great if Tony Luck can confirm this). So, it's an annoying thing and
in the long term removing would be a good idea but I don't think it's
anything major as long as it blows up during build not runtime.
Thanks.
--
tejun
> I talked with Christoph earlier and he basically said that the
> constant can just be bumped up without any ill effect (it would be
> great if Tony Luck can confirm this). So, it's an annoying thing and
> in the long term removing would be a good idea but I don't think it's
> anything major as long as it blows up during build not runtime.
Yes, ia64 can use any supported page size for the percpu area (I
tried a couple of different options, and saw no obvious problems).
But ... the architecturally supported page sizes go up by powers
of 4, so next choice from 64K is 256K then 1M, 4M, etc. This is
also requires an edit of source code and re-compile. We could easily
make it a config option ... but that is still inconvenient.
The bloat introduced by adding percpu variables is multiplied by
NR_CPUS ... and in my case that is 4096. It is easy to just shrug
this off and say that such big systems have plenty of memory anyway,
but the case that led to this issue (adding a percpu object that
included a [NR_CPUS] array) shows that, IMHO, people are do not
care enough about the bloat.
I suspect that if I just increase the percpu area to 256K or 1M,
I'll see this same issue when someone adds:
struct foo {
char buf[NR_CPUS][PAGE_SIZE];
};
DECLARE_PER_CPU(struct foo, bar);
which needs 4k * 64k = 256M of per-cpu space ... i.e. 1T total.
If such code is going to be deemed acceptable, then we do need
to move away from the ia64 TLB mapped percpu area.
-Tony
On Wed, 21 Oct 2009, Luck, Tony wrote:
> But ... the architecturally supported page sizes go up by powers of 4,
> so next choice from 64K is 256K then 1M, 4M, etc. This is also requires
> an edit of source code and re-compile. We could easily make it a config
> option ... but that is still inconvenient.
>
> The bloat introduced by adding percpu variables is multiplied by NR_CPUS
> ... and in my case that is 4096. It is easy to just shrug this off and
> say that such big systems have plenty of memory anyway, but the case
> that led to this issue (adding a percpu object that included a [NR_CPUS]
> array) shows that, IMHO, people are do not care enough about the bloat.
>
> I suspect that if I just increase the percpu area to 256K or 1M, I'll
> see this same issue when someone adds:
>
> struct foo {
> char buf[NR_CPUS][PAGE_SIZE];
> };
> DECLARE_PER_CPU(struct foo, bar);
>
> which needs 4k * 64k = 256M of per-cpu space ... i.e. 1T total.
>
> If such code is going to be deemed acceptable, then we do need
> to move away from the ia64 TLB mapped percpu area.
Well, I must say I slightly agree that my gut feeling is that we should
try to avoid arrays which size depends on NR_CPUS as much as possible.
Now, what to do for 2.6.32? We definitely need some kind of fix for this,
otherwise the Altix guys would kill us.
Tony, is the change that will eventually have to be made to ia64 pagefault
handler too intrusive for -rc6, and should we rather go with my workaround
instead, and try to find something proper for 2.6.33?
Pure revert of 34d76c41 isn't possible as well, as it fixes a real bug.
--
Jiri Kosina
SUSE Labs, Novell Inc.
On 10/22/2009 10:49 AM, Jiri Kosina wrote:
> On Wed, 21 Oct 2009, Luck, Tony wrote:
>
>> But ... the architecturally supported page sizes go up by powers of 4,
>> so next choice from 64K is 256K then 1M, 4M, etc. This is also requires
>> an edit of source code and re-compile. We could easily make it a config
>> option ... but that is still inconvenient.
>>
>> The bloat introduced by adding percpu variables is multiplied by NR_CPUS
>> ... and in my case that is 4096. It is easy to just shrug this off and
>> say that such big systems have plenty of memory anyway, but the case
>> that led to this issue (adding a percpu object that included a [NR_CPUS]
>> array) shows that, IMHO, people are do not care enough about the bloat.
>>
>> I suspect that if I just increase the percpu area to 256K or 1M, I'll
>> see this same issue when someone adds:
>>
>> struct foo {
>> char buf[NR_CPUS][PAGE_SIZE];
>> };
>> DECLARE_PER_CPU(struct foo, bar);
>>
>> which needs 4k * 64k = 256M of per-cpu space ... i.e. 1T total.
>>
>> If such code is going to be deemed acceptable, then we do need
>> to move away from the ia64 TLB mapped percpu area.
>
> Well, I must say I slightly agree that my gut feeling is that we should
> try to avoid arrays which size depends on NR_CPUS as much as possible.
Agreed.
> Now, what to do for 2.6.32? We definitely need some kind of fix for this,
> otherwise the Altix guys would kill us.
>
> Tony, is the change that will eventually have to be made to ia64 pagefault
> handler too intrusive for -rc6, and should we rather go with my workaround
> instead, and try to find something proper for 2.6.33?
It's not an either-or situation. It's not really a workaround. Distros
configure NR_CPUS for the maximum possible for a given architecture even
though the majority of systems don't come close to it. We're just
wasting memory. I've seen other fixes that are just "make percpu xyz
dynamic" without much debate. I don't see why this one should be much
different other than the fact that the original report raised a red flag
about an implementation limitation.
Yes, the ia64 limitation should be revisited, but that is a separate issue.
-Jeff
--
Jeff Mahoney
SUSE Labs
> Tony, is the change that will eventually have to be made to ia64 pagefault
> handler too intrusive for -rc6, and should we rather go with my workaround
> instead, and try to find something proper for 2.6.33?
Using __alloc_percpu() rather than static declaration looks to be
the right fix here. Not a "workaround".
-Tony
* Luck, Tony <[email protected]> wrote:
> > Tony, is the change that will eventually have to be made to ia64
> > pagefault handler too intrusive for -rc6, and should we rather go
> > with my workaround instead, and try to find something proper for
> > 2.6.33?
>
> Using __alloc_percpu() rather than static declaration looks to be the
> right fix here. Not a "workaround".
It is a workaround for the IA64 build failure.
It's also an improvement of the scheduler code (we generally try to
eliminate NR_CPUs scaling of allocations) - but code improvements need
to happen much sooner than -rc6/-rc7.
Ingo
On Fri, 23 Oct 2009, Ingo Molnar wrote:
> > > Tony, is the change that will eventually have to be made to ia64
> > > pagefault handler too intrusive for -rc6, and should we rather go
> > > with my workaround instead, and try to find something proper for
> > > 2.6.33?
> >
> > Using __alloc_percpu() rather than static declaration looks to be the
> > right fix here. Not a "workaround".
>
> It is a workaround for the IA64 build failure.
>
> It's also an improvement of the scheduler code (we generally try to
> eliminate NR_CPUs scaling of allocations) - but code improvements need
> to happen much sooner than -rc6/-rc7.
Unfortunately refactoring the ia4 pagefault handler would require some
refactoring in order to eliminate the 64k limitation, as far as I know,
which doesn't seem to be feasible for this timeframe either.
So if I personally have to chose from these two options, I'd prefer my
patch for 2.6.32 and it'd be helpful if ia64 people could look into doing
something with the 64k limit for 2.6.33.
Acceptable for everyone? Ingo/Peter, will you merge this asap please, so
that we don't miss 2.6.32 with it?
Thanks,
--
Jiri Kosina
SUSE Labs, Novell Inc.
On Fri, 23 Oct 2009, Jiri Kosina wrote:
> > > > Tony, is the change that will eventually have to be made to ia64
> > > > pagefault handler too intrusive for -rc6, and should we rather go
> > > > with my workaround instead, and try to find something proper for
> > > > 2.6.33?
> > >
> > > Using __alloc_percpu() rather than static declaration looks to be the
> > > right fix here. Not a "workaround".
> >
> > It is a workaround for the IA64 build failure.
> >
> > It's also an improvement of the scheduler code (we generally try to
> > eliminate NR_CPUs scaling of allocations) - but code improvements need
> > to happen much sooner than -rc6/-rc7.
>
> Unfortunately refactoring the ia4 pagefault handler would require some
> refactoring in order to eliminate the 64k limitation, as far as I know,
> which doesn't seem to be feasible for this timeframe either.
>
> So if I personally have to chose from these two options, I'd prefer my
> patch for 2.6.32 and it'd be helpful if ia64 people could look into doing
> something with the 64k limit for 2.6.33.
>
> Acceptable for everyone? Ingo/Peter, will you merge this asap please, so
> that we don't miss 2.6.32 with it?
This stil seems unfixed in the relevant trees I have checked ... anyone
has strong objections to this, which cause the patch to be delayed?
Thanks,
--
Jiri Kosina
SUSE Labs, Novell Inc.
* Jeff Mahoney <[email protected]> wrote:
> On 10/20/2009 10:18 AM, Jeff Mahoney wrote:
> > On 10/20/2009 02:27 AM, Jiri Kosina wrote:
> >> @@ -1627,11 +1623,10 @@ static int tg_shares_up(struct task_group *tg, void *data)
> >> return 0;
> >>
> >> local_irq_save(flags);
> >> - usd = &__get_cpu_var(update_shares_data);
> >>
> >> for_each_cpu(i, sched_domain_span(sd)) {
> >> weight = tg->cfs_rq[i]->load.weight;
> >> - usd->rq_weight[i] = weight;
> >> + usd = *per_cpu_ptr(update_shares_data, i) = weight;
> >>
> >> /*
> >> * If there are currently no tasks on the cpu pretend there
> >
> > I don't think this is what you want here.
> >
> > In the original version, usd is the percpu var using the current cpu. In
> > your version, usd is the percpu var using i instead of the current cpu.
> >
> > I'll post my version of the patch shortly. I don't think keeping most of
> > the original version is a bad thing. We can just allocate it dynamically
> > instead.
>
> This version fixes a build issue (__alignof__(unsigned long)) and fixes the
> percpu lookup to be usd = percpu array pointer, usd[i] = actual variable.
>
> -Jeff
>
> From: Jiri Kosina <[email protected]>
> Subject: sched: move rq_weight data array out of .percpu
>
> Commit 34d76c41 introduced percpu array update_shares_data, size of which
> being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
> NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
>
> Fix this by allocating this array dynamically and keep only pointer to it
> percpu.
>
> Signed-off-by: Jiri Kosina <[email protected]>
> ---
> kernel/sched.c | 20 ++++++++++----------
> 1 file changed, 10 insertions(+), 10 deletions(-)
>
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1564,11 +1564,7 @@ static unsigned long cpu_avg_load_per_ta
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
>
> -struct update_shares_data {
> - unsigned long rq_weight[NR_CPUS];
> -};
> -
> -static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);
> +unsigned long *update_shares_data;
That should be __read_mostly - this can be a hotly accessed data
structure of the scheduler - if it happens to go next to a frequently
bouncing variable that can be bad for performance.
> - struct update_shares_data *usd)
> + unsigned long *usd)
I dont think using usd[cpu] is clearer than usd->rq_weight[cpu]. At
minimum it should be renamed to usd_rq_weight not usd.
> local_irq_save(flags);
> - usd = &__get_cpu_var(update_shares_data);
> + usd = per_cpu_ptr(update_shares_data, smp_processor_id());
Could we please have a look at the before/after assembly of this
sequence on x86, to make sure the claims in this thread are true and we
dont lose performance? (and included it in the changelog with a
resubmission - with a new, changed '[PATCH] ...' subject line, not
hidden inside a discussion thread.)
>From a merge POV i'm quite nervous about such a change to the scheduler
this late in the .32 cycle - to offset that risk i'd really like to see
that this change has been pursued carefully to the edge of possibilities
- currently it does not give that impression.
Thanks,
Ingo
On Mon, 26 Oct 2009, Ingo Molnar wrote:
> > -struct update_shares_data {
> > - unsigned long rq_weight[NR_CPUS];
> > -};
> > -
> > -static DEFINE_PER_CPU(struct update_shares_data, update_shares_data);
> > +unsigned long *update_shares_data;
>
> That should be __read_mostly - this can be a hotly accessed data
> structure of the scheduler - if it happens to go next to a frequently
> bouncing variable that can be bad for performance.
Good point, will resend with this annotation added.
> > - struct update_shares_data *usd)
> > + unsigned long *usd)
>
> I dont think using usd[cpu] is clearer than usd->rq_weight[cpu]. At
> minimum it should be renamed to usd_rq_weight not usd.
OK.
> > local_irq_save(flags);
> > - usd = &__get_cpu_var(update_shares_data);
> > + usd = per_cpu_ptr(update_shares_data, smp_processor_id());
>
> Could we please have a look at the before/after assembly of this
> sequence on x86, to make sure the claims in this thread are true and we
> dont lose performance? (and included it in the changelog with a
> resubmission - with a new, changed '[PATCH] ...' subject line, not
> hidden inside a discussion thread.)
Just for completness (I will include it in the changelog with the next
resubmission shortly):
Before:
...
ffffffff8104337c: 65 48 8b 14 25 20 cd mov %gs:0xcd20,%rdx
ffffffff81043383: 00 00
ffffffff81043385: 48 c7 c0 00 e1 00 00 mov $0xe100,%rax
ffffffff8104338c: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp)
ffffffff81043393: 00
ffffffff81043394: 48 c7 45 a8 00 00 00 movq $0x0,-0x58(%rbp)
ffffffff8104339b: 00
ffffffff8104339c: 48 01 d0 add %rdx,%rax
ffffffff8104339f: 49 8d 94 24 08 01 00 lea 0x108(%r12),%rdx
ffffffff810433a6: 00
ffffffff810433a7: b9 ff ff ff ff mov $0xffffffff,%ecx
ffffffff810433ac: 48 89 45 b0 mov %rax,-0x50(%rbp)
ffffffff810433b0: bb 00 04 00 00 mov $0x400,%ebx
ffffffff810433b5: 48 89 55 c0 mov %rdx,-0x40(%rbp)
...
After:
...
ffffffff8104337c: 65 8b 04 25 28 cd 00 mov %gs:0xcd28,%eax
ffffffff81043383: 00
ffffffff81043384: 48 98 cltq
ffffffff81043386: 49 8d bc 24 08 01 00 lea 0x108(%r12),%rdi
ffffffff8104338d: 00
ffffffff8104338e: 48 8b 15 d3 7f 76 00 mov 0x767fd3(%rip),%rdx # ffffffff817ab368 <update_shares_data>
ffffffff81043395: 48 8b 34 c5 00 ee 6d mov -0x7e921200(,%rax,8),%rsi
ffffffff8104339c: 81
ffffffff8104339d: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp)
ffffffff810433a4: 00
ffffffff810433a5: b9 ff ff ff ff mov $0xffffffff,%ecx
ffffffff810433aa: 48 89 7d c0 mov %rdi,-0x40(%rbp)
ffffffff810433ae: 48 c7 45 a8 00 00 00 movq $0x0,-0x58(%rbp)
ffffffff810433b5: 00
ffffffff810433b6: bb 00 04 00 00 mov $0x400,%ebx
ffffffff810433bb: 48 01 f2 add %rsi,%rdx
ffffffff810433be: 48 89 55 b0 mov %rdx,-0x50(%rbp)
...
> From a merge POV i'm quite nervous about such a change to the scheduler
> this late in the .32 cycle - to offset that risk i'd really like to see
> that this change has been pursued carefully to the edge of possibilities
> - currently it does not give that impression.
So what's your other proposal?
Refactoring the ia64 pagefault handler is no-go this late in the cycle
IMHO.
So either this, or reverting 34d76c41 (which is also no-go, I believe).
--
Jiri Kosina
SUSE Labs, Novell Inc.
On Tue, 20 Oct 2009, Tejun Heo wrote:
> > How about this one? (untested)
> >
> > From: Jiri Kosina <[email protected]>
> > Subject: sched: move rq_weight data array out of .percpu
> >
> > Commit 34d76c41 introduced percpu array update_shares_data, size of which
> > being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
> > NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
> >
> > Fix this by allocating this array dynamically and keep only pointer to it
> > percpu.
>
> If the structure can be dynamically allocated at all, just using
> percpu_alloc() to allocate the whole thing should solve it with the
> correct numa locality.
Unfortunately it's not that straightforward and easy, as we can't simply
call __alloc_percpu from sched_init().
The reason is that does spin_unlock_irq(), and in sched_init() time it's
too early for that, so we end up with
WARNING: at kernel/lockdep.c:2311 trace_hardirqs_on_caller+0xd2/0x1a0()
Hardware name: Kfisher HT2100 HT1000
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.32-rc5-default #66
Call Trace:
[<ffffffff81086a62>] ? trace_hardirqs_on_caller+0xd2/0x1a0
[<ffffffff81054528>] warn_slowpath_common+0x78/0xd0
[<ffffffff81356b8b>] ? _spin_unlock_irq+0x2b/0x40
[<ffffffff8105458f>] warn_slowpath_null+0xf/0x20
[<ffffffff81086a62>] trace_hardirqs_on_caller+0xd2/0x1a0
[<ffffffff81086b3d>] trace_hardirqs_on+0xd/0x10
[<ffffffff81356b8b>] _spin_unlock_irq+0x2b/0x40
[<ffffffff811154c1>] pcpu_extend_area_map+0x41/0x110
[<ffffffff81115ce0>] pcpu_alloc+0x120/0x890
[<ffffffff81084577>] ? trace_hardirqs_off_caller+0x67/0xa0
[<ffffffff8108578d>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8111646b>] __alloc_percpu+0xb/0x10
[<ffffffff81712215>] sched_init+0x147/0x5d2
[<ffffffff816fabca>] start_kernel+0x1fc/0x40c
[<ffffffff816fa299>] x86_64_start_reservations+0x99/0xb9
[<ffffffff816fa3bf>] x86_64_start_kernel+0x106/0x121
[<ffffffff816fa140>] ? early_idt_handler+0x0/0x71
---[ end trace 4eaa2a86a8e2da22 ]---
start_kernel(): bug: interrupts were enabled *very* early, fixing it
--
Jiri Kosina
SUSE Labs, Novell Inc.