2008-11-19 06:49:08

by Ken Chen

[permalink] [raw]
Subject: [patch] sched: add locking when update the task_group's cfs_rq[] array.

add locking when update the task_group's cfs_rq[] array. tg_shares_up()
can be potentially executed concurrently on multiple CPUs with overlaping
cpu mask depending on where task_cpu() was when a task got woken up. Lack
of any locking while redistribute tg->shares over cfs_rq[] array opens up
a large window for conflict updates and utimately cause corruptions to the
integrity of per cpu cfs_rq shares. Add a tg_lock to protect the operations.


Signed-off-by: Ken Chen <[email protected]>

diff --git a/kernel/sched.c b/kernel/sched.c
index 1ff78b6..907a44e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -267,6 +267,8 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
+ /* protect integrity of per-cpu cfs_rq[i]->shares */
+ spinlock_t tg_lock;
#endif

#ifdef CONFIG_RT_GROUP_SCHED
@@ -1493,13 +1495,11 @@ update_group_shares_cpu
if (abs(shares - tg->se[cpu]->load.weight) >
sysctl_sched_shares_thresh) {
struct rq *rq = cpu_rq(cpu);
- unsigned long flags;

- spin_lock_irqsave(&rq->lock, flags);
+ spin_lock(&rq->lock);
tg->cfs_rq[cpu]->shares = shares;
-
__set_se_shares(tg->se[cpu], shares);
- spin_unlock_irqrestore(&rq->lock, flags);
+ spin_unlock(&rq->lock);
}
}

@@ -1513,8 +1513,12 @@ static int tg_shares_up
unsigned long weight, rq_weight = 0;
unsigned long shares = 0;
struct sched_domain *sd = data;
+ unsigned long flags;
int i;

+ if (!spin_trylock_irqsave(&tg->tg_lock, flags))
+ return 0;
+
for_each_cpu_mask(i, sd->span) {
/*
* If there are currently no tasks on the cpu pretend there
@@ -1539,6 +1543,7 @@ static int tg_shares_up
for_each_cpu_mask(i, sd->span)
update_group_shares_cpu(tg, i, shares, rq_weight);

+ spin_unlock_irqrestore(&tg->tg_lock, flags);
return 0;
}

@@ -8195,6 +8200,10 @@ void __init sched_init(void)
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);

+#ifdef CONFIG_FAIR_GROUP_SCHED
+ spin_lock_init(&init_task_group.tg_lock);
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_USER_SCHED
INIT_LIST_HEAD(&root_task_group.children);
init_task_group.parent = &root_task_group;
@@ -8491,6 +8500,10 @@ int alloc_fair_sched_group

tg->shares = NICE_0_LOAD;

+#ifdef CONFIG_FAIR_GROUP_SCHED
+ spin_lock_init(&tg->tg_lock);
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
for_each_possible_cpu(i) {
rq = cpu_rq(i);


2008-11-19 07:53:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] sched: add locking when update the task_group's cfs_rq[] array.


* Ken Chen <[email protected]> wrote:

> @@ -1513,8 +1513,12 @@ static int tg_shares_up
> unsigned long weight, rq_weight = 0;
> unsigned long shares = 0;
> struct sched_domain *sd = data;
> + unsigned long flags;
> int i;
>
> + if (!spin_trylock_irqsave(&tg->tg_lock, flags))
> + return 0;

hm, why trylock?

Ingo

2008-11-19 08:22:38

by Ken Chen

[permalink] [raw]
Subject: Re: [patch] sched: add locking when update the task_group's cfs_rq[] array.

On Tue, Nov 18, 2008 at 11:53 PM, Ingo Molnar <[email protected]> wrote:
>
> * Ken Chen <[email protected]> wrote:
>
>> @@ -1513,8 +1513,12 @@ static int tg_shares_up
>> unsigned long weight, rq_weight = 0;
>> unsigned long shares = 0;
>> struct sched_domain *sd = data;
>> + unsigned long flags;
>> int i;
>>
>> + if (!spin_trylock_irqsave(&tg->tg_lock, flags))
>> + return 0;
>
> hm, why trylock?

I'm paranoid about potential lock contention. Considering calls to
tg_shares_up() are more or less sample based, I opt to skip updating
if there is a lock contention. Though kernel only walks tg tree every
sysctl_sched_shares_ratelimit. Maybe chances of running into lock
contention isn't that high anyway, in which case trylock will mostly
able to get the lock.

- Ken

2008-11-19 16:58:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: add locking when update the task_group's cfs_rq[] array.

On Tue, 2008-11-18 at 22:48 -0800, Ken Chen wrote:
> add locking when update the task_group's cfs_rq[] array. tg_shares_up()
> can be potentially executed concurrently on multiple CPUs with overlaping
> cpu mask depending on where task_cpu() was when a task got woken up. Lack
> of any locking while redistribute tg->shares over cfs_rq[] array opens up
> a large window for conflict updates and utimately cause corruptions to the
> integrity of per cpu cfs_rq shares. Add a tg_lock to protect the operations.

I see why you want to do this, but introducing a global lock makes me
sad :/

Let me ponder this a while...

> Signed-off-by: Ken Chen <[email protected]>
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1ff78b6..907a44e 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -267,6 +267,8 @@ struct task_group {
> /* runqueue "owned" by this group on each cpu */
> struct cfs_rq **cfs_rq;
> unsigned long shares;
> + /* protect integrity of per-cpu cfs_rq[i]->shares */
> + spinlock_t tg_lock;
> #endif
>
> #ifdef CONFIG_RT_GROUP_SCHED
> @@ -1493,13 +1495,11 @@ update_group_shares_cpu
> if (abs(shares - tg->se[cpu]->load.weight) >
> sysctl_sched_shares_thresh) {
> struct rq *rq = cpu_rq(cpu);
> - unsigned long flags;
>
> - spin_lock_irqsave(&rq->lock, flags);
> + spin_lock(&rq->lock);
> tg->cfs_rq[cpu]->shares = shares;
> -
> __set_se_shares(tg->se[cpu], shares);
> - spin_unlock_irqrestore(&rq->lock, flags);
> + spin_unlock(&rq->lock);
> }
> }
>
> @@ -1513,8 +1513,12 @@ static int tg_shares_up
> unsigned long weight, rq_weight = 0;
> unsigned long shares = 0;
> struct sched_domain *sd = data;
> + unsigned long flags;
> int i;
>
> + if (!spin_trylock_irqsave(&tg->tg_lock, flags))
> + return 0;
> +
> for_each_cpu_mask(i, sd->span) {
> /*
> * If there are currently no tasks on the cpu pretend there
> @@ -1539,6 +1543,7 @@ static int tg_shares_up
> for_each_cpu_mask(i, sd->span)
> update_group_shares_cpu(tg, i, shares, rq_weight);
>
> + spin_unlock_irqrestore(&tg->tg_lock, flags);
> return 0;
> }
>
> @@ -8195,6 +8200,10 @@ void __init sched_init(void)
> list_add(&init_task_group.list, &task_groups);
> INIT_LIST_HEAD(&init_task_group.children);
>
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> + spin_lock_init(&init_task_group.tg_lock);
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
> #ifdef CONFIG_USER_SCHED
> INIT_LIST_HEAD(&root_task_group.children);
> init_task_group.parent = &root_task_group;
> @@ -8491,6 +8500,10 @@ int alloc_fair_sched_group
>
> tg->shares = NICE_0_LOAD;
>
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> + spin_lock_init(&tg->tg_lock);
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
> for_each_possible_cpu(i) {
> rq = cpu_rq(i);

2008-11-19 17:22:00

by Ken Chen

[permalink] [raw]
Subject: Re: [patch] sched: add locking when update the task_group's cfs_rq[] array.

On Wed, Nov 19, 2008 at 8:54 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2008-11-18 at 22:48 -0800, Ken Chen wrote:
>> add locking when update the task_group's cfs_rq[] array. tg_shares_up()
>> can be potentially executed concurrently on multiple CPUs with overlaping
>> cpu mask depending on where task_cpu() was when a task got woken up. Lack
>> of any locking while redistribute tg->shares over cfs_rq[] array opens up
>> a large window for conflict updates and utimately cause corruptions to the
>> integrity of per cpu cfs_rq shares. Add a tg_lock to protect the operations.
>
> I see why you want to do this, but introducing a global lock makes me
> sad :/

I wholly agree on the scalability. The bigger the system, the more it
needs to protect the integrity of cfs_rq[]->shares that the sum still
adds up to tg->shares. Otherwise, the share distributed on each CPU's
cfs_rq might go wildly and indirectly leads to fluctuation of
effective total tg->shares. However, I have the same doubt that this
will scale on large CPU system. Does CFS really have to iterate the
whole task_group tree?

- Ken

2008-11-19 20:59:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: add locking when update the task_group's cfs_rq[] array.

On Wed, 2008-11-19 at 09:21 -0800, Ken Chen wrote:
> On Wed, Nov 19, 2008 at 8:54 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2008-11-18 at 22:48 -0800, Ken Chen wrote:
> >> add locking when update the task_group's cfs_rq[] array. tg_shares_up()
> >> can be potentially executed concurrently on multiple CPUs with overlaping
> >> cpu mask depending on where task_cpu() was when a task got woken up. Lack
> >> of any locking while redistribute tg->shares over cfs_rq[] array opens up
> >> a large window for conflict updates and utimately cause corruptions to the
> >> integrity of per cpu cfs_rq shares. Add a tg_lock to protect the operations.
> >
> > I see why you want to do this, but introducing a global lock makes me
> > sad :/
>
> I wholly agree on the scalability. The bigger the system, the more it
> needs to protect the integrity of cfs_rq[]->shares that the sum still
> adds up to tg->shares. Otherwise, the share distributed on each CPU's
> cfs_rq might go wildly and indirectly leads to fluctuation of
> effective total tg->shares. However, I have the same doubt that this
> will scale on large CPU system. Does CFS really have to iterate the
> whole task_group tree?

Yes, sadly. The weight of a per-cpu super-task representation depends on
the group's task distribution over all tasks :/

(Dhaval, could you send Ken a copy of the paper we did on this?)

The idea was that we balance the stuff usng the sched-domain tree and
update it incrementally, and on the top level sched domain fix it all
up.

Will it scale, half-way, I'd say. It races a little, but should
converge. The biggest issue is that we're running with 10 bit fixed
point math, and on large cpu machines you get into granularity problems.


2008-11-19 21:51:12

by Chris Friesen

[permalink] [raw]
Subject: Re: [patch] sched: add locking when update the task_group's cfs_rq[] array.

Peter Zijlstra wrote:

> Will it scale, half-way, I'd say. It races a little, but should
> converge. The biggest issue is that we're running with 10 bit fixed
> point math, and on large cpu machines you get into granularity problems.

Is there any way to get higher resolution math on 64-bit machines? (I'm
assuming that most larger SMP boxes will be 64-bit capable.)

Chris

2008-11-22 10:09:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: add locking when update the task_group's cfs_rq[] array.

On Wed, 2008-11-19 at 15:50 -0600, Chris Friesen wrote:
> Peter Zijlstra wrote:
>
> > Will it scale, half-way, I'd say. It races a little, but should
> > converge. The biggest issue is that we're running with 10 bit fixed
> > point math, and on large cpu machines you get into granularity problems.
>
> Is there any way to get higher resolution math on 64-bit machines? (I'm
> assuming that most larger SMP boxes will be 64-bit capable.)

Yes, if we do the same thing we do for 64bit math on 32bit boxen to
obtain 128bit math on 64bit boxen.

Trouble seems to be gcc, which imho f*'ed up by not providing a long
long long (or whatever), there does seem to be something but all the gcc
people tell me its fragile and not to be used.

Anyway, I've been meaning to send a RFC to linux-arch about adding
something like this, which we'd then have to do our-selves with C
functions instead of nice integer types :/