2009-07-30 14:57:22

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 0/2] scheduler fixes

The following changes since commit 1314bb58bbe2a2c146490c59986c511687543d3c:
Gregory Haskins (1):
sched: enhance the pre/post scheduling logic

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git sched/updates

Gregory Haskins (2):
[RESEND] sched: Fully integrate cpus_active_map and root-domain code
sched: fix race in cpupri introduced by cpumask_var changes

kernel/sched.c | 2 +-
kernel/sched_cpupri.c | 15 ++++++++++++++-
kernel/sched_fair.c | 10 +++++++---
kernel/sched_rt.c | 7 -------
4 files changed, 22 insertions(+), 12 deletions(-)


Regards,
-Greg


2009-07-30 14:57:31

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 1/2] [RESEND] sched: Fully integrate cpus_active_map and root-domain code

(Applies to 2.6.31-rc4)

[
This patch was originaly sent about a year ago, but got dropped
presumably by accident. Here is the original thread.

http://lkml.org/lkml/2008/7/22/281

At that time, Peter and Max acked it. It has now been forward
ported to the new cpumask interface. I will be so bold as to
carry their acks forward since the basic logic is the same.
However, a new acknowledgement, if they have the time to review,
would be ideal.

I have tested this patch on a 4-way system using Max's recommended
"echo 0|1 > cpu1/online" technique and it appears to work properly
]

What: Reflect "active" cpus in the rq->rd->online field, instead of the
online_map.

Motivation: Things that use the root-domain code (such as cpupri) only
care about cpus classified as "active" anyway. By synchronizing the
root-domain state with the active map, we allow several optimizations.

For instance, we can remove an extra cpumask_and from the scheduler
hotpath by utilizing rq->rd->online (since it is now a cached version
of cpu_active_map & rq->rd->span).

Signed-off-by: Gregory Haskins <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: Max Krasnyansky <[email protected]>
---

kernel/sched.c | 2 +-
kernel/sched_fair.c | 10 +++++++---
kernel/sched_rt.c | 7 -------
3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1a104e6..38a1526 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7874,7 +7874,7 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
rq->rd = rd;

cpumask_set_cpu(rq->cpu, rd->span);
- if (cpumask_test_cpu(rq->cpu, cpu_online_mask))
+ if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
set_rq_online(rq);

spin_unlock_irqrestore(&rq->lock, flags);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 9ffb2b2..2b9cae6 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1040,17 +1040,21 @@ static void yield_task_fair(struct rq *rq)
* search starts with cpus closest then further out as needed,
* so we always favor a closer, idle cpu.
* Domains may include CPUs that are not usable for migration,
- * hence we need to mask them out (cpu_active_mask)
+ * hence we need to mask them out (rq->rd->online)
*
* Returns the CPU we should wake onto.
*/
#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
+
+#define cpu_rd_active(cpu, rq) cpumask_test_cpu(cpu, rq->rd->online)
+
static int wake_idle(int cpu, struct task_struct *p)
{
struct sched_domain *sd;
int i;
unsigned int chosen_wakeup_cpu;
int this_cpu;
+ struct rq *task_rq = task_rq(p);

/*
* At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
@@ -1083,10 +1087,10 @@ static int wake_idle(int cpu, struct task_struct *p)
for_each_domain(cpu, sd) {
if ((sd->flags & SD_WAKE_IDLE)
|| ((sd->flags & SD_WAKE_IDLE_FAR)
- && !task_hot(p, task_rq(p)->clock, sd))) {
+ && !task_hot(p, task_rq->clock, sd))) {
for_each_cpu_and(i, sched_domain_span(sd),
&p->cpus_allowed) {
- if (cpu_active(i) && idle_cpu(i)) {
+ if (cpu_rd_active(i, task_rq) && idle_cpu(i)) {
if (i != task_cpu(p)) {
schedstat_inc(p,
se.nr_wakeups_idle);
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a8f89bc..13f728e 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1173,13 +1173,6 @@ static int find_lowest_rq(struct task_struct *task)
return -1; /* No targets found */

/*
- * Only consider CPUs that are usable for migration.
- * I guess we might want to change cpupri_find() to ignore those
- * in the first place.
- */
- cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
-
- /*
* At this point we have built a mask of cpus representing the
* lowest priority tasks in the system. Now we want to elect
* the best one based on our affinity and topology.

2009-07-30 14:57:41

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 2/2] sched: fix race in cpupri introduced by cpumask_var changes

[
Note: obsoletes c8cab1ddfafd112c82a69a6cc4ec608984eeac97,
though it should be harmless to leave the old patch applied.
]

Background:

Several race conditions in the scheduler have cropped up recently,
which Steven and I have tracked down using ftrace. The most recent
one turns out to be a race in how the scheduler determines a
suitable migration target for RT tasks, introduced recently with commit:

commit 68e74568fbe5854952355e942acca51f138096d9
Author: Rusty Russell <[email protected]>
Date: Tue Nov 25 02:35:13 2008 +1030

sched: convert struct cpupri_vec cpumask_var_t.

The original design of cpupri allowed lockless readers to quickly
determine a best-estimate target. Races between the pri_active
bitmap and the vec->mask were handled in the original code because
we would detect and return "0" when this occured. The design was
predicated on the *effective* atomicity (*) of caching the result
of cpus_and() between the cpus_allowed and the vec->mask.

Commit 68e74568 changed the behavior such that vec->mask is accessed
multiple times. This introduces a subtle race, the result of which
means we can have a result that returns "1", but with an empty bitmap.

*) yes, we know cpus_and() is not a locked operator across the entire
composite array, but it is implicitly atomic on a per-word basis
which is all the design required to work.

Implementation:

Rather than forgoing the lockless design, or reverting to a stack-based
cpumask_t, we simply check for when the race has been encountered and
continue processing in the event that the race is hit. This renders the
removal race as if the priority bit had been atomically cleared as well,
and allows the algorithm to execute correctly.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Rusty Russell <[email protected]>
CC: Steven Rostedt <[email protected]>
---

kernel/sched_cpupri.c | 15 ++++++++++++++-
1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index aef68b6..0f052fc 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -81,8 +81,21 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)
continue;

- if (lowest_mask)
+ if (lowest_mask) {
cpumask_and(lowest_mask, &p->cpus_allowed, vec->mask);
+
+ /*
+ * We have to ensure that we have at least one bit
+ * still set in the array, since the map could have
+ * been concurrently emptied between the first and
+ * second reads of vec->mask. If we hit this
+ * condition, simply act as though we never hit this
+ * priority level and continue on.
+ */
+ if (cpumask_any(lowest_mask) >= nr_cpu_ids)
+ continue;
+ }
+
return 1;
}

2009-07-30 15:01:54

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 1/2] [RESEND] sched: Fully integrate cpus_active_map and root-domain code

Gregory Haskins wrote:
> (Applies to 2.6.31-rc4)
>
> [
> This patch was originaly sent about a year ago, but got dropped
> presumably by accident. Here is the original thread.
>
> http://lkml.org/lkml/2008/7/22/281
>
> At that time, Peter and Max acked it. It has now been forward
> ported to the new cpumask interface. I will be so bold as to
> carry their acks forward since the basic logic is the same.
> However, a new acknowledgement, if they have the time to review,
> would be ideal.
>
> I have tested this patch on a 4-way system using Max's recommended
> "echo 0|1 > cpu1/online" technique and it appears to work properly
> ]
>
> What: Reflect "active" cpus in the rq->rd->online field, instead of the
> online_map.
>
> Motivation: Things that use the root-domain code (such as cpupri) only
> care about cpus classified as "active" anyway. By synchronizing the
> root-domain state with the active map, we allow several optimizations.
>
> For instance, we can remove an extra cpumask_and from the scheduler
> hotpath by utilizing rq->rd->online (since it is now a cached version
> of cpu_active_map & rq->rd->span).
>
> Signed-off-by: Gregory Haskins <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Acked-by: Max Krasnyansky <[email protected]>
> ---
>
> kernel/sched.c | 2 +-
> kernel/sched_fair.c | 10 +++++++---
> kernel/sched_rt.c | 7 -------
> 3 files changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1a104e6..38a1526 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -7874,7 +7874,7 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
> rq->rd = rd;
>
> cpumask_set_cpu(rq->cpu, rd->span);
> - if (cpumask_test_cpu(rq->cpu, cpu_online_mask))
> + if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
> set_rq_online(rq);
>
> spin_unlock_irqrestore(&rq->lock, flags);
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 9ffb2b2..2b9cae6 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1040,17 +1040,21 @@ static void yield_task_fair(struct rq *rq)
> * search starts with cpus closest then further out as needed,
> * so we always favor a closer, idle cpu.
> * Domains may include CPUs that are not usable for migration,
> - * hence we need to mask them out (cpu_active_mask)
> + * hence we need to mask them out (rq->rd->online)
> *
> * Returns the CPU we should wake onto.
> */
> #if defined(ARCH_HAS_SCHED_WAKE_IDLE)
> +
> +#define cpu_rd_active(cpu, rq) cpumask_test_cpu(cpu, rq->rd->online)
> +
> static int wake_idle(int cpu, struct task_struct *p)
> {
> struct sched_domain *sd;
> int i;
> unsigned int chosen_wakeup_cpu;
> int this_cpu;
> + struct rq *task_rq = task_rq(p);
>
> /*
> * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
> @@ -1083,10 +1087,10 @@ static int wake_idle(int cpu, struct task_struct *p)
> for_each_domain(cpu, sd) {
> if ((sd->flags & SD_WAKE_IDLE)
> || ((sd->flags & SD_WAKE_IDLE_FAR)
> - && !task_hot(p, task_rq(p)->clock, sd))) {
> + && !task_hot(p, task_rq->clock, sd))) {
> for_each_cpu_and(i, sched_domain_span(sd),
> &p->cpus_allowed) {

Hmm, something got suboptimal in translation from the original patch.

This would be better expressed as:

for_each_cpu_and(i, rq->rd->online, &p->cpus_allowed) {
if (idle_cpu(i)...
}

> - if (cpu_active(i) && idle_cpu(i)) {
> + if (cpu_rd_active(i, task_rq) && idle_cpu(i)) {
> if (i != task_cpu(p)) {
> schedstat_inc(p,
> se.nr_wakeups_idle);
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index a8f89bc..13f728e 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -1173,13 +1173,6 @@ static int find_lowest_rq(struct task_struct *task)
> return -1; /* No targets found */
>
> /*
> - * Only consider CPUs that are usable for migration.
> - * I guess we might want to change cpupri_find() to ignore those
> - * in the first place.
> - */
> - cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
> -
> - /*
> * At this point we have built a mask of cpus representing the
> * lowest priority tasks in the system. Now we want to elect
> * the best one based on our affinity and topology.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/



Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-07-30 15:10:57

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 1/2] [RESEND] sched: Fully integrate cpus_active_map and root-domain code

Gregory Haskins wrote:
> Gregory Haskins wrote:
>> (Applies to 2.6.31-rc4)
>>
>> [
>> This patch was originaly sent about a year ago, but got dropped
>> presumably by accident. Here is the original thread.
>>
>> http://lkml.org/lkml/2008/7/22/281
>>
>> At that time, Peter and Max acked it. It has now been forward
>> ported to the new cpumask interface. I will be so bold as to
>> carry their acks forward since the basic logic is the same.
>> However, a new acknowledgement, if they have the time to review,
>> would be ideal.
>>
>> I have tested this patch on a 4-way system using Max's recommended
>> "echo 0|1 > cpu1/online" technique and it appears to work properly
>> ]
>>
>> What: Reflect "active" cpus in the rq->rd->online field, instead of the
>> online_map.
>>
>> Motivation: Things that use the root-domain code (such as cpupri) only
>> care about cpus classified as "active" anyway. By synchronizing the
>> root-domain state with the active map, we allow several optimizations.
>>
>> For instance, we can remove an extra cpumask_and from the scheduler
>> hotpath by utilizing rq->rd->online (since it is now a cached version
>> of cpu_active_map & rq->rd->span).
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> Acked-by: Peter Zijlstra <[email protected]>
>> Acked-by: Max Krasnyansky <[email protected]>
>> ---
>>
>> kernel/sched.c | 2 +-
>> kernel/sched_fair.c | 10 +++++++---
>> kernel/sched_rt.c | 7 -------
>> 3 files changed, 8 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 1a104e6..38a1526 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -7874,7 +7874,7 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
>> rq->rd = rd;
>>
>> cpumask_set_cpu(rq->cpu, rd->span);
>> - if (cpumask_test_cpu(rq->cpu, cpu_online_mask))
>> + if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
>> set_rq_online(rq);
>>
>> spin_unlock_irqrestore(&rq->lock, flags);
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index 9ffb2b2..2b9cae6 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -1040,17 +1040,21 @@ static void yield_task_fair(struct rq *rq)
>> * search starts with cpus closest then further out as needed,
>> * so we always favor a closer, idle cpu.
>> * Domains may include CPUs that are not usable for migration,
>> - * hence we need to mask them out (cpu_active_mask)
>> + * hence we need to mask them out (rq->rd->online)
>> *
>> * Returns the CPU we should wake onto.
>> */
>> #if defined(ARCH_HAS_SCHED_WAKE_IDLE)
>> +
>> +#define cpu_rd_active(cpu, rq) cpumask_test_cpu(cpu, rq->rd->online)
>> +
>> static int wake_idle(int cpu, struct task_struct *p)
>> {
>> struct sched_domain *sd;
>> int i;
>> unsigned int chosen_wakeup_cpu;
>> int this_cpu;
>> + struct rq *task_rq = task_rq(p);
>>
>> /*
>> * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
>> @@ -1083,10 +1087,10 @@ static int wake_idle(int cpu, struct task_struct *p)
>> for_each_domain(cpu, sd) {
>> if ((sd->flags & SD_WAKE_IDLE)
>> || ((sd->flags & SD_WAKE_IDLE_FAR)
>> - && !task_hot(p, task_rq(p)->clock, sd))) {
>> + && !task_hot(p, task_rq->clock, sd))) {
>> for_each_cpu_and(i, sched_domain_span(sd),
>> &p->cpus_allowed) {
>
> Hmm, something got suboptimal in translation from the original patch.
>
> This would be better expressed as:
>
> for_each_cpu_and(i, rq->rd->online, &p->cpus_allowed) {
> if (idle_cpu(i)...
> }


NM. My first instinct was correct.

We need the result of sd->span, cpus_allowed, and active_map. The
submitted logic (or the original) is correct, and my comment above is wrong.

Sorry for the noise,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature