2008-08-25 20:17:47

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 0/5] sched: misc rt fixes for tip/sched/devel

Hi Ingo,

The following repositories

git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git tip/sched/devel/rtbalance

ftp://ftp.novell.com/dev/ghaskins/tip-rt-fixes.tar.bz2

apply to tip/sched/devel to pick up the following fixes for the RT
scheduler.

---

Gregory Haskins (5):
sched: create "pushable_tasks" list to limit pushing to one attempt
sched: add sched_class->needs_post_schedule() member
sched: make double-lock-balance fair
sched: pull only one task during NEWIDLE balancing to limit critical section
sched: only try to push a task on wakeup if it is migratable


include/linux/init_task.h | 1
include/linux/sched.h | 2 +
kernel/sched.c | 37 +++++++-----
kernel/sched_rt.c | 139 ++++++++++++++++++++++++++++++++++++++-------
4 files changed, 143 insertions(+), 36 deletions(-)

--

These patches were developed in the interest of shorting latencies in
PREEMPT_RT, but they apply to the mainline scheduler as well, so I am
offering them here first.

This has been tested under both 26.3-rt3 and tip/sched/devel on x86_64 for
both CONFIG_SMP and !CONFIG_SMP.

Comments/bug-fixes welcome!

Regards,
-Greg


2008-08-25 20:18:04

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 1/5] sched: only try to push a task on wakeup if it is migratable

There is no sense in wasting time trying to push a task away that
cannot move anywhere else. We gain no benefit from trying to push
other tasks at this point, so if the task being woken up is non
migratable, just skip the whole operation. This reduces overhead
in the wakeup path for certain tasks.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/sched_rt.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 6163e4c..196c024 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1229,14 +1229,16 @@ static void post_schedule_rt(struct rq *rq)
}

/*
- * If we are not running and we are not going to reschedule soon, we should
- * try to push tasks away now
+ * If we are not running, we are not going to reschedule soon,
+ * and we have tasks that could be pushed, we should try to push tasks
+ * away now
*/
static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
{
if (!task_running(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
- rq->rt.overloaded)
+ rq->rt.overloaded &&
+ p->rt.nr_cpus_allowed > 1)
push_rt_tasks(rq);
}

2008-08-25 20:18:30

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 2/5] sched: pull only one task during NEWIDLE balancing to limit critical section

git-id c4acb2c0669c5c5c9b28e9d02a34b5c67edf7092 attempted to limit
newidle critical section length by stopping after at least one task
was moved. Further investigation has shown that there are other
paths nested further inside the algorithm which still remain that allow
long latencies to occur with newidle balancing. This patch applies
the same technique inside balance_tasks() to limit the duration of
this optional balancing operation.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Nick Piggin <[email protected]>
---

kernel/sched.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6a43c89..6e0bde6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2963,9 +2963,11 @@ next:
rem_load_move -= p->se.load.weight;

/*
- * We only want to steal up to the prescribed amount of weighted load.
+ * We only want to steal up to the prescribed amount of weighted load
+ * and we don't want to take more than one task if we are NEWLY idle
+ * since this can cause unbounded latencies
*/
- if (rem_load_move > 0) {
+ if (idle != CPU_NEWLY_IDLE && rem_load_move > 0) {
if (p->prio < *this_best_prio)
*this_best_prio = p->prio;
p = iterator->next(iterator->arg);

2008-08-25 20:19:00

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 4/5] sched: add sched_class->needs_post_schedule() member

We currently run class->post_schedule() outside of the rq->lock, which
means that we need to test for the need to post_schedule outside of
the lock to avoid a forced reacquistion. This is currently not a problem
as we only look at rq->rt.overloaded. However, we want to enhance this
going forward to look at more state to reduce the need to post_schedule to
a bare minimum set. Therefore, we introduce a new member-func called
needs_post_schedule() which tests for the post_schedule condtion without
actually performing the work. Therefore it is safe to call this
function before the rq->lock is released, because we are guaranteed not
to drop the lock at an intermediate point (such as what post_schedule()
may do).

We will use this later in the series

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/sched.h | 1 +
kernel/sched.c | 10 +++++++++-
kernel/sched_rt.c | 24 ++++++++++++++----------
3 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 08a87b5..cf8cd8c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -912,6 +912,7 @@ struct sched_class {
struct rq *busiest, struct sched_domain *sd,
enum cpu_idle_type idle);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
+ int (*needs_post_schedule) (struct rq *this_rq);
void (*post_schedule) (struct rq *this_rq);
void (*task_wake_up) (struct rq *this_rq, struct task_struct *task);
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index b7326cd..78fae75 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2525,6 +2525,14 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
{
struct mm_struct *mm = rq->prev_mm;
long prev_state;
+#ifdef CONFIG_SMP
+ int post_schedule = 0;
+
+ if (current->sched_class->needs_post_schedule) {
+ BUG_ON(!current->sched_class->post_schedule);
+ post_schedule = current->sched_class->needs_post_schedule(rq);
+ }
+#endif

rq->prev_mm = NULL;

@@ -2543,7 +2551,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
finish_arch_switch(prev);
finish_lock_switch(rq, prev);
#ifdef CONFIG_SMP
- if (current->sched_class->post_schedule)
+ if (post_schedule)
current->sched_class->post_schedule(rq);
#endif

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 196c024..a3d695b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1212,20 +1212,23 @@ static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
pull_rt_task(rq);
}

+/*
+ * assumes rq->lock is held
+ */
+static int needs_post_schedule_rt(struct rq *rq)
+{
+ return rq->rt.overloaded ? 1 : 0;
+}
+
static void post_schedule_rt(struct rq *rq)
{
/*
- * If we have more than one rt_task queued, then
- * see if we can push the other rt_tasks off to other CPUS.
- * Note we may release the rq lock, and since
- * the lock was owned by prev, we need to release it
- * first via finish_lock_switch and then reaquire it here.
+ * This is only called if needs_post_schedule_rt() indicates that
+ * we need to push tasks away
*/
- if (unlikely(rq->rt.overloaded)) {
- spin_lock_irq(&rq->lock);
- push_rt_tasks(rq);
- spin_unlock_irq(&rq->lock);
- }
+ spin_lock_irq(&rq->lock);
+ push_rt_tasks(rq);
+ spin_unlock_irq(&rq->lock);
}

/*
@@ -1473,6 +1476,7 @@ static const struct sched_class rt_sched_class = {
.rq_online = rq_online_rt,
.rq_offline = rq_offline_rt,
.pre_schedule = pre_schedule_rt,
+ .needs_post_schedule = needs_post_schedule_rt,
.post_schedule = post_schedule_rt,
.task_wake_up = task_wake_up_rt,
.switched_from = switched_from_rt,

2008-08-25 20:18:44

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 3/5] sched: make double-lock-balance fair

double_lock balance() currently favors logically lower cpus since they
often do not have to release their own lock to acquire a second lock.
The result is that logically higher cpus can get starved when there is
a lot of pressure on the RQs. This can result in higher latencies on
higher cpu-ids.

This patch makes the algorithm more fair by forcing all paths to have
to release both locks before acquiring them again. Since callsites to
double_lock_balance already consider it a potential preemption/reschedule
point, they have the proper logic to recheck for atomicity violations.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/sched.c | 17 +++++------------
1 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6e0bde6..b7326cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2790,23 +2790,16 @@ static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
__acquires(busiest->lock)
__acquires(this_rq->lock)
{
- int ret = 0;
-
if (unlikely(!irqs_disabled())) {
/* printk() doesn't work good under rq->lock */
spin_unlock(&this_rq->lock);
BUG_ON(1);
}
- if (unlikely(!spin_trylock(&busiest->lock))) {
- if (busiest < this_rq) {
- spin_unlock(&this_rq->lock);
- spin_lock(&busiest->lock);
- spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING);
- ret = 1;
- } else
- spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING);
- }
- return ret;
+
+ spin_unlock(&this_rq->lock);
+ double_rq_lock(this_rq, busiest);
+
+ return 1;
}

static void double_unlock_balance(struct rq *this_rq, struct rq *busiest)

2008-08-25 20:19:22

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH 5/5] sched: create "pushable_tasks" list to limit pushing to one attempt

The RT scheduler employs a "push/pull" design to actively balance tasks
within the system (on a per disjoint cpuset basis). When a task is
awoken, it is immediately determined if there are any lower priority
cpus which should be preempted. This is opposed to the way normal
SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
operation to occur before spreading out load.

When a particular RQ has more than 1 active RT task, it is said to
be in an "overloaded" state. Once this occurs, the system enters
the active balancing mode, where it will try to push the task away,
or persuade a different cpu to pull it over. The system will stay
in this state until the system falls back below the <= 1 queued RT
task per RQ.

However, the current implementation suffers from a limitation in the
push logic. Once overloaded, all tasks (other than current) on the
RQ are analyzed on every push operation, even if it was previously
unpushable (due to affinity, etc). Whats more, the operation stops
at the first task that is unpushable and will not look at items
lower in the queue. This causes two problems:

1) We can have the same tasks analyzed over and over again during each
push, which extends out the fast path in the scheduler for no
gain. Consider a RQ that has dozens of tasks that are bound to a
core. Each one of those tasks will be encountered and skipped
for each push operation while they are queued.

2) There may be lower-priority tasks under the unpushable task that
could have been successfully pushed, but will never be considered
until either the unpushable task is cleared, or a pull operation
succeeds. The net result is a potential latency source for mid
priority tasks.

This patch aims to rectify these two conditions by introducing a new
priority sorted list: "pushable_tasks". A task is added to the list
each time a task is activated or preempted. It is removed from the
list any time it is deactivated, made current, or fails to push.

This works because a task only needs to be attempted to push once.
After an initial failure to push, the other cpus will eventually try to
pull the task when the conditions are proper. This also solves the
problem that we don't completely analyze all tasks due to encountering
an unpushable tasks. Now every task will have a push attempted (when
appropriate).

This reduces latency both by shorting the critical section of the
rq->lock for certain workloads, and by making sure the algorithm
considers all eligible tasks in the system.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Steven Rostedt <[email protected]>
---

include/linux/init_task.h | 1
include/linux/sched.h | 1
kernel/sched.c | 4 ++
kernel/sched_rt.c | 111 +++++++++++++++++++++++++++++++++++++++++----
4 files changed, 107 insertions(+), 10 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 021d8e7..7f910f4 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -140,6 +140,7 @@ extern struct group_info init_groups;
.nr_cpus_allowed = NR_CPUS, \
}, \
.tasks = LIST_HEAD_INIT(tsk.tasks), \
+ .pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \
.ptraced = LIST_HEAD_INIT(tsk.ptraced), \
.ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \
.real_parent = &tsk, \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cf8cd8c..3480c8a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1078,6 +1078,7 @@ struct task_struct {
#endif

struct list_head tasks;
+ struct plist_node pushable_tasks;

struct mm_struct *mm, *active_mm;

diff --git a/kernel/sched.c b/kernel/sched.c
index 78fae75..6ac9aa8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -447,6 +447,7 @@ struct rt_rq {
#ifdef CONFIG_SMP
unsigned long rt_nr_migratory;
int overloaded;
+ struct plist_head pushable_tasks;
#endif
int rt_throttled;
u64 rt_time;
@@ -2383,6 +2384,8 @@ void sched_fork(struct task_struct *p, int clone_flags)
/* Want to start with kernel preemption disabled. */
task_thread_info(p)->preempt_count = 1;
#endif
+ plist_node_init(&p->pushable_tasks, MAX_PRIO);
+
put_cpu();
}

@@ -7846,6 +7849,7 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
#ifdef CONFIG_SMP
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
+ plist_head_init(&rq->rt.pushable_tasks, &rq->lock);
#endif

rt_rq->rt_time = 0;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a3d695b..19cbbe4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -49,6 +49,25 @@ static void update_rt_migration(struct rq *rq)
rq->rt.overloaded = 0;
}
}
+
+static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
+{
+ BUG_ON(!plist_node_empty(&p->pushable_tasks));
+
+ plist_node_init(&p->pushable_tasks, p->prio);
+ plist_add(&p->pushable_tasks, &rq->rt.pushable_tasks);
+}
+
+static void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
+{
+ plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
+}
+
+#else
+
+#define enqueue_pushable_task(rq, p) do { } while (0)
+#define dequeue_pushable_task(rq, p) do { } while (0)
+
#endif /* CONFIG_SMP */

static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
@@ -669,6 +688,9 @@ static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)

enqueue_rt_entity(rt_se);

+ if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
+ enqueue_pushable_task(rq, p);
+
inc_cpu_load(rq, p->se.load.weight);
}

@@ -679,6 +701,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
update_curr_rt(rq);
dequeue_rt_entity(rt_se);

+ dequeue_pushable_task(rq, p);
+
dec_cpu_load(rq, p->se.load.weight);
}

@@ -824,7 +848,7 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
return next;
}

-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *_pick_next_task_rt(struct rq *rq)
{
struct sched_rt_entity *rt_se;
struct task_struct *p;
@@ -846,6 +870,18 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)

p = rt_task_of(rt_se);
p->se.exec_start = rq->clock;
+
+ return p;
+}
+
+static struct task_struct *pick_next_task_rt(struct rq *rq)
+{
+ struct task_struct *p = _pick_next_task_rt(rq);
+
+ /* The running task is never eligible for pushing */
+ if (p)
+ dequeue_pushable_task(rq, p);
+
return p;
}

@@ -853,6 +889,13 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
update_curr_rt(rq);
p->se.exec_start = 0;
+
+ /*
+ * The previous task needs to be made eligible for pushing
+ * if it is still active
+ */
+ if (p->se.on_rq && p->rt.nr_cpus_allowed > 1)
+ enqueue_pushable_task(rq, p);
}

#ifdef CONFIG_SMP
@@ -1031,6 +1074,28 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
return lowest_rq;
}

+static inline int has_pushable_tasks(struct rq *rq)
+{
+ return !plist_head_empty(&rq->rt.pushable_tasks);
+}
+
+static struct task_struct *pick_next_pushable_task(struct rq *rq)
+{
+ struct task_struct *p;
+
+ if (!has_pushable_tasks(rq))
+ return NULL;
+
+ p = plist_first_entry(&rq->rt.pushable_tasks,
+ struct task_struct, pushable_tasks);
+
+ BUG_ON(rq->cpu != task_cpu(p));
+ BUG_ON(task_current(rq, p));
+ BUG_ON(p->rt.nr_cpus_allowed <= 1);
+
+ return p;
+}
+
/*
* If the current CPU has more than one RT task, see if the non
* running task can migrate over to a CPU that is running a task
@@ -1040,13 +1105,12 @@ static int push_rt_task(struct rq *rq)
{
struct task_struct *next_task;
struct rq *lowest_rq;
- int ret = 0;
int paranoid = RT_MAX_TRIES;

if (!rq->rt.overloaded)
return 0;

- next_task = pick_next_highest_task_rt(rq, -1);
+ next_task = pick_next_pushable_task(rq);
if (!next_task)
return 0;

@@ -1078,12 +1142,19 @@ static int push_rt_task(struct rq *rq)
* so it is possible that next_task has changed.
* If it has, then try again.
*/
- task = pick_next_highest_task_rt(rq, -1);
+ task = pick_next_pushable_task(rq);
if (unlikely(task != next_task) && task && paranoid--) {
put_task_struct(next_task);
next_task = task;
goto retry;
}
+
+ /*
+ * Once we have failed to push this task, we will not
+ * try again, since the other cpus will pull from us
+ * when they are ready
+ */
+ dequeue_pushable_task(rq, next_task);
goto out;
}

@@ -1095,11 +1166,10 @@ static int push_rt_task(struct rq *rq)

double_unlock_balance(rq, lowest_rq);

- ret = 1;
out:
put_task_struct(next_task);

- return ret;
+ return 1;
}

/*
@@ -1128,7 +1198,7 @@ static int pull_rt_task(struct rq *this_rq)
if (likely(!rt_overloaded(this_rq)))
return 0;

- next = pick_next_task_rt(this_rq);
+ next = _pick_next_task_rt(this_rq);

for_each_cpu_mask_nr(cpu, this_rq->rd->rto_mask) {
if (this_cpu == cpu)
@@ -1145,7 +1215,7 @@ static int pull_rt_task(struct rq *this_rq)
if (double_lock_balance(this_rq, src_rq)) {
struct task_struct *old_next = next;

- next = pick_next_task_rt(this_rq);
+ next = _pick_next_task_rt(this_rq);
if (next != old_next)
ret = 1;
}
@@ -1217,7 +1287,7 @@ static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
*/
static int needs_post_schedule_rt(struct rq *rq)
{
- return rq->rt.overloaded ? 1 : 0;
+ return has_pushable_tasks(rq);
}

static void post_schedule_rt(struct rq *rq)
@@ -1240,7 +1310,7 @@ static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
{
if (!task_running(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
- rq->rt.overloaded &&
+ has_pushable_tasks(rq) &&
p->rt.nr_cpus_allowed > 1)
push_rt_tasks(rq);
}
@@ -1277,6 +1347,24 @@ static void set_cpus_allowed_rt(struct task_struct *p,
if (p->se.on_rq && (weight != p->rt.nr_cpus_allowed)) {
struct rq *rq = task_rq(p);

+ if (!task_current(rq, p)) {
+ /*
+ * Make sure we dequeue this task from the pushable list
+ * before going further. It will either remain off of
+ * the list because we are no longer pushable, or it
+ * will be requeued.
+ */
+ if (p->rt.nr_cpus_allowed > 1)
+ dequeue_pushable_task(rq, p);
+
+ /*
+ * Requeue if our weight is changing and still > 1
+ */
+ if (weight > 1)
+ enqueue_pushable_task(rq, p);
+
+ }
+
if ((p->rt.nr_cpus_allowed <= 1) && (weight > 1)) {
rq->rt.rt_nr_migratory++;
} else if ((p->rt.nr_cpus_allowed > 1) && (weight <= 1)) {
@@ -1453,6 +1541,9 @@ static void set_curr_task_rt(struct rq *rq)
struct task_struct *p = rq->curr;

p->se.exec_start = rq->clock;
+
+ /* The running task is never eligible for pushing */
+ dequeue_pushable_task(rq, p);
}

static const struct sched_class rt_sched_class = {

2008-08-26 06:15:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 3/5] sched: make double-lock-balance fair

On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
> double_lock balance() currently favors logically lower cpus since they
> often do not have to release their own lock to acquire a second lock.
> The result is that logically higher cpus can get starved when there is
> a lot of pressure on the RQs. This can result in higher latencies on
> higher cpu-ids.
>
> This patch makes the algorithm more fair by forcing all paths to have
> to release both locks before acquiring them again. Since callsites to
> double_lock_balance already consider it a potential preemption/reschedule
> point, they have the proper logic to recheck for atomicity violations.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> kernel/sched.c | 17 +++++------------
> 1 files changed, 5 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 6e0bde6..b7326cd 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2790,23 +2790,16 @@ static int double_lock_balance(struct rq *this_rq,
> struct rq *busiest) __acquires(busiest->lock)
> __acquires(this_rq->lock)
> {
> - int ret = 0;
> -
> if (unlikely(!irqs_disabled())) {
> /* printk() doesn't work good under rq->lock */
> spin_unlock(&this_rq->lock);
> BUG_ON(1);
> }
> - if (unlikely(!spin_trylock(&busiest->lock))) {
> - if (busiest < this_rq) {
> - spin_unlock(&this_rq->lock);
> - spin_lock(&busiest->lock);
> - spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING);
> - ret = 1;
> - } else
> - spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING);
> - }
> - return ret;
> +
> + spin_unlock(&this_rq->lock);
> + double_rq_lock(this_rq, busiest);

Rather than adding the extra atomic operation, can't you just put this
into the unlikely spin_trylock failure path rather than the unfair logic
there?

FWIW, this is always going to be a *tiny* bit unfair, because of double
rq lock taking the lower lock first. I guess to fix that you need to just
have a single lock to take before taking 2 rq locks. But that's not
really appropriate for mainline (and maybe not -rt either).

2008-08-26 06:21:50

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: pull only one task during NEWIDLE balancing to limit critical section

On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
> git-id c4acb2c0669c5c5c9b28e9d02a34b5c67edf7092 attempted to limit
> newidle critical section length by stopping after at least one task
> was moved. Further investigation has shown that there are other
> paths nested further inside the algorithm which still remain that allow
> long latencies to occur with newidle balancing. This patch applies
> the same technique inside balance_tasks() to limit the duration of
> this optional balancing operation.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> CC: Nick Piggin <[email protected]>

Hmm, this (andc4acb2c0669c5c5c9b28e9d02a34b5c67edf7092) still could
increase the amount of work to do significantly for workloads where
the CPU is going idle and pulling tasks over frequently. I don't
really like either of them too much.

Maybe increasing the limit would effectively amortize most of the
problem (say, limit to move 16 tasks at most).

2008-08-26 11:39:28

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: pull only one task during NEWIDLE balancing to limit critical section

Nick Piggin wrote:
> On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
>
>> git-id c4acb2c0669c5c5c9b28e9d02a34b5c67edf7092 attempted to limit
>> newidle critical section length by stopping after at least one task
>> was moved. Further investigation has shown that there are other
>> paths nested further inside the algorithm which still remain that allow
>> long latencies to occur with newidle balancing. This patch applies
>> the same technique inside balance_tasks() to limit the duration of
>> this optional balancing operation.
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> CC: Nick Piggin <[email protected]>
>>
>
> Hmm, this (andc4acb2c0669c5c5c9b28e9d02a34b5c67edf7092) still could
> increase the amount of work to do significantly for workloads where
> the CPU is going idle and pulling tasks over frequently. I don't
> really like either of them too much.
>

I had a feeling you may object to this patch based on your comments on
the first one. Thats why I CC'd you so you wouldnt think I was trying
to sneak something past ;)

> Maybe increasing the limit would effectively amortize most of the
> problem (say, limit to move 16 tasks at most).
>

The problem I was seeing was that even moving 2 was too many in the
ftraces traces I looked at. I think the idea of making a variable limit
(set via a sysctl, etc) here is a good one, but I would recommend we
have the default be "1" for CONFIG_PREEMPT (or at least
CONFIG_PREEMPT_RT) based on what I know right now. I know last time
you objected to any kind of special cases for the preemptible kernels,
but I think this is a good compromise. Would this be acceptable?

Thanks Nick,

-Greg


Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-26 12:25:48

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/5] sched: make double-lock-balance fair

Nick Piggin wrote:
> On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
>
>> double_lock balance() currently favors logically lower cpus since they
>> often do not have to release their own lock to acquire a second lock.
>> The result is that logically higher cpus can get starved when there is
>> a lot of pressure on the RQs. This can result in higher latencies on
>> higher cpu-ids.
>>
>> This patch makes the algorithm more fair by forcing all paths to have
>> to release both locks before acquiring them again. Since callsites to
>> double_lock_balance already consider it a potential preemption/reschedule
>> point, they have the proper logic to recheck for atomicity violations.
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> ---
>>
>> kernel/sched.c | 17 +++++------------
>> 1 files changed, 5 insertions(+), 12 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 6e0bde6..b7326cd 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -2790,23 +2790,16 @@ static int double_lock_balance(struct rq *this_rq,
>> struct rq *busiest) __acquires(busiest->lock)
>> __acquires(this_rq->lock)
>> {
>> - int ret = 0;
>> -
>> if (unlikely(!irqs_disabled())) {
>> /* printk() doesn't work good under rq->lock */
>> spin_unlock(&this_rq->lock);
>> BUG_ON(1);
>> }
>> - if (unlikely(!spin_trylock(&busiest->lock))) {
>> - if (busiest < this_rq) {
>> - spin_unlock(&this_rq->lock);
>> - spin_lock(&busiest->lock);
>> - spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING);
>> - ret = 1;
>> - } else
>> - spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING);
>> - }
>> - return ret;
>> +
>> + spin_unlock(&this_rq->lock);
>> + double_rq_lock(this_rq, busiest);
>>
>
> Rather than adding the extra atomic operation, can't you just put this
> into the unlikely spin_trylock failure path rather than the unfair logic
> there?
>

The trick is that we *must* first release this_rq before proceeding or
the new proposal doesn't work as intended. This patch effectively
breaks up the this_rq->lock critical section evenly across all CPUs as
if it hit the case common for higher cpus. This modification decreased
latency by over 800% (went from > 400us to < 50us) on cpus 6 and 7 in my
8-way box namely because they were not forced to wait for all the other
lower cores to finish, but rather completions of double_lock_balance
were handled in true FIFO w.r.t. to other calls to
double_lock_balance(). It has to do with the positioning within your
FIFO ticket locks (though even if ticket locks are not present on a
given architecture we should still see an improvement.)

When a low cpu wants to double lock, it tends to hold this_rq and gets
in line for busiest_rq with no bearing on how long it held this_rq.
Therefore the following scenario can occur:

cpu 0 cpu N
----------------------------------
rq[0] locked
..
..
..
double_lock(N, 0)
rq[N] released
blocked on rq[0]
..
..
..
..
double_lock(0, N)
rq[N] locked
double_lock returns
..
..
..
..
rq[0] released rq[0] locked
double_lock returns
...
...
...

---------------------------------

So double lock acquisition favors the lower cpus unfairly. They will
always win, even if they were not first. Now with the combination of my
patch plus your ticket locks, entry into the double lock becomes FIFO
because the "blocked on rq[0]" would have inserted it in the
time-ordered head of rq[0]. The later call to double_lock(0, N)
performed by cpu0 would be forced to queue up behind cpuN trying to
reacquire its own lock.

Effectively, double_lock_balance becomes as fair as the underlying
spinlock governing the RQ. If we have ticket locks, double_lock_balance
is FIFO. Lack of ticket locks means we have arbitrary fairness. And
lack of this patch means we have favor for the lower cpus.

To your point about the extra atomic ops: Perhaps I should predicate
this change on CONFIG_PREEMPT so as to not slow down mainline where
throughput is more valued.

> FWIW, this is always going to be a *tiny* bit unfair, because of double
> rq lock taking the lower lock first.

While this is technically true, I am not as concerned about the fairness
while acquiring the locks.



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-26 17:37:27

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 1/6] sched: only try to push a task on wakeup if it is migratable

There is no sense in wasting time trying to push a task away that
cannot move anywhere else. We gain no benefit from trying to push
other tasks at this point, so if the task being woken up is non
migratable, just skip the whole operation. This reduces overhead
in the wakeup path for certain tasks.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/sched_rt.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 6163e4c..196c024 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1229,14 +1229,16 @@ static void post_schedule_rt(struct rq *rq)
}

/*
- * If we are not running and we are not going to reschedule soon, we should
- * try to push tasks away now
+ * If we are not running, we are not going to reschedule soon,
+ * and we have tasks that could be pushed, we should try to push tasks
+ * away now
*/
static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
{
if (!task_running(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
- rq->rt.overloaded)
+ rq->rt.overloaded &&
+ p->rt.nr_cpus_allowed > 1)
push_rt_tasks(rq);
}

2008-08-26 17:37:44

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 0/6] Series short description

Hi Ingo,

The following repositories

git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git tip/sched/devel/rtbalance

ftp://ftp.novell.com/dev/ghaskins/tip-rt-fixes.tar.bz2

apply to tip/sched/devel to pick up fixes for the RT scheduler.

[
Changelog:

v2:
*) predicated newidle and double_lock changes on
CONFIG_PREEMPT to address a concern by Nick Piggin

*) fixed a build issue when DEBUG_PI_LIST is defined

v1:
*) initial release
]

These patches were developed in the interest of shorting latencies in
PREEMPT_RT, but they apply to the mainline scheduler as well, so I am
offering them here first.

This has been tested under both 26.3-rt3 and tip/sched/devel on x86_64 for
both CONFIG_SMP and !CONFIG_SMP.

Comments/bug-fixes welcome!

Regards,
-Greg


---

Gregory Haskins (6):
sched: create "pushable_tasks" list to limit pushing to one attempt
plist: fix PLIST_NODE_INIT to work with debug enabled
sched: add sched_class->needs_post_schedule() member
sched: make double-lock-balance fair
sched: pull only one task during NEWIDLE balancing to limit critical section
sched: only try to push a task on wakeup if it is migratable


include/linux/init_task.h | 1
include/linux/plist.h | 9 ++-
include/linux/sched.h | 2 +
kernel/sched.c | 84 ++++++++++++++++++++++++---
kernel/sched_rt.c | 139 ++++++++++++++++++++++++++++++++++++++-------
5 files changed, 202 insertions(+), 33 deletions(-)

2008-08-26 17:37:58

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 2/6] sched: pull only one task during NEWIDLE balancing to limit critical section

git-id c4acb2c0669c5c5c9b28e9d02a34b5c67edf7092 attempted to limit
newidle critical section length by stopping after at least one task
was moved. Further investigation has shown that there are other
paths nested further inside the algorithm which still remain that allow
long latencies to occur with newidle balancing. This patch applies
the same technique inside balance_tasks() to limit the duration of
this optional balancing operation.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Nick Piggin <[email protected]>
---

kernel/sched.c | 18 +++++++++++++++++-
1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6a43c89..df6b447 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2962,6 +2962,16 @@ next:
pulled++;
rem_load_move -= p->se.load.weight;

+#ifdef CONFIG_PREEMPT
+ /*
+ * NEWIDLE balancing is a source of latency, so preemptible kernels
+ * will stop after the first task is pulled to minimize the critical
+ * section.
+ */
+ if (idle == CPU_NEWLY_IDLE)
+ goto out;
+#endif
+
/*
* We only want to steal up to the prescribed amount of weighted load.
*/
@@ -3008,9 +3018,15 @@ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
sd, idle, all_pinned, &this_best_prio);
class = class->next;

+#ifdef CONFIG_PREEMPT
+ /*
+ * NEWIDLE balancing is a source of latency, so preemptible
+ * kernels will stop after the first task is pulled to minimize
+ * the critical section.
+ */
if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
break;
-
+#endif
} while (class && max_load_move > total_load_moved);

return total_load_moved > 0;

2008-08-26 17:38:23

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 3/6] sched: make double-lock-balance fair

double_lock balance() currently favors logically lower cpus since they
often do not have to release their own lock to acquire a second lock.
The result is that logically higher cpus can get starved when there is
a lot of pressure on the RQs. This can result in higher latencies on
higher cpu-ids.

This patch makes the algorithm more fair by forcing all paths to have
to release both locks before acquiring them again. Since callsites to
double_lock_balance already consider it a potential preemption/reschedule
point, they have the proper logic to recheck for atomicity violations.

Signed-off-by: Gregory Haskins <[email protected]>
---

kernel/sched.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index df6b447..850b454 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2782,21 +2782,43 @@ static void double_rq_unlock(struct rq *rq1, struct rq *rq2)
__release(rq2->lock);
}

+#ifdef CONFIG_PREEMPT
+
/*
- * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
+ * fair double_lock_balance: Safely acquires both rq->locks in a fair
+ * way at the expense of forcing extra atomic operations in all
+ * invocations. This assures that the double_lock is acquired using the
+ * same underlying policy as the spinlock_t on this architecture, which
+ * reduces latency compared to the unfair variant below. However, it
+ * also adds more overhead and therefore may reduce throughput.
*/
-static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
+static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
+ __releases(this_rq->lock)
+ __acquires(busiest->lock)
+ __acquires(this_rq->lock)
+{
+ spin_unlock(&this_rq->lock);
+ double_rq_lock(this_rq, busiest);
+
+ return 1;
+}
+
+#else
+
+/*
+ * Unfair double_lock_balance: Optimizes throughput at the expense of
+ * latency by eliminating extra atomic operations when the locks are
+ * already in proper order on entry. This favors lower cpu-ids and will
+ * grant the double lock to lower cpus over higher ids under contention,
+ * regardless of entry order into the function.
+ */
+static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
__releases(this_rq->lock)
__acquires(busiest->lock)
__acquires(this_rq->lock)
{
int ret = 0;

- if (unlikely(!irqs_disabled())) {
- /* printk() doesn't work good under rq->lock */
- spin_unlock(&this_rq->lock);
- BUG_ON(1);
- }
if (unlikely(!spin_trylock(&busiest->lock))) {
if (busiest < this_rq) {
spin_unlock(&this_rq->lock);
@@ -2809,6 +2831,22 @@ static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
return ret;
}

+#endif /* CONFIG_PREEMPT */
+
+/*
+ * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
+ */
+static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
+{
+ if (unlikely(!irqs_disabled())) {
+ /* printk() doesn't work good under rq->lock */
+ spin_unlock(&this_rq->lock);
+ BUG_ON(1);
+ }
+
+ return _double_lock_balance(this_rq, busiest);
+}
+
static void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
__releases(busiest->lock)
{

2008-08-26 17:39:00

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 5/6] plist: fix PLIST_NODE_INIT to work with debug enabled

It seems that PLIST_NODE_INIT breaks if used and DEBUG_PI_LIST is defined.
Since there are no current users of PLIST_NODE_INIT, this has gone
undetected. This patch fixes the build issue that enables the
DEBUG_PI_LIST later in the series when we use it in init_task.h

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/plist.h | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/plist.h b/include/linux/plist.h
index 85de2f0..2b65483 100644
--- a/include/linux/plist.h
+++ b/include/linux/plist.h
@@ -96,6 +96,10 @@ struct plist_node {
# define PLIST_HEAD_LOCK_INIT(_lock)
#endif

+#define _PLIST_HEAD_INIT(head) \
+ .prio_list = LIST_HEAD_INIT((head).prio_list), \
+ .node_list = LIST_HEAD_INIT((head).node_list)
+
/**
* PLIST_HEAD_INIT - static struct plist_head initializer
* @head: struct plist_head variable name
@@ -103,8 +107,7 @@ struct plist_node {
*/
#define PLIST_HEAD_INIT(head, _lock) \
{ \
- .prio_list = LIST_HEAD_INIT((head).prio_list), \
- .node_list = LIST_HEAD_INIT((head).node_list), \
+ _PLIST_HEAD_INIT(head), \
PLIST_HEAD_LOCK_INIT(&(_lock)) \
}

@@ -116,7 +119,7 @@ struct plist_node {
#define PLIST_NODE_INIT(node, __prio) \
{ \
.prio = (__prio), \
- .plist = PLIST_HEAD_INIT((node).plist, NULL), \
+ .plist = { _PLIST_HEAD_INIT((node).plist) }, \
}

/**

2008-08-26 17:38:38

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 4/6] sched: add sched_class->needs_post_schedule() member

We currently run class->post_schedule() outside of the rq->lock, which
means that we need to test for the need to post_schedule outside of
the lock to avoid a forced reacquistion. This is currently not a problem
as we only look at rq->rt.overloaded. However, we want to enhance this
going forward to look at more state to reduce the need to post_schedule to
a bare minimum set. Therefore, we introduce a new member-func called
needs_post_schedule() which tests for the post_schedule condtion without
actually performing the work. Therefore it is safe to call this
function before the rq->lock is released, because we are guaranteed not
to drop the lock at an intermediate point (such as what post_schedule()
may do).

We will use this later in the series

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/sched.h | 1 +
kernel/sched.c | 10 +++++++++-
kernel/sched_rt.c | 24 ++++++++++++++----------
3 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 08a87b5..cf8cd8c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -912,6 +912,7 @@ struct sched_class {
struct rq *busiest, struct sched_domain *sd,
enum cpu_idle_type idle);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
+ int (*needs_post_schedule) (struct rq *this_rq);
void (*post_schedule) (struct rq *this_rq);
void (*task_wake_up) (struct rq *this_rq, struct task_struct *task);
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index 850b454..c5b4459 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2525,6 +2525,14 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
{
struct mm_struct *mm = rq->prev_mm;
long prev_state;
+#ifdef CONFIG_SMP
+ int post_schedule = 0;
+
+ if (current->sched_class->needs_post_schedule) {
+ BUG_ON(!current->sched_class->post_schedule);
+ post_schedule = current->sched_class->needs_post_schedule(rq);
+ }
+#endif

rq->prev_mm = NULL;

@@ -2543,7 +2551,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
finish_arch_switch(prev);
finish_lock_switch(rq, prev);
#ifdef CONFIG_SMP
- if (current->sched_class->post_schedule)
+ if (post_schedule)
current->sched_class->post_schedule(rq);
#endif

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 196c024..a3d695b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1212,20 +1212,23 @@ static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
pull_rt_task(rq);
}

+/*
+ * assumes rq->lock is held
+ */
+static int needs_post_schedule_rt(struct rq *rq)
+{
+ return rq->rt.overloaded ? 1 : 0;
+}
+
static void post_schedule_rt(struct rq *rq)
{
/*
- * If we have more than one rt_task queued, then
- * see if we can push the other rt_tasks off to other CPUS.
- * Note we may release the rq lock, and since
- * the lock was owned by prev, we need to release it
- * first via finish_lock_switch and then reaquire it here.
+ * This is only called if needs_post_schedule_rt() indicates that
+ * we need to push tasks away
*/
- if (unlikely(rq->rt.overloaded)) {
- spin_lock_irq(&rq->lock);
- push_rt_tasks(rq);
- spin_unlock_irq(&rq->lock);
- }
+ spin_lock_irq(&rq->lock);
+ push_rt_tasks(rq);
+ spin_unlock_irq(&rq->lock);
}

/*
@@ -1473,6 +1476,7 @@ static const struct sched_class rt_sched_class = {
.rq_online = rq_online_rt,
.rq_offline = rq_offline_rt,
.pre_schedule = pre_schedule_rt,
+ .needs_post_schedule = needs_post_schedule_rt,
.post_schedule = post_schedule_rt,
.task_wake_up = task_wake_up_rt,
.switched_from = switched_from_rt,

2008-08-26 17:39:22

by Gregory Haskins

[permalink] [raw]
Subject: [PATCH v2 6/6] sched: create "pushable_tasks" list to limit pushing to one attempt

The RT scheduler employs a "push/pull" design to actively balance tasks
within the system (on a per disjoint cpuset basis). When a task is
awoken, it is immediately determined if there are any lower priority
cpus which should be preempted. This is opposed to the way normal
SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
operation to occur before spreading out load.

When a particular RQ has more than 1 active RT task, it is said to
be in an "overloaded" state. Once this occurs, the system enters
the active balancing mode, where it will try to push the task away,
or persuade a different cpu to pull it over. The system will stay
in this state until the system falls back below the <= 1 queued RT
task per RQ.

However, the current implementation suffers from a limitation in the
push logic. Once overloaded, all tasks (other than current) on the
RQ are analyzed on every push operation, even if it was previously
unpushable (due to affinity, etc). Whats more, the operation stops
at the first task that is unpushable and will not look at items
lower in the queue. This causes two problems:

1) We can have the same tasks analyzed over and over again during each
push, which extends out the fast path in the scheduler for no
gain. Consider a RQ that has dozens of tasks that are bound to a
core. Each one of those tasks will be encountered and skipped
for each push operation while they are queued.

2) There may be lower-priority tasks under the unpushable task that
could have been successfully pushed, but will never be considered
until either the unpushable task is cleared, or a pull operation
succeeds. The net result is a potential latency source for mid
priority tasks.

This patch aims to rectify these two conditions by introducing a new
priority sorted list: "pushable_tasks". A task is added to the list
each time a task is activated or preempted. It is removed from the
list any time it is deactivated, made current, or fails to push.

This works because a task only needs to be attempted to push once.
After an initial failure to push, the other cpus will eventually try to
pull the task when the conditions are proper. This also solves the
problem that we don't completely analyze all tasks due to encountering
an unpushable tasks. Now every task will have a push attempted (when
appropriate).

This reduces latency both by shorting the critical section of the
rq->lock for certain workloads, and by making sure the algorithm
considers all eligible tasks in the system.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Steven Rostedt <[email protected]>
---

include/linux/init_task.h | 1
include/linux/sched.h | 1
kernel/sched.c | 4 ++
kernel/sched_rt.c | 111 +++++++++++++++++++++++++++++++++++++++++----
4 files changed, 107 insertions(+), 10 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 021d8e7..7f910f4 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -140,6 +140,7 @@ extern struct group_info init_groups;
.nr_cpus_allowed = NR_CPUS, \
}, \
.tasks = LIST_HEAD_INIT(tsk.tasks), \
+ .pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \
.ptraced = LIST_HEAD_INIT(tsk.ptraced), \
.ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \
.real_parent = &tsk, \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cf8cd8c..3480c8a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1078,6 +1078,7 @@ struct task_struct {
#endif

struct list_head tasks;
+ struct plist_node pushable_tasks;

struct mm_struct *mm, *active_mm;

diff --git a/kernel/sched.c b/kernel/sched.c
index c5b4459..d0b189d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -447,6 +447,7 @@ struct rt_rq {
#ifdef CONFIG_SMP
unsigned long rt_nr_migratory;
int overloaded;
+ struct plist_head pushable_tasks;
#endif
int rt_throttled;
u64 rt_time;
@@ -2383,6 +2384,8 @@ void sched_fork(struct task_struct *p, int clone_flags)
/* Want to start with kernel preemption disabled. */
task_thread_info(p)->preempt_count = 1;
#endif
+ plist_node_init(&p->pushable_tasks, MAX_PRIO);
+
put_cpu();
}

@@ -7905,6 +7908,7 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
#ifdef CONFIG_SMP
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
+ plist_head_init(&rq->rt.pushable_tasks, &rq->lock);
#endif

rt_rq->rt_time = 0;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a3d695b..19cbbe4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -49,6 +49,25 @@ static void update_rt_migration(struct rq *rq)
rq->rt.overloaded = 0;
}
}
+
+static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
+{
+ BUG_ON(!plist_node_empty(&p->pushable_tasks));
+
+ plist_node_init(&p->pushable_tasks, p->prio);
+ plist_add(&p->pushable_tasks, &rq->rt.pushable_tasks);
+}
+
+static void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
+{
+ plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
+}
+
+#else
+
+#define enqueue_pushable_task(rq, p) do { } while (0)
+#define dequeue_pushable_task(rq, p) do { } while (0)
+
#endif /* CONFIG_SMP */

static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
@@ -669,6 +688,9 @@ static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)

enqueue_rt_entity(rt_se);

+ if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
+ enqueue_pushable_task(rq, p);
+
inc_cpu_load(rq, p->se.load.weight);
}

@@ -679,6 +701,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
update_curr_rt(rq);
dequeue_rt_entity(rt_se);

+ dequeue_pushable_task(rq, p);
+
dec_cpu_load(rq, p->se.load.weight);
}

@@ -824,7 +848,7 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
return next;
}

-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *_pick_next_task_rt(struct rq *rq)
{
struct sched_rt_entity *rt_se;
struct task_struct *p;
@@ -846,6 +870,18 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)

p = rt_task_of(rt_se);
p->se.exec_start = rq->clock;
+
+ return p;
+}
+
+static struct task_struct *pick_next_task_rt(struct rq *rq)
+{
+ struct task_struct *p = _pick_next_task_rt(rq);
+
+ /* The running task is never eligible for pushing */
+ if (p)
+ dequeue_pushable_task(rq, p);
+
return p;
}

@@ -853,6 +889,13 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
update_curr_rt(rq);
p->se.exec_start = 0;
+
+ /*
+ * The previous task needs to be made eligible for pushing
+ * if it is still active
+ */
+ if (p->se.on_rq && p->rt.nr_cpus_allowed > 1)
+ enqueue_pushable_task(rq, p);
}

#ifdef CONFIG_SMP
@@ -1031,6 +1074,28 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
return lowest_rq;
}

+static inline int has_pushable_tasks(struct rq *rq)
+{
+ return !plist_head_empty(&rq->rt.pushable_tasks);
+}
+
+static struct task_struct *pick_next_pushable_task(struct rq *rq)
+{
+ struct task_struct *p;
+
+ if (!has_pushable_tasks(rq))
+ return NULL;
+
+ p = plist_first_entry(&rq->rt.pushable_tasks,
+ struct task_struct, pushable_tasks);
+
+ BUG_ON(rq->cpu != task_cpu(p));
+ BUG_ON(task_current(rq, p));
+ BUG_ON(p->rt.nr_cpus_allowed <= 1);
+
+ return p;
+}
+
/*
* If the current CPU has more than one RT task, see if the non
* running task can migrate over to a CPU that is running a task
@@ -1040,13 +1105,12 @@ static int push_rt_task(struct rq *rq)
{
struct task_struct *next_task;
struct rq *lowest_rq;
- int ret = 0;
int paranoid = RT_MAX_TRIES;

if (!rq->rt.overloaded)
return 0;

- next_task = pick_next_highest_task_rt(rq, -1);
+ next_task = pick_next_pushable_task(rq);
if (!next_task)
return 0;

@@ -1078,12 +1142,19 @@ static int push_rt_task(struct rq *rq)
* so it is possible that next_task has changed.
* If it has, then try again.
*/
- task = pick_next_highest_task_rt(rq, -1);
+ task = pick_next_pushable_task(rq);
if (unlikely(task != next_task) && task && paranoid--) {
put_task_struct(next_task);
next_task = task;
goto retry;
}
+
+ /*
+ * Once we have failed to push this task, we will not
+ * try again, since the other cpus will pull from us
+ * when they are ready
+ */
+ dequeue_pushable_task(rq, next_task);
goto out;
}

@@ -1095,11 +1166,10 @@ static int push_rt_task(struct rq *rq)

double_unlock_balance(rq, lowest_rq);

- ret = 1;
out:
put_task_struct(next_task);

- return ret;
+ return 1;
}

/*
@@ -1128,7 +1198,7 @@ static int pull_rt_task(struct rq *this_rq)
if (likely(!rt_overloaded(this_rq)))
return 0;

- next = pick_next_task_rt(this_rq);
+ next = _pick_next_task_rt(this_rq);

for_each_cpu_mask_nr(cpu, this_rq->rd->rto_mask) {
if (this_cpu == cpu)
@@ -1145,7 +1215,7 @@ static int pull_rt_task(struct rq *this_rq)
if (double_lock_balance(this_rq, src_rq)) {
struct task_struct *old_next = next;

- next = pick_next_task_rt(this_rq);
+ next = _pick_next_task_rt(this_rq);
if (next != old_next)
ret = 1;
}
@@ -1217,7 +1287,7 @@ static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
*/
static int needs_post_schedule_rt(struct rq *rq)
{
- return rq->rt.overloaded ? 1 : 0;
+ return has_pushable_tasks(rq);
}

static void post_schedule_rt(struct rq *rq)
@@ -1240,7 +1310,7 @@ static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
{
if (!task_running(rq, p) &&
!test_tsk_need_resched(rq->curr) &&
- rq->rt.overloaded &&
+ has_pushable_tasks(rq) &&
p->rt.nr_cpus_allowed > 1)
push_rt_tasks(rq);
}
@@ -1277,6 +1347,24 @@ static void set_cpus_allowed_rt(struct task_struct *p,
if (p->se.on_rq && (weight != p->rt.nr_cpus_allowed)) {
struct rq *rq = task_rq(p);

+ if (!task_current(rq, p)) {
+ /*
+ * Make sure we dequeue this task from the pushable list
+ * before going further. It will either remain off of
+ * the list because we are no longer pushable, or it
+ * will be requeued.
+ */
+ if (p->rt.nr_cpus_allowed > 1)
+ dequeue_pushable_task(rq, p);
+
+ /*
+ * Requeue if our weight is changing and still > 1
+ */
+ if (weight > 1)
+ enqueue_pushable_task(rq, p);
+
+ }
+
if ((p->rt.nr_cpus_allowed <= 1) && (weight > 1)) {
rq->rt.rt_nr_migratory++;
} else if ((p->rt.nr_cpus_allowed > 1) && (weight <= 1)) {
@@ -1453,6 +1541,9 @@ static void set_curr_task_rt(struct rq *rq)
struct task_struct *p = rq->curr;

p->se.exec_start = rq->clock;
+
+ /* The running task is never eligible for pushing */
+ dequeue_pushable_task(rq, p);
}

static const struct sched_class rt_sched_class = {

2008-08-26 18:18:41

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] sched: misc rt fixes for tip/sched/devel (was: Series short description)

Gregory Haskins wrote:
> Hi Ingo,
>
> The following repositories
>
> git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git tip/sched/devel/rtbalance
>
> ftp://ftp.novell.com/dev/ghaskins/tip-rt-fixes.tar.bz2
>
> apply to tip/sched/devel to pick up fixes for the RT scheduler.
>

[ It would help if I remembered to rename the default subject ;) ]

I just wanted to add that these are not urgent fixes. .28 at the
earliest. Thanks

-Greg

> [
> Changelog:
>
> v2:
> *) predicated newidle and double_lock changes on
> CONFIG_PREEMPT to address a concern by Nick Piggin
>
> *) fixed a build issue when DEBUG_PI_LIST is defined
>
> v1:
> *) initial release
> ]
>
> These patches were developed in the interest of shorting latencies in
> PREEMPT_RT, but they apply to the mainline scheduler as well, so I am
> offering them here first.
>
> This has been tested under both 26.3-rt3 and tip/sched/devel on x86_64 for
> both CONFIG_SMP and !CONFIG_SMP.
>
> Comments/bug-fixes welcome!
>
> Regards,
> -Greg
>
>
> ---
>
> Gregory Haskins (6):
> sched: create "pushable_tasks" list to limit pushing to one attempt
> plist: fix PLIST_NODE_INIT to work with debug enabled
> sched: add sched_class->needs_post_schedule() member
> sched: make double-lock-balance fair
> sched: pull only one task during NEWIDLE balancing to limit critical section
> sched: only try to push a task on wakeup if it is migratable
>
>
> include/linux/init_task.h | 1
> include/linux/plist.h | 9 ++-
> include/linux/sched.h | 2 +
> kernel/sched.c | 84 ++++++++++++++++++++++++---
> kernel/sched_rt.c | 139 ++++++++++++++++++++++++++++++++++++++-------
> 5 files changed, 202 insertions(+), 33 deletions(-)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-27 06:36:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 3/5] sched: make double-lock-balance fair

On Tuesday 26 August 2008 22:23, Gregory Haskins wrote:
> Nick Piggin wrote:
> > On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
> >> double_lock balance() currently favors logically lower cpus since they
> >> often do not have to release their own lock to acquire a second lock.
> >> The result is that logically higher cpus can get starved when there is
> >> a lot of pressure on the RQs. This can result in higher latencies on
> >> higher cpu-ids.
> >>
> >> This patch makes the algorithm more fair by forcing all paths to have
> >> to release both locks before acquiring them again. Since callsites to
> >> double_lock_balance already consider it a potential
> >> preemption/reschedule point, they have the proper logic to recheck for
> >> atomicity violations.
> >>
> >> Signed-off-by: Gregory Haskins <[email protected]>
> >> ---
> >>
> >> kernel/sched.c | 17 +++++------------
> >> 1 files changed, 5 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/kernel/sched.c b/kernel/sched.c
> >> index 6e0bde6..b7326cd 100644
> >> --- a/kernel/sched.c
> >> +++ b/kernel/sched.c
> >> @@ -2790,23 +2790,16 @@ static int double_lock_balance(struct rq
> >> *this_rq, struct rq *busiest) __acquires(busiest->lock)
> >> __acquires(this_rq->lock)
> >> {
> >> - int ret = 0;
> >> -
> >> if (unlikely(!irqs_disabled())) {
> >> /* printk() doesn't work good under rq->lock */
> >> spin_unlock(&this_rq->lock);
> >> BUG_ON(1);
> >> }
> >> - if (unlikely(!spin_trylock(&busiest->lock))) {
> >> - if (busiest < this_rq) {
> >> - spin_unlock(&this_rq->lock);
> >> - spin_lock(&busiest->lock);
> >> - spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING);
> >> - ret = 1;
> >> - } else
> >> - spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING);
> >> - }
> >> - return ret;
> >> +
> >> + spin_unlock(&this_rq->lock);
> >> + double_rq_lock(this_rq, busiest);
> >
> > Rather than adding the extra atomic operation, can't you just put this
> > into the unlikely spin_trylock failure path rather than the unfair logic
> > there?
>
> The trick is that we *must* first release this_rq before proceeding or
> the new proposal doesn't work as intended. This patch effectively
> breaks up the this_rq->lock critical section evenly across all CPUs as
> if it hit the case common for higher cpus.

I don't exactly see why my proposal would introduce any more latency, because
we only trylock while holding the existing lock -- this is will only ever add
a small ~constant time to the critical section, regardless of whether it is a
high or low CPU runqueue.


> This modification decreased
> latency by over 800% (went from > 400us to < 50us) on cpus 6 and 7 in my
> 8-way box namely because they were not forced to wait for all the other
> lower cores to finish, but rather completions of double_lock_balance
> were handled in true FIFO w.r.t. to other calls to
> double_lock_balance(). It has to do with the positioning within your
> FIFO ticket locks (though even if ticket locks are not present on a
> given architecture we should still see an improvement.)
>
> When a low cpu wants to double lock, it tends to hold this_rq and gets
> in line for busiest_rq with no bearing on how long it held this_rq.
> Therefore the following scenario can occur:
>
> cpu 0 cpu N
> ----------------------------------
> rq[0] locked
> ..
> ..
> ..
> double_lock(N, 0)
> rq[N] released
> blocked on rq[0]
> ..
> ..
> ..
> ..
> double_lock(0, N)
> rq[N] locked
> double_lock returns
> ..
> ..
> ..
> ..
> rq[0] released rq[0] locked
> double_lock returns
> ...
> ...
> ...
>
> ---------------------------------
>
> So double lock acquisition favors the lower cpus unfairly. They will
> always win, even if they were not first. Now with the combination of my
> patch plus your ticket locks, entry into the double lock becomes FIFO
> because the "blocked on rq[0]" would have inserted it in the
> time-ordered head of rq[0].

Right, but I don't think it is particularly wrong to allow a given
CPU to double_lock_balance ahead of another guy if we're already holding
the lock. _So long as_ the lock we are trying to acquire is uncontended,
and we don't introduce this skewed unfairness due to lower CPUs being
allowed to hold their lower lock while higher CPUs have to release their
lock and first queue on the lower.

The difference is that with my patch, there is a small window where the
guy who asks for the double lock first will go through second. I don't
think this really adds a fundamental amount of latency, and the
performance benefit should not be ignored.

Linux's traditional and I suppose much largest user base does not require
realtime or really strict fairness, so IMO it is always questionable to
make changes like this.


> The later call to double_lock(0, N)
> performed by cpu0 would be forced to queue up behind cpuN trying to
> reacquire its own lock.
>
> Effectively, double_lock_balance becomes as fair as the underlying
> spinlock governing the RQ. If we have ticket locks, double_lock_balance
> is FIFO. Lack of ticket locks means we have arbitrary fairness. And
> lack of this patch means we have favor for the lower cpus.
>
> To your point about the extra atomic ops: Perhaps I should predicate
> this change on CONFIG_PREEMPT so as to not slow down mainline where
> throughput is more valued.

I am happy with that. Actually, I think we should have a CONFIG_ to trade
off throughput for latency, and have that config select PREEMPT as well as
enable small misc changes like this one and the tlb flushing latency
reduction. But for the moment, CONFIG_PREEMPT is the traditional placeholder
for such a config option.

2008-08-27 06:42:06

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: pull only one task during NEWIDLE balancing to limit critical section

On Tuesday 26 August 2008 21:36, Gregory Haskins wrote:
> Nick Piggin wrote:
> > On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
> >> git-id c4acb2c0669c5c5c9b28e9d02a34b5c67edf7092 attempted to limit
> >> newidle critical section length by stopping after at least one task
> >> was moved. Further investigation has shown that there are other
> >> paths nested further inside the algorithm which still remain that allow
> >> long latencies to occur with newidle balancing. This patch applies
> >> the same technique inside balance_tasks() to limit the duration of
> >> this optional balancing operation.
> >>
> >> Signed-off-by: Gregory Haskins <[email protected]>
> >> CC: Nick Piggin <[email protected]>
> >
> > Hmm, this (andc4acb2c0669c5c5c9b28e9d02a34b5c67edf7092) still could
> > increase the amount of work to do significantly for workloads where
> > the CPU is going idle and pulling tasks over frequently. I don't
> > really like either of them too much.
>
> I had a feeling you may object to this patch based on your comments on
> the first one. Thats why I CC'd you so you wouldnt think I was trying
> to sneak something past ;)

Appreciated.


> > Maybe increasing the limit would effectively amortize most of the
> > problem (say, limit to move 16 tasks at most).
>
> The problem I was seeing was that even moving 2 was too many in the
> ftraces traces I looked at. I think the idea of making a variable limit
> (set via a sysctl, etc) here is a good one, but I would recommend we
> have the default be "1" for CONFIG_PREEMPT (or at least
> CONFIG_PREEMPT_RT) based on what I know right now. I know last time
> you objected to any kind of special cases for the preemptible kernels,
> but I think this is a good compromise. Would this be acceptable?

Well I _prefer_ not to have a special case for preemptible kernels, but
we already have similar arbitrary kind of changes like in tlb flushing,
so...

I understand and accept there are some places where fundamentally you
have to trade latency for throughput, so at some point we have to have a
config and/or sysctl for that.

I'm surprised 2 is too much but 1 is OK. Seems pretty fragile to me. Are
you just running insane tests that load up the runqueues heaps and tests
latency? -rt users will have to understand that some algorithms scale
linearly or so with the number of a particular resource allocated, so
they aren't going to get a constant low latency under arbitrary
conditions.

FWIW, if you haven't already, then for -rt you might want to look at a
more advanced data structure than simple run ordered list for moving tasks
from one rq to the other. A simple one I was looking at is a time ordered
list to pull the most cache cold tasks (and thus we can stop searching
when we encounter the first cache hot task, in situations where it is
appropriate, etc).

Anyway... yeah I'm OK with this if it is under a config option.

2008-08-27 08:22:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

On Tue, 2008-08-26 at 13:35 -0400, Gregory Haskins wrote:
> double_lock balance() currently favors logically lower cpus since they
> often do not have to release their own lock to acquire a second lock.
> The result is that logically higher cpus can get starved when there is
> a lot of pressure on the RQs. This can result in higher latencies on
> higher cpu-ids.
>
> This patch makes the algorithm more fair by forcing all paths to have
> to release both locks before acquiring them again. Since callsites to
> double_lock_balance already consider it a potential preemption/reschedule
> point, they have the proper logic to recheck for atomicity violations.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> kernel/sched.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-------
> 1 files changed, 45 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index df6b447..850b454 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2782,21 +2782,43 @@ static void double_rq_unlock(struct rq *rq1, struct rq *rq2)
> __release(rq2->lock);
> }
>
> +#ifdef CONFIG_PREEMPT
> +
> /*
> - * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
> + * fair double_lock_balance: Safely acquires both rq->locks in a fair
> + * way at the expense of forcing extra atomic operations in all
> + * invocations. This assures that the double_lock is acquired using the
> + * same underlying policy as the spinlock_t on this architecture, which
> + * reduces latency compared to the unfair variant below. However, it
> + * also adds more overhead and therefore may reduce throughput.
> */
> -static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
> +static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
> + __releases(this_rq->lock)
> + __acquires(busiest->lock)
> + __acquires(this_rq->lock)
> +{
> + spin_unlock(&this_rq->lock);
> + double_rq_lock(this_rq, busiest);
> +
> + return 1;
> +}

Right - so to belabour Nick's point:

if (!spin_trylock(&busiest->lock)) {
spin_unlock(&this_rq->lock);
double_rq_lock(this_rq, busiest);
}

might unfairly treat someone who is waiting on this_rq if I understand
it right?

I suppose one could then write it like:

if (spin_is_contended(&this_rq->lock) || !spin_trylock(&busiest->lock)) {
spin_unlock(&this_rq->lock);
double_rq_lock(this_rq, busiest);
}

But, I'm not sure that's worth the effort at that point..

Anyway - I think all this is utterly defeated on CONFIG_PREEMPT by the
spin with IRQs enabled logic in kernel/spinlock.c.

Making this an -rt only patch...

2008-08-27 08:25:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

On Wed, 2008-08-27 at 10:21 +0200, Peter Zijlstra wrote:
> On Tue, 2008-08-26 at 13:35 -0400, Gregory Haskins wrote:
> > double_lock balance() currently favors logically lower cpus since they
> > often do not have to release their own lock to acquire a second lock.
> > The result is that logically higher cpus can get starved when there is
> > a lot of pressure on the RQs. This can result in higher latencies on
> > higher cpu-ids.
> >
> > This patch makes the algorithm more fair by forcing all paths to have
> > to release both locks before acquiring them again. Since callsites to
> > double_lock_balance already consider it a potential preemption/reschedule
> > point, they have the proper logic to recheck for atomicity violations.
> >
> > Signed-off-by: Gregory Haskins <[email protected]>
> > ---
> >
> > kernel/sched.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-------
> > 1 files changed, 45 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index df6b447..850b454 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2782,21 +2782,43 @@ static void double_rq_unlock(struct rq *rq1, struct rq *rq2)
> > __release(rq2->lock);
> > }
> >
> > +#ifdef CONFIG_PREEMPT
> > +
> > /*
> > - * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
> > + * fair double_lock_balance: Safely acquires both rq->locks in a fair
> > + * way at the expense of forcing extra atomic operations in all
> > + * invocations. This assures that the double_lock is acquired using the
> > + * same underlying policy as the spinlock_t on this architecture, which
> > + * reduces latency compared to the unfair variant below. However, it
> > + * also adds more overhead and therefore may reduce throughput.
> > */
> > -static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
> > +static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
> > + __releases(this_rq->lock)
> > + __acquires(busiest->lock)
> > + __acquires(this_rq->lock)
> > +{
> > + spin_unlock(&this_rq->lock);
> > + double_rq_lock(this_rq, busiest);
> > +
> > + return 1;
> > +}
>
> Right - so to belabour Nick's point:
>
> if (!spin_trylock(&busiest->lock)) {
> spin_unlock(&this_rq->lock);
> double_rq_lock(this_rq, busiest);
> }
>
> might unfairly treat someone who is waiting on this_rq if I understand
> it right?
>
> I suppose one could then write it like:
>
> if (spin_is_contended(&this_rq->lock) || !spin_trylock(&busiest->lock)) {
> spin_unlock(&this_rq->lock);
> double_rq_lock(this_rq, busiest);
> }
>
> But, I'm not sure that's worth the effort at that point..
>
> Anyway - I think all this is utterly defeated on CONFIG_PREEMPT by the
> spin with IRQs enabled logic in kernel/spinlock.c.
>
> Making this an -rt only patch...

n/m my last bit, that's only for spin_lock_irq*() which we're not using
here, so yes, it ought to work.

2008-08-27 08:33:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] Series short description

On Tue, 2008-08-26 at 13:34 -0400, Gregory Haskins wrote:
> Hi Ingo,
>
> The following repositories
>
> git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git tip/sched/devel/rtbalance
>
> ftp://ftp.novell.com/dev/ghaskins/tip-rt-fixes.tar.bz2
>
> apply to tip/sched/devel to pick up fixes for the RT scheduler.
>
> [
> Changelog:
>
> v2:
> *) predicated newidle and double_lock changes on
> CONFIG_PREEMPT to address a concern by Nick Piggin
>
> *) fixed a build issue when DEBUG_PI_LIST is defined
>
> v1:
> *) initial release
> ]
>
> These patches were developed in the interest of shorting latencies in
> PREEMPT_RT, but they apply to the mainline scheduler as well, so I am
> offering them here first.
>
> This has been tested under both 26.3-rt3 and tip/sched/devel on x86_64 for
> both CONFIG_SMP and !CONFIG_SMP.
>
> Comments/bug-fixes welcome!

Looks good to me,

Acked-by: Peter Zijlstra <[email protected]>

Ingo please consider.

> ---
>
> Gregory Haskins (6):
> sched: create "pushable_tasks" list to limit pushing to one attempt
> plist: fix PLIST_NODE_INIT to work with debug enabled
> sched: add sched_class->needs_post_schedule() member
> sched: make double-lock-balance fair
> sched: pull only one task during NEWIDLE balancing to limit critical section
> sched: only try to push a task on wakeup if it is migratable
>
>
> include/linux/init_task.h | 1
> include/linux/plist.h | 9 ++-
> include/linux/sched.h | 2 +
> kernel/sched.c | 84 ++++++++++++++++++++++++---
> kernel/sched_rt.c | 139 ++++++++++++++++++++++++++++++++++++++-------
> 5 files changed, 202 insertions(+), 33 deletions(-)
>
>

2008-08-27 10:27:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

On Wed, Aug 27, 2008 at 10:21:35AM +0200, Peter Zijlstra wrote:
> On Tue, 2008-08-26 at 13:35 -0400, Gregory Haskins wrote:
> > double_lock balance() currently favors logically lower cpus since they
> > often do not have to release their own lock to acquire a second lock.
> > The result is that logically higher cpus can get starved when there is
> > a lot of pressure on the RQs. This can result in higher latencies on
> > higher cpu-ids.
> >
> > This patch makes the algorithm more fair by forcing all paths to have
> > to release both locks before acquiring them again. Since callsites to
> > double_lock_balance already consider it a potential preemption/reschedule
> > point, they have the proper logic to recheck for atomicity violations.
> >
> > Signed-off-by: Gregory Haskins <[email protected]>
> > ---
> >
> > kernel/sched.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-------
> > 1 files changed, 45 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index df6b447..850b454 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2782,21 +2782,43 @@ static void double_rq_unlock(struct rq *rq1, struct rq *rq2)
> > __release(rq2->lock);
> > }
> >
> > +#ifdef CONFIG_PREEMPT
> > +
> > /*
> > - * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
> > + * fair double_lock_balance: Safely acquires both rq->locks in a fair
> > + * way at the expense of forcing extra atomic operations in all
> > + * invocations. This assures that the double_lock is acquired using the
> > + * same underlying policy as the spinlock_t on this architecture, which
> > + * reduces latency compared to the unfair variant below. However, it
> > + * also adds more overhead and therefore may reduce throughput.
> > */
> > -static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
> > +static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
> > + __releases(this_rq->lock)
> > + __acquires(busiest->lock)
> > + __acquires(this_rq->lock)
> > +{
> > + spin_unlock(&this_rq->lock);
> > + double_rq_lock(this_rq, busiest);
> > +
> > + return 1;
> > +}
>
> Right - so to belabour Nick's point:
>
> if (!spin_trylock(&busiest->lock)) {
> spin_unlock(&this_rq->lock);
> double_rq_lock(this_rq, busiest);
> }
>
> might unfairly treat someone who is waiting on this_rq if I understand
> it right?
>
> I suppose one could then write it like:
>
> if (spin_is_contended(&this_rq->lock) || !spin_trylock(&busiest->lock)) {
> spin_unlock(&this_rq->lock);
> double_rq_lock(this_rq, busiest);
> }
>
> But, I'm not sure that's worth the effort at that point..

Yeah, that could work, but hmm it might cause 2 cache coherency transactions
anyway even in the fastpath, so it might even be slower than just unlocking
unconditionally and taking both locks :(


> Anyway - I think all this is utterly defeated on CONFIG_PREEMPT by the
> spin with IRQs enabled logic in kernel/spinlock.c.
>
> Making this an -rt only patch...

Hmm, and also on x86 with ticket locks we don't spin with preempt or
interrupts enabled any more (although we still do of course on other
architectures)

2008-08-27 10:41:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

On Wed, 2008-08-27 at 12:26 +0200, Nick Piggin wrote:
> On Wed, Aug 27, 2008 at 10:21:35AM +0200, Peter Zijlstra wrote:

> > I suppose one could then write it like:
> >
> > if (spin_is_contended(&this_rq->lock) || !spin_trylock(&busiest->lock)) {
> > spin_unlock(&this_rq->lock);
> > double_rq_lock(this_rq, busiest);
> > }
> >
> > But, I'm not sure that's worth the effort at that point..
>
> Yeah, that could work, but hmm it might cause 2 cache coherency transactions
> anyway even in the fastpath, so it might even be slower than just unlocking
> unconditionally and taking both locks :(

right,..

> > Anyway - I think all this is utterly defeated on CONFIG_PREEMPT by the
> > spin with IRQs enabled logic in kernel/spinlock.c.
> >
> > Making this an -rt only patch...
>
> Hmm, and also on x86 with ticket locks we don't spin with preempt or
> interrupts enabled any more (although we still do of course on other
> architectures)

Aah, we don't do CONFIG_GENERIC_LOCKBREAK anymore?

Does it make sense to make this _double_lock_balance() thing depend on
that too?

2008-08-27 10:56:26

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

On Wed, Aug 27, 2008 at 12:41:08PM +0200, Peter Zijlstra wrote:
> On Wed, 2008-08-27 at 12:26 +0200, Nick Piggin wrote:
> > On Wed, Aug 27, 2008 at 10:21:35AM +0200, Peter Zijlstra wrote:
>
> > > I suppose one could then write it like:
> > >
> > > if (spin_is_contended(&this_rq->lock) || !spin_trylock(&busiest->lock)) {
> > > spin_unlock(&this_rq->lock);
> > > double_rq_lock(this_rq, busiest);
> > > }
> > >
> > > But, I'm not sure that's worth the effort at that point..
> >
> > Yeah, that could work, but hmm it might cause 2 cache coherency transactions
> > anyway even in the fastpath, so it might even be slower than just unlocking
> > unconditionally and taking both locks :(
>
> right,..


Although I guess we could prefetch it... But OTOH I don't know exactly
what Intel CPUs do with prefetch -- I don't think they have a prefetchw.

I would support your idea if it is faster, of course ;)


> > > Anyway - I think all this is utterly defeated on CONFIG_PREEMPT by the
> > > spin with IRQs enabled logic in kernel/spinlock.c.
> > >
> > > Making this an -rt only patch...
> >
> > Hmm, and also on x86 with ticket locks we don't spin with preempt or
> > interrupts enabled any more (although we still do of course on other
> > architectures)
>
> Aah, we don't do CONFIG_GENERIC_LOCKBREAK anymore?

Right. My reasoning said that if our critical sections are short enough,
*and not subject to starvation*, then we should not really need it, and
at any rate often it is just luck if it helps because in other cases
we might be taking the lock under an irq save region so it wouldn't help
there...


> Does it make sense to make this _double_lock_balance() thing depend on
> that too?

Hmm, you might have a good point there. Greg?

BTW. I wonder about other architectures that are of interest to -rt? Like
mips or arm perhaps... Any plans to implement ticket locks on those, or
do they not tend to be used in SMP configurations on -rt yte?

2008-08-27 10:57:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

On Wed, Aug 27, 2008 at 12:56:15PM +0200, Nick Piggin wrote:
> On Wed, Aug 27, 2008 at 12:41:08PM +0200, Peter Zijlstra wrote:
> > Does it make sense to make this _double_lock_balance() thing depend on
> > that too?
>
> Hmm, you might have a good point there. Greg?
>
> BTW. I wonder about other architectures that are of interest to -rt? Like
> mips or arm perhaps... Any plans to implement ticket locks on those, or
> do they not tend to be used in SMP configurations on -rt yte?

Oh, or maybe I guess again: -rt might be using Greg's generic ticket lock
implementation?

2008-08-27 11:07:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

On Wed, 2008-08-27 at 12:56 +0200, Nick Piggin wrote:

> BTW. I wonder about other architectures that are of interest to -rt? Like
> mips or arm perhaps... Any plans to implement ticket locks on those, or
> do they not tend to be used in SMP configurations on -rt yte?

I think both are starting to show more and more SMP machines, so yes,
eventually we'll want ticket locks for them too.

That said, I'm not aware of anyone working on it - I suppose we could
poke the regular arch maintainers.. Ralf, Russell ?

2008-08-27 11:23:40

by Russell King

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

On Wed, Aug 27, 2008 at 01:07:04PM +0200, Peter Zijlstra wrote:
> On Wed, 2008-08-27 at 12:56 +0200, Nick Piggin wrote:
>
> > BTW. I wonder about other architectures that are of interest to -rt? Like
> > mips or arm perhaps... Any plans to implement ticket locks on those, or
> > do they not tend to be used in SMP configurations on -rt yte?
>
> I think both are starting to show more and more SMP machines, so yes,
> eventually we'll want ticket locks for them too.
>
> That said, I'm not aware of anyone working on it - I suppose we could
> poke the regular arch maintainers.. Ralf, Russell ?

I've no idea; SMP on ARM as far as I've been involved has been silent
for well over a year with very little visible interest from anyone.
That's situation normal for things that "just work" though.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2008-08-27 11:44:31

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/5] sched: make double-lock-balance fair

Nick Piggin wrote:
> On Tuesday 26 August 2008 22:23, Gregory Haskins wrote:
>
>> Nick Piggin wrote:
>>
>>> On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
>>>
>>>> double_lock balance() currently favors logically lower cpus since they
>>>> often do not have to release their own lock to acquire a second lock.
>>>> The result is that logically higher cpus can get starved when there is
>>>> a lot of pressure on the RQs. This can result in higher latencies on
>>>> higher cpu-ids.
>>>>
>>>> This patch makes the algorithm more fair by forcing all paths to have
>>>> to release both locks before acquiring them again. Since callsites to
>>>> double_lock_balance already consider it a potential
>>>> preemption/reschedule point, they have the proper logic to recheck for
>>>> atomicity violations.
>>>>
>>>> Signed-off-by: Gregory Haskins <[email protected]>
>>>> ---
>>>>
>>>> kernel/sched.c | 17 +++++------------
>>>> 1 files changed, 5 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>> index 6e0bde6..b7326cd 100644
>>>> --- a/kernel/sched.c
>>>> +++ b/kernel/sched.c
>>>> @@ -2790,23 +2790,16 @@ static int double_lock_balance(struct rq
>>>> *this_rq, struct rq *busiest) __acquires(busiest->lock)
>>>> __acquires(this_rq->lock)
>>>> {
>>>> - int ret = 0;
>>>> -
>>>> if (unlikely(!irqs_disabled())) {
>>>> /* printk() doesn't work good under rq->lock */
>>>> spin_unlock(&this_rq->lock);
>>>> BUG_ON(1);
>>>> }
>>>> - if (unlikely(!spin_trylock(&busiest->lock))) {
>>>> - if (busiest < this_rq) {
>>>> - spin_unlock(&this_rq->lock);
>>>> - spin_lock(&busiest->lock);
>>>> - spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING);
>>>> - ret = 1;
>>>> - } else
>>>> - spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING);
>>>> - }
>>>> - return ret;
>>>> +
>>>> + spin_unlock(&this_rq->lock);
>>>> + double_rq_lock(this_rq, busiest);
>>>>
>>> Rather than adding the extra atomic operation, can't you just put this
>>> into the unlikely spin_trylock failure path rather than the unfair logic
>>> there?
>>>
>> The trick is that we *must* first release this_rq before proceeding or
>> the new proposal doesn't work as intended. This patch effectively
>> breaks up the this_rq->lock critical section evenly across all CPUs as
>> if it hit the case common for higher cpus.
>>
>
> I don't exactly see why my proposal would introduce any more latency, because
> we only trylock while holding the existing lock -- this is will only ever add
> a small ~constant time to the critical section, regardless of whether it is a
> high or low CPU runqueue.
>

Its because we are trying to create a break in the critical section of
this_rq->lock, not improve the acquisition of busiest->lock. So whether
you do spin_lock or spin_trylock on busiest does not matter. Busiest
will not be contended in the case that I am concerned with. If you use
my example below: rq[N] will not be contended because cpuN is blocked on
rq[0] after already having released rq[N]. So its the contention
against this_rq that is the problem.

Or am I missing your point completely?


>
>
>> This modification decreased
>> latency by over 800% (went from > 400us to < 50us) on cpus 6 and 7 in my
>> 8-way box namely because they were not forced to wait for all the other
>> lower cores to finish, but rather completions of double_lock_balance
>> were handled in true FIFO w.r.t. to other calls to
>> double_lock_balance(). It has to do with the positioning within your
>> FIFO ticket locks (though even if ticket locks are not present on a
>> given architecture we should still see an improvement.)
>>
>> When a low cpu wants to double lock, it tends to hold this_rq and gets
>> in line for busiest_rq with no bearing on how long it held this_rq.
>> Therefore the following scenario can occur:
>>
>> cpu 0 cpu N
>> ----------------------------------
>> rq[0] locked
>> ..
>> ..
>> ..
>> double_lock(N, 0)
>> rq[N] released
>> blocked on rq[0]
>> ..
>> ..
>> ..
>> ..
>> double_lock(0, N)
>> rq[N] locked
>> double_lock returns
>> ..
>> ..
>> ..
>> ..
>> rq[0] released rq[0] locked
>> double_lock returns
>> ...
>> ...
>> ...
>>
>> ---------------------------------
>>
>> So double lock acquisition favors the lower cpus unfairly. They will
>> always win, even if they were not first. Now with the combination of my
>> patch plus your ticket locks, entry into the double lock becomes FIFO
>> because the "blocked on rq[0]" would have inserted it in the
>> time-ordered head of rq[0].
>>
>
> Right, but I don't think it is particularly wrong to allow a given
> CPU to double_lock_balance ahead of another guy if we're already holding
> the lock.

Its not "wrong". Its just a latency source ;)

> _So long as_ the lock we are trying to acquire is uncontended,
> and we don't introduce this skewed unfairness due to lower CPUs being
> allowed to hold their lower lock while higher CPUs have to release their
> lock and first queue on the lower.
>
> The difference is that with my patch, there is a small window where the
> guy who asks for the double lock first will go through second. I don't
> think this really adds a fundamental amount of latency, and the
> performance benefit should not be ignored.
>
> Linux's traditional and I suppose much largest user base does not require
> realtime or really strict fairness, so IMO it is always questionable to
> make changes like this.
>
Please take a look at the v2 series that I sent out yesterday. I have
now predicated this on CONFIG_PREEMPT, per your comments.

-Greg


Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-27 11:53:28

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: pull only one task during NEWIDLE balancing to limit critical section

Nick Piggin wrote:
> On Tuesday 26 August 2008 21:36, Gregory Haskins wrote:
>
>> Nick Piggin wrote:
>>
>>> On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
>>>
>>>> git-id c4acb2c0669c5c5c9b28e9d02a34b5c67edf7092 attempted to limit
>>>> newidle critical section length by stopping after at least one task
>>>> was moved. Further investigation has shown that there are other
>>>> paths nested further inside the algorithm which still remain that allow
>>>> long latencies to occur with newidle balancing. This patch applies
>>>> the same technique inside balance_tasks() to limit the duration of
>>>> this optional balancing operation.
>>>>
>>>> Signed-off-by: Gregory Haskins <[email protected]>
>>>> CC: Nick Piggin <[email protected]>
>>>>
>>> Hmm, this (andc4acb2c0669c5c5c9b28e9d02a34b5c67edf7092) still could
>>> increase the amount of work to do significantly for workloads where
>>> the CPU is going idle and pulling tasks over frequently. I don't
>>> really like either of them too much.
>>>
>> I had a feeling you may object to this patch based on your comments on
>> the first one. Thats why I CC'd you so you wouldnt think I was trying
>> to sneak something past ;)
>>
>
> Appreciated.
>
>
>
>>> Maybe increasing the limit would effectively amortize most of the
>>> problem (say, limit to move 16 tasks at most).
>>>
>> The problem I was seeing was that even moving 2 was too many in the
>> ftraces traces I looked at. I think the idea of making a variable limit
>> (set via a sysctl, etc) here is a good one, but I would recommend we
>> have the default be "1" for CONFIG_PREEMPT (or at least
>> CONFIG_PREEMPT_RT) based on what I know right now. I know last time
>> you objected to any kind of special cases for the preemptible kernels,
>> but I think this is a good compromise. Would this be acceptable?
>>
>
> Well I _prefer_ not to have a special case for preemptible kernels, but
> we already have similar arbitrary kind of changes like in tlb flushing,
> so...
>
> I understand and accept there are some places where fundamentally you
> have to trade latency for throughput, so at some point we have to have a
> config and/or sysctl for that.
>
> I'm surprised 2 is too much but 1 is OK. Seems pretty fragile to me.

Its not that 1 is magically "ok". Its simply that newidle balancing
hurts latency, and 1 is the minimum to pull to reasonably reduce the
critical section. I already check if we NEEDS_RESCHED before taking the
rq->lock in newidle, so waiting for one task to pull is the first
opportunity I have to end the section as quickly as possible. It would
be nice if I could just keep going if I could detect whether there was
not any real contention. Let me give this angle some more thought.

> Are
> you just running insane tests that load up the runqueues heaps and tests
> latency? -rt users will have to understand that some algorithms scale
> linearly or so with the number of a particular resource allocated, so
> they aren't going to get a constant low latency under arbitrary
> conditions.
>
> FWIW, if you haven't already, then for -rt you might want to look at a
> more advanced data structure than simple run ordered list for moving tasks
> from one rq to the other. A simple one I was looking at is a time ordered
> list to pull the most cache cold tasks (and thus we can stop searching
> when we encounter the first cache hot task, in situations where it is
> appropriate, etc).
>

Im not sure I follow your point, but if I do note that the RT scheduler
uses a completely different load balancer (that is priority ordered).


> Anyway... yeah I'm OK with this if it is under a config option.
>

Cool.. See v2 ;)

Thanks Nick,
-Greg



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-27 11:53:52

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 3/5] sched: make double-lock-balance fair

On Wednesday 27 August 2008 21:41, Gregory Haskins wrote:
> Nick Piggin wrote:
> > On Tuesday 26 August 2008 22:23, Gregory Haskins wrote:
> >> Nick Piggin wrote:
> >>> On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
> >>>> double_lock balance() currently favors logically lower cpus since they
> >>>> often do not have to release their own lock to acquire a second lock.
> >>>> The result is that logically higher cpus can get starved when there is
> >>>> a lot of pressure on the RQs. This can result in higher latencies on
> >>>> higher cpu-ids.
> >>>>
> >>>> This patch makes the algorithm more fair by forcing all paths to have
> >>>> to release both locks before acquiring them again. Since callsites to
> >>>> double_lock_balance already consider it a potential
> >>>> preemption/reschedule point, they have the proper logic to recheck for
> >>>> atomicity violations.
> >>>>
> >>>> Signed-off-by: Gregory Haskins <[email protected]>
> >>>> ---
> >>>>
> >>>> kernel/sched.c | 17 +++++------------
> >>>> 1 files changed, 5 insertions(+), 12 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched.c b/kernel/sched.c
> >>>> index 6e0bde6..b7326cd 100644
> >>>> --- a/kernel/sched.c
> >>>> +++ b/kernel/sched.c
> >>>> @@ -2790,23 +2790,16 @@ static int double_lock_balance(struct rq
> >>>> *this_rq, struct rq *busiest) __acquires(busiest->lock)
> >>>> __acquires(this_rq->lock)
> >>>> {
> >>>> - int ret = 0;
> >>>> -
> >>>> if (unlikely(!irqs_disabled())) {
> >>>> /* printk() doesn't work good under rq->lock */
> >>>> spin_unlock(&this_rq->lock);
> >>>> BUG_ON(1);
> >>>> }
> >>>> - if (unlikely(!spin_trylock(&busiest->lock))) {
> >>>> - if (busiest < this_rq) {
> >>>> - spin_unlock(&this_rq->lock);
> >>>> - spin_lock(&busiest->lock);
> >>>> - spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING);
> >>>> - ret = 1;
> >>>> - } else
> >>>> - spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING);
> >>>> - }
> >>>> - return ret;
> >>>> +
> >>>> + spin_unlock(&this_rq->lock);
> >>>> + double_rq_lock(this_rq, busiest);
> >>>
> >>> Rather than adding the extra atomic operation, can't you just put this
> >>> into the unlikely spin_trylock failure path rather than the unfair
> >>> logic there?
> >>
> >> The trick is that we *must* first release this_rq before proceeding or
> >> the new proposal doesn't work as intended. This patch effectively
> >> breaks up the this_rq->lock critical section evenly across all CPUs as
> >> if it hit the case common for higher cpus.
> >
> > I don't exactly see why my proposal would introduce any more latency,
> > because we only trylock while holding the existing lock -- this is will
> > only ever add a small ~constant time to the critical section, regardless
> > of whether it is a high or low CPU runqueue.
>
> Its because we are trying to create a break in the critical section of
> this_rq->lock, not improve the acquisition of busiest->lock. So whether
> you do spin_lock or spin_trylock on busiest does not matter. Busiest
> will not be contended in the case that I am concerned with. If you use
> my example below: rq[N] will not be contended because cpuN is blocked on
> rq[0] after already having released rq[N]. So its the contention
> against this_rq that is the problem.
>
> Or am I missing your point completely?

Well my point is just that after my change, there may be some windows
of "unfair" behaviour where one CPU gets to go ahead early, but AFAIKS
now there is no systemic bias against higher numbered CPUs (except the
small issue of taking lowered numbered locks first which is also present
in any solution).

As such, I would actually like to see my proposal implemented in the
!lowlatency case as well... unless my reasoning has a hole in it?

But if you are _also_ wanting to eliminate the long lock hold time and
strictly improve fairness for lowlatency kernels, then I agree that your
patch goes much further than mine, so I don't object to putting that
under CONFIG_PREEMPT.


> >> This modification decreased
> >> latency by over 800% (went from > 400us to < 50us) on cpus 6 and 7 in my
> >> 8-way box namely because they were not forced to wait for all the other
> >> lower cores to finish, but rather completions of double_lock_balance
> >> were handled in true FIFO w.r.t. to other calls to
> >> double_lock_balance(). It has to do with the positioning within your
> >> FIFO ticket locks (though even if ticket locks are not present on a
> >> given architecture we should still see an improvement.)
> >>
> >> When a low cpu wants to double lock, it tends to hold this_rq and gets
> >> in line for busiest_rq with no bearing on how long it held this_rq.
> >> Therefore the following scenario can occur:
> >>
> >> cpu 0 cpu N
> >> ----------------------------------
> >> rq[0] locked
> >> ..
> >> ..
> >> ..
> >> double_lock(N, 0)
> >> rq[N] released
> >> blocked on rq[0]
> >> ..
> >> ..
> >> ..
> >> ..
> >> double_lock(0, N)
> >> rq[N] locked
> >> double_lock returns
> >> ..
> >> ..
> >> ..
> >> ..
> >> rq[0] released rq[0] locked
> >> double_lock returns
> >> ...
> >> ...
> >> ...
> >>
> >> ---------------------------------
> >>
> >> So double lock acquisition favors the lower cpus unfairly. They will
> >> always win, even if they were not first. Now with the combination of my
> >> patch plus your ticket locks, entry into the double lock becomes FIFO
> >> because the "blocked on rq[0]" would have inserted it in the
> >> time-ordered head of rq[0].
> >
> > Right, but I don't think it is particularly wrong to allow a given
> > CPU to double_lock_balance ahead of another guy if we're already holding
> > the lock.
>
> Its not "wrong". Its just a latency source ;)

OK, sure.


> > _So long as_ the lock we are trying to acquire is uncontended,
> > and we don't introduce this skewed unfairness due to lower CPUs being
> > allowed to hold their lower lock while higher CPUs have to release their
> > lock and first queue on the lower.
> >
> > The difference is that with my patch, there is a small window where the
> > guy who asks for the double lock first will go through second. I don't
> > think this really adds a fundamental amount of latency, and the
> > performance benefit should not be ignored.
> >
> > Linux's traditional and I suppose much largest user base does not require
> > realtime or really strict fairness, so IMO it is always questionable to
> > make changes like this.
>
> Please take a look at the v2 series that I sent out yesterday. I have
> now predicated this on CONFIG_PREEMPT, per your comments.

Those patches look pretty OK with me now.

2008-08-27 11:57:52

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: pull only one task during NEWIDLE balancing to limit critical section

On Wednesday 27 August 2008 21:50, Gregory Haskins wrote:
> Nick Piggin wrote:

> > I'm surprised 2 is too much but 1 is OK. Seems pretty fragile to me.
>
> Its not that 1 is magically "ok". Its simply that newidle balancing
> hurts latency, and 1 is the minimum to pull to reasonably reduce the
> critical section. I already check if we NEEDS_RESCHED before taking the
> rq->lock in newidle, so waiting for one task to pull is the first
> opportunity I have to end the section as quickly as possible. It would
> be nice if I could just keep going if I could detect whether there was
> not any real contention. Let me give this angle some more thought.

OK. Beware of introducing more cache coherency transitions, branches,
icache, etc. If you are already resigned to put some special cases under
CONFIG_PREEMPT, I think there is a lot to say for keeping the logic
really simple and correct even if there is a small expense to performance.

But of course I never object to speedups ;)


> > FWIW, if you haven't already, then for -rt you might want to look at a
> > more advanced data structure than simple run ordered list for moving
> > tasks from one rq to the other. A simple one I was looking at is a time
> > ordered list to pull the most cache cold tasks (and thus we can stop
> > searching when we encounter the first cache hot task, in situations where
> > it is appropriate, etc).
>
> Im not sure I follow your point, but if I do note that the RT scheduler
> uses a completely different load balancer (that is priority ordered).

You still use the normal scheduler for other tasks, though? At any
rate, no I haven't looked closely at the scheduler for a while, so
I'm quite likely to be talking nonsense.

2008-08-27 12:02:35

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

Peter Zijlstra wrote:
> On Wed, 2008-08-27 at 12:56 +0200, Nick Piggin wrote:
>
>
>> BTW. I wonder about other architectures that are of interest to -rt? Like
>> mips or arm perhaps... Any plans to implement ticket locks on those, or
>> do they not tend to be used in SMP configurations on -rt yte?
>>
>
> I think both are starting to show more and more SMP machines, so yes,
> eventually we'll want ticket locks for them too.
>
> That said, I'm not aware of anyone working on it - I suppose we could
> poke the regular arch maintainers.. Ralf, Russell ?
>
>
As an aside: I have an implementation of C-based ticket lock
replacements for raw_spinlock_t that I could post. I never posted them
because I mostly care about x86 and Nick's solution is superior there
since its hand-tuned asm. But mine could be used in places that have
not yet written arch support but need FIFO. I basically have it as a
config option to enable "generic fifo spinlocks" or something like that.

-Greg


Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-27 12:05:07

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

Peter Zijlstra wrote:
> On Tue, 2008-08-26 at 13:35 -0400, Gregory Haskins wrote:
>
>> double_lock balance() currently favors logically lower cpus since they
>> often do not have to release their own lock to acquire a second lock.
>> The result is that logically higher cpus can get starved when there is
>> a lot of pressure on the RQs. This can result in higher latencies on
>> higher cpu-ids.
>>
>> This patch makes the algorithm more fair by forcing all paths to have
>> to release both locks before acquiring them again. Since callsites to
>> double_lock_balance already consider it a potential preemption/reschedule
>> point, they have the proper logic to recheck for atomicity violations.
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> ---
>>
>> kernel/sched.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-------
>> 1 files changed, 45 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index df6b447..850b454 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -2782,21 +2782,43 @@ static void double_rq_unlock(struct rq *rq1, struct rq *rq2)
>> __release(rq2->lock);
>> }
>>
>> +#ifdef CONFIG_PREEMPT
>> +
>> /*
>> - * double_lock_balance - lock the busiest runqueue, this_rq is locked already.
>> + * fair double_lock_balance: Safely acquires both rq->locks in a fair
>> + * way at the expense of forcing extra atomic operations in all
>> + * invocations. This assures that the double_lock is acquired using the
>> + * same underlying policy as the spinlock_t on this architecture, which
>> + * reduces latency compared to the unfair variant below. However, it
>> + * also adds more overhead and therefore may reduce throughput.
>> */
>> -static int double_lock_balance(struct rq *this_rq, struct rq *busiest)
>> +static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
>> + __releases(this_rq->lock)
>> + __acquires(busiest->lock)
>> + __acquires(this_rq->lock)
>> +{
>> + spin_unlock(&this_rq->lock);
>> + double_rq_lock(this_rq, busiest);
>> +
>> + return 1;
>> +}
>>
>
> Right - so to belabour Nick's point:
>
> if (!spin_trylock(&busiest->lock)) {
> spin_unlock(&this_rq->lock);
> double_rq_lock(this_rq, busiest);
> }
>
> might unfairly treat someone who is waiting on this_rq if I understand
> it right?
>
> I suppose one could then write it like:
>
> if (spin_is_contended(&this_rq->lock) || !spin_trylock(&busiest->lock)) {
> spin_unlock(&this_rq->lock);
> double_rq_lock(this_rq, busiest);
> }
>

Indeed. This does get to the heart of the problem: contention against
this_rq->lock.

> But, I'm not sure that's worth the effort at that point..
>
> Anyway - I think all this is utterly defeated on CONFIG_PREEMPT by the
> spin with IRQs enabled logic in kernel/spinlock.c.
>

I submitted some patches related to this a while back. The gist of it
is that the presence of ticketlocks for a given config *should* defeat
the preemptible version of the generic spinlocks or there is no point.
Let me see if I can dig it up.

-Greg



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-27 12:06:16

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

Nick Piggin wrote:
> On Wed, Aug 27, 2008 at 12:56:15PM +0200, Nick Piggin wrote:
>
>> On Wed, Aug 27, 2008 at 12:41:08PM +0200, Peter Zijlstra wrote:
>>
>>> Does it make sense to make this _double_lock_balance() thing depend on
>>> that too?
>>>
>> Hmm, you might have a good point there. Greg?
>>
>> BTW. I wonder about other architectures that are of interest to -rt? Like
>> mips or arm perhaps... Any plans to implement ticket locks on those, or
>> do they not tend to be used in SMP configurations on -rt yte?
>>
>
> Oh, or maybe I guess again: -rt might be using Greg's generic ticket lock
> implementation?
>
I never submitted them since yours were better for x86. But I can dust
them off and post

-Greg


Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-27 12:13:18

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 3/5] sched: make double-lock-balance fair

Nick Piggin wrote:
> On Wednesday 27 August 2008 21:41, Gregory Haskins wrote:
>
>> Nick Piggin wrote:
>>
>>> On Tuesday 26 August 2008 22:23, Gregory Haskins wrote:
>>>
>>>> Nick Piggin wrote:
>>>>
>>>>> On Tuesday 26 August 2008 06:15, Gregory Haskins wrote:
>>>>>
>>>>>> double_lock balance() currently favors logically lower cpus since they
>>>>>> often do not have to release their own lock to acquire a second lock.
>>>>>> The result is that logically higher cpus can get starved when there is
>>>>>> a lot of pressure on the RQs. This can result in higher latencies on
>>>>>> higher cpu-ids.
>>>>>>
>>>>>> This patch makes the algorithm more fair by forcing all paths to have
>>>>>> to release both locks before acquiring them again. Since callsites to
>>>>>> double_lock_balance already consider it a potential
>>>>>> preemption/reschedule point, they have the proper logic to recheck for
>>>>>> atomicity violations.
>>>>>>
>>>>>> Signed-off-by: Gregory Haskins <[email protected]>
>>>>>> ---
>>>>>>
>>>>>> kernel/sched.c | 17 +++++------------
>>>>>> 1 files changed, 5 insertions(+), 12 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>>>> index 6e0bde6..b7326cd 100644
>>>>>> --- a/kernel/sched.c
>>>>>> +++ b/kernel/sched.c
>>>>>> @@ -2790,23 +2790,16 @@ static int double_lock_balance(struct rq
>>>>>> *this_rq, struct rq *busiest) __acquires(busiest->lock)
>>>>>> __acquires(this_rq->lock)
>>>>>> {
>>>>>> - int ret = 0;
>>>>>> -
>>>>>> if (unlikely(!irqs_disabled())) {
>>>>>> /* printk() doesn't work good under rq->lock */
>>>>>> spin_unlock(&this_rq->lock);
>>>>>> BUG_ON(1);
>>>>>> }
>>>>>> - if (unlikely(!spin_trylock(&busiest->lock))) {
>>>>>> - if (busiest < this_rq) {
>>>>>> - spin_unlock(&this_rq->lock);
>>>>>> - spin_lock(&busiest->lock);
>>>>>> - spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING);
>>>>>> - ret = 1;
>>>>>> - } else
>>>>>> - spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING);
>>>>>> - }
>>>>>> - return ret;
>>>>>> +
>>>>>> + spin_unlock(&this_rq->lock);
>>>>>> + double_rq_lock(this_rq, busiest);
>>>>>>
>>>>> Rather than adding the extra atomic operation, can't you just put this
>>>>> into the unlikely spin_trylock failure path rather than the unfair
>>>>> logic there?
>>>>>
>>>> The trick is that we *must* first release this_rq before proceeding or
>>>> the new proposal doesn't work as intended. This patch effectively
>>>> breaks up the this_rq->lock critical section evenly across all CPUs as
>>>> if it hit the case common for higher cpus.
>>>>
>>> I don't exactly see why my proposal would introduce any more latency,
>>> because we only trylock while holding the existing lock -- this is will
>>> only ever add a small ~constant time to the critical section, regardless
>>> of whether it is a high or low CPU runqueue.
>>>
>> Its because we are trying to create a break in the critical section of
>> this_rq->lock, not improve the acquisition of busiest->lock. So whether
>> you do spin_lock or spin_trylock on busiest does not matter. Busiest
>> will not be contended in the case that I am concerned with. If you use
>> my example below: rq[N] will not be contended because cpuN is blocked on
>> rq[0] after already having released rq[N]. So its the contention
>> against this_rq that is the problem.
>>
>> Or am I missing your point completely?
>>
>
> Well my point is just that after my change, there may be some windows
> of "unfair" behaviour where one CPU gets to go ahead early, but AFAIKS
> now there is no systemic bias against higher numbered CPUs (except the
> small issue of taking lowered numbered locks first which is also present
> in any solution).
>
> As such, I would actually like to see my proposal implemented in the
> !lowlatency case as well... unless my reasoning has a hole in it?
>
> But if you are _also_ wanting to eliminate the long lock hold time and
> strictly improve fairness for lowlatency kernels, then I agree that your
> patch goes much further than mine, so I don't object to putting that
> under CONFIG_PREEMPT.
>

Ok, I understand what you are saying now, and that makes sense.

-Greg


Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-27 12:16:11

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

Nick Piggin wrote:
> On Wed, Aug 27, 2008 at 12:41:08PM +0200, Peter Zijlstra wrote:
>
>
>> Does it make sense to make this _double_lock_balance() thing depend on
>> that too?
>>
>
> Hmm, you might have a good point there. Greg?
>

This makes sense to me.



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2008-08-29 13:18:50

by Ralf Baechle

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched: make double-lock-balance fair

On Wed, Aug 27, 2008 at 01:07:04PM +0200, Peter Zijlstra wrote:

> > BTW. I wonder about other architectures that are of interest to -rt? Like
> > mips or arm perhaps... Any plans to implement ticket locks on those, or
> > do they not tend to be used in SMP configurations on -rt yte?
>
> I think both are starting to show more and more SMP machines, so yes,
> eventually we'll want ticket locks for them too.
>
> That said, I'm not aware of anyone working on it - I suppose we could
> poke the regular arch maintainers.. Ralf, Russell ?

I looked into this a bit and we definately need it. One large MIPS user
already had converted his kernel to ticket locks due to an excessive
number of cacheline bounces having been observed.

Ralf

2008-08-29 13:26:46

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] sched: create "pushable_tasks" list to limit pushing to one attempt

Hi Ingo,

Gregory Haskins wrote:
> The RT scheduler employs a "push/pull" design to actively balance tasks
> within the system (on a per disjoint cpuset basis). When a task is
> awoken, it is immediately determined if there are any lower priority
> cpus which should be preempted. This is opposed to the way normal
> SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
> operation to occur before spreading out load.
>
> When a particular RQ has more than 1 active RT task, it is said to
> be in an "overloaded" state. Once this occurs, the system enters
> the active balancing mode, where it will try to push the task away,
> or persuade a different cpu to pull it over. The system will stay
> in this state until the system falls back below the <= 1 queued RT
> task per RQ.
>
> However, the current implementation suffers from a limitation in the
> push logic. Once overloaded, all tasks (other than current) on the
> RQ are analyzed on every push operation, even if it was previously
> unpushable (due to affinity, etc). Whats more, the operation stops
> at the first task that is unpushable and will not look at items
> lower in the queue. This causes two problems:
>
> 1) We can have the same tasks analyzed over and over again during each
> push, which extends out the fast path in the scheduler for no
> gain. Consider a RQ that has dozens of tasks that are bound to a
> core. Each one of those tasks will be encountered and skipped
> for each push operation while they are queued.
>
> 2) There may be lower-priority tasks under the unpushable task that
> could have been successfully pushed, but will never be considered
> until either the unpushable task is cleared, or a pull operation
> succeeds. The net result is a potential latency source for mid
> priority tasks.
>
> This patch aims to rectify these two conditions by introducing a new
> priority sorted list: "pushable_tasks". A task is added to the list
> each time a task is activated or preempted. It is removed from the
> list any time it is deactivated, made current, or fails to push.
>
> This works because a task only needs to be attempted to push once.
> After an initial failure to push, the other cpus will eventually try to
> pull the task when the conditions are proper. This also solves the
> problem that we don't completely analyze all tasks due to encountering
> an unpushable tasks. Now every task will have a push attempted (when
> appropriate).
>
> This reduces latency both by shorting the critical section of the
> rq->lock for certain workloads, and by making sure the algorithm
> considers all eligible tasks in the system.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> CC: Steven Rostedt <[email protected]>
>

We have found a crash bug in this patch in our lab caused by this
patch. So if you had any plans to pull this series in, please skip this
last patch for now (6/6). I will send out a refresh after I fix the issue.

Thanks,
-Greg



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature