2011-03-01 23:34:24

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH] sched: next buddy hint on sleep and preempt path

When a task in a taskgroup sleeps, pick_next_task starts all the way back at
the root and picks the task/taskgroup with the min vruntime across all
runnable tasks. But, when there are many frequently sleeping tasks
across different taskgroups, it makes better sense to stay with same taskgroup
for its slice period (or until all tasks in the taskgroup sleeps) instead of
switching cross taskgroup on each sleep after a short runtime.
This helps specifically where taskgroups corresponds to a process with
multiple threads. The change reduces the number of CR3 switches in this case.

Example:
Two taskgroups with 2 threads each which are running for 2ms and
sleeping for 1ms. Looking at sched:sched_switch shows -

BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
cpu-soaker-5004 [003] 3683.391089
cpu-soaker-5016 [003] 3683.393106
cpu-soaker-5005 [003] 3683.395119
cpu-soaker-5017 [003] 3683.397130
cpu-soaker-5004 [003] 3683.399143
cpu-soaker-5016 [003] 3683.401155
cpu-soaker-5005 [003] 3683.403168
cpu-soaker-5017 [003] 3683.405170

AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
cpu-soaker-21890 [003] 865.895494
cpu-soaker-21935 [003] 865.897506
cpu-soaker-21934 [003] 865.899520
cpu-soaker-21935 [003] 865.901532
cpu-soaker-21934 [003] 865.903543
cpu-soaker-21935 [003] 865.905546
cpu-soaker-21891 [003] 865.907548
cpu-soaker-21890 [003] 865.909560
cpu-soaker-21891 [003] 865.911571
cpu-soaker-21890 [003] 865.913582
cpu-soaker-21891 [003] 865.915594
cpu-soaker-21934 [003] 865.917606

Similar problem is there when there are multiple taskgroups and say a task A
preempts currently running task B of taskgroup_1. On schedule, pick_next_task
can pick an unrelated task on taskgroup_2. Here it would be better to give some
preference to task B on pick_next_task.

A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
client processes with 2 threads each running on a single CPU. Avg throughput
across 5 50 sec runs was -
BEFORE: 105.84 MB/sec
AFTER: 112.42 MB/sec

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 20 ++++++++++++++++++--
1 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3a88dee..36e8f02 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1339,6 +1339,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
hrtick_update(rq);
}

+static void set_next_buddy(struct sched_entity *se);
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -1348,14 +1350,22 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int task_flags = flags;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);

/* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ if (cfs_rq->load.weight) {
+ /*
+ * Bias pick_next to pick a task from this cfs_rq, as
+ * p is sleeping when it is within its sched_slice.
+ */
+ if (task_flags & DEQUEUE_SLEEP && se->parent)
+ set_next_buddy(se->parent);
break;
+ }
flags |= DEQUEUE_SLEEP;
}

@@ -1887,8 +1897,14 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
update_curr(cfs_rq);
find_matching_se(&se, &pse);
BUG_ON(!pse);
- if (wakeup_preempt_entity(se, pse) == 1)
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ /*
+ * Bias pick_next to pick the sched entity that is
+ * triggering this preemption.
+ */
+ set_next_buddy(pse);
goto preempt;
+ }

return;

--
1.7.3.1


2011-03-02 02:44:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path

On 03/01/2011 06:33 PM, Venkatesh Pallipadi wrote:
> When a task in a taskgroup sleeps, pick_next_task starts all the way back at
> the root and picks the task/taskgroup with the min vruntime across all
> runnable tasks. But, when there are many frequently sleeping tasks
> across different taskgroups, it makes better sense to stay with same taskgroup
> for its slice period (or until all tasks in the taskgroup sleeps) instead of
> switching cross taskgroup on each sleep after a short runtime.
> This helps specifically where taskgroups corresponds to a process with
> multiple threads. The change reduces the number of CR3 switches in this case.

Nice!

> Signed-off-by: Venkatesh Pallipadi<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2011-03-02 05:44:32

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path

On Tue, Mar 1, 2011 at 3:33 PM, Venkatesh Pallipadi <[email protected]> wrote:
> When a task in a taskgroup sleeps, pick_next_task starts all the way back at
> the root and picks the task/taskgroup with the min vruntime across all
> runnable tasks. But, when there are many frequently sleeping tasks
> across different taskgroups, it makes better sense to stay with same taskgroup
> for its slice period (or until all tasks in the taskgroup sleeps) instead of
> switching cross taskgroup on each sleep after a short runtime.
> This helps specifically where taskgroups corresponds to a process with
> multiple threads. The change reduces the number of CR3 switches in this case.
>
> Example:
> Two taskgroups with 2 threads each which are running for 2ms and
> sleeping for 1ms. Looking at sched:sched_switch shows -
>
> BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
> ? ? ?cpu-soaker-5004 ?[003] ?3683.391089
> ? ? ?cpu-soaker-5016 ?[003] ?3683.393106
> ? ? ?cpu-soaker-5005 ?[003] ?3683.395119
> ? ? ?cpu-soaker-5017 ?[003] ?3683.397130
> ? ? ?cpu-soaker-5004 ?[003] ?3683.399143
> ? ? ?cpu-soaker-5016 ?[003] ?3683.401155
> ? ? ?cpu-soaker-5005 ?[003] ?3683.403168
> ? ? ?cpu-soaker-5017 ?[003] ?3683.405170
>
> AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
> ? ? ?cpu-soaker-21890 [003] ? 865.895494
> ? ? ?cpu-soaker-21935 [003] ? 865.897506
> ? ? ?cpu-soaker-21934 [003] ? 865.899520
> ? ? ?cpu-soaker-21935 [003] ? 865.901532
> ? ? ?cpu-soaker-21934 [003] ? 865.903543
> ? ? ?cpu-soaker-21935 [003] ? 865.905546
> ? ? ?cpu-soaker-21891 [003] ? 865.907548
> ? ? ?cpu-soaker-21890 [003] ? 865.909560
> ? ? ?cpu-soaker-21891 [003] ? 865.911571
> ? ? ?cpu-soaker-21890 [003] ? 865.913582
> ? ? ?cpu-soaker-21891 [003] ? 865.915594
> ? ? ?cpu-soaker-21934 [003] ? 865.917606
>
> Similar problem is there when there are multiple taskgroups and say a task A
> preempts currently running task B of taskgroup_1. On schedule, pick_next_task
> can pick an unrelated task on taskgroup_2. Here it would be better to give some
> preference to task B on pick_next_task.
>
> A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
> client processes with 2 threads each running on a single CPU. Avg throughput
> across 5 50 sec runs was -
> BEFORE: 105.84 MB/sec
> AFTER: 112.42 MB/sec
>
> Signed-off-by: Venkatesh Pallipadi <[email protected]>
> ---
> ?kernel/sched_fair.c | ? 20 ++++++++++++++++++--
> ?1 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 3a88dee..36e8f02 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1339,6 +1339,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> ? ? ? ?hrtick_update(rq);
> ?}
>
> +static void set_next_buddy(struct sched_entity *se);
> +
> ?/*
> ?* The dequeue_task method is called before nr_running is
> ?* decreased. We remove the task from the rbtree and
> @@ -1348,14 +1350,22 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> ?{
> ? ? ? ?struct cfs_rq *cfs_rq;
> ? ? ? ?struct sched_entity *se = &p->se;
> + ? ? ? int task_flags = flags;

simpler: int voluntary = flags & DEQUEUE_SLEEP;
>
> ? ? ? ?for_each_sched_entity(se) {
> ? ? ? ? ? ? ? ?cfs_rq = cfs_rq_of(se);
> ? ? ? ? ? ? ? ?dequeue_entity(cfs_rq, se, flags);
>
> ? ? ? ? ? ? ? ?/* Don't dequeue parent if it has other entities besides us */
> - ? ? ? ? ? ? ? if (cfs_rq->load.weight)
> + ? ? ? ? ? ? ? if (cfs_rq->load.weight) {
> + ? ? ? ? ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ? ? ? ? ?* Bias pick_next to pick a task from this cfs_rq, as
> + ? ? ? ? ? ? ? ? ? ? ? ?* p is sleeping when it is within its sched_slice.
> + ? ? ? ? ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? ? ? ? ? if (task_flags & DEQUEUE_SLEEP && se->parent)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? set_next_buddy(se->parent);

re-using the last_buddy would seem like a more natural fit here; also
doesn't have a clobber race with a wakeup

> ? ? ? ? ? ? ? ? ? ? ? ?break;
> + ? ? ? ? ? ? ? }
> ? ? ? ? ? ? ? ?flags |= DEQUEUE_SLEEP;
> ? ? ? ?}
>
> @@ -1887,8 +1897,14 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> ? ? ? ?update_curr(cfs_rq);
> ? ? ? ?find_matching_se(&se, &pse);
> ? ? ? ?BUG_ON(!pse);
> - ? ? ? if (wakeup_preempt_entity(se, pse) == 1)
> + ? ? ? if (wakeup_preempt_entity(se, pse) == 1) {
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* Bias pick_next to pick the sched entity that is
> + ? ? ? ? ? ? ? ?* triggering this preemption.
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? set_next_buddy(pse);

this probably wants some sort of unification with the scale-based next
buddy above

> ? ? ? ? ? ? ? ?goto preempt;
> + ? ? ? }
>
> ? ? ? ?return;
>
> --
> 1.7.3.1
>
>

2011-03-02 06:47:34

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path

On Tue, 2011-03-01 at 21:43 -0800, Paul Turner wrote:
> On Tue, Mar 1, 2011 at 3:33 PM, Venkatesh Pallipadi <[email protected]> wrote:

> > for_each_sched_entity(se) {
> > cfs_rq = cfs_rq_of(se);
> > dequeue_entity(cfs_rq, se, flags);
> >
> > /* Don't dequeue parent if it has other entities besides us */
> > - if (cfs_rq->load.weight)
> > + if (cfs_rq->load.weight) {
> > + /*
> > + * Bias pick_next to pick a task from this cfs_rq, as
> > + * p is sleeping when it is within its sched_slice.
> > + */
> > + if (task_flags & DEQUEUE_SLEEP && se->parent)
> > + set_next_buddy(se->parent);
>
> re-using the last_buddy would seem like a more natural fit here; also
> doesn't have a clobber race with a wakeup

Hm, that would break last_buddy no? A preempted task won't get the CPU
back after light preempting thread deactivates. (it's disabled atm
unless heavily overloaded anyway, but..)

This wants a tweak either way though.

static inline struct task_struct *task_of(struct sched_entity *se)
{
#ifdef CONFIG_SCHED_DEBUG
WARN_ON_ONCE(!entity_is_task(se));
#endif
return container_of(se, struct task_struct, se);
}

static void set_next_buddy(struct sched_entity *se)
{
if (likely(task_of(se)->policy != SCHED_IDLE)) {
for_each_sched_entity(se)
cfs_rq_of(se)->next = se;
}
}

2011-03-02 07:08:38

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path

On Tue, Mar 1, 2011 at 10:47 PM, Mike Galbraith <[email protected]> wrote:
> On Tue, 2011-03-01 at 21:43 -0800, Paul Turner wrote:
>> On Tue, Mar 1, 2011 at 3:33 PM, Venkatesh Pallipadi <[email protected]> wrote:
>
>> > ? ? ? ?for_each_sched_entity(se) {
>> > ? ? ? ? ? ? ? ?cfs_rq = cfs_rq_of(se);
>> > ? ? ? ? ? ? ? ?dequeue_entity(cfs_rq, se, flags);
>> >
>> > ? ? ? ? ? ? ? ?/* Don't dequeue parent if it has other entities besides us */
>> > - ? ? ? ? ? ? ? if (cfs_rq->load.weight)
>> > + ? ? ? ? ? ? ? if (cfs_rq->load.weight) {
>> > + ? ? ? ? ? ? ? ? ? ? ? /*
>> > + ? ? ? ? ? ? ? ? ? ? ? ?* Bias pick_next to pick a task from this cfs_rq, as
>> > + ? ? ? ? ? ? ? ? ? ? ? ?* p is sleeping when it is within its sched_slice.
>> > + ? ? ? ? ? ? ? ? ? ? ? ?*/
>> > + ? ? ? ? ? ? ? ? ? ? ? if (task_flags & DEQUEUE_SLEEP && se->parent)
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? set_next_buddy(se->parent);
>>
>> re-using the last_buddy would seem like a more natural fit here; also
>> doesn't have a clobber race with a wakeup
>
> Hm, that would break last_buddy no? ?A preempted task won't get the CPU
> back after light preempting thread deactivates. ?(it's disabled atm
> unless heavily overloaded anyway, but..)

Ommm yeah.. we're actually a little snookered in this case since the
pre-empting thread's sleep will be voluntary which will try to return
time to its hierarchy.

I suppose keeping the last_buddy is preferable to the occasional clobber.

>
> This wants a tweak either way though.
>
> static inline struct task_struct *task_of(struct sched_entity *se)
> {
> #ifdef CONFIG_SCHED_DEBUG
> ? ? ? ?WARN_ON_ONCE(!entity_is_task(se));
> #endif
> ? ? ? ?return container_of(se, struct task_struct, se);
> }
>
> static void set_next_buddy(struct sched_entity *se)
> {
> ? ? ? ?if (likely(task_of(se)->policy != SCHED_IDLE)) {
> ? ? ? ? ? ? ? ?for_each_sched_entity(se)
> ? ? ? ? ? ? ? ? ? ? ? ?cfs_rq_of(se)->next = se;
> ? ? ? ?}
> }
>

yeah good catch, the if's do need to be broken out so that setting
buddies it works properly with non-task entities

>

2011-03-02 07:40:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path

On Tue, 2011-03-01 at 23:08 -0800, Paul Turner wrote:
> On Tue, Mar 1, 2011 at 10:47 PM, Mike Galbraith <[email protected]> wrote:
> > On Tue, 2011-03-01 at 21:43 -0800, Paul Turner wrote:
> >> On Tue, Mar 1, 2011 at 3:33 PM, Venkatesh Pallipadi <[email protected]> wrote:
> >
> >> > for_each_sched_entity(se) {
> >> > cfs_rq = cfs_rq_of(se);
> >> > dequeue_entity(cfs_rq, se, flags);
> >> >
> >> > /* Don't dequeue parent if it has other entities besides us */
> >> > - if (cfs_rq->load.weight)
> >> > + if (cfs_rq->load.weight) {
> >> > + /*
> >> > + * Bias pick_next to pick a task from this cfs_rq, as
> >> > + * p is sleeping when it is within its sched_slice.
> >> > + */
> >> > + if (task_flags & DEQUEUE_SLEEP && se->parent)
> >> > + set_next_buddy(se->parent);
> >>
> >> re-using the last_buddy would seem like a more natural fit here; also
> >> doesn't have a clobber race with a wakeup
> >
> > Hm, that would break last_buddy no? A preempted task won't get the CPU
> > back after light preempting thread deactivates. (it's disabled atm
> > unless heavily overloaded anyway, but..)
>
> Ommm yeah.. we're actually a little snookered in this case since the
> pre-empting thread's sleep will be voluntary which will try to return
> time to its hierarchy.
>
> I suppose keeping the last_buddy is preferable to the occasional clobber.

Yeah, I think we don't want to break it. I don't know if pgsql still
uses userland spinlocks, haven't run it in quite a while now, but with
those nasty things, last_buddy was the only thing that kept it from
collapsing into a quivering heap when you try to scale. Preempting a
userland spinlock holder gets ugly in the extreme.

I'm going to test this patch some more, but in light testing, I saw no
interactivity problems with it, and it does _seem_ to be improving
throughput when there are competing grouped loads sharing the box. I
haven't tested that heftily though, that's just watching the numbers and
recalling the relative effect of mixing loads previously.

-Mike

2011-03-02 10:30:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path

On Tue, 2011-03-01 at 15:33 -0800, Venkatesh Pallipadi wrote:
> When a task in a taskgroup sleeps, pick_next_task starts all the way back at
> the root and picks the task/taskgroup with the min vruntime across all
> runnable tasks. But, when there are many frequently sleeping tasks
> across different taskgroups, it makes better sense to stay with same taskgroup
> for its slice period (or until all tasks in the taskgroup sleeps) instead of
> switching cross taskgroup on each sleep after a short runtime.
> This helps specifically where taskgroups corresponds to a process with
> multiple threads. The change reduces the number of CR3 switches in this case.

I wasn't expecting this approach to this problem, and was dreading a
pick_next_task() rewrite, but aside from all the mentioned problems it
does look quite nice :-)

It doesn't avoid iterating the whole hierarchy every schedule, but like
you say, it should avoid the expensive cr3 switches.

2011-03-02 15:26:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path

On Wed, 2011-03-02 at 11:31 +0100, Peter Zijlstra wrote:
> On Tue, 2011-03-01 at 15:33 -0800, Venkatesh Pallipadi wrote:
> > When a task in a taskgroup sleeps, pick_next_task starts all the way back at
> > the root and picks the task/taskgroup with the min vruntime across all
> > runnable tasks. But, when there are many frequently sleeping tasks
> > across different taskgroups, it makes better sense to stay with same taskgroup
> > for its slice period (or until all tasks in the taskgroup sleeps) instead of
> > switching cross taskgroup on each sleep after a short runtime.
> > This helps specifically where taskgroups corresponds to a process with
> > multiple threads. The change reduces the number of CR3 switches in this case.
>
> I wasn't expecting this approach to this problem, and was dreading a
> pick_next_task() rewrite, but aside from all the mentioned problems it
> does look quite nice :-)
>
> It doesn't avoid iterating the whole hierarchy every schedule, but like
> you say, it should avoid the expensive cr3 switches.

FWIW, some numbers from banging tbench and mysql+oltp together to see
what happens.

unpatched tip

mysql+oltp solo

clients 1 2 4 8 16 32 64 128 256
11078.14 20600.50 36517.18 36097.94 34989.94 33628.60 31682.38 26809.03 20311.53
10987.48 20636.78 36983.98 36556.50 35035.15 33626.54 31740.77 27124.11 20448.92
10977.32 20640.12 36362.76 36315.82 34794.60 33675.73 31817.49 26927.34 20386.41
avg 11014.31 20625.80 36621.30 36323.42 34939.89 33643.62 31746.88 26953.49 20382.28

tbench 16 solo 1076.32 MB/sec

mysql+oltp + tbench 16
clients 1 2 4 8 16 32 64 128 256
9147.08 16458.56 17613.71 20514.14 18233.71 18115.45 17205.45 14346.86 9073.38
9700.07 16409.88 19206.53 19311.40 18644.14 17926.43 17030.01 13646.80 9137.17
9324.63 16705.00 18909.88 19535.12 18718.56 17794.90 16957.54 13591.91 8858.30
avg 9390.59 16524.48 18576.70 19786.88 18532.13 17945.59 17064.33 13861.85 9022.95

tbench 16 + mysql+oltp 590.49 MB/sec

patched tip

mysql+oltp solo

clients 1 2 4 8 16 32 64 128 256
11040.91 20535.52 32602.98 36062.36 34890.45 33663.40 31923.10 27007.86 20285.95
11076.79 20708.12 35328.35 36502.54 35251.00 33568.11 31633.70 26846.61 20336.28
11071.31 20697.78 37281.81 36451.19 35285.16 33502.64 31353.73 26733.23 20151.40
avg 11063.00 20647.14 35071.04 36338.69 35142.20 33578.05 31636.84 26862.56 20257.87
1.004 1.001 .957 1.000 1.005 .998 .996 .996 .993

tbench 16 solo 1080.23 MB/sec
1.003

mysql+oltp + tbench 16
clients 1 2 4 8 16 32 64 128 256
9649.64 17510.35 18231.25 19089.73 19363.54 18528.36 17190.88 14393.28 8747.72
10010.87 17308.15 19926.13 20421.10 19757.92 18901.87 18127.18 14701.35 9157.01
9759.10 16897.08 19305.13 19289.35 20086.89 18777.02 17144.39 14690.63 9160.61
avg 9806.53 17238.52 19154.17 19600.06 19736.11 18735.75 17487.48 14595.08 9021.78
1.044 1.043 1.031 .990 1.064 1.044 1.024 1.052 .999

tbench 16 + mysql+oltp 593.36 MB/sec
1.004

Note: Variance at peak (4) is cgroups oddity with mysql+oltp. Unpatched
solo run happened to hit 3 consecutive good runs. Variance is the norm
right at peak for some odd reason.

-Mike

2011-03-02 19:12:07

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path

On Tue, Mar 1, 2011 at 10:47 PM, Mike Galbraith <[email protected]> wrote:
> On Tue, 2011-03-01 at 21:43 -0800, Paul Turner wrote:
>> On Tue, Mar 1, 2011 at 3:33 PM, Venkatesh Pallipadi <[email protected]> wrote:
>
>> > ? ? ? ?for_each_sched_entity(se) {
>> > ? ? ? ? ? ? ? ?cfs_rq = cfs_rq_of(se);
>> > ? ? ? ? ? ? ? ?dequeue_entity(cfs_rq, se, flags);
>> >
>> > ? ? ? ? ? ? ? ?/* Don't dequeue parent if it has other entities besides us */
>> > - ? ? ? ? ? ? ? if (cfs_rq->load.weight)
>> > + ? ? ? ? ? ? ? if (cfs_rq->load.weight) {
>> > + ? ? ? ? ? ? ? ? ? ? ? /*
>> > + ? ? ? ? ? ? ? ? ? ? ? ?* Bias pick_next to pick a task from this cfs_rq, as
>> > + ? ? ? ? ? ? ? ? ? ? ? ?* p is sleeping when it is within its sched_slice.
>> > + ? ? ? ? ? ? ? ? ? ? ? ?*/
>> > + ? ? ? ? ? ? ? ? ? ? ? if (task_flags & DEQUEUE_SLEEP && se->parent)
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? set_next_buddy(se->parent);
>>
>> re-using the last_buddy would seem like a more natural fit here; also
>> doesn't have a clobber race with a wakeup
>
> Hm, that would break last_buddy no? ?A preempted task won't get the CPU
> back after light preempting thread deactivates. ?(it's disabled atm
> unless heavily overloaded anyway, but..)
>
> This wants a tweak either way though.
>
> static inline struct task_struct *task_of(struct sched_entity *se)
> {
> #ifdef CONFIG_SCHED_DEBUG
> ? ? ? ?WARN_ON_ONCE(!entity_is_task(se));
> #endif
> ? ? ? ?return container_of(se, struct task_struct, se);
> }
>
> static void set_next_buddy(struct sched_entity *se)
> {
> ? ? ? ?if (likely(task_of(se)->policy != SCHED_IDLE)) {
> ? ? ? ? ? ? ? ?for_each_sched_entity(se)
> ? ? ? ? ? ? ? ? ? ? ? ?cfs_rq_of(se)->next = se;
> ? ? ? ?}
> }
>

Yes. Thanks for pointing this out. Will handle this in patch-respin.

Thanks,
Venki

2011-03-02 19:22:09

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path

On Tue, Mar 1, 2011 at 9:43 PM, Paul Turner <[email protected]> wrote:
> On Tue, Mar 1, 2011 at 3:33 PM, Venkatesh Pallipadi <[email protected]> wrote:
>> When a task in a taskgroup sleeps, pick_next_task starts all the way back at
>> the root and picks the task/taskgroup with the min vruntime across all
>> runnable tasks. But, when there are many frequently sleeping tasks
>> across different taskgroups, it makes better sense to stay with same taskgroup
>> for its slice period (or until all tasks in the taskgroup sleeps) instead of
>> switching cross taskgroup on each sleep after a short runtime.
>> This helps specifically where taskgroups corresponds to a process with
>> multiple threads. The change reduces the number of CR3 switches in this case.

<snip>

>> ---
>> ?kernel/sched_fair.c | ? 20 ++++++++++++++++++--
>> ?1 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index 3a88dee..36e8f02 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -1339,6 +1339,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> ? ? ? ?hrtick_update(rq);
>> ?}
>>
>> +static void set_next_buddy(struct sched_entity *se);
>> +
>> ?/*
>> ?* The dequeue_task method is called before nr_running is
>> ?* decreased. We remove the task from the rbtree and
>> @@ -1348,14 +1350,22 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> ?{
>> ? ? ? ?struct cfs_rq *cfs_rq;
>> ? ? ? ?struct sched_entity *se = &p->se;
>> + ? ? ? int task_flags = flags;
>
> simpler: int voluntary = flags & DEQUEUE_SLEEP;

Agree. This looks cleaner. Will change.

>>
>> ? ? ? ?for_each_sched_entity(se) {
>> ? ? ? ? ? ? ? ?cfs_rq = cfs_rq_of(se);
>> ? ? ? ? ? ? ? ?dequeue_entity(cfs_rq, se, flags);
>>
>> ? ? ? ? ? ? ? ?/* Don't dequeue parent if it has other entities besides us */
>> - ? ? ? ? ? ? ? if (cfs_rq->load.weight)
>> + ? ? ? ? ? ? ? if (cfs_rq->load.weight) {
>> + ? ? ? ? ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ? ? ? ? ?* Bias pick_next to pick a task from this cfs_rq, as
>> + ? ? ? ? ? ? ? ? ? ? ? ?* p is sleeping when it is within its sched_slice.
>> + ? ? ? ? ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? ? ? ? ? if (task_flags & DEQUEUE_SLEEP && se->parent)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? set_next_buddy(se->parent);
>
> re-using the last_buddy would seem like a more natural fit here; also
> doesn't have a clobber race with a wakeup

Yes. Using of next_buddy will be racy. There will be races with
yield_to and preempt as well. But, as long as we use it only as hint,
I thought occasional clobber would be OK.

>
>> ? ? ? ? ? ? ? ? ? ? ? ?break;
>> + ? ? ? ? ? ? ? }
>> ? ? ? ? ? ? ? ?flags |= DEQUEUE_SLEEP;
>> ? ? ? ?}
>>
>> @@ -1887,8 +1897,14 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>> ? ? ? ?update_curr(cfs_rq);
>> ? ? ? ?find_matching_se(&se, &pse);
>> ? ? ? ?BUG_ON(!pse);
>> - ? ? ? if (wakeup_preempt_entity(se, pse) == 1)
>> + ? ? ? if (wakeup_preempt_entity(se, pse) == 1) {
>> + ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ?* Bias pick_next to pick the sched entity that is
>> + ? ? ? ? ? ? ? ?* triggering this preemption.
>> + ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? set_next_buddy(pse);
>
> this probably wants some sort of unification with the scale-based next
> buddy above
>

Yes. I can skip this if it is already set by scale based next buddy above.

Thanks,
Venki

>> ? ? ? ? ? ? ? ?goto preempt;
>> + ? ? ? }
>>
>> ? ? ? ?return;
>>
>> --
>> 1.7.3.1
>>
>>
>

2011-03-08 01:00:36

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH] sched: next buddy hint on sleep and preempt path - v1

When a task in a taskgroup sleeps, pick_next_task starts all the way back at
the root and picks the task/taskgroup with the min vruntime across all
runnable tasks. But, when there are many frequently sleeping tasks
across different taskgroups, it makes better sense to stay with same taskgroup
for its slice period (or until all tasks in the taskgroup sleeps) instead of
switching cross taskgroup on each sleep after a short runtime.
This helps specifically where taskgroups corresponds to a process with
multiple threads. The change reduces the number of CR3 switches in this case.

Example:
Two taskgroups with 2 threads each which are running for 2ms and
sleeping for 1ms. Looking at sched:sched_switch shows -

BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
cpu-soaker-5004 [003] 3683.391089
cpu-soaker-5016 [003] 3683.393106
cpu-soaker-5005 [003] 3683.395119
cpu-soaker-5017 [003] 3683.397130
cpu-soaker-5004 [003] 3683.399143
cpu-soaker-5016 [003] 3683.401155
cpu-soaker-5005 [003] 3683.403168
cpu-soaker-5017 [003] 3683.405170

AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
cpu-soaker-21890 [003] 865.895494
cpu-soaker-21935 [003] 865.897506
cpu-soaker-21934 [003] 865.899520
cpu-soaker-21935 [003] 865.901532
cpu-soaker-21934 [003] 865.903543
cpu-soaker-21935 [003] 865.905546
cpu-soaker-21891 [003] 865.907548
cpu-soaker-21890 [003] 865.909560
cpu-soaker-21891 [003] 865.911571
cpu-soaker-21890 [003] 865.913582
cpu-soaker-21891 [003] 865.915594
cpu-soaker-21934 [003] 865.917606

Similar problem is there when there are multiple taskgroups and say a task A
preempts currently running task B of taskgroup_1. On schedule, pick_next_task
can pick an unrelated task on taskgroup_2. Here it would be better to give some
preference to task B on pick_next_task.

A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
client processes with 2 threads each running on a single CPU. Avg throughput
across 5 50 sec runs was -
BEFORE: 105.84 MB/sec
AFTER: 112.42 MB/sec

Changes from v0:
* Always pass task se to set_next_buddy
* Avoid repeated set_next_buddy in check_preempt_wakeup
* Minor flag cleanup in dequeue_task_fair

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 41 ++++++++++++++++++++++++++++++++++++++---
1 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3a88dee..cbe442e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1339,6 +1339,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
hrtick_update(rq);
}

+static struct sched_entity *pick_next_taskse_on_cfsrq(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *se;
+
+ do {
+ se = pick_next_entity(cfs_rq);
+ cfs_rq = group_cfs_rq(se);
+ } while (cfs_rq);
+
+ return se;
+}
+
+static void set_next_buddy(struct sched_entity *se);
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -1348,14 +1362,25 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int task_sleep = flags & DEQUEUE_SLEEP;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);

/* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ if (cfs_rq->load.weight) {
+ /*
+ * Bias pick_next to pick a task from this cfs_rq, as
+ * p is sleeping when it is within its sched_slice.
+ */
+ if (task_sleep) {
+ struct sched_entity *next_se;
+ next_se = pick_next_taskse_on_cfsrq(cfs_rq);
+ set_next_buddy(next_se);
+ }
break;
+ }
flags |= DEQUEUE_SLEEP;
}

@@ -1856,12 +1881,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int scale = cfs_rq->nr_running >= sched_nr_latency;
+ int next_buddy_marked = 0;

if (unlikely(se == pse))
return;

- if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
+ if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
set_next_buddy(pse);
+ next_buddy_marked = 1;
+ }

/*
* We can come here with TIF_NEED_RESCHED already set from new task
@@ -1887,8 +1915,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
update_curr(cfs_rq);
find_matching_se(&se, &pse);
BUG_ON(!pse);
- if (wakeup_preempt_entity(se, pse) == 1)
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ /*
+ * Bias pick_next to pick the task that is
+ * triggering this preemption.
+ */
+ if (!next_buddy_marked)
+ set_next_buddy(&p->se);
goto preempt;
+ }

return;

--
1.7.3.1

2011-03-08 01:29:37

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path - v1

On Mon, Mar 7, 2011 at 4:59 PM, Venkatesh Pallipadi <[email protected]> wrote:
> When a task in a taskgroup sleeps, pick_next_task starts all the way back at
> the root and picks the task/taskgroup with the min vruntime across all
> runnable tasks. But, when there are many frequently sleeping tasks
> across different taskgroups, it makes better sense to stay with same taskgroup
> for its slice period (or until all tasks in the taskgroup sleeps) instead of
> switching cross taskgroup on each sleep after a short runtime.
> This helps specifically where taskgroups corresponds to a process with
> multiple threads. The change reduces the number of CR3 switches in this case.
>
> Example:
> Two taskgroups with 2 threads each which are running for 2ms and
> sleeping for 1ms. Looking at sched:sched_switch shows -
>
> BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
> ? ? ?cpu-soaker-5004 ?[003] ?3683.391089
> ? ? ?cpu-soaker-5016 ?[003] ?3683.393106
> ? ? ?cpu-soaker-5005 ?[003] ?3683.395119
> ? ? ?cpu-soaker-5017 ?[003] ?3683.397130
> ? ? ?cpu-soaker-5004 ?[003] ?3683.399143
> ? ? ?cpu-soaker-5016 ?[003] ?3683.401155
> ? ? ?cpu-soaker-5005 ?[003] ?3683.403168
> ? ? ?cpu-soaker-5017 ?[003] ?3683.405170
>
> AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
> ? ? ?cpu-soaker-21890 [003] ? 865.895494
> ? ? ?cpu-soaker-21935 [003] ? 865.897506
> ? ? ?cpu-soaker-21934 [003] ? 865.899520
> ? ? ?cpu-soaker-21935 [003] ? 865.901532
> ? ? ?cpu-soaker-21934 [003] ? 865.903543
> ? ? ?cpu-soaker-21935 [003] ? 865.905546
> ? ? ?cpu-soaker-21891 [003] ? 865.907548
> ? ? ?cpu-soaker-21890 [003] ? 865.909560
> ? ? ?cpu-soaker-21891 [003] ? 865.911571
> ? ? ?cpu-soaker-21890 [003] ? 865.913582
> ? ? ?cpu-soaker-21891 [003] ? 865.915594
> ? ? ?cpu-soaker-21934 [003] ? 865.917606
>
> Similar problem is there when there are multiple taskgroups and say a task A
> preempts currently running task B of taskgroup_1. On schedule, pick_next_task
> can pick an unrelated task on taskgroup_2. Here it would be better to give some
> preference to task B on pick_next_task.
>
> A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
> client processes with 2 threads each running on a single CPU. Avg throughput
> across 5 50 sec runs was -
> BEFORE: 105.84 MB/sec
> AFTER: 112.42 MB/sec
>
> Changes from v0:
> * Always pass task se to set_next_buddy
> * Avoid repeated set_next_buddy in check_preempt_wakeup
> * Minor flag cleanup in dequeue_task_fair
>
> Signed-off-by: Venkatesh Pallipadi <[email protected]>
> ---
> ?kernel/sched_fair.c | ? 41 ++++++++++++++++++++++++++++++++++++++---
> ?1 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 3a88dee..cbe442e 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1339,6 +1339,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> ? ? ? ?hrtick_update(rq);
> ?}
>
> +static struct sched_entity *pick_next_taskse_on_cfsrq(struct cfs_rq *cfs_rq)
> +{
> + ? ? ? struct sched_entity *se;
> +
> + ? ? ? do {
> + ? ? ? ? ? ? ? se = pick_next_entity(cfs_rq);
> + ? ? ? ? ? ? ? cfs_rq = group_cfs_rq(se);
> + ? ? ? } while (cfs_rq);
> +
> + ? ? ? return se;
> +}
> +

I think the original approach was much cleaner; the notion of a
SCHED_IDLE task is only relative versus siblings in group scheduling

Generalizing the buddies to work on entities, e.g.:

@@ -2137,10 +2180,11 @@ static void set_last_buddy(struct sched_entity *se)

static void set_next_buddy(struct sched_entity *se)
{
- if (likely(task_of(se)->policy != SCHED_IDLE)) {
- for_each_sched_entity(se)
- cfs_rq_of(se)->next = se;
- }
+ if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
+ return;
+
+ for_each_sched_entity(se)
+ cfs_rq_of(se)->next = se;
}

Avoids all the picking descent and gets us back there.

> +static void set_next_buddy(struct sched_entity *se);
> +
> ?/*
> ?* The dequeue_task method is called before nr_running is
> ?* decreased. We remove the task from the rbtree and
> @@ -1348,14 +1362,25 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> ?{
> ? ? ? ?struct cfs_rq *cfs_rq;
> ? ? ? ?struct sched_entity *se = &p->se;
> + ? ? ? int task_sleep = flags & DEQUEUE_SLEEP;
>
> ? ? ? ?for_each_sched_entity(se) {
> ? ? ? ? ? ? ? ?cfs_rq = cfs_rq_of(se);
> ? ? ? ? ? ? ? ?dequeue_entity(cfs_rq, se, flags);
>
> ? ? ? ? ? ? ? ?/* Don't dequeue parent if it has other entities besides us */
> - ? ? ? ? ? ? ? if (cfs_rq->load.weight)
> + ? ? ? ? ? ? ? if (cfs_rq->load.weight) {
> + ? ? ? ? ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ? ? ? ? ?* Bias pick_next to pick a task from this cfs_rq, as
> + ? ? ? ? ? ? ? ? ? ? ? ?* p is sleeping when it is within its sched_slice.
> + ? ? ? ? ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? ? ? ? ? if (task_sleep) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct sched_entity *next_se;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? next_se = pick_next_taskse_on_cfsrq(cfs_rq);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? set_next_buddy(next_se);
> + ? ? ? ? ? ? ? ? ? ? ? }
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> + ? ? ? ? ? ? ? }
> ? ? ? ? ? ? ? ?flags |= DEQUEUE_SLEEP;
> ? ? ? ?}
>
> @@ -1856,12 +1881,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> ? ? ? ?struct sched_entity *se = &curr->se, *pse = &p->se;
> ? ? ? ?struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> ? ? ? ?int scale = cfs_rq->nr_running >= sched_nr_latency;
> + ? ? ? int next_buddy_marked = 0;
>
> ? ? ? ?if (unlikely(se == pse))
> ? ? ? ? ? ? ? ?return;
>
> - ? ? ? if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
> + ? ? ? if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
> ? ? ? ? ? ? ? ?set_next_buddy(pse);
> + ? ? ? ? ? ? ? next_buddy_marked = 1;
> + ? ? ? }
>
> ? ? ? ?/*
> ? ? ? ? * We can come here with TIF_NEED_RESCHED already set from new task
> @@ -1887,8 +1915,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> ? ? ? ?update_curr(cfs_rq);
> ? ? ? ?find_matching_se(&se, &pse);
> ? ? ? ?BUG_ON(!pse);
> - ? ? ? if (wakeup_preempt_entity(se, pse) == 1)
> + ? ? ? if (wakeup_preempt_entity(se, pse) == 1) {

Can't this just be:

if ((wakeup_preempt_entity(se, pse) == 1) || (scale && !fork))

Or even:

( wakeup || scale ) && !fork

Storing the state seems messy just for the
prempt-with-resched-already-set case (effective behavioral difference.
With this the other case can be deleted.

> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* Bias pick_next to pick the task that is
> + ? ? ? ? ? ? ? ?* triggering this preemption.
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? if (!next_buddy_marked)
> + ? ? ? ? ? ? ? ? ? ? ? set_next_buddy(&p->se);
> ? ? ? ? ? ? ? ?goto preempt;
> + ? ? ? }
>
> ? ? ? ?return;
>
> --
> 1.7.3.1
>
>

2011-03-08 01:47:09

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path - v1

On Mon, Mar 7, 2011 at 5:29 PM, Paul Turner <[email protected]> wrote:
> On Mon, Mar 7, 2011 at 4:59 PM, Venkatesh Pallipadi <[email protected]> wrote:
>> When a task in a taskgroup sleeps, pick_next_task starts all the way back at
>> the root and picks the task/taskgroup with the min vruntime across all
>> runnable tasks. But, when there are many frequently sleeping tasks
>> across different taskgroups, it makes better sense to stay with same taskgroup
>> for its slice period (or until all tasks in the taskgroup sleeps) instead of
>> switching cross taskgroup on each sleep after a short runtime.
>> This helps specifically where taskgroups corresponds to a process with
>> multiple threads. The change reduces the number of CR3 switches in this case.
>>
>> Example:
>> Two taskgroups with 2 threads each which are running for 2ms and
>> sleeping for 1ms. Looking at sched:sched_switch shows -
>>
>> BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
>> ? ? ?cpu-soaker-5004 ?[003] ?3683.391089
>> ? ? ?cpu-soaker-5016 ?[003] ?3683.393106
>> ? ? ?cpu-soaker-5005 ?[003] ?3683.395119
>> ? ? ?cpu-soaker-5017 ?[003] ?3683.397130
>> ? ? ?cpu-soaker-5004 ?[003] ?3683.399143
>> ? ? ?cpu-soaker-5016 ?[003] ?3683.401155
>> ? ? ?cpu-soaker-5005 ?[003] ?3683.403168
>> ? ? ?cpu-soaker-5017 ?[003] ?3683.405170
>>
>> AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
>> ? ? ?cpu-soaker-21890 [003] ? 865.895494
>> ? ? ?cpu-soaker-21935 [003] ? 865.897506
>> ? ? ?cpu-soaker-21934 [003] ? 865.899520
>> ? ? ?cpu-soaker-21935 [003] ? 865.901532
>> ? ? ?cpu-soaker-21934 [003] ? 865.903543
>> ? ? ?cpu-soaker-21935 [003] ? 865.905546
>> ? ? ?cpu-soaker-21891 [003] ? 865.907548
>> ? ? ?cpu-soaker-21890 [003] ? 865.909560
>> ? ? ?cpu-soaker-21891 [003] ? 865.911571
>> ? ? ?cpu-soaker-21890 [003] ? 865.913582
>> ? ? ?cpu-soaker-21891 [003] ? 865.915594
>> ? ? ?cpu-soaker-21934 [003] ? 865.917606
>>
>> Similar problem is there when there are multiple taskgroups and say a task A
>> preempts currently running task B of taskgroup_1. On schedule, pick_next_task
>> can pick an unrelated task on taskgroup_2. Here it would be better to give some
>> preference to task B on pick_next_task.
>>
>> A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
>> client processes with 2 threads each running on a single CPU. Avg throughput
>> across 5 50 sec runs was -
>> BEFORE: 105.84 MB/sec
>> AFTER: 112.42 MB/sec
>>
>> Changes from v0:
>> * Always pass task se to set_next_buddy
>> * Avoid repeated set_next_buddy in check_preempt_wakeup
>> * Minor flag cleanup in dequeue_task_fair
>>
>> Signed-off-by: Venkatesh Pallipadi <[email protected]>
>> ---
>> ?kernel/sched_fair.c | ? 41 ++++++++++++++++++++++++++++++++++++++---
>> ?1 files changed, 38 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index 3a88dee..cbe442e 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -1339,6 +1339,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> ? ? ? ?hrtick_update(rq);
>> ?}
>>
>> +static struct sched_entity *pick_next_taskse_on_cfsrq(struct cfs_rq *cfs_rq)
>> +{
>> + ? ? ? struct sched_entity *se;
>> +
>> + ? ? ? do {
>> + ? ? ? ? ? ? ? se = pick_next_entity(cfs_rq);
>> + ? ? ? ? ? ? ? cfs_rq = group_cfs_rq(se);
>> + ? ? ? } while (cfs_rq);
>> +
>> + ? ? ? return se;
>> +}
>> +
>
> I think the original approach was much cleaner; the notion of a
> SCHED_IDLE task is only relative versus siblings in group scheduling
>
> Generalizing the buddies to work on entities, e.g.:
>
> @@ -2137,10 +2180,11 @@ static void set_last_buddy(struct sched_entity *se)
>
> ?static void set_next_buddy(struct sched_entity *se)
> ?{
> - ? ? ? if (likely(task_of(se)->policy != SCHED_IDLE)) {
> - ? ? ? ? ? ? ? for_each_sched_entity(se)
> - ? ? ? ? ? ? ? ? ? ? ? cfs_rq_of(se)->next = se;
> - ? ? ? }
> + ? ? ? if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
> + ? ? ? ? ? ? ? return;
> +
> + ? ? ? for_each_sched_entity(se)
> + ? ? ? ? ? ? ? cfs_rq_of(se)->next = se;
> ?}
>
> Avoids all the picking descent and gets us back there.
>

The reason I did not go with some thing like above was for the case
where: a task (SCHED_IDLE or otherwise) is going to sleep and there is
another runnable SCHED_IDLE task in the same group, and there is a non
SCHED_IDLE task in some other group. In this case I thought we should
not be giving next_buddy privilege ingroup and potentially pick non
SCHED_IDLE task from other group based on vruntime.

I agree that someting like above is simpler and if we do not really
care about buddy privilege picking SCHED_IDLE task ahead of non
SCHED_IDLE task from different group, then above approach seems
simpler.

>> +static void set_next_buddy(struct sched_entity *se);
>> +
>> ?/*
>> ?* The dequeue_task method is called before nr_running is
>> ?* decreased. We remove the task from the rbtree and
>> @@ -1348,14 +1362,25 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> ?{
>> ? ? ? ?struct cfs_rq *cfs_rq;
>> ? ? ? ?struct sched_entity *se = &p->se;
>> + ? ? ? int task_sleep = flags & DEQUEUE_SLEEP;
>>
>> ? ? ? ?for_each_sched_entity(se) {
>> ? ? ? ? ? ? ? ?cfs_rq = cfs_rq_of(se);
>> ? ? ? ? ? ? ? ?dequeue_entity(cfs_rq, se, flags);
>>
>> ? ? ? ? ? ? ? ?/* Don't dequeue parent if it has other entities besides us */
>> - ? ? ? ? ? ? ? if (cfs_rq->load.weight)
>> + ? ? ? ? ? ? ? if (cfs_rq->load.weight) {
>> + ? ? ? ? ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ? ? ? ? ?* Bias pick_next to pick a task from this cfs_rq, as
>> + ? ? ? ? ? ? ? ? ? ? ? ?* p is sleeping when it is within its sched_slice.
>> + ? ? ? ? ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? ? ? ? ? if (task_sleep) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct sched_entity *next_se;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? next_se = pick_next_taskse_on_cfsrq(cfs_rq);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? set_next_buddy(next_se);
>> + ? ? ? ? ? ? ? ? ? ? ? }
>> ? ? ? ? ? ? ? ? ? ? ? ?break;
>> + ? ? ? ? ? ? ? }
>> ? ? ? ? ? ? ? ?flags |= DEQUEUE_SLEEP;
>> ? ? ? ?}
>>
>> @@ -1856,12 +1881,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>> ? ? ? ?struct sched_entity *se = &curr->se, *pse = &p->se;
>> ? ? ? ?struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>> ? ? ? ?int scale = cfs_rq->nr_running >= sched_nr_latency;
>> + ? ? ? int next_buddy_marked = 0;
>>
>> ? ? ? ?if (unlikely(se == pse))
>> ? ? ? ? ? ? ? ?return;
>>
>> - ? ? ? if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
>> + ? ? ? if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
>> ? ? ? ? ? ? ? ?set_next_buddy(pse);
>> + ? ? ? ? ? ? ? next_buddy_marked = 1;
>> + ? ? ? }
>>
>> ? ? ? ?/*
>> ? ? ? ? * We can come here with TIF_NEED_RESCHED already set from new task
>> @@ -1887,8 +1915,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>> ? ? ? ?update_curr(cfs_rq);
>> ? ? ? ?find_matching_se(&se, &pse);
>> ? ? ? ?BUG_ON(!pse);
>> - ? ? ? if (wakeup_preempt_entity(se, pse) == 1)
>> + ? ? ? if (wakeup_preempt_entity(se, pse) == 1) {
>
> Can't this just be:
>
> if ((wakeup_preempt_entity(se, pse) == 1) || (scale && !fork))
>
> Or even:
>
> ( wakeup || scale ) && !fork
>
> Storing the state seems messy just for the
> prempt-with-resched-already-set case (effective behavioral difference.
> ?With this the other case can be deleted.

Removing other case is not straight-forward as there are couple of
non-preempt returns before this point which all wants the next_buddy
set with (scale && !fork).
I cannot move this case ahead as we have to do matching se before
doing preempt check. Hence I ended up with stored state.
Also NEXT_BUDDY feature is off by default. So, this change should be a
compiler optimized to nop by default.


Thanks,
Venki
>
>> + ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ?* Bias pick_next to pick the task that is
>> + ? ? ? ? ? ? ? ?* triggering this preemption.
>> + ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? if (!next_buddy_marked)
>> + ? ? ? ? ? ? ? ? ? ? ? set_next_buddy(&p->se);
>> ? ? ? ? ? ? ? ?goto preempt;
>> + ? ? ? }
>>
>> ? ? ? ?return;
>>
>> --
>> 1.7.3.1
>>
>>
>

2011-03-08 02:33:07

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path - v1

On Mon, Mar 7, 2011 at 5:29 PM, Paul Turner <[email protected]> wrote:
> On Mon, Mar 7, 2011 at 4:59 PM, Venkatesh Pallipadi <[email protected]> wrote:
>> When a task in a taskgroup sleeps, pick_next_task starts all the way back at
>> the root and picks the task/taskgroup with the min vruntime across all
>> runnable tasks. But, when there are many frequently sleeping tasks
>> across different taskgroups, it makes better sense to stay with same taskgroup
>> for its slice period (or until all tasks in the taskgroup sleeps) instead of
>> switching cross taskgroup on each sleep after a short runtime.
>> This helps specifically where taskgroups corresponds to a process with
>> multiple threads. The change reduces the number of CR3 switches in this case.
>>
>> Example:
>> Two taskgroups with 2 threads each which are running for 2ms and
>> sleeping for 1ms. Looking at sched:sched_switch shows -
>>
>> BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
>> ? ? ?cpu-soaker-5004 ?[003] ?3683.391089
>> ? ? ?cpu-soaker-5016 ?[003] ?3683.393106
>> ? ? ?cpu-soaker-5005 ?[003] ?3683.395119
>> ? ? ?cpu-soaker-5017 ?[003] ?3683.397130
>> ? ? ?cpu-soaker-5004 ?[003] ?3683.399143
>> ? ? ?cpu-soaker-5016 ?[003] ?3683.401155
>> ? ? ?cpu-soaker-5005 ?[003] ?3683.403168
>> ? ? ?cpu-soaker-5017 ?[003] ?3683.405170
>>
>> AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
>> ? ? ?cpu-soaker-21890 [003] ? 865.895494
>> ? ? ?cpu-soaker-21935 [003] ? 865.897506
>> ? ? ?cpu-soaker-21934 [003] ? 865.899520
>> ? ? ?cpu-soaker-21935 [003] ? 865.901532
>> ? ? ?cpu-soaker-21934 [003] ? 865.903543
>> ? ? ?cpu-soaker-21935 [003] ? 865.905546
>> ? ? ?cpu-soaker-21891 [003] ? 865.907548
>> ? ? ?cpu-soaker-21890 [003] ? 865.909560
>> ? ? ?cpu-soaker-21891 [003] ? 865.911571
>> ? ? ?cpu-soaker-21890 [003] ? 865.913582
>> ? ? ?cpu-soaker-21891 [003] ? 865.915594
>> ? ? ?cpu-soaker-21934 [003] ? 865.917606
>>
>> Similar problem is there when there are multiple taskgroups and say a task A
>> preempts currently running task B of taskgroup_1. On schedule, pick_next_task
>> can pick an unrelated task on taskgroup_2. Here it would be better to give some
>> preference to task B on pick_next_task.
>>
>> A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
>> client processes with 2 threads each running on a single CPU. Avg throughput
>> across 5 50 sec runs was -
>> BEFORE: 105.84 MB/sec
>> AFTER: 112.42 MB/sec
>>
>> Changes from v0:
>> * Always pass task se to set_next_buddy
>> * Avoid repeated set_next_buddy in check_preempt_wakeup
>> * Minor flag cleanup in dequeue_task_fair
>>
>> Signed-off-by: Venkatesh Pallipadi <[email protected]>
>> ---
>> ?kernel/sched_fair.c | ? 41 ++++++++++++++++++++++++++++++++++++++---
>> ?1 files changed, 38 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index 3a88dee..cbe442e 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -1339,6 +1339,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> ? ? ? ?hrtick_update(rq);
>> ?}
>>
>> +static struct sched_entity *pick_next_taskse_on_cfsrq(struct cfs_rq *cfs_rq)
>> +{
>> + ? ? ? struct sched_entity *se;
>> +
>> + ? ? ? do {
>> + ? ? ? ? ? ? ? se = pick_next_entity(cfs_rq);
>> + ? ? ? ? ? ? ? cfs_rq = group_cfs_rq(se);
>> + ? ? ? } while (cfs_rq);
>> +
>> + ? ? ? return se;
>> +}
>> +
>
> I think the original approach was much cleaner; the notion of a
> SCHED_IDLE task is only relative versus siblings in group scheduling

Looking at the related code,
static void set_skip_buddy(struct sched_entity *se)
{
if (likely(task_of(se)->policy != SCHED_IDLE)) {
for_each_sched_entity(se)
cfs_rq_of(se)->skip = se;
}
}

Shouldn't it be always set skip se irrespective of current task's
SCHED_IDLE setting.

Thanks,
Venki

2011-04-14 01:21:30

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 0/2] sched: Avoid frequent cross taskgroup switches -v2


As I did not hear any other comments on v1, I am assuming that the
above mentioned case of giving buddy privilege to a SCHED_IDLE task
from one cgroup when there is a non SCHED_IDLE task in other cgroup
is not an important case to handle and keeping this change simpler is better.

So, I am going with Paul's suggestion of sticking with original change,
with some cleanups to handle the frequent CR3 switches problem.

Here is the v2 version of the change, split into two patches.

Signed-off-by: Venkatesh Pallipadi <[email protected]>

2011-04-14 01:21:41

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 1/2] sched: Make set_*_buddy work on non-task entity -v2

Make set_*_buddy work on non-task sched_entity, to facilitate the
use of next_buddy to cache a group entity in cases where one of the tasks
within that entity sleeps or gets preempted.

set_skip_buddy was wrongly comparing the policy of task that is yielding
to be not equal to SCHED_IDLE. Yielding should happen even when task yielding
is SCHED_IDLE. This change removes the policy check on the yielding task.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 24 ++++++++++++------------
1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 80ecd09..d3513b7 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1826,26 +1826,26 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)

static void set_last_buddy(struct sched_entity *se)
{
- if (likely(task_of(se)->policy != SCHED_IDLE)) {
- for_each_sched_entity(se)
- cfs_rq_of(se)->last = se;
- }
+ if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
+ return;
+
+ for_each_sched_entity(se)
+ cfs_rq_of(se)->last = se;
}

static void set_next_buddy(struct sched_entity *se)
{
- if (likely(task_of(se)->policy != SCHED_IDLE)) {
- for_each_sched_entity(se)
- cfs_rq_of(se)->next = se;
- }
+ if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
+ return;
+
+ for_each_sched_entity(se)
+ cfs_rq_of(se)->next = se;
}

static void set_skip_buddy(struct sched_entity *se)
{
- if (likely(task_of(se)->policy != SCHED_IDLE)) {
- for_each_sched_entity(se)
- cfs_rq_of(se)->skip = se;
- }
+ for_each_sched_entity(se)
+ cfs_rq_of(se)->skip = se;
}

/*
--
1.7.3.1

2011-04-14 01:21:45

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 2/2] sched: next buddy hint on sleep and preempt path -v2

When a task in a taskgroup sleeps, pick_next_task starts all the way back at
the root and picks the task/taskgroup with the min vruntime across all
runnable tasks. But, when there are many frequently sleeping tasks
across different taskgroups, it makes better sense to stay with same taskgroup
for its slice period (or until all tasks in the taskgroup sleeps) instead of
switching cross taskgroup on each sleep after a short runtime.
This helps specifically where taskgroups corresponds to a process with
multiple threads. The change reduces the number of CR3 switches in this case.

Example:
Two taskgroups with 2 threads each which are running for 2ms and
sleeping for 1ms. Looking at sched:sched_switch shows -

BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
cpu-soaker-5004 [003] 3683.391089
cpu-soaker-5016 [003] 3683.393106
cpu-soaker-5005 [003] 3683.395119
cpu-soaker-5017 [003] 3683.397130
cpu-soaker-5004 [003] 3683.399143
cpu-soaker-5016 [003] 3683.401155
cpu-soaker-5005 [003] 3683.403168
cpu-soaker-5017 [003] 3683.405170

AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
cpu-soaker-21890 [003] 865.895494
cpu-soaker-21935 [003] 865.897506
cpu-soaker-21934 [003] 865.899520
cpu-soaker-21935 [003] 865.901532
cpu-soaker-21934 [003] 865.903543
cpu-soaker-21935 [003] 865.905546
cpu-soaker-21891 [003] 865.907548
cpu-soaker-21890 [003] 865.909560
cpu-soaker-21891 [003] 865.911571
cpu-soaker-21890 [003] 865.913582
cpu-soaker-21891 [003] 865.915594
cpu-soaker-21934 [003] 865.917606

Similar problem is there when there are multiple taskgroups and say a task A
preempts currently running task B of taskgroup_1. On schedule, pick_next_task
can pick an unrelated task on taskgroup_2. Here it would be better to give some
preference to task B on pick_next_task.

A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
client processes with 2 threads each running on a single CPU. Avg throughput
across 5 50 sec runs was -
BEFORE: 105.84 MB/sec
AFTER: 112.42 MB/sec

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 26 +++++++++++++++++++++++---
1 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index d3513b7..8828a7e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1340,6 +1340,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
hrtick_update(rq);
}

+static void set_next_buddy(struct sched_entity *se);
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -1349,14 +1351,22 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int task_sleep = flags & DEQUEUE_SLEEP;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);

/* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ if (cfs_rq->load.weight) {
+ /*
+ * Bias pick_next to pick a task from this cfs_rq, as
+ * p is sleeping when it is within its sched_slice.
+ */
+ if (task_sleep && se->parent)
+ set_next_buddy(se->parent);
break;
+ }
flags |= DEQUEUE_SLEEP;
}

@@ -1857,12 +1867,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int scale = cfs_rq->nr_running >= sched_nr_latency;
+ int next_buddy_marked = 0;

if (unlikely(se == pse))
return;

- if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
+ if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
set_next_buddy(pse);
+ next_buddy_marked = 1;
+ }

/*
* We can come here with TIF_NEED_RESCHED already set from new task
@@ -1890,8 +1903,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
update_curr(cfs_rq);
find_matching_se(&se, &pse);
BUG_ON(!pse);
- if (wakeup_preempt_entity(se, pse) == 1)
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ /*
+ * Bias pick_next to pick the sched entity that is
+ * triggering this preemption.
+ */
+ if (!next_buddy_marked)
+ set_next_buddy(pse);
goto preempt;
+ }

return;

--
1.7.3.1

2011-04-14 10:47:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: next buddy hint on sleep and preempt path -v2

On Wed, 2011-04-13 at 18:21 -0700, Venkatesh Pallipadi wrote:
> ---
> kernel/sched_fair.c | 26 +++++++++++++++++++++++---
> 1 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index d3513b7..8828a7e 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1340,6 +1340,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> hrtick_update(rq);
> }
>
> +static void set_next_buddy(struct sched_entity *se);
> +
> /*
> * The dequeue_task method is called before nr_running is
> * decreased. We remove the task from the rbtree and
> @@ -1349,14 +1351,22 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> {
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
> + int task_sleep = flags & DEQUEUE_SLEEP;
>
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> dequeue_entity(cfs_rq, se, flags);
>
> /* Don't dequeue parent if it has other entities besides us */
> - if (cfs_rq->load.weight)
> + if (cfs_rq->load.weight) {
> + /*
> + * Bias pick_next to pick a task from this cfs_rq, as
> + * p is sleeping when it is within its sched_slice.
> + */
> + if (task_sleep && se->parent)
> + set_next_buddy(se->parent);
> break;
> + }
> flags |= DEQUEUE_SLEEP;
> }
>

You made my compiler sad:

In file included from /home/root/src/linux-2.6/kernel/sched.c:2006:
/home/root/src/linux-2.6/kernel/sched_fair.c: In function ‘dequeue_task_fair’:
/home/root/src/linux-2.6/kernel/sched_fair.c:1370: error: ‘struct sched_entity’ has no member named ‘parent’
/home/root/src/linux-2.6/kernel/sched_fair.c:1371: error: ‘struct sched_entity’ has no member named ‘parent’

2011-04-14 17:31:44

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: next buddy hint on sleep and preempt path -v2

Sorry. My mistake. The below version should fare better.

Thanks,
Venki

When a task in a taskgroup sleeps, pick_next_task starts all the way back at
the root and picks the task/taskgroup with the min vruntime across all
runnable tasks. But, when there are many frequently sleeping tasks
across different taskgroups, it makes better sense to stay with same taskgroup
for its slice period (or until all tasks in the taskgroup sleeps) instead of
switching cross taskgroup on each sleep after a short runtime.
This helps specifically where taskgroups corresponds to a process with
multiple threads. The change reduces the number of CR3 switches in this case.

Example:
Two taskgroups with 2 threads each which are running for 2ms and
sleeping for 1ms. Looking at sched:sched_switch shows -

BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
cpu-soaker-5004 [003] 3683.391089
cpu-soaker-5016 [003] 3683.393106
cpu-soaker-5005 [003] 3683.395119
cpu-soaker-5017 [003] 3683.397130
cpu-soaker-5004 [003] 3683.399143
cpu-soaker-5016 [003] 3683.401155
cpu-soaker-5005 [003] 3683.403168
cpu-soaker-5017 [003] 3683.405170

AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
cpu-soaker-21890 [003] 865.895494
cpu-soaker-21935 [003] 865.897506
cpu-soaker-21934 [003] 865.899520
cpu-soaker-21935 [003] 865.901532
cpu-soaker-21934 [003] 865.903543
cpu-soaker-21935 [003] 865.905546
cpu-soaker-21891 [003] 865.907548
cpu-soaker-21890 [003] 865.909560
cpu-soaker-21891 [003] 865.911571
cpu-soaker-21890 [003] 865.913582
cpu-soaker-21891 [003] 865.915594
cpu-soaker-21934 [003] 865.917606

Similar problem is there when there are multiple taskgroups and say a task A
preempts currently running task B of taskgroup_1. On schedule, pick_next_task
can pick an unrelated task on taskgroup_2. Here it would be better to give some
preference to task B on pick_next_task.

A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
client processes with 2 threads each running on a single CPU. Avg throughput
across 5 50 sec runs was -
BEFORE: 105.84 MB/sec
AFTER: 112.42 MB/sec

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched_fair.c | 26 +++++++++++++++++++++++---
1 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 1b3a4d4..80931f0 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1340,6 +1340,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
hrtick_update(rq);
}

+static void set_next_buddy(struct sched_entity *se);
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -1349,14 +1351,22 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int task_sleep = flags & DEQUEUE_SLEEP;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);

/* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ if (cfs_rq->load.weight) {
+ /*
+ * Bias pick_next to pick a task from this cfs_rq, as
+ * p is sleeping when it is within its sched_slice.
+ */
+ if (task_sleep && parent_entity(se))
+ set_next_buddy(parent_entity(se));
break;
+ }
flags |= DEQUEUE_SLEEP;
}

@@ -1857,12 +1867,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int scale = cfs_rq->nr_running >= sched_nr_latency;
+ int next_buddy_marked = 0;

if (unlikely(se == pse))
return;

- if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
+ if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
set_next_buddy(pse);
+ next_buddy_marked = 1;
+ }

/*
* We can come here with TIF_NEED_RESCHED already set from new task
@@ -1890,8 +1903,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
update_curr(cfs_rq);
find_matching_se(&se, &pse);
BUG_ON(!pse);
- if (wakeup_preempt_entity(se, pse) == 1)
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ /*
+ * Bias pick_next to pick the sched entity that is
+ * triggering this preemption.
+ */
+ if (!next_buddy_marked)
+ set_next_buddy(pse);
goto preempt;
+ }

return;

--
1.7.3.1

2011-04-15 21:46:15

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: next buddy hint on sleep and preempt path -v2

On 04/14/2011 01:30 PM, Venkatesh Pallipadi wrote:

> A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
> client processes with 2 threads each running on a single CPU. Avg throughput
> across 5 50 sec runs was -
> BEFORE: 105.84 MB/sec
> AFTER: 112.42 MB/sec
>
> Signed-off-by: Venkatesh Pallipadi<[email protected]>

Those performance numbers and the general idea sure get my

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2011-04-19 12:05:32

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [tip:sched/core] sched: Make set_*_buddy() work on non-task entities

Commit-ID: 69c80f3e9d3c569f8a3cee94ba1a324b5a7fa6b9
Gitweb: http://git.kernel.org/tip/69c80f3e9d3c569f8a3cee94ba1a324b5a7fa6b9
Author: Venkatesh Pallipadi <[email protected]>
AuthorDate: Wed, 13 Apr 2011 18:21:09 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 19 Apr 2011 10:08:37 +0200

sched: Make set_*_buddy() work on non-task entities

Make set_*_buddy() work on non-task sched_entity, to facilitate the
use of next_buddy to cache a group entity in cases where one of the
tasks within that entity sleeps or gets preempted.

set_skip_buddy() was incorrectly comparing the policy of task that is
yielding to be not equal to SCHED_IDLE. Yielding should happen even
when task yielding is SCHED_IDLE. This change removes the policy check
on the yielding task.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 24 ++++++++++++------------
1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8744593..501ab63 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1846,26 +1846,26 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)

static void set_last_buddy(struct sched_entity *se)
{
- if (likely(task_of(se)->policy != SCHED_IDLE)) {
- for_each_sched_entity(se)
- cfs_rq_of(se)->last = se;
- }
+ if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
+ return;
+
+ for_each_sched_entity(se)
+ cfs_rq_of(se)->last = se;
}

static void set_next_buddy(struct sched_entity *se)
{
- if (likely(task_of(se)->policy != SCHED_IDLE)) {
- for_each_sched_entity(se)
- cfs_rq_of(se)->next = se;
- }
+ if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
+ return;
+
+ for_each_sched_entity(se)
+ cfs_rq_of(se)->next = se;
}

static void set_skip_buddy(struct sched_entity *se)
{
- if (likely(task_of(se)->policy != SCHED_IDLE)) {
- for_each_sched_entity(se)
- cfs_rq_of(se)->skip = se;
- }
+ for_each_sched_entity(se)
+ cfs_rq_of(se)->skip = se;
}

/*

2011-04-19 12:06:01

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [tip:sched/core] sched: Next buddy hint on sleep and preempt path

Commit-ID: 2f36825b176f67e5c5228aa33d828bc39718811f
Gitweb: http://git.kernel.org/tip/2f36825b176f67e5c5228aa33d828bc39718811f
Author: Venkatesh Pallipadi <[email protected]>
AuthorDate: Thu, 14 Apr 2011 10:30:53 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 19 Apr 2011 10:08:38 +0200

sched: Next buddy hint on sleep and preempt path

When a task in a taskgroup sleeps, pick_next_task starts all the way back at
the root and picks the task/taskgroup with the min vruntime across all
runnable tasks.

But when there are many frequently sleeping tasks across different taskgroups,
it makes better sense to stay with same taskgroup for its slice period (or
until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
on each sleep after a short runtime.

This helps specifically where taskgroups corresponds to a process with
multiple threads. The change reduces the number of CR3 switches in this case.

Example:

Two taskgroups with 2 threads each which are running for 2ms and
sleeping for 1ms. Looking at sched:sched_switch shows:

BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
cpu-soaker-5004 [003] 3683.391089
cpu-soaker-5016 [003] 3683.393106
cpu-soaker-5005 [003] 3683.395119
cpu-soaker-5017 [003] 3683.397130
cpu-soaker-5004 [003] 3683.399143
cpu-soaker-5016 [003] 3683.401155
cpu-soaker-5005 [003] 3683.403168
cpu-soaker-5017 [003] 3683.405170

AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
cpu-soaker-21890 [003] 865.895494
cpu-soaker-21935 [003] 865.897506
cpu-soaker-21934 [003] 865.899520
cpu-soaker-21935 [003] 865.901532
cpu-soaker-21934 [003] 865.903543
cpu-soaker-21935 [003] 865.905546
cpu-soaker-21891 [003] 865.907548
cpu-soaker-21890 [003] 865.909560
cpu-soaker-21891 [003] 865.911571
cpu-soaker-21890 [003] 865.913582
cpu-soaker-21891 [003] 865.915594
cpu-soaker-21934 [003] 865.917606

Similar problem is there when there are multiple taskgroups and say a task A
preempts currently running task B of taskgroup_1. On schedule, pick_next_task
can pick an unrelated task on taskgroup_2. Here it would be better to give some
preference to task B on pick_next_task.

A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
client processes with 2 threads each running on a single CPU. Avg throughput
across 5 50 sec runs was:

BEFORE: 105.84 MB/sec
AFTER: 112.42 MB/sec

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 26 +++++++++++++++++++++++---
1 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 501ab63..5280272 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1344,6 +1344,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
hrtick_update(rq);
}

+static void set_next_buddy(struct sched_entity *se);
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -1353,14 +1355,22 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int task_sleep = flags & DEQUEUE_SLEEP;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);

/* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ if (cfs_rq->load.weight) {
+ /*
+ * Bias pick_next to pick a task from this cfs_rq, as
+ * p is sleeping when it is within its sched_slice.
+ */
+ if (task_sleep && parent_entity(se))
+ set_next_buddy(parent_entity(se));
break;
+ }
flags |= DEQUEUE_SLEEP;
}

@@ -1877,12 +1887,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int scale = cfs_rq->nr_running >= sched_nr_latency;
+ int next_buddy_marked = 0;

if (unlikely(se == pse))
return;

- if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
+ if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
set_next_buddy(pse);
+ next_buddy_marked = 1;
+ }

/*
* We can come here with TIF_NEED_RESCHED already set from new task
@@ -1910,8 +1923,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
update_curr(cfs_rq);
find_matching_se(&se, &pse);
BUG_ON(!pse);
- if (wakeup_preempt_entity(se, pse) == 1)
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ /*
+ * Bias pick_next to pick the sched entity that is
+ * triggering this preemption.
+ */
+ if (!next_buddy_marked)
+ set_next_buddy(pse);
goto preempt;
+ }

return;