2008-02-21 05:38:15

by Balbir Singh

[permalink] [raw]
Subject: Make yield_task_fair more efficient

__pick_last_entity() walks the entire tree on O(lgn) time to find the
rightmost entry. This patch makes the routine more efficient by
reducing the cost of the lookup

Description
-----------

Cache the rightmost entry in the rb_tree in the same way that we cache
the leftmost entry. __pick_last_entity now uses the cached value.
A cautious reviewer might note that the rb_erase() in dequeue entity
might cause the tree to be rebalanced and thus effect the cached value.
In the case of the rightmost entity, we assign the value to the parent,
which even if the tree is rebalanced will become the rightmost entity.

NOTE: This depends on Peter Zijlstra's patch which checks for !rightmost.
Check out http://lkml.org/lkml/2008/2/18/295. Peter that patch works
for me, could you please push that as well.

I've seen improvements with this patch. Specifically for
the case reported by Yanmin (http://lkml.org/lkml/2008/2/13/128)

Comments?

Signed-off-by: Balbir Singh <[email protected]>
---

kernel/sched.c | 7 +++++++
kernel/sched_fair.c | 30 +++++++++++++++++++-----------
2 files changed, 26 insertions(+), 11 deletions(-)

diff -puN kernel/sched.c~make-sched-yield-fair-more-efficient kernel/sched.c
--- linux-2.6.25-rc2/kernel/sched.c~make-sched-yield-fair-more-efficient 2008-02-21 08:49:06.000000000 +0530
+++ linux-2.6.25-rc2-balbir/kernel/sched.c 2008-02-21 09:50:13.000000000 +0530
@@ -537,7 +537,13 @@ struct cfs_rq {
u64 min_vruntime;

struct rb_root tasks_timeline;
+
+ /*
+ * Cached data to help improve performance (avoid walking the
+ * tree)
+ */
struct rb_node *rb_leftmost;
+ struct rb_node *rb_rightmost;

struct list_head tasks;
struct list_head *balance_iterator;
@@ -7332,6 +7338,7 @@ static void init_cfs_rq(struct cfs_rq *c
cfs_rq->tasks_timeline = RB_ROOT;
INIT_LIST_HEAD(&cfs_rq->tasks);
cfs_rq->rb_leftmost = NULL;
+ cfs_rq->rb_rightmost = NULL;
#ifdef CONFIG_FAIR_GROUP_SCHED
cfs_rq->rq = rq;
#endif
diff -puN kernel/sched_fair.c~make-sched-yield-fair-more-efficient kernel/sched_fair.c
--- linux-2.6.25-rc2/kernel/sched_fair.c~make-sched-yield-fair-more-efficient 2008-02-21 08:49:06.000000000 +0530
+++ linux-2.6.25-rc2-balbir/kernel/sched_fair.c 2008-02-21 10:26:06.000000000 +0530
@@ -233,6 +233,7 @@ static void __enqueue_entity(struct cfs_
struct sched_entity *entry;
s64 key;
int leftmost = 1;
+ int rightmost = 1;

if (!entity_is_task(se))
return;
@@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_
*/
if (key < entity_key(cfs_rq, entry)) {
link = &parent->rb_left;
+ rightmost = 0;
} else {
link = &parent->rb_right;
leftmost = 0;
@@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_
*/
if (leftmost)
cfs_rq->rb_leftmost = &se->run_node;
+ if (rightmost)
+ cfs_rq->rb_rightmost = &se->run_node;

rb_link_node(&se->run_node, parent, link);
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
@@ -286,6 +290,12 @@ static void __dequeue_entity(struct cfs_
if (cfs_rq->rb_leftmost == &se->run_node)
cfs_rq->rb_leftmost = rb_next(&se->run_node);

+ /*
+ * The parent becomes the rightmost node
+ */
+ if (cfs_rq->rb_rightmost == &se->run_node)
+ cfs_rq->rb_rightmost = rb_parent(&se->run_node);
+
rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
}

@@ -294,6 +304,11 @@ static inline struct rb_node *first_fair
return cfs_rq->rb_leftmost;
}

+static inline struct rb_node *last_fair(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->rb_rightmost;
+}
+
static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
{
return rb_entry(first_fair(cfs_rq), struct sched_entity, run_node);
@@ -301,17 +316,10 @@ static struct sched_entity *__pick_next_

static inline struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
{
- struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
- struct sched_entity *se = NULL;
- struct rb_node *parent;
-
- while (*link) {
- parent = *link;
- se = rb_entry(parent, struct sched_entity, run_node);
- link = &parent->rb_right;
- }
-
- return se;
+ struct rb_node *rightmost = last_fair(cfs_rq);
+ if (!rightmost)
+ return NULL;
+ return rb_entry(rightmost, struct sched_entity, run_node);
}

/**************************************************************
_

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL


2008-02-21 06:04:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient


* Balbir Singh <[email protected]> wrote:

> __pick_last_entity() walks the entire tree on O(lgn) time to find the
> rightmost entry. This patch makes the routine more efficient by
> reducing the cost of the lookup

hm, i'm not sure we want to do this: we'd be slowing down the fastpath
of all the other common scheduler functions, for the sake of a rarely
used (and broken ...) API: yield. And note that an rbtree walk is not
slow at all - if you are yielding frequently it's probably all cached.

Ingo

2008-02-21 06:56:26

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Ingo Molnar wrote:
> * Balbir Singh <[email protected]> wrote:
>
>> __pick_last_entity() walks the entire tree on O(lgn) time to find the
>> rightmost entry. This patch makes the routine more efficient by
>> reducing the cost of the lookup
>
> hm, i'm not sure we want to do this: we'd be slowing down the fastpath
> of all the other common scheduler functions, for the sake of a rarely
> used (and broken ...) API: yield. And note that an rbtree walk is not
> slow at all - if you are yielding frequently it's probably all cached.
>
> Ingo

Ingo,

I disagree. The cost is only adding a field to cfs_rq and we already have the
logic to track the leftmost node, we just update the rightmost node as well.
For a large number of tasks - say 10000, we need to walk 14 levels before we
reach the node (each time). Doesn't matter if the data is cached, we are still
spending CPU time looking through pointers and walking to the right node. If all
the threads get a chance to run, you can imagine the effect it has on efficiency
and the data cache.

I see a boost in sched_yield performance and no big hit on regular performance.
Could you please test it to see if your apprehensions are correct.

PS: You missed to add me on the cc/to list.


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 07:07:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient


* Balbir Singh <[email protected]> wrote:

> I disagree. The cost is only adding a field to cfs_rq [...]

wrong. The cost is "only" of adding a field to cfs_rq and _updating it_,
in the hottest paths of the scheduler:

@@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_
*/
if (key < entity_key(cfs_rq, entry)) {
link = &parent->rb_left;
+ rightmost = 0;
} else {
link = &parent->rb_right;
leftmost = 0;
@@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_
*/
if (leftmost)
cfs_rq->rb_leftmost = &se->run_node;
+ if (rightmost)
+ cfs_rq->rb_rightmost = &se->run_node;

> [...] For a large number of tasks - say 10000, we need to walk 14
> levels before we reach the node (each time). [...]

10,000 yield-ing tasks is not a common workload we care about. It's not
even a rare workload we care about. _Especially_ we dont care about it
if it slows down every other workload (a tiny bit).

> [...] Doesn't matter if the data is cached, we are still spending CPU
> time looking through pointers and walking to the right node. [...]

have you actually measured how much it takes to walk the tree that deep
on recent hardware? I have.

Ingo

2008-02-21 07:45:09

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Ingo Molnar wrote:
> * Balbir Singh <[email protected]> wrote:
>
>> I disagree. The cost is only adding a field to cfs_rq [...]
>
> wrong. The cost is "only" of adding a field to cfs_rq and _updating it_,
> in the hottest paths of the scheduler:
>
> @@ -256,6 +257,7 @@ static void __enqueue_entity(struct cfs_
> */
> if (key < entity_key(cfs_rq, entry)) {
> link = &parent->rb_left;
> + rightmost = 0;

That's an update when we move leftwards.

> } else {
> link = &parent->rb_right;
> leftmost = 0;
> @@ -268,6 +270,8 @@ static void __enqueue_entity(struct cfs_
> */
> if (leftmost)
> cfs_rq->rb_leftmost = &se->run_node;
> + if (rightmost)
> + cfs_rq->rb_rightmost = &se->run_node;
>

&se->run_node is already in the cache, we are assigning cfs_rq->rb_rightmost to it.

>> [...] For a large number of tasks - say 10000, we need to walk 14
>> levels before we reach the node (each time). [...]
>
> 10,000 yield-ing tasks is not a common workload we care about. It's not
> even a rare workload we care about. _Especially_ we dont care about it
> if it slows down every other workload (a tiny bit).
>

sched_yield() is supported API and also look at
http://lkml.org/lkml/2007/9/19/351. I am trying to make sched_yield() efficient
when compat_sched_yield is turned on (which is most likely), since people will
want that behaviour (Hint, please read the man-page for sched_yield).There are
already several applications using sched_yield(), so they all suffer.

>> [...] Doesn't matter if the data is cached, we are still spending CPU
>> time looking through pointers and walking to the right node. [...]
>
> have you actually measured how much it takes to walk the tree that deep
> on recent hardware? I have.

I have measured how much time can be saved by not doing that and it's quite a lot.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 08:44:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient


On Thu, 2008-02-21 at 13:09 +0530, Balbir Singh wrote:

> sched_yield() is supported API

For SCHED_FIFO/SCHED_RR.

> and also look at http://lkml.org/lkml/2007/9/19/351.

Read on (http://lkml.org/lkml/2007/9/19/371) and find:

The sched_yield() behaviour is actually very well-defined for RT
tasks (now, whether it's a good interface to use or not is still
open to debate, but at least it's a _defined_ interface ;), and
I think we should at least try to approximate that behaviour for
normal tasks, even if they aren't RT.

sched_yield() just isn't a very useful API except in RT and even there
one can question it.

> I am trying to make sched_yield() efficient
> when compat_sched_yield is turned on (which is most likely), since people will
> want that behaviour (Hint, please read the man-page for sched_yield).

I did, so what? read the POSIX spec - its just plain undefined for
SCHED_OTHER. We already do something 'sensible' for those few broken
applications relying on unspecified behaviour.

> There are
> already several applications using sched_yield(), so they all suffer.

Who is using it, where is their source so we can show its faster to not
use it?

Really, hiding behind closed sores doesn't make it good.

2008-02-21 08:55:19

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 13:09 +0530, Balbir Singh wrote:
>
>> sched_yield() is supported API
>
> For SCHED_FIFO/SCHED_RR.
>
>> and also look at http://lkml.org/lkml/2007/9/19/351.
>
> Read on (http://lkml.org/lkml/2007/9/19/371) and find:
>
> The sched_yield() behaviour is actually very well-defined for RT
> tasks (now, whether it's a good interface to use or not is still
> open to debate, but at least it's a _defined_ interface ;), and
> I think we should at least try to approximate that behaviour for
> normal tasks, even if they aren't RT.
>
> sched_yield() just isn't a very useful API except in RT and even there
> one can question it.
>

But it's being used very heavily by existing applications. Can we break/penalize
them because we think it's not a good interface or a method to program?


>> I am trying to make sched_yield() efficient
>> when compat_sched_yield is turned on (which is most likely), since people will
>> want that behaviour (Hint, please read the man-page for sched_yield).
>
> I did, so what? read the POSIX spec - its just plain undefined for
> SCHED_OTHER. We already do something 'sensible' for those few broken
> applications relying on unspecified behaviour.
>
>> There are
>> already several applications using sched_yield(), so they all suffer.
>
> Who is using it, where is their source so we can show its faster to not
> use it?
>
> Really, hiding behind closed sores doesn't make it good.
>
>

I am not hiding behind closed source. Please refer to the regression reported by
Yanmin where he used compat_sched_yield=1 (see
http://lkml.org/lkml/2008/2/18/7). The regression might be due to other reasons,
but the point is that people are using compat_sched_yield=1

If you insist that sched_yield() is bad, I might agree, but how does my patch
make things worse. In my benchmarks, it has helped the sched_yield case, why is
that bad? As far as touching the hot-path is concerned, give me a benchmark that
I can run on my desktop, that will convince you that the overhead of the patch
is not all that high. If there is an overhead that is very visible, I'll stop
pushing the patch.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 09:05:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient


* Balbir Singh <[email protected]> wrote:

> If you insist that sched_yield() is bad, I might agree, but how does
> my patch make things worse. [...]

it puts new instructions into the hotpath.

> [...] In my benchmarks, it has helped the sched_yield case, why is
> that bad? [...]

I had the same cache for the rightmost task in earlier CFS (it's a
really obvious thing) but removed it. It wasnt a bad idea, but it hurt
the fastpath hence i removed it. Algorithms and implementations are a
constant balancing act.

Ingo

2008-02-21 09:37:53

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Ingo Molnar wrote:
> * Balbir Singh <[email protected]> wrote:
>
>> If you insist that sched_yield() is bad, I might agree, but how does
>> my patch make things worse. [...]
>
> it puts new instructions into the hotpath.
>
>> [...] In my benchmarks, it has helped the sched_yield case, why is
>> that bad? [...]
>
> I had the same cache for the rightmost task in earlier CFS (it's a
> really obvious thing) but removed it. It wasnt a bad idea, but it hurt
> the fastpath hence i removed it. Algorithms and implementations are a
> constant balancing act.

This is more convincing, was the code ever in git? How did you measure the
overhead? What are your plans for reports with regressions where
kernel.compat_sched_yield is set to 1?

I have an alternate approach in mind (that I need to find time for),
threaded-rbtrees. Walking the tree is really efficient, specially finding
successor of a node.


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 09:42:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient


* Balbir Singh <[email protected]> wrote:

> I have an alternate approach in mind (that I need to find time for),
> threaded-rbtrees. Walking the tree is really efficient, specially
> finding successor of a node.

sure, feel free to experiment with those details.

But if you want to improve Java workloads then you should probably start
by converting them to futexes instead of sched_yield(), that will
probably give far more performance than micro-optimizing the
sys_sched_yield() codepath. (especially when it happens at the expense
of other workloads, which is not acceptable for mainline) You can
rebuild your JVM easily and re-test with its locking fixed, right?

Ingo

2008-02-21 09:43:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient


On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:

> I have an alternate approach in mind (that I need to find time for),
> threaded-rbtrees. Walking the tree is really efficient, specially finding
> successor of a node.

Threading the rbtrees would be even more expensive, it would require a
list_head in each node and a full list operation for every tree
operation.

2008-02-21 09:48:51

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
>
>> I have an alternate approach in mind (that I need to find time for),
>> threaded-rbtrees. Walking the tree is really efficient, specially finding
>> successor of a node.
>
> Threading the rbtrees would be even more expensive, it would require a
> list_head in each node and a full list operation for every tree
> operation.
>

Peter, when I say threaded, I don't mean threaded as in tasks. Please see
http://www.nist.gov/dads/HTML/threadedtree.html and
http://datastructures.itgo.com/trees/tbt.htm

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 09:49:31

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Ingo Molnar wrote:
> * Balbir Singh <[email protected]> wrote:
>

You did not answer some of my earlier questions.

>> I have an alternate approach in mind (that I need to find time for),
>> threaded-rbtrees. Walking the tree is really efficient, specially
>> finding successor of a node.
>
> sure, feel free to experiment with those details.
>
> But if you want to improve Java workloads then you should probably start
> by converting them to futexes instead of sched_yield(), that will
> probably give far more performance than micro-optimizing the
> sys_sched_yield() codepath. (especially when it happens at the expense
> of other workloads, which is not acceptable for mainline) You can
> rebuild your JVM easily and re-test with its locking fixed, right?

No.. I don't want to optimize the JVM or rebuild it. I don't have access to any
JVM code either. I wanted to do the threaded rb-trees for the case where we use
spend time finding the successor using rb_next() (walking the tree).

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 10:01:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient


On Thu, 2008-02-21 at 15:12 +0530, Balbir Singh wrote:
> Peter Zijlstra wrote:
> > On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
> >
> >> I have an alternate approach in mind (that I need to find time for),
> >> threaded-rbtrees. Walking the tree is really efficient, specially finding
> >> successor of a node.
> >
> > Threading the rbtrees would be even more expensive, it would require a
> > list_head in each node and a full list operation for every tree
> > operation.
> >
>
> Peter, when I say threaded, I don't mean threaded as in tasks. Please see
> http://www.nist.gov/dads/HTML/threadedtree.html and
> http://datastructures.itgo.com/trees/tbt.htm

I'm quite aware of the meaning of threaded in the context of data
structures. They usually require additional list data and operations to
implement.

Look at how the latter link you provided grows the rb_node with one
pointer to provide a single linked list.

2008-02-21 10:12:16

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 15:12 +0530, Balbir Singh wrote:
>> Peter Zijlstra wrote:
>>> On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
>>>
>>>> I have an alternate approach in mind (that I need to find time for),
>>>> threaded-rbtrees. Walking the tree is really efficient, specially finding
>>>> successor of a node.
>>> Threading the rbtrees would be even more expensive, it would require a
>>> list_head in each node and a full list operation for every tree
>>> operation.
>>>
>> Peter, when I say threaded, I don't mean threaded as in tasks. Please see
>> http://www.nist.gov/dads/HTML/threadedtree.html and
>> http://datastructures.itgo.com/trees/tbt.htm
>
> I'm quite aware of the meaning of threaded in the context of data
> structures.

Oh! I did not mean to assume anything. Just wanted to make sure we are talking
of the same thing

They usually require additional list data and operations to
> implement.
>

>From the definition
A Threaded Binary Tree is a binary tree in which every node that does not have a
right child has a THREAD (in actual sense, a link) to its INORDER successor.

You use the empty pointer (missing right child), so why do we need a list. May
be I am missing something.

> Look at how the latter link you provided grows the rb_node with one
> pointer to provide a single linked list.
>

Yes, that's a bad example I suppose. Look at
http://www.cs.rutgers.edu/~kaplan/503/thread.html

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 10:22:19

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Balbir Singh wrote:
> Ingo Molnar wrote:
>> * Balbir Singh <[email protected]> wrote:
>>
>>> If you insist that sched_yield() is bad, I might agree, but how does
>>> my patch make things worse. [...]
>> it puts new instructions into the hotpath.
>>
>>> [...] In my benchmarks, it has helped the sched_yield case, why is
>>> that bad? [...]
>> I had the same cache for the rightmost task in earlier CFS (it's a
>> really obvious thing) but removed it. It wasnt a bad idea, but it hurt
>> the fastpath hence i removed it. Algorithms and implementations are a
>> constant balancing act.
>
> This is more convincing, was the code ever in git? How did you measure the
> overhead? What are your plans for reports with regressions where
> kernel.compat_sched_yield is set to 1?
>

Ingo,

I was looking through nptl/sysdeps/unix/sysv/linux/pthread_yield.c in the glibc
sources and it looks like pthread_yield() calls sched_yield(). Applications
using compat_sched_yield and pthread_yield() are also going to be impacted.

I searched for pthread_yield and sched_yield using google codesearch to look at
the applications that use these routines, the search list is too big for me to
mark applications as candidates for potential improvement.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 11:13:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient


On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:

> You use the empty pointer (missing right child), so why do we need a list. May
> be I am missing something.

A fully threaded tree also has back-pointer to traverse backwards
through the ordered elements.

That said, overloading the right child pointer might not be the best
thing for the linux kernel, as it will impact all the rb-tree lookups
which are open-coded and often performance critical (this is the reason
the colour isn't bit encoded in either of the child pointers either).

But if you only want a uni directional thread, I guess we can stick it
in the unsigned long we use for the node colour.

Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
threaded thing as there are also a lot of rb_prev() users in the kernel.
Who knows..

Anyway, I agree that improving rb_next() is worth looking into for the
scheduler.

2008-02-21 11:31:52

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Peter Zijlstra wrote:
> On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
>
>> You use the empty pointer (missing right child), so why do we need a list. May
>> be I am missing something.
>
> A fully threaded tree also has back-pointer to traverse backwards
> through the ordered elements.
>

The back pointer is implemented using missing left children. I am referring to
Don Knuth's volume 1: The art of computer programming, section 2.3.1, page 322.
Please see the threaded representation of the tree, it is compared with the
unthreaded representation.

> That said, overloading the right child pointer might not be the best
> thing for the linux kernel, as it will impact all the rb-tree lookups
> which are open-coded and often performance critical (this is the reason
> the colour isn't bit encoded in either of the child pointers either).
>

We always look at the child, irrespective of whether it is to the right or left
to terminate our walk. Overloading them and setting a bit stating it is threaded
should not be that bad.

> But if you only want a uni directional thread, I guess we can stick it
> in the unsigned long we use for the node colour.
>
> Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
> threaded thing as there are also a lot of rb_prev() users in the kernel.
> Who knows..
>

Why does the rb_node need to grow? We can encode the bits in the children

> Anyway, I agree that improving rb_next() is worth looking into for the
> scheduler.

Sure, will experiment and see if we can bump up performance.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 12:02:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Hi,

On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
> Ingo Molnar wrote:
> > * Balbir Singh <[email protected]> wrote:
> >
> >> If you insist that sched_yield() is bad, I might agree, but how does
> >> my patch make things worse. [...]
> >
> > it puts new instructions into the hotpath.
> >
> >> [...] In my benchmarks, it has helped the sched_yield case, why is
> >> that bad? [...]
> >
> > I had the same cache for the rightmost task in earlier CFS (it's a
> > really obvious thing) but removed it. It wasnt a bad idea, but it hurt
> > the fastpath hence i removed it. Algorithms and implementations are a
> > constant balancing act.
>
> This is more convincing, was the code ever in git? How did you measure the
> overhead?

Counting enqueue/dequeue cycles on my 3GHz P4/HT running a 60 seconds
netperf test that does ~85k/s context switches shows:

sched_cycles: 7198444348 unpatched
vs
sched_cycles: 8574036268 patched

-Mike

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e217d18..a71edfe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1262,6 +1262,7 @@ struct task_struct {
int latency_record_count;
struct latency_record latency_record[LT_SAVECOUNT];
#endif
+ unsigned long long sched_cycles;
};

/*
diff --git a/kernel/exit.c b/kernel/exit.c
index 506a957..c5c3d5e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -174,6 +174,8 @@ repeat:
}

write_unlock_irq(&tasklist_lock);
+ if (!strcmp("netperf", p->comm) || !strcmp("netserver", p->comm))
+ printk(KERN_WARNING "%s:%d sched_cycles: %Ld\n", p->comm, p->pid, p->sched_cycles);
release_thread(p);
call_rcu(&p->rcu, delayed_put_task_struct);

diff --git a/kernel/sched.c b/kernel/sched.c
index f28f19e..d7c1b0b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1301,15 +1301,25 @@ static void set_load_weight(struct task_struct *p)

static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
{
+ unsigned long long then, now;
+
+ rdtscll(then);
sched_info_queued(p);
p->sched_class->enqueue_task(rq, p, wakeup);
p->se.on_rq = 1;
+ rdtscll(now);
+ p->sched_cycles += now - then;
}

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
{
+ unsigned long long then, now;
+
+ rdtscll(then);
p->sched_class->dequeue_task(rq, p, sleep);
p->se.on_rq = 0;
+ rdtscll(now);
+ p->sched_cycles += now - then;
}

/*
@@ -2009,6 +2019,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
update_rq_clock(rq);

p->prio = effective_prio(p);
+ p->sched_cycles = 0;

if (!p->sched_class->task_new || !current->se.on_rq) {
activate_task(rq, p, 0);

2008-02-21 12:06:34

by Mike Galbraith

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient


On Thu, 2008-02-21 at 13:02 +0100, Mike Galbraith wrote:
> sched_cycles: 7198444348 unpatched
> vs
> sched_cycles: 8574036268 patched

(in case it's not clear, patched means your patch, not my quick/dirty
countem patch:)

2008-02-21 13:33:57

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Mike Galbraith wrote:
> Hi,
>
> On Thu, 2008-02-21 at 15:01 +0530, Balbir Singh wrote:
>> Ingo Molnar wrote:
>>> * Balbir Singh <[email protected]> wrote:
>>>
>>>> If you insist that sched_yield() is bad, I might agree, but how does
>>>> my patch make things worse. [...]
>>> it puts new instructions into the hotpath.
>>>
>>>> [...] In my benchmarks, it has helped the sched_yield case, why is
>>>> that bad? [...]
>>> I had the same cache for the rightmost task in earlier CFS (it's a
>>> really obvious thing) but removed it. It wasnt a bad idea, but it hurt
>>> the fastpath hence i removed it. Algorithms and implementations are a
>>> constant balancing act.
>> This is more convincing, was the code ever in git? How did you measure the
>> overhead?
>
> Counting enqueue/dequeue cycles on my 3GHz P4/HT running a 60 seconds
> netperf test that does ~85k/s context switches shows:
>
> sched_cycles: 7198444348 unpatched
> vs
> sched_cycles: 8574036268 patched


Thanks for the numbers! I am very convinced that the patch should stay out until
we can find a way to reduce the overhead. I'll try your patch and see what the
numbers look like as well.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-02-21 14:39:41

by Jörn Engel

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

On Thu, 21 February 2008 12:21:33 +0530, Balbir Singh wrote:
>
> For a large number of tasks - say 10000, we need to walk 14 levels before we

16.7, actually. rbtrees are balanced treed, but they are not balanced
binary trees. The average fan-out is sqrt(3) instead of 2.

Jörn

--
The cost of changing business rules is much more expensive for software
than for a secretaty.
-- unknown

2008-02-21 20:39:20

by Jens Axboe

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

On Thu, Feb 21 2008, Peter Zijlstra wrote:
>
> On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
>
> > You use the empty pointer (missing right child), so why do we need a list. May
> > be I am missing something.
>
> A fully threaded tree also has back-pointer to traverse backwards
> through the ordered elements.
>
> That said, overloading the right child pointer might not be the best
> thing for the linux kernel, as it will impact all the rb-tree lookups
> which are open-coded and often performance critical (this is the reason
> the colour isn't bit encoded in either of the child pointers either).
>
> But if you only want a uni directional thread, I guess we can stick it
> in the unsigned long we use for the node colour.
>
> Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
> threaded thing as there are also a lot of rb_prev() users in the kernel.
> Who knows..
>
> Anyway, I agree that improving rb_next() is worth looking into for the
> scheduler.

For the IO scheduler as well, it's used quite extensively! So speeding
up rb_next() would definitely help, as it's typically invoked for every
bio queued (attempting to back merge with the next request). CFQ and AS
additionally does an rb_next() and rb_prev() when trying to decide which
request to do next.

--
Jens Axboe

2008-02-21 20:55:59

by Jens Axboe

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

On Thu, Feb 21 2008, Jens Axboe wrote:
> On Thu, Feb 21 2008, Peter Zijlstra wrote:
> >
> > On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
> >
> > > You use the empty pointer (missing right child), so why do we need a list. May
> > > be I am missing something.
> >
> > A fully threaded tree also has back-pointer to traverse backwards
> > through the ordered elements.
> >
> > That said, overloading the right child pointer might not be the best
> > thing for the linux kernel, as it will impact all the rb-tree lookups
> > which are open-coded and often performance critical (this is the reason
> > the colour isn't bit encoded in either of the child pointers either).
> >
> > But if you only want a uni directional thread, I guess we can stick it
> > in the unsigned long we use for the node colour.
> >
> > Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
> > threaded thing as there are also a lot of rb_prev() users in the kernel.
> > Who knows..
> >
> > Anyway, I agree that improving rb_next() is worth looking into for the
> > scheduler.
>
> For the IO scheduler as well, it's used quite extensively! So speeding
> up rb_next() would definitely help, as it's typically invoked for every
> bio queued (attempting to back merge with the next request). CFQ and AS
> additionally does an rb_next() and rb_prev() when trying to decide which
> request to do next.

One possible course of action to implement this without eating extra
space in the rb_node would be:

- Add rb_right() and rb_set_right() (plus ditto _left variants) to
rbtree.h
- Convert all in-kernel users to use these. Quite extensive, as the
rbtree code search/insert functions are coded in situ and not in
rbtree.[ch]
- Now we can overload bit 0 of ->rb_right and ->rb_left to indicate
whether this is a node or thread pointer and modify rbtree.c to tag
and add the thread links when appropriate.

So we can definitely do this in a compatible fashion. Given that I have
a flight coming up in a few days time, I may give it a got if no one
beats me to it :-)

--
Jens Axboe

2008-02-22 03:32:16

by Balbir Singh

[permalink] [raw]
Subject: Re: Make yield_task_fair more efficient

Jens Axboe wrote:
> On Thu, Feb 21 2008, Jens Axboe wrote:
>> On Thu, Feb 21 2008, Peter Zijlstra wrote:
>>> On Thu, 2008-02-21 at 15:37 +0530, Balbir Singh wrote:
>>>
>>>> You use the empty pointer (missing right child), so why do we need a list. May
>>>> be I am missing something.
>>> A fully threaded tree also has back-pointer to traverse backwards
>>> through the ordered elements.
>>>
>>> That said, overloading the right child pointer might not be the best
>>> thing for the linux kernel, as it will impact all the rb-tree lookups
>>> which are open-coded and often performance critical (this is the reason
>>> the colour isn't bit encoded in either of the child pointers either).
>>>
>>> But if you only want a uni directional thread, I guess we can stick it
>>> in the unsigned long we use for the node colour.
>>>
>>> Still, perhaps it's worth it to grow rb_node to 4 words and do the fully
>>> threaded thing as there are also a lot of rb_prev() users in the kernel.
>>> Who knows..
>>>
>>> Anyway, I agree that improving rb_next() is worth looking into for the
>>> scheduler.
>> For the IO scheduler as well, it's used quite extensively! So speeding
>> up rb_next() would definitely help, as it's typically invoked for every
>> bio queued (attempting to back merge with the next request). CFQ and AS
>> additionally does an rb_next() and rb_prev() when trying to decide which
>> request to do next.
>
> One possible course of action to implement this without eating extra
> space in the rb_node would be:
>
> - Add rb_right() and rb_set_right() (plus ditto _left variants) to
> rbtree.h
> - Convert all in-kernel users to use these. Quite extensive, as the
> rbtree code search/insert functions are coded in situ and not in
> rbtree.[ch]
> - Now we can overload bit 0 of ->rb_right and ->rb_left to indicate
> whether this is a node or thread pointer and modify rbtree.c to tag
> and add the thread links when appropriate.
>

Exactly along the lines I was thinking of.and discussing with David.

> So we can definitely do this in a compatible fashion. Given that I have
> a flight coming up in a few days time, I may give it a got if no one
> beats me to it :-)
>

Feel free to do so, please do keep me on the cc. I am very interested in getting
rb threaded trees done, but my bandwidth is a little limited this month.


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL