fresh back from the Kernel Summit, Peter Zijlstra and me are pleased to
announce the latest iteration of the CFS scheduler development tree. Our
main focus has been on simplifications and performance - and as part of
that we've also picked up some ideas from Roman Zippel's 'Really Fair
Scheduler' patch as well and integrated them into CFS. We'd like to ask
people go give these patches a good workout, especially with an eye on
any interactivity regressions.
The combo patch against 2.6.23-rc6 can be picked up from:
http://people.redhat.com/mingo/cfs-scheduler/devel/
The sched-devel.git tree can be pulled from:
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
There are lots of small performance improvements in form of a
finegrained 29-patch series. We have removed a number of features and
metrics from CFS that might have been needed but ended up being
superfluous - while keeping the things that worked out fine, like
sleeper fairness. On 32-bit x86 there's a ~16% speedup (over -rc6) in
lmbench (lat_ctx -s 0 2) results:
(microseconds, lower is better)
------------------------------------------------------------
v2.6.22 2.6.23-rc6(CFS) v2.6.23-rc6-CFS-devel
----------------------------------------------------
0.70 0.75 0.65
0.62 0.66 0.63
0.60 0.72 0.69
0.62 0.74 0.61
0.69 0.73 0.53
0.66 0.73 0.63
0.63 0.69 0.61
0.63 0.70 0.64
0.61 0.76 0.61
0.69 0.74 0.63
----------------------------------------------------
avg: 0.64 0.72 (+12%) 0.62 (-3%)
there is a similar speedup on 64-bit x86 as well. We are now a bit
faster than the O(1) scheduler was under v2.6.22 - even on 32-bit. The
main speedup comes from the avoidance of divisions (or shifts) in the
wakeup and context-switch fastpaths.
there's also a visible reduction in code size:
text data bss dec hex filename
13369 228 2036 15633 3d11 sched.o.before (UP, nodebug)
11167 224 1988 13379 3443 sched.o.after (UP, nodebug)
which obviously helps embedded and is good for performance as well. Even
on 32-bit we are now within 1% of the size of v2.6.22's sched.o, which
was:
text data bss dec hex filename
9915 24 3344 13283 33e3 sched.o.v2.6.22
and on SMP the new scheduler is now substantially smaller:
text data bss dec hex filename
24972 4149 24 29145 71d9 sched.o-v2.6.22
24056 2594 16 26666 682a sched.o-CFS-devel
Changes: besides the many micro-optimizations, one of the changes is
that se->vruntime (virtual runtime) based scheduling has been introduced
gradually, step by step - while keeping the wait_runtime metric working
too. (so that the two methods are comparable side by side, in the same
scheduler)
The ->vruntime metric is similar to the ->time_norm metric used by
Roman's patch (and both are losely related to the already existing
sum_exec_runtime metric in CFS), it's in essence the sum of CPU time
executed by a task, in nanoseconds - weighted up or down by their nice
level (or kept the same on the default nice 0 level). Besides this basic
metric our implementation and math differs from RFS. The two approaches
should be conceptually more comparable from now on.
We have also picked up two cleanups from RFS (the cfs_rq->curr approach
and an uninlining optimization) and there's also a cleanup patch from
Matthias Kaehlcke. We welcome and encourage finegrained patches against
this patchset. As usual, bugreports, fixes and suggestions are welcome,
Ingo, Peter
------------------>
Matthias Kaehlcke (1):
sched: use list_for_each_entry_safe() in __wake_up_common()
Peter Zijlstra (5):
sched: simplify SCHED_FEAT_* code
sched: new task placement for vruntime
sched: simplify adaptive latency
sched: clean up new task placement
sched: add tree based averages
Ingo Molnar (23):
sched: fix new-task method
sched: small sched_debug cleanup
sched: debug: track maximum 'slice'
sched: uniform tunings
sched: use constants if !CONFIG_SCHED_DEBUG
sched: remove stat_gran
sched: remove precise CPU load
sched: remove precise CPU load calculations #2
sched: track cfs_rq->curr on !group-scheduling too
sched: cleanup: simplify cfs_rq_curr() methods
sched: uninline __enqueue_entity()/__dequeue_entity()
sched: speed up update_load_add/_sub()
sched: clean up calc_weighted()
sched: introduce se->vruntime
sched: move sched_feat() definitions
sched: optimize vruntime based scheduling
sched: simplify check_preempt() methods
sched: wakeup granularity fix
sched: add se->vruntime debugging
sched: debug: update exec_clock only when SCHED_DEBUG
sched: remove wait_runtime limit
sched: remove wait_runtime fields and features
sched: x86: allow single-depth wchan output
arch/i386/Kconfig | 11
include/linux/sched.h | 17 -
kernel/sched.c | 196 ++++-------------
kernel/sched_debug.c | 86 +++----
kernel/sched_fair.c | 557 +++++++++++++-------------------------------------
kernel/sysctl.c | 22 -
6 files changed, 243 insertions(+), 646 deletions(-)
Hi,
Hi,
Out of curiousity: will I ever get answers to my questions?
bye, Roman
Hi Ingo,
When compiling, I get:
In file included from kernel/sched.c:794:
kernel/sched_fair.c: In function 'task_new_fair':
kernel/sched_fair.c:857: error: 'sysctl_sched_child_runs_first'
undeclared (first use in this function)
kernel/sched_fair.c:857: error: (Each undeclared identifier is
reported only once
kernel/sched_fair.c:857: error: for each function it appears in.)
Presumably because sched_fair.c is being included into sched.c before
sysctl_sched_child_runs_first is defined.
Regards,
Rob
On Tue, 2007-09-11 at 22:04 +0200, Ingo Molnar wrote:
> fresh back from the Kernel Summit, Peter Zijlstra and me are pleased to
> announce the latest iteration of the CFS scheduler development tree. Our
> main focus has been on simplifications and performance - and as part of
> that we've also picked up some ideas from Roman Zippel's 'Really Fair
> Scheduler' patch as well and integrated them into CFS. We'd like to ask
> people go give these patches a good workout, especially with an eye on
> any interactivity regressions.
Initial test-drive looks good here, but I do see a regression. First
the good news.
fairtest2 is perfect, more perfect than ever seen before in fact. Mixed
interval sleepers/hog looks fine as well (can't say perfect due to
startup differences with the various proggies, but cpu% looks perfect).
Amarok song switch time under hefty kbuild load is fine as well. I
haven't done heavy multimedia testing yet, but will give it a more
thorough workout later (errands).
The regression: I see some GUI lurch, easily reproducible by running a
make -j5 and moving the mouse in a circle... perceptible (100ms or so)
lurches not present in rc5.
-Mike
Hi,
On Tue, 11 Sep 2007, Ingo Molnar wrote:
> fresh back from the Kernel Summit, Peter Zijlstra and me are pleased to
> announce the latest iteration of the CFS scheduler development tree. Our
> main focus has been on simplifications and performance - and as part of
> that we've also picked up some ideas from Roman Zippel's 'Really Fair
> Scheduler' patch as well and integrated them into CFS. We'd like to ask
> people go give these patches a good workout, especially with an eye on
> any interactivity regressions.
I'm must really say, I'm quite impressed by your efforts to give me as
little credit as possible.
On the one hand it's of course positive to see so much sudden activity, on
the other hand I'm not sure how much had happened if I hadn't posted my
patch, I don't really think it were my complaints about CFS's complexity
that finally lead to the improvements in this area. I presented the basic
concepts of my patch already with my first CFS review, but at that time
you didn't show any interest and instead you were rather quick to simply
dismiss it. My patch did not add that much new, it's mostly a conceptual
improvement and describes the math in more detail, but it also
demonstrated a number of improvements.
> The combo patch against 2.6.23-rc6 can be picked up from:
>
> http://people.redhat.com/mingo/cfs-scheduler/devel/
>
> The sched-devel.git tree can be pulled from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
Am I the only one who can't clone that thing? So I can't go into much
detail about the individual changes here.
The thing that makes me curious, is that it also includes patches by
others. It can't be entirely explained with the Kernel Summit, as this is
not the first time patches appear out of the blue in form of a git tree.
The funny/sad thing is that at some point Linus complained about Con that
his development activity happend on a separate mailing list, but there was
at least a place to go to. CFS's development appears to mostly happen in
private. Patches may be your primary form of communication, but that isn't
true for many other people, with patches a lot of intent and motivation
for a change is lost. I know it's rather tempting to immediately try out
an idea first, but would it really hurt you so much to formulate an idea
in a more conventional manner? Are you afraid it might hurt your
ueberhacker status by occasionally screwing up in public? Patches on the
other hand have the advantage to more easily cover that up by simply
posting a fix - it makes it more difficult to understand what's going on.
A more conventional way of communication would give more people a chance
to participate, they may not understand every detail of the patch, but
they can try to understand the general concepts and apply them to their
own situation and eventually come up with some ideas/improvements of their
own, they would be less dependent on you to come up with a solution to
their problem. Unless of course that's exactly what you want - unless you
want to be in full control of the situation and you want to be the hero
that saves the day.
> There are lots of small performance improvements in form of a
> finegrained 29-patch series. We have removed a number of features and
> metrics from CFS that might have been needed but ended up being
> superfluous - while keeping the things that worked out fine, like
> sleeper fairness. On 32-bit x86 there's a ~16% speedup (over -rc6) in
> lmbench (lat_ctx -s 0 2) results:
In the patch you really remove _a_lot_ of stuff. You also removed a lot of
things I tried to get you to explain them to me. On the one hand I could
be happy that these things are gone, as they were the major road block to
splitting up my own patch. On the other hand it still leaves me somewhat
unsatisfied, as I still don't know what that stuff was good for.
In a more collaborative development model I would have expected that you
tried to explain these features, which could have resulted in a discussion
how else things can be implemented or if it's still needed at all. Instead
of this you now simply decide unilaterally that these things are not
needed anymore.
BTW the old sleeper fairness logic "that worked out fine" is actually
completely gone and is now conceptually closer to what I'm already doing
in my patch (only the amount of sleeper bonus differs).
> (microseconds, lower is better)
> ------------------------------------------------------------
> v2.6.22 2.6.23-rc6(CFS) v2.6.23-rc6-CFS-devel
> ----------------------------------------------------
> 0.70 0.75 0.65
> 0.62 0.66 0.63
> 0.60 0.72 0.69
> 0.62 0.74 0.61
> 0.69 0.73 0.53
> 0.66 0.73 0.63
> 0.63 0.69 0.61
> 0.63 0.70 0.64
> 0.61 0.76 0.61
> 0.69 0.74 0.63
> ----------------------------------------------------
> avg: 0.64 0.72 (+12%) 0.62 (-3%)
>
> there is a similar speedup on 64-bit x86 as well. We are now a bit
> faster than the O(1) scheduler was under v2.6.22 - even on 32-bit. The
> main speedup comes from the avoidance of divisions (or shifts) in the
> wakeup and context-switch fastpaths.
>
> there's also a visible reduction in code size:
>
> text data bss dec hex filename
> 13369 228 2036 15633 3d11 sched.o.before (UP, nodebug)
> 11167 224 1988 13379 3443 sched.o.after (UP, nodebug)
Well, one could say that you used every little trick in the book to get
these numbers down. On the other hand at this point it's a little unclear
whether you maybe removed it a little too much to get there, so the
significance of these numbers is a bit limited.
> Changes: besides the many micro-optimizations, one of the changes is
> that se->vruntime (virtual runtime) based scheduling has been introduced
> gradually, step by step - while keeping the wait_runtime metric working
> too. (so that the two methods are comparable side by side, in the same
> scheduler)
I can't quite see that, the wait_runtime metric is relative to fair_clock
and this is gone without any replacement, in my patch I at least
calculate these values for the debug output, but in your patch even that
is simply gone, so I'm not sure what you actually compare "side by side".
> The ->vruntime metric is similar to the ->time_norm metric used by
> Roman's patch (and both are losely related to the already existing
> sum_exec_runtime metric in CFS), it's in essence the sum of CPU time
> executed by a task, in nanoseconds - weighted up or down by their nice
> level (or kept the same on the default nice 0 level). Besides this basic
> metric our implementation and math differs from RFS.
At this point it gets really interesting - I'm amazed how much you stress
the differences. If we take the basic math as I more simply explained it
in this example http://lkml.org/lkml/2007/9/3/168, you now also make the
step from the relative wait_runtime value to an absolute virtual time
value. Basically it's really the same thing, only the resolution differs.
This means you already reimplemented a key element of my patch, so would
you please give me at least that much credit?
The rest of the math is indeed different - it's simply missing. What is
there is IMO not really adequate. I guess you will see the differences,
once you test a bit more with different nice levels. There's a good reason
I put that much effort into maintaining a good, but still cheap average,
it's needed for a good task placement. There is of course more than one
way to implement this, so you'll have good chances to simply reimplement
it somewhat differently, but I'd be surprised if it would be something
completely different.
To make it very clear to everyone else: this is primarely not about
getting some credit, although that's not completely unimportant of course.
This is about getting adequate help. I had to push very hard to get any
kind of acknowledgment with scheduler issues only to be rewarded with
silence, so that I was occassionally wondering myself, whether I'm just
hallucinating all this. Only after I provide the prove that further
improvements are possible is there some activity. But instead of providing
help (e.g. by answering my questions) Ingo just goes ahead and
reimplements the damn thing himself and simply throws out all questionable
items instead of explaining them.
The point is that I have no real interest in any stinking competition, I
have no interest to be reduced to simply complaining all the time. I'm
more interested in a cooperation, but that requires better communication
and an actual exchange of ideas, patches are no real communication, they
are a supplement and should rather be the end result instead of means to
get anything started. A collaboration could bundle the individual
strengths, so that this doesn't degenerate into a contest of who's the
greatest hacker.
Is this really too much to expect?
bye, Roman
---------- Forwarded message ----------
From: Roman Zippel <[email protected]>
Date: Sep 12, 2007 6:17 PM
Subject: Re: [announce] CFS-devel, performance improvements
To: Ingo Molnar <[email protected]>
Cc: [email protected], Peter Zijlstra
<[email protected]>, Mike Galbraith <[email protected]>
Hi,
On Tue, 11 Sep 2007, Ingo Molnar wrote:
> fresh back from the Kernel Summit, Peter Zijlstra and me are pleased to
> announce the latest iteration of the CFS scheduler development tree. Our
> main focus has been on simplifications and performance - and as part of
> that we've also picked up some ideas from Roman Zippel's 'Really Fair
> Scheduler' patch as well and integrated them into CFS. We'd like to ask
> people go give these patches a good workout, especially with an eye on
> any interactivity regressions.
I'm must really say, I'm quite impressed by your efforts to give me as
little credit as possible.
On the one hand it's of course positive to see so much sudden activity, on
the other hand I'm not sure how much had happened if I hadn't posted my
patch, I don't really think it were my complaints about CFS's complexity
that finally lead to the improvements in this area. I presented the basic
concepts of my patch already with my first CFS review, but at that time
you didn't show any interest and instead you were rather quick to simply
dismiss it. My patch did not add that much new, it's mostly a conceptual
improvement and describes the math in more detail, but it also
demonstrated a number of improvements.
> The combo patch against 2.6.23-rc6 can be picked up from:
>
> http://people.redhat.com/mingo/cfs-scheduler/devel/
>
> The sched-devel.git tree can be pulled from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
Am I the only one who can't clone that thing? So I can't go into much
detail about the individual changes here.
The thing that makes me curious, is that it also includes patches by
others. It can't be entirely explained with the Kernel Summit, as this is
not the first time patches appear out of the blue in form of a git tree.
The funny/sad thing is that at some point Linus complained about Con that
his development activity happend on a separate mailing list, but there was
at least a place to go to. CFS's development appears to mostly happen in
private. Patches may be your primary form of communication, but that isn't
true for many other people, with patches a lot of intent and motivation
for a change is lost. I know it's rather tempting to immediately try out
an idea first, but would it really hurt you so much to formulate an idea
in a more conventional manner? Are you afraid it might hurt your
ueberhacker status by occasionally screwing up in public? Patches on the
other hand have the advantage to more easily cover that up by simply
posting a fix - it makes it more difficult to understand what's going on.
A more conventional way of communication would give more people a chance
to participate, they may not understand every detail of the patch, but
they can try to understand the general concepts and apply them to their
own situation and eventually come up with some ideas/improvements of their
own, they would be less dependent on you to come up with a solution to
their problem. Unless of course that's exactly what you want - unless you
want to be in full control of the situation and you want to be the hero
that saves the day.
> There are lots of small performance improvements in form of a
> finegrained 29-patch series. We have removed a number of features and
> metrics from CFS that might have been needed but ended up being
> superfluous - while keeping the things that worked out fine, like
> sleeper fairness. On 32-bit x86 there's a ~16% speedup (over -rc6) in
> lmbench (lat_ctx -s 0 2) results:
In the patch you really remove _a_lot_ of stuff. You also removed a lot of
things I tried to get you to explain them to me. On the one hand I could
be happy that these things are gone, as they were the major road block to
splitting up my own patch. On the other hand it still leaves me somewhat
unsatisfied, as I still don't know what that stuff was good for.
In a more collaborative development model I would have expected that you
tried to explain these features, which could have resulted in a discussion
how else things can be implemented or if it's still needed at all. Instead
of this you now simply decide unilaterally that these things are not
needed anymore.
BTW the old sleeper fairness logic "that worked out fine" is actually
completely gone and is now conceptually closer to what I'm already doing
in my patch (only the amount of sleeper bonus differs).
> (microseconds, lower is better)
> ------------------------------------------------------------
> v2.6.22 2.6.23-rc6(CFS) v2.6.23-rc6-CFS-devel
> ----------------------------------------------------
> 0.70 0.75 0.65
> 0.62 0.66 0.63
> 0.60 0.72 0.69
> 0.62 0.74 0.61
> 0.69 0.73 0.53
> 0.66 0.73 0.63
> 0.63 0.69 0.61
> 0.63 0.70 0.64
> 0.61 0.76 0.61
> 0.69 0.74 0.63
> ----------------------------------------------------
> avg: 0.64 0.72 (+12%) 0.62 (-3%)
>
> there is a similar speedup on 64-bit x86 as well. We are now a bit
> faster than the O(1) scheduler was under v2.6.22 - even on 32-bit. The
> main speedup comes from the avoidance of divisions (or shifts) in the
> wakeup and context-switch fastpaths.
>
> there's also a visible reduction in code size:
>
> text data bss dec hex filename
> 13369 228 2036 15633 3d11 sched.o.before (UP, nodebug)
> 11167 224 1988 13379 3443 sched.o.after (UP, nodebug)
Well, one could say that you used every little trick in the book to get
these numbers down. On the other hand at this point it's a little unclear
whether you maybe removed it a little too much to get there, so the
significance of these numbers is a bit limited.
is'nt it gud to use all those tricks if it helps? we'll know soon if
it helps from the testing which it'll get. i'm just concerned about
doing these cleanups so late in the rc cycle.
And Ingo, please do explain the reasons for all these cleanups and why
they were put there in the first place.
> Changes: besides the many micro-optimizations, one of the changes is
> that se->vruntime (virtual runtime) based scheduling has been introduced
> gradually, step by step - while keeping the wait_runtime metric working
> too. (so that the two methods are comparable side by side, in the same
> scheduler)
I can't quite see that, the wait_runtime metric is relative to fair_clock
and this is gone without any replacement, in my patch I at least
calculate these values for the debug output, but in your patch even that
is simply gone, so I'm not sure what you actually compare "side by side".
> The ->vruntime metric is similar to the ->time_norm metric used by
> Roman's patch (and both are losely related to the already existing
> sum_exec_runtime metric in CFS), it's in essence the sum of CPU time
> executed by a task, in nanoseconds - weighted up or down by their nice
> level (or kept the same on the default nice 0 level). Besides this basic
> metric our implementation and math differs from RFS.
At this point it gets really interesting - I'm amazed how much you stress
the differences. If we take the basic math as I more simply explained it
in this example http://lkml.org/lkml/2007/9/3/168, you now also make the
step from the relative wait_runtime value to an absolute virtual time
value. Basically it's really the same thing, only the resolution differs.
This means you already reimplemented a key element of my patch, so would
you please give me at least that much credit?
The rest of the math is indeed different - it's simply missing. What is
there is IMO not really adequate. I guess you will see the differences,
once you test a bit more with different nice levels. There's a good reason
I put that much effort into maintaining a good, but still cheap average,
it's needed for a good task placement. There is of course more than one
way to implement this, so you'll have good chances to simply reimplement
it somewhat differently, but I'd be surprised if it would be something
completely different.
To make it very clear to everyone else: this is primarely not about
getting some credit, although that's not completely unimportant of course.
I give you credit for coming up with the math which is so easily
understandable comapared to CFS. Don't loose patience, like Con did.
Please keep fighting if u think ur code is better. It'll help all of
us out here.
Please ignore the previous mail, i messed it up bad.
On 9/12/07, Roman Zippel <[email protected]> wrote:
> Hi,
>
> On Tue, 11 Sep 2007, Ingo Molnar wrote:
>
> > fresh back from the Kernel Summit, Peter Zijlstra and me are pleased to
> > announce the latest iteration of the CFS scheduler development tree. Our
> > main focus has been on simplifications and performance - and as part of
> > that we've also picked up some ideas from Roman Zippel's 'Really Fair
> > Scheduler' patch as well and integrated them into CFS. We'd like to ask
> > people go give these patches a good workout, especially with an eye on
> > any interactivity regressions.
>
> I'm must really say, I'm quite impressed by your efforts to give me as
> little credit as possible.
> On the one hand it's of course positive to see so much sudden activity, on
> the other hand I'm not sure how much had happened if I hadn't posted my
> patch, I don't really think it were my complaints about CFS's complexity
> that finally lead to the improvements in this area. I presented the basic
> concepts of my patch already with my first CFS review, but at that time
> you didn't show any interest and instead you were rather quick to simply
> dismiss it. My patch did not add that much new, it's mostly a conceptual
> improvement and describes the math in more detail, but it also
> demonstrated a number of improvements.
>
> > The combo patch against 2.6.23-rc6 can be picked up from:
> >
> > http://people.redhat.com/mingo/cfs-scheduler/devel/
> >
> > The sched-devel.git tree can be pulled from:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
>
> Am I the only one who can't clone that thing? So I can't go into much
> detail about the individual changes here.
> The thing that makes me curious, is that it also includes patches by
> others. It can't be entirely explained with the Kernel Summit, as this is
> not the first time patches appear out of the blue in form of a git tree.
> The funny/sad thing is that at some point Linus complained about Con that
> his development activity happend on a separate mailing list, but there was
> at least a place to go to. CFS's development appears to mostly happen in
> private. Patches may be your primary form of communication, but that isn't
> true for many other people, with patches a lot of intent and motivation
> for a change is lost. I know it's rather tempting to immediately try out
> an idea first, but would it really hurt you so much to formulate an idea
> in a more conventional manner? Are you afraid it might hurt your
> ueberhacker status by occasionally screwing up in public? Patches on the
> other hand have the advantage to more easily cover that up by simply
> posting a fix - it makes it more difficult to understand what's going on.
> A more conventional way of communication would give more people a chance
> to participate, they may not understand every detail of the patch, but
> they can try to understand the general concepts and apply them to their
> own situation and eventually come up with some ideas/improvements of their
> own, they would be less dependent on you to come up with a solution to
> their problem. Unless of course that's exactly what you want - unless you
> want to be in full control of the situation and you want to be the hero
> that saves the day.
>
> > There are lots of small performance improvements in form of a
> > finegrained 29-patch series. We have removed a number of features and
> > metrics from CFS that might have been needed but ended up being
> > superfluous - while keeping the things that worked out fine, like
> > sleeper fairness. On 32-bit x86 there's a ~16% speedup (over -rc6) in
> > lmbench (lat_ctx -s 0 2) results:
>
> In the patch you really remove _a_lot_ of stuff. You also removed a lot of
> things I tried to get you to explain them to me. On the one hand I could
> be happy that these things are gone, as they were the major road block to
> splitting up my own patch. On the other hand it still leaves me somewhat
> unsatisfied, as I still don't know what that stuff was good for.
> In a more collaborative development model I would have expected that you
> tried to explain these features, which could have resulted in a discussion
> how else things can be implemented or if it's still needed at all. Instead
> of this you now simply decide unilaterally that these things are not
> needed anymore.
>
> BTW the old sleeper fairness logic "that worked out fine" is actually
> completely gone and is now conceptually closer to what I'm already doing
> in my patch (only the amount of sleeper bonus differs).
>
> > (microseconds, lower is better)
> > ------------------------------------------------------------
> > v2.6.22 2.6.23-rc6(CFS) v2.6.23-rc6-CFS-devel
> > ----------------------------------------------------
> > 0.70 0.75 0.65
> > 0.62 0.66 0.63
> > 0.60 0.72 0.69
> > 0.62 0.74 0.61
> > 0.69 0.73 0.53
> > 0.66 0.73 0.63
> > 0.63 0.69 0.61
> > 0.63 0.70 0.64
> > 0.61 0.76 0.61
> > 0.69 0.74 0.63
> > ----------------------------------------------------
> > avg: 0.64 0.72 (+12%) 0.62 (-3%)
> >
> > there is a similar speedup on 64-bit x86 as well. We are now a bit
> > faster than the O(1) scheduler was under v2.6.22 - even on 32-bit. The
> > main speedup comes from the avoidance of divisions (or shifts) in the
> > wakeup and context-switch fastpaths.
> >
> > there's also a visible reduction in code size:
> >
> > text data bss dec hex filename
> > 13369 228 2036 15633 3d11 sched.o.before (UP, nodebug)
> > 11167 224 1988 13379 3443 sched.o.after (UP, nodebug)
>
> Well, one could say that you used every little trick in the book to get
> these numbers down. On the other hand at this point it's a little unclear
> whether you maybe removed it a little too much to get there, so the
> significance of these numbers is a bit limited.
is'nt it gud to use all those tricks if it helps? we'll know soon if
it helps from the testing which it'll get. i'm just concerned about
doing these cleanups so late in the rc cycle.
And Ingo, please do explain the reasons for all these cleanups and why
they were put there in the first place.
> > Changes: besides the many micro-optimizations, one of the changes is
> > that se->vruntime (virtual runtime) based scheduling has been introduced
> > gradually, step by step - while keeping the wait_runtime metric working
> > too. (so that the two methods are comparable side by side, in the same
> > scheduler)
>
> I can't quite see that, the wait_runtime metric is relative to fair_clock
> and this is gone without any replacement, in my patch I at least
> calculate these values for the debug output, but in your patch even that
> is simply gone, so I'm not sure what you actually compare "side by side".
>
> > The ->vruntime metric is similar to the ->time_norm metric used by
> > Roman's patch (and both are losely related to the already existing
> > sum_exec_runtime metric in CFS), it's in essence the sum of CPU time
> > executed by a task, in nanoseconds - weighted up or down by their nice
> > level (or kept the same on the default nice 0 level). Besides this basic
> > metric our implementation and math differs from RFS.
>
> At this point it gets really interesting - I'm amazed how much you stress
> the differences. If we take the basic math as I more simply explained it
> in this example http://lkml.org/lkml/2007/9/3/168, you now also make the
> step from the relative wait_runtime value to an absolute virtual time
> value. Basically it's really the same thing, only the resolution differs.
> This means you already reimplemented a key element of my patch, so would
> you please give me at least that much credit?
> The rest of the math is indeed different - it's simply missing. What is
> there is IMO not really adequate. I guess you will see the differences,
> once you test a bit more with different nice levels. There's a good reason
> I put that much effort into maintaining a good, but still cheap average,
> it's needed for a good task placement. There is of course more than one
> way to implement this, so you'll have good chances to simply reimplement
> it somewhat differently, but I'd be surprised if it would be something
> completely different.
>
> To make it very clear to everyone else: this is primarely not about
> getting some credit, although that's not completely unimportant of course.
I give you credit for coming up with the math which is so easily
understandable comapared to CFS. Don't loose patience, like Con did.
Please keep fighting if u think ur code is better. It'll help all of
us out here.
regards,
debian dev
* Roman Zippel <[email protected]> wrote:
> Hi,
>
> Out of curiousity: will I ever get answers to my questions?
the last few weeks/months have been pretty hectic - i get more than 50
non-list emails a day so i could easily have missed some. (and to take a
line from Linus: my attention span is roughly that of a slightly
retarded golden retriever ;)
so it would be helpful if you could please re-state any questions you
still have, in context of our latest CFS-devel queue. I tried to answer
the error/rounding worries you had - which seemed to be the main theme
of your patch. There are lots of good kernel hackers on lkml who know
the new scheduler code pretty well and who might be able to provide an
answer even if i dont manage to answer. (Perhaps asking the questions
without heavy math will also help more people be able to understand and
answer your questions and their practical relevance.) In any case - if
you see packet loss on my side then please resend :) That would be
hugely helpful. Thanks,
Ingo
On 9/11/07, Rob Hussey <[email protected]> wrote:
> Hi Ingo,
>
> When compiling, I get:
Yeah, this was my fault :(
I've had a chance to test this now, and everything feels great. I did
some benchmarks for 2.6.23-rc1, 2.6.23-rc6-cfs, and
2.6.23-rc6-cfs-devel:
lat_ctx -s 0 2:
2.6.23-rc1 2.6.23-rc6-cfs 2.6.23-rc6-cfs-devel
5.15 4.91 5.05
5.23 5.18 4.85
5.19 4.89 5.17
5.36 5.23 4.86
5.35 5.00 5.13
5.34 5.05 5.12
5.26 4.99 5.06
5.11 5.04 4.96
5.29 5.19 5.18
5.40 4.93 5.07
hackbench 50:
2.6.23-rc1 2.6.23-rc6-cfs 2.6.23-rc6-cfs-devel
6.301 5.963 5.837
6.417 5.961 5.814
6.468 5.965 5.757
6.525 5.926 5.840
6.320 5.929 5.751
6.457 5.909 5.825
pipe-test (http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test.c):
2.6.23-rc1 2.6.23-rc6-cfs 2.6.23-rc6-cfs-devel
14.29 14.03 13.89
14.31 14.01 14.10
14.27 13.99 14.15
14.31 14.02 14.16
14.53 14.02 14.14
14.53 14.27 14.16
14.51 14.36 14.12
14.48 14.33 14.16
14.52 14.36 14.17
14.47 14.36 14.15
I turned the results into graphs as well. I'll attach them, but they're also at:
http://www.healthcarelinen.com/misc/lat_ctx_benchmark.png
http://www.healthcarelinen.com/misc/hackbench_benchmark.png
http://www.healthcarelinen.com/misc/pipe-test_benchmark.png
The hackbench and pipe-test numbers are very encouraging. The avg
between the 2.6.23-rc6-cfs and 2.6.23-rc6-cfs-devel lat_ctx numbers
are nearly identical (5.041 and 5.045 respectively).
* Rob Hussey <[email protected]> wrote:
> On 9/11/07, Rob Hussey <[email protected]> wrote:
> > Hi Ingo,
> >
> > When compiling, I get:
>
> Yeah, this was my fault :(
>
> I've had a chance to test this now, and everything feels great. I did
> some benchmarks for 2.6.23-rc1, 2.6.23-rc6-cfs, and
> 2.6.23-rc6-cfs-devel:
thanks for the numbers! Could you please also post the .config you used?
Thx,
Ingo
* Roman Zippel <[email protected]> wrote:
> > The sched-devel.git tree can be pulled from:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
>
> Am I the only one who can't clone that thing? [...]
Ah - i have messed up my sched-devel.git script so the git-push went to
kernel.org but into my home directory :-/ Should work now - let me know
if it doesnt.
i've also uploaded the patch series in quilt format, to:
http://people.redhat.com/mingo/cfs-scheduler/devel/patches.tar.gz
> [...] It can't be entirely explained with the Kernel Summit, as this
> is not the first time patches appear out of the blue in form of a git
> tree.
i'm not sure what you mean, but i can definitely tell you that there was
no scheduler hacking at the Kernel Summit. (there's no good wireless in
the pubs and not enough space for a laptop anyway ;)
The impressive linecount has been mostly achieved by dumb removal:
sched: remove wait_runtime fields and features
4 files changed, 14 insertions(+), 161 deletions(-)
sched: remove wait_runtime limit
5 files changed, 3 insertions(+), 124 deletions(-)
sched: remove precise CPU load calculations #2
1 file changed, 1 insertion(+), 31 deletions(-)
sched: remove precise CPU load
3 files changed, 9 insertions(+), 41 deletions(-)
sched: remove stat_gran
4 files changed, 15 insertions(+), 50 deletions(-)
Hack time to do them: ~10 minutes apiece. Removing stuff is _easy_ :-)
The rest is finegrained, small changes. One of the harder patches was
this one:
commit 28c4b8ed35f0fc7050f186147da9e10b55e1e446
sched: introduce se->vruntime
3 files changed, 50 insertions(+), 33 deletions(-)
And i sent you the first variant of that already:
http://lkml.org/lkml/2007/9/2/76
we needed 2 days after the KS to put it into shape and send it out for
feedback.
Ingo
On 9/13/07, Ingo Molnar <[email protected]> wrote:
>
> thanks for the numbers! Could you please also post the .config you used?
Sure, .config for 2.6.23-rc1 and 2.6.23-rc6 attached.
* Rob Hussey <[email protected]> wrote:
> On 9/13/07, Ingo Molnar <[email protected]> wrote:
> >
> > thanks for the numbers! Could you please also post the .config you used?
>
> Sure, .config for 2.6.23-rc1 and 2.6.23-rc6 attached.
thx! If you've got some time, could you perhaps re-measure with these
disabled:
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
these options mask some of the performance enhancements we made. There's
also a new code drop at:
http://people.redhat.com/mingo/cfs-scheduler/devel/
with some fixes for SMP. (and you've got an SMP box it appears)
also, if you want to maximize performance, it usually makes more sense
to build with these flipped around:
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_FORCED_INLINING=y
i.e.:
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
# CONFIG_FORCED_INLINING is not set
because especially on modern x86 CPUs, smaller x86 code is faster. (and
it also takes up less I-cache size)
Ingo
On 9/13/07, Ingo Molnar <[email protected]> wrote:
>
> * Rob Hussey <[email protected]> wrote:
>
> > On 9/13/07, Ingo Molnar <[email protected]> wrote:
> > >
> > > thanks for the numbers! Could you please also post the .config you used?
> >
> > Sure, .config for 2.6.23-rc1 and 2.6.23-rc6 attached.
>
> thx! If you've got some time, could you perhaps re-measure with these
> disabled:
>
> CONFIG_SCHED_DEBUG=y
Well, I was going over my config myself after you asked for me to post
it, and I thought to do the same thing. Except, disabling sched_debug
caused the same error as before:
In file included from kernel/sched.c:794:
kernel/sched_fair.c: In function 'task_new_fair':
kernel/sched_fair.c:857: error: 'sysctl_sched_child_runs_first'
undeclared (first use in this function)
kernel/sched_fair.c:857: error: (Each undeclared identifier is
reported only once
kernel/sched_fair.c:857: error: for each function it appears in.)
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2
It only happens with sched_debug=y. I take it back, it wasn't my fault :)
As for everything else, I'd be happy to.
On 9/13/07, Rob Hussey <[email protected]> wrote:
> On 9/13/07, Ingo Molnar <[email protected]> wrote:
> >
> > * Rob Hussey <[email protected]> wrote:
> >
> > > On 9/13/07, Ingo Molnar <[email protected]> wrote:
> > > >
> > > > thanks for the numbers! Could you please also post the .config you used?
> > >
> > > Sure, .config for 2.6.23-rc1 and 2.6.23-rc6 attached.
> >
> > thx! If you've got some time, could you perhaps re-measure with these
> > disabled:
> >
> > CONFIG_SCHED_DEBUG=y
>
> Well, I was going over my config myself after you asked for me to post
> it, and I thought to do the same thing. Except, disabling sched_debug
> caused the same error as before:
> In file included from kernel/sched.c:794:
> kernel/sched_fair.c: In function 'task_new_fair':
> kernel/sched_fair.c:857: error: 'sysctl_sched_child_runs_first'
> undeclared (first use in this function)
> kernel/sched_fair.c:857: error: (Each undeclared identifier is
> reported only once
> kernel/sched_fair.c:857: error: for each function it appears in.)
> make[1]: *** [kernel/sched.o] Error 1
> make: *** [kernel] Error 2
>
> It only happens with sched_debug=y. I take it back, it wasn't my fault :)
>
I'm trying the patches now to see if they help.
On 9/13/07, Rob Hussey <[email protected]> wrote:
> On 9/13/07, Rob Hussey <[email protected]> wrote:
> > On 9/13/07, Ingo Molnar <[email protected]> wrote:
> > >
> > > * Rob Hussey <[email protected]> wrote:
> > >
> > > > On 9/13/07, Ingo Molnar <[email protected]> wrote:
> > > > >
> > > > > thanks for the numbers! Could you please also post the .config you used?
> > > >
> > > > Sure, .config for 2.6.23-rc1 and 2.6.23-rc6 attached.
> > >
> > > thx! If you've got some time, could you perhaps re-measure with these
> > > disabled:
> > >
> > > CONFIG_SCHED_DEBUG=y
> >
> > Well, I was going over my config myself after you asked for me to post
> > it, and I thought to do the same thing. Except, disabling sched_debug
> > caused the same error as before:
> > In file included from kernel/sched.c:794:
> > kernel/sched_fair.c: In function 'task_new_fair':
> > kernel/sched_fair.c:857: error: 'sysctl_sched_child_runs_first'
> > undeclared (first use in this function)
> > kernel/sched_fair.c:857: error: (Each undeclared identifier is
> > reported only once
> > kernel/sched_fair.c:857: error: for each function it appears in.)
> > make[1]: *** [kernel/sched.o] Error 1
> > make: *** [kernel] Error 2
> >
> > It only happens with sched_debug=y. I take it back, it wasn't my fault :)
> >
> I'm trying the patches now to see if they help.
>
Current cfs-devel git compiles fine without sched_debug. Not sure how
I broke things, but I need some sleep. I know the 2.6.23-rc1 numbers
were good, but not sure about the others. I'll make the changes you
suggested, and get some new and hopefully good numbers for
2.6.23-rc6-cfs and 2.6.23-rc6-cfs-devel.
On Thu, 2007-09-13 at 00:17 +0200, Roman Zippel wrote:
> The rest of the math is indeed different - it's simply missing. What is
> there is IMO not really adequate. I guess you will see the differences,
> once you test a bit more with different nice levels.
The rounding error we now still have is accumulative over the long time
but has no real effect. The only effect is that a nice level would be a
little different that it would have been had the division been perfect,
not dissimilar to having a small error in the divisor series to being
with. (note that in order to see this little fuzz you need amazingly
high context switch rates)
We've measured the effect with the strongest nice levels -20 and 19, a
normal loop against two yield loops (this generated 700.000 context
switches per second), and the effect is <1%. Not something worth fixing
IMHO (unless it comes for free).
At that high switching rates the overhead of scheduling itself and
caching causes more skew than this - the small error is totally swamped
by the time lost scheduling.
> There's a good reason
> I put that much effort into maintaining a good, but still cheap average,
> it's needed for a good task placement.
While I agree that having this average is nice, your particular
implementation has the problem that it quickly overflows u64 at which
point it becomes a huge problem (a CPU hog could basically lock up your
box when that happens).
I solved the wrap around problem in cfs-devel, and from that base I
_could_ probably maintain the average without overflow problems, but
have yet to try.
> There is of course more than one
> way to implement this, so you'll have good chances to simply reimplement
> it somewhat differently, but I'd be surprised if it would be something
> completely different.
Currently we have 2 approximations in place:
(leftmost + rightmost) / 2
and
leftmost + period/2 (where period should match the span of the tree)
neither are perfect but they seem to work quite well.
* Rob Hussey <[email protected]> wrote:
> Well, I was going over my config myself after you asked for me to post
> it, and I thought to do the same thing. Except, disabling sched_debug
> caused the same error as before:
> In file included from kernel/sched.c:794:
> kernel/sched_fair.c: In function 'task_new_fair':
> kernel/sched_fair.c:857: error: 'sysctl_sched_child_runs_first'
> undeclared (first use in this function)
> kernel/sched_fair.c:857: error: (Each undeclared identifier is
> reported only once
> kernel/sched_fair.c:857: error: for each function it appears in.)
> make[1]: *** [kernel/sched.o] Error 1
> make: *** [kernel] Error 2
>
> It only happens with sched_debug=y. I take it back, it wasn't my fault :)
>
> As for everything else, I'd be happy to.
are you sure this is happening with the latest iteration of the patch
too? (with the combo-3.patch?) You can pick it up from here:
http://people.redhat.com/mingo/cfs-scheduler/devel/sched-cfs-v2.6.23-rc6-v21-combo-3.patch
I tried your config and it builds fine here.
Ingo
Hi,
On Thu, 13 Sep 2007, Peter Zijlstra wrote:
> > There's a good reason
> > I put that much effort into maintaining a good, but still cheap average,
> > it's needed for a good task placement.
>
> While I agree that having this average is nice, your particular
> implementation has the problem that it quickly overflows u64 at which
> point it becomes a huge problem (a CPU hog could basically lock up your
> box when that happens).
If you look at the math, you'll see that I took the overflow into account,
I even expected it. If you see this effect in my implementation, it would
be a bug.
> > There is of course more than one
> > way to implement this, so you'll have good chances to simply reimplement
> > it somewhat differently, but I'd be surprised if it would be something
> > completely different.
>
> Currently we have 2 approximations in place:
>
> (leftmost + rightmost) / 2
>
> and
>
> leftmost + period/2 (where period should match the span of the tree)
>
> neither are perfect but they seem to work quite well.
You need more than two busy loops.
There's a reason I implemented a simple simulator first, so I could
actually study the scheduling behaviour of different load situations. That
doesn't protect from all surprises of course, but it gives me the
necessary confidence the scheduler will work reasonably even in weird
situations.
>From these tests I already know that your approximations only work with
rather simple loads.
bye, Roman
Hi,
On Thu, 13 Sep 2007, Ingo Molnar wrote:
> > Out of curiousity: will I ever get answers to my questions?
>
> the last few weeks/months have been pretty hectic - i get more than 50
> non-list emails a day so i could easily have missed some.
Well, let's just take the recent "Really Simple Really Fair Scheduler"
thread. You had the time to ask me questions about my scheduler, I even
explained to you how the sleeping bonus works in my model. At the end I
was sort of hoping you would start answering my questions and explaining
things how the same things work in CFS - but nothing.
Then you had the time to reimplement the very things you've just asked me
about and what do I get credit for - "two cleanups from RFS".
And now I get this lame ass excuse for not answering my questions? :-(
bye, Roman
On Thu, 2007-09-13 at 14:14 +0200, Roman Zippel wrote:
> Hi,
>
> On Thu, 13 Sep 2007, Peter Zijlstra wrote:
>
> > > There's a good reason
> > > I put that much effort into maintaining a good, but still cheap average,
> > > it's needed for a good task placement.
> >
> > While I agree that having this average is nice, your particular
> > implementation has the problem that it quickly overflows u64 at which
> > point it becomes a huge problem (a CPU hog could basically lock up your
> > box when that happens).
>
> If you look at the math, you'll see that I took the overflow into account,
> I even expected it. If you see this effect in my implementation, it would
> be a bug.
Ah, ok, I shall look to your patches in more detail, it was not obvious
from the formulae you posted.
> > > There is of course more than one
> > > way to implement this, so you'll have good chances to simply reimplement
> > > it somewhat differently, but I'd be surprised if it would be something
> > > completely different.
> >
> > Currently we have 2 approximations in place:
> >
> > (leftmost + rightmost) / 2
> >
> > and
> >
> > leftmost + period/2 (where period should match the span of the tree)
> >
> > neither are perfect but they seem to work quite well.
>
> You need more than two busy loops.
I'm missing context here, are you referring to the nice level error or
the avg approximation?
> There's a reason I implemented a simple simulator first, so I could
> actually study the scheduling behaviour of different load situations. That
> doesn't protect from all surprises of course, but it gives me the
> necessary confidence the scheduler will work reasonably even in weird
> situations.
Right, I've build user-space simulators too, handy little things to play
with :-)
> From these tests I already know that your approximations only work with
> rather simple loads.
I've not yet seen it go spectacularly wrong, although admittedly a
highly concurrent kbuild is the most complex task I let loose on it.
Could you perhaps be more specific on the circumstances it breaks down
and what the negative impact is?
* Roman Zippel <[email protected]> wrote:
> The rest of the math is indeed different - it's simply missing. What
> is there is IMO not really adequate. I guess you will see the
> differences, once you test a bit more with different nice levels.
Roman, i disagree strongly. I did test with different nice levels. Here
are some hard numbers: the CPU usage table of 40 busy loops started at
once, all running at a different nice level, from nice -20 to nice +19:
top - 12:25:07 up 19 min, 2 users, load average: 40.00, 39.15, 28.35
Tasks: 172 total, 41 running, 131 sleeping, 0 stopped, 0 zombie
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2455 root 0 -20 1576 248 196 R 20 0.0 3:47.56 loop
2456 root 1 -19 1576 244 196 R 16 0.0 3:03.96 loop
2457 root 2 -18 1576 244 196 R 13 0.0 2:24.80 loop
2458 root 3 -17 1576 248 196 R 10 0.0 1:58.63 loop
2459 root 4 -16 1576 244 196 R 8 0.0 1:33.04 loop
2460 root 5 -15 1576 248 196 R 7 0.0 1:14.73 loop
2461 root 6 -14 1576 248 196 R 5 0.0 0:59.61 loop
2462 root 7 -13 1576 244 196 R 4 0.0 0:47.95 loop
2463 root 8 -12 1576 248 196 R 3 0.0 0:38.31 loop
2464 root 9 -11 1576 244 196 R 3 0.0 0:30.54 loop
2465 root 10 -10 1576 244 196 R 2 0.0 0:24.47 loop
2466 root 11 -9 1576 244 196 R 2 0.0 0:19.52 loop
2467 root 12 -8 1576 248 196 R 1 0.0 0:15.63 loop
2468 root 13 -7 1576 248 196 R 1 0.0 0:12.56 loop
2469 root 14 -6 1576 248 196 R 1 0.0 0:10.00 loop
2470 root 15 -5 1576 244 196 R 1 0.0 0:07.99 loop
2471 root 16 -4 1576 244 196 R 1 0.0 0:06.40 loop
2472 root 17 -3 1576 244 196 R 0 0.0 0:05.09 loop
2473 root 18 -2 1576 244 196 R 0 0.0 0:04.05 loop
2474 root 19 -1 1576 248 196 R 0 0.0 0:03.26 loop
2475 root 20 0 1576 244 196 R 0 0.0 0:02.61 loop
2476 root 21 1 1576 244 196 R 0 0.0 0:02.09 loop
2477 root 22 2 1576 244 196 R 0 0.0 0:01.67 loop
2478 root 23 3 1576 244 196 R 0 0.0 0:01.33 loop
2479 root 24 4 1576 248 196 R 0 0.0 0:01.07 loop
2480 root 25 5 1576 244 196 R 0 0.0 0:00.84 loop
2481 root 26 6 1576 248 196 R 0 0.0 0:00.68 loop
2482 root 27 7 1576 248 196 R 0 0.0 0:00.54 loop
2483 root 28 8 1576 248 196 R 0 0.0 0:00.43 loop
2484 root 29 9 1576 248 196 R 0 0.0 0:00.34 loop
2485 root 30 10 1576 244 196 R 0 0.0 0:00.27 loop
2486 root 31 11 1576 248 196 R 0 0.0 0:00.21 loop
2487 root 32 12 1576 244 196 R 0 0.0 0:00.17 loop
2488 root 33 13 1576 244 196 R 0 0.0 0:00.13 loop
2489 root 34 14 1576 244 196 R 0 0.0 0:00.10 loop
2490 root 35 15 1576 244 196 R 0 0.0 0:00.08 loop
2491 root 36 16 1576 248 196 R 0 0.0 0:00.06 loop
2493 root 38 18 1576 248 196 R 0 0.0 0:00.03 loop
2492 root 37 17 1576 244 196 R 0 0.0 0:00.04 loop
2494 root 39 19 1576 244 196 R 0 0.0 0:00.02 loop
check a few select rows (the ratio of CPU time should be 1.25 at every
step) and see that CPU time is distributed very exactly. (and the same
is true for both -rc6 and -rc6-cfs-devel)
So even in this pretty extreme example (who on this planet runs 40 busy
loops with each loop on exactly one separate nice level, creating a load
average of 40.0 and expects perfect distribution after just a few
minutes?) CFS still distributes CPU time perfectly.
When you first raised accuracy issues i have asked you to provide
specific real-world examples showing any of the "problems" with nice
levels you implied to repeatedly:
http://lkml.org/lkml/2007/9/2/38
In the announcement of your "Really Fair Scheduler" patch you used the
following very strong statement:
" This model is far more accurate than CFS is [...]"
http://lkml.org/lkml/2007/8/30/307
but when i stressed you for actual real-world proof of CFS misbehavior,
you said:
"[...] they have indeed little effect in the short term, [...] "
http://lkml.org/lkml/2007/9/2/282
so how can CFS be "far less accurate" (paraphrased) while it has "little
effect in the short term"?
so to repeat my question: my (and Peter's) claim is that there is no
real-world significance of much of the complexity you added to avoid
rounding effects. You do disagree with that, so our follow-up question
is: what actual real-world significance does it have in your opinion?
What is the worst-case effect? Do we even care? We have measured it
every which way and it just does not matter. (but we could easily be
wrong, so please be specific if you know about something that we
overlooked.) Thanks,
Ingo
* Roman Zippel <[email protected]> wrote:
> Then you had the time to reimplement the very things you've just asked
> me about and what do I get credit for - "two cleanups from RFS".
i'm sorry to say this, but you must be reading some other email list and
a different git tree than what i am reading.
Firstly, about communications - in the past 3 months i've written you 40
emails regarding CFS - and that's more emails than my wife (or any
member of my family) got in that timeframe :-( I just ran a quick
script: i sent more CFS related emails to you than to any other person
on this planet. I bent backwards trying to somehow get you to cooperate
with us (and i still havent given up on that!) - instead of you
disparaging CFS and me frequently :-(
Secondly, i prominently credited you as early as in the second sentence
of our announcement:
| fresh back from the Kernel Summit, Peter Zijlstra and me are pleased
| to announce the latest iteration of the CFS scheduler development
| tree. Our main focus has been on simplifications and performance -
| and as part of that we've also picked up some ideas from Roman
| Zippel's 'Really Fair Scheduler' patch as well and integrated them
| into CFS. We'd like to ask people go give these patches a good
| workout, especially with an eye on any interactivity regressions.
http://lkml.org/lkml/2007/9/11/395
And you are duly credited in 3 patches:
------------------->
Subject: sched: introduce se->vruntime
introduce se->vruntime as a sum of weighted delta-exec's, and use
that as the key into the tree.
the idea to use absolute virtual time as the basic metric of
scheduling has been first raised by William Lee Irwin, advanced by
Tong Li and first prototyped by Roman Zippel in the "Really Fair
Scheduler" (RFS) patchset.
also see:
http://lkml.org/lkml/2007/9/2/76
for a simpler variant of this patch.
------------------->
Subject: sched: track cfs_rq->curr on !group-scheduling too
Noticed by Roman Zippel: use cfs_rq->curr in the !group-scheduling
case too. Small micro-optimization and cleanup effect:
------------------->
Subject: sched: uninline __enqueue_entity()/__dequeue_entity()
suggested by Roman Zippel: uninline __enqueue_entity() and
__dequeue_entity().
------------------->
We could not add you as the author, because you unfortunately did not
make your changes applicable to CFS. I've asked you _three_ separate
times to send a nicely split up series so that we can apply your code:
" it's far easier to review and merge stuff if it's nicely split up. "
http://lkml.org/lkml/2007/9/2/38
" I also think that the core math changes should be split from the
Breshenham optimizations. "
http://lkml.org/lkml/2007/9/2/43
" That's also why i've asked for a split-up patch series - it makes
it far easier to review and test the code and it makes it far
easier to quickly apply the obviously correct bits. "
http://www.mail-archive.com/[email protected]/msg204094.html
You never directly replied to these pretty explicit requests, all you
did was this side remark 5 days later in one of your patch
announcements:
" For a split version I'm still waiting for some more explanation
about the CFS tuning parameter. "
http://lkml.org/lkml/2007/9/7/87
You are an experienced kernel hacker. How you can credibly claim that
while you were capable of writing a new scheduler along with a series of
25 complex mathematical equations that few if any lkml readers are able
to understand (and which scheduler came in one intermixed patch that
added no new comments at all!), and that you are able to maintain the
m68k Linux architecture code, but that at the same time some supposed
missing explanation from _me_ makes you magically incapable to split up
_your own fine code_? This is really beyond me.
I even gave you the first baby step of the split-up by sending this:
http://lkml.org/lkml/2007/9/2/76
And your reaction to this was dismissive:
" It simplifies the math too much, the nice level weighting is an
essential part of the math and without it one can't really
understand the problem I'm trying to solve. "
http://lkml.org/lkml/2007/9/3/174
So we advanced this whole issue by trying the vruntime concept in CFS
and adding the 2 cleanups from RFS (we couldnt actually use any code
from you, due to the way you shaped your patch - but we'd certainly be
glad to!). You've seen the earliest iteration of that at:
http://lkml.org/lkml/2007/9/2/76
So far you've sent 3 updates of your patch without addressing any of the
structural feedback we gave. We virtually begged you to make your code
finegrained and applicable - but you did not do that.
And please understand, splitting up patches is paramount when
cooperating with others: we are not against adding code that makes sense
(to the contrary and we do that every day), but it has to be done
gradually, in order of utility and impact, so please do send finegrained
patches if you wish to contribute. (but plain verbal feedback is useful
too - whichever you prefer.)
I asked you to send a split-up queue repeatedly and finally we ended up
extracting _one_ concept from your patch (which concept was suggested by
others months ago already, in the CFS discussions) and two cleanups. You
are credited for that in the patches. Please send us your other changes
as a finegrained series and if they are applied you are (of course)
credited as the author. Does this sound good to you?
Ingo
Hi,
On Thu, 13 Sep 2007, Ingo Molnar wrote:
> > Then you had the time to reimplement the very things you've just asked
> > me about and what do I get credit for - "two cleanups from RFS".
>
> i'm sorry to say this, but you must be reading some other email list and
> a different git tree than what i am reading.
>
> Firstly, about communications - in the past 3 months i've written you 40
> emails regarding CFS - and that's more emails than my wife (or any
> member of my family) got in that timeframe :-( I just ran a quick
> script: i sent more CFS related emails to you than to any other person
> on this planet. I bent backwards trying to somehow get you to cooperate
> with us (and i still havent given up on that!) - instead of you
> disparaging CFS and me frequently :-(
>
> Secondly, i prominently credited you as early as in the second sentence
> of our announcement:
>
> | fresh back from the Kernel Summit, Peter Zijlstra and me are pleased
> | to announce the latest iteration of the CFS scheduler development
> | tree. Our main focus has been on simplifications and performance -
> | and as part of that we've also picked up some ideas from Roman
> | Zippel's 'Really Fair Scheduler' patch as well and integrated them
> | into CFS. We'd like to ask people go give these patches a good
> | workout, especially with an eye on any interactivity regressions.
>
> http://lkml.org/lkml/2007/9/11/395
>
> And you are duly credited in 3 patches:
This needs a little perspective, as I couldn't clone the repository (and
you know that), all I had was this announcement, so using the patch
descriptions now as defense is unfair by you.
In this announcement you make relatively few references how this relates
to my work. Maybe someone else can show me how to read that announcement
differently, but IMO the casual reader is likely to get the impression,
that you only picked some minor cleanups from my patch, but it's rather
unclear that you already reimplemented key aspects of my patch. Don't
blame me for your own ambiguity.
> ------------------->
>
> Subject: sched: introduce se->vruntime
>
> introduce se->vruntime as a sum of weighted delta-exec's, and use
> that as the key into the tree.
>
> the idea to use absolute virtual time as the basic metric of
> scheduling has been first raised by William Lee Irwin, advanced by
> Tong Li and first prototyped by Roman Zippel in the "Really Fair
> Scheduler" (RFS) patchset.
>
> also see:
>
> http://lkml.org/lkml/2007/9/2/76
>
> for a simpler variant of this patch.
Let's compare this to the relevant part of the announcement:
| The ->vruntime metric is similar to the ->time_norm metric used by
| Roman's patch (and both are losely related to the already existing
| sum_exec_runtime metric in CFS), it's in essence the sum of CPU time
| executed by a task, in nanoseconds - weighted up or down by their nice
| level (or kept the same on the default nice 0 level). Besides this basic
| metric our implementation and math differs from RFS.
In the patch you are more explicit about the virtual time aspect, in the
announcement you're less clear that it's all based on the same idea and
somehow it's important to stress the point that "implementation and math
differs", which is not untrue, but your forget to mention that the
differences are rather small.
> You never directly replied to these pretty explicit requests, all you
> did was this side remark 5 days later in one of your patch
> announcements:
This is ridiculous, I asked you multiple times to explain to me some of
the differences relative to CFS as response to the splitup requests. Not
once did you react, you didn't even ask what I'd like to know
specifically.
>
> " For a split version I'm still waiting for some more explanation
> about the CFS tuning parameter. "
>
> http://lkml.org/lkml/2007/9/7/87
>
> You are an experienced kernel hacker. How you can credibly claim that
> while you were capable of writing a new scheduler along with a series of
> 25 complex mathematical equations that few if any lkml readers are able
> to understand (and which scheduler came in one intermixed patch that
> added no new comments at all!), and that you are able to maintain the
> m68k Linux architecture code, but that at the same time some supposed
> missing explanation from _me_ makes you magically incapable to split up
> _your own fine code_? This is really beyond me.
I never claimed to understand every detail of CFS, I can _guess_ what
_might_ have been intended, but from that it's impossible to know for
certain how important they are. Let's take this patch fragment:
- /*
- * Fix up delta_fair with the effect of us running
- * during the whole sleep period:
- */
- if (sched_feat(SLEEPER_AVG))
- delta_fair = div64_likely32((u64)delta_fair * load,
- load + se->load.weight);
-
- delta_fair = calc_weighted(delta_fair, se);
You simply remove this logic, without ever explaining what it was really
good for. There is no indication how it has been replaced. AFAICT the
comment refers to the calc_weighted() call, which is not the problem.
I can _guess_ that it was meant to scale the bonus based on cpu load, what
I can't guess from this is why this logic was added in first place, _why_
it was neccessary.
It had been easy for me to simply remove this as well, but I had preferred
to _know_ what the motivation for this logic was, so I can take it into
account in my patch.
bye, Roman
On Thu, 2007-09-13 at 18:50 +0200, Roman Zippel wrote:
> I never claimed to understand every detail of CFS, I can _guess_ what
> _might_ have been intended, but from that it's impossible to know for
> certain how important they are. Let's take this patch fragment:
>
delta_fair = se->delta_fair_sleep;
we slept that much
> - /*
> - * Fix up delta_fair with the effect of us running
> - * during the whole sleep period:
> - */
> - if (sched_feat(SLEEPER_AVG))
> - delta_fair = div64_likely32((u64)delta_fair * load,
> - load + se->load.weight);
if we would have ran we would not have been removed from the rq and the
weight would have been: rq_weight + weight
so compensate for us having been removed from the rq by scaling the
delta with: rq_weight/(rq_weight + weight)
> - delta_fair = calc_weighted(delta_fair, se);
scale for nice levels
On Thu, 2007-09-13 at 19:06 +0200, Peter Zijlstra wrote:
> On Thu, 2007-09-13 at 18:50 +0200, Roman Zippel wrote:
>
> > I never claimed to understand every detail of CFS, I can _guess_ what
> > _might_ have been intended, but from that it's impossible to know for
> > certain how important they are. Let's take this patch fragment:
> >
>
> delta_fair = se->delta_fair_sleep;
>
> we slept that much
>
> > - /*
> > - * Fix up delta_fair with the effect of us running
> > - * during the whole sleep period:
> > - */
> > - if (sched_feat(SLEEPER_AVG))
> > - delta_fair = div64_likely32((u64)delta_fair * load,
> > - load + se->load.weight);
>
> if we would have ran we would not have been removed from the rq and the
> weight would have been: rq_weight + weight
>
> so compensate for us having been removed from the rq by scaling the
> delta with: rq_weight/(rq_weight + weight)
>
> > - delta_fair = calc_weighted(delta_fair, se);
>
> scale for nice levels
>
Or at least, I think that is how to read it :-)
On Sep 13, 2007, at 12:50:12, Roman Zippel wrote:
> On Thu, 13 Sep 2007, Ingo Molnar wrote:
>> And you are duly credited in 3 patches:
>
> This needs a little perspective, as I couldn't clone the repository
> (and you know that), all I had was this announcement, so using the
> patch descriptions now as defense is unfair by you.
How the hell is that unfair? The fact that nobody could clone the
repo for about 24 hours is *totally* *irrelevant* to the whole
discussion as it's simply a matter of a technical glitch. His point
in referencing patch descriptions is to clear up matters of credit.
Ingo has never in this discussion been "out to get you". From the
point of view of a sideline observer it's been *you* that has been
demanding answers and refusing to answer questions directed at you.
The most brilliant mathematician in the world would have nothing to
contribute to the Linux scheduler if he couldn't describe, code, and
comment his algorithm in detail so that others (even code-monkeys
like myself) could grok at least the basic outline and be able to
give useful commentary and suggestions.
> In this announcement you make relatively few references how this
> relates to my work. Maybe someone else can show me how to read
> that announcement differently, but IMO the casual reader is likely
> to get the impression, that you only picked some minor cleanups
> from my patch, but it's rather unclear that you already
> reimplemented key aspects of my patch.
As a casual reader and reviewer I have yet to actually see you post
readable/reviewable patches in this thread. I was basically
completely unable to follow the detailed math you go into (even with
a math minor) due to your *complete* lack of comments. The fact that
you renamed files and didn't split up your patch made it useless for
actual practical kernel development, its only value was as a
comparison point. I did however get the impression that Ingo got
something significantly useful out of your code despite the problems,
but I still haven't had time to read through his and Peter's patches
in detail to understand exactly what it was. From personal
inspection of a fair percentage of the changes that Ingo and Peter
committed, they certainly appear to be deleting a lot more code than
they add. More specifically they appear to describe in detail what
they are deleting and why, with the exception of one patch that's
missing a changelog entry.
So yeah, I get the impression that Ingo re-implemented some ideas
that you had because you refused to do so in a way that was
acceptable for the upstream kernel. How exactly is this a bad
thing? You came up with a great idea that worked and somebody else
did the ugly grunt work to get it ready to go upstream! On the other
hand, given the "pleasant" attitude that you've showed Ingo during
this whole thing I doubt he'd be likely to do it again.
>> You never directly replied to these pretty explicit requests, all
>> you did was this side remark 5 days later in one of your patch
>> announcements:
>
> This is ridiculous, I asked you multiple times to explain to me
> some of the differences relative to CFS as response to the splitup
> requests. Not once did you react, you didn't even ask what I'd like
> to know specifically.
How exactly is Ingo supposed to explain to YOU the differences
between his scheduler and your modified one? Completely ignoring the
fact that you merged all your changes into a single patch and didn't
add a single comment, it's not *his* algorithm that I have trouble
understanding. From a relatively basic scan of the source-code and
comments I was able to figure out how the algorithm works in general,
enough to ask much more specific questions than yours. If anything,
Ingo should have been asking *you* how your scheduler differed from
the one it was based on.
> I never claimed to understand every detail of CFS, I can _guess_
> what _might_ have been intended, but from that it's impossible to
> know for certain how important they are. Let's take this patch
> fragment:
Oh come on, you appear to be quite knowledgeable about CPU scheduling
and the algorithms involved, surely as such you should have a much
easier time with reading the comments and asking specific questions.
For example, your below question specifically about the sleep
averaging could have been answered in fifteen minutes had you
actually *ASKED* that. You'll notice that in fact Peter Zijlstra's
email response did come almost exactly 15 minutes after you sent this
email, and for a casual reader like me it seems perfectly
sufficient; it does depend on you asking specific questions instead
of "how does it differ from my hundred-kbyte patch".
As for that specific patch, it's very clear that the affected logic
is controlled by one of the sched-feature tweaking tools, so you
could very easily experiment with it yourself to see what the
differences are when the feature is on or off and whether or not it's
useful or harmful for your workloads. Such evidence would help
indicate which way the scheduler feature should be hard-coded when
the tunable is finally removed.
Cheers,
Kyle Moffett
Hi Roman.
On Thu, Sep 13, 2007 at 02:35:35PM +0200, Roman Zippel wrote:
> Hi,
>
> On Thu, 13 Sep 2007, Ingo Molnar wrote:
>
> > > Out of curiousity: will I ever get answers to my questions?
> >
> > the last few weeks/months have been pretty hectic - i get more than 50
> > non-list emails a day so i could easily have missed some.
>
> Well, let's just take the recent "Really Simple Really Fair Scheduler"
> thread. You had the time to ask me questions about my scheduler, I even
> explained to you how the sleeping bonus works in my model. At the end I
> was sort of hoping you would start answering my questions and explaining
> things how the same things work in CFS - but nothing.
> Then you had the time to reimplement the very things you've just asked me
> about and what do I get credit for - "two cleanups from RFS".
I have read the announcement from Ingo and after reading it I concluded
that it was good to see that Ingo had taken in consideration the feedback
from you and improved the schduler based on this.
And when I read that he removed a lot of stuff I smiled. This reminded
me of countless monkey aka code review sessions where I repeatedly do
like my childred and asks why so many times that the author realize that
something is not needed or no longer used.
The above were my impression after reading the announcement with
respect to your influence and that goes far beyond "two cleanups".
I bet many others read it roughly like I did.
And no - I did not go back and re-read it. So do not answering
by quoting the announcement or stuff like this.
Because that will NOT change what my first impression was.
So keep up the review - we get a better scheduler this way.
Sam
On Thu, 2007-09-13 at 14:28 -0400, Kyle Moffett wrote:
> with the exception of one patch that's missing a changelog entry.
Ah, that would have been one of mine.
---
From: Peter Zijlstra <[email protected]>
Handle vruntime overflow by centering the key space around min_vruntime.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a306f05..b8e2a0d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -116,11 +116,18 @@ set_leftmost(struct cfs_rq *cfs_rq, struct rb_node *leftmost)
cfs_rq->rb_leftmost = leftmost;
if (leftmost) {
se = rb_entry(leftmost, struct sched_entity, run_node);
- cfs_rq->min_vruntime = max(se->vruntime,
- cfs_rq->min_vruntime);
+ if ((se->vruntime > cfs_rq->min_vruntime) ||
+ (cfs_rq->min_vruntime > (1ULL << 61) &&
+ se->vruntime < (1ULL << 50)))
+ cfs_rq->min_vruntime = se->vruntime;
}
}
+s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ return se->fair_key - cfs_rq->min_vruntime;
+}
+
/*
* Enqueue an entity into the rb-tree:
*/
@@ -130,7 +137,7 @@ __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
struct rb_node *parent = NULL;
struct sched_entity *entry;
- s64 key = se->fair_key;
+ s64 key = entity_key(cfs_rq, se);
int leftmost = 1;
/*
@@ -143,7 +150,7 @@ __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* We dont care about collisions. Nodes with
* the same key stay together.
*/
- if (key - entry->fair_key < 0) {
+ if (key < entity_key(cfs_rq, entry)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
Hi,
please find a couple of minor cleanups below (on top of sched-cfs-v2.6.23-rc6-v21-combo-3.patch):
(1)
Better placement of #ifdef CONFIG_SCHEDSTAT block in dequeue_entity().
Signed-off-by: Dmitry Adamushko <[email protected]>
---
diff -upr linux-2.6.23-rc6/kernel/sched_fair.c linux-2.6.23-rc6-my/kernel/sched_fair.c
--- linux-2.6.23-rc6/kernel/sched_fair.c 2007-09-13 21:38:49.000000000 +0200
+++ linux-2.6.23-rc6-my/kernel/sched_fair.c 2007-09-13 21:48:50.000000000 +0200
@@ -453,8 +453,8 @@ static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
{
update_stats_dequeue(cfs_rq, se);
- if (sleep) {
#ifdef CONFIG_SCHEDSTATS
+ if (sleep) {
if (entity_is_task(se)) {
struct task_struct *tsk = task_of(se);
@@ -463,8 +463,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
if (tsk->state & TASK_UNINTERRUPTIBLE)
se->block_start = rq_of(cfs_rq)->clock;
}
-#endif
}
+#endif
__dequeue_entity(cfs_rq, se);
}
---
(2)
unless we are very eager to keep an additional layer of abstraction,
'struct load_stat' is redundant now so let's get rid of it.
Signed-off-by: Dmitry Adamushko <[email protected]>
---
diff -upr linux-2.6.23-rc6/kernel/sched.c linux-2.6.23-rc6-sched-dev/kernel/sched.c
--- linux-2.6.23-rc6/kernel/sched.c 2007-09-12 21:37:41.000000000 +0200
+++ linux-2.6.23-rc6-sched-dev/kernel/sched.c 2007-09-12 21:26:10.000000000 +0200
@@ -170,10 +170,6 @@ struct rt_prio_array {
struct list_head queue[MAX_RT_PRIO];
};
-struct load_stat {
- struct load_weight load;
-};
-
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load;
@@ -232,7 +228,7 @@ struct rq {
#ifdef CONFIG_NO_HZ
unsigned char in_nohz_recently;
#endif
- struct load_stat ls; /* capture load from *all* tasks on this cpu */
+ struct load_weight load; /* capture load from *all* tasks on this cpu */
unsigned long nr_load_updates;
u64 nr_switches;
@@ -804,7 +800,7 @@ static int balance_tasks(struct rq *this
* Update delta_exec, delta_fair fields for rq.
*
* delta_fair clock advances at a rate inversely proportional to
- * total load (rq->ls.load.weight) on the runqueue, while
+ * total load (rq->load.weight) on the runqueue, while
* delta_exec advances at the same rate as wall-clock (provided
* cpu is not idle).
*
@@ -812,17 +808,17 @@ static int balance_tasks(struct rq *this
* runqueue over any given interval. This (smoothened) load is used
* during load balance.
*
- * This function is called /before/ updating rq->ls.load
+ * This function is called /before/ updating rq->load
* and when switching tasks.
*/
static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
- update_load_add(&rq->ls.load, p->se.load.weight);
+ update_load_add(&rq->load, p->se.load.weight);
}
static inline void dec_load(struct rq *rq, const struct task_struct *p)
{
- update_load_sub(&rq->ls.load, p->se.load.weight);
+ update_load_sub(&rq->load, p->se.load.weight);
}
static void inc_nr_running(struct task_struct *p, struct rq *rq)
@@ -967,7 +963,7 @@ inline int task_curr(const struct task_s
/* Used instead of source_load when we know the type == 0 */
unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->ls.load.weight;
+ return cpu_rq(cpu)->load.weight;
}
static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
@@ -1933,7 +1929,7 @@ unsigned long nr_active(void)
*/
static void update_cpu_load(struct rq *this_rq)
{
- unsigned long this_load = this_rq->ls.load.weight;
+ unsigned long this_load = this_rq->load.weight;
int i, scale;
this_rq->nr_load_updates++;
diff -upr linux-2.6.23-rc6/kernel/sched_debug.c linux-2.6.23-rc6-sched-dev/kernel/sched_debug.c
--- linux-2.6.23-rc6/kernel/sched_debug.c 2007-09-12 21:37:41.000000000 +0200
+++ linux-2.6.23-rc6-sched-dev/kernel/sched_debug.c 2007-09-12 21:36:04.000000000 +0200
@@ -137,7 +137,7 @@ static void print_cpu(struct seq_file *m
P(nr_running);
SEQ_printf(m, " .%-30s: %lu\n", "load",
- rq->ls.load.weight);
+ rq->load.weight);
P(nr_switches);
P(nr_load_updates);
P(nr_uninterruptible);
diff -upr linux-2.6.23-rc6/kernel/sched_fair.c linux-2.6.23-rc6-sched-dev/kernel/sched_fair.c
--- linux-2.6.23-rc6/kernel/sched_fair.c 2007-09-12 21:37:41.000000000 +0200
+++ linux-2.6.23-rc6-sched-dev/kernel/sched_fair.c 2007-09-12 21:35:27.000000000 +0200
@@ -499,7 +499,7 @@ set_next_entity(struct cfs_rq *cfs_rq, s
* least twice that of our own weight (i.e. dont track it
* when there are only lesser-weight tasks around):
*/
- if (rq_of(cfs_rq)->ls.load.weight >= 2*se->load.weight) {
+ if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
se->slice_max = max(se->slice_max,
se->sum_exec_runtime - se->prev_sum_exec_runtime);
}
---
Roman,
I've been trying to follow your mails about CFS since your review posted
on Aug 1st. Back to that date, I was thinking "cool, an in-depth review
by someone who understands schedulers and mathematics very well, we'll
quickly have a very solid design".
On Aug 10th, I was disappointed to see that you still had not provided
the critical information that Ingo had been asking to you for 9 days
(cfs-sched-debug output). Your motivations in this work started to
become a bit fuzzy to me, since people who behave like this generally
do so to get all the lights on them and you really don't need this.
Your explanation was kind of "show me yours and only then I'll show
you mine". Pretty childish but you finally sent that long-requested
information.
Since then, I've been noticing your now popular "will I get a response
to my questions" stuffed in most of your mails. That was getting very
suspicious from someone who can write down mathematics equations to
prove his design is right, especially considering the fact that your
"question" only relates to what a few lines were supposed to do. Nobody
believes that someone as smart as you is still blocked on the same
line of code after one month!
And if getting CFS fixed wasn't your real motivation...
On Thu, Sep 13, 2007 at 12:17:42AM +0200, Roman Zippel wrote:
> On Tue, 11 Sep 2007, Ingo Molnar wrote:
>
> > fresh back from the Kernel Summit, Peter Zijlstra and me are pleased to
> > announce the latest iteration of the CFS scheduler development tree. Our
> > main focus has been on simplifications and performance - and as part of
> > that we've also picked up some ideas from Roman Zippel's 'Really Fair
> > Scheduler' patch as well and integrated them into CFS. We'd like to ask
> > people go give these patches a good workout, especially with an eye on
> > any interactivity regressions.
>
> I'm must really say, I'm quite impressed by your efforts to give me as
> little credit as possible.
> On the one hand it's of course positive to see so much sudden activity, on
> the other hand I'm not sure how much had happened if I hadn't posted my
> patch, I don't really think it were my complaints about CFS's complexity
> that finally lead to the improvements in this area.
I'm now fairly convinced that you're not seeking credits either. There
are more credits to your name per line of patch here than there is in
your own code in the kernel. That complaint does not stand by itself.
In fact, I'm beginning to think that you're like a cat who has found a mouse.
Why kill it if you can play with it ? Each of your "will I get a response"
are just like a small kick in the mouse's back to make it move. But by dint
of doing this, you're slowly pushing the mouse to the door where it risks
to escape from you, and you're losing your toy.
So right now, I'm sure you really do not want to get any code merged. It's
so much fun for you to say "hey, Ingo, respond to me" that you would lose
this ability would your code get merged.
> I presented the basic
> concepts of my patch already with my first CFS review, but at that time
> you didn't show any interest and instead you were rather quick to simply
> dismiss it.
At that time, if my memory serves me, you were complaining about a fairness
problem you had with a few programs that you already took days to show the
sources. Proposing an alternate design with a bug report generally has no
chance to be considered because the developer mostly focuses on the bug
report. You should have spent time explaining how your design would work
*after* your problems were solved.
> My patch did not add that much new, it's mostly a conceptual
> improvement and describes the math in more detail
- why those details were never explained in pure english when nobody could
understand your maths, then ?
- if you have no problem reading code and translating it to concepts, without
any comment around it, then how is it believable that you have a problem
understanding 10 lines of code after 1 month ?
>, but it also demonstrated a number of improvements.
Very likely, reason why Ingo and Peter accepted to take parts of those
improvements. But do you realize that your lack of ability to communicate
on this list has probably delayed mainline integration of parts of your
work, because it was required to get a patch to try to understand your
intents ? It's not sci.math here, its linux-kernel, the _development_
mailing list, where the raw material and common language between people
is the _code_. Some people do not have the skills required to code their
excellent ideas, but they can spend time explaining those to other people.
In your case, it was just a guess game. It does not work like this and
you know it. I really think that you deliberately slowed all the process
down in order to stay on the scene playing this game.
> > The sched-devel.git tree can be pulled from:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
>
> Am I the only one who can't clone that thing? So I can't go into much
> detail about the individual changes here.
Even your question here is suspicious: the fact that you wonder whether you're
the only one implies you think it could be possible, thus implying something
intentionally targetted at you. And no, do not tell me you meant you could
have failed your git-clone command, you would have asked differently, such
as "sorry, I cannot clone from there right now".
You're taking advantage of everything around you to show that there is a
deliberate intention not to cooperate.
> The thing that makes me curious, is that it also includes patches by
> others. It can't be entirely explained with the Kernel Summit, as this is
> not the first time patches appear out of the blue in form of a git tree.
Once again: implied accusation of things being done without you knowing about
them. What's wrong with this? Fortunately, Linus does not tell you when he
merges a patch by someone different than you.
> The funny/sad thing is that at some point Linus complained about Con that
> his development activity happend on a separate mailing list, but there was
> at least a place to go to. CFS's development appears to mostly happen in
> private.
Not trying to take Ingo's defense because I too think he tends to show his
code when it's well advanced, but it's often required to work with pencil
and paper for hours before you suddenly can start. After that, it's true
that changes can advance very fast. After all, how many iterations did you
send before the patch that Ingo and Peter used ? Only one (or maybe zero,
depending on what patch they started with). So you're dishonnest again.
> Patches may be your primary form of communication, but that isn't
> true for many other people,
It's true that most of my family and relatives do not speak this language,
but you would find it funny to discover that on this list, it's the most
common form of expression. Look at the subjects. Most of them begin with
'[PATCH]'. And even code reviews are done in patch form with lines starting
with '-' and '+'. I won't tell you further, I know you know it, you were
just playing the dumb.
> with patches a lot of intent and motivation for a change is lost.
Yes, that's true. And I think that you deliberately avoided any comments
in your code exactly for this reason: slow down its integration process
to play a bit longer here. Would it have been that hard to put comments
to indicate people *your* intents ?
> I know it's rather tempting to immediately try out
> an idea first, but would it really hurt you so much to formulate an idea
> in a more conventional manner?
This is funny! Several people have been asking you to reformulate your
ideas that nobody could understand because of your math notation, which
you never did (at least not completely, just some parts). The conventional
manner _is_ the patch on LKML.
> Are you afraid it might hurt your
> ueberhacker status by occasionally screwing up in public?
Not speaking for Ingo of course, but I'd ask "and you?". Do you feel any
particular pride of being able to send formulas nobody understands, and
would it hurt your status explaining them to the normal people? (I mean
"normal" for this list, you remember, the ones who only communicate in
English or Patch).
> Patches on the
> other hand have the advantage to more easily cover that up by simply
> posting a fix - it makes it more difficult to understand what's going on.
I don't get it, it cannot hide a history. It happens to me very often to
rediff any set of patches and/or launch interdiff to see what changed
between multiple versions. On the other hand, if you would send 3 consecutive
mails with your magnificient formulas, nobody would notice any change!
> A more conventional way of communication would give more people a chance
> to participate
Exactly, that's what was asked to you! After 15 minutes reading your mail and
trying to decipher it, I finally gave up. That's sad because it looked very
interesting. I'm all for demonstrable designs instead of empirical ones.
> they may not understand every detail of the patch, but
> they can try to understand the general concepts and apply them to their
> own situation and eventually come up with some ideas/improvements of their
> own, they would be less dependent on you to come up with a solution to
> their problem.
I think that it's what Ingo and Peter did: try to apply their understanding
of your concepts to their implementation, without being too much dependant
on you to come up with a solution.
> Unless of course that's exactly what you want - unless you want to be in
> full control of the situation
And we're at it! You've been controlling the situation pretty well for the
last month. People politely entreating you to explain what you considered
wrong, how your design worked, etc... Even your mail rate on LKML has
doubled since August. You might have been feeling horny!
> and you want to be the hero that saves the day.
Right now, nobody saves the day. The Linux development process looks like
a playground with little kids sending sand into their eyes. Lamentable!
> In the patch you really remove _a_lot_ of stuff. You also removed a lot of
> things I tried to get you to explain them to me. On the one hand I could
> be happy that these things are gone, as they were the major road block to
> splitting up my own patch. On the other hand it still leaves me somewhat
> unsatisfied, as I still don't know what that stuff was good for.
You do not appear sincere. You might have been believing this the first
few days, but insisting for ONE MONTH on this part of the code means
that you found a flaw in it or you found it did not serve any purpose,
and you wanted Ingo to tell you anything about this so that you could
reply "bullshit, it does not work". Now I suspect it was simply useless
and they finally realized it then removed the code. What would it have
cost you to say "It seems to me that this code does nothing" ? You would
have got credited for it, since you're asking for that.
> In a more collaborative development model I would have expected that you
> tried to explain these features, which could have resulted in a discussion
> how else things can be implemented or if it's still needed at all. Instead
> of this you now simply decide unilaterally that these things are not
> needed anymore.
You know like me that explaining concepts by mail take *a lot* of time. I
even refuse to do this anymore with the people I work with. Wasting 4 hours
writing down something which goes to the bin in 5 minutes is stupid at best.
Better refine the thinking all in our corners, and either meet or discuss
the small pieces by mail.
> > there's also a visible reduction in code size:
> >
> > text data bss dec hex filename
> > 13369 228 2036 15633 3d11 sched.o.before (UP, nodebug)
> > 11167 224 1988 13379 3443 sched.o.after (UP, nodebug)
>
> Well, one could say that you used every little trick in the book to get
> these numbers down.
And why is this wrong ? I too spend a lot of time reducing and optimizing
code, sometimes even 1 hour to reduce some primitives by a few bytes or
cycles on most architectures I can test, and it often pays off. At this
stage of the development, its not unreasonable to try to reduce code size,
since it is not meant to change a lot. And 15% is not bad at all!
> On the other hand at this point it's a little unclear
> whether you maybe removed it a little too much to get there, so the
> significance of these numbers is a bit limited.
That's clearly possible. But how would one say, given the level of outbound
filtering you apply to your advices ?
> > Changes: besides the many micro-optimizations, one of the changes is
> > that se->vruntime (virtual runtime) based scheduling has been introduced
> > gradually, step by step - while keeping the wait_runtime metric working
> > too. (so that the two methods are comparable side by side, in the same
> > scheduler)
>
> I can't quite see that, the wait_runtime metric is relative to fair_clock
> and this is gone without any replacement, in my patch I at least
> calculate these values for the debug output, but in your patch even that
> is simply gone, so I'm not sure what you actually compare "side by side".
Ah, this is where the useful information was hidden. In most mails from you,
there's often :
- a ton of crap
- one complaint
- a ton of crap
- a very useful advice
- a ton of crap
Very easy after that yo ask for responses to your question and to say "I told
you 1 month ago...". And don't pretend it's unintentional, I've been playing
the same game with some other people for years in other contexts!
> > The ->vruntime metric is similar to the ->time_norm metric used by
> > Roman's patch (and both are losely related to the already existing
> > sum_exec_runtime metric in CFS), it's in essence the sum of CPU time
> > executed by a task, in nanoseconds - weighted up or down by their nice
> > level (or kept the same on the default nice 0 level). Besides this basic
> > metric our implementation and math differs from RFS.
>
> At this point it gets really interesting - I'm amazed how much you stress
> the differences.
Many people would be amazed how much you exagerate the fact that there are
differences. Indeed, of those 6 lines, 5 are about similarity, and one is
about a different implementation and math. I don't see "how much he stresses
the differences".
> If we take the basic math as I more simply explained it
> in this example http://lkml.org/lkml/2007/9/3/168, you now also make the
> step from the relative wait_runtime value to an absolute virtual time
> value. Basically it's really the same thing, only the resolution differs.
> This means you already reimplemented a key element of my patch, so would
> you please give me at least that much credit?
Ah, the episode of the guy having his code counterfeited with no credit.
Anyway, since it's your idea, I too think that there should be coments in
the code stating this, close to the explanations.
> The rest of the math is indeed different - it's simply missing. What is
> there is IMO not really adequate. I guess you will see the differences,
> once you test a bit more with different nice levels. There's a good reason
> I put that much effort into maintaining a good, but still cheap average,
> it's needed for a good task placement.
And this reason is ?
> There is of course more than one
> way to implement this, so you'll have good chances to simply reimplement
> it somewhat differently, but I'd be surprised if it would be something
> completely different.
It would be stupid if they had to reimplement something they did not
understand from your work. I would personally feel really desperate if
I spent that much time inventing very smart concepts that people did
not get right because I was totally unable to explain something with
humain-understandable words.
> To make it very clear to everyone else: this is primarely not about
> getting some credit, although that's not completely unimportant of course.
I believe you on this one.
> This is about getting adequate help.
I don't believe you on this one. Getting help is mostly what Ingo and Peter
have been seeking from you and got in small parts with lots of difficulties.
You could show everyone here that your brain really needs no help when it
comes to play with those algorithms, but it likes to play and often with
the same games.
> I had to push very hard to get any
> kind of acknowledgment with scheduler issues only to be rewarded with
> silence, so that I was occassionally wondering myself, whether I'm just
> hallucinating all this.
Not credible, you should renice the amazing factor in your complaints, it's
just a poor theatre play we're assisting to. With slightly less exageration,
it might be believable.
> Only after I provide the prove that further
> improvements are possible is there some activity.
That's true. What's unfortunate is that this proof was also the first
understandable starting point.
> But instead of providing
> help (e.g. by answering my questions) Ingo just goes ahead and
> reimplements the damn thing himself and simply throws out all questionable
> items instead of explaining them.
Oh, the persecuted guy again with his persistant pain due to the lack of
answer to his same question since last month.
> The point is that I have no real interest in any stinking competition, I
> have no interest to be reduced to simply complaining all the time.
False! It's the way you're trying to prove Ingo is a bastard and that you're
a victim. But if we just re-read a few pick-ups of your mails since Aug 1st,
its getting pretty obvious that you completely made up this situation. And
I can only applaud you very high manipulation skills, I'm impressed, because
you got me for a long time. But as always when such people are constantly
pushing the limits further, they reveal themselves.
> I'm more interested in a cooperation,
Did I say that I doubt about it now ?
> but that requires better communication and an actual exchange of ideas,
> patches are no real communication, they are a supplement and should
> rather be the end result instead of means to get anything started.
On some complex algorithms, you may be right. But a quick and dirty patch
has the advantage of showing the ideas and concepts in a way that many
people can understand and comment on.
> A collaboration could bundle the individual
> strengths, so that this doesn't degenerate into a contest of who's the
> greatest hacker.
> Is this really too much to expect?
Well, what are you going to do next?
- wait for a no response and say "everybody, look, the weak bastard in front
of me refuses the fight" ?
- split up your patches and add comments in them so that Ingo and Peter
finally understand what you really mean and not only what you're willing
to show them ?
- open a new thread on LKML detailing your ideas one at a time and proposing
others to implement them if you cannot code cleanly ?
- anything else? (eg: consult a specialist in schizophrenia?)
You could at least choose to prove your intent to contribute by rediffing
your patch against the last one which tries to imitate it, and commenting
the result, then splitting it up in as many parts as you see fit. And to
reuse a phrase from your last mail :
> Is this really too much to expect?
I sincerely hope you'll make everyone benefit from your unequalled skills,
and that you will stop playing cat and mouse. It's boring for many people,
and counter-productive.
> bye, Roman
Thanks,
Willy
and here's something a bit more intrusive.
The initial idea was to completely get rid of 'se->fair_key'. It's always equal to 'se->vruntime' for
all runnable tasks but the 'current'. The exact key within the tree for the 'current' has to be known in
order for __enqueue_entity() to work properly (if we just use 'vruntime', we may go a wrong way down the tree
while looking for the correct position for a new element).
Sure, it's possible to cache the current's key in the 'cfs_rq' and add a few additional checks, but that's
not very nice... so what if we don't keep the 'current' within the tree? :-)
The illustration is below. Some bits can be missed so far but a patched kernel boots/works
(haven't done real regression tests yet... can say that the mail client is still working
at this very moment :-).
There are 2 benefits:
(1) no more 'fair_key' ;
(2) entity_tick() is simpler/more effective : 'update_curr()' now vs.
'dequeue_entity() + enqueue_entity()' before.
anyway, consider it as mainly an illustration of idea so far.
---
diff -upr linux-2.6.23-rc6/include/linux/sched.h linux-2.6.23-rc6-my/include/linux/sched.h
--- linux-2.6.23-rc6/include/linux/sched.h 2007-09-13 21:38:49.000000000 +0200
+++ linux-2.6.23-rc6-my/include/linux/sched.h 2007-09-13 23:01:21.000000000 +0200
@@ -890,7 +890,6 @@ struct load_weight {
* 6 se->load.weight
*/
struct sched_entity {
- s64 fair_key;
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
unsigned int on_rq;
diff -upr linux-2.6.23-rc6/kernel/sched.c linux-2.6.23-rc6-my/kernel/sched.c
--- linux-2.6.23-rc6/kernel/sched.c 2007-09-13 21:52:13.000000000 +0200
+++ linux-2.6.23-rc6-my/kernel/sched.c 2007-09-13 23:00:19.000000000 +0200
@@ -6534,7 +6534,6 @@ void normalize_rt_tasks(void)
read_lock_irq(&tasklist_lock);
do_each_thread(g, p) {
- p->se.fair_key = 0;
p->se.exec_start = 0;
#ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;
diff -upr linux-2.6.23-rc6/kernel/sched_debug.c linux-2.6.23-rc6-my/kernel/sched_debug.c
--- linux-2.6.23-rc6/kernel/sched_debug.c 2007-09-13 21:52:13.000000000 +0200
+++ linux-2.6.23-rc6-my/kernel/sched_debug.c 2007-09-13 23:00:50.000000000 +0200
@@ -38,7 +38,7 @@ print_task(struct seq_file *m, struct rq
SEQ_printf(m, "%15s %5d %15Ld %13Ld %5d ",
p->comm, p->pid,
- (long long)p->se.fair_key,
+ (long long)p->se.vruntime,
(long long)(p->nvcsw + p->nivcsw),
p->prio);
#ifdef CONFIG_SCHEDSTATS
diff -upr linux-2.6.23-rc6/kernel/sched_fair.c linux-2.6.23-rc6-my/kernel/sched_fair.c
--- linux-2.6.23-rc6/kernel/sched_fair.c 2007-09-13 21:52:13.000000000 +0200
+++ linux-2.6.23-rc6-my/kernel/sched_fair.c 2007-09-13 23:48:02.000000000 +0200
@@ -125,7 +125,7 @@ set_leftmost(struct cfs_rq *cfs_rq, stru
s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- return se->fair_key - cfs_rq->min_vruntime;
+ return se->vruntime - cfs_rq->min_vruntime;
}
/*
@@ -167,9 +167,6 @@ __enqueue_entity(struct cfs_rq *cfs_rq,
rb_link_node(&se->run_node, parent, link);
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
- update_load_add(&cfs_rq->load, se->load.weight);
- cfs_rq->nr_running++;
- se->on_rq = 1;
}
static void
@@ -179,9 +176,6 @@ __dequeue_entity(struct cfs_rq *cfs_rq,
set_leftmost(cfs_rq, rb_next(&se->run_node));
rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
- update_load_sub(&cfs_rq->load, se->load.weight);
- cfs_rq->nr_running--;
- se->on_rq = 0;
}
static inline struct rb_node *first_fair(struct cfs_rq *cfs_rq)
@@ -320,10 +314,6 @@ static void update_stats_enqueue(struct
*/
if (se != cfs_rq->curr)
update_stats_wait_start(cfs_rq, se);
- /*
- * Update the key:
- */
- se->fair_key = se->vruntime;
}
static void
@@ -371,6 +361,22 @@ update_stats_curr_end(struct cfs_rq *cfs
* Scheduling class queueing methods:
*/
+static void
+account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ update_load_add(&cfs_rq->load, se->load.weight);
+ cfs_rq->nr_running++;
+ se->on_rq = 1;
+}
+
+static void
+account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ update_load_sub(&cfs_rq->load, se->load.weight);
+ cfs_rq->nr_running--;
+ se->on_rq = 0;
+}
+
static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
#ifdef CONFIG_SCHEDSTATS
@@ -446,7 +452,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
}
update_stats_enqueue(cfs_rq, se);
- __enqueue_entity(cfs_rq, se);
+ if (se != cfs_rq->curr)
+ __enqueue_entity(cfs_rq, se);
+ account_entity_enqueue(cfs_rq, se);
}
static void
@@ -465,7 +473,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
}
}
#endif
- __dequeue_entity(cfs_rq, se);
+ if (se != cfs_rq->curr)
+ __dequeue_entity(cfs_rq, se);
+ account_entity_dequeue(cfs_rq, se);
}
/*
@@ -511,6 +521,9 @@ static struct sched_entity *pick_next_en
{
struct sched_entity *se = __pick_next_entity(cfs_rq);
+ if (se)
+ __dequeue_entity(cfs_rq, se);
+
set_next_entity(cfs_rq, se);
return se;
@@ -522,8 +535,11 @@ static void put_prev_entity(struct cfs_r
* If still on the runqueue then deactivate_task()
* was not called and update_curr() has to be done:
*/
- if (prev->on_rq)
+ if (prev->on_rq) {
update_curr(cfs_rq);
+ /* Put the current back into the tree. */
+ __enqueue_entity(cfs_rq, prev);
+ }
update_stats_curr_end(cfs_rq, prev);
@@ -535,11 +551,9 @@ static void put_prev_entity(struct cfs_r
static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
/*
- * Dequeue and enqueue the task to update its
- * position within the tree:
+ * Update run-time statistics of the 'current'.
*/
- dequeue_entity(cfs_rq, curr, 0);
- enqueue_entity(cfs_rq, curr, 0);
+ update_curr(cfs_rq);
if (cfs_rq->nr_running > 1)
check_preempt_tick(cfs_rq, curr);
@@ -890,6 +904,7 @@ static void task_new_fair(struct rq *rq,
update_stats_enqueue(cfs_rq, se);
__enqueue_entity(cfs_rq, se);
+ account_entity_enqueue(cfs_rq, se);
resched_task(rq->curr);
}
---
On 9/13/07, Ingo Molnar <[email protected]> wrote:
>
> * Rob Hussey <[email protected]> wrote:
>
> > Well, I was going over my config myself after you asked for me to post
> > it, and I thought to do the same thing. Except, disabling sched_debug
> > caused the same error as before:
> > In file included from kernel/sched.c:794:
> > kernel/sched_fair.c: In function 'task_new_fair':
> > kernel/sched_fair.c:857: error: 'sysctl_sched_child_runs_first'
> > undeclared (first use in this function)
> > kernel/sched_fair.c:857: error: (Each undeclared identifier is
> > reported only once
> > kernel/sched_fair.c:857: error: for each function it appears in.)
> > make[1]: *** [kernel/sched.o] Error 1
> > make: *** [kernel] Error 2
> >
> > It only happens with sched_debug=y. I take it back, it wasn't my fault :)
> >
> > As for everything else, I'd be happy to.
>
> are you sure this is happening with the latest iteration of the patch
> too? (with the combo-3.patch?) You can pick it up from here:
>
> http://people.redhat.com/mingo/cfs-scheduler/devel/sched-cfs-v2.6.23-rc6-v21-combo-3.patch
>
> I tried your config and it builds fine here.
>
> Ingo
>
I managed to work it all out (it was my fault after all), and I've now
made the changes you suggested to my .configs for 2.6.23-rc1 and
2.6.23-rc6. I've done the benchmarks all over, including tests with
the task bound to a single core. Without further ado, the numbers I
promised:
lat_ctx -s 0 2
# rc1 rc6 cfs-devel
1 4.58 4.39 4.42
2 4.76 4.42 4.41
3 4.74 4.52 4.67
4 4.74 4.44 4.76
5 4.79 4.74 4.59
6 4.80 4.65 4.76
7 4.52 4.54 4.50
8 4.72 4.57 4.62
9 4.87 4.67 4.80
10 4.69 4.47 4.65
hackbench 50
# rc1 rc6 cfs-devel
1 6.634 5.969 5.894
2 6.342 5.974 5.903
3 6.219 5.913 5.941
4 6.702 5.980 5.916
5 6.287 6.007 5.943
6 6.239 6.022 5.899
7 6.434 5.946 5.904
8 6.229 6.007 5.941
9 6.387 5.947 5.880
10 6.383 5.946 5.933
pipe-test
# rc1 rc6 cfs-devel
1 13.39 13.16 13.20
2 13.37 13.12 13.22
3 13.19 13.17 13.26
4 13.17 13.16 13.18
5 13.16 13.22 13.23
6 13.15 13.19 13.18
7 13.18 13.42 13.21
8 13.45 13.39 13.26
9 13.40 13.40 13.28
10 13.39 13.44 13.24
Bound to single core:
lat_ctx -s 0 2
# rc1 rc6 cfs-devel
1 3.20 2.61 2.37
2 3.20 2.60 2.40
3 3.21 2.67 2.38
4 3.19 2.66 2.34
5 3.19 2.64 2.37
6 3.22 2.67 2.36
7 3.21 3.29 2.36
8 3.22 2.61 2.44
9 3.23 2.68 2.36
10 3.22 2.60 2.37
hackbench 50
# rc1 rc6 cfs-devel
1 7.528 7.950 7.538
2 7.649 8.026 7.548
3 7.613 8.160 7.580
4 7.550 8.054 7.558
5 7.563 8.373 7.559
6 7.617 8.152 7.550
7 7.593 7.831 7.562
8 7.602 8.311 7.588
9 7.589 8.010 7.552
10 7.682 8.059 7.556
pipe-test
# rc1 rc6 cfs-devel
1 10.29 9.27 8.54
2 10.30 9.29 8.54
3 10.31 9.28 8.54
4 10.29 9.27 8.54
5 10.28 9.28 8.53
6 10.30 9.28 8.53
7 10.30 9.28 8.54
I've made graphs like last time:
http://www.healthcarelinen.com/misc/lat_ctx_benchmark.png
http://www.healthcarelinen.com/misc/hackbench_benchmark.png
http://www.healthcarelinen.com/misc/pipe-test_benchmark.png
http://www.healthcarelinen.com/misc/BOUND_lat_ctx_benchmark.png
http://www.healthcarelinen.com/misc/BOUND_hackbench_benchmark.png
http://www.healthcarelinen.com/misc/BOUND_pipe-test_benchmark.png
I'll also attach the graphs, as well as my .configs.
As for the data, I really can't read much into it. I think the graphs
tell the story better (if only I could understand the story). Overall
though, a definite improvement over 2.6.23-rc1.
On 9/13/07, Rob Hussey <[email protected]> wrote:
> Bound to single core:
...
> hackbench 50
> # rc1 rc6 cfs-devel
> 1 7.528 7.950 7.538
> 2 7.649 8.026 7.548
> 3 7.613 8.160 7.580
> 4 7.550 8.054 7.558
> 5 7.563 8.373 7.559
> 6 7.617 8.152 7.550
> 7 7.593 7.831 7.562
> 8 7.602 8.311 7.588
> 9 7.589 8.010 7.552
> 10 7.682 8.059 7.556
>
I knew there was no way I'd post all these numbers and not screw
something up. Switch rc6 and rc1 for hackbench 50 (bound to single
core). Updated graph:
http://www.healthcarelinen.com/misc/BOUND_hackbench_benchmark_fixed.png
Also attached.
On Sep 13, 2007, at 21:47:25, Rob Hussey wrote:
> On 9/13/07, Ingo Molnar <[email protected]> wrote:
>> are you sure this is happening with the latest iteration of the
>> patch too? (with the combo-3.patch?) You can pick it up from here:
>>
>> http://people.redhat.com/mingo/cfs-scheduler/devel/sched-cfs-
>> v2.6.23-rc6-v21-combo-3.patch
>
> I managed to work it all out (it was my fault after all), and I've now
> made the changes you suggested to my .configs for 2.6.23-rc1 and
> 2.6.23-rc6. I've done the benchmarks all over, including tests with
> the task bound to a single core. Without further ado, the numbers I
> promised:
>
> [...]
>
> I've made graphs like last time:
> http://www.healthcarelinen.com/misc/lat_ctx_benchmark.png
> http://www.healthcarelinen.com/misc/hackbench_benchmark.png
> http://www.healthcarelinen.com/misc/pipe-test_benchmark.png
> http://www.healthcarelinen.com/misc/BOUND_lat_ctx_benchmark.png
> http://www.healthcarelinen.com/misc/BOUND_hackbench_benchmark.png
> http://www.healthcarelinen.com/misc/BOUND_pipe-test_benchmark.png
Well looking at these graphs (and the fixed one from your second
email), it sure looks a lot like CFS is doing at *least* as well as
the old scheduler in every single test, and doing much better in most
of them (in addition it's much more consistent between runs). This
seems to jive with all the other benchmarks and overall empirical
testing that everyone has been doing. Overall I have to say a job
well done for Ingo, Peter, Con, and all the other major contributors
to this impressive endeavor.
Cheers,
Kyle Moffett
* dimm <[email protected]> wrote:
> Better placement of #ifdef CONFIG_SCHEDSTAT block in dequeue_entity().
thanks, applied.
Ingo
* dimm <[email protected]> wrote:
> unless we are very eager to keep an additional layer of abstraction,
> 'struct load_stat' is redundant now so let's get rid of it.
yeah - good one, it's indeed redundant. Applied.
Ingo
* dimm <[email protected]> wrote:
> and here's something a bit more intrusive.
>
> The initial idea was to completely get rid of 'se->fair_key'. It's
> always equal to 'se->vruntime' for all runnable tasks but the
> 'current'. The exact key within the tree for the 'current' has to be
> known in order for __enqueue_entity() to work properly (if we just use
> 'vruntime', we may go a wrong way down the tree while looking for the
> correct position for a new element). Sure, it's possible to cache the
> current's key in the 'cfs_rq' and add a few additional checks, but
> that's not very nice... so what if we don't keep the 'current' within
> the tree? :-)
>
> The illustration is below. Some bits can be missed so far but a
> patched kernel boots/works (haven't done real regression tests yet...
> can say that the mail client is still working at this very moment :-).
>
> There are 2 benefits:
>
> (1) no more 'fair_key' ;
> (2) entity_tick() is simpler/more effective : 'update_curr()' now vs.
> 'dequeue_entity() + enqueue_entity()' before.
cool patch - i like it! It removes some code as well, besides shrinking
struct task_struct with another 64-bit variable - so it's a nice
speedup:
text data bss dec hex filename
34467 3466 24 37957 9445 sched.o.before
34414 3466 24 37904 9410 sched.o.after
i've applied it to the end of the queue - it depends on whether
->vruntime works out well.
Ingo
On Thu, 2007-09-13 at 14:44 +0200, Peter Zijlstra wrote:
> > If you look at the math, you'll see that I took the overflow into account,
> > I even expected it. If you see this effect in my implementation, it would
> > be a bug.
>
> Ah, ok, I shall look to your patches in more detail, it was not obvious
> from the formulae you posted.
You indeed outlined it in your email, I must have forgotten it.
I have the attention span of a goldfish these days :-/
Hi,
On Thu, 13 Sep 2007, Ingo Molnar wrote:
> > The rest of the math is indeed different - it's simply missing. What
> > is there is IMO not really adequate. I guess you will see the
> > differences, once you test a bit more with different nice levels.
>
> Roman, i disagree strongly. I did test with different nice levels. Here
> are some hard numbers: the CPU usage table of 40 busy loops started at
> once, all running at a different nice level, from nice -20 to nice +19:
Ingo, you should have read the rest of the paragraph too, I said "it's
needed for a good task placement", I didn't say anything about time
distribution.
Try to start a few niced busy loops and then try some interactivity tests.
You should also increase the granularity, the rather small time slices can
cover up a lot of bad scheduling decisions.
> In the announcement of your "Really Fair Scheduler" patch you used the
> following very strong statement:
>
> " This model is far more accurate than CFS is [...]"
>
> http://lkml.org/lkml/2007/8/30/307
>
> but when i stressed you for actual real-world proof of CFS misbehavior,
You're forgetting that only a few days before that announcement, the worst
issues had been fixed, which at that time I hadn't taken into account yet.
> you said:
>
> "[...] they have indeed little effect in the short term, [...] "
>
> http://lkml.org/lkml/2007/9/2/282
>
> so how can CFS be "far less accurate" (paraphrased) while it has "little
> effect in the short term"?
>
> so to repeat my question: my (and Peter's) claim is that there is no
> real-world significance of much of the complexity you added to avoid
> rounding effects. You do disagree with that, so our follow-up question
> is: what actual real-world significance does it have in your opinion?
> What is the worst-case effect? Do we even care? We have measured it
> every which way and it just does not matter. (but we could easily be
> wrong, so please be specific if you know about something that we
> overlooked.) Thanks,
Did you read the rest of mail? I said a little bit more than that, which
actually explains this already in large parts.
(BTW this mail also has one example where I almost begged you to explain
me some of the CFS features in response to your splitup request - no
response.)
Accuracy is an important aspect, but it's not really the primary goal.
As I said I wanted a correct mathematical model of CFS, but due to the
complexity of CFS (of which a lot has been removed now in CFS-devel) it
was rather difficult to produce such a model.
Producing an accurate model is meant as a _tool_ for further
transformations, e.g. to analyze where are further simplifications
possible, where can the 64bit math be replaced with something simpler
without reducing scheduling quality significantly.
The added accuracy increases of course the complexity, but compared to the
already existing complexity it was still less (at least according to the
lmbench numbers), so IMO it's worth it. The advantage is that I didn't had
to worry about any effects of unexpected rounding errors. This scheduler
has to work with a wide range of clock implementations and AFAICT it's
impossible to guarantee that it work in any situation, it may not
break down completely, but I couldn't exclude unexplainable anomalities,
especially after seeing the problems in the early CFS version, which got
merged.
As I also mentioned this is only part of the problem (but to which early
CFS version significantly contributed). The main problem were the limits,
once the limits are exceeded, that overflow/underflow time is simply lost
and that is what finally resulted in the misbehaviour. The rounding
problems were one possible cause but not the only one. Other possibilities
would require more complex scheduling pattern, where de-/enqueuing of
tasks would push some tasks into these limits. Prime suspect here was the
sleeper bonus and the question was: is it possible to accumulate the
bonus, is it possible to force the punishment onto specific tasks.
The complexity of CFS makes it now hard to quantify the problem, it's easy
to say that it will work in most cases, but e.g. the rounding fixes
changed more the common case but not really the worst case. The point is
what would cost to be a little more acurate and as proved with my patch
not much, but in the end we would have a more reliable scheduler, that
not only works well in the common cases.
Anyway, as I said already earlier, with the step to an absolute virtual
time the biggest error source is gone, so in a way you also proved my
point that it's worth it, even if you don't want to admit it.
bye, Roman
Hi,
On Thu, 13 Sep 2007, Peter Zijlstra wrote:
> On Thu, 2007-09-13 at 18:50 +0200, Roman Zippel wrote:
>
> > I never claimed to understand every detail of CFS, I can _guess_ what
> > _might_ have been intended, but from that it's impossible to know for
> > certain how important they are. Let's take this patch fragment:
> >
>
> delta_fair = se->delta_fair_sleep;
>
> we slept that much
>
> > - /*
> > - * Fix up delta_fair with the effect of us running
> > - * during the whole sleep period:
> > - */
> > - if (sched_feat(SLEEPER_AVG))
> > - delta_fair = div64_likely32((u64)delta_fair * load,
> > - load + se->load.weight);
>
> if we would have ran we would not have been removed from the rq and the
> weight would have been: rq_weight + weight
>
> so compensate for us having been removed from the rq by scaling the
> delta with: rq_weight/(rq_weight + weight)
>
> > - delta_fair = calc_weighted(delta_fair, se);
>
> scale for nice levels
AFAICT the compensation part is already done by the scaling part, without
the load part it largely mirrors what __update_stats_wait_end() does, i.e.
it gets the same time as other tasks, which have been on the rq.
bye, Roman
On Fri, 2007-09-14 at 14:04 +0200, Roman Zippel wrote:
> AFAICT the compensation part is already done by the scaling part, without
> the load part it largely mirrors what __update_stats_wait_end() does, i.e.
> it gets the same time as other tasks, which have been on the rq.
All it tried to do was approximate the situation where the task never
left the rq. I'm not saying it makes sense or is the right thing to do,
just what the thought behind that particular bit was.
There is a reason it was turned off by default:
- SCHED_FEAT_SLEEPER_AVG *0 |
Hi,
On Thu, 13 Sep 2007, Sam Ravnborg wrote:
> I have read the announcement from Ingo and after reading it I concluded
> that it was good to see that Ingo had taken in consideration the feedback
> from you and improved the schduler based on this.
> And when I read that he removed a lot of stuff I smiled. This reminded
> me of countless monkey aka code review sessions where I repeatedly do
> like my childred and asks why so many times that the author realize that
> something is not needed or no longer used.
>
>
> The above were my impression after reading the announcement with
> respect to your influence and that goes far beyond "two cleanups".
> I bet many others read it roughly like I did.
Sam, in a way you actually prove my point. Thanks. :)
The primary thing you remember here from the announcements is the cleanup
and tuning part, during which he picked up some small parts from my patch.
That's what I was afraid of, most people won't realize what was added in
this process and even if they notice it, Ingo describes it somewhat
"similiar", but actually "different". That part is pretty important to me,
but Ingo treats it more as a minor matter.
bye, Roman
Hi,
On Fri, 14 Sep 2007, Willy Tarreau wrote:
> On Aug 10th, I was disappointed to see that you still had not provided
> the critical information that Ingo had been asking to you for 9 days
> (cfs-sched-debug output). Your motivations in this work started to
> become a bit fuzzy to me, since people who behave like this generally
> do so to get all the lights on them and you really don't need this.
>
> Your explanation was kind of "show me yours and only then I'll show
> you mine". Pretty childish but you finally sent that long-requested
> information.
Well, I admit it was rather fruitless attempt to get some information out
of Ingo, but I only did it _once_.
The problem is that the flow of informations hasn't improved since, later
I actually answered his questions, but my information requests still go to
/dev/null.
> I'm now fairly convinced that you're not seeking credits either. There
> are more credits to your name per line of patch here than there is in
> your own code in the kernel. That complaint does not stand by itself.
>
> In fact, I'm beginning to think that you're like a cat who has found a mouse.
> Why kill it if you can play with it ? Each of your "will I get a response"
> are just like a small kick in the mouse's back to make it move. But by dint
> of doing this, you're slowly pushing the mouse to the door where it risks
> to escape from you, and you're losing your toy.
Getting credit is indeed not really that important to me, but apparently
some lousy credit notes is the only way to get any kind of
acknowledgement.
I want to get more attention, but not in a way you suspect. I don't think
that a mouse is a really good analogy, is he really that defenseless?
All I want is to be taken a bit more seriously, the communication aspect I
mentioned is really important. From my perspective Ingo is somewhere up on
his pedestal and I have to scream to get any kind of attention.
I skip the rest of the mail, it's one big attempt trying to prove that I'm
dishonest, but you only look at the issue from one side and thus making it
yourself very easy.
bye, Roman
On Thu, 13 Sep 2007 18:50:12 +0200 (CEST)
Roman Zippel <[email protected]> wrote:
> > You never directly replied to these pretty explicit requests, all
> > you did was this side remark 5 days later in one of your patch
> > announcements:
>
> This is ridiculous, I asked you multiple times to explain to me some
> of the differences relative to CFS as response to the splitup
> requests. Not once did you react, you didn't even ask what I'd like
> to know specifically.
Roman,
this is... a strange comment. It almost sounds like you were holding
the splitup hostage depending on some other thing happening.... that's
not a good attitude in my book. Having big-blob patches that do many
things at the same time leads to them being impossible to apply. Linux
works by having smaller incrementals. You know that; you've been around
for a long time.
Complaining that someone finally did splitup work after you refused,
and even puts credit in for you... that's beyond my comprehension.
Sorry.
Hi,
On Fri, 14 Sep 2007, Arjan van de Ven wrote:
> > This is ridiculous, I asked you multiple times to explain to me some
> > of the differences relative to CFS as response to the splitup
> > requests. Not once did you react, you didn't even ask what I'd like
> > to know specifically.
>
> Roman,
>
> this is... a strange comment. It almost sounds like you were holding
> the splitup hostage depending on some other thing happening.... that's
> not a good attitude in my book. Having big-blob patches that do many
> things at the same time leads to them being impossible to apply. Linux
> works by having smaller incrementals. You know that; you've been around
> for a long time.
There is actually a very simple reason for that, the actual patch is not
my primary focus, for me it's actually more an afterthought of the actual
design to show that it actually works.
My primary interest is a _discussion_ of the scheduler design, but Ingo
insists on patches. Sorry, but I don't really work this way, I want to
think things through _first_, I need a solid concept and I don't like to
rely on guesswork.
How much response would I have gotten if I had only posted the example
program and the math description as I initially planned?
bye, Roman
On Fri, 14 Sep 2007 16:50:22 +0200 (CEST)
Roman Zippel <[email protected]> wrote:
> Hi,
>
> On Fri, 14 Sep 2007, Arjan van de Ven wrote:
>
> > > This is ridiculous, I asked you multiple times to explain to me
> > > some of the differences relative to CFS as response to the splitup
> > > requests. Not once did you react, you didn't even ask what I'd
> > > like to know specifically.
> >
> > Roman,
> >
> > this is... a strange comment. It almost sounds like you were holding
> > the splitup hostage depending on some other thing happening....
> > that's not a good attitude in my book. Having big-blob patches that
> > do many things at the same time leads to them being impossible to
> > apply. Linux works by having smaller incrementals. You know that;
> > you've been around for a long time.
>
> There is actually a very simple reason for that, the actual patch is
> not my primary focus,
for someone who's not focused on patches/code, you make quite a bit of
noise when someone does turn your discussion into smaller patches and
only credits you three times.
Hi,
On Fri, 14 Sep 2007, Arjan van de Ven wrote:
> > There is actually a very simple reason for that, the actual patch is
> > not my primary focus,
>
>
> for someone who's not focused on patches/code, you make quite a bit of
> noise when someone does turn your discussion into smaller patches and
> only credits you three times.
As I said before, it's not really the lack of credit, it's the lack of
discussion.
bye, Roman
On Fri, Sep 14, 2007 at 03:10:47PM +0200, Roman Zippel wrote:
> Hi,
> Getting credit is indeed not really that important to me, but apparently
> some lousy credit notes is the only way to get any kind of
> acknowledgement.
Acknowledgement has always been one of the kernel's weaknesses it seems,
given the recent issues on other subjects. But it's not always easy either,
especially when you just change sparse parts of code based on someone else's
analysis. I personally do credit people in the GIT changelogs for their ideas
or patches, but that does not appear in the code.
> I want to get more attention, but not in a way you suspect. I don't think
> that a mouse is a really good analogy, is he really that defenseless?
I don't know. But I observe that you're very efficient at building the road
you want him to walk on.
> All I want is to be taken a bit more seriously, the communication aspect I
> mentioned is really important. From my perspective Ingo is somewhere up on
> his pedestal and I have to scream to get any kind of attention.
In my opinion, you're screaming in a language he does not understand, and
when he proposes random responses, you don't understand them either. That
game can last very long. You want to speak maths, he cannot. He wants to
speak patches, you cannot. I'm not saying one is better than the other, but
I know for sure that the common language here on LKML is patches. So my
conclusion is that you need someone to act as a translator when you want
to communicate here. It should be a very hard work, BTW!
> I skip the rest of the mail, it's one big attempt trying to prove that I'm
> dishonest, but you only look at the issue from one side and thus making it
> yourself very easy.
Maybe there was a very prominent side then. I might be wrong in my analysis,
but I cannot find any other interpretation, there are too many coincidences,
and I don't believe in that, especially from smart people ;-)
Willy