2007-09-12 23:16:40

by Antoine Martin

[permalink] [raw]
Subject: CFS: some bad numbers with Java/database threading

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hi list,

I was working on some unit tests and thought I'd give CFS a whirl to see
if it had any impact on my workloads (to see what the fuss was about),
and I came up with some pretty disturbing numbers:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests-noload2.png
As above but also showing the load average:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests2.png
Looks like a regression to me...

Basically, all the previous kernels are pretty close (2.6.16 through to
2.6.20 performed almost identically to 2.6.22 and are not shown here to
avoid cluttering the graphs)

All the 2.6.23-rc kernels performed poorly (except -rc3!): much more
erratically and with a sharp performance drop above 800 threads. The
load starts to go up and the performance takes a nosedive.

With fewer threads (less than 50) there is hardly any difference at all
between all the kernels.

Notes about the tests and setup:
* environment is:
Dual Opteron 252 with 3GB ram, scsi disk, etc..
Sun Java 1.6
MySQL 5.0.44
Junit + ant + my test code (devloop.org.uk)
* java threads are created first and the data is prepared, then all the
threads are started in a tight loop. Each thread runs multiple queries
with a 10ms pause (to allow the other threads to get scheduled)
* load average is divided by the number of cpus (2)
* more general information (which also covers some irrelevant
information about some other tests I have published) is here:
http://devloop.org.uk/documentation/database-performance/Setup/

Don't shoot the messenger!
I can run some more tests if needed (bearing in mind that a full test
run takes a few hours) or you can run the tests yourself: instructions
on running the tests are included.
I am now re-testing with sched-cfs-v2.6.23-rc6-v21-combo-2.patch
but feel free to send some other patches.

Antoine
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG6HH+GK2zHPGK1rsRCl3oAJ9c4crCtNQfGs9gWO7Y5CvcIno8TACbBPTw
0TEHkqLMGAfH0ILwWVKc0oo=
=1iBA
-----END PGP SIGNATURE-----


2007-09-13 07:19:39

by David Schwartz

[permalink] [raw]
Subject: RE: some bad numbers with Java/database threading


> I was working on some unit tests and thought I'd give CFS a whirl to see
> if it had any impact on my workloads (to see what the fuss was about),
> and I came up with some pretty disturbing numbers:
> http://devloop.org.uk/documentation/database-performance/Linux-Ker
> nels/Kernels-ManyThreads-CombinedTests-noload2.png
> As above but also showing the load average:
> http://devloop.org.uk/documentation/database-performance/Linux-Ker
> nels/Kernels-ManyThreads-CombinedTests2.png
> Looks like a regression to me...

I've tried reasonalby diligently to figure out what the hell you're doing and gone through quite a bit of your documentation, and I just can't figure it out. This could entirely be the result of your test's sensitivity to execution order.

For example, if you run ten threads that all insert, query, and delete from the *same* table, then the exact interleaving pattern will determine the size of the results. A slight change in the scheduling quantum could multiply the size of the result data by a huge factor. There is a big difference between:

1) Thread A inserts data.
2) Thread A queries data.
3) Thread A deletes data.
4) Thread B inserts data.
...


and
1) Thread A inserts data.
2) Thread B insers data.
...
101) Thread A queries data.
102) Thread B queries data.
...

Now, even if they're using separate tables, your test is still very sensitive to execution order. If thread A runs to completion and then thread B does, the database data will fit better into cache. If thread A runs partially, then thread B runs partially, when thread A runs again, its database stuff will not be hot.

>* java threads are created first and the data is prepared, then all the
>threads are started in a tight loop. Each thread runs multiple queries
>with a 10ms pause (to allow the other threads to get scheduled)

There are a number of ways you might be measuring nothing but how the scheduler chooses to interleave your threads. Benchmarking threads that yield suggests just this type of thing -- if a thread has useful work to do and another thread is not going to help it, *why* *yield*?

Are you worried the scheduler isn't going to schedule other threads?! Or is there some sane reason to force suboptimal scheduling when you're trying to benchmark a scheduler? Are you trying to see how it deals with pathological patterns? ;)

The only documentation I can see about what you're actually *doing* says things like "The schema and statements are almost identical to the non-threaded tests." Do you see why that's not helpful?

DS


2007-09-13 11:24:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading


* Antoine Martin <[email protected]> wrote:

> Basically, all the previous kernels are pretty close (2.6.16 through
> to 2.6.20 performed almost identically to 2.6.22 and are not shown
> here to avoid cluttering the graphs)
>
> All the 2.6.23-rc kernels performed poorly (except -rc3!): much more
> erratically and with a sharp performance drop above 800 threads. The
> load starts to go up and the performance takes a nosedive.

hm, could you try the patch below ontop of 2.6.23-rc6 and do:

echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

does this improve the numbers?

Ingo

-------------->
Subject: sched: yield debugging
From: Ingo Molnar <[email protected]>

introduce various sched_yield implementations:

# default one:
echo 0 > /proc/sys/kernel/sched_yield_bug_workaround

# always queues the current task next to the next task:
echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

# NOP:
echo 2 > /proc/sys/kernel/sched_yield_bug_workaround

tunability depends on CONFIG_SCHED_DEBUG=y.

Not-yet-signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 2 +
kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 19 +++++++++++++
3 files changed, 88 insertions(+), 6 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1392,10 +1392,12 @@ extern void sched_idle_next(void);
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_yield_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_batch_wakeup_granularity;
extern unsigned int sysctl_sched_stat_granularity;
extern unsigned int sysctl_sched_runtime_limit;
+extern unsigned int sysctl_sched_yield_bug_workaround;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;
#endif
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -42,6 +42,16 @@ const_debug unsigned int sysctl_sched_la
*/
const_debug unsigned int sysctl_sched_child_runs_first = 1;

+const_debug unsigned int sysctl_sched_yield_granularity = 10000000ULL;
+
+/*
+ * sys_sched_yield workaround switch.
+ *
+ * This option switches the yield implementation of the
+ * old scheduler back on.
+ */
+const_debug unsigned int sysctl_sched_yield_bug_workaround;
+
/*
* Minimal preemption granularity for CPU-bound tasks:
* (default: 2 msec, units: nanoseconds)
@@ -675,15 +685,66 @@ static void dequeue_task_fair(struct rq
*/
static void yield_task_fair(struct rq *rq, struct task_struct *p)
{
- struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ if (!sysctl_sched_yield_bug_workaround) {
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ __update_rq_clock(rq);
+
+ /*
+ * Dequeue and enqueue the task to update its
+ * position within the tree:
+ */
+ dequeue_entity(cfs_rq, &p->se, 0);
+ enqueue_entity(cfs_rq, &p->se, 0);
+ return;
+ }
+
+ if (sysctl_sched_yield_bug_workaround == 1) {
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct rb_node *curr, *next, *first;
+ struct task_struct *p_next;
+ s64 yield_key;
+
+ __update_rq_clock(rq);
+ curr = &p->se.run_node;
+ first = first_fair(cfs_rq);
+ /*
+ * Move this task to the second place in the tree:
+ */
+ if (unlikely(curr != first)) {
+ next = first;
+ } else {
+ next = rb_next(curr);
+ /*
+ * We were the last one already - nothing to do, return
+ * and reschedule:
+ */
+ if (unlikely(!next))
+ return;
+ }
+
+ p_next = rb_entry(next, struct task_struct, se.run_node);
+ /*
+ * Minimally necessary key value to be the second in the tree:
+ */
+ yield_key = p_next->se.fair_key + (int)sysctl_sched_yield_granularity;
+
+ dequeue_entity(cfs_rq, &p->se, 0);
+
+ /*
+ * Only update the key if we need to move more backwards
+ * than the minimally necessary position to be the second:
+ */
+ if (p->se.fair_key < yield_key)
+ p->se.fair_key = yield_key;
+
+ __enqueue_entity(cfs_rq, &p->se);
+ return;
+ }

- __update_rq_clock(rq);
/*
- * Dequeue and enqueue the task to update its
- * position within the tree:
+ * Just reschedule, do nothing else:
*/
- dequeue_entity(cfs_rq, &p->se, 0);
- enqueue_entity(cfs_rq, &p->se, 0);
+ resched_task(p);
}

/*
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -244,6 +244,17 @@ static ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_yield_granularity_ns",
+ .data = &sysctl_sched_yield_granularity,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &min_sched_granularity_ns,
+ .extra2 = &max_sched_granularity_ns,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "sched_wakeup_granularity_ns",
.data = &sysctl_sched_wakeup_granularity,
.maxlen = sizeof(unsigned int),
@@ -266,6 +277,14 @@ static ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_yield_bug_workaround",
+ .data = &sysctl_sched_yield_bug_workaround,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "sched_child_runs_first",
.data = &sysctl_sched_child_runs_first,
.maxlen = sizeof(unsigned int),

2007-09-13 15:15:49

by Nick Piggin

[permalink] [raw]
Subject: Re: some bad numbers with Java/database threading

On Thursday 13 September 2007 17:18, David Schwartz wrote:
> > I was working on some unit tests and thought I'd give CFS a whirl to see
> > if it had any impact on my workloads (to see what the fuss was about),
> > and I came up with some pretty disturbing numbers:
> > http://devloop.org.uk/documentation/database-performance/Linux-Ker
> > nels/Kernels-ManyThreads-CombinedTests-noload2.png
> > As above but also showing the load average:
> > http://devloop.org.uk/documentation/database-performance/Linux-Ker
> > nels/Kernels-ManyThreads-CombinedTests2.png
> > Looks like a regression to me...
>
> I've tried reasonalby diligently to figure out what the hell you're doing

(cc's readded please reply to all when replying to lkml)

Hi David,

You might be sounding a bit too abrasive here... I understand you're
also trying to help, but your tone just might be taken the wrong way.

Antonie is really doing the right thing here to test such a new feature
early and on the code he cares about as a user. And most importantly,
reporting it here. This is probably the most useful resource we have in
Linux.

Maybe the workload is quirky, but regardless, if it is a *regression*
from a previous kernel then it is really important to be brought to our
attention.


> and gone through quite a bit of your documentation, and I just can't figure
> it out. This could entirely be the result of your test's sensitivity to
> execution order.
>
> For example, if you run ten threads that all insert, query, and delete from
> the *same* table, then the exact interleaving pattern will determine the
> size of the results. A slight change in the scheduling quantum could
> multiply the size of the result data by a huge factor. There is a big
> difference between:
>
> 1) Thread A inserts data.
> 2) Thread A queries data.
> 3) Thread A deletes data.
> 4) Thread B inserts data.
> ...
>
>
> and
> 1) Thread A inserts data.
> 2) Thread B insers data.
> ...
> 101) Thread A queries data.
> 102) Thread B queries data.
> ...
>
> Now, even if they're using separate tables, your test is still very
> sensitive to execution order. If thread A runs to completion and then
> thread B does, the database data will fit better into cache. If thread A
> runs partially, then thread B runs partially, when thread A runs again, its
> database stuff will not be hot.
>
> >* java threads are created first and the data is prepared, then all the
> >threads are started in a tight loop. Each thread runs multiple queries
> >with a 10ms pause (to allow the other threads to get scheduled)
>
> There are a number of ways you might be measuring nothing but how the
> scheduler chooses to interleave your threads. Benchmarking threads that
> yield suggests just this type of thing -- if a thread has useful work to do
> and another thread is not going to help it, *why* *yield*?
>
> Are you worried the scheduler isn't going to schedule other threads?! Or is
> there some sane reason to force suboptimal scheduling when you're trying to
> benchmark a scheduler? Are you trying to see how it deals with pathological
> patterns? ;)
>
> The only documentation I can see about what you're actually *doing* says
> things like "The schema and statements are almost identical to the
> non-threaded tests." Do you see why that's not helpful?

2007-09-13 19:02:33

by Antoine Martin

[permalink] [raw]
Subject: Re: some bad numbers with Java/database threading

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

I've tested a couple more kernels: 2.6.23-rc4-mm1 and 2.6.23-rc6 with
the "sched_yield_bug_workaround" patch from Ingo, results are here:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests3-10msYield.png
Same, with the load average (dotted lines):
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests3-10msYield-withload.png
Still slows down after 800 threads...

> On Thursday 13 September 2007 17:18, David Schwartz wrote:
>>> I was working on some unit tests and thought I'd give CFS a whirl to see
>>> if it had any impact on my workloads (to see what the fuss was about),
>>> and I came up with some pretty disturbing numbers:
>>> http://devloop.org.uk/documentation/database-performance/Linux-Ker
>>> nels/Kernels-ManyThreads-CombinedTests-noload2.png
>>> As above but also showing the load average:
>>> http://devloop.org.uk/documentation/database-performance/Linux-Ker
>>> nels/Kernels-ManyThreads-CombinedTests2.png
>>> Looks like a regression to me...
>> I've tried reasonalby diligently to figure out what the hell you're doing
>
> (cc's readded please reply to all when replying to lkml)
(...)
>> and gone through quite a bit of your documentation, and I just can't figure
>> it out.
Until I find enough time to update the docs, feel free to ask.

>> This could entirely be the result of your test's sensitivity to
>> execution order.
It very well might be, but this is still a visible regression.

>> For example, if you run ten threads that all insert, query, and delete from
>> the *same* table, then the exact interleaving pattern will determine the
>> size of the results. A slight change in the scheduling quantum could
>> multiply the size of the result data by a huge factor. There is a big
>> difference between:
>>
>> 1) Thread A inserts data.
>> 2) Thread A queries data.
>> 3) Thread A deletes data.
>> 4) Thread B inserts data.
>> ...
>>
>>
>> and
>> 1) Thread A inserts data.
>> 2) Thread B insers data.
>> ...
>> 101) Thread A queries data.
>> 102) Thread B queries data.
Thanks for the suggestion, I'll add this to the test series.

>> Now, even if they're using separate tables, your test is still very
>> sensitive to execution order. If thread A runs to completion and then
>> thread B does, the database data will fit better into cache. If thread A
>> runs partially, then thread B runs partially, when thread A runs again, its
>> database stuff will not be hot.
You are correct, it is very sensitive to the execution order and
caching. When I vary the thread pause, the total execution time varies
widely. 10ms just happens to be the sweet spot which provides the best
contrast in the results (for both kernels and rdbms)

>>> * java threads are created first and the data is prepared, then all the
>>> threads are started in a tight loop. Each thread runs multiple queries
>>> with a 10ms pause (to allow the other threads to get scheduled)
>> There are a number of ways you might be measuring nothing but how the
>> scheduler chooses to interleave your threads. Benchmarking threads that
>> yield suggests just this type of thing -- if a thread has useful work to do
>> and another thread is not going to help it, *why* *yield*?
As I said above, I have tried varying the pause and this is the sweet
spot. If you don't yield, the first hundred threads will be finished by
the time you start the 800th - which is not what you want.
I could try just yielding at the start, or only during the first
iteration, etc.. all suggestions are most welcome.

>> Are you worried the scheduler isn't going to schedule other threads?!
Yes!

>> Or is
>> there some sane reason to force suboptimal scheduling when you're trying to
>> benchmark a scheduler? Are you trying to see how it deals with pathological
>> patterns? ;)
I know it sounds sub-optimal, but this benchmark wasn't designed to test
the scheduler! It is meant to thrash just one database table. It isn't
meant to be anything like TPC either.
Previously it did uncover some very noticeable differences in JVM
performance when stress-tested with a large amount of threads.

>> The only documentation I can see about what you're actually *doing* says
>> things like "The schema and statements are almost identical to the
>> non-threaded tests." Do you see why that's not helpful?
I sure do! I'll try to improve on that when I get a chance. Here is a
start: the schema is configurable with simple text files. And the
database layer translates it into the SQL syntax (and types) supported
by the backend (MySQL in this case):
# cat ManyThreadedCombinedTest.properties
table=update
columns=i1:integer,i2:integer,s1:varchar,s2:varchar
I'll make a click&run tarball if anyone is interested in running the
tests themselves (it outputs csv data).

Thanks for the feedback.

Antoine
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG6Yk8GK2zHPGK1rsRCtVaAKCAVyU4woztnl6q0NAZBNsM94I2JgCcD4M3
+MpR1gAom65Mn/tQX8ig1cI=
=Q3IY
-----END PGP SIGNATURE-----

2007-09-13 21:48:31

by David Schwartz

[permalink] [raw]
Subject: RE: some bad numbers with Java/database threading


First, let me apologize if the tone of my other post came through as angry or frustrated. Text can sometimes be misleading as to tone. I certainly wasn't angry.

Second, let me say that I'm definitely not suggesting that you were wrong to bring this to everyone's attention. Even if it turns out that your code is horribly broken and it's all your fault, any apparent regression still has to be investigated. It is virtually certain that we will learn something interesting about either your code, CFS, or both.

I was definitely *not* saying "how dare you challenge CFS' supremacy without a perfect test case".

Antoine Martin wrote:

> >> Now, even if they're using separate tables, your test is still very
> >> sensitive to execution order. If thread A runs to completion and then
> >> thread B does, the database data will fit better into cache.
> >> If thread A
> >> runs partially, then thread B runs partially, when thread A
> >> runs again, its
> >> database stuff will not be hot.

> You are correct, it is very sensitive to the execution order and
> caching. When I vary the thread pause, the total execution time varies
> widely. 10ms just happens to be the sweet spot which provides the best
> contrast in the results (for both kernels and rdbms)

> >> Or is
> >> there some sane reason to force suboptimal scheduling when
> >> you're trying to
> >> benchmark a scheduler? Are you trying to see how it deals with
> >> pathological
> >> patterns? ;)

> I know it sounds sub-optimal, but this benchmark wasn't designed to test
> the scheduler! It is meant to thrash just one database table. It isn't
> meant to be anything like TPC either.
> Previously it did uncover some very noticeable differences in JVM
> performance when stress-tested with a large amount of threads.

The problem is that because your access pattern is pathological, schedulers that are objectively worse may turn in much better performance. For example, a scheduler that completely ignores your attempt to sleep will probably perform significantly better than a scheduler that goes out of its way to honor it.

That the execution time is very sensitive to the pause is strong evidence of this.

The problem is simply that your test program doesn't do a fixed amount of work. It does a variable amount of work, and that amount depends upon scheduler details. It's like a job that has fifty threads that do this:

1) Increment a shared variable.
2) Do a math problem a number of times equal to the value of that shared variable.
3) Decrement the shared variable.

The problem is that the number of times the math problem is done depends upon the execution order of the threads. To be fair, you would need to benchmark how many times the math problem gets done, not how long the threads take to complete.

Now, imagine if I insert a yield between steps 1 and 2. The more a scheduler honors your yield request, the worse it will appear to perform. The scheduler that ignores it (which is legal, but definitely not The Right Thing) will seem to perform *much* better.

Of course, it's also possible that this is not what's going on.

DS


2007-09-14 08:33:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading


* Ingo Molnar <[email protected]> wrote:

> hm, could you try the patch below ontop of 2.6.23-rc6 and do:
>
> echo 1 > /proc/sys/kernel/sched_yield_bug_workaround
>
> does this improve the numbers?

the patch i sent was against CFS-devel. Could you try the one below,
which is against vanilla -rc6, does it improve the numbers? (it should
have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the
sysctl.

Ingo

------------------------>
Subject: sched: yield debugging
From: Ingo Molnar <[email protected]>

introduce various sched_yield implementations:

# default one:
echo 0 > /proc/sys/kernel/sched_yield_bug_workaround

# always queues the current task next to the next task:
echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

# NOP:
echo 2 > /proc/sys/kernel/sched_yield_bug_workaround

tunability depends on CONFIG_SCHED_DEBUG=y.

Not-yet-signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 2 +
kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 19 +++++++++++++
3 files changed, 88 insertions(+), 6 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1402,10 +1402,12 @@ extern void sched_idle_next(void);

extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_yield_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_batch_wakeup_granularity;
extern unsigned int sysctl_sched_stat_granularity;
extern unsigned int sysctl_sched_runtime_limit;
+extern unsigned int sysctl_sched_yield_bug_workaround;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -42,6 +42,16 @@ unsigned int sysctl_sched_latency __read
*/
unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL;

+unsigned int sysctl_sched_yield_granularity __read_mostly = 10000000ULL;
+
+/*
+ * sys_sched_yield workaround switch.
+ *
+ * This option switches the yield implementation of the
+ * old scheduler back on.
+ */
+unsigned int __read_mostly sysctl_sched_yield_bug_workaround;
+
/*
* SCHED_BATCH wake-up granularity.
* (default: 25 msec, units: nanoseconds)
@@ -901,15 +911,66 @@ static void dequeue_task_fair(struct rq
*/
static void yield_task_fair(struct rq *rq, struct task_struct *p)
{
- struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ if (!sysctl_sched_yield_bug_workaround) {
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ __update_rq_clock(rq);
+
+ /*
+ * Dequeue and enqueue the task to update its
+ * position within the tree:
+ */
+ dequeue_entity(cfs_rq, &p->se, 0);
+ enqueue_entity(cfs_rq, &p->se, 0);
+ return;
+ }
+
+ if (sysctl_sched_yield_bug_workaround == 1) {
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct rb_node *curr, *next, *first;
+ struct task_struct *p_next;
+ s64 yield_key;
+
+ __update_rq_clock(rq);
+ curr = &p->se.run_node;
+ first = first_fair(cfs_rq);
+ /*
+ * Move this task to the second place in the tree:
+ */
+ if (unlikely(curr != first)) {
+ next = first;
+ } else {
+ next = rb_next(curr);
+ /*
+ * We were the last one already - nothing to do, return
+ * and reschedule:
+ */
+ if (unlikely(!next))
+ return;
+ }
+
+ p_next = rb_entry(next, struct task_struct, se.run_node);
+ /*
+ * Minimally necessary key value to be the second in the tree:
+ */
+ yield_key = p_next->se.fair_key + (int)sysctl_sched_yield_granularity;
+
+ dequeue_entity(cfs_rq, &p->se, 0);
+
+ /*
+ * Only update the key if we need to move more backwards
+ * than the minimally necessary position to be the second:
+ */
+ if (p->se.fair_key < yield_key)
+ p->se.fair_key = yield_key;
+
+ __enqueue_entity(cfs_rq, &p->se);
+ return;
+ }

- __update_rq_clock(rq);
/*
- * Dequeue and enqueue the task to update its
- * position within the tree:
+ * Just reschedule, do nothing else:
*/
- dequeue_entity(cfs_rq, &p->se, 0);
- enqueue_entity(cfs_rq, &p->se, 0);
+ resched_task(p);
}

/*
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -244,6 +244,17 @@ static ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_yield_granularity_ns",
+ .data = &sysctl_sched_yield_granularity,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &min_sched_granularity_ns,
+ .extra2 = &max_sched_granularity_ns,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "sched_wakeup_granularity_ns",
.data = &sysctl_sched_wakeup_granularity,
.maxlen = sizeof(unsigned int),
@@ -288,6 +299,14 @@ static ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_yield_bug_workaround",
+ .data = &sysctl_sched_yield_bug_workaround,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "sched_child_runs_first",
.data = &sysctl_sched_child_runs_first,
.maxlen = sizeof(unsigned int),

2007-09-14 10:07:12

by Satyam Sharma

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading

Hi Antoine, Ingo,


On 9/14/07, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > hm, could you try the patch below ontop of 2.6.23-rc6 and do:
> >
> > echo 1 > /proc/sys/kernel/sched_yield_bug_workaround
> >
> > does this improve the numbers?

Hmm, I know diddly about Java, and I don't want to preempt Antoine's next
test, but I noticed that he uses Thread.sleep() in his testcode and not the
Thread.yield() so it would be interesting if Antoine can test with this patch
and report if something shows up ...


> the patch i sent was against CFS-devel. Could you try the one below,
> which is against vanilla -rc6, does it improve the numbers? (it should
> have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the
> sysctl.


On 9/13/07, Antoine Martin <[email protected]> wrote:
>
> All the 2.6.23-rc kernels performed poorly (except -rc3!):

This is an interesting data point, IMHO ... considering these tests are long,
I suspect you ran them only once each per kernel. So I wonder how reliable
that -rc3 testpoint is. If this oddity is reproducible, it would be great if you
could git-bisect:

1. between 23-rc1 and 23-rc3, and find out which commit led to the
improvement in performance, and,
2. between 23-rc3 and 23-rc6, and find out which commit brought down
the numbers again.

[ http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html,
git-bisect is easy and amazingly helpful on certain occasions. ]


> Notes about the tests and setup:
> * environment is:
> Dual Opteron 252 with 3GB ram, scsi disk, etc..
> Sun Java 1.6
> MySQL 5.0.44
> Junit + ant + my test code (devloop.org.uk)
> * java threads are created first and the data is prepared, then all the
> threads are started in a tight loop. Each thread runs multiple queries
> with a 10ms pause (to allow the other threads to get scheduled)

Don't know much about CFS either, but does that constant "10 ms" sleep
somehow lead to evil synchronization issues between the test threads?
Does randomizing that time (say from 2-20 ms) lead to different numbers?


> * load average is divided by the number of cpus (2)
> * more general information (which also covers some irrelevant
> information about some other tests I have published) is here:
> http://devloop.org.uk/documentation/database-performance/Setup/


Thanks,

Satyam

2007-09-14 15:25:35

by Antoine Martin

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Satyam Sharma wrote:
> Hi Antoine, Ingo,
> On 9/14/07, Ingo Molnar <[email protected]> wrote:
>> * Ingo Molnar <[email protected]> wrote:
>>
>>> hm, could you try the patch below ontop of 2.6.23-rc6 and do:
>>>
>>> echo 1 > /proc/sys/kernel/sched_yield_bug_workaround
>>>
>>> does this improve the numbers?
>
> Hmm, I know diddly about Java, and I don't want to preempt Antoine's next
> test, but I noticed that he uses Thread.sleep() in his testcode and not the
> Thread.yield() so it would be interesting if Antoine can test with this patch
> and report if something shows up ..
See below...
I'll add a new test using yield() and see what that does.

>> the patch i sent was against CFS-devel. Could you try the one below,
>> which is against vanilla -rc6, does it improve the numbers? (it should
>> have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the
>> sysctl.
It looks good now! Updated results here:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield-noload.png
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield.png
Compared with more kernels here - a bit more cluttered:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests4-10msYield-noload.png

Thanks Ingo!
Does this mean that I'll have to keep doing:
echo 1 > /proc/sys/kernel/sched_yield_bug_workaround
Or are you planning on finding a more elegant solution?
# find /proc -name "*workaround*"
/proc/sys/kernel/sched_yield_bug_workaround
/proc/sys/net/ipv4/tcp_workaround_signed_windows

> On 9/13/07, Antoine Martin <[email protected]> wrote:
>> All the 2.6.23-rc kernels performed poorly (except -rc3!):
>
> This is an interesting data point, IMHO ... considering these tests are long,
> I suspect you ran them only once each per kernel. So I wonder how reliable
> that -rc3 testpoint is. If this oddity is reproducible, it would be great if you
> could git-bisect:
Yeah, I thought that was quite suspicious.
- -rc2 is just like -rc1 (see above) so I'll re-test -rc3 first,
git-bisect could take a while with those tests... just wiping the disk
between tests takes about 30mins.

>> * java threads are created first and the data is prepared, then all the
>> threads are started in a tight loop. Each thread runs multiple queries
>> with a 10ms pause (to allow the other threads to get scheduled)
>
> Don't know much about CFS either, but does that constant "10 ms" sleep
> somehow lead to evil synchronization issues between the test threads?
> Does randomizing that time (say from 2-20 ms) lead to different numbers?
I've tested before with varying timings, but I had not thought of using
a randomized delay.
Will add that too.

Many thanks to you all for the feedback!
Antoine
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG6qfkGK2zHPGK1rsRCgeEAJ9HUUtHUNScvTVKo5z2sSmo+G+BVgCfdYmK
rcd1VYUuzQA2oFEmakjZxgM=
=jmI8
-----END PGP SIGNATURE-----

2007-09-14 15:32:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]


* Antoine Martin <[email protected]> wrote:

> >> have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the
> >> sysctl.
> It looks good now! Updated results here:
> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield-noload.png
> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield.png
> Compared with more kernels here - a bit more cluttered:
> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests4-10msYield-noload.png
>
> Thanks Ingo!
> Does this mean that I'll have to keep doing:
> echo 1 > /proc/sys/kernel/sched_yield_bug_workaround
> Or are you planning on finding a more elegant solution?

just to make sure - can you get it to work fast with the
-rc6+yield-patch solution too? (i.e. not CFS-devel) We need a (tested)
solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've
attached below the latest version of the -rc6 yield patch - the switch
is not dependent on SCHED_DEBUG anymore but always available.

Ingo

------------------->
Subject: sched: yield workaround
From: Ingo Molnar <[email protected]>

sched_yield() is fundamentally broken, and CFS has changed
its behavior. Some apps that mistakenly rely on sched_yield()
want "weak" sched yield (such as desktop apps) - while some
apps want "strong" sched_yield() (for example some JDKs).

There's no way for the scheduler to figure out which of the
two variants the app really wants - because sched_yield() is
all about hiding from the kernel the true structure of the
user-space locking code.

As a solution, provide a workaround, to introduce a more
agressive sched_yield implementation:

# default one:
echo 0 > /proc/sys/kernel/sched_yield_bug_workaround

# always queues the current task next to the next task:
echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

# NOP:
echo 2 > /proc/sys/kernel/sched_yield_bug_workaround

in the future, the use of this sysctl might generate
a deprecation warning, so that apps start moving away
from their reliance on sched_yield().

Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 2 +
kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 19 +++++++++++++
3 files changed, 88 insertions(+), 6 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1402,10 +1402,12 @@ extern void sched_idle_next(void);

extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_yield_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_batch_wakeup_granularity;
extern unsigned int sysctl_sched_stat_granularity;
extern unsigned int sysctl_sched_runtime_limit;
+extern unsigned int sysctl_sched_yield_bug_workaround;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -42,6 +42,16 @@ unsigned int sysctl_sched_latency __read
*/
unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL;

+unsigned int sysctl_sched_yield_granularity __read_mostly = 10000000ULL;
+
+/*
+ * sys_sched_yield workaround switch.
+ *
+ * This option switches the yield implementation of the
+ * old scheduler back on.
+ */
+unsigned int __read_mostly sysctl_sched_yield_bug_workaround;
+
/*
* SCHED_BATCH wake-up granularity.
* (default: 25 msec, units: nanoseconds)
@@ -901,15 +911,66 @@ static void dequeue_task_fair(struct rq
*/
static void yield_task_fair(struct rq *rq, struct task_struct *p)
{
- struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ if (!sysctl_sched_yield_bug_workaround) {
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ __update_rq_clock(rq);
+
+ /*
+ * Dequeue and enqueue the task to update its
+ * position within the tree:
+ */
+ dequeue_entity(cfs_rq, &p->se, 0);
+ enqueue_entity(cfs_rq, &p->se, 0);
+ return;
+ }
+
+ if (sysctl_sched_yield_bug_workaround == 1) {
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct rb_node *curr, *next, *first;
+ struct task_struct *p_next;
+ s64 yield_key;
+
+ __update_rq_clock(rq);
+ curr = &p->se.run_node;
+ first = first_fair(cfs_rq);
+ /*
+ * Move this task to the second place in the tree:
+ */
+ if (curr != first)
+ next = rb_next(curr);
+ else
+ next = first;
+ /*
+ * We were the last one already - nothing to do, return
+ * and reschedule:
+ */
+ if (unlikely(!next))
+ return;
+
+ p_next = rb_entry(next, struct task_struct, se.run_node);
+ /*
+ * Minimally necessary key value to be the second in the tree:
+ */
+ yield_key = p_next->se.fair_key +
+ (int)sysctl_sched_yield_granularity;
+
+ dequeue_entity(cfs_rq, &p->se, 0);
+
+ /*
+ * Only update the key if we need to move more backwards
+ * than the minimally necessary position to be the second:
+ */
+ if (p->se.fair_key < yield_key)
+ p->se.fair_key = yield_key;
+
+ __enqueue_entity(cfs_rq, &p->se);
+ return;
+ }

- __update_rq_clock(rq);
/*
- * Dequeue and enqueue the task to update its
- * position within the tree:
+ * Just reschedule, do nothing else:
*/
- dequeue_entity(cfs_rq, &p->se, 0);
- enqueue_entity(cfs_rq, &p->se, 0);
+ resched_task(p);
}

/*
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -244,6 +244,17 @@ static ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_yield_granularity_ns",
+ .data = &sysctl_sched_yield_granularity,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &min_sched_granularity_ns,
+ .extra2 = &max_sched_granularity_ns,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "sched_wakeup_granularity_ns",
.data = &sysctl_sched_wakeup_granularity,
.maxlen = sizeof(unsigned int),
@@ -288,6 +299,14 @@ static ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_yield_bug_workaround",
+ .data = &sysctl_sched_yield_bug_workaround,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "sched_child_runs_first",
.data = &sysctl_sched_child_runs_first,
.maxlen = sizeof(unsigned int),

2007-09-14 16:01:38

by Satyam Sharma

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading

[ Argh, just noticed this thread got broke and had been living a parallel
life due to Subject: changes, dropped Cc:'s, and munged In-Reply-To:'s.
Adding back all interested folk here. ]


> Hi Antoine, Ingo,
>
>
> On 9/14/07, Ingo Molnar <[email protected]> wrote:
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > hm, could you try the patch below ontop of 2.6.23-rc6 and do:
> > >
> > > echo 1 > /proc/sys/kernel/sched_yield_bug_workaround
> > >
> > > does this improve the numbers?
>
> Hmm, I know diddly about Java, and I don't want to preempt Antoine's next
> test, but I noticed that he uses Thread.sleep() in his testcode and not the
> Thread.yield() so it would be interesting if Antoine can test with this patch
> and report if something shows up ...

Ok, it appears Antoine tested with the patch and got:

http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests3-10msYield.png
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests3-10msYield-withload.png

which leads me to believe this probably wasn't a yield problem after all, though
it would still be useful if someone with more knowledge of Java could give that
code a look over ...

Curiously, the -rc3 oddity is still plainly visible there -- how do we
explain that?
Ingo, does that oddity (and the commits that went in around -rc3 time) give some
clue as to the behaviour / characteristics of Antoine's workloads?


> > the patch i sent was against CFS-devel. Could you try the one below,
> > which is against vanilla -rc6, does it improve the numbers? (it should
> > have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the
> > sysctl.
>
>
> On 9/13/07, Antoine Martin <[email protected]> wrote:
> >
> > All the 2.6.23-rc kernels performed poorly (except -rc3!):
>
> This is an interesting data point, IMHO ... considering these tests are long,
> I suspect you ran them only once each per kernel. So I wonder how reliable
> that -rc3 testpoint is. If this oddity is reproducible,

Ok, it's reproducible, making our job easier. Which also means, Antoine,
please do try the following git-bisecting:


> 1. between 23-rc1 and 23-rc3, and find out which commit led to the
> improvement in performance, and,
> 2. between 23-rc3 and 23-rc6, and find out which commit brought down
> the numbers again.
>
> [ http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html,
> git-bisect is easy and amazingly helpful on certain occasions. ]

I don't have access to any real/meaningful SMP systems, so I wonder how much
sense it makes in practical terms for me to try and run these tests
locally on my
little boxen ... would be helpful if someone with 4/8 CPU systems could give
Antoine's testsuite a whirl :-)


> > Notes about the tests and setup:
> > * environment is:
> > Dual Opteron 252 with 3GB ram, scsi disk, etc..
> > Sun Java 1.6
> > MySQL 5.0.44
> > Junit + ant + my test code (devloop.org.uk)
> > * java threads are created first and the data is prepared, then all the
> > threads are started in a tight loop. Each thread runs multiple queries
> > with a 10ms pause (to allow the other threads to get scheduled)
>
> Don't know much about CFS either, but does that constant "10 ms" sleep
> somehow lead to evil synchronization issues between the test threads?
> Does randomizing that time (say from 2-20 ms) lead to different numbers?

Umm, you mention _changing_ this value earlier, but it still remains the same
for every thread during every loop for a given test run -- what I'm
suggesting is
making that code do something like: Thread.sleep(random(x, y)); where
x=2, y=20 and random(x, y) returns a random integer between x and y, so all
threads sleep for different durations in every loop, but still average
out to about
~10 ms over a period. Try to vary x and y (to vary the average) and post the
resulting graphs too? CONFIG_HZ (actually, full .config) and dmesg might be
useful for us as well.

Also, like David mentioned, counting the _number_ of times the test threads
managed to execute those SQL queries is probably a better benchmark than
measuring the time it takes for threads to finish execution itself -- uniformity
in that number across threads would bring out how "fair" CFS is compared to
previous kernels, for one ...

And finally I need a clarification from you: from the code that I
read, it appears
you have *different* threads for purely inserting/selecting/updating/deleting
records from the table, right? So I think David was barking up the wrong tree
in that other reply there, where he said the *same* thread needs to execute
multiple queries on the same data, and therefore your test code is susceptible
to cache hotness and thread execution order effects ... but I don't see any
such pathologies. I could be wrong, of course ...

Which brings us to another issue -- how well does the testsuite capture real
world workloads? Wouldn't multiple threads in the real world that arbitrarily
execute insert/update/select/delete queries on the same table also need to
implement some form of locking? How would that affect the numbers?


> > * load average is divided by the number of cpus (2)
> > * more general information (which also covers some irrelevant
> > information about some other tests I have published) is here:
> > http://devloop.org.uk/documentation/database-performance/Setup/
>
>
> Thanks,
>
> Satyam
>

2007-09-14 16:08:40

by Satyam Sharma

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading

On 9/14/07, Satyam Sharma <[email protected]> wrote:
>
[snip]

Ick, threading breaks in Gmail with Subject: changes, so I missed the latest
updates on this thread ... oh well, never mind.

Satyam

2007-09-17 12:17:19

by Antoine Martin

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Satyam Sharma wrote:
> I don't have access to any real/meaningful SMP systems, so I wonder
> how much sense it makes in practical terms for me to try and run these
> tests locally on my little boxen ... would be helpful if someone with
> 4/8 CPU systems could give Antoine's testsuite a whirl :-)
It should still work... if you're patient.

[snip] covered in separate mail
> CONFIG_HZ (actually, full .config) and dmesg might be
> useful for us as well.
http://devloop.org.uk/documentation/database-performance/dmesg-dualopteron.gz
CONFIG_HZ=1000

> Also, like David mentioned, counting the _number_ of times the test
> threads managed to execute those SQL queries is probably a better
> benchmark than measuring the time it takes for threads to finish
> execution itself -- uniformity in that number across threads would
> bring out how "fair" CFS is compared to previous kernels, for one ...
Good idea, I've added code to capture more fairness oriented attributes:
* time until the first thread finishes - this should actually be higher
when the scheduler is fairer.
* total number of loop iterations executed when the first thread
finishes (higher is better)
etc.

> And finally I need a clarification from you: from the code that I
> read, it appears you have *different* threads for purely
> inserting/selecting/updating/deleting records from the table, right?
Correct.

> So I think David was barking up the wrong tree
> in that other reply there, where he said the *same* thread needs to
> execute multiple queries on the same data, and therefore your test
> code is susceptible to cache hotness and thread execution order
> effects ... but I don't see any such pathologies. I could be wrong, of
> course ...
I think he does have a valid point:
maybe, the locality of the java threads is important, but there are so
many of them and so many layers below it (jdbc pool, backend database
threads, filesystem, device) that the various caches get thrashed quite
often anyway, no matter what the interleaving factor is.
Which means that being fairer in this case also means switching threads
more often and hitting the caches' capacity limits earlier.
In which case the granularity and CONFIG_HZ should have a noticeable
impact. Ingo did suggest tweaking
/proc/sys/kernel/sched_yield_granularity_ns, which I did but this
particular test timed out when It ran over the weekend... will run it
again now. (results in a couple of days)

Maybe the granularity should auto-adjust to prevent such cases?
(probably not)
Granularity (and therefore fairness) become less important as the number
of threads becomes very large, as fairness starts to adversely affects
throughput?

I have changed my mind and now think that this is not a regression, or
at least not one we should worry too much about. As David said, this
workload is "pathological".

And CFS, does show some noticeable improvements with the new
measurements (now using a shorter thread yield of 5ms and a higher
number of loop iterations per thread: 25):
It does not adversely affect throughput as much (will test with more
threads later):
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests8-5msYield.png
The number of threads that exit before we reach half the total number of
loop iterations is lower and more predictable (meaning a more even
distribution between all the threads):
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests8-5msYield-ThreadsFinished.png
And the time it takes to reach this half way point is also more consistent:
http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests8-5msYield-TimeToHalfWay.png

"2.6.23-rc6-yield2" is the yield patch meant for 2.6.23,
"2.6.23-rc6-yield3" is the one that is not meant to be merged.
Both include:
echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

The high latency case (coming soon) includes HZ250, no preempt, and
doubles the /proc/sys/kernel/sched_yield_granularity_ns

> Which brings us to another issue -- how well does the testsuite
> capture real world workloads?
Not well... This particular test is not meant to represent any real-life
scenario - it is very very hard not to have a bias,
(that is why there are so many variations of TPC: TPC-W, TPC-H,..)
It is just meant to try to measure the impact of changing individual
components (anything from kernels, databases, threads, coding
techniques, ...) one at a time, running a variety of scenarios and
spotting any anomalies. More often than not, the anomalies will tell you
about which combinations to avoid, and not about unexpected improvements.

> Wouldn't multiple threads in the real world that arbitrarily
> execute insert/update/select/delete queries on the same table also
> need to implement some form of locking?
Generally yes, but not always - I have a personal preference for
lock-less algorithms (like variations on optimistic locking) because
they're a lot easier to get right (and understand), but they're not
always suitable. (mostly because of ordering/timing/QoS issues)
And I digress.

> How would that affect the numbers?
Hmmm. Not entirely sure, if the locking is implemented using database
transactions, the threads would spend a bit longer in wait state whilst
the database does its stuff (even more so for the databases that do not
use row-level locking), switching tasks less, so it might actually help
improve throughput in some cases (as long as the workload is cpu-bound
rather than I/O bound) ?
We'll know more when I get the results for the high latency case.

Antoine
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG7nBCGK2zHPGK1rsRCrs3AJ9e5Ye1KVgydd06aD6akoLe5Z2RYQCfQey2
PpQYOdsvdo5bLE68ro/KFbE=
=c21f
-----END PGP SIGNATURE-----

2007-09-18 21:29:36

by Chuck Ebbert

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]

On 09/14/2007 11:32 AM, Ingo Molnar wrote:
> * Antoine Martin <[email protected]> wrote:
>
>>>> have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the
>>>> sysctl.
>> It looks good now! Updated results here:
>> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield-noload.png
>> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield.png
>> Compared with more kernels here - a bit more cluttered:
>> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests4-10msYield-noload.png
>>
>> Thanks Ingo!
>> Does this mean that I'll have to keep doing:
>> echo 1 > /proc/sys/kernel/sched_yield_bug_workaround
>> Or are you planning on finding a more elegant solution?
>
> just to make sure - can you get it to work fast with the
> -rc6+yield-patch solution too? (i.e. not CFS-devel) We need a (tested)
> solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've
> attached below the latest version of the -rc6 yield patch - the switch
> is not dependent on SCHED_DEBUG anymore but always available.
>

Is this going to be merged? And will you be making the default == 1 or
just leaving it at 0, which forces people who want the older behavior to
modify the default?

2007-09-18 22:50:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]


* Chuck Ebbert <[email protected]> wrote:

> On 09/14/2007 11:32 AM, Ingo Molnar wrote:
> > * Antoine Martin <[email protected]> wrote:
> >
> >>>> have an impact) Keep CONFIG_SCHED_DEBUG=y to be able to twiddle the
> >>>> sysctl.
> >> It looks good now! Updated results here:
> >> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield-noload.png
> >> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests5-10msYield.png
> >> Compared with more kernels here - a bit more cluttered:
> >> http://devloop.org.uk/documentation/database-performance/Linux-Kernels/Kernels-ManyThreads-CombinedTests4-10msYield-noload.png
> >>
> >> Thanks Ingo!
> >> Does this mean that I'll have to keep doing:
> >> echo 1 > /proc/sys/kernel/sched_yield_bug_workaround
> >> Or are you planning on finding a more elegant solution?
> >
> > just to make sure - can you get it to work fast with the
> > -rc6+yield-patch solution too? (i.e. not CFS-devel) We need a (tested)
> > solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've
> > attached below the latest version of the -rc6 yield patch - the switch
> > is not dependent on SCHED_DEBUG anymore but always available.
> >
>
> Is this going to be merged? And will you be making the default == 1 or
> just leaving it at 0, which forces people who want the older behavior
> to modify the default?

not at the moment - Antoine suggested that the workload is probably fine
and the patch against -rc6 would have no clear effect anyway so we have
nothing to merge right now. (Note that there's no "older behavior"
possible, unless we want to emulate all of the O(1) scheduler's
behavior.) But ... we could still merge something like that patch, but a
clearer testcase is needed. The JVM's i have access to work fine.

Ingo

2007-09-18 23:02:31

by Chuck Ebbert

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]

On 09/18/2007 06:46 PM, Ingo Molnar wrote:
>>> We need a (tested)
>>> solution for 2.6.23 and the CFS-devel patches are not for 2.6.23. I've
>>> attached below the latest version of the -rc6 yield patch - the switch
>>> is not dependent on SCHED_DEBUG anymore but always available.
>>>
>> Is this going to be merged? And will you be making the default == 1 or
>> just leaving it at 0, which forces people who want the older behavior
>> to modify the default?
>
> not at the moment - Antoine suggested that the workload is probably fine
> and the patch against -rc6 would have no clear effect anyway so we have
> nothing to merge right now. (Note that there's no "older behavior"
> possible, unless we want to emulate all of the O(1) scheduler's
> behavior.) But ... we could still merge something like that patch, but a
> clearer testcase is needed. The JVM's i have access to work fine.

I just got a bug report today:

https://bugzilla.redhat.com/show_bug.cgi?id=295071

==================================================

Description of problem:

The CFS scheduler does not seem to implement sched_yield correctly. If one
program loops with a sched_yield and another program prints out timing
information in a loop. You will see that if both are taskset to the same core
that the timing stats will be twice as long as when they are on different cores.
This problem was not in 2.6.21-1.3194 but showed up in 2.6.22.4-65 and continues
in the newest released kernel 2.6.22.5-76.

Version-Release number of selected component (if applicable):

2.6.22.4-65 through 2.6.22.5-76

How reproducible:

Very

Steps to Reproduce:
compile task1
int main() {
while (1) {
sched_yield();
}
return 0;
}

and compile task2

#include <stdio.h>
#include <sys/time.h>
int main() {
while (1) {
int i;
struct timeval t0,t1;
double usec;

gettimeofday(&t0, 0);
for (i = 0; i < 100000000; ++i)
;
gettimeofday(&t1, 0);

usec = (t1.tv_sec * 1e6 + t1.tv_usec) - (t0.tv_sec * 1e6 + t0.tv_usec);
printf ("%8.0f\n", usec);
}
return 0;
}

Then run:
"taskset -c 0 ./task1"
"taskset -c 0 ./task2"

You will see that both tasks use 50% of the CPU.
Then kill task2 and run:
"taskset -c 1 ./task2"

Now task2 will run twice as fast verifying that it is not some anomaly with the
way top calculates CPU usage with sched_yield.

Actual results:
Tasks with sched_yield do not yield like they are suppose to.

Expected results:
The sched_yield task's CPU usage should go to near 0% when another task is on
the same CPU.

2007-09-19 18:46:21

by David Schwartz

[permalink] [raw]
Subject: RE: CFS: some bad numbers with Java/database threading [FIXED]


> The CFS scheduler does not seem to implement sched_yield correctly. If one
> program loops with a sched_yield and another program prints out timing
> information in a loop. You will see that if both are taskset to
> the same core
> that the timing stats will be twice as long as when they are on
> different cores.
> This problem was not in 2.6.21-1.3194 but showed up in
> 2.6.22.4-65 and continues
> in the newest released kernel 2.6.22.5-76.

I disagree with the bug report.

> You will see that both tasks use 50% of the CPU.
> Then kill task2 and run:
> "taskset -c 1 ./task2"

This seems right. They're both always ready to run. They're at the same
priority. Neither ever blocks. There is no reason one should get more CPU
than the other.

> Now task2 will run twice as fast verifying that it is not some
> anomaly with the
> way top calculates CPU usage with sched_yield.
>
> Actual results:
> Tasks with sched_yield do not yield like they are suppose to.

Umm, how does he get that? It's yielding at blinding speed.

> Expected results:
> The sched_yield task's CPU usage should go to near 0% when
> another task is on
> the same CPU.

Nonsense. The task is always ready-to-run. There is no reason its CPU should
be low. This bug report is based on a misunderstanding of what yielding
means.

The Linux page says:

"A process can relinquish the processor voluntarily without blocking
by
calling sched_yield(). The process will then be moved to the end
of
the queue for its static priority and a new process gets to run."

Notice the "without blocking" part?

POSIX says:

"The sched_yield() function forces the running thread to relinquish the
processor until it again becomes the head of its thread list. It takes no
arguments."

CFS is perfectly complying with both of these. This bug report is a great
example of how sched_yield can be misunderstood and misused.

You can even argue that the sched_yield process should get even more CPU,
since it's voluntarily relinquishing (which should be rewarded) rather than
infinitely spinning (which should be punished). (Not that I agree with this
argument, I'm just using it to counter-balance the other argument.)

DS


2007-09-19 19:19:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]


* Chuck Ebbert <[email protected]> wrote:

> I just got a bug report today:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=295071
>
> ==================================================
>
> Description of problem:
>
> The CFS scheduler does not seem to implement sched_yield correctly. If
> one program loops with a sched_yield and another program prints out
> timing information in a loop. You will see that if both are taskset to
> the same core that the timing stats will be twice as long as when they
> are on different cores. This problem was not in 2.6.21-1.3194 but
> showed up in 2.6.22.4-65 and continues in the newest released kernel
> 2.6.22.5-76.

sched_yield is a very poorly defined interface as it does not tell the
kernel anything about what kind of locking user-space does. When an app
calls the syscall it basically tells the scheduler: "uhm, ok, i'm
blocked and i want you to do something else now. I dont want to tell you
how long this task wants to wait, and i dont want to tell you on what
condition it should be unblocked. Just ... do some stuff and let me
alone. See ya."

That's not an intelligent interface upon which the scheduler can do
anything particularly intelligent (in fact, it's a very stupid interface
upon which the scheduler can only do stupid things), and hence
schedulers tend to implement sched_yield() quite arbitrarily and in a
very scheduler-internals dependent way - which internals change when the
scheduler is changed materially.

The correct way to tell the kernel that the task is blocked is to use
futexes for example, or any kernel-based locking or wait object - there
are myriads of APIs for these. (The only well-defined behavior of yield
is for SCHED_FIFO/RR tasks - and that is fully preserved in CFS.)

Regarding compatible behavior: it's not possible to fully emulate the
O(1) scheduler's yield behavior, because the yield implementation
depended on so many scheduler internals. We changed yield when we went
from 2.4 to 2.6, and even in 2.6 we changed it a number of times. To
avoid the reoccuring problems of applications mistakenly relying on
sched_yield(), we now context-switch on yield very weakly for
SCHED_OTHER tasks (SCHED_FIFO/RR behavior is unchanged) - this is the
only sane behavior that will get apps to stop using sched_yield() - and
that's the difference that the above testcase shows. (there's actual
user-space code executed, instead of the frequent overscheduling.)

My patch below adds a sysctl flag that triggers a context-switch when
yield is called (how futile and undefined and broken that might be), but
that would only be for legacy reasons. We could still add this patch,
but i was reluctant to send it to Linus without having at least one
application that relies on having it and that benefits from it
(Antoine's test turned out to not rely on it - see the '[FIXED]'
qualifier in the subject line).

Linus, what do you think? I have no strong feelings, I think the patch
cannot hurt (it does not change anything by default) - but we should not
turn the workaround flag on by default. If you agree that we should do
this, then please pull this single patch from the sched.git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

i've tested it with the flag off and on as well, and it works as
expected.

Ingo

------------------>
Ingo Molnar (1):
sched: yield workaround

include/linux/sched.h | 2 +
kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 19 +++++++++++++
3 files changed, 88 insertions(+), 6 deletions(-)

Ingo

------------------------------>
Subject: sched: yield workaround
From: Ingo Molnar <[email protected]>

sched_yield() is fundamentally broken, and CFS has changed
its behavior. Some apps that mistakenly rely on sched_yield()
want "weak" sched yield (such as desktop apps) - while some
apps want "strong" sched_yield() (for example some JDKs).

There's no way for the scheduler to figure out which of the
two variants the app really wants - because sched_yield() is
all about hiding from the kernel the true structure of the
user-space locking code.

As a solution, provide a workaround, to introduce a more
agressive sched_yield implementation:

# default one:
echo 0 > /proc/sys/kernel/sched_yield_bug_workaround

# always queues the current task next to the next task:
echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

# NOP:
echo 2 > /proc/sys/kernel/sched_yield_bug_workaround

in the future, the use of this sysctl might generate
a deprecation warning, so that apps start moving away
from their reliance on sched_yield().

Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 2 +
kernel/sched_fair.c | 73 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 19 +++++++++++++
3 files changed, 88 insertions(+), 6 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1402,10 +1402,12 @@ extern void sched_idle_next(void);

extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_yield_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_batch_wakeup_granularity;
extern unsigned int sysctl_sched_stat_granularity;
extern unsigned int sysctl_sched_runtime_limit;
+extern unsigned int sysctl_sched_yield_bug_workaround;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -42,6 +42,16 @@ unsigned int sysctl_sched_latency __read
*/
unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL;

+unsigned int sysctl_sched_yield_granularity __read_mostly = 10000000ULL;
+
+/*
+ * sys_sched_yield workaround switch.
+ *
+ * This option switches the yield implementation of the
+ * old scheduler back on.
+ */
+unsigned int __read_mostly sysctl_sched_yield_bug_workaround;
+
/*
* SCHED_BATCH wake-up granularity.
* (default: 25 msec, units: nanoseconds)
@@ -901,15 +911,66 @@ static void dequeue_task_fair(struct rq
*/
static void yield_task_fair(struct rq *rq, struct task_struct *p)
{
- struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ if (!sysctl_sched_yield_bug_workaround) {
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ __update_rq_clock(rq);
+
+ /*
+ * Dequeue and enqueue the task to update its
+ * position within the tree:
+ */
+ dequeue_entity(cfs_rq, &p->se, 0);
+ enqueue_entity(cfs_rq, &p->se, 0);
+ return;
+ }
+
+ if (sysctl_sched_yield_bug_workaround == 1) {
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct rb_node *curr, *next, *first;
+ struct task_struct *p_next;
+ s64 yield_key;
+
+ __update_rq_clock(rq);
+ curr = &p->se.run_node;
+ first = first_fair(cfs_rq);
+ /*
+ * Move this task to the second place in the tree:
+ */
+ if (curr != first)
+ next = rb_next(curr);
+ else
+ next = first;
+ /*
+ * We were the last one already - nothing to do, return
+ * and reschedule:
+ */
+ if (unlikely(!next))
+ return;
+
+ p_next = rb_entry(next, struct task_struct, se.run_node);
+ /*
+ * Minimally necessary key value to be the second in the tree:
+ */
+ yield_key = p_next->se.fair_key +
+ (int)sysctl_sched_yield_granularity;
+
+ dequeue_entity(cfs_rq, &p->se, 0);
+
+ /*
+ * Only update the key if we need to move more backwards
+ * than the minimally necessary position to be the second:
+ */
+ if (p->se.fair_key < yield_key)
+ p->se.fair_key = yield_key;
+
+ __enqueue_entity(cfs_rq, &p->se);
+ return;
+ }

- __update_rq_clock(rq);
/*
- * Dequeue and enqueue the task to update its
- * position within the tree:
+ * Just reschedule, do nothing else:
*/
- dequeue_entity(cfs_rq, &p->se, 0);
- enqueue_entity(cfs_rq, &p->se, 0);
+ resched_task(p);
}

/*
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -244,6 +244,17 @@ static ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_yield_granularity_ns",
+ .data = &sysctl_sched_yield_granularity,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &min_sched_granularity_ns,
+ .extra2 = &max_sched_granularity_ns,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "sched_wakeup_granularity_ns",
.data = &sysctl_sched_wakeup_granularity,
.maxlen = sizeof(unsigned int),
@@ -303,6 +314,14 @@ static ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_yield_bug_workaround",
+ .data = &sysctl_sched_yield_bug_workaround,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROVE_LOCKING
{
.ctl_name = CTL_UNNUMBERED,

2007-09-19 19:43:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]



On Wed, 19 Sep 2007, Ingo Molnar wrote:
>
> Linus, what do you think? I have no strong feelings, I think the patch
> cannot hurt (it does not change anything by default) - but we should not
> turn the workaround flag on by default.

I disagree. I think CFS made "sched_yield()" worse, and what you call "bug
workaround" is likely the *better* behaviour.

The fact is, sched_yield() is not - and should not be - about
"recalculating the position in the scheduler queue" like you do now in
CFS.

It very much is about moving the thread *dead last* within its priority
group.

That's what it does for round-robin, and it's not about fairness, it's
about

- Opengroup:

DESCRIPTION

The sched_yield() function forces the running thread to
relinquish the processor until it again becomes the head of its
thread list. It takes no arguments.

- Linux man-page:

DESCRIPTION

A process can relinquish the processor voluntarily without
blocking by calling sched_yield. The process will then be moved
to the end of the queue for its static priority and a new process
gets to run.

and quite frankly, the current CFS behaviour simply looks buggy. It should
simply not move it to the "right place" in the rbtree. It should move it
*last*.

Linus

2007-09-19 19:48:40

by Chris Friesen

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]

David Schwartz wrote:

> Nonsense. The task is always ready-to-run. There is no reason its CPU should
> be low. This bug report is based on a misunderstanding of what yielding
> means.

The yielding task has given up the cpu. The other task should get to
run for a timeslice (or whatever the equivalent is in CFS) until the
yielding task again "becomes head of the thread list".

Chris

2007-09-19 19:57:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]


* Linus Torvalds <[email protected]> wrote:

> > Linus, what do you think? I have no strong feelings, I think the
> > patch cannot hurt (it does not change anything by default) - but we
> > should not turn the workaround flag on by default.
>
> I disagree. I think CFS made "sched_yield()" worse, and what you call
> "bug workaround" is likely the *better* behaviour.
>
> The fact is, sched_yield() is not - and should not be - about
> "recalculating the position in the scheduler queue" like you do now in
> CFS.
>
> It very much is about moving the thread *dead last* within its
> priority group.
[...]
> and quite frankly, the current CFS behaviour simply looks buggy. It
> should simply not move it to the "right place" in the rbtree. It
> should move it *last*.

ok, we can do that.

the O(1) implementation of yield() was pretty arbitrary: it did not move
it last on the same priority level - it only did it within the active
array. So expired tasks (such as CPU hogs) would come _after_ a
yield()-ing task.

so the yield() implementation was so much tied to the data structures of
the O(1) scheduler that it was impossible to fully emulate it in CFS.

in CFS we dont have a per-nice-level rbtree, so we cannot move it dead
last within the same priority group - but we can move it dead last in
the whole tree. (then they'd be put even after nice +19 tasks.) People
might complain about _that_.

another practical problem is that this will break certain desktop apps
that do calls to yield() [some firefox plugins do that, some 3D apps do
that, etc.] but they dont expect to be moved 'very late' into the queue
- they expect the O(1) scheduler's behavior of being delayed "a bit".
(That's why i added the yield-granularity tunable.)

we can make yield super-agressive, that is pretty much the only sane
(because well-defined) thing to do (besides turning yield into a NOP),
but there will be lots of regression reports about lost interactivity
during load. sched_yield() is a mortally broken API. "fix the app" would
be the answer, but still there will be lots of complaints.

Ingo

2007-09-19 20:00:49

by Chris Friesen

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]

Ingo Molnar wrote:
> > The correct way to tell the kernel that the task is blocked is to use
> futexes for example, or any kernel-based locking or wait object - there
> are myriads of APIs for these. (The only well-defined behavior of yield
> is for SCHED_FIFO/RR tasks - and that is fully preserved in CFS.)

Certainly this is reasonable for applications for which the source is
available and readily recompilable. However, there are legacy
closed-source apps out there expecting sched_yield() to result in a
reasonable amount of time passing before the task is scheduled again.

Also, there are installed bases of people that may have older versions
of code that may wish to upgrade to newer kernels without upgrading the
rest of the system. It seems odd to force them to update userspace apps
just because we don't like the undefined semantics.

> To avoid the reoccuring problems of applications mistakenly relying on
> sched_yield(), we now context-switch on yield very weakly for
> SCHED_OTHER tasks...

<snip>

> My patch below adds a sysctl flag that triggers a context-switch when
> yield is called... <snip> I think the patch
> cannot hurt (it does not change anything by default) - but we should not
> turn the workaround flag on by default. If you agree that we should do
> this, then please pull this single patch from the sched.git tree:

I've always understood one of the kernel's basic tenets to be that we
don't break userspace without a good reason. If there are apps out
there that expect sched_yield() to give up the cpu, it seems
counter-intuitive to ignore that expectation.

Personally, I'd be in favour of making the context-switch be the default
behaviour, but at the very least it should be possible to enable a
"backwards-compatibility mode" for sched_yield().

Chris

2007-09-19 20:27:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]


* Ingo Molnar <[email protected]> wrote:

> ok, we can do that.
>
> the O(1) implementation of yield() was pretty arbitrary: it did not
> move it last on the same priority level - it only did it within the
> active array. So expired tasks (such as CPU hogs) would come _after_ a
> yield()-ing task.
>
> so the yield() implementation was so much tied to the data structures
> of the O(1) scheduler that it was impossible to fully emulate it in
> CFS.
>
> in CFS we dont have a per-nice-level rbtree, so we cannot move it dead
> last within the same priority group - but we can move it dead last in
> the whole tree. (then they'd be put even after nice +19 tasks.) People
> might complain about _that_.
>
> another practical problem is that this will break certain desktop apps
> that do calls to yield() [some firefox plugins do that, some 3D apps
> do that, etc.] but they dont expect to be moved 'very late' into the
> queue - they expect the O(1) scheduler's behavior of being delayed "a
> bit". (That's why i added the yield-granularity tunable.)
>
> we can make yield super-agressive, that is pretty much the only sane
> (because well-defined) thing to do (besides turning yield into a NOP),
> but there will be lots of regression reports about lost interactivity
> during load. sched_yield() is a mortally broken API. "fix the app"
> would be the answer, but still there will be lots of complaints.

find below the fix that puts yielding tasks to the rightmost position in
the rbtree. I have not tested it extensively yet, but it appears to work
on x86. (i tested yield using interactive tasks and they get hurt now
under load - but this would be expected.)

Ingo

---------------------->
Subject: sched: make yield more agressive
From: Ingo Molnar <[email protected]>

make sys_sched_yield() more agressive, by moving the yielding task
to the last position in the rbtree.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 5 +----
kernel/sched_fair.c | 39 ++++++++++++++++++++++++++++++++++-----
2 files changed, 35 insertions(+), 9 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void)
struct rq *rq = this_rq_lock();

schedstat_inc(rq, yld_cnt);
- if (unlikely(rq->nr_running == 1))
- schedstat_inc(rq, yld_act_empty);
- else
- current->sched_class->yield_task(rq, current);
+ current->sched_class->yield_task(rq, current);

/*
* Since we are going to call schedule() anyway, there's
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -902,14 +902,43 @@ static void dequeue_task_fair(struct rq
static void yield_task_fair(struct rq *rq, struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
+ struct sched_entity *rightmost, *se = &p->se;
+ struct rb_node *parent;

- __update_rq_clock(rq);
/*
- * Dequeue and enqueue the task to update its
- * position within the tree:
+ * Are we the only task in the tree?
*/
- dequeue_entity(cfs_rq, &p->se, 0);
- enqueue_entity(cfs_rq, &p->se, 0);
+ if (unlikely(cfs_rq->nr_running == 1))
+ return;
+ /*
+ * Find the rightmost entry in the rbtree:
+ */
+ do {
+ parent = *link;
+ link = &parent->rb_right;
+ } while (*link);
+
+ rightmost = rb_entry(parent, struct sched_entity, run_node);
+ /*
+ * Already in the rightmost position?
+ */
+ if (unlikely(rightmost == se))
+ return;
+
+ /*
+ * Minimally necessary key value to be last in the tree:
+ */
+ se->fair_key = rightmost->fair_key + 1;
+
+ if (cfs_rq->rb_leftmost == &se->run_node)
+ cfs_rq->rb_leftmost = rb_next(&se->run_node);
+ /*
+ * Relink the task to the rightmost position:
+ */
+ rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+ rb_link_node(&se->run_node, parent, link);
+ rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}

/*

2007-09-19 20:29:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]



On Wed, 19 Sep 2007, Ingo Molnar wrote:
>
> in CFS we dont have a per-nice-level rbtree, so we cannot move it dead
> last within the same priority group - but we can move it dead last in
> the whole tree. (then they'd be put even after nice +19 tasks.) People
> might complain about _that_.

Yeah, with reasonably good reason.

The sched_yield() behaviour is actually very well-defined for RT tasks
(now, whether it's a good interface to use or not is still open to debate,
but at least it's a _defined_ interface ;), and I think we should at least
try to approximate that behaviour for normal tasks, even if they aren't
RT.

And the closest you can get is basically to say something that is close to

"sched_yield()" on a non-RT process still means that all
other runnable tasks in that same nice-level will be scheduled
before the task is scheduled again.

and I think that would be the optimal approximation of the RT behaviour.

Now, quite understandably we might not be able to actually get that
*exact* behaviour (since we mix up different nice levels), but I think
from a QoI standpoint we should really have that as a "target" behaviour.

So I think it should be moved back in the RB-tree, but really preferably
only back to behind all other processes at the same nice-level.

So I think we have two problems with yield():

- it really doesn't have very nice/good semantics in the first place for
anything but strict round-robin RT tasks.

We can't do much about this problem, apart from trying to make people
use proper locking and other models that do *not* depend on yield().

- Linux has made the problem a bit worse by then having fairly arbitrary
and different semantics over time.

This is where I think it would be a good idea to try to have the above
kind of "this is the ideal behaviour - we don't guarantee it being
precise, but at least we'd have some definition of what people should
be able to expect is the "best" behaviour.

So I don't think a "globally last" option is at all the best thing, but I
think it's likely better than what CFS does now. And if we then have some
agreement on what would be considered a further _improvement_, then that
would be a good thing - maybe we're not perfect, but at least having a
view of what we want to do is a good idea.

Btw, the "enqueue at the end" could easily be a statistical thing, not an
absolute thing. So it's possible that we could perhaps implement the CFS
"yield()" using the same logic as we have now, except *not* calling the
"update_stats()" stuff:

__dequeue_entity(..);
__enqueue_entity(..);

and then just force the "fair_key" of the to something that
*statistically* means that it should be at the end of its nice queue.

I dunno.

Linus

2007-09-19 21:41:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]


* Linus Torvalds <[email protected]> wrote:

> Btw, the "enqueue at the end" could easily be a statistical thing, not
> an absolute thing. So it's possible that we could perhaps implement
> the CFS "yield()" using the same logic as we have now, except *not*
> calling the "update_stats()" stuff:
>
> __dequeue_entity(..);
> __enqueue_entity(..);
>
> and then just force the "fair_key" of the to something that
> *statistically* means that it should be at the end of its nice queue.
>
> I dunno.

i thought a bit about the statistical approach, and it's good in
principle, but it has an implementational problem/complication: if there
are only yielding tasks in the system, then the "queue rightwards in the
tree, statistically" approach cycles through the key-space artificially
fast. That can cause various problems. (this means that the
workload-flag patch that uses yield_granularity is buggy as well. The
queue-rightmost patch did not have this problem.)

So right now there are only two viable options i think: either do the
current weak thing, or do the rightmost thing. The statistical method
might work too, but it needs more thought and more testing - i'm not
sure we can get that ready for 2.6.23.

So what we have as working code right now is the two extremes, and apps
will really mostly prefer either the first (if they dont truly want to
use yield but somehow it got into their code) or the second (if they
want some massive delay). So while it does not have a good QoI, how
about doing a compat_yield sysctl that allows the turning on of the
"queue rightmost" logic? Find tested patch below.

Peter, what do you think?

Linus, if this would be acceptable for .23 then i'll push it out into
sched.git together with another fix that Hiroshi Shimamoto just posted
to lkml.

Ingo

-------------------------------------->
Subject: sched: add /proc/sys/kernel/sched_compat_yield
From: Ingo Molnar <[email protected]>

add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield()
more agressive, by moving the yielding task to the last position
in the rbtree.

with sched_compat_yield=0:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield
2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop

with sched_compat_yield=1:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop
2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield

Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 1
kernel/sched.c | 5 ---
kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 8 ++++++
4 files changed, 67 insertions(+), 10 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1406,6 +1406,7 @@ extern unsigned int sysctl_sched_wakeup_
extern unsigned int sysctl_sched_batch_wakeup_granularity;
extern unsigned int sysctl_sched_stat_granularity;
extern unsigned int sysctl_sched_runtime_limit;
+extern unsigned int sysctl_sched_compat_yield;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void)
struct rq *rq = this_rq_lock();

schedstat_inc(rq, yld_cnt);
- if (unlikely(rq->nr_running == 1))
- schedstat_inc(rq, yld_act_empty);
- else
- current->sched_class->yield_task(rq, current);
+ current->sched_class->yield_task(rq, current);

/*
* Since we are going to call schedule() anyway, there's
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -43,6 +43,14 @@ unsigned int sysctl_sched_latency __read
unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL;

/*
+ * sys_sched_yield() compat mode
+ *
+ * This option switches the agressive yield implementation of the
+ * old scheduler back on.
+ */
+unsigned int __read_mostly sysctl_sched_compat_yield;
+
+/*
* SCHED_BATCH wake-up granularity.
* (default: 25 msec, units: nanoseconds)
*
@@ -897,19 +905,62 @@ static void dequeue_task_fair(struct rq
}

/*
- * sched_yield() support is very simple - we dequeue and enqueue
+ * sched_yield() support is very simple - we dequeue and enqueue.
+ *
+ * If compat_yield is turned on then we requeue to the end of the tree.
*/
static void yield_task_fair(struct rq *rq, struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
+ struct sched_entity *rightmost, *se = &p->se;
+ struct rb_node *parent;

- __update_rq_clock(rq);
/*
- * Dequeue and enqueue the task to update its
- * position within the tree:
+ * Are we the only task in the tree?
+ */
+ if (unlikely(cfs_rq->nr_running == 1))
+ return;
+
+ if (likely(!sysctl_sched_compat_yield)) {
+ __update_rq_clock(rq);
+ /*
+ * Dequeue and enqueue the task to update its
+ * position within the tree:
+ */
+ dequeue_entity(cfs_rq, &p->se, 0);
+ enqueue_entity(cfs_rq, &p->se, 0);
+
+ return;
+ }
+ /*
+ * Find the rightmost entry in the rbtree:
*/
- dequeue_entity(cfs_rq, &p->se, 0);
- enqueue_entity(cfs_rq, &p->se, 0);
+ do {
+ parent = *link;
+ link = &parent->rb_right;
+ } while (*link);
+
+ rightmost = rb_entry(parent, struct sched_entity, run_node);
+ /*
+ * Already in the rightmost position?
+ */
+ if (unlikely(rightmost == se))
+ return;
+
+ /*
+ * Minimally necessary key value to be last in the tree:
+ */
+ se->fair_key = rightmost->fair_key + 1;
+
+ if (cfs_rq->rb_leftmost == &se->run_node)
+ cfs_rq->rb_leftmost = rb_next(&se->run_node);
+ /*
+ * Relink the task to the rightmost position:
+ */
+ rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+ rb_link_node(&se->run_node, parent, link);
+ rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}

/*
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -303,6 +303,14 @@ static ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_compat_yield",
+ .data = &sysctl_sched_compat_yield,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROVE_LOCKING
{
.ctl_name = CTL_UNNUMBERED,

2007-09-19 21:49:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]


* Ingo Molnar <[email protected]> wrote:

> Peter, what do you think?
>
> Linus, if this would be acceptable for .23 then i'll push it out into
> sched.git together with another fix that Hiroshi Shimamoto just posted
> to lkml.

it's getting late here so i've pushed the current version of those two
patches out to:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

I'll redo this git tree if we want some other solution for yield. (but i
think this is the safest approach for 2.6.23 - some apps will complain
about too strong yield, some apps will complain about too weak yield. So
by providing the two extremes we at least cover the practical range of
behavior.)

there's nothing else pending for 2.6.23 otherwise at the moment,
scheduler-wise.

Ingo

------------------>
Hiroshi Shimamoto (1):
sched: fix invalid sched_class use

Ingo Molnar (1):
sched: add /proc/sys/kernel/sched_compat_yield

include/linux/sched.h | 1
kernel/sched.c | 10 ++++---
kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sysctl.c | 8 ++++++
4 files changed, 72 insertions(+), 10 deletions(-)

2007-09-19 22:00:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: CFS: some bad numbers with Java/database threading [FIXED]

On Wed, 19 Sep 2007 23:41:05 +0200 Ingo Molnar <[email protected]> wrote:

>
> * Linus Torvalds <[email protected]> wrote:
>
> > Btw, the "enqueue at the end" could easily be a statistical thing, not
> > an absolute thing. So it's possible that we could perhaps implement
> > the CFS "yield()" using the same logic as we have now, except *not*
> > calling the "update_stats()" stuff:
> >
> > __dequeue_entity(..);
> > __enqueue_entity(..);
> >
> > and then just force the "fair_key" of the to something that
> > *statistically* means that it should be at the end of its nice queue.
> >
> > I dunno.
>
> i thought a bit about the statistical approach, and it's good in
> principle, but it has an implementational problem/complication: if there
> are only yielding tasks in the system, then the "queue rightwards in the
> tree, statistically" approach cycles through the key-space artificially
> fast. That can cause various problems. (this means that the
> workload-flag patch that uses yield_granularity is buggy as well. The
> queue-rightmost patch did not have this problem.)
>
> So right now there are only two viable options i think: either do the
> current weak thing, or do the rightmost thing. The statistical method
> might work too, but it needs more thought and more testing - i'm not
> sure we can get that ready for 2.6.23.
>
> So what we have as working code right now is the two extremes, and apps
> will really mostly prefer either the first (if they dont truly want to
> use yield but somehow it got into their code) or the second (if they
> want some massive delay). So while it does not have a good QoI, how
> about doing a compat_yield sysctl that allows the turning on of the
> "queue rightmost" logic? Find tested patch below.
>
> Peter, what do you think?

I have to agree that for .23 we can't do much more than this. And tasks
moving to the right without actually doing work and advancing
fair_clock do scare me a little.

Also, while I agree with Linus' definition of sched_yield, I'm afraid
it will cause 'regressions' for all the interactivity people out there.
Somehow this yield thing has made it into all sorts of unfortunate
places like video drivers, so a heavy penalizing yield will hurt them.

2007-09-19 22:57:17

by David Schwartz

[permalink] [raw]
Subject: RE: CFS: some bad numbers with Java/database threading [FIXED]


> David Schwartz wrote:

> > Nonsense. The task is always ready-to-run. There is no reason
> > its CPU should
> > be low. This bug report is based on a misunderstanding of what yielding
> > means.

> The yielding task has given up the cpu. The other task should get to
> run for a timeslice (or whatever the equivalent is in CFS) until the
> yielding task again "becomes head of the thread list".

Are you sure this isn't happening? I'll run some tests on my SMP system
running CFS. But I'll bet the context switch rate is quite rapid.

Honestly, I can't imagine what else could be happening here. Does CFS spin
in a loop doing nothing when you call sched_yield even though another task
is ready-to-run? That seems kind of bizarre. Is sched_yield acting as a
no-op?

DS


2007-09-19 23:05:55

by David Schwartz

[permalink] [raw]
Subject: RE: CFS: some bad numbers with Java/database threading [FIXED]


Chris Friesen wrote:

> > The yielding task has given up the cpu. The other task should get to
> > run for a timeslice (or whatever the equivalent is in CFS) until the
> > yielding task again "becomes head of the thread list".

> Are you sure this isn't happening? I'll run some tests on my SMP
> system running CFS. But I'll bet the context switch rate is quite rapid.

Yep, that's exactly what's happening. The tasks are alternating. They are
both always ready-to-run. The yielding task is put at the end of the queue
for its priority level.

There is no reason the yielding task should get less CPU since they're both
always ready-to-run.

The only downside here is that a yielding task results in very small
timeslices which causes cache inefficiencies. A sane lower bound on the
timeslice might be a good idea. But there is no semantic problem.

DS


2007-09-19 23:53:36

by David Schwartz

[permalink] [raw]
Subject: RE: CFS: some bad numbers with Java/database threading [FIXED]


Ack, sorry, I'm wrong.

Please ignore me, if you weren't already.

I'm glad to hear this will be fixed. The task should be moved last for its
priority level.

DS


2007-09-26 01:46:53

by Antoine Martin

[permalink] [raw]
Subject: Re: CFS: new java yield graphs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

These are pure cpu scheduling tests, not doing any I/O this time.
All these tests are still "pathological" in the sense that they are only
meant to show differences between schedulers rather than try to simulate
real usage scenarios.

all the graphs are here:
http://devloop.org.uk/lkml/

Legend:
* 2.6.23-rc6-yield2: "yield2" patch is this one:
http://lkml.org/lkml/2007/9/14/157
* 2.6.23-rc6-yield2-highlatency is the same patch, but the kernel is
built without preempt, with HZ100 and the scheduling granularity is
doubled using sysctl.
* 2.6.23-rc6-yield3 is CFS-v21 combo3 + the yield patch
* 2.6.23-rc6-yield4 is this patch (queued for mainline? and in mm?):
http://lkml.org/lkml/2007/9/19/409

of interest I found:
* rc6-mm1 is not always fair (see "ManyThreadsYieldOften" tests) - the
only one to have almost half the threads already finished at the half
way point when yielding often. Also slower for the "RandomSleep".
* increasing latency makes a noticeable difference (see "ShortPause")
it can be more fair, but it also makes it a lot more erratic (see
"Yield" tests)
* most changes are only noticeable with a large number for threads (look
for 'ManyThreads' in the filename)


Summary of the code: each thread loops 25 times, incrementing a shared
counter each time and calling "iteration()" which does:
* "Long Pause" tests:
1000 times sqrt() followed by 25 milliseconds sleep.
* "Random Sleep" tests:
sleep(n) after 1000 sqrt calculations, where n is a random number
between 0 and 99 milliseconds.
* "Short Pause" Tests:
1 million sqrt() followed by 5ms sleep.
* "Yield Often" Tests:
sqrt() then yield(), both 250 times per iteration.
* "Yield" Tests:
1 million sqrt() followed by a single yield().
All the source code is here:
http://bugs.devloop.org.uk/devloop/browser/metastores/examples-test/src/com/metastores/examples/perftest/

Antoine

PS: now testing -rc8, also added a test that consumes memory in each
thread. also now recording context switches and idle time.



Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
>> Btw, the "enqueue at the end" could easily be a statistical thing, not
>> an absolute thing. So it's possible that we could perhaps implement
>> the CFS "yield()" using the same logic as we have now, except *not*
>> calling the "update_stats()" stuff:
>>
>> __dequeue_entity(..);
>> __enqueue_entity(..);
>>
>> and then just force the "fair_key" of the to something that
>> *statistically* means that it should be at the end of its nice queue.
>>
>> I dunno.
>
> i thought a bit about the statistical approach, and it's good in
> principle, but it has an implementational problem/complication: if there
> are only yielding tasks in the system, then the "queue rightwards in the
> tree, statistically" approach cycles through the key-space artificially
> fast. That can cause various problems. (this means that the
> workload-flag patch that uses yield_granularity is buggy as well. The
> queue-rightmost patch did not have this problem.)
>
> So right now there are only two viable options i think: either do the
> current weak thing, or do the rightmost thing. The statistical method
> might work too, but it needs more thought and more testing - i'm not
> sure we can get that ready for 2.6.23.
>
> So what we have as working code right now is the two extremes, and apps
> will really mostly prefer either the first (if they dont truly want to
> use yield but somehow it got into their code) or the second (if they
> want some massive delay). So while it does not have a good QoI, how
> about doing a compat_yield sysctl that allows the turning on of the
> "queue rightmost" logic? Find tested patch below.
>
> Peter, what do you think?
>
> Linus, if this would be acceptable for .23 then i'll push it out into
> sched.git together with another fix that Hiroshi Shimamoto just posted
> to lkml.
>
> Ingo
>
> -------------------------------------->
> Subject: sched: add /proc/sys/kernel/sched_compat_yield
> From: Ingo Molnar <[email protected]>
>
> add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield()
> more agressive, by moving the yielding task to the last position
> in the rbtree.
>
> with sched_compat_yield=0:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield
> 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop
>
> with sched_compat_yield=1:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop
> 2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield
>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> include/linux/sched.h | 1
> kernel/sched.c | 5 ---
> kernel/sched_fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++-----
> kernel/sysctl.c | 8 ++++++
> 4 files changed, 67 insertions(+), 10 deletions(-)
>
> Index: linux/include/linux/sched.h
> ===================================================================
> --- linux.orig/include/linux/sched.h
> +++ linux/include/linux/sched.h
> @@ -1406,6 +1406,7 @@ extern unsigned int sysctl_sched_wakeup_
> extern unsigned int sysctl_sched_batch_wakeup_granularity;
> extern unsigned int sysctl_sched_stat_granularity;
> extern unsigned int sysctl_sched_runtime_limit;
> +extern unsigned int sysctl_sched_compat_yield;
> extern unsigned int sysctl_sched_child_runs_first;
> extern unsigned int sysctl_sched_features;
>
> Index: linux/kernel/sched.c
> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -4550,10 +4550,7 @@ asmlinkage long sys_sched_yield(void)
> struct rq *rq = this_rq_lock();
>
> schedstat_inc(rq, yld_cnt);
> - if (unlikely(rq->nr_running == 1))
> - schedstat_inc(rq, yld_act_empty);
> - else
> - current->sched_class->yield_task(rq, current);
> + current->sched_class->yield_task(rq, current);
>
> /*
> * Since we are going to call schedule() anyway, there's
> Index: linux/kernel/sched_fair.c
> ===================================================================
> --- linux.orig/kernel/sched_fair.c
> +++ linux/kernel/sched_fair.c
> @@ -43,6 +43,14 @@ unsigned int sysctl_sched_latency __read
> unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL;
>
> /*
> + * sys_sched_yield() compat mode
> + *
> + * This option switches the agressive yield implementation of the
> + * old scheduler back on.
> + */
> +unsigned int __read_mostly sysctl_sched_compat_yield;
> +
> +/*
> * SCHED_BATCH wake-up granularity.
> * (default: 25 msec, units: nanoseconds)
> *
> @@ -897,19 +905,62 @@ static void dequeue_task_fair(struct rq
> }
>
> /*
> - * sched_yield() support is very simple - we dequeue and enqueue
> + * sched_yield() support is very simple - we dequeue and enqueue.
> + *
> + * If compat_yield is turned on then we requeue to the end of the tree.
> */
> static void yield_task_fair(struct rq *rq, struct task_struct *p)
> {
> struct cfs_rq *cfs_rq = task_cfs_rq(p);
> + struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
> + struct sched_entity *rightmost, *se = &p->se;
> + struct rb_node *parent;
>
> - __update_rq_clock(rq);
> /*
> - * Dequeue and enqueue the task to update its
> - * position within the tree:
> + * Are we the only task in the tree?
> + */
> + if (unlikely(cfs_rq->nr_running == 1))
> + return;
> +
> + if (likely(!sysctl_sched_compat_yield)) {
> + __update_rq_clock(rq);
> + /*
> + * Dequeue and enqueue the task to update its
> + * position within the tree:
> + */
> + dequeue_entity(cfs_rq, &p->se, 0);
> + enqueue_entity(cfs_rq, &p->se, 0);
> +
> + return;
> + }
> + /*
> + * Find the rightmost entry in the rbtree:
> */
> - dequeue_entity(cfs_rq, &p->se, 0);
> - enqueue_entity(cfs_rq, &p->se, 0);
> + do {
> + parent = *link;
> + link = &parent->rb_right;
> + } while (*link);
> +
> + rightmost = rb_entry(parent, struct sched_entity, run_node);
> + /*
> + * Already in the rightmost position?
> + */
> + if (unlikely(rightmost == se))
> + return;
> +
> + /*
> + * Minimally necessary key value to be last in the tree:
> + */
> + se->fair_key = rightmost->fair_key + 1;
> +
> + if (cfs_rq->rb_leftmost == &se->run_node)
> + cfs_rq->rb_leftmost = rb_next(&se->run_node);
> + /*
> + * Relink the task to the rightmost position:
> + */
> + rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
> + rb_link_node(&se->run_node, parent, link);
> + rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
> }
>
> /*
> Index: linux/kernel/sysctl.c
> ===================================================================
> --- linux.orig/kernel/sysctl.c
> +++ linux/kernel/sysctl.c
> @@ -303,6 +303,14 @@ static ctl_table kern_table[] = {
> .proc_handler = &proc_dointvec,
> },
> #endif
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "sched_compat_yield",
> + .data = &sysctl_sched_compat_yield,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
> #ifdef CONFIG_PROVE_LOCKING
> {
> .ctl_name = CTL_UNNUMBERED,
>


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG+bn+GK2zHPGK1rsRCg9aAJ42JXUKKf5V5fkSy48sxDZwIjs95gCeLDQ/
QdKYGH3mW8Cubn5IcfpArYY=
=mZPq
-----END PGP SIGNATURE-----

2007-09-27 08:35:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: CFS: new java yield graphs


* Antoine Martin <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> These are pure cpu scheduling tests, not doing any I/O this time. All
> these tests are still "pathological" in the sense that they are only
> meant to show differences between schedulers rather than try to
> simulate real usage scenarios.

thanks for testing this!

> all the graphs are here:
> http://devloop.org.uk/lkml/

wow - really nice graphs!

> Legend:
> * 2.6.23-rc6-yield2: "yield2" patch is this one:
> http://lkml.org/lkml/2007/9/14/157
> * 2.6.23-rc6-yield2-highlatency is the same patch, but the kernel is
> built without preempt, with HZ100 and the scheduling granularity is
> doubled using sysctl.
> * 2.6.23-rc6-yield3 is CFS-v21 combo3 + the yield patch
> * 2.6.23-rc6-yield4 is this patch (queued for mainline? and in mm?):
> http://lkml.org/lkml/2007/9/19/409

wrt. yield4 did you set /proc/sys/kernel/sched_compat_yield to 1?
(with sched_compat_yield at 0, which is the default, nothing
changes)

which one would be your favorite kernel? To me yield4 looks pretty good.

> of interest I found:
> * rc6-mm1 is not always fair (see "ManyThreadsYieldOften" tests) - the
> only one to have almost half the threads already finished at the half
> way point when yielding often. Also slower for the "RandomSleep".
> * increasing latency makes a noticeable difference (see "ShortPause")
> it can be more fair, but it also makes it a lot more erratic (see
> "Yield" tests)
> * most changes are only noticeable with a large number for threads (look
> for 'ManyThreads' in the filename)

i'm wondering, how easy would it be for you to test the sched-devel.git
tree? If you havent used git before then first install the 'git'
package, then do:

git-clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6.git
cd linux-2.6.git
git-pull git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

> PS: now testing -rc8, also added a test that consumes memory in each
> thread. also now recording context switches and idle time.

ok.

Ingo