Hi Ingo and rest:
I have been playing around with the sched_rt_runtime_us cap that can be
used to limit the amount of CPU time allocated towards scheduling rt
group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled (we
use only the root user in our embedded setup). I have no other CPU
intensive workloads (RT or otherwise) running on my system. I have
changed no other scheduling parameters from /proc.
I have written a small test program that:
(a) forks two threads, one SCHED_FIFO and one SCHED_OTHER (this thread
is reniced to -20) and ties both of them to a specific core.
(b) runs both the threads in a tight loop (same number of iterations for
both threads) until the SCHED_FIFO thread terminates.
(c) calculates the number of completed iterations of the regular
SCHED_OTHER thread against the fixed number of iterations of the
SCHED_FIFO thread. It then calculates a percentage based on that.
I am running the above workload against varying sched_rt_runtime_us
values (200 ms to 700 ms) keeping the sched_rt_period_us constant at
1000 ms. I have also experimented a little bit by decreasing the value
of sched_rt_period_us (thus increasing the sched granularity) with no
apparent change in behavior.
My observations are listed in tabular form:
Ratio of # of completed iterations of reg thread /
sched_rt_runtime_us / # of iterations of RT thread (in %)
sched_rt_runtime_us
0.2 100 % (regular thread completed all its
iterations).
0.3 73 %
0.4 45 %
0.5 17 %
0.6 0 % (SCHED_OTHER thread completely throttled.
Never ran)
0.7 0 %
This result kind of baffles me. Even when we cap the RT group to a
fraction of 0.6 of overall CPU time, the rest 0.4 \should\ still be
available for running regular threads. So my SCHED_OTHER \should\ make
some progress as opposed to being completely throttled. Similarly, with
any fraction less than 0.5, the SCHED_OTHER should complete before
SCHED_FIFO.
I do not have an easy way to verify my results over the latest kernel
(2.6.31). Was there any regressions in the scheduling subsystem in
2.6.26? Can this behavior be explained? Do we need to tweak any other
/proc parameters?
Cheers,
Ani
Hi again:
I am copying my test code here. I am really hoping to get some answers/
pointers. If there are whitespace/formatting issues in this mail,
please let me know. I am using an alternate mailer.
Cheers,
Ani
/* Test code to experiment the CPU allocation cap for an FIFO RT thread
* spinning on a tight loop. Yes, you read it right. RT thread on a
* tight loop.
*/
#define _GNU_SOURCE
#include <sched.h>
#include <pthread.h>
#include <time.h>
#include <utmpx.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <assert.h>
unsigned long reg_count;
void *fifo_thread(void *arg)
{
int core = (int) arg;
int i, j;
cpu_set_t cpuset;
struct sched_param fifo_schedparam;
int fifo_policy;
unsigned long start, end;
unsigned long fifo_count = 0;
CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);
assert(sched_setaffinity(0, sizeof cpuset, &cpuset) == 0);
/* RT priority 1 - lowest */
fifo_schedparam.sched_priority = 1;
assert(pthread_setschedparam(pthread_self(), SCHED_FIFO,
&fifo_schedparam) == 0);
start = reg_count;
printf("start reg_count=%llu\n", start);
for(i = 0; i < 5; i++) {
for(j = 0; j < UINT_MAX/10; j++) {
fifo_count++;
}
}
printf("\nRT thread has terminated\n");
end = reg_count;
printf("end reg_count=%llu\n", end);
printf("delta reg count = %llu\n", end-start);
printf("fifo count = %llu\n", fifo_count);
printf("% = %f\n", ((float)(end-start)*100)/(float)fifo_count);
return NULL;
}
void *reg_thread(void *arg)
{
int core = (int) arg;
int i, j;
int new_nice;
cpu_set_t cpuset;
struct sched_param fifo_schedparam;
int fifo_policy;
/* let's renice it to highest priority level */
new_nice = nice(-20);
printf("new nice value for regular thread=%d\n", new_nice);
printf("regular thread dispatch(%d)\n", core);
CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);
assert(sched_setaffinity(0, sizeof cpuset, &cpuset) == 0);
for(i = 0; i < 5; i++) {
for(j = 0; j < UINT_MAX/10; j++) {
reg_count++;
}
}
printf("\nregular thread has terminated\n");
return NULL;
}
int main(int argc, char *argv[])
{
char *core_str = NULL;
int core;
pthread_t tid1, tid2;
pthread_attr_t attr;
if(argc != 2) {
fprintf(stderr, "Usage: %s <core-ID>\n", argv[0]);
return -1;
}
reg_count = 0;
core = atoi(argv[1]);
pthread_attr_init(&attr);
assert(pthread_attr_setschedpolicy(&attr, SCHED_FIFO) == 0);
assert(pthread_create(&tid1, &attr, fifo_thread, (void*)core) ==
0);
assert(pthread_attr_setschedpolicy(&attr, SCHED_OTHER) == 0);
assert(pthread_create(&tid2, &attr, reg_thread, (void*)core) == 0);
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
return 0;
}
-----
From: Anirban Sinha
Sent: Fri 9/4/2009 5:55 PM
To:
Subject: question on sched-rt group allocation cap: sched_rt_runtime_us
Hi Ingo and rest:
I have been playing around with the sched_rt_runtime_us cap that can
be used to limit the amount of CPU time allocated towards scheduling
rt group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled
(we use only the root user in our embedded setup). I have no other CPU
intensive workloads (RT or otherwise) running on my system. I have
changed no other scheduling parameters from /proc.
I have written a small test program that:
(a) forks two threads, one SCHED_FIFO and one SCHED_OTHER (this thread
is reniced to -20) and ties both of them to a specific core.
(b) runs both the threads in a tight loop (same number of iterations
for both threads) until the SCHED_FIFO thread terminates.
(c) calculates the number of completed iterations of the regular
SCHED_OTHER thread against the fixed number of iterations of the
SCHED_FIFO thread. It then calculates a percentage based on that.
I am running the above workload against varying sched_rt_runtime_us
values (200 ms to 700 ms) keeping the sched_rt_period_us constant at
1000 ms. I have also experimented a little bit by decreasing the value
of sched_rt_period_us (thus increasing the sched granularity) with no
apparent change in behavior.
My observations are listed in tabular form. The numbers in the two
columns are:
rt_runtime_us /
rt_period_us
Vs
completed iterations of reg thr /
all iterations of RT thr (in %)
0.2 100 % (reg thread completed all its iterations).
0.3 73 %
0.4 45 %
0.5 17 %
0.6 0 % (reg thr completely throttled. Never ran)
0.7 0 %
This result kind of baffles me. Even when we cap the RT group to a
fraction of 0.6 of overall CPU time, the rest 0.4 \should\ still be
available for running regular threads. So my SCHED_OTHER \should\ make
some progress as opposed to being completely throttled. Similarly,
with any fraction less than 0.5, the SCHED_OTHER should complete
before SCHED_FIFO.
I do not have an easy way to verify my results over the latest kernel
(2.6.31). Was there any regressions in the scheduling subsystem in
2.6.26? Can this behavior be explained? Do we need to tweak any other /
proc parameters?
Cheers,
Ani
On Sat, Sep 5, 2009 at 02:55, Anirban Sinha<[email protected]> wrote:
> Hi Ingo and rest:
>
> I have been playing around with the sched_rt_runtime_us cap that can be
> used to limit the amount of CPU time allocated towards scheduling rt
> group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled (we
> use only the root user in our embedded setup). I have no other CPU
> intensive workloads (RT or otherwise) running on my system. I have
> changed no other scheduling parameters from /proc.
>
> I have written a small test program that:
Would you mind sending the source of this test?
Lucas De Marchi
> From: Anirban Sinha <[email protected]>
> Date: Fri, Sep 04, 2009 05:55:15PM -0700
>
> Hi Ingo and rest:
>
> I have been playing around with the sched_rt_runtime_us cap that can be
> used to limit the amount of CPU time allocated towards scheduling rt
> group threads. I am using 2.6.26 with CONFIG_GROUP_SCHED disabled (we
> use only the root user in our embedded setup). I have no other CPU
> intensive workloads (RT or otherwise) running on my system. I have
> changed no other scheduling parameters from /proc.
>
> I have written a small test program that:
>
> (a) forks two threads, one SCHED_FIFO and one SCHED_OTHER (this thread
> is reniced to -20) and ties both of them to a specific core.
> (b) runs both the threads in a tight loop (same number of iterations for
> both threads) until the SCHED_FIFO thread terminates.
> (c) calculates the number of completed iterations of the regular
> SCHED_OTHER thread against the fixed number of iterations of the
> SCHED_FIFO thread. It then calculates a percentage based on that.
>
> I am running the above workload against varying sched_rt_runtime_us
> values (200 ms to 700 ms) keeping the sched_rt_period_us constant at
> 1000 ms. I have also experimented a little bit by decreasing the value
> of sched_rt_period_us (thus increasing the sched granularity) with no
> apparent change in behavior.
>
> My observations are listed in tabular form:
>
> Ratio of # of completed iterations of reg thread /
> sched_rt_runtime_us / # of iterations of RT thread (in %)
> sched_rt_runtime_us
>
> 0.2 100 % (regular thread completed all its
> iterations).
> 0.3 73 %
> 0.4 45 %
> 0.5 17 %
> 0.6 0 % (SCHED_OTHER thread completely throttled.
> Never ran)
> 0.7 0 %
>
> This result kind of baffles me. Even when we cap the RT group to a
> fraction of 0.6 of overall CPU time, the rest 0.4 \should\ still be
> available for running regular threads. So my SCHED_OTHER \should\ make
> some progress as opposed to being completely throttled. Similarly, with
> any fraction less than 0.5, the SCHED_OTHER should complete before
> SCHED_FIFO.
>
> I do not have an easy way to verify my results over the latest kernel
> (2.6.31). Was there any regressions in the scheduling subsystem in
> 2.6.26? Can this behavior be explained? Do we need to tweak any other
> /proc parameters?
>
You say you pin the threads to a single core: how many cores does your
system have?
I don't know if 2.6.26 had anything wrong (from a quick look the relevant
code seems similar to what we have now), but something like that can be
the consequence of the runtime migration logic moving bandwidth from a
second core to the one executing the two tasks.
If this is the case, this behavior is the expected one, the scheduler
tries to reduce the number of migrations, concentrating the bandwidth
of rt tasks on a single core. With your workload it doesn't work well
because runtime migration has freed the other core(s) from rt bandwidth,
so these cores are available to SCHED_OTHER ones, but your SCHED_OTHER
thread is pinned and cannot make use of them.
> You say you pin the threads to a single core: how many cores does your
> system have?
>
> If this is the case, this behavior is the expected one, the scheduler
> tries to reduce the number of migrations, concentrating the bandwidth
> of rt tasks on a single core. ?With your workload it doesn't work well
> because runtime migration has freed the other core(s) from rt bandwidth,
> so these cores are available to SCHED_OTHER ones, but your SCHED_OTHER
> thread is pinned and cannot make use of them.
Indeed. I've tested this same test program in a single core machine and it
produces the expected behavior:
rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
95% 4.48%
60% 54.84%
50% 86.03%
40% OTHER completed first
Lucas De Marchi
> You say you pin the threads to a single core: how many cores does
your
> system have?
The results I sent you were on a dual core blade.
> If this is the case, this behavior is the expected one, the scheduler
> tries to reduce the number of migrations, concentrating the bandwidth
> of rt tasks on a single core. With your workload it doesn't work
well
> because runtime migration has freed the other core(s) from rt
bandwidth,
> so these cores are available to SCHED_OTHER ones, but your
SCHED_OTHER
> thread is pinned and cannot make use of them.
But, I ran the same routine on a quadcore blade and the results this
time were:
rt_runtime/rt_period % of iterations of reg thrd against rt thrd
0.20 46%
0.25 18%
0.26 7%
0.3 0%
0.4 0%
(rest of the cases) 0%
So if the scheduler is concentrating all rt bandwidth to one core, it
should be effectively 0.2 * 4 = 0.8 for this core. Hence, we should
see the percentage closer to 20% but it seems that it's more than
double. At ~0.25, the regular thread should make no progress, but it
seems it does make a little progress.
On Sep 5, 3:50?pm, Lucas De Marchi <[email protected]> wrote:
>
> Indeed. I've tested this same test program in a single core machine and it
> produces the expected behavior:
>
> rt_runtime_us / rt_period_us ? ? % loops executed in SCHED_OTHER
> 95% ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?4.48%
> 60% ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?54.84%
> 50% ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?86.03%
> 40% ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?OTHER completed first
>
Hmm. This does seem to indicate that there is some kind of
relationship with SMP. So I wonder whether there is a way to turn this
'RT bandwidth accumulation' heuristic off. I did an
echo 0 > /proc/sys/kernel/sched_migration_cost
but results were identical to previous.
I figure that if I set it to zero, the regular sched-fair (non-RT)
tasks will be treated as not being cache hot and hence susceptible to
migration. From the code it looks like sched-rt tasks are always
treated as cache cold? Mind you though that I have not yet looked into
the code very rigorously. I knew the O(1) scheduler relatively well,
but I am just begun digging into the new CFS scheduler code.
On a side note, why is there no documentation explain the
sched_migration_cost tuning knob? It would be nice to have one - at
least where the sysctl variable is defined.
--Ani
On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> On Sep 5, 3:50 pm, Lucas De Marchi <[email protected]> wrote:
> >
> > Indeed. I've tested this same test program in a single core machine and it
> > produces the expected behavior:
> >
> > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > 95% 4.48%
> > 60% 54.84%
> > 50% 86.03%
> > 40% OTHER completed first
> >
>
> Hmm. This does seem to indicate that there is some kind of
> relationship with SMP. So I wonder whether there is a way to turn this
> 'RT bandwidth accumulation' heuristic off.
No there isn't, but maybe there should be, since this isn't the first
time it's come up. One pro argument is that pinned tasks are thoroughly
screwed when an RT hog lands on their runqueue. On the con side, the
whole RT bandwidth restriction thing is intended (AFAIK) to allow an
admin to regain control should RT app go insane, which the default 5%
aggregate accomplishes just fine.
Dunno. Fly or die little patchlet (toss).
sched: allow the user to disable RT bandwidth aggregation.
Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8736ba1..6e6d4c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1881,6 +1881,7 @@ static inline unsigned int get_sysctl_timer_migration(void)
#endif
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;
+extern int sysctl_sched_rt_bandwidth_aggregate;
int sched_rt_handler(struct ctl_table *table, int write,
struct file *filp, void __user *buffer, size_t *lenp,
diff --git a/kernel/sched.c b/kernel/sched.c
index c512a02..ca6a378 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -864,6 +864,12 @@ static __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;
+/*
+ * aggregate bandwidth, ie allow borrowing from neighbors when
+ * bandwidth for an individual runqueue is exhausted.
+ */
+int sysctl_sched_rt_bandwidth_aggregate = 1;
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 2eb4bd6..75daf88 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -495,6 +495,9 @@ static int balance_runtime(struct rt_rq *rt_rq)
{
int more = 0;
+ if (!sysctl_sched_rt_bandwidth_aggregate)
+ return 0;
+
if (rt_rq->rt_time > rt_rq->rt_runtime) {
spin_unlock(&rt_rq->rt_runtime_lock);
more = do_balance_runtime(rt_rq);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cdbe8d0..0ad08e5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -368,6 +368,14 @@ static struct ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_rt_bandwidth_aggregate",
+ .data = &sysctl_sched_rt_bandwidth_aggregate,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &sched_rt_handler,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "sched_compat_yield",
.data = &sysctl_sched_compat_yield,
.maxlen = sizeof(unsigned int),
On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > On Sep 5, 3:50 pm, Lucas De Marchi <[email protected]> wrote:
> > >
> > > Indeed. I've tested this same test program in a single core machine and it
> > > produces the expected behavior:
> > >
> > > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > > 95% 4.48%
> > > 60% 54.84%
> > > 50% 86.03%
> > > 40% OTHER completed first
> > >
> >
> > Hmm. This does seem to indicate that there is some kind of
> > relationship with SMP. So I wonder whether there is a way to turn this
> > 'RT bandwidth accumulation' heuristic off.
>
> No there isn't, but maybe there should be, since this isn't the first
> time it's come up. One pro argument is that pinned tasks are thoroughly
> screwed when an RT hog lands on their runqueue. On the con side, the
> whole RT bandwidth restriction thing is intended (AFAIK) to allow an
> admin to regain control should RT app go insane, which the default 5%
> aggregate accomplishes just fine.
>
> Dunno. Fly or die little patchlet (toss).
btw, a _kinda sorta_ pro is that it can prevent IO lockups like the
below. Seems kjournald can end up depending on kblockd/3, which ain't
going anywhere with that 100% RT hog in the way, so the whole box is
fairly hosed. (much better would be to wake some other kblockd)
top - 12:01:49 up 56 min, 20 users, load average: 8.01, 4.96, 2.39
Tasks: 304 total, 4 running, 300 sleeping, 0 stopped, 0 zombie
Cpu(s): 25.8%us, 0.3%sy, 0.0%ni, 0.0%id, 73.7%wa, 0.3%hi, 0.0%si, 0.0%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
13897 root -2 0 7920 592 484 R 100 0.0 1:13.43 3 xx
12716 root 20 0 8868 1328 860 R 1 0.0 0:01.44 0 top
14 root 15 -5 0 0 0 R 0 0.0 0:00.02 3 events/3
94 root 15 -5 0 0 0 R 0 0.0 0:00.00 3 kblockd/3
1212 root 15 -5 0 0 0 D 0 0.0 0:00.04 2 kjournald
14393 root 20 0 9848 2296 756 D 0 0.1 0:00.01 0 make
14404 root 20 0 38012 25m 5552 D 0 0.8 0:00.21 1 cc1
14405 root 20 0 20220 8852 2388 D 0 0.3 0:00.02 1 as
14437 root 20 0 24132 10m 2680 D 0 0.3 0:00.06 2 cc1
14448 root 20 0 18324 1724 1240 D 0 0.1 0:00.00 2 cc1
14452 root 20 0 12540 792 656 D 0 0.0 0:00.00 2 mv
> From: Anirban Sinha <[email protected]>
> Date: Sat, Sep 05, 2009 05:47:39PM -0700
>
> > You say you pin the threads to a single core: how many cores does
> your
> > system have?
>
> The results I sent you were on a dual core blade.
>
>
> > If this is the case, this behavior is the expected one, the scheduler
> > tries to reduce the number of migrations, concentrating the bandwidth
> > of rt tasks on a single core. With your workload it doesn't work
> well
> > because runtime migration has freed the other core(s) from rt
> bandwidth,
> > so these cores are available to SCHED_OTHER ones, but your
> SCHED_OTHER
> > thread is pinned and cannot make use of them.
>
> But, I ran the same routine on a quadcore blade and the results this
> time were:
>
> rt_runtime/rt_period % of iterations of reg thrd against rt thrd
>
> 0.20 46%
> 0.25 18%
> 0.26 7%
> 0.3 0%
> 0.4 0%
> (rest of the cases) 0%
>
> So if the scheduler is concentrating all rt bandwidth to one core, it
> should be effectively 0.2 * 4 = 0.8 for this core. Hence, we should
> see the percentage closer to 20% but it seems that it's more than
> double. At ~0.25, the regular thread should make no progress, but it
> seems it does make a little progress.
So this can be a bug. While it is possible that the kernel does
not succeed in migrating all the runtime (e.g., due to a (system) rt
task consuming some bandwidth on a remote cpu), 46% instead of 20%
is too much.
Running your program I'm unable to reproduce the same issue on a recent
kernel here; for 25ms over 100ms across several runs I get less than 2%.
This number increases, reaching your values, only when using short
periods (where the meaning for short depends on your HZ value), which
is something to be expected, due to the fact that rt throttling uses
the tick to charge runtimes to tasks.
Looking at the git history, there have been several bugfixes to the rt
bandwidth code from 2.6.26, one of them seems to be strictly related to
runtime accounting with your setup:
commit f6121f4f8708195e88cbdf8dd8d171b226b3f858
Author: Dario Faggioli <[email protected]>
Date: Fri Oct 3 17:40:46 2008 +0200
sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq
On Sun, 2009-09-06 at 07:53 -0700, Anirban Sinha wrote:
>
>
>
> > Seems kjournald can end up depending on kblockd/3, which ain't
> > going anywhere with that 100% RT hog in the way,
>
> I think in the past AKPM's response to this has been "just don't do
> it", i.e, don't hog the CPU with an RT thread.
Oh yeah, sure. Best to run RT oinkers on isolated cpus. It just
surprised me that the 100% compute RT cpu became involved in IO.
-Mike
> Running your program I'm unable to reproduce the same issue on a
recent
> kernel here; for 25ms over 100ms across several runs I get less
than 2%.
> This number increases, reaching your values, only when using short
> periods (where the meaning for short depends on your HZ value),
In our kernel, the jiffies are configured as 100 HZ.
> which
> is something to be expected, due to the fact that rt throttling uses
> the tick to charge runtimes to tasks.
Hmm. I see. I understand that.
> Looking at the git history, there have been several bugfixes to the
rt
> bandwidth code from 2.6.26, one of them seems to be strictly
related to
> runtime accounting with your setup:
I will apply these patches on Tuesday and rerun the tests.
> Dunno. Fly or die little patchlet (toss).
> sched: allow the user to disable RT bandwidth aggregation.
Hmm. Interesting. With this change, my results are as follows:
rt_runtime/rt_period % of reg iterations
0.2 100%
0.25 100%
0.3 100%
0.4 100%
0.5 82%
0.6 66%
0.7 54%
0.8 46%
0.9 38.5%
0.95 32%
This results are on a quad core blade. Does it still makes sense though?
Can anyone else run the same tests on a quadcore over the latest kernel?
I will patch our 2.6.26 kernel with upstream fixes and rerun these
tests on tuesday.
Ani
On 2009-09-06, at 8:09 AM, Mike Galbraith wrote:
> On Sun, 2009-09-06 at 07:53 -0700, Anirban Sinha wrote:
>>
>>
>>
>>> Seems kjournald can end up depending on kblockd/3, which ain't
>>> going anywhere with that 100% RT hog in the way,
>>
>> I think in the past AKPM's response to this has been "just don't do
>> it", i.e, don't hog the CPU with an RT thread.
>
> Oh yeah, sure. Best to run RT oinkers on isolated cpus.
Correct. Unfortunately at some places, the application coders do
stupid things and then the onus falls on the kernel guys to make
things 'just work'.
I would not have any problems if such a cap mechanism did not exist at
all. However, since we do have such a tuning knob. I would say that
let's make it do what it is supposed to do. In the documentation it
says "0.05s to be used by SCHED_OTHER". Unfortunately, it never hints
that if your thread is tied to the RT core, you are screwed. The
bandwidth accumulation logic would virtually kill all the remaining
SCHED_OTHER threads, much before that 95% cap is reached. Somewhere it
doesn't quite seem right. At the very very least, can we have this
clearly written in sched-rt-group.txt?
Cheers,
Ani
On Sun, 2009-09-06 at 17:18 -0700, Anirban Sinha wrote:
>
>
> > Dunno. Fly or die little patchlet (toss).
>
> > sched: allow the user to disable RT bandwidth aggregation.
>
> Hmm. Interesting. With this change, my results are as follows:
>
> rt_runtime/rt_period % of reg iterations
>
> 0.2 100%
> 0.25 100%
> 0.3 100%
> 0.4 100%
> 0.5 82%
> 0.6 66%
> 0.7 54%
> 0.8 46%
> 0.9 38.5%
> 0.95 32%
>
>
> This results are on a quad core blade. Does it still makes sense
> though?
> Can anyone else run the same tests on a quadcore over the latest
> kernel? I will patch our 2.6.26 kernel with upstream fixes and rerun
> these tests on tuesday.
I tested tip (v2.6.31-rc9-1357-ge6a3cd0) with a little perturbation
measurement proglet on an isolated Q6600 core.
10s measurement interval results:
sched_rt_runtime_us RT utilization
950000 94.99%
750000 75.00%
500000 50.04%
250000 25.02%
50000 5.03%
Seems to work fine here.
-Mike
On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > On Sep 5, 3:50 pm, Lucas De Marchi <[email protected]> wrote:
> > >
> > > Indeed. I've tested this same test program in a single core machine and it
> > > produces the expected behavior:
> > >
> > > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > > 95% 4.48%
> > > 60% 54.84%
> > > 50% 86.03%
> > > 40% OTHER completed first
> > >
> >
> > Hmm. This does seem to indicate that there is some kind of
> > relationship with SMP. So I wonder whether there is a way to turn this
> > 'RT bandwidth accumulation' heuristic off.
>
> No there isn't..
Actually there is, use cpusets to carve the system into partitions.
On Mon, 2009-09-07 at 09:59 +0200, Peter Zijlstra wrote:
> On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> > On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > > On Sep 5, 3:50 pm, Lucas De Marchi <[email protected]> wrote:
> > > >
> > > > Indeed. I've tested this same test program in a single core machine and it
> > > > produces the expected behavior:
> > > >
> > > > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > > > 95% 4.48%
> > > > 60% 54.84%
> > > > 50% 86.03%
> > > > 40% OTHER completed first
> > > >
> > >
> > > Hmm. This does seem to indicate that there is some kind of
> > > relationship with SMP. So I wonder whether there is a way to turn this
> > > 'RT bandwidth accumulation' heuristic off.
> >
> > No there isn't..
>
> Actually there is, use cpusets to carve the system into partitions.
Yeah, I stand corrected. I tend to think in terms of the dirt simplest
configuration only.
-Mike
On 2009-09-07, at 9:42 AM, Anirban Sinha wrote:
>
>
>
> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Mon 9/7/2009 12:59 AM
> To: Mike Galbraith
> Cc: Anirban Sinha; Lucas De Marchi; [email protected];
> Ingo Molnar
> Subject: Re: question on sched-rt group allocation cap:
> sched_rt_runtime_us
>
> On Sun, 2009-09-06 at 08:32 +0200, Mike Galbraith wrote:
> > On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > > On Sep 5, 3:50 pm, Lucas De Marchi <[email protected]>
> wrote:
> > > >
> > > > Indeed. I've tested this same test program in a single core
> machine and it
> > > > produces the expected behavior:
> > > >
> > > > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > > > 95% 4.48%
> > > > 60% 54.84%
> > > > 50% 86.03%
> > > > 40% OTHER completed first
> > > >
> > >
> > > Hmm. This does seem to indicate that there is some kind of
> > > relationship with SMP. So I wonder whether there is a way to
> turn this
> > > 'RT bandwidth accumulation' heuristic off.
> >
> > No there isn't..
>
> Actually there is, use cpusets to carve the system into partitions.
hmm. ok. I looked at the code a little bit. It seems to me that the
'borrowing' of RT runtimes occurs only from rt runqueues belonging to
the same root domain. And partition_sched_domains() is the only
external interface that can be used to create root domain out of a CPU
set. But then I think it needs to have CGROUPS/USER groups enabled?
Right?
--Ani
>
>
>
On 2009-09-07, at 9:44 AM, Anirban Sinha wrote:
>
>
>
> -----Original Message-----
> From: Mike Galbraith [mailto:[email protected]]
> Sent: Sun 9/6/2009 11:54 PM
> To: Anirban Sinha
> Cc: Lucas De Marchi; [email protected]; Peter Zijlstra;
> Ingo Molnar
> Subject: RE: question on sched-rt group allocation cap:
> sched_rt_runtime_us
>
> On Sun, 2009-09-06 at 17:18 -0700, Anirban Sinha wrote:
> >
> >
> > > Dunno. Fly or die little patchlet (toss).
> >
> > > sched: allow the user to disable RT bandwidth aggregation.
> >
> > Hmm. Interesting. With this change, my results are as follows:
> >
> > rt_runtime/rt_period % of reg iterations
> >
> > 0.2 100%
> > 0.25 100%
> > 0.3 100%
> > 0.4 100%
> > 0.5 82%
> > 0.6 66%
> > 0.7 54%
> > 0.8 46%
> > 0.9 38.5%
> > 0.95 32%
> >
> >
> > This results are on a quad core blade. Does it still makes sense
> > though?
> > Can anyone else run the same tests on a quadcore over the latest
> > kernel? I will patch our 2.6.26 kernel with upstream fixes and rerun
> > these tests on tuesday.
>
> I tested tip (v2.6.31-rc9-1357-ge6a3cd0) with a little perturbation
> measurement proglet on an isolated Q6600 core.
Thanks Mike. Is this on a single core machine (or one core carved out
of N)? We may have some newer patches missing from the 2.6.26 kernel
that fixes some accounting bugs. I will do a review and rerun the test
after applying the upstream patches.
Ani
On Tue, 2009-09-08 at 00:08 -0700, Anirban Sinha wrote:
> > Actually there is, use cpusets to carve the system into partitions.
>
> hmm. ok. I looked at the code a little bit. It seems to me that the
> 'borrowing' of RT runtimes occurs only from rt runqueues belonging to
> the same root domain. And partition_sched_domains() is the only
> external interface that can be used to create root domain out of a CPU
> set. But then I think it needs to have CGROUPS/USER groups enabled?
> Right?
No you need cpusets, you create a partition by disabling load-balancing
on the top set, thereby only allowing load-balancing withing the
children.
The runtime sharing is a form of load-balancing.
CONFIG_CPUSETS=y
Documentation/cgroups/cpusets.txt
On Tue, 2009-09-08 at 00:10 -0700, Anirban Sinha wrote:
> > I tested tip (v2.6.31-rc9-1357-ge6a3cd0) with a little perturbation
> > measurement proglet on an isolated Q6600 core.
>
>
> Thanks Mike. Is this on a single core machine (or one core carved out
> of N)? We may have some newer patches missing from the 2.6.26 kernel
> that fixes some accounting bugs. I will do a review and rerun the test
> after applying the upstream patches.
Q6600 is a quad, test was 1 carved out of 4 (thought i said that).
-Mike
On 2009-09-08, at 1:42 AM, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 00:08 -0700, Anirban Sinha wrote:
>
>>> Actually there is, use cpusets to carve the system into partitions.
>>
>> hmm. ok. I looked at the code a little bit. It seems to me that the
>> 'borrowing' of RT runtimes occurs only from rt runqueues belonging to
>> the same root domain. And partition_sched_domains() is the only
>> external interface that can be used to create root domain out of a
>> CPU
>> set. But then I think it needs to have CGROUPS/USER groups enabled?
>> Right?
>
> No you need cpusets, you create a partition by disabling load-
> balancing
> on the top set, thereby only allowing load-balancing withing the
> children.
>
Ah I see. Thanks for the clarification.
> The runtime sharing is a form of load-balancing.
sure.
>
> CONFIG_CPUSETS=y
Hmm. Ok. I guess what I meant but did not articulate properly (because
I was thinking in terms of code) was CPUSETS needed CGROUPS support:
config CPUSETS
bool "Cpuset support"
depends on CGROUPS
Anyway, that's fine. I'll dig around the code a little bit more.
>
> Documentation/cgroups/cpusets.txt
Thanks for the pointer. My bad, I did not care to see the docs. I tend
to ignore docs and read code instead. :D
>
> Looking at the git history, there have been several bugfixes to the rt
> bandwidth code from 2.6.26, one of them seems to be strictly related
> to
> runtime accounting with your setup:
>
> commit f6121f4f8708195e88cbdf8dd8d171b226b3f858
> Author: Dario Faggioli <[email protected]>
> Date: Fri Oct 3 17:40:46 2008 +0200
>
> sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq
Hmm. Indeed there did seem to have quite a few fixes to the accounting
logic. I back-patched our 2.6.26 kernel with the upstream patches that
seemed relevant and my test code now yields reasonable results.
Applying the above patch did not fix it though which kind of makes
sense since from the commit log it seems that the patch fixed cases
when the RT task was getting *less* CPU than it's bandwidth allocation
as opposed to more as in my case. I haven't bisected the patchet to
figure out exactly which one fixed it but I intend to do it later just
for fun.
For completeness, these are the results after applying the upstream
patches *and* disabling bandwidth borrowing logic on my 2.6.26 kernel
running on a quad core blade with CONFIG_GROUP_SCHED turned off (100HZ
jiffies):
rt_runtime/
rt_period % of SCHED_OTHER iterations
.40 100%
.50 74%
.60 47%
.70 31%
.80 18%
.90 8%
.95 4%
--Ani
On 2009-09-08, at 10:32 AM, Anirban Sinha wrote:
>
>
>
> -----Original Message-----
> From: Mike Galbraith [mailto:[email protected]]
> Sent: Sat 9/5/2009 11:32 PM
> To: Anirban Sinha
> Cc: Lucas De Marchi; [email protected]; Peter Zijlstra;
> Ingo Molnar
> Subject: Re: question on sched-rt group allocation cap:
> sched_rt_runtime_us
>
> On Sat, 2009-09-05 at 19:32 -0700, Ani wrote:
> > On Sep 5, 3:50 pm, Lucas De Marchi <[email protected]>
> wrote:
> > >
> > > Indeed. I've tested this same test program in a single core
> machine and it
> > > produces the expected behavior:
> > >
> > > rt_runtime_us / rt_period_us % loops executed in SCHED_OTHER
> > > 95% 4.48%
> > > 60% 54.84%
> > > 50% 86.03%
> > > 40% OTHER completed first
> > >
> >
> > Hmm. This does seem to indicate that there is some kind of
> > relationship with SMP. So I wonder whether there is a way to turn
> this
> > 'RT bandwidth accumulation' heuristic off.
>
> No there isn't, but maybe there should be, since this isn't the first
> time it's come up. One pro argument is that pinned tasks are
> thoroughly
> screwed when an RT hog lands on their runqueue. On the con side, the
> whole RT bandwidth restriction thing is intended (AFAIK) to allow an
> admin to regain control should RT app go insane, which the default 5%
> aggregate accomplishes just fine.
>
> Dunno. Fly or die little patchlet (toss).
So it would be nice to have a knob like this when CGROUPS is disabled
(it say 'say N when unsure' :)). CPUSETS depends on CGROUPS.
>
> sched: allow the user to disable RT bandwidth aggregation.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
Verified-by: Anirban Sinha <[email protected]>
> LKML-Reference: <new-submission>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8736ba1..6e6d4c7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1881,6 +1881,7 @@ static inline unsigned int
> get_sysctl_timer_migration(void)
> #endif
> extern unsigned int sysctl_sched_rt_period;
> extern int sysctl_sched_rt_runtime;
> +extern int sysctl_sched_rt_bandwidth_aggregate;
>
> int sched_rt_handler(struct ctl_table *table, int write,
> struct file *filp, void __user *buffer, size_t *lenp,
> diff --git a/kernel/sched.c b/kernel/sched.c
> index c512a02..ca6a378 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -864,6 +864,12 @@ static __read_mostly int scheduler_running;
> */
> int sysctl_sched_rt_runtime = 950000;
>
> +/*
> + * aggregate bandwidth, ie allow borrowing from neighbors when
> + * bandwidth for an individual runqueue is exhausted.
> + */
> +int sysctl_sched_rt_bandwidth_aggregate = 1;
> +
> static inline u64 global_rt_period(void)
> {
> return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 2eb4bd6..75daf88 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -495,6 +495,9 @@ static int balance_runtime(struct rt_rq *rt_rq)
> {
> int more = 0;
>
> + if (!sysctl_sched_rt_bandwidth_aggregate)
> + return 0;
> +
> if (rt_rq->rt_time > rt_rq->rt_runtime) {
> spin_unlock(&rt_rq->rt_runtime_lock);
> more = do_balance_runtime(rt_rq);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index cdbe8d0..0ad08e5 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -368,6 +368,14 @@ static struct ctl_table kern_table[] = {
> },
> {
> .ctl_name = CTL_UNNUMBERED,
> + .procname = "sched_rt_bandwidth_aggregate",
> + .data =
> &sysctl_sched_rt_bandwidth_aggregate,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = &sched_rt_handler,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> .procname = "sched_compat_yield",
> .data = &sysctl_sched_compat_yield,
> .maxlen = sizeof(unsigned int),
>
>
>
>
On Tue, 2009-09-08 at 10:41 -0700, Anirban Sinha wrote:
> On 2009-09-08, at 10:32 AM, Anirban Sinha wrote:
>
> > Dunno. Fly or die little patchlet (toss).
>
> So it would be nice to have a knob like this when CGROUPS is disabled
> (it say 'say N when unsure' :)). CPUSETS depends on CGROUPS.
Maybe. Short term hack. My current thoughts on the subject, after some
testing, is that the patchlet should just die, and pondering the larger
solution should happen.
-Mike
>Maybe. Short term hack. My current thoughts on the subject, after
some
>testing, is that the patchlet should just die, and pondering the larger
>solution should happen.
Just curious, what is the larger solution? When everyone adapts using
control-groups?
> Looking at the git history, there have been several bugfixes to the rt
> bandwidth code from 2.6.26, one of them seems to be strictly related
> to
> runtime accounting with your setup:
>
> commit f6121f4f8708195e88cbdf8dd8d171b226b3f858
> Author: Dario Faggioli <[email protected]>
> Date: Fri Oct 3 17:40:46 2008 +0200
>
> sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq
Hmm. Indeed there did seem to have quite a few fixes to the accounting
logic. I back-patched our 2.6.26 kernel with the upstream patches that
seemed relevant and my test code now yields reasonable results.
Applying the above patch did not fix it though which kind of makes
sense since from the commit log it seems that the patch fixed cases
when the RT task was getting *less* CPU than it's bandwidth allocation
as opposed to more as in my case. I haven't bisected the patchet to
figure out exactly which one fixed it but I intend to do it later just
for fun.
For completeness, these are the results after applying the upstream
patches *and* disabling bandwidth borrowing logic on my 2.6.26 kernel
running on a quad core blade with CONFIG_GROUP_SCHED turned off (100HZ
jiffies):
rt_runtime/
rt_period % of SCHED_OTHER iterations
.40 100%
.50 74%
.60 47%
.70 31%
.80 18%
.90 8%
.95 4%
--Ani
On Tue, 2009-09-08 at 12:34 -0700, Anirban Sinha wrote:
> >Maybe. Short term hack. My current thoughts on the subject, after
> some
> >testing, is that the patchlet should just die, and pondering the larger
> >solution should happen.
>
> Just curious, what is the larger solution?
That's what needs pondering :)
-Mike