This is version 2 of the SCHED_ISO patch with the yield bug fixed and
code cleanups.
This patch for 2.6.11-rc1 provides a method of providing real time
scheduling to unprivileged users which increasingly is desired for
multimedia workloads.
It does this by adding a new scheduling class called SCHED_ISO or
Isochronous scheduling which means "same time" scheduling. This class
does not require superuser privileges and is starvation free. The
scheduling class no. 4 was chosen since there are quite a few userspace
applications already supporting 3 and 4 for SCHED_BATCH and SCHED_ISO
respectively on non-mainline kernels. As a way of immediately providing
support for current userspace apps, any unprivileged user starting an
application requesting SCHED_RR or SCHED_FIFO will be demoted to
SCHED_ISO. This may or may not be worth removing later.
The SCHED_ISO class runs as SCHED_RR effectively at a priority just
above all SCHED_NORMAL tasks and below all true real time tasks. Once a
cpu usage limit is exceeded by tasks of this class (per cpu), SCHED_ISO
tasks will then run as SCHED_NORMAL until the cpu usage drops to 90% of
the cpu limit.
By default the cpu limit is set to 70% which literature suggests should
provide good real time behaviour for most applications without gross
unfairness. This cpu limit is calculated as a decaying average over 5
seconds. These limits are configurable with the tunables
/proc/sys/kernel/iso_cpu
/proc/sys/kernel/iso_period
iso_cpu can be set to 100 which would allow all unprivileged users
access to unrestricted SCHED_RR behaviour. OSX provides a similar class
to SCHED_ISO and uses 90% as its cpu limit.
The sysrq-n combination which converts all user real-time tasks to
SCHED_NORMAL also will affect SCHED_ISO tasks.
Currently the round robin interval is set to 10ms which is a cache
friendly timeslice. It may be worth making this configurable or smaller,
and it would also be possible to implement SCHED_ISO of a FIFO nature as
well.
For testing, the userspace tool schedtool available here:
http://freequaos.host.sk/schedtool/
can be used as a wrapper to start SCHED_ISO tasks
schedtool -I -e xmms
for example
Patch also available here:
http://ck.kolivas.org/patches/SCHED_ISO/
Signed-off-by: Con Kolivas <[email protected]>
Hi Con
On Thu, 2005-01-20 at 09:39 +1100, Con Kolivas wrote:
> This is version 2 of the SCHED_ISO patch with the yield bug fixed and
> code cleanups.
Thanks for the update.
@@ -2406,6 +2489,10 @@ void scheduler_tick(void)
task_t *p = current;
rq->timestamp_last_tick = sched_clock();
+ if (iso_task(p) && !rq->iso_refractory)
+ inc_iso_ticks(rq, p);
+ else
+ dec_iso_ticks(rq, p);
scheduler_tick() is not only called by the timer interrupt but also form
the fork code. Is this intended? I think the accounting for
iso_refractory is wrong. Isn't calling it from
timer.c::update_process_times() better?
And shouldn't real RT task also counted? If RT tasks use 40% cpu you can
lockup the system as unprivileged user with SCHED_ISO because it doesn't
reach the 70% cpu limit.
Futher on i see a fundamental problem with this accounting for
iso_refractory. What if i manage as unprivileged user to run a SCHED_ISO
task which consumes all cpu and only sleeps very short during the timer
interrupt? I think this will nearly lockup or very slow down the system.
The iso_cpu limit can't guaranteed.
My simple yield DoS don't work anymore. But i found another way.
Running this as SCHED_ISO:
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <sys/time.h>
#include <sys/resource.h>
struct timeval tv;
int a, b, i0, i1;
int cpuusage ()
{
struct rusage ru;
getrusage(RUSAGE_SELF, &ru);
return ru.ru_utime.tv_usec + ru.ru_stime.tv_usec;
}
int main ()
{
while(1) {
a = tv.tv_sec;
b = tv.tv_usec;
gettimeofday(&tv, 0);
i0 = i1;
i1 = cpuusage();
if (i0 != i1) {
// printf("%d.%06d\t%d.%06d\t%d\t%d\n",
// a, b, (int)tv.tv_sec, (int)tv.tv_usec, i0, i1);
}
}
}
It stalled the system for a few seconds and the drop it to SCHED_OTHER.
Then start a additional SCHED_OTHER cpu hog (while true; do : ; done).
The system locks up after a few seconds.
sysrq-n causes a reboot.
utz
utz lehmann wrote:
> @@ -2406,6 +2489,10 @@ void scheduler_tick(void)
> task_t *p = current;
>
> rq->timestamp_last_tick = sched_clock();
> + if (iso_task(p) && !rq->iso_refractory)
> + inc_iso_ticks(rq, p);
> + else
> + dec_iso_ticks(rq, p);
>
> scheduler_tick() is not only called by the timer interrupt but also form
> the fork code. Is this intended? I think the accounting for
The calling from fork code only occurs if there is one millisecond of
time_slice left so it will only very rarely be hit. I dont think this
accounting problem is worth worrying about.
> iso_refractory is wrong. Isn't calling it from
> timer.c::update_process_times() better?
>
> And shouldn't real RT task also counted? If RT tasks use 40% cpu you can
> lockup the system as unprivileged user with SCHED_ISO because it doesn't
> reach the 70% cpu limit.
Ah yes. Good point. Will add that to the equation.
> Futher on i see a fundamental problem with this accounting for
> iso_refractory. What if i manage as unprivileged user to run a SCHED_ISO
> task which consumes all cpu and only sleeps very short during the timer
> interrupt? I think this will nearly lockup or very slow down the system.
> The iso_cpu limit can't guaranteed.
Right you are. The cpu accounting uses primitive on-interrupt run time
which as we know is not infallible. To extend this I'll have to keep a
timer based on the sched_clock which is already implemented. That's
something for me to work on.
> sysrq-n causes a reboot.
And that will need looking into.
Thanks very much for your comments!
Cheers,
Con
Con Kolivas wrote:
> This is version 2 of the SCHED_ISO patch with the yield bug fixed and
> code cleanups.
...answering on this thread to consolidate the two branches of the email
thread.
Here are my results with SCHED_ISO v2 on a pentium-M 1.7Ghz (all
powersaving features off):
SCHED_NORMAL:
awk: ./jack_test3_summary.awk:67: (FILENAME=- FNR=862) fatal: division
by zero attempted
Well we wont bother looking at those results then. there were 38 XRUNS
that did make it on the parsed output.
SCHED_FIFO:
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 4
Delay Count (>spare time) . . : 18
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 6595 usecs
Cycle Maximum . . . . . . . . : 368 usecs
Average DSP Load. . . . . . . : 17.9 %
Average CPU System Load . . . : 3.4 %
Average CPU User Load . . . . : 13.7 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 1.4 %
Average CPU IRQ Load . . . . : 0.6 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1697.0 /sec
Average Context-Switch Rate . : 13334.1 /sec
*********************************************
SCHED_ISO (iso_cpu 70):
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 5
Delay Count (>spare time) . . : 18
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 6489 usecs
Cycle Maximum . . . . . . . . : 405 usecs
Average DSP Load. . . . . . . : 18.0 %
Average CPU System Load . . . : 3.3 %
Average CPU User Load . . . . : 13.7 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.0 %
Average CPU IRQ Load . . . . : 0.6 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1700.2 /sec
Average Context-Switch Rate . : 12457.2 /sec
*********************************************
Increasing iso_cpu did not change the results.
At least in my testing on my hardware, v2 is working as advertised. I
need results from more hardware configurations to know if priority
support is worth adding or not.
Cheers,
Con
Con Kolivas <[email protected]> writes:
> Con Kolivas wrote:
>
> Here are my results with SCHED_ISO v2 on a pentium-M 1.7Ghz (all
> powersaving features off):
>
> Increasing iso_cpu did not change the results.
>
> At least in my testing on my hardware, v2 is working as advertised. I
> need results from more hardware configurations to know if priority
> support is worth adding or not.
Excellent. Judging by the DSP Load, your machine seems to run almost
twice as fast as my 1.5GHz Athlon (surprising). You might want to try
pushing it a bit harder by running more clients (2nd parameter,
default is 20).
Are you getting fairly consistent results running SCHED_ISO
repeatedly? That worked better for me after I fixed that bug in JACK
0.99.47, but I think there is still more variance than with
SCHED_FIFO.
--
joq
Jack O'Quin wrote:
> Con Kolivas <[email protected]> writes:
>
>
>>Con Kolivas wrote:
>>
>>Here are my results with SCHED_ISO v2 on a pentium-M 1.7Ghz (all
>>powersaving features off):
>>
>>Increasing iso_cpu did not change the results.
>>
>>At least in my testing on my hardware, v2 is working as advertised. I
>>need results from more hardware configurations to know if priority
>>support is worth adding or not.
>
>
> Excellent. Judging by the DSP Load, your machine seems to run almost
> twice as fast as my 1.5GHz Athlon (surprising). You might want to try
Not really surprising; the 2Mb cache makes this a damn fine cpu, if not
necessarily across the board :)
> pushing it a bit harder by running more clients (2nd parameter,
> default is 20).
Ask and ye shall receive.
> Are you getting fairly consistent results running SCHED_ISO
> repeatedly? That worked better for me after I fixed that bug in JACK
> 0.99.47, but I think there is still more variance than with
> SCHED_FIFO.
Much more consistent, and I believe some bugs in the earlier
implementation were probably biting.
40 clients:
SCHED_FIFO:
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 7
Delay Count (>spare time) . . : 20
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 6739 usecs
Cycle Maximum . . . . . . . . : 746 usecs
Average DSP Load. . . . . . . : 30.4 %
Average CPU System Load . . . : 5.7 %
Average CPU User Load . . . . : 23.3 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.0 %
Average CPU IRQ Load . . . . : 0.6 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1692.0 /sec
Average Context-Switch Rate . : 20907.7 /sec
*********************************************
SCHED_ISO:
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 11
Delay Count (>spare time) . . : 19
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 8723 usecs
Cycle Maximum . . . . . . . . : 714 usecs
Average DSP Load. . . . . . . : 31.1 %
Average CPU System Load . . . : 5.7 %
Average CPU User Load . . . . : 23.2 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.0 %
Average CPU IRQ Load . . . . : 0.6 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1685.4 /sec
Average Context-Switch Rate . : 20496.9 /sec
*********************************************
Full results and pretty pictures available here:
http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
Cheers,
Con
Con Kolivas <[email protected]> writes:
> Jack O'Quin wrote:
>> Excellent. Judging by the DSP Load, your machine seems to run almost
>> twice as fast as my 1.5GHz Athlon (surprising). You might want to try
>
> Not really surprising; the 2Mb cache makes this a damn fine cpu, if
> not necessarily across the board :)
I wonder if most of the critical DSP cycle fits in the cache?
Does it degrade significantly with a compile running in the background?
> Full results and pretty pictures available here:
> http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
Outstanding.
How do you get rid of that checkerboard grey background in the graphs?
Looking at the graphs, your system has a substantial 4 to 6 msec delay
on approximately 40 second intervals, regardless of which scheduling
class or how many clients you run. I'm guessing this is a recurring
long code path in the kernel and not a scheduling artifact at all.
--
joq
Jack O'Quin wrote:
> Con Kolivas <[email protected]> writes:
>
> Does it degrade significantly with a compile running in the background?
Check results below.
>>Full results and pretty pictures available here:
>>http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
More pretty pictures with compile load on SCHED_ISO put up there now.
> Outstanding.
>
> How do you get rid of that checkerboard grey background in the graphs?
Funny; that's the script you sent me so... beats me?
> Looking at the graphs, your system has a substantial 4 to 6 msec delay
> on approximately 40 second intervals, regardless of which scheduling
> class or how many clients you run. I'm guessing this is a recurring
> long code path in the kernel and not a scheduling artifact at all.
Probably. No matter what I do the hard drive seems to keep trying to
spin down. Might be related.
in the background:
while true ; do make clean && make ; done
SCHED_ISO with 40 clients:
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 3
Delay Count (>spare time) . . : 20
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 5841 usecs
Cycle Maximum . . . . . . . . : 891 usecs
Average DSP Load. . . . . . . : 34.1 %
Average CPU System Load . . . : 10.7 %
Average CPU User Load . . . . : 87.8 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.7 %
Average CPU IRQ Load . . . . : 0.8 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1711.4 /sec
Average Context-Switch Rate . : 20751.6 /sec
*********************************************
Cheers,
Con
On Thu, 2005-01-20 at 11:33 +1100, Con Kolivas wrote:
> utz lehmann wrote:
> > @@ -2406,6 +2489,10 @@ void scheduler_tick(void)
> > task_t *p = current;
> >
> > rq->timestamp_last_tick = sched_clock();
> > + if (iso_task(p) && !rq->iso_refractory)
> > + inc_iso_ticks(rq, p);
> > + else
> > + dec_iso_ticks(rq, p);
> >
> > scheduler_tick() is not only called by the timer interrupt but also form
> > the fork code. Is this intended? I think the accounting for
>
> The calling from fork code only occurs if there is one millisecond of
> time_slice left so it will only very rarely be hit. I dont think this
> accounting problem is worth worrying about.
I had experimented with throttling runaway RT tasks. I use a similar
accounting. I saw a difference between counting with or without the
calling from fork. If i remember correctly the timeout expired too fast
if the non-RT load was "while /bin/true; do :; done".
With "while true; do :; done" ("true" is bash buildin) it worked good.
But maybe it's not important in the real world.
>
> > Futher on i see a fundamental problem with this accounting for
> > iso_refractory. What if i manage as unprivileged user to run a SCHED_ISO
> > task which consumes all cpu and only sleeps very short during the timer
> > interrupt? I think this will nearly lockup or very slow down the system.
> > The iso_cpu limit can't guaranteed.
>
> Right you are. The cpu accounting uses primitive on-interrupt run time
> which as we know is not infallible. To extend this I'll have to keep a
> timer based on the sched_clock which is already implemented. That's
> something for me to work on.
If i understand sched_clock correctly it only has higher resolution if
you can use tsc. In the non tsc case it's jiffies based. (On x86).
I think you can easily fool a timer tick/jiffies based accounting and do
a local DoS.
Making SCHED_ISO privileged if you don't have a high resolution
sched_clock is ugly.
I really like the idea of a unprivileged SCHED_ISO but it has to be safe
for a multi user system. And the kernel default should be safe for multi
user.
cheers
utz
Con Kolivas <[email protected]> writes:
> Jack O'Quin wrote:
>> Outstanding. How do you get rid of that checkerboard grey
>> background in the graphs?
>
>> Con Kolivas <[email protected]> writes:
> Funny; that's the script you sent me so... beats me?
It's just one of the many things I don't understand about graphics.
If I look at those png's locally (with gimp or gqview) they have a
dark grey checkerboard background. If I look at them on the web (with
galeon), the background is white. Go figure. Maybe the file has no
background? I dunno.
>> Looking at the graphs, your system has a substantial 4 to 6 msec delay
>> on approximately 40 second intervals, regardless of which scheduling
>> class or how many clients you run. I'm guessing this is a recurring
>> long code path in the kernel and not a scheduling artifact at all.
>
> Probably. No matter what I do the hard drive seems to keep trying to
> spin down. Might be related.
I was misreading the x-axis. They're actually every 20 sec. My
system isn't doing that.
> in the background:
> while true ; do make clean && make ; done
>
> SCHED_ISO with 40 clients:
> *********************************************
> Timeout Count . . . . . . . . :( 0)
> XRUN Count . . . . . . . . . : 3
> Delay Count (>spare time) . . : 20
> Delay Count (>1000 usecs) . . : 0
> Delay Maximum . . . . . . . . : 5841 usecs
> Cycle Maximum . . . . . . . . : 891 usecs
> Average DSP Load. . . . . . . : 34.1 %
> Average CPU System Load . . . : 10.7 %
> Average CPU User Load . . . . : 87.8 %
> Average CPU Nice Load . . . . : 0.0 %
> Average CPU I/O Wait Load . . : 0.7 %
> Average CPU IRQ Load . . . . : 0.8 %
> Average CPU Soft-IRQ Load . . : 0.0 %
> Average Interrupt Rate . . . : 1711.4 /sec
> Average Context-Switch Rate . : 20751.6 /sec
> *********************************************
The scheduler seems to be working great.
You're really getting hammered with those periodic 6 msec delays,
though. The basic audio cycle is only 1.45 msec.
--
joq
On Wednesday 19 January 2005 23:57, Jack O'Quin wrote:
>Con Kolivas <[email protected]> writes:
>> Jack O'Quin wrote:
>>> Outstanding. How do you get rid of that checkerboard grey
>>> background in the graphs?
>>>
>>> Con Kolivas <[email protected]> writes:
>>
>> Funny; that's the script you sent me so... beats me?
>
>It's just one of the many things I don't understand about graphics.
>
>If I look at those png's locally (with gimp or gqview) they have a
>dark grey checkerboard background. If I look at them on the web
> (with galeon), the background is white. Go figure. Maybe the file
> has no background? I dunno.
>
I think you've probably hit it there Con. Thats exactly what the gimp
will show you if you expand the view window bigger than the image.
No data=checkerboard here, everytime.
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.32% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.
utz lehmann wrote:
> I had experimented with throttling runaway RT tasks. I use a similar
> accounting. I saw a difference between counting with or without the
> calling from fork. If i remember correctly the timeout expired too fast
> if the non-RT load was "while /bin/true; do :; done".
> With "while true; do :; done" ("true" is bash buildin) it worked good.
> But maybe it's not important in the real world.
It won't be relevant if we move to a sched_clock() based timesource
anyway, which looks to be the next major development for this.
> If i understand sched_clock correctly it only has higher resolution if
> you can use tsc. In the non tsc case it's jiffies based. (On x86).
> I think you can easily fool a timer tick/jiffies based accounting and do
> a local DoS.
The same timer is used for accounting of SCHED_NORMAL tasks, so if you
can work around that you can DoS the system with even the correct
combination of SCHED_NORMAL tasks. When Ingo implemented sched_clock all
the architectures slowly came on board. If I recall correctly the lowest
resolution one still had microsecond accuracy which is more than enough
given the time a context switch takes.
> Making SCHED_ISO privileged if you don't have a high resolution
> sched_clock is ugly.
> I really like the idea of a unprivileged SCHED_ISO but it has to be safe
> for a multi user system. And the kernel default should be safe for multi
> user.
Agreed; this is exactly what this work is about.
Cheers,
Con
Jack O'Quin wrote:
> If I look at those png's locally (with gimp or gqview) they have a
> dark grey checkerboard background. If I look at them on the web (with
> galeon), the background is white. Go figure. Maybe the file has no
> background? I dunno.
Yes there's no background so it depends on what you look at it with.
Gene already pointed out the checkered background in gimp :)
> I was misreading the x-axis. They're actually every 20 sec. My
> system isn't doing that.
Possibly reiserfs journal related. That has larger non-preemptible code
sections.
> You're really getting hammered with those periodic 6 msec delays,
> though. The basic audio cycle is only 1.45 msec.
As you've already pointed out, though, they occur even with SCHED_FIFO
so I'm certain it's an artefact unrelated to cpu scheduling.
Con
Con Kolivas wrote:
> Jack O'Quin wrote:
>> I was misreading the x-axis. They're actually every 20 sec. My
>> system isn't doing that.
>
>
> Possibly reiserfs journal related. That has larger non-preemptible code
> sections.
>
>> You're really getting hammered with those periodic 6 msec delays,
>> though. The basic audio cycle is only 1.45 msec.
>
>
> As you've already pointed out, though, they occur even with SCHED_FIFO
> so I'm certain it's an artefact unrelated to cpu scheduling.
Ok to try and answer my own possibility I remounted reiserfs with the
nolog option and tested with SCHED_ISO
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 0
Delay Count (>spare time) . . : 1
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 6750 usecs
Cycle Maximum . . . . . . . . : 717 usecs
Average DSP Load. . . . . . . : 30.7 %
Average CPU System Load . . . : 5.7 %
Average CPU User Load . . . . : 23.2 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.0 %
Average CPU IRQ Load . . . . : 0.6 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1683.8 /sec
Average Context-Switch Rate . : 20015.8 /sec
*********************************************
You'll see on
http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/jack_test3-iso2-40c-nolog.png
That the 20s periodic thing delay has all but gone. Just one towards the
end (no idea what that was).
Cheers,
Con
>> Jack O'Quin wrote:
>>> Outstanding. How do you get rid of that checkerboard grey
>>> background in the graphs?
>>
>>> Con Kolivas <[email protected]> writes:
>> Funny; that's the script you sent me so... beats me?
>
> It's just one of the many things I don't understand about graphics.
>
> If I look at those png's locally (with gimp or gqview) they have a
> dark grey checkerboard background. If I look at them on the web (with
> galeon), the background is white. Go figure. Maybe the file has no
> background? I dunno.
>
The PNGs are being generated with transparent background. The checkered
background is just being added as a visual helper artifact (or sort of) on
some graphics viewers (notably the ones which names start with "g" :).
It's in jack_test3_plot.sh where the explicit option to render it
transparent is. Just look for "transparent" and get rid of it, if you like
:)
BTW, as joq has already hinted, I have almost ready here a new test suite
(jack_test4), which features an actual (i.e.audible) audio chain instead
of just CPU eaters, as on the jack_test3 set.
Right now I'm merging the corrections joq handed to me yesterday, and will
post it here later toiday.
Cheers.
--
rncbc aka Rui Nuno Capela
[email protected]
Con Kolivas <[email protected]> writes:
>> Jack O'Quin wrote:
>>> You're really getting hammered with those periodic 6 msec delays,
>>> though. The basic audio cycle is only 1.45 msec.
> Con Kolivas wrote:
>> As you've already pointed out, though, they occur even with
>> SCHED_FIFO so I'm certain it's an artefact unrelated to cpu
>> scheduling.
Yes. Your scheduler works well.
> Ok to try and answer my own possibility I) remounted reiserfs with the
> nolog option and tested with SCHED_ISO
That's discouraging about reiserfs. Is it version 3 or 4? Earlier
versions showed good realtime responsiveness for audio testers. It
had a reputation for working much better at lower latency than ext3.
How do we report this problem to the developers?
> That the 20s periodic thing delay has all but gone. Just one towards
> the end (no idea what that was).
I need to figure out how to use Takashi's ALSA xrun debugger. It
prints the kernel stack when an xrun occurs.
--
joq
>That's discouraging about reiserfs. Is it version 3 or 4? Earlier
>versions showed good realtime responsiveness for audio testers. It
>had a reputation for working much better at lower latency than ext3.
over on #ardour last week, we saw appalling performance from
reiserfs. a 120GB filesystem with 11GB of space failed to be able to
deliver enough read/write speed to keep up with a 16 track
session. When the filesystem was cleared to provide 36GB of space,
things improved. The actual recording takes place using writes of
256kB, and no more than a few hundred MB was being written during the
failed tests.
everything i read about reiser suggests it is unsuitable for audio
work: it is optimized around the common case of filesystems with many
small files. the filesystems where we record audio is typically filled
with a relatively small number of very, very large files.
--p
Paul Davis <[email protected]> writes:
>>That's discouraging about reiserfs. Is it version 3 or 4? Earlier
>>versions showed good realtime responsiveness for audio testers. It
>>had a reputation for working much better at lower latency than ext3.
>
> over on #ardour last week, we saw appalling performance from
> reiserfs. a 120GB filesystem with 11GB of space failed to be able to
> deliver enough read/write speed to keep up with a 16 track
> session. When the filesystem was cleared to provide 36GB of space,
> things improved. The actual recording takes place using writes of
> 256kB, and no more than a few hundred MB was being written during the
> failed tests.
>
> everything i read about reiser suggests it is unsuitable for audio
> work: it is optimized around the common case of filesystems with many
> small files. the filesystems where we record audio is typically filled
> with a relatively small number of very, very large files.
I was not speaking of its disk performance, but rather of low-latency
CPU performance. In 2.4 with the low-latency patches, reiserfs did
that fairly well.
I know its design is not focused on streaming large blocks of data to
disk. But, when you're recording clicks every 20 seconds, disk
throughput is definitely a secondary consideration.
Looks like we need to do another study to determine which filesystem
works best for multi-track audio recording and playback. XFS looks
promising, but only if they get the latency right. Any experience
with that?
--
joq
OK. Here goes my fresh and newly jack_test4.1 test suite. It might be
still rough, as usual ;)
(Jack: this post is an new edited version of the same I sent you last
weekend; sorry for the noise:)
The main difference against jack_test3.2 goes into the specific test client
(jack_test4_client.c). That is, the client chain now tries to resemble a
real audio chain.
The first client runs as a signal generator, pumping ou a pure sinusoidal
1Khz tone into all its ouput ports.
The second and all following clients connect their input ports to the
outputs of the preceding one. Each one of these clients work exactly as
before, mixing all input into each output.
Finally the last client of the chain, connects its output ports to the
available terminal/physical inputs (e.g. alsa_pcm). This way the tone
signal feeding can actually be heard on your speakers--but please take
care of the effective output volume for not to hurt your precious ears :)
The tone signal is specially in phase-sync along all client nodes,
provided a sync pulse is propagated along the chain, but suppressed on
that
last node (the one which feeds the speakers).
Each client, other than the first generator one, compares every single
input frame against a self-generated one, checking for any extraneous
noise/artifact. This difference is detected and exposed as a "Delta
Maximum" value on the summary results--it should be always 0.00000, if not
something really bad has occurred during the test.
A fifth argument to the jack_test4_run.sh main script is also featured,
giving the number of consecutive runs the whole test-chain-cycle is
performed for the same jackd service session.
The sixth argument is now the number of extra playback ports to be
acquainted by jackd.
Now the bad news .
This new test-suite exposes a very nasty jackd behavior, which was rarely
seen with the previous jack_test3.2 but now is pretty reproducible, at
least on my laptop ([email protected]/UP) under Ingo's 2.6.11-rc1-RT-V0.7.35-01
(PREEMPT_RT).
This phenomenon, so to speak, shows up as a sudden full increase of
DSP/CPU load after a few minutes running jackd while perfectly normal and
stable until that moment. Once that occurs, and it does now everytime I
run jack_test4_run.sh with default parameters (14 clients, 4x4 ports), you
end under a horrible XRUN storm--see attached chart--you can even hear it
perfectly as the 1KHz audible tone burps and stutters, resembling
radioactivity morse pulses.
So it seems that this showstopper is an issue only under extreme loads,
and is probably relative to the hardware you're running into. On my other
[email protected]/HT desktop I could not reproduce this. Instead, I hit a rather
older issue, which comes like the magic 14 client limit. As it seems, I
now find trouble when starting more than 14 connected clients, as the
jack_watchdog kills everything in sight beyhond that point. This wasn't
happening with the jack_test3.2 suite, suspectedly because those clients
weren't being connected to each other.
Please check this out, and would you try at least to reproduce the naughty
behavior such as the pictured on the attached chart?
Now back on-topic :)
Just some last lines about the iso2 patch. I know some of you will laught
at me, but I've just gave it a try on merging it with Ingo's
realtime-preempt. After some changes to Con's original
(2.6.11-rc1-iso2.diff) I've reached a clean build, and the attached patch
is the proof, which applies in the following sequence:
linux-2.6.11-rc1.tar.bz2
+ realtime-preempt-2.6.11-rc1-V0.7.35-01
+ linux-2.6.11-rc1-RT-V0.7.35-iso2.patch.gz
But... even though it booted fine, the resulting kernel just crashes
immediately as soon as one remembers to run jackd -R :( If just someone
finds this interesting... ;)
So no joy no fun, eheheh.
Cheers.
--
rncbc aka Rui Nuno Capela
[email protected]
just finished a short testrun with nice--20 compared to SCHED_FIFO, on a
relatively slow 466 MHz box:
SCHED_FIFO:
************* SUMMARY RESULT ****************
Total seconds ran . . . . . . : 120
Number of clients . . . . . . : 4
Ports per client . . . . . . : 4
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 10
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 27879 usecs
Cycle Maximum . . . . . . . . : 732 usecs
Average DSP Load. . . . . . . : 38.8 %
Average CPU System Load . . . : 10.9 %
Average CPU User Load . . . . : 26.4 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.1 %
Average CPU IRQ Load . . . . : 0.0 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1709.6 /sec
Average Context-Switch Rate . : 6359.9 /sec
*********************************************
nice--20-hack:
************* SUMMARY RESULT ****************
Total seconds ran . . . . . . : 120
Number of clients . . . . . . : 4
Ports per client . . . . . . : 4
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 10
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 6807 usecs
Cycle Maximum . . . . . . . . : 1059 usecs
Average DSP Load. . . . . . . : 39.9 %
Average CPU System Load . . . : 10.9 %
Average CPU User Load . . . . : 26.0 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.1 %
Average CPU IRQ Load . . . . : 0.0 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1712.8 /sec
Average Context-Switch Rate . : 5113.0 /sec
*********************************************
this shows the surprising result that putting all RT tasks on nice--20
reduced context-switch rate by 20% and the Delay Maximum is lower as
well. (although the Delay Maximum is quite unreliable so this could be a
fluke.) But the XRUN count is the same.
can anyone else reproduce this, with the test-patch below applied?
Ingo
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -2245,10 +2245,10 @@ EXPORT_PER_CPU_SYMBOL(kstat);
* if a better static_prio task has expired:
*/
#define EXPIRED_STARVING(rq) \
- ((STARVATION_LIMIT && ((rq)->expired_timestamp && \
+ ((task_nice(current) > -20) && ((STARVATION_LIMIT && ((rq)->expired_timestamp && \
(jiffies - (rq)->expired_timestamp >= \
STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \
- ((rq)->curr->static_prio > (rq)->best_expired_prio))
+ ((rq)->curr->static_prio > (rq)->best_expired_prio)))
/*
* Do the virtual cpu time signal calculations.
@@ -3211,6 +3211,12 @@ static inline task_t *find_process_by_pi
static void __setscheduler(struct task_struct *p, int policy, int prio)
{
BUG_ON(p->array);
+ if (policy != SCHED_NORMAL) {
+ p->policy = SCHED_NORMAL;
+ p->static_prio = NICE_TO_PRIO(-20);
+ p->prio = p->static_prio;
+ return;
+ }
p->policy = policy;
p->rt_priority = prio;
if (policy != SCHED_NORMAL)
On Thu, Jan 20, 2005 at 10:42:24AM -0500, Paul Davis wrote:
> over on #ardour last week, we saw appalling performance from
> reiserfs. a 120GB filesystem with 11GB of space failed to be able to
> deliver enough read/write speed to keep up with a 16 track
> session. When the filesystem was cleared to provide 36GB of space,
> things improved. The actual recording takes place using writes of
> 256kB, and no more than a few hundred MB was being written during the
> failed tests.
It's been a long while since I followed ReiserFS development closely,
*however*, this issue used to be a common problem ReiserFS - when
free space starts to drop below 10%, performace takes a big hit. So
performance improved when space was cleared up.
I don't remember what causes this or what the status is in modern
ResierFS systems.
> everything i read about reiser suggests it is unsuitable for audio
> work: it is optimized around the common case of filesystems with many
> small files. the filesystems where we record audio is typically filled
> with a relatively small number of very, very large files.
Anecdotally, I've found this to not be the case. I only use ReiserFS
and have a few reasonably sized projects in Ardour that work fine:
maybe 20 tracks, with 10-15 plugins (in the whole project), and I can
do overdubs with no problems. It may be relevant that I only have a
four track card and so load is too small.
But at least in my practice, it hasn't been a huge hinderance.
--
Ross Vandegrift
[email protected]
"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37
> My simple yield DoS don't work anymore. But i found another way.
> Running this as SCHED_ISO:
Yep, bad accounting in queue_iso() which relied on p->array == rq->active
This fixes it:
Index: vanilla/kernel/sched.c
===================================================================
--- vanilla.orig/kernel/sched.c 2005-01-20 18:05:59.000000000 +0100
+++ vanilla/kernel/sched.c 2005-01-20 18:41:26.000000000 +0100
@@ -2621,15 +2621,19 @@
static task_t* queue_iso(runqueue_t *rq, prio_array_t *array)
{
task_t *p = list_entry(rq->iso_queue.next, task_t, iso_list);
- if (p->prio == MAX_RT_PRIO)
- goto out;
+ prio_array_t *old_array = p->array;
+
+ old_array->nr_active--;
list_del(&p->run_list);
- if (list_empty(array->queue + p->prio))
- __clear_bit(p->prio, array->bitmap);
+ if (list_empty(old_array->queue + p->prio))
+ __clear_bit(p->prio, old_array->bitmap);
+
p->prio = MAX_RT_PRIO;
list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
-out:
+ array->nr_active++;
+ p->array = array;
+
return p;
}
On Thu, 2005-01-20 at 12:49 -0500, [email protected] wrote:
> It's been a long while since I followed ReiserFS development closely,
> *however*, this issue used to be a common problem ReiserFS - when
> free space starts to drop below 10%, performace takes a big hit. So
> performance improved when space was cleared up.
>
To be fair to Reiserfs, many UNIX filesystems have done this on purpose,
all the way back to FFS I think. Once free space drops below 10%, they
change their allocation scheme to favor efficiency over speed. Probably
this behavior doesn't make sense on a 120GB disk with 11GB free. But it
certainly does on a 300MB disk when you get down to 30 ;-)
Lee
"Rui Nuno Capela" <[email protected]> writes:
> OK. Here goes my fresh and newly jack_test4.1 test suite. It might be
> still rough, as usual ;)
Thanks for all your work on this fine test suite.
> This phenomenon, so to speak, shows up as a sudden full increase of
> DSP/CPU load after a few minutes running jackd while perfectly normal and
> stable until that moment. Once that occurs, and it does now everytime I
> run jack_test4_run.sh with default parameters (14 clients, 4x4 ports), you
> end under a horrible XRUN storm--see attached chart--you can even hear it
> perfectly as the 1KHz audible tone burps and stutters, resembling
> radioactivity morse pulses.
Looking at the graph, it appears that your DSP load is hovering just
above 70% most of the time. This happens to be the default threshold
for revoking realtime privileges. Perhaps that is the problem. Try
running it with the threshold set to 90%. (I don't recall exactly
how, but I think there's a /proc/sys/kernel control somewhere.)
> So it seems that this showstopper is an issue only under extreme loads,
> and is probably relative to the hardware you're running into. On my other
> [email protected]/HT desktop I could not reproduce this. Instead, I hit a rather
> older issue, which comes like the magic 14 client limit. As it seems, I
> now find trouble when starting more than 14 connected clients, as the
> jack_watchdog kills everything in sight beyhond that point. This wasn't
> happening with the jack_test3.2 suite, suspectedly because those clients
> weren't being connected to each other.
I'll take a look. The old problem with more than 14 clients has been
fixed. I routinely run 30 or 40 without trouble.
Perhaps we're running out of some port resource?
> Please check this out, and would you try at least to reproduce the naughty
> behavior such as the pictured on the attached chart?
Will do.
--
joq
Alexander Nyberg wrote:
>>My simple yield DoS don't work anymore. But i found another way.
>>Running this as SCHED_ISO:
>
>
> Yep, bad accounting in queue_iso() which relied on p->array == rq->active
> This fixes it:
>
>
> Index: vanilla/kernel/sched.c
> ===================================================================
> --- vanilla.orig/kernel/sched.c 2005-01-20 18:05:59.000000000 +0100
> +++ vanilla/kernel/sched.c 2005-01-20 18:41:26.000000000 +0100
> @@ -2621,15 +2621,19 @@
> static task_t* queue_iso(runqueue_t *rq, prio_array_t *array)
> {
> task_t *p = list_entry(rq->iso_queue.next, task_t, iso_list);
> - if (p->prio == MAX_RT_PRIO)
> - goto out;
> + prio_array_t *old_array = p->array;
> +
> + old_array->nr_active--;
> list_del(&p->run_list);
> - if (list_empty(array->queue + p->prio))
> - __clear_bit(p->prio, array->bitmap);
> + if (list_empty(old_array->queue + p->prio))
> + __clear_bit(p->prio, old_array->bitmap);
> +
> p->prio = MAX_RT_PRIO;
> list_add_tail(&p->run_list, array->queue + p->prio);
> __set_bit(p->prio, array->bitmap);
> -out:
> + array->nr_active++;
> + p->array = array;
> +
> return p;
> }
>
>
Excellent pickup, thanks!
Acked-by: Con Kolivas <[email protected]>
>>>>> "Jack" == Jack O'Quin <[email protected]> writes:
Jack> Looks like we need to do another study to determine which
Jack> filesystem works best for multi-track audio recording and
Jack> playback. XFS looks promising, but only if they get the latency
Jack> right. Any experience with that?
The nice thing about audio/video and XFS is that if you know ahead of
time the max size of a file (and you usually do -- because you know
ahead of time how long a take is going to be) you can precreadte the
file as a contiguous chunk, then just fill it in, for minimum disc
latency.
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*
[email protected] wrote:
> On Thu, Jan 20, 2005 at 10:42:24AM -0500, Paul Davis wrote:
>
>>over on #ardour last week, we saw appalling performance from
>>reiserfs. a 120GB filesystem with 11GB of space failed to be able to
>>deliver enough read/write speed to keep up with a 16 track
>>session. When the filesystem was cleared to provide 36GB of space,
>>things improved. The actual recording takes place using writes of
>>256kB, and no more than a few hundred MB was being written during the
>>failed tests.
>
>
> It's been a long while since I followed ReiserFS development closely,
> *however*, this issue used to be a common problem ReiserFS - when
> free space starts to drop below 10%, performace takes a big hit. So
> performance improved when space was cleared up.
>
> I don't remember what causes this or what the status is in modern
> ResierFS systems.
>
>
>>everything i read about reiser suggests it is unsuitable for audio
>>work: it is optimized around the common case of filesystems with many
>>small files. the filesystems where we record audio is typically filled
>>with a relatively small number of very, very large files.
>
>
> Anecdotally, I've found this to not be the case. I only use ReiserFS
> and have a few reasonably sized projects in Ardour that work fine:
> maybe 20 tracks, with 10-15 plugins (in the whole project), and I can
> do overdubs with no problems. It may be relevant that I only have a
> four track card and so load is too small.
>
> But at least in my practice, it hasn't been a huge hinderance.
This is my understanding of the situation, which is not gospel but
interpretation of the information data I have had available.
Reiserfs3.6 is in maintenance mode. Its performance was very good in 2.4
days, but since 2.6 the block layer has matured so much that the code
paths that were fast in reiserfs are no longer so impressive compared to
those shared by ext3.
In terms of recommendation, the latency of non-preemptible codepaths
will be fastest in ext3 in 2.6 due to the nature of it constantly being
examined, addressed and updated. That does not mean it has the fastest
performance by any stretch of the imagination. XFS, I believe, has
significantly faster large file performance, and reiser3.6 has
significantly faster small file performance. But if throughput is not a
problem, and latency is, then ext3 is a better choice. Reiser4 is a
curious beast with obviously high throughput, but for the moment I do
not think it is remotely suitable for low latency applications.
As for the %full issue; no filesystem works well as it approaches full
capacity. Performance degrades dramatically beyond 75% on all of them,
becoming woeful once beyond 85%. If you're looking for good performance,
more free capacity is more effective than changing filesystems.
All of this should be taken into consideration if you're worried about
low latency cpu scheduling, as it all will collapse if your filesystem
code has high latency in the kernel. It also would make benchmarking low
latency cpu scheduling potentially prone to disastrous mis-interpretation.
Cheers,
Con
Peter Chubb <[email protected]> writes:
>>>>>> "Jack" == Jack O'Quin <[email protected]> writes:
>
> Jack> Looks like we need to do another study to determine which
> Jack> filesystem works best for multi-track audio recording and
> Jack> playback. XFS looks promising, but only if they get the latency
> Jack> right. Any experience with that?
>
> The nice thing about audio/video and XFS is that if you know ahead of
> time the max size of a file (and you usually do -- because you know
> ahead of time how long a take is going to be) you can precreadte the
> file as a contiguous chunk, then just fill it in, for minimum disc
> latency.
I am not talking about disk latency. The problem Con uncovered in
ReiserFS was CPU hogging. Every 20 seconds there was a 6msec latency
glitch in system response.
--
joq
* Con Kolivas <[email protected]> wrote:
> In terms of recommendation, the latency of non-preemptible codepaths
> will be fastest in ext3 in 2.6 due to the nature of it constantly
> being examined, addressed and updated. That does not mean it has the
> fastest performance by any stretch of the imagination. [...]
i agree with the latency observation. But ext3 got two significant
performance boosts recently, at two ends of the performance spectrum:
- in the (lots-of-)small-files area: the addition of the htree feature
- in the large-files-throughput case: with the addition of the
reservation feature.
ext3 installed by a recent distro should have both features enabled. (i
know for sure that Fedora Core 3 with the update/erratum kernel
installed will create ext3 filesystems that utilize both of these
features by default.)
I encourage everyone to try the famous 'create and read 1 million small
files' test on both recent ext3 and on other filesystems.
Ingo
Jack O'Quin wrote:
>
> [...] Looking at the graph, it appears that your DSP load is hovering
> above 70% most of the time. This happens to be the default threshold
> for revoking realtime privileges. Perhaps that is the problem. Try
> running it with the threshold set to 90%. (I don't recall exactly
> how, but I think there's a /proc/sys/kernel control somewhere.)
>
It would be nice to know which one really is :) Here are what I have here:
# grep . /proc/sys/kernel
/proc/sys/kernel/bootloader_type:2
/proc/sys/kernel/cad_pid:1
/proc/sys/kernel/cap-bound:-257
/proc/sys/kernel/core_pattern:core
/proc/sys/kernel/core_uses_pid:1
/proc/sys/kernel/ctrl-alt-del:0
/proc/sys/kernel/debug_direct_keyboard:0
/proc/sys/kernel/domainname:(none)
/proc/sys/kernel/hostname:lambda
/proc/sys/kernel/hotplug:/sbin/hotplug
/proc/sys/kernel/kernel_preemption:1
/proc/sys/kernel/modprobe:/sbin/modprobe
/proc/sys/kernel/msgmax:8192
/proc/sys/kernel/msgmnb:16384
/proc/sys/kernel/msgmni:16
/proc/sys/kernel/ngroups_max:65536
/proc/sys/kernel/osrelease:2.6.11-rc1-RT-V0.7.35-01.0
/proc/sys/kernel/ostype:Linux
/proc/sys/kernel/overflowgid:65534
/proc/sys/kernel/overflowuid:65534
/proc/sys/kernel/panic:0
/proc/sys/kernel/panic_on_oops:0
/proc/sys/kernel/pid_max:32768
/proc/sys/kernel/printk:3 4 1 7
/proc/sys/kernel/printk_ratelimit:5
/proc/sys/kernel/printk_ratelimit_burst:10
/proc/sys/kernel/prof_pid:-1
/proc/sys/kernel/sem:250 32000 32 128
/proc/sys/kernel/shmall:2097152
/proc/sys/kernel/shmmax:33554432
/proc/sys/kernel/shmmni:4096
/proc/sys/kernel/sysrq:1
/proc/sys/kernel/tainted:0
/proc/sys/kernel/threads-max:8055
/proc/sys/kernel/unknown_nmi_panic:0
/proc/sys/kernel/version:#1 Mon Jan 17 15:15:21 WET 2005
My eyes can't find anything related, but you know how intuitive these
things are ;)
>
> [...] The old problem with more than 14 clients has been
> fixed. I routinely run 30 or 40 without trouble.
>
Yes, but that is with (old) jack_test3 where clients were stand-alone,
without being connected to anyother. With (new) jack_test4, each client
gets all its inputs connected to all the output ports of the preceding
one. In my experience, with this (new) setup you achieve a much greater
workload with a lesser number of running clients.
If seems funny, but highly suspicious, that in both of my machines, a
[email protected]/UP laptop and a [email protected]/SMP desktop, the limit is somewhat the
same: jackd stops responding as soon the 15th client enters the chain.
Currently on jackd 0.99.47 (cvs), but I'm quite sure that was occurring
before.
> Perhaps we're running out of some port resource?
>
Probably. That was what I thought and then worked around by increasing the
maximum number of ports jackd preallocates (-p/--port-max option), but I
guess it is not enough. It was on jack_test3, but it isn't for jack_test4.
Go figure ;)
Bye now.
--
rncbc aka Rui Nuno Capela
[email protected]
Rui Nuno Capela wrote:
> Jack O'Quin wrote:
>
>>[...] Looking at the graph, it appears that your DSP load is hovering
>>above 70% most of the time. This happens to be the default threshold
>>for revoking realtime privileges. Perhaps that is the problem. Try
>>running it with the threshold set to 90%. (I don't recall exactly
>>how, but I think there's a /proc/sys/kernel control somewhere.)
>>
>
>
> It would be nice to know which one really is :) Here are what I have here:
>
> # grep . /proc/sys/kernel
> /proc/sys/kernel/bootloader_type:2
> /proc/sys/kernel/cad_pid:1
> /proc/sys/kernel/cap-bound:-257
> /proc/sys/kernel/core_pattern:core
> /proc/sys/kernel/core_uses_pid:1
> /proc/sys/kernel/ctrl-alt-del:0
> /proc/sys/kernel/debug_direct_keyboard:0
> /proc/sys/kernel/domainname:(none)
> /proc/sys/kernel/hostname:lambda
> /proc/sys/kernel/hotplug:/sbin/hotplug
> /proc/sys/kernel/kernel_preemption:1
> /proc/sys/kernel/modprobe:/sbin/modprobe
> /proc/sys/kernel/msgmax:8192
> /proc/sys/kernel/msgmnb:16384
> /proc/sys/kernel/msgmni:16
> /proc/sys/kernel/ngroups_max:65536
> /proc/sys/kernel/osrelease:2.6.11-rc1-RT-V0.7.35-01.0
> /proc/sys/kernel/ostype:Linux
> /proc/sys/kernel/overflowgid:65534
> /proc/sys/kernel/overflowuid:65534
> /proc/sys/kernel/panic:0
> /proc/sys/kernel/panic_on_oops:0
> /proc/sys/kernel/pid_max:32768
> /proc/sys/kernel/printk:3 4 1 7
> /proc/sys/kernel/printk_ratelimit:5
> /proc/sys/kernel/printk_ratelimit_burst:10
> /proc/sys/kernel/prof_pid:-1
> /proc/sys/kernel/sem:250 32000 32 128
> /proc/sys/kernel/shmall:2097152
> /proc/sys/kernel/shmmax:33554432
> /proc/sys/kernel/shmmni:4096
> /proc/sys/kernel/sysrq:1
> /proc/sys/kernel/tainted:0
> /proc/sys/kernel/threads-max:8055
> /proc/sys/kernel/unknown_nmi_panic:0
> /proc/sys/kernel/version:#1 Mon Jan 17 15:15:21 WET 2005
>
> My eyes can't find anything related, but you know how intuitive these
> things are ;)
He means when using the SCHED_ISO patch. Then you'd have iso_cpu and
iso_period, which you have neither of so you are not using SCHED_ISO.
Cheers,
Con
Con Kolivas <[email protected]> writes:
> Rui Nuno Capela wrote:
>> My eyes can't find anything related, but you know how intuitive these
>> things are ;)
>
> He means when using the SCHED_ISO patch. Then you'd have iso_cpu and
> iso_period, which you have neither of so you are not using SCHED_ISO.
In that case, my suggestion was moot. I thought Rui was running
SCHED_ISO and had just hit the 70% barrier. Since he's not, I have no
idea what the problem is.
I'm busy building two new kernels right now. I'll try the new
jack_test4 some time soon, when I can.
--
joq
utz lehmann wrote:
> Hi
>
> I dislike the behavior of the SCHED_ISO patch that iso tasks are
> degraded to SCHED_NORMAL if they exceed the limit.
> IMHO it's better to throttle them at the iso_cpu limit.
>
> I have modified Con's iso2 patch to do this. If iso_cpu > 50 iso tasks
> only get stalled for 1 tick (1ms on x86).
Some tasks are so cache intensive they would make almost no forward
progress running for only 1ms.
> Fortunately there is a currently unused task prio (MAX_RT_PRIO-1) [1]. I
Your implementation is not correct. The "prio" field of real time tasks
is determined by MAX_RT_PRIO-1-rt_priority. Therefore you're limiting
the best real time priority, not the other way around.
Throttling them for only 1ms will make it very easy to starve the system
with 1 or more short running (<1ms) SCHED_NORMAL tasks running. Lower
priority tasks will never run.
Cheers,
Con
Hi
I dislike the behavior of the SCHED_ISO patch that iso tasks are
degraded to SCHED_NORMAL if they exceed the limit.
IMHO it's better to throttle them at the iso_cpu limit.
I have modified Con's iso2 patch to do this. If iso_cpu > 50 iso tasks
only get stalled for 1 tick (1ms on x86).
Fortunately there is a currently unused task prio (MAX_RT_PRIO-1) [1]. I
used it for ISO_PRIO. All SCHED_ISO tasks use it and they not changing
to other priorities. SCHED_ISO is a realtime class with the specialty
that it can preempted by SCHED_NORMAL tasks if iso_throttle is set. With
this the iso queue stuff is not needed.
iso_throttle controls if a SCHED_ISO task can be preempted. It's set by
the RT task load.
With my patch rt_task() also includes iso tasks. I have added a
posix_rt_task() for only SCHED_FIFO and SCHED_RR.
I changed the iso_period sysctl to iso_timeout which is in centisecs.
A iso_throttle_count sysctl is added which count the ticks when a iso
task is preempted by the timer. It uses currently a simple global
variable. It should be per runqueue. And i'm not sure a sysctl is an
appropriate place for it (/sys, /proc?).
It's for 2.6.11-rc1 and i have tested it only on UP x86.
I'm a kernel hacker newbie. Please tell me if this is nonsense, good,
can be improved, ...
utz
[1] Actually MAX_RT_PRIO-1 is used by sched_idle_next() and
migration_call(). I changed it to MAX_RT_PRIO-2 for them. I think it's
ok.
diff -Nrup linux-2.6.11-rc1/include/linux/sched.h linux-2.6.11-rc1-uiso2/include/linux/sched.h
--- linux-2.6.11-rc1/include/linux/sched.h 2005-01-21 19:46:54.677616421 +0100
+++ linux-2.6.11-rc1-uiso2/include/linux/sched.h 2005-01-21 20:30:29.616340716 +0100
@@ -130,6 +130,24 @@ extern unsigned long nr_iowait(void);
#define SCHED_NORMAL 0
#define SCHED_FIFO 1
#define SCHED_RR 2
+/* policy 3 reserved for SCHED_BATCH */
+#define SCHED_ISO 4
+
+extern int iso_cpu, iso_timeout;
+extern int iso_throttle_count;
+extern void account_iso_ticks(struct task_struct *p);
+
+#define SCHED_RANGE(policy) ((policy) == SCHED_NORMAL || \
+ (policy) == SCHED_FIFO || \
+ (policy) == SCHED_RR || \
+ (policy) == SCHED_ISO)
+
+#define SCHED_RT(policy) ((policy) == SCHED_FIFO || \
+ (policy) == SCHED_RR || \
+ (policy) == SCHED_ISO)
+
+#define SCHED_POSIX_RT(policy) ((policy) == SCHED_FIFO || \
+ (policy) == SCHED_RR)
struct sched_param {
int sched_priority;
@@ -342,9 +360,11 @@ struct signal_struct {
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
- * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL tasks are
- * in the range MAX_RT_PRIO..MAX_PRIO-1. Priority values
- * are inverted: lower p->prio value means higher priority.
+ * priority is 0..MAX_RT_PRIO-1. SCHED_FIFO and SCHED_RR uses
+ * 0..MAX_RT_PRIO-2, SCHED_ISO uses MAX_RT_PRIO-1.
+ * SCHED_NORMAL tasks are in the range MAX_RT_PRIO..MAX_PRIO-1.
+ * Priority values are inverted: lower p->prio value means
+ * higher priority.
*
* The MAX_USER_RT_PRIO value allows the actual maximum
* RT priority to be separate from the value exported to
@@ -358,7 +378,12 @@ struct signal_struct {
#define MAX_PRIO (MAX_RT_PRIO + 40)
+#define ISO_PRIO (MAX_RT_PRIO - 1)
+
#define rt_task(p) (unlikely((p)->prio < MAX_RT_PRIO))
+#define posix_rt_task(p) (unlikely((p)->policy == SCHED_FIFO || \
+ (p)->policy == SCHED_RR))
+#define iso_task(p) (unlikely((p)->policy == SCHED_ISO))
/*
* Some day this will be a full-fledged user tracking system..
diff -Nrup linux-2.6.11-rc1/include/linux/sysctl.h linux-2.6.11-rc1-uiso2/include/linux/sysctl.h
--- linux-2.6.11-rc1/include/linux/sysctl.h 2005-01-21 19:46:54.717612339 +0100
+++ linux-2.6.11-rc1-uiso2/include/linux/sysctl.h 2005-01-21 20:30:38.105484416 +0100
@@ -135,6 +135,9 @@ enum
KERN_HZ_TIMER=65, /* int: hz timer on or off */
KERN_UNKNOWN_NMI_PANIC=66, /* int: unknown nmi panic flag */
KERN_BOOTLOADER_TYPE=67, /* int: boot loader type */
+ KERN_ISO_CPU=68, /* int: cpu% allowed by SCHED_ISO class */
+ KERN_ISO_TIMEOUT=69, /* int: centisecs after SCHED_ISO is throttled */
+ KERN_ISO_THROTTLE_COUNT=70, /* int: no. of throttled SCHED_ISO ticks */
};
diff -Nrup linux-2.6.11-rc1/kernel/sched.c linux-2.6.11-rc1-uiso2/kernel/sched.c
--- linux-2.6.11-rc1/kernel/sched.c 2005-01-21 19:46:55.650517137 +0100
+++ linux-2.6.11-rc1-uiso2/kernel/sched.c 2005-01-21 23:35:11.531981295 +0100
@@ -149,9 +149,6 @@
(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
-#define TASK_PREEMPTS_CURR(p, rq) \
- ((p)->prio < (rq)->curr->prio)
-
/*
* task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
* to time slice values: [800ms ... 100ms ... 5ms]
@@ -171,6 +168,11 @@ static unsigned int task_timeslice(task_
else
return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
}
+
+int iso_cpu = 70; /* The soft %cpu limit on SCHED_ISO tasks */
+int iso_timeout = 500; /* Cenitsecs after SCHED_ISO is throttled */
+int iso_throttle_count = 0; /* No. of throttled SCHED_ISO ticks */
+
#define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
< (long long) (sd)->cache_hot_time)
@@ -206,6 +208,8 @@ struct runqueue {
#ifdef CONFIG_SMP
unsigned long cpu_load;
#endif
+ long iso_ticks;
+ int iso_throttle;
unsigned long long nr_switches;
/*
@@ -297,6 +301,19 @@ static DEFINE_PER_CPU(struct runqueue, r
# define task_running(rq, p) ((rq)->curr == (p))
#endif
+static inline int task_preempts_curr(task_t *p, runqueue_t *rq)
+{
+ if (unlikely(rq->iso_throttle)) {
+ if (iso_task(p))
+ return 0;
+ if (iso_task(rq->curr))
+ return 1;
+ }
+ if (p->prio < rq->curr->prio)
+ return 1;
+ return 0;
+}
+
/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
@@ -1101,7 +1118,7 @@ out_activate:
*/
activate_task(p, rq, cpu == this_cpu);
if (!sync || cpu != this_cpu) {
- if (TASK_PREEMPTS_CURR(p, rq))
+ if (task_preempts_curr(p, rq))
resched_task(rq->curr);
}
success = 1;
@@ -1257,7 +1274,7 @@ void fastcall wake_up_new_task(task_t *
p->timestamp = (p->timestamp - this_rq->timestamp_last_tick)
+ rq->timestamp_last_tick;
__activate_task(p, rq);
- if (TASK_PREEMPTS_CURR(p, rq))
+ if (task_preempts_curr(p, rq))
resched_task(rq->curr);
schedstat_inc(rq, wunt_moved);
@@ -1634,7 +1651,7 @@ void pull_task(runqueue_t *src_rq, prio_
* Note that idle threads have a prio of MAX_PRIO, for this test
* to be always true for them.
*/
- if (TASK_PREEMPTS_CURR(p, this_rq))
+ if (task_preempts_curr(p, this_rq))
resched_task(this_rq->curr);
}
@@ -2315,6 +2332,33 @@ static void check_rlimit(struct task_str
}
/*
+ * Account RT tasks for SCHED_ISO throttle. Called every timer tick.
+ * @p: the process that gets accounted
+ */
+void account_iso_ticks(task_t *p)
+{
+ runqueue_t *rq = this_rq();
+
+ if (rt_task(p)) {
+ if (!rq->iso_throttle) {
+ rq->iso_ticks += (100 - iso_cpu);
+ }
+ } else {
+ rq->iso_ticks -= iso_cpu;
+ if (rq->iso_ticks < 0)
+ rq->iso_ticks = 0;
+ }
+
+ if (rq->iso_ticks >
+ (iso_timeout * (100 - iso_cpu) * HZ / 100 + 100)) {
+ rq->iso_throttle = 1;
+ } else {
+ rq->iso_throttle = 0;
+ }
+
+}
+
+/*
* Account user cpu time to a process.
* @p: the process that the cpu time gets accounted to
* @hardirq_offset: the offset to subtract from hardirq_count()
@@ -2427,7 +2471,7 @@ void scheduler_tick(void)
* timeslice. This makes it possible for interactive tasks
* to use up their timeslices at their highest priority levels.
*/
- if (rt_task(p)) {
+ if (posix_rt_task(p)) {
/*
* RR tasks need a special form of timeslice management.
* FIFO tasks have no timeslices.
@@ -2442,6 +2486,22 @@ void scheduler_tick(void)
}
goto out_unlock;
}
+
+ if (iso_task(p)) {
+ if (rq->iso_throttle) {
+ iso_throttle_count++;
+ set_tsk_need_resched(p);
+ goto out_unlock;
+ }
+ if (!(--p->time_slice % GRANULARITY)) {
+ requeue_task(p, rq->active);
+ set_tsk_need_resched(p);
+ }
+ if (!p->time_slice)
+ p->time_slice = task_timeslice(p);
+ goto out_unlock;
+ }
+
if (!--p->time_slice) {
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
@@ -2646,6 +2706,20 @@ EXPORT_SYMBOL(sub_preempt_count);
#endif
+static inline void expire_all_iso_tasks(prio_array_t *active,
+ prio_array_t *expired)
+{
+ struct list_head *queue;
+ task_t *next;
+
+ queue = active->queue + ISO_PRIO;
+ while (!list_empty(queue)) {
+ next = list_entry(queue->next, task_t, run_list);
+ dequeue_task(next, active);
+ enqueue_task(next, expired);
+ }
+}
+
/*
* schedule() is the main scheduler function.
*/
@@ -2753,6 +2827,7 @@ go_idle:
}
array = rq->active;
+switch_to_expired:
if (unlikely(!array->nr_active)) {
/*
* Switch the active and expired arrays.
@@ -2767,6 +2842,21 @@ go_idle:
schedstat_inc(rq, sched_noswitch);
idx = sched_find_first_bit(array->bitmap);
+ if (unlikely(rq->iso_throttle && (idx == ISO_PRIO))) {
+ idx = find_next_bit(array->bitmap, MAX_PRIO, ISO_PRIO + 1);
+ if (idx >= MAX_PRIO) {
+ /*
+ * only SCHED_ISO tasks in active array
+ */
+ if (rq->expired->nr_active) {
+ expire_all_iso_tasks(array, rq->expired);
+ goto switch_to_expired;
+ } else {
+ idx = ISO_PRIO;
+ }
+ }
+ }
+
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
@@ -3213,7 +3303,8 @@ static void __setscheduler(struct task_s
BUG_ON(p->array);
p->policy = policy;
p->rt_priority = prio;
- if (policy != SCHED_NORMAL)
+
+ if (SCHED_RT(policy))
p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority;
else
p->prio = p->static_prio;
@@ -3238,9 +3329,8 @@ recheck:
/* double check policy once rq lock held */
if (policy < 0)
policy = oldpolicy = p->policy;
- else if (policy != SCHED_FIFO && policy != SCHED_RR &&
- policy != SCHED_NORMAL)
- return -EINVAL;
+ else if (!SCHED_RANGE(policy))
+ return -EINVAL;
/*
* Valid priorities for SCHED_FIFO and SCHED_RR are
* 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL is 0.
@@ -3248,12 +3338,19 @@ recheck:
if (param->sched_priority < 0 ||
param->sched_priority > MAX_USER_RT_PRIO-1)
return -EINVAL;
- if ((policy == SCHED_NORMAL) != (param->sched_priority == 0))
+ if ((!SCHED_POSIX_RT(policy)) != (param->sched_priority == 0))
return -EINVAL;
- if ((policy == SCHED_FIFO || policy == SCHED_RR) &&
- !capable(CAP_SYS_NICE))
- return -EPERM;
+ if (SCHED_POSIX_RT(policy) && !capable(CAP_SYS_NICE)) {
+ /*
+ * If the caller requested an POSIX RT policy without
+ * having the necessary rights, we downgrade the policy
+ * to SCHED_ISO. Temporary hack for testing.
+ */
+ policy = SCHED_ISO;
+ param->sched_priority = 0;
+ }
+
if ((current->euid != p->euid) && (current->euid != p->uid) &&
!capable(CAP_SYS_NICE))
return -EPERM;
@@ -3287,7 +3384,7 @@ recheck:
if (task_running(rq, p)) {
if (p->prio > oldprio)
resched_task(rq->curr);
- } else if (TASK_PREEMPTS_CURR(p, rq))
+ } else if (task_preempts_curr(p, rq))
resched_task(rq->curr);
}
task_rq_unlock(rq, &flags);
@@ -3714,6 +3811,7 @@ asmlinkage long sys_sched_get_priority_m
ret = MAX_USER_RT_PRIO-1;
break;
case SCHED_NORMAL:
+ case SCHED_ISO:
ret = 0;
break;
}
@@ -3737,6 +3835,7 @@ asmlinkage long sys_sched_get_priority_m
ret = 1;
break;
case SCHED_NORMAL:
+ case SCHED_ISO:
ret = 0;
}
return ret;
@@ -4010,7 +4109,7 @@ static void __migrate_task(struct task_s
+ rq_dest->timestamp_last_tick;
deactivate_task(p, rq_src);
activate_task(p, rq_dest, 0);
- if (TASK_PREEMPTS_CURR(p, rq_dest))
+ if (task_preempts_curr(p, rq_dest))
resched_task(rq_dest->curr);
}
@@ -4181,7 +4280,7 @@ void sched_idle_next(void)
*/
spin_lock_irqsave(&rq->lock, flags);
- __setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-1);
+ __setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-2);
/* Add idle task to _front_ of it's priority queue */
__activate_idle_task(p, rq);
@@ -4265,7 +4364,7 @@ static int migration_call(struct notifie
kthread_bind(p, cpu);
/* Must be high prio: stop_machine expects to yield to it. */
rq = task_rq_lock(p, &flags);
- __setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-1);
+ __setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-2);
task_rq_unlock(rq, &flags);
cpu_rq(cpu)->migration_thread = p;
break;
diff -Nrup linux-2.6.11-rc1/kernel/sysctl.c linux-2.6.11-rc1-uiso2/kernel/sysctl.c
--- linux-2.6.11-rc1/kernel/sysctl.c 2005-01-21 19:46:55.666515504 +0100
+++ linux-2.6.11-rc1-uiso2/kernel/sysctl.c 2005-01-21 20:30:21.820127147 +0100
@@ -219,6 +219,11 @@ static ctl_table root_table[] = {
{ .ctl_name = 0 }
};
+/* Constants for minimum and maximum testing in vm_table.
+ We use these as one-element integer vectors. */
+static int zero;
+static int one_hundred = 100;
+
static ctl_table kern_table[] = {
{
.ctl_name = KERN_OSTYPE,
@@ -633,15 +638,36 @@ static ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = KERN_ISO_CPU,
+ .procname = "iso_cpu",
+ .data = &iso_cpu,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
+ {
+ .ctl_name = KERN_ISO_TIMEOUT,
+ .procname = "iso_timeout",
+ .data = &iso_timeout,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = KERN_ISO_THROTTLE_COUNT,
+ .procname = "iso_throttle_count",
+ .data = &iso_throttle_count,
+ .maxlen = sizeof(int),
+ .mode = 0444,
+ .proc_handler = &proc_dointvec,
+ },
{ .ctl_name = 0 }
};
-/* Constants for minimum and maximum testing in vm_table.
- We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
static ctl_table vm_table[] = {
{
.ctl_name = VM_OVERCOMMIT_MEMORY,
diff -Nrup linux-2.6.11-rc1/kernel/timer.c linux-2.6.11-rc1-uiso2/kernel/timer.c
--- linux-2.6.11-rc1/kernel/timer.c 2005-01-21 19:46:55.672514892 +0100
+++ linux-2.6.11-rc1-uiso2/kernel/timer.c 2005-01-21 20:30:14.254890301 +0100
@@ -815,6 +815,8 @@ void update_process_times(int user_tick)
struct task_struct *p = current;
int cpu = smp_processor_id();
+ account_iso_ticks(p);
+
/* Note: this timer irq context must be accounted for as well. */
if (user_tick)
account_user_time(p, jiffies_to_cputime(1));
Rui Nuno Capela wrote:
> OK. Here goes my fresh and newly jack_test4.1 test suite. It might be
> still rough, as usual ;)
Thanks
Here's fresh results on more stressed hardware (on ext3) with
2.6.11-rc1-mm2 (which by the way has SCHED_ISO v2 included). The load
hovering at 50% spikes at times close to 70 which tests the behaviour
under iso throttling.
==> jack_test4-2.6.11-rc1-mm2-fifo.log <==
Number of runs . . . . . . . :( 1)
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 41
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 0 usecs
Cycle Maximum . . . . . . . . : 10968 usecs
Average DSP Load. . . . . . . : 44.3 %
Average CPU System Load . . . : 4.9 %
Average CPU User Load . . . . : 17.1 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.0 %
Average CPU IRQ Load . . . . : 0.0 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1689.9 /sec
Average Context-Switch Rate . : 19052.6 /sec
*********************************************
Delta Maximum . . . . . . . . : 0.00000
*********************************************
==> jack_test4-2.6.11-rc1-mm2-iso.log <==
Number of runs . . . . . . . :( 1)
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 2
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 0 usecs
Cycle Maximum . . . . . . . . : 1282 usecs
Average DSP Load. . . . . . . : 50.5 %
Average CPU System Load . . . : 11.2 %
Average CPU User Load . . . . : 17.6 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.0 %
Average CPU IRQ Load . . . . : 0.0 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1688.8 /sec
Average Context-Switch Rate . : 18985.1 /sec
*********************************************
Delta Maximum . . . . . . . . : 0.00000
*********************************************
==> jack_test4-2.6.11-rc1-mm2-normal.log <==
Number of runs . . . . . . . :( 1)
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 325
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 0 usecs
Cycle Maximum . . . . . . . . : 4726 usecs
Average DSP Load. . . . . . . : 50.0 %
Average CPU System Load . . . : 5.1 %
Average CPU User Load . . . . : 18.7 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.0 %
Average CPU IRQ Load . . . . : 0.1 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1704.5 /sec
Average Context-Switch Rate . : 18875.2 /sec
*********************************************
Delta Maximum . . . . . . . . : 0.00000
*********************************************
Full data and pretty pictures:
http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
Cheers,
Con
On Sat, 2005-01-22 at 10:48 +1100, Con Kolivas wrote:
> utz lehmann wrote:
> > Hi
> >
> > I dislike the behavior of the SCHED_ISO patch that iso tasks are
> > degraded to SCHED_NORMAL if they exceed the limit.
> > IMHO it's better to throttle them at the iso_cpu limit.
> >
> > I have modified Con's iso2 patch to do this. If iso_cpu > 50 iso tasks
> > only get stalled for 1 tick (1ms on x86).
>
> Some tasks are so cache intensive they would make almost no forward
> progress running for only 1ms.
Ok. The throttle duration can be exceed.
What is a good value? 5ms, 10ms?
>
> > Fortunately there is a currently unused task prio (MAX_RT_PRIO-1) [1]. I
>
> Your implementation is not correct. The "prio" field of real time tasks
> is determined by MAX_RT_PRIO-1-rt_priority. Therefore you're limiting
> the best real time priority, not the other way around.
Really? The task prios are (lower value is higher priority):
0
.. For SCHED_FIFO/SCHED_RR (rt_priority 99..1)
98 MAX_RT_PRIO-2
99 MAX_RT_PRIO-1 ISO_PRIO (rt_priority 0)
100 MAX_RT_PRIO
.. For SCHED_NORMAL
139 MAX_PRIO-1
ISO_PRIO is between the SCHED_FIFO/SCHED_RR and the SCHED_NORMAL range.
>
> Throttling them for only 1ms will make it very easy to starve the system
> with 1 or more short running (<1ms) SCHED_NORMAL tasks running. Lower
> priority tasks will never run.
>
> Cheers,
> Con
Ingo Molnar <[email protected]> writes:
> just finished a short testrun with nice--20 compared to SCHED_FIFO, on a
> relatively slow 466 MHz box:
> this shows the surprising result that putting all RT tasks on nice--20
> reduced context-switch rate by 20% and the Delay Maximum is lower as
> well. (although the Delay Maximum is quite unreliable so this could be a
> fluke.) But the XRUN count is the same.
> can anyone else reproduce this, with the test-patch below applied?
I finally made new kernel builds for the latest patches from both Ingo
and Con. I kept the two patch sets separate, as they modify some of
the same files.
I ran three sets of tests with three or more 5 minute runs for each
case. The results (log files and graphs) are in these directories...
1) sched-fifo -- as a baseline
http://www.joq.us/jack/benchmarks/sched-fifo
2) sched-iso -- Con's scheduler, no privileges
http://www.joq.us/jack/benchmarks/sched-iso
3) nice-20 -- Ingo's "nice --20" scheduler hack
http://www.joq.us/jack/benchmarks/nice-20
The SCHED_FIFO runs are all with Con's scheduler. I could not figure
out how to get SCHED_FIFO working with Ingo's version. With or
without the appropriate privileges, it used nice --20, instead. I
used schedtool to verify that the realtime threads were running in the
expected class for each test.
It's hard to make much sense out of all this information. The
SCHED_FIFO results are clearly best. There were no xruns at all in
those three runs. All of the others had at least a few, some quite
severe. But, one of the nice-20 runs had just one small sub-
millisecond xrun. I made some extra runs with that, because I was
puzzled by its lack of consistency.
Yet, both Ingo's and Con's schedulers basically seem to work well.
I'm not sure how to explain the xruns. Maybe they are caused by other
kernel latency bugs. (But then, why not SCHED_FIFO?) Maybe those
schedulers work most of the time, but are not sufficiently careful to
always preempt the running process when an audio interrupt arrives?
I had some problems with the y2 graph axis (for XRUN and DELAY). In
most of the graphs it is unreadable. In some it is inconsistent. I
hacked on the jack_test3_plot.sh script several times, trying to set
readable values, mostly without success. There is too much variation
in those numbers. So, be careful reading and comparing that
information. Some xruns look better or worse than they really are.
These tests were run without any other heavy demands on the system. I
want to try some with a compile running in the background. But, I
won't have time for that until tomorrow at the earliest. So, I'll
post these preliminary results now for your enjoyment.
--
joq
utz lehmann wrote:
> On Sat, 2005-01-22 at 10:48 +1100, Con Kolivas wrote:
>
>>utz lehmann wrote:
>>
>>>Hi
>>>
>>>I dislike the behavior of the SCHED_ISO patch that iso tasks are
>>>degraded to SCHED_NORMAL if they exceed the limit.
>>>IMHO it's better to throttle them at the iso_cpu limit.
>>>
>>>I have modified Con's iso2 patch to do this. If iso_cpu > 50 iso tasks
>>>only get stalled for 1 tick (1ms on x86).
>>
>>Some tasks are so cache intensive they would make almost no forward
>>progress running for only 1ms.
>
>
> Ok. The throttle duration can be exceed.
> What is a good value? 5ms, 10ms?
It's architecture and cpu dependant. Useful timeslices to avoid cache
trashing vary from 2ms to 20ms. Also HZ varies between architectures and
setups, and almost certainly will vary in some dynamic way in the future
altering substantially the accuracy of such a setup.
>>>Fortunately there is a currently unused task prio (MAX_RT_PRIO-1) [1]. I
>>
>>Your implementation is not correct. The "prio" field of real time tasks
>>is determined by MAX_RT_PRIO-1-rt_priority. Therefore you're limiting
>>the best real time priority, not the other way around.
>
>
> Really? The task prios are (lower value is higher priority):
>
> 0
> .. For SCHED_FIFO/SCHED_RR (rt_priority 99..1)
> 98 MAX_RT_PRIO-2
>
> 99 MAX_RT_PRIO-1 ISO_PRIO (rt_priority 0)
>
> 100 MAX_RT_PRIO
> .. For SCHED_NORMAL
> 139 MAX_PRIO-1
>
> ISO_PRIO is between the SCHED_FIFO/SCHED_RR and the SCHED_NORMAL range.
I wan't debating that fact. I was saying that decreasing the range of
priorities you can have for real time will lose the highest priority ones.
if (SCHED_RT(policy))
p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority;
>>Throttling them for only 1ms will make it very easy to starve the system
>> with 1 or more short running (<1ms) SCHED_NORMAL tasks running. Lower
>>priority tasks will never run.
can I also comment on:
+ while (!list_empty(queue)) {
+ next = list_entry(queue->next, task_t, run_list);
+ dequeue_task(next, active);
+ enqueue_task(next, expired);
+ }
O(n) functions are a bad idea in critical codepaths, even if they only
get hit when there is more than one SCHED_ISO task queued.
Apart from those, I'm not really sure what advantage this different
design has. Once you go over the cpu limit the behaviour is grey and
your design basically complicates what is already simple - to make an
unprivileged task starvation free you run it SCHED_NORMAL. I know you
want it back to high priority as soon as possible, but I fail to see how
this is any better. They're either real time or not depending on what
limits you set in either design.
As for priority support, I have been working on it. While the test cases
I've been involved in show no need for it, I can understand why it would
be desirable.
Cheers,
Con
Con Kolivas <[email protected]> writes:
> As for priority support, I have been working on it. While the test
> cases I've been involved in show no need for it, I can understand why
> it would be desirable.
Yes. Rui's jack_test3.2 does not require multiple realtime
priorities, but I can point to applications that do. Their reasons
for working that way make sense and should be supported.
For example, the JACK Audio Mastering interface (JAMin) does a Fast
Fourier Transform on the audio for phase-neutral frequency domain
crossover and EQ processing. This is very CPU intensive, but modern
processors can handle it and the sound is outstanding. The FFT
algorithm uses a moving window with a natural block size of 256
frames. When the JACK buffer size is large enough, JAMin performs
this operation directly in the process callback.
When the JACK buffer size is smaller than 256 frames that won't work.
So, JAMin queues the audio to a realtime helper thread running at a
priority one less than the JACK process thread. So, when JACK is
running at 64 frames per cycle (the jack_test3.2 default), JAMin's FFT
thread will have four process cycles in which to compute its next FFT
window. This adds latency, but permits the application to work even
when the overall JACK graph is running at rather low latencies. If
the scheduler were to run that thread at the same priority as the JACK
process thread, it would practically guarantee xruns. This would
cause JAMin to be unfairly ejected from the JACK graph for failing to
meet its realtime deadlines.
So, there are legitimate examples of realtime applications needing to
use more than one scheduler priority.
--
joq
Con Kolivas <[email protected]> writes:
> Here's fresh results on more stressed hardware (on ext3) with
> 2.6.11-rc1-mm2 (which by the way has SCHED_ISO v2 included). The load
> hovering at 50% spikes at times close to 70 which tests the behaviour
> under iso throttling.
What version of JACK are you running (`jackd --version')?
You're still getting zero Delay Max. That is an important measure.
--
joq
Jack O'Quin wrote:
> Con Kolivas <[email protected]> writes:
>
>
>>Here's fresh results on more stressed hardware (on ext3) with
>>2.6.11-rc1-mm2 (which by the way has SCHED_ISO v2 included). The load
>>hovering at 50% spikes at times close to 70 which tests the behaviour
>>under iso throttling.
>
>
> What version of JACK are you running (`jackd --version')?
>
> You're still getting zero Delay Max. That is an important measure.
Oops I haven't updated it on this machine.
jackd version 0.99.0 tmpdir /tmp protocol 13
Con
Jack O'Quin wrote:
> Con Kolivas <[email protected]> writes:
>
>
>>Here's fresh results on more stressed hardware (on ext3) with
>>2.6.11-rc1-mm2 (which by the way has SCHED_ISO v2 included). The load
>>hovering at 50% spikes at times close to 70 which tests the behaviour
>>under iso throttling.
>
>
> What version of JACK are you running (`jackd --version')?
>
> You're still getting zero Delay Max. That is an important measure.
Ok updated jackd
Here's an updated set of runs. Not very impressive even with SCHED_FIFO,
but the same from both policies.
==> jack_test4-2.6.11-rc1-mm2-fifo.log <==
Number of runs . . . . . . . :( 1)
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 404
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 261254 usecs
Cycle Maximum . . . . . . . . : 2701 usecs
Average DSP Load. . . . . . . : 52.4 %
Average CPU System Load . . . : 5.1 %
Average CPU User Load . . . . : 18.1 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.0 %
Average CPU IRQ Load . . . . : 0.0 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1699.3 /sec
Average Context-Switch Rate . : 19018.9 /sec
*********************************************
Delta Maximum . . . . . . . . : 0.00000
*********************************************
==> jack_test4-2.6.11-rc1-mm2-iso.log <==
Number of runs . . . . . . . :( 1)
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 408
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 269804 usecs
Cycle Maximum . . . . . . . . : 2449 usecs
Average DSP Load. . . . . . . : 52.6 %
Average CPU System Load . . . : 5.0 %
Average CPU User Load . . . . : 17.8 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.1 %
Average CPU IRQ Load . . . . : 0.0 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1699.2 /sec
Average Context-Switch Rate . : 19041.0 /sec
*********************************************
Delta Maximum . . . . . . . . : 0.00000
*********************************************
I've updated the pretty graphs and removed the dud runs from here:
http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
Con
Con Kolivas wrote:
> Jack O'Quin wrote:
>
>> Con Kolivas <[email protected]> writes:
>>
>>
>>> Here's fresh results on more stressed hardware (on ext3) with
>>> 2.6.11-rc1-mm2 (which by the way has SCHED_ISO v2 included). The load
>>> hovering at 50% spikes at times close to 70 which tests the behaviour
>>> under iso throttling.
>>
>>
>>
>> What version of JACK are you running (`jackd --version')?
>>
>> You're still getting zero Delay Max. That is an important measure.
>
>
> Ok updated jackd
>
> Here's an updated set of runs. Not very impressive even with SCHED_FIFO,
> but the same from both policies.
>
> ==> jack_test4-2.6.11-rc1-mm2-fifo.log <==
> Number of runs . . . . . . . :( 1)
> *********************************************
> Timeout Count . . . . . . . . :( 0)
> XRUN Count . . . . . . . . . : 404
> Delay Count (>spare time) . . : 0
> Delay Count (>1000 usecs) . . : 0
> Delay Maximum . . . . . . . . : 261254 usecs
> Cycle Maximum . . . . . . . . : 2701 usecs
> Average DSP Load. . . . . . . : 52.4 %
> Average CPU System Load . . . : 5.1 %
> Average CPU User Load . . . . : 18.1 %
> Average CPU Nice Load . . . . : 0.0 %
> Average CPU I/O Wait Load . . : 0.0 %
> Average CPU IRQ Load . . . . : 0.0 %
> Average CPU Soft-IRQ Load . . : 0.0 %
> Average Interrupt Rate . . . : 1699.3 /sec
> Average Context-Switch Rate . : 19018.9 /sec
> *********************************************
> Delta Maximum . . . . . . . . : 0.00000
> *********************************************
>
> ==> jack_test4-2.6.11-rc1-mm2-iso.log <==
> Number of runs . . . . . . . :( 1)
> *********************************************
> Timeout Count . . . . . . . . :( 0)
> XRUN Count . . . . . . . . . : 408
> Delay Count (>spare time) . . : 0
> Delay Count (>1000 usecs) . . : 0
> Delay Maximum . . . . . . . . : 269804 usecs
> Cycle Maximum . . . . . . . . : 2449 usecs
> Average DSP Load. . . . . . . : 52.6 %
> Average CPU System Load . . . : 5.0 %
> Average CPU User Load . . . . : 17.8 %
> Average CPU Nice Load . . . . : 0.0 %
> Average CPU I/O Wait Load . . : 0.1 %
> Average CPU IRQ Load . . . . : 0.0 %
> Average CPU Soft-IRQ Load . . : 0.0 %
> Average Interrupt Rate . . . : 1699.2 /sec
> Average Context-Switch Rate . : 19041.0 /sec
> *********************************************
> Delta Maximum . . . . . . . . : 0.00000
> *********************************************
Bah stupid me. Both of those are SCHED_NORMAL.
Ignore those, and I'll try again.
Con
Con Kolivas wrote:
> Con Kolivas wrote:
>
>> Jack O'Quin wrote:
>>
>>> Con Kolivas <[email protected]> writes:
>>>
>>>
>>>> Here's fresh results on more stressed hardware (on ext3) with
>>>> 2.6.11-rc1-mm2 (which by the way has SCHED_ISO v2 included). The load
>>>> hovering at 50% spikes at times close to 70 which tests the behaviour
>>>> under iso throttling.
>>>
>>>
>>>
>>>
>>> What version of JACK are you running (`jackd --version')?
>>>
>>> You're still getting zero Delay Max. That is an important measure.
>>
>> Ok updated jackd
So let's try again, sorry about the noise:
==> jack_test4-2.6.11-rc1-mm2-fifo.log <==
Number of runs . . . . . . . :( 1)
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 3
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 20161 usecs
Cycle Maximum . . . . . . . . : 1072 usecs
Average DSP Load. . . . . . . : 47.2 %
Average CPU System Load . . . : 5.1 %
Average CPU User Load . . . . : 18.0 %
Average CPU Nice Load . . . . : 0.1 %
Average CPU I/O Wait Load . . : 0.3 %
Average CPU IRQ Load . . . . : 0.0 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1701.6 /sec
Average Context-Switch Rate . : 19343.7 /sec
*********************************************
Delta Maximum . . . . . . . . : 0.00000
*********************************************
==> jack_test4-2.6.11-rc1-mm2-iso.log <==
Number of runs . . . . . . . :( 1)
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 6
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 4604 usecs
Cycle Maximum . . . . . . . . : 1190 usecs
Average DSP Load. . . . . . . : 54.5 %
Average CPU System Load . . . : 11.6 %
Average CPU User Load . . . . : 18.4 %
Average CPU Nice Load . . . . : 0.1 %
Average CPU I/O Wait Load . . : 0.0 %
Average CPU IRQ Load . . . . : 0.0 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1697.9 /sec
Average Context-Switch Rate . : 19046.2 /sec
*********************************************
Delta Maximum . . . . . . . . : 0.00000
*********************************************
Pretty pictures:
http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
Note these are on a full desktop environment, although it is pretty much
idle apart from checking email. No changes between fifo and iso runs.
Cheers,
Con
>>>>>> "Jack" == Jack O'Quin <[email protected]> writes:
>
>
>Jack> Looks like we need to do another study to determine which
>Jack> filesystem works best for multi-track audio recording and
>Jack> playback. XFS looks promising, but only if they get the latency
>Jack> right. Any experience with that?
>
>The nice thing about audio/video and XFS is that if you know ahead of
>time the max size of a file (and you usually do -- because you know
>ahead of time how long a take is going to be) you can precreadte the
>file as a contiguous chunk, then just fill it in, for minimum disc
>latency.
I don't know what world you're in, but this simply isn't the case in
my experience - you generally have absolutely no idea how long a take
is going to be.
--p
* Jack O'Quin <[email protected]> wrote:
> I finally made new kernel builds for the latest patches from both Ingo
> and Con. I kept the two patch sets separate, as they modify some of
> the same files.
>
> I ran three sets of tests with three or more 5 minute runs for each
> case. The results (log files and graphs) are in these directories...
>
> 1) sched-fifo -- as a baseline
> http://www.joq.us/jack/benchmarks/sched-fifo
>
> 2) sched-iso -- Con's scheduler, no privileges
> http://www.joq.us/jack/benchmarks/sched-iso
>
> 3) nice-20 -- Ingo's "nice --20" scheduler hack
> http://www.joq.us/jack/benchmarks/nice-20
thanks for the testing. The important result is that nice--20
performance is roughly the same as SCHED_ISO. This somewhat
reduces the urgency of the introduction of SCHED_ISO.
Ingo
Con Kolivas <[email protected]> writes:
> So let's try again, sorry about the noise:
>
> ==> jack_test4-2.6.11-rc1-mm2-fifo.log <==
> *********************************************
> XRUN Count . . . . . . . . . : 3
> Delay Maximum . . . . . . . . : 20161 usecs
> *********************************************
>
> ==> jack_test4-2.6.11-rc1-mm2-iso.log <==
> *********************************************
> XRUN Count . . . . . . . . . : 6
> Delay Maximum . . . . . . . . : 4604 usecs
> *********************************************
>
> Pretty pictures:
> http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
Neither run exhibits reliable audio performance. There is some low
latency performance problem with your system. Maybe ReiserFS is
causing trouble even with logging turned off. Perhaps the problem is
somewhere else. Maybe some device is misbehaving.
Until you solve this problem, beware of drawing conclusions.
--
joq
Ingo Molnar <[email protected]> writes:
> thanks for the testing. The important result is that nice--20
> performance is roughly the same as SCHED_ISO. This somewhat
> reduces the urgency of the introduction of SCHED_ISO.
I can see why you feel that way, but don't share your conclusion.
First, only SCHED_FIFO worked reliably in my tests. In Con's tests
even that did not work. My system is probably better tuned for low
latency than his. Until we can determine why there were so many
xruns, it is premature to declare victory for either scheduler.
Preferably, we should compare them on a well-tuned low-latency
system running your Realtime Preemption kernel.
Second, the nice(-20) scheduler provides no clear way to support
multiple realtime priorities. This is necessary for some audio
applications, but not jack_test3.2.
Third, your prototype denies SCHED_FIFO to privileged threads. This
is a serious problem, even for testing (though perhaps easy to fix).
Most important, let's not forget that this long discussion started
because ordinary users need access to realtime scheduling. Con's
scheduler provides a solution for that problem. Your prototype does
not.
Chris Wright and Arjan van de Ven have outlined a proposal to address
the privilege issue using rlimits. This is still the only workable
alternative to the realtime LSM on the table. If the decision were up
to me, I would choose the simplicity and better security of the LSM.
But their approach is adequate, if implemented in a timely fashion. I
would like to see some progress on this in addition to the scheduler
work. People still need SCHED_FIFO for some applications.
Right now, SCHED_ISO still looks better than nice(-20) for audio. It
works without special permissions. The throttling threshold is
adjustable with appropriate privileges. It has the potential to
support multiple priorities.
Being less entangled with SCHED_NORMAL makes me worry less about
someone coming along later and messing it up while working on some
unrelated problem. Right now for example, mounting an encrypted
filesystem starts a `loop0' kernel thread at nice -20.
--
joq
Jack O'Quin wrote:
> Con Kolivas <[email protected]> writes:
>
>
>>So let's try again, sorry about the noise:
>>
>>==> jack_test4-2.6.11-rc1-mm2-fifo.log <==
>>*********************************************
>>XRUN Count . . . . . . . . . : 3
>>Delay Maximum . . . . . . . . : 20161 usecs
>>*********************************************
>>
>>==> jack_test4-2.6.11-rc1-mm2-iso.log <==
>>*********************************************
>>XRUN Count . . . . . . . . . : 6
>>Delay Maximum . . . . . . . . : 4604 usecs
>>*********************************************
>>
>>Pretty pictures:
>>http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
>
>
> Neither run exhibits reliable audio performance. There is some low
> latency performance problem with your system. Maybe ReiserFS is
> causing trouble even with logging turned off. Perhaps the problem is
> somewhere else. Maybe some device is misbehaving.
>
> Until you solve this problem, beware of drawing conclusions.
Sigh.. I guess you want me to do all the benchmarking. Well it's easy
enough to get good results. I'll simply turn off all services and not
run a desktop. This is all on ext3 on a fully laden desktop by the way,
but if you want to get the results you're looking for I can easily drop
down to a console and get perfect results.
Con
Jack O'Quin wrote:
>
> Neither run exhibits reliable audio performance. There is some low
> latency performance problem with your system. Maybe ReiserFS is
> causing trouble even with logging turned off. Perhaps the problem is
> somewhere else. Maybe some device is misbehaving.
>
> Until you solve this problem, beware of drawing conclusions.
The idea is to get equivalent performance to SCHED_FIFO. The results
show that much, and it is 100 times better than unprivileged
SCHED_NORMAL. The fact that this is an unoptimised normal desktop
environment means that the conclusion we _can_ draw is that SCHED_ISO is
as good as SCHED_FIFO for audio on the average desktop. I need someone
with optimised hardware setup to see if it's as good as SCHED_FIFO in
the critical setup.
I'm actually not an audio person and have no need for such a setup, but
I can see how linux would benefit from such support... ;)
Cheers,
Con
>The idea is to get equivalent performance to SCHED_FIFO. The results
>show that much, and it is 100 times better than unprivileged
>SCHED_NORMAL. The fact that this is an unoptimised normal desktop
>environment means that the conclusion we _can_ draw is that SCHED_ISO is
>as good as SCHED_FIFO for audio on the average desktop. I need someone
no, this isn't true. the performance you are getting isn't as good as
SCHED_FIFO on a tuned system (h/w and s/w). the difference might be
the fact that you have "an average desktop", or it might be that your
desktop is just fine and SCHED_ISO actually is not as good as
SCHED_FIFO.
>with optimised hardware setup to see if it's as good as SCHED_FIFO in
>the critical setup.
agreed. i have every confidence that Lee and/or Jack will be
forthcoming :)
--p
Paul Davis wrote:
>>The idea is to get equivalent performance to SCHED_FIFO. The results
>>show that much, and it is 100 times better than unprivileged
>>SCHED_NORMAL. The fact that this is an unoptimised normal desktop
>>environment means that the conclusion we _can_ draw is that SCHED_ISO is
>>as good as SCHED_FIFO for audio on the average desktop. I need someone
>
>
> no, this isn't true. the performance you are getting isn't as good as
> SCHED_FIFO on a tuned system (h/w and s/w). the difference might be
> the fact that you have "an average desktop", or it might be that your
> desktop is just fine and SCHED_ISO actually is not as good as
> SCHED_FIFO.
<pedantic mode>
On my desktop, whatever that is, SCHED_FIFO and SCHED_ISO results were
the same.
</pedantic mode>
>
>>with optimised hardware setup to see if it's as good as SCHED_FIFO in
>>the critical setup.
>
>
> agreed. i have every confidence that Lee and/or Jack will be
> forthcoming :)
Good stuff :).
Meanwhile, I have the priority support working (but not bug free), and
the preliminary results suggest that the results are better. Do I recall
someone mentioning jackd uses threads at different priority?
Cheers,
Con
P.S. If you read any emotion in my emails without a smiley or frowny
face it's unintentional and is the limited emotional range the email
format is allowed to convey. Hmm.. perhaps I should make this my sig ;)
Jack O'Quin wrote:
> Chris Wright and Arjan van de Ven have outlined a proposal to address
> the privilege issue using rlimits. This is still the only workable
> alternative to the realtime LSM on the table. If the decision were up
> to me, I would choose the simplicity and better security of the LSM.
> But their approach is adequate, if implemented in a timely fashion. I
> would like to see some progress on this in addition to the scheduler
> work. People still need SCHED_FIFO for some applications.
>
I think this is a pretty sane and minimally intrusive (for the kernel)
way to support what you want.
* Nick Piggin ([email protected]) wrote:
> Jack O'Quin wrote:
>
> > Chris Wright and Arjan van de Ven have outlined a proposal to address
> > the privilege issue using rlimits. This is still the only workable
> > alternative to the realtime LSM on the table. If the decision were up
> > to me, I would choose the simplicity and better security of the LSM.
> > But their approach is adequate, if implemented in a timely fashion. I
> > would like to see some progress on this in addition to the scheduler
> > work. People still need SCHED_FIFO for some applications.
> >
>
> I think this is a pretty sane and minimally intrusive (for the kernel)
> way to support what you want.
Here's an untested respin against current bk.
thanks,
-chris
===== include/asm-generic/resource.h 1.1 vs edited =====
--- 1.1/include/asm-generic/resource.h 2005-01-20 21:00:51 -08:00
+++ edited/include/asm-generic/resource.h 2005-01-22 18:54:58 -08:00
@@ -20,8 +20,11 @@
#define RLIMIT_LOCKS 10 /* maximum file locks held */
#define RLIMIT_SIGPENDING 11 /* max number of pending signals */
#define RLIMIT_MSGQUEUE 12 /* maximum bytes in POSIX mqueues */
-
-#define RLIM_NLIMITS 13
+#define RLIMIT_NICE 13 /* max nice prio allowed to raise to
+ 0-39 for nice level 19 .. -20 */
+#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
+
+#define RLIM_NLIMITS 15
#endif
/*
@@ -53,6 +56,8 @@
[RLIMIT_LOCKS] = { RLIM_INFINITY, RLIM_INFINITY }, \
[RLIMIT_SIGPENDING] = { MAX_SIGPENDING, MAX_SIGPENDING }, \
[RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
+ [RLIMIT_NICE] = { 0, 0 }, \
+ [RLIMIT_RTPRIO] = { 0, 0 }, \
}
#endif /* __KERNEL__ */
===== include/asm-alpha/resource.h 1.6 vs edited =====
--- 1.6/include/asm-alpha/resource.h 2005-01-20 21:00:50 -08:00
+++ edited/include/asm-alpha/resource.h 2005-01-22 18:58:04 -08:00
@@ -18,8 +18,11 @@
#define RLIMIT_LOCKS 10 /* maximum file locks held */
#define RLIMIT_SIGPENDING 11 /* max number of pending signals */
#define RLIMIT_MSGQUEUE 12 /* maximum bytes in POSIX mqueues */
-
-#define RLIM_NLIMITS 13
+#define RLIMIT_NICE 13 /* max nice prio allowed to raise to
+ 0-39 for nice level 19 .. -20 */
+#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
+
+#define RLIM_NLIMITS 15
#define __ARCH_RLIMIT_ORDER
/*
===== include/asm-mips/resource.h 1.8 vs edited =====
--- 1.8/include/asm-mips/resource.h 2005-01-20 21:00:50 -08:00
+++ edited/include/asm-mips/resource.h 2005-01-22 18:59:29 -08:00
@@ -25,8 +25,11 @@
#define RLIMIT_LOCKS 10 /* maximum file locks held */
#define RLIMIT_SIGPENDING 11 /* max number of pending signals */
#define RLIMIT_MSGQUEUE 12 /* maximum bytes in POSIX mqueues */
+#define RLIMIT_NICE 13 /* max nice prio allowed to raise to
+ 0-39 for nice level 19 .. -20 */
+#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
-#define RLIM_NLIMITS 13 /* Number of limit flavors. */
+#define RLIM_NLIMITS 15 /* Number of limit flavors. */
#define __ARCH_RLIMIT_ORDER
/*
===== include/asm-sparc/resource.h 1.6 vs edited =====
--- 1.6/include/asm-sparc/resource.h 2005-01-20 21:00:50 -08:00
+++ edited/include/asm-sparc/resource.h 2005-01-22 19:00:07 -08:00
@@ -24,8 +24,11 @@
#define RLIMIT_LOCKS 10 /* maximum file locks held */
#define RLIMIT_SIGPENDING 11 /* max number of pending signals */
#define RLIMIT_MSGQUEUE 12 /* maximum bytes in POSIX mqueues */
+#define RLIMIT_NICE 13 /* max nice prio allowed to raise to
+ 0-39 for nice level 19 .. -20 */
+#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
-#define RLIM_NLIMITS 13
+#define RLIM_NLIMITS 15
#define __ARCH_RLIMIT_ORDER
/*
===== include/asm-sparc64/resource.h 1.6 vs edited =====
--- 1.6/include/asm-sparc64/resource.h 2005-01-20 21:00:50 -08:00
+++ edited/include/asm-sparc64/resource.h 2005-01-22 19:00:41 -08:00
@@ -24,8 +24,11 @@
#define RLIMIT_LOCKS 10 /* maximum file locks held */
#define RLIMIT_SIGPENDING 11 /* max number of pending signals */
#define RLIMIT_MSGQUEUE 12 /* maximum bytes in POSIX mqueues */
+#define RLIMIT_NICE 13 /* max nice prio allowed to raise to
+ 0-39 for nice level 19 .. -20 */
+#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
-#define RLIM_NLIMITS 13
+#define RLIM_NLIMITS 15
#define __ARCH_RLIMIT_ORDER
#include <asm-generic/resource.h>
===== include/linux/sched.h 1.274 vs edited =====
--- 1.274/include/linux/sched.h 2005-01-18 12:27:58 -08:00
+++ edited/include/linux/sched.h 2005-01-22 18:52:07 -08:00
@@ -767,6 +767,7 @@ extern void sched_idle_next(void);
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(const task_t *p);
extern int task_nice(const task_t *p);
+extern int can_nice(const task_t *p, const int nice);
extern int task_curr(const task_t *p);
extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
===== kernel/sched.c 1.387 vs edited =====
--- 1.387/kernel/sched.c 2005-01-20 16:00:00 -08:00
+++ edited/kernel/sched.c 2005-01-22 18:52:07 -08:00
@@ -3220,6 +3220,19 @@ out_unlock:
EXPORT_SYMBOL(set_user_nice);
+/**
+ * can_nice - check if a task can reduce its nice value
+ @p: task
+ * @nice: nice value
+ */
+int can_nice(const task_t *p, const int nice)
+{
+ /* convert nice value [19,-20] to rlimit style value [0,39] */
+ int nice_rlim = 19 - nice;
+ return (nice_rlim <= p->signal->rlim[RLIMIT_NICE].rlim_cur ||
+ capable(CAP_SYS_NICE));
+}
+
#ifdef __ARCH_WANT_SYS_NICE
/*
@@ -3239,12 +3252,8 @@ asmlinkage long sys_nice(int increment)
* We don't have to worry. Conceptually one call occurs first
* and we have a single winner.
*/
- if (increment < 0) {
- if (!capable(CAP_SYS_NICE))
- return -EPERM;
- if (increment < -40)
- increment = -40;
- }
+ if (increment < -40)
+ increment = -40;
if (increment > 40)
increment = 40;
@@ -3254,6 +3263,9 @@ asmlinkage long sys_nice(int increment)
if (nice > 19)
nice = 19;
+ if (increment < 0 && !can_nice(current, nice))
+ return -EPERM;
+
retval = security_task_setnice(current, nice);
if (retval)
return retval;
@@ -3369,6 +3381,7 @@ recheck:
return -EINVAL;
if ((policy == SCHED_FIFO || policy == SCHED_RR) &&
+ param->sched_priority > p->signal->rlim[RLIMIT_RTPRIO].rlim_cur &&
!capable(CAP_SYS_NICE))
return -EPERM;
if ((current->euid != p->euid) && (current->euid != p->uid) &&
===== kernel/sys.c 1.104 vs edited =====
--- 1.104/kernel/sys.c 2005-01-11 16:42:35 -08:00
+++ edited/kernel/sys.c 2005-01-22 18:52:07 -08:00
@@ -225,7 +225,7 @@ static int set_one_prio(struct task_stru
error = -EPERM;
goto out;
}
- if (niceval < task_nice(p) && !capable(CAP_SYS_NICE)) {
+ if (niceval < task_nice(p) && !can_nice(p, niceval)) {
error = -EACCES;
goto out;
}
Con Kolivas <[email protected]> writes:
> Jack O'Quin wrote:
>> Neither run exhibits reliable audio performance. There is some low
>> latency performance problem with your system. Maybe ReiserFS is
>> causing trouble even with logging turned off. Perhaps the problem is
>> somewhere else. Maybe some device is misbehaving.
>>
>> Until you solve this problem, beware of drawing conclusions.
> Sigh.. I guess you want me to do all the benchmarking.
Not at all. I am willing to continue running audio benchmarks for you
and Ingo both. I have been spending significant amounts of time doing
that. I can't work on it full-time, but will continue doing tests as
requested. I just assumed you wanted to be able to produce similar
results on your own system. I would, if I were in your place.
I think you misunderstood my comment. I was pointing out that your
system currently has too low a signal/noise ratio to draw conclusions
about scheduling and latency. How can we tell whether the scheduler
is working or not when there are extremely long XRUNS (~20msec) even
running SCHED_FIFO? Clearly, something is broken. We need to figure
out what the latency problem is with your system before putting too
much faith in those results.
> Well it's easy enough to get good results. I'll simply turn off all
> services and not run a desktop. This is all on ext3 on a fully laden
> desktop by the way, but if you want to get the results you're
> looking for I can easily drop down to a console and get perfect
> results.
That would prove absolutely nothing. The whole purpose in requesting
SCHED_FIFO (or an approximation, like SCHED_ISO) is for audio to work
reliably in a loaded system. You were running your tests in the right
environment. Your results showed it wasn't working, but did not
necessarily indicate a scheduler problem.
My tests were all done with GNOME, metacity, xemacs, and galeon
running. I still use ext2 because back when I tuned my system for
audio, ext3 had very poor low latency behavior. Maybe that has
changed. Or, maybe it hasn't and that is the cause of the latency
spikes you're seeing. I can't figure that out remotely, but I can
suggest some things to try. First, make sure the JACK tmp directory
is mounted on a tmpfs[1]. Then, try the test with ext2, instead of
ext3.
[1] http://www.affenbande.org/~tapas/wiki/index.php?Jackd%20and%20tmpfs%20%28or%20shmfs%29
Tuning Linux PC's for low-latency audio is currently an art, not a
science. But, a considerable body of experience has grown up over the
past few years[2]. It can be done. (If nothing else, you may develop
more sympathy for all the crap Linux audio developers have been
putting up with for so long.)
[2] http://affenbande.org/~tapas/wiki/index.php?Low%20latency%20for%20audio%20work%20on%20linux%202.6.x
I recommend testing these schedulers on the best available low latency
kernel for the clearest signal/noise ratio before drawing any final
conclusions. Right now, that seems to be Ingo's RP version.
The original request that started this whole exercise was for 2.6.10
numbers, which morphed into 2.6.11-rc1. Working on the mainline of
development makes sense. And, the mainline is getting a lot better.
But, 2.6.10 is still far from clean as a vehicle for soft-RT. Your
tests prove that.
--
joq
Jack O'Quin wrote:
[snip lots of valid points]
> suggest some things to try. First, make sure the JACK tmp directory
> is mounted on a tmpfs[1]. Then, try the test with ext2, instead of
Looks like the tmpfs is probably the biggest problem. Here's SCHED_ISO
with just the /tmp mounted on tmpfs change - running on a complete
desktop environment with a 2nd exported X seession and my wife browsing
the net and emailing at the same time.
************* SUMMARY RESULT ****************
Total seconds ran . . . . . . : 300
Number of clients . . . . . . : 14
Ports per client . . . . . . : 4
Frames per buffer . . . . . . : 64
Number of runs . . . . . . . :( 1)
*********************************************
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 0
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 72 usecs
Cycle Maximum . . . . . . . . : 1108 usecs
Average DSP Load. . . . . . . : 50.1 %
Average CPU System Load . . . : 10.7 %
Average CPU User Load . . . . : 18.3 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.1 %
Average CPU IRQ Load . . . . : 0.0 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1693.1 /sec
Average Context-Switch Rate . : 18852.7 /sec
*********************************************
Delta Maximum . . . . . . . . : 0.00000
*********************************************
Warning: empty y2 range [0:0], adjusting to [0:1]
All invalid runs removed and just this one posted here:
http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
How's that look?
Cheers,
Con
Con Kolivas <[email protected]> writes:
> Jack O'Quin wrote:
> [snip lots of valid points]
>> suggest some things to try. First, make sure the JACK tmp directory
>> is mounted on a tmpfs[1]. Then, try the test with ext2, instead of
>
> Looks like the tmpfs is probably the biggest problem. Here's SCHED_ISO
> with just the /tmp mounted on tmpfs change - running on a complete
> desktop environment with a 2nd exported X seession and my wife
> browsing the net and emailing at the same time.
>
> All invalid runs removed and just this one posted here:
> http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
>
> How's that look?
Excellent!
Sorry I didn't warn you about that problem before. JACK audio users
generally know about it, but there's no reason you should have.
So, that was run with ext3?
--
joq
Con Kolivas <[email protected]> writes:
> Meanwhile, I have the priority support working (but not bug free), and
> the preliminary results suggest that the results are better. Do I
> recall someone mentioning jackd uses threads at different priority?
Yes, it does.
I'm not sure whether that matters in this test (it might).
But, I'm certain it matters for some JACK applications.
--
joq
Jack O'Quin wrote:
> Con Kolivas <[email protected]> writes:
>
>
>>Jack O'Quin wrote:
>>[snip lots of valid points]
>>
>>>suggest some things to try. First, make sure the JACK tmp directory
>>>is mounted on a tmpfs[1]. Then, try the test with ext2, instead of
>>
>>Looks like the tmpfs is probably the biggest problem. Here's SCHED_ISO
>>with just the /tmp mounted on tmpfs change - running on a complete
>>desktop environment with a 2nd exported X seession and my wife
>>browsing the net and emailing at the same time.
>>
>>All invalid runs removed and just this one posted here:
>>http://ck.kolivas.org/patches/SCHED_ISO/iso2-benchmarks/
>>
>>How's that look?
>
>
> Excellent!
>
> Sorry I didn't warn you about that problem before. JACK audio users
> generally know about it, but there's no reason you should have.
>
> So, that was run with ext3?
Yes I think I mentioned before this is a different machine than the
pentiumM one. It's a P4HT3.06 with on board i810 sound and ext3 (which
explains the vastly different DSP usage). The only "special" measure
taken for jackd was to use the latest jackd code and the tmpfs mount you
suggested. Looks like the number of steps to convert a modern "standard
setup" desktop to a low latency one on linux aren't that big after all :)
Cheers,
Con
Jack O'Quin <[email protected]> writes:
>
> I ran three sets of tests with three or more 5 minute runs for each
> case. The results (log files and graphs) are in these directories...
>
> 1) sched-fifo -- as a baseline
> http://www.joq.us/jack/benchmarks/sched-fifo
>
> 2) sched-iso -- Con's scheduler, no privileges
> http://www.joq.us/jack/benchmarks/sched-iso
>
> 3) nice-20 -- Ingo's "nice --20" scheduler hack
> http://www.joq.us/jack/benchmarks/nice-20
> I had some problems with the y2 graph axis (for XRUN and DELAY). In
> most of the graphs it is unreadable. In some it is inconsistent. I
> hacked on the jack_test3_plot.sh script several times, trying to set
> readable values, mostly without success. There is too much variation
> in those numbers. So, be careful reading and comparing that
> information. Some xruns look better or worse than they really are.
I fixed that problem in the script this way...
--- jack_test3_plot.sh~ Fri Jan 21 15:23:04 2005
+++ jack_test3_plot.sh Sat Jan 22 21:21:58 2005
@@ -33,8 +33,8 @@
set ylabel "CPU Load (%), CTX (x1000/sec)"
set y2label "XRUN, DELAY (msecs)"
set yrange [0:100]
- set y2range [0:*]
- set y2tics 0.2
+ set y2range [0:10]
+ set y2tics 2.0
set terminal png transparent small size 640,320
set output "${NAME}.png"
plot \
Now it gives a consistent, readable range for the XRUN and DELAY data.
Anything over 10msec is "off the graph". Successive graphs are easy
to compare visually.
I went back and regenerated yesterday's graphs from the original log
files with this change, so they're all consistent now for comparison
purposes.
> These tests were run without any other heavy demands on the system. I
> want to try some with a compile running in the background. But, I
> won't have time for that until tomorrow at the earliest. So, I'll
> post these preliminary results now for your enjoyment.
I made more runs today with a compile of ardour running continuously
in the background. These results were much more dramatic than
yesterday's lightly loaded system numbers.
My main conclusion is that on my system sched-fifo works almost
flawlessly, while neither nice-20 nor sched-iso hold up under load.
All the data are here...
http://www.joq.us/jack/benchmarks/
in these six subdirectories...
http://www.joq.us/jack/benchmarks/nice-20
http://www.joq.us/jack/benchmarks/nice-20+compile
http://www.joq.us/jack/benchmarks/sched-fifo
http://www.joq.us/jack/benchmarks/sched-fifo+compile
http://www.joq.us/jack/benchmarks/sched-iso
http://www.joq.us/jack/benchmarks/sched-iso+compile
In many runs with both nice-20 and sched-iso, some of the test clients
failed to meet their deadlines and were evicted from the JACK graph.
This was particularly evident under load (see the nice-20+compile and
sched-iso+compile logs). But, looking back at the logs from
yesterday, I see it also happened without the background compilation.
I didn't notice, because the effects were less obvious. But, this may
explain the rather inconsistent results I noted at the time.
This run[1] shows a particularly dramatic example of this phenomenon.
Note the DSP load dropoff around second 140. After that everything
runs fine because almost half of the clients were ejected.
[1] http://www.joq.us/jack/benchmarks/nice-20+compile/jack_test3-2.6.11-rc1-q2-200501221908.png
There were *no* client failures in *any* of the sched-fifo runs.
So, I reluctantly conclude that neither of the new scheduler
prototypes performs adequately in its current form. We should get
someone else to duplicate these results on a different machine, if
possible.
I'm wondering now if the lack of priority support in the two
prototypes might explain the problems I'm seeing.
--
joq
Jack O'Quin wrote:
> I'm wondering now if the lack of priority support in the two
> prototypes might explain the problems I'm seeing.
Distinctly possible since my results got better with priority support.
However I'm still bugfixing what I've got. Just as a data point here is
an incremental patch for testing which applies to mm2. This survives
a jackd test run at my end but is not ready for inclusion yet.
Cheers,
Con
At 03:50 PM 1/23/2005 +1100, Con Kolivas wrote:
>Looks like the number of steps to convert a modern "standard setup"
>desktop to a low latency one on linux aren't that big after all :)
Yup, modern must be the key. Even Ingo can't help my little ole PIII/500
with YMF-740C. Dang thing can't handle -p64 (alsa rejects that, causing
jackd to become terminally upset), and it can't even handle 4 clients at
SCHED_FIFO despite latest/greatest RT preempt kernel without xruns.
Bugger... downloaded all that nifty sounding stuff for _nothing_ ;-) See
attached highly trimmed log for humorous results. (and /tmp isn't the
problem here)
-Mike
>Yup, modern must be the key. Even Ingo can't help my little ole PIII/500
>with YMF-740C. Dang thing can't handle -p64 (alsa rejects that, causing
>jackd to become terminally upset), and it can't even handle 4 clients at
>SCHED_FIFO despite latest/greatest RT preempt kernel without xruns.
>
>Bugger... downloaded all that nifty sounding stuff for _nothing_ ;-) See
>attached highly trimmed log for humorous results. (and /tmp isn't the
>problem here)
correct. the yamaha ymf interfaces are a joke for low latency
audio. they weren't designed correctly at the h/w level, and they
don't work correctly for low latency on Windows with ASIO either. i
don't know what yamaha was thinking, since other companies understood
how to do this long before the first YMF card came out.
--p
Ingo Molnar <[email protected]> writes:
> thanks for the testing. The important result is that nice--20
> performance is roughly the same as SCHED_ISO. This somewhat
> reduces the urgency of the introduction of SCHED_ISO.
Doing more runs and a more thorough analysis has driven me to a
different conclusion. The important result is that *neither* nice-20
*nor* SCHED_ISO work properly in their current forms.
For further comparison, I booted an old 2.4.19 kernel with Andrew
Morton's low-latency patches and ran the same test SCHED_FIFO, with
and without background compiles. The results were roughly the same as
SCHED_FIFO on 2.6.11-rc1...
http://www.joq.us/jack/benchmarks/2.4ll-fifo
http://www.joq.us/jack/benchmarks/2.4ll-fifo+compile
In addition, I extracted some across the board information by grepping
for key results. Looking at these numbers in aggregate paints a
pretty convincing picture that neither of the new scheduler prototypes
are performing adequately compared to SCHED_FIFO on either 2.4ll or
2.6.
http://www.joq.us/jack/benchmarks/cycle_max.log
http://www.joq.us/jack/benchmarks/delay_max.log
http://www.joq.us/jack/benchmarks/xrun_count.log
Looking at delay_max broken down by directory is particularly
striking. Below, I grouped the values by scheduling class to show the
striking differences. These kinds of worst-case numbers are what
realtime applications designers are generally most interested in...
============= SCHED_FIFO ==============
...benchmarks/2.4ll-fifo...
Delay Maximum . . . . . . . . : 823 usecs
Delay Maximum . . . . . . . . : 303 usecs
...benchmarks/2.4ll-fifo+compile...
Delay Maximum . . . . . . . . : 926 usecs
Delay Maximum . . . . . . . . : 663 usecs
...benchmarks/sched-fifo...
Delay Maximum . . . . . . . . : 347 usecs
Delay Maximum . . . . . . . . : 277 usecs
Delay Maximum . . . . . . . . : 246 usecs
...benchmarks/sched-fifo+compile...
Delay Maximum . . . . . . . . : 285 usecs
Delay Maximum . . . . . . . . : 269 usecs
Delay Maximum . . . . . . . . : 277 usecs
Delay Maximum . . . . . . . . : 569 usecs
Delay Maximum . . . . . . . . : 461 usecs
============= nice(-20) ==============
...benchmarks/nice-20...
Delay Maximum . . . . . . . . : 13818 usecs
Delay Maximum . . . . . . . . : 155637 usecs
Delay Maximum . . . . . . . . : 487 usecs
Delay Maximum . . . . . . . . : 160328 usecs
Delay Maximum . . . . . . . . : 495328 usecs
...benchmarks/nice-20+compile...
Delay Maximum . . . . . . . . : 183083 usecs
Delay Maximum . . . . . . . . : 5976 usecs
Delay Maximum . . . . . . . . : 18155 usecs
Delay Maximum . . . . . . . . : 557 usecs
============= SCHED_ISO ==============
...benchmarks/sched-iso...
Delay Maximum . . . . . . . . : 21410 usecs
Delay Maximum . . . . . . . . : 36830 usecs
Delay Maximum . . . . . . . . : 4062 usecs
...benchmarks/sched-iso+compile...
Delay Maximum . . . . . . . . : 98909 usecs
Delay Maximum . . . . . . . . : 39414 usecs
Delay Maximum . . . . . . . . : 40294 usecs
Delay Maximum . . . . . . . . : 217192 usecs
Delay Maximum . . . . . . . . : 156989 usecs
Looked at this way, there really is no question. The new scheduler
prototypes are falling short significantly. Could this be due to
their lack of priority distinctions between realtime threads? Maybe.
I can't say for sure. I'll be interested to see what happens when Con
is ready for me to try his new priority-based SCHED_ISO prototype.
On a different note, the fact that 2.6 is finally performing as well
as 2.4+lowlat on this test represents significant progress. In fact,
it performed slightly better (I don't know whether that improvement is
statistically significant).
Congratulations to all who had a hand in making this happen!
--
joq
Jack O'Quin wrote:
> Looked at this way, there really is no question. The new scheduler
> prototypes are falling short significantly. Could this be due to
> their lack of priority distinctions between realtime threads? Maybe.
> I can't say for sure. I'll be interested to see what happens when Con
> is ready for me to try his new priority-based SCHED_ISO prototype.
There are two things that the SCHED_ISO you tried is not that SCHED_FIFO
is - As you mentioned there is no priority support, and it is RR, not
FIFO. I am not sure whether it is one and or the other responsible. Both
can be added to SCHED_ISO. I haven't looked at jackd code but it should
be trivial to change SCHED_FIFO to SCHED_RR to see if RR with priority
support is enough or not. Second the patch I sent you is fine for
testing; I was hoping you would try it. What you can't do with it is
spawn lots of userspace apps safely SCHED_ISO with it - it will crash,
but it not take down your hard disk. I've had significantly better
results with that patch so far. Then we cn take it from there.
Cheers,
Con
Con Kolivas <[email protected]> writes:
> There are two things that the SCHED_ISO you tried is not that
> SCHED_FIFO is - As you mentioned there is no priority support, and it
> is RR, not FIFO. I am not sure whether it is one and or the other
> responsible. Both can be added to SCHED_ISO. I haven't looked at jackd
> code but it should be trivial to change SCHED_FIFO to SCHED_RR to see
> if RR with priority support is enough or not.
Sure, that's easy. I didn't do it because I assumed it would not
matter. Since the RR scheduling quantum is considerably longer than
the basic 1.45msec audio cycle, it should work exactly the same. I'll
cobble together a JACK version to try that for you.
> Second the patch I sent you is fine for testing; I was hoping you
> would try it. What you can't do with it is spawn lots of userspace
> apps safely SCHED_ISO with it - it will crash, but it not take down
> your hard disk. I've had significantly better results with that
> patch so far. Then we cn take it from there.
Sorry. I took you literally when you said it was not yet ready to
try. This would be the isoprio3 patch you posted?
Do I have to use 2.6.11-rc1-mm2, or will it work with 2.6.11-rc1?
--
joq
Jack O'Quin wrote:
> Con Kolivas <[email protected]> writes:
>
>
>>There are two things that the SCHED_ISO you tried is not that
>>SCHED_FIFO is - As you mentioned there is no priority support, and it
>>is RR, not FIFO. I am not sure whether it is one and or the other
>>responsible. Both can be added to SCHED_ISO. I haven't looked at jackd
>>code but it should be trivial to change SCHED_FIFO to SCHED_RR to see
>>if RR with priority support is enough or not.
>
>
> Sure, that's easy. I didn't do it because I assumed it would not
> matter. Since the RR scheduling quantum is considerably longer than
> the basic 1.45msec audio cycle, it should work exactly the same. I'll
> cobble together a JACK version to try that for you.
If you already know the audio cycle is much less than 10ms then there
isn't much point trying it.
>>Second the patch I sent you is fine for testing; I was hoping you
>>would try it. What you can't do with it is spawn lots of userspace
>>apps safely SCHED_ISO with it - it will crash, but it not take down
>>your hard disk. I've had significantly better results with that
>>patch so far. Then we cn take it from there.
>
>
> Sorry. I took you literally when you said it was not yet ready to
> try. This would be the isoprio3 patch you posted?
Yes it is.
> Do I have to use 2.6.11-rc1-mm2, or will it work with 2.6.11-rc1?
It was for mm2, but should patch on an iso2 patched kernel.
Thanks!
Con
Con Kolivas <[email protected]> writes:
>>>Second the patch I sent you is fine for testing; I was hoping you
>>>would try it. What you can't do with it is spawn lots of userspace
>>>apps safely SCHED_ISO with it - it will crash, but it not take down
>>>your hard disk. I've had significantly better results with that
>>>patch so far. Then we cn take it from there.
> It was for mm2, but should patch on an iso2 patched kernel.
It does apply to 2.6.11-rc1 (on top of your previous patch) with just
some minor chunk offsets.
The results are excellent...
http://www.joq.us/jack/benchmarks/sched-isoprio
http://www.joq.us/jack/benchmarks/sched-isoprio+compile
I updated the summary information to include them...
http://www.joq.us/jack/benchmarks/cycle_max.log
http://www.joq.us/jack/benchmarks/delay_max.log
http://www.joq.us/jack/benchmarks/xrun_count.log
These results are indistinguishable from SCHED_FIFO...
...benchmarks/sched-fifo...
Delay Maximum . . . . . . . . : 347 usecs
Delay Maximum . . . . . . . . : 277 usecs
Delay Maximum . . . . . . . . : 246 usecs
...benchmarks/sched-fifo+compile...
Delay Maximum . . . . . . . . : 285 usecs
Delay Maximum . . . . . . . . : 269 usecs
Delay Maximum . . . . . . . . : 277 usecs
Delay Maximum . . . . . . . . : 569 usecs
Delay Maximum . . . . . . . . : 461 usecs
...benchmarks/sched-isoprio...
Delay Maximum . . . . . . . . : 199 usecs
Delay Maximum . . . . . . . . : 261 usecs
Delay Maximum . . . . . . . . : 305 usecs
...benchmarks/sched-isoprio+compile...
Delay Maximum . . . . . . . . : 405 usecs
Delay Maximum . . . . . . . . : 286 usecs
Delay Maximum . . . . . . . . : 579 usecs
...benchmarks/sched-iso...
Delay Maximum . . . . . . . . : 21410 usecs
Delay Maximum . . . . . . . . : 36830 usecs
Delay Maximum . . . . . . . . : 4062 usecs
...benchmarks/sched-iso+compile...
Delay Maximum . . . . . . . . : 98909 usecs
Delay Maximum . . . . . . . . : 39414 usecs
Delay Maximum . . . . . . . . : 40294 usecs
Delay Maximum . . . . . . . . : 217192 usecs
Delay Maximum . . . . . . . . : 156989 usecs
So, thread priorities clearly do matter, even in this relatively
simple test. It's amazing how much is going on there, when you look
at it closely.
Is there any chance of these patches working with Ingo's latest RP
patchset? I just downloaded realtime-preempt-2.6.11-rc2-V0.7.36-02,
but haven't built it yet.
--
joq
Jack O'Quin <[email protected]> writes:
> These results are indistinguishable from SCHED_FIFO...
Disregard my previous message, it was an idiotic mistake. The results
were indistinguishable form SCHED_FIFO because they *were* SCHED_FIFO.
I'm running everything again, this time with the correct scheduling
parameters.
Will post the correct numbers shortly. Sorry for the screw-up.
--
joq
Jack O'Quin <[email protected]> writes:
> Will post the correct numbers shortly. Sorry for the screw-up.
Here they are...
http://www.joq.us/jack/benchmarks/sched-isoprio
http://www.joq.us/jack/benchmarks/sched-isoprio+compile
I moved the previous runs to the sched-fifo* directories where they
belong. For convenience, I moved all the summary data logs here...
http://www.joq.us/jack/benchmarks/.SUMMARY
Unfortunately, these corrected tests do not compare favorably with the
earlier sched-fifo runs (mistaken or otherwise). I wanted to believe
the problem was just a matter of priorities, but evidently it is not.
In fact, for this test the priority scheduler does not really help at
all (though I believe it will for other things). The max delay
numbers are still all over the place...
...benchmarks/sched-iso...
Delay Maximum . . . . . . . . : 21410 usecs
Delay Maximum . . . . . . . . : 36830 usecs
Delay Maximum . . . . . . . . : 4062 usecs
...benchmarks/sched-iso+compile...
Delay Maximum . . . . . . . . : 98909 usecs
Delay Maximum . . . . . . . . : 39414 usecs
Delay Maximum . . . . . . . . : 40294 usecs
Delay Maximum . . . . . . . . : 217192 usecs
Delay Maximum . . . . . . . . : 156989 usecs
...benchmarks/sched-isoprio...
Delay Maximum . . . . . . . . : 37071 usecs
Delay Maximum . . . . . . . . : 98193 usecs
Delay Maximum . . . . . . . . : 36935 usecs
...benchmarks/sched-isoprio+compile...
Delay Maximum . . . . . . . . : 59662 usecs
Delay Maximum . . . . . . . . : 151624 usecs
Delay Maximum . . . . . . . . : 39250 usecs
I'll try building a SCHED_RR version of JACK. I still don't think it
will make any difference. But my intuition isn't working very well
right now, so I need more data.
I still wonder if some coding error might occasionally be letting a
lower priority process continue running after an interrupt when it
ought to be preempted.
--
joq
Jack O'Quin wrote:
> I'll try building a SCHED_RR version of JACK. I still don't think it
> will make any difference. But my intuition isn't working very well
> right now, so I need more data.
Could be that despite what it appears, FIFO behaviour may be desirable
to RR. Also the RR in SCHED_ISO is pretty fast at 10ms. However with
nothing else really running it just shouldn't matter...
> I still wonder if some coding error might occasionally be letting a
> lower priority process continue running after an interrupt when it
> ought to be preempted.
That's one distinct possiblity. Preempt code is a particular problem.
You are not running into the cpu limits of SCHED_ISO so something else
must be responsible. If we are higher priority than everything else and
do no expire in any way there is no reason we shouldn't perform as well
as SCHED_FIFO.
There is some sort of privileged memory handling when jackd is running
as root as well, so I don't know how that features here. I can't imagine
it's a real issue though.
Con
Ingo Molnar <[email protected]> writes:
> just finished a short testrun with nice--20 compared to SCHED_FIFO, on a
> relatively slow 466 MHz box:
Has anyone done this kind of realtime testing on an SMP system? I'd
love to know how they compare. Unfortunately, I don't have access to
one at the moment. Are they generally better or worse for this kind
of work? I'm not asking about partitioning or processor affinity, but
actually using the entire SMP complex as a realtime machine.
Our current jack_test scripts wouldn't exercise a multiprocessor very
well. But, even those results would be interesting to know. Then, I
think we could modify them to start muliple JACK servers. That will
probably require using the dummy backend driver, which would need a
more accurate timer source than its current usleep() call to provide
reliable low latency results. (We currently drive the audio cycle
from ALSA driver interrupts, but each JACK server requires a dedicated
sound card for that.)
--
joq
Con Kolivas <[email protected]> writes:
> Jack O'Quin wrote:
>> I'll try building a SCHED_RR version of JACK. I still don't think it
>> will make any difference. But my intuition isn't working very well
>> right now, so I need more data.
>
> Could be that despite what it appears, FIFO behaviour may be desirable
> to RR. Also the RR in SCHED_ISO is pretty fast at 10ms. However with
> nothing else really running it just shouldn't matter...
That's the way I see it, too.
> There is some sort of privileged memory handling when jackd is running
> as root as well, so I don't know how that features here. I can't
> imagine it's a real issue though.
We use mlockall() to avoid page faults in the audio path. That should
be happening in all these tests. The JACK server would complain if
the request were failing, and it doesn't.
How can I verify that the pages are actually locked? (Even without
mlock(), I don't think I would run out of memory.)
--
joq
* Jack O'Quin <[email protected]> wrote:
> First, only SCHED_FIFO worked reliably in my tests. In Con's tests
> even that did not work. My system is probably better tuned for low
> latency than his. Until we can determine why there were so many
> xruns, it is premature to declare victory for either scheduler.
> Preferably, we should compare them on a well-tuned low-latency
> system running your Realtime Preemption kernel.
i didnt declare victory - the full range of latency fixes is in the -RT
tree. Merging of relevant bits is an ongoing process - in 2.6.10 you've
already seen some early results, but it's by no means complete. Nor did
i declare that nice--20 was suitable for audio priorities.
> Second, the nice(-20) scheduler provides no clear way to support
> multiple realtime priorities. [...]
why? You could use e.g. nice -20, -19 and -18. (see the patch below that
implements this.)
> Third, your prototype denies SCHED_FIFO to privileged threads. This
> is a serious problem, even for testing (though perhaps easy to fix).
this is not a prototype, it's an 'API hack'. The real solution would
have none of these limitations of course. Just think of this patch as an
'easy way to use nice--20 without any jackd changes' - any API
limitation you sense is a fault of this hack, not a fault of the
concept.
Find below an updated version of the 'API hack', which, instead of
auto-mapping all RT priorities, extends sched_setscheduler() to allow
nonzero sched_priority values for SCHED_OTHER, which are interpreted as
nice values. E.g. to set a thread to nice--20, do this:
struct sched_param param = { sched_priority: -19 };
sched_setscheduler(pid, SCHED_OTHER, ¶m);
(obviously this is not complete because no permission checking is done,
but this could be combined with an rlimits solution to achieve safety.)
> Most important, let's not forget that this long discussion started
> because ordinary users need access to realtime scheduling. Con's
> scheduler provides a solution for that problem. Your prototype does
> not.
sorry, but that is not how the discussion started. The discussion
started about an API hack, the RT-LSM way to give ordinary users the
unfettered access to RT scheduling.
Then, after this approach was vetoed (rightfully IMO, because it has a
number of disadvantages), did the real discussion start: "how do we give
low latencies to audio applications (and other, soft-RT alike
applications), while not allowing them to lock up the system."
I happened to start that angle - until that point everyone was focused
on the wrong premise of 'how do we give RT privileges to ordinary
users'. We _dont_ give raw RT scheduling to ordinary users, period. The
discussion is still about how to give (audio) applications low
priorities, for which there are a number of solutions:
- SCHED_ISO is a possibility, and has nonzero costs to the scheduler.
- CKRM is another possibility, and has nonzero costs as well, but solves
a wider range of problems.
- negative nice levels are the easiest shortterm solution and have zero
cost. They have some disadvantages though.
I'm not 'against' SCHED_ISO at all:
http://lkml.org/lkml/2004/11/2/114
> Being less entangled with SCHED_NORMAL makes me worry less about
> someone coming along later and messing it up while working on some
> unrelated problem. [...]
i think the real situation is somewhat the opposite: we much more often
broke RT scheduling than SCHED_NORMAL scheduling. RT scheduling is
rarely used, while SCHED_NORMAL (and negative/positive nice levels) are
used much more often than e.g. SCHED_FIFO or SCHED_RR.
> [...] Right now for example, mounting an encrypted filesystem starts a
> `loop0' kernel thread at nice -20.
this is not really a problem - there are other kernel subsystems that
start RT-priority kernel threads. We could easily move such threads to
the common nice -10 priority level or so.
Ingo
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -2245,10 +2245,10 @@ EXPORT_PER_CPU_SYMBOL(kstat);
* if a better static_prio task has expired:
*/
#define EXPIRED_STARVING(rq) \
- ((STARVATION_LIMIT && ((rq)->expired_timestamp && \
+ ((task_nice(current) >= -15) && ((STARVATION_LIMIT && ((rq)->expired_timestamp && \
(jiffies - (rq)->expired_timestamp >= \
STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \
- ((rq)->curr->static_prio > (rq)->best_expired_prio))
+ ((rq)->curr->static_prio > (rq)->best_expired_prio)))
/*
* Do the virtual cpu time signal calculations.
@@ -3328,12 +3328,16 @@ static inline task_t *find_process_by_pi
static void __setscheduler(struct task_struct *p, int policy, int prio)
{
BUG_ON(p->array);
+ if (policy == SCHED_NORMAL) {
+ p->policy = SCHED_NORMAL;
+ p->rt_priority = 0;
+ p->static_prio = NICE_TO_PRIO(prio);
+ p->prio = p->static_prio;
+ return;
+ }
p->policy = policy;
p->rt_priority = prio;
- if (policy != SCHED_NORMAL)
- p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority;
- else
- p->prio = p->static_prio;
+ p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority;
}
/**
@@ -3361,12 +3365,17 @@ recheck:
/*
* Valid priorities for SCHED_FIFO and SCHED_RR are
* 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL is 0.
+ *
+ * Hack: allow SCHED_OTHER with nice levels of -20 ... +19
*/
- if (param->sched_priority < 0 ||
- param->sched_priority > MAX_USER_RT_PRIO-1)
- return -EINVAL;
- if ((policy == SCHED_NORMAL) != (param->sched_priority == 0))
- return -EINVAL;
+ if (policy != SCHED_NORMAL) {
+ if (param->sched_priority < 0 ||
+ param->sched_priority > MAX_USER_RT_PRIO-1)
+ return -EINVAL;
+ } else {
+ if (param->sched_priority < -20 || param->sched_priority > 19)
+ return -EINVAL;
+ }
if ((policy == SCHED_FIFO || policy == SCHED_RR) &&
!capable(CAP_SYS_NICE))
On Mon, 24 Jan 2005 09:59:02 +0100, Ingo Molnar <[email protected]> wrote:
[...]
> - CKRM is another possibility, and has nonzero costs as well, but solves
> a wider range of problems.
BTW, do you know what's the status of CKRM ?
If I'm not wrong it is already widely used, is there any plan to push
it to mainstream ?
--
Paolo <paolo dot ciarrocchi at gmail dot com>
Paolo Ciarrocchi wrote:
> On Mon, 24 Jan 2005 09:59:02 +0100, Ingo Molnar <[email protected]> wrote:
> [...]
>
>>- CKRM is another possibility, and has nonzero costs as well, but solves
>> a wider range of problems.
>
>
> BTW, do you know what's the status of CKRM ?
> If I'm not wrong it is already widely used, is there any plan to push
> it to mainstream ?
>
The CKRM CPU scheduler is under development and it is usable, but it
seems like it has quite a long way to go before it could be considered
for merging.
* Paolo Ciarrocchi <[email protected]> wrote:
> On Mon, 24 Jan 2005 09:59:02 +0100, Ingo Molnar <[email protected]> wrote:
> [...]
> > - CKRM is another possibility, and has nonzero costs as well, but solves
> > a wider range of problems.
>
> BTW, do you know what's the status of CKRM ? If I'm not wrong it is
> already widely used, is there any plan to push it to mainstream ?
it's a bit complex and thus not a no-brainer in terms of merging. Also,
the last version of it seems to be against 2.6.8.1. CKRM-cpu is in
essence an additional layer ontop of normal scheduling. Another patch in
this area is fairsched.
Ingo
* Ingo Molnar <[email protected]> wrote:
> [...] "how do we give low latencies to audio applications (and other,
> soft-RT alike applications), while not allowing them to lock up the
> system."
ok, here is another approach, against 2.6.10/11-ish kernels:
http://redhat.com/~mingo/rt-limit-patches/
this patch adds the /proc/sys/kernel/rt_cpu_limit tunable: the maximum
amount of CPU time all RT tasks combined may use, in percent. Defaults
to 80%.
just apply the patch to 2.6.11-rc2 and you should be able to run e.g.
"jackd -R" as an unprivileged user.
note that this allows the use of SCHED_FIFO/SCHED_RR policies, without
the need to add any new scheduling classes. The RT CPU-limit acts on the
existing RT-scheduling classes, by adding a pretty simple and
straightforward method of tracking their CPU usage, and limiting them if
they exceed the threshold. As long as the treshold is not violated the
scheduling/latency properties of those scheduling classes remains.
It would be very interesting to see how jackd/jack_test performs with
this patch applied, and rt_cpu_limit is set to different percentages,
compared against unpatched SCHED_FIFO performance.
other properties of rt_cpu_limit:
- if there's idle time in the system then RT tasks will be
allowed to use more than the limit. Once SCHED_OTHER tasks
are present again, the limit is enforced.
- if an RT task goes above the limit all the time then there
is no guarantee that exactly the limit will be allowed for
it. (i.e. you should set the limit to somewhat above the real
needs of the RT task in question.)
- zero rt_cpu_limit value means unlimited CPU time to all
RT tasks.
- a nonzero rt_cpu_limit value also has the effect of allowing
the use of RT scheduling classes/priorities for nonprivileged
users. I.e. a value of 100% differs from a value of 0 in that 0
doesnt allow RT priorities for ordinary users.
- on SMP the limit is measured and enforced per-CPU.
- runtime overhead is minimal, especially if the limit is set to 0.
- the CPU-use measurement code has a 'memory' of roughly 300 msecs.
I.e. if an RT task runs 100 msecs nonstop then it will increase
its CPU use by about 30%. This should be fast enough for users for
the limit to be human-inperceptible, but slow enough to allow
occasional longer timeslices to RT tasks.
have fun,
Ingo
> ok, here is another approach, against 2.6.10/11-ish kernels:
>
> http://redhat.com/~mingo/rt-limit-patches/
>
> this patch adds the /proc/sys/kernel/rt_cpu_limit tunable: the maximum
> amount of CPU time all RT tasks combined may use, in percent. Defaults
> to 80%.
i've just updated the -B4 patch - the earlier (-B3) patch had
a last-minute bug that made the kernel enforce limit/10 - i.e.
8% instead of 80% ...
Ingo
Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>>[...] "how do we give low latencies to audio applications (and other,
>>soft-RT alike applications), while not allowing them to lock up the
>>system."
>
>
> ok, here is another approach, against 2.6.10/11-ish kernels:
>
> http://redhat.com/~mingo/rt-limit-patches/
>
> this patch adds the /proc/sys/kernel/rt_cpu_limit tunable: the maximum
> amount of CPU time all RT tasks combined may use, in percent. Defaults
> to 80%.
>
> just apply the patch to 2.6.11-rc2 and you should be able to run e.g.
> "jackd -R" as an unprivileged user.
>
> note that this allows the use of SCHED_FIFO/SCHED_RR policies, without
> the need to add any new scheduling classes. The RT CPU-limit acts on the
> existing RT-scheduling classes, by adding a pretty simple and
> straightforward method of tracking their CPU usage, and limiting them if
> they exceed the threshold. As long as the treshold is not violated the
> scheduling/latency properties of those scheduling classes remains.
>
> It would be very interesting to see how jackd/jack_test performs with
> this patch applied, and rt_cpu_limit is set to different percentages,
> compared against unpatched SCHED_FIFO performance.
Indeed it would be interesting because assuming there are no bugs in my
SCHED_ISO implementation (which is unlikely) it should perform the same.
There are a number of features that it would be nice to have addressed
if we take this route.
Superusers are unable to set anything higher priority than unprivileged
users. Any restrictions placed on SCHED_RR/FIFO for unprivileged users
affect superuser tasks as well. The default setting breaks the
definition of these policies, yet changing the setting to 100 gives
everyone full rt access.
ie it would be nice for there to be discrepancy between the default cpu
limits and priority levels available to unprivileged vs superusers, and
superusers' default settings to remain the same as current SCHED_RR/FIFO
behaviour.
Cheers,
Con
Con Kolivas wrote:
> Ingo Molnar wrote:
>
>> * Ingo Molnar <[email protected]> wrote:
>>
>>
>>> [...] "how do we give low latencies to audio applications (and other,
>>> soft-RT alike applications), while not allowing them to lock up the
>>> system."
>>
>>
>>
>> ok, here is another approach, against 2.6.10/11-ish kernels:
>>
>> http://redhat.com/~mingo/rt-limit-patches/
>>
>> this patch adds the /proc/sys/kernel/rt_cpu_limit tunable: the maximum
>> amount of CPU time all RT tasks combined may use, in percent. Defaults
>> to 80%.
>>
>> just apply the patch to 2.6.11-rc2 and you should be able to run e.g.
>> "jackd -R" as an unprivileged user.
>>
>> note that this allows the use of SCHED_FIFO/SCHED_RR policies, without
>> the need to add any new scheduling classes. The RT CPU-limit acts on the
>> existing RT-scheduling classes, by adding a pretty simple and
>> straightforward method of tracking their CPU usage, and limiting them if
>> they exceed the threshold. As long as the treshold is not violated the
>> scheduling/latency properties of those scheduling classes remains.
>>
>> It would be very interesting to see how jackd/jack_test performs with
>> this patch applied, and rt_cpu_limit is set to different percentages,
>> compared against unpatched SCHED_FIFO performance.
>
>
> Indeed it would be interesting because assuming there are no bugs in my
> SCHED_ISO implementation (which is unlikely) it should perform the same.
>
> There are a number of features that it would be nice to have addressed
> if we take this route.
>
> Superusers are unable to set anything higher priority than unprivileged
> users. Any restrictions placed on SCHED_RR/FIFO for unprivileged users
> affect superuser tasks as well. The default setting breaks the
> definition of these policies, yet changing the setting to 100 gives
> everyone full rt access.
>
> ie it would be nice for there to be discrepancy between the default cpu
> limits and priority levels available to unprivileged vs superusers, and
> superusers' default settings to remain the same as current SCHED_RR/FIFO
> behaviour.
I guess it would be a simple matter of throwing on another 100 rt
priorities that can only be set by CAP_SYS_NICE, and limiting
selectively based on the rt_priority.
Cheers,
Con
* Jack O'Quin <[email protected]> wrote:
> Has anyone done this kind of realtime testing on an SMP system? I'd
> love to know how they compare. Unfortunately, I don't have access to
> one at the moment. Are they generally better or worse for this kind
> of work? I'm not asking about partitioning or processor affinity, but
> actually using the entire SMP complex as a realtime machine.
this particular test was done with an UP kernel. (although the
testsystem indeed is an SMP box.) Generally i mention SMP explicitly.
There are a couple of patches in the -RT tree to make RT scheduling
better on SMP systems.
with those fixes applied (i.e. under the -RT kernel), jackd/jack_test
behaves better than on a single CPU. I didnt find any scheduling anomaly
that wasnt caused by the kernel.
Ingo
Jack O'Quin wrote:
> I still wonder if some coding error might occasionally be letting a
> lower priority process continue running after an interrupt when it
> ought to be preempted.
Well not surprisingly I did find a bug in my patch which did not honour
priority support between ISO threads. So basically the patch only does
as much as the original ISO patch. I'll slap something together for more
testing with this fixed, and ISO_FIFO support too.
Cheers,
Con
-cc list trimmed to those who have recently responded.
Here is a patch to go on top of 2.6.11-rc2-mm1 that fixes some bugs in
the general SCHED_ISO code, fixes the priority support between ISO
threads, and implements SCHED_ISO_RR and SCHED_ISO_FIFO as separate
policies. Note the bugfixes and cleanups mean the codepaths in this are
leaner than the original ISO2 implementation despite the extra features.
This works safely and effectively on UP (but not tested on SMP yet) so
Jack if/when you get a chance I'd love to see more benchmarks from you
on this one. It seems on my machine the earlier ISO2 implementation
without priority nor FIFO was enough for good results, but not on yours,
which makes your testcase a more discriminating one.
Cheers,
Con
Con Kolivas wrote:
> -cc list trimmed to those who have recently responded.
>
>
> Here is a patch to go on top of 2.6.11-rc2-mm1 that fixes some bugs in
> the general SCHED_ISO code, fixes the priority support between ISO
> threads, and implements SCHED_ISO_RR and SCHED_ISO_FIFO as separate
> policies. Note the bugfixes and cleanups mean the codepaths in this are
> leaner than the original ISO2 implementation despite the extra features.
>
> This works safely and effectively on UP (but not tested on SMP yet) so
> Jack if/when you get a chance I'd love to see more benchmarks from you
> on this one. It seems on my machine the earlier ISO2 implementation
> without priority nor FIFO was enough for good results, but not on yours,
> which makes your testcase a more discriminating one.
Sorry, I see yet another flaw in the design and SMP is broken so hold
off testing for a bit.
Cheers,
Con
Ingo Molnar <[email protected]> writes:
> * Jack O'Quin <[email protected]> wrote:
>
>> First, only SCHED_FIFO worked reliably in my tests. In Con's tests
>> even that did not work. My system is probably better tuned for low
>> latency than his. Until we can determine why there were so many
>> xruns, it is premature to declare victory for either scheduler.
>> Preferably, we should compare them on a well-tuned low-latency
>> system running your Realtime Preemption kernel.
>
> i didnt declare victory - the full range of latency fixes is in the -RT
> tree. Merging of relevant bits is an ongoing process - in 2.6.10 you've
> already seen some early results, but it's by no means complete.
I didn't mean to insult you, Ingo.
I have nothing but praise for what you've accomplished with 2.6.10.
My tests yesterday demonstrated slightly better SCHED_FIFO performance
with 2.6.10 than 2.4.19 with Andrew's low-latency patches. For a
mainstream kernel that is a huge accomplishment, never before
achieved. We should celebrate.
I was just pointing out that saying nice(-20) works as well as
SCHED_ISO, though true, doesn't mean much since neither of them
(currently) work well enough to be useful.
>> Second, the nice(-20) scheduler provides no clear way to support
>> multiple realtime priorities. [...]
>
> why? You could use e.g. nice -20, -19 and -18. (see the patch below that
> implements this.)
Which of the the POSIX 1-99 range do you map into those three
priorities? (I can't figure it out from the patch.)
How does one go about deciding which priority differences "matter" and
which do not? Why not honor the realtime programmer's choice of
priorities?
For good reasons, most audio developers prefer the POSIX realtime
interfaces. They are far from perfect, but remain the only workable,
portable solution available. That is why I like your rt_cpu_limit
proposal so much better that this one.
--
joq
Jack O'Quin wrote:
> Ingo Molnar <[email protected]> writes:
>
>
>>* Ingo Molnar <[email protected]> wrote:
>>
>>this patch adds the /proc/sys/kernel/rt_cpu_limit tunable: the maximum
>>amount of CPU time all RT tasks combined may use, in percent. Defaults
>>to 80%.
>>
>>just apply the patch to 2.6.11-rc2 and you should be able to run e.g.
>>"jackd -R" as an unprivileged user.
>
>
> This is a far better idea from an API perspective. We can continue
> writing to the POSIX realtime standard interfaces. Yet users can
> actually take advantage of them. I like it.
>
This still doesn't solve your privlige problem though. If I can't
renice something as a regular user, it makes no sense to allow such
realtime behaviour.
I still think the ulimit patches aren't a bad idea to solve your
privilege problem. At that point, is there still a need for
rt_cpu_limit?
Nick
* Jack O'Quin <[email protected]> wrote:
> > It would be very interesting to see how jackd/jack_test performs with
> > this patch applied, and rt_cpu_limit is set to different percentages,
> > compared against unpatched SCHED_FIFO performance.
>
> It works great...
>
> http://www.joq.us/jack/benchmarks/rt_cpu_limit
> http://www.joq.us/jack/benchmarks/rt_cpu_limit+compile
> http://www.joq.us/jack/benchmarks/.SUMMARY
>
> I'll experiment with it some more, but this seems to meet all my
> needs. As one would expect, the results are indistinguishable from
> SCHED_FIFO...
>
> # rt_cpu_limit
> Delay Maximum . . . . . . . . : 290 usecs
> Delay Maximum . . . . . . . . : 443 usecs
> Delay Maximum . . . . . . . . : 232 usecs
>
> # rt_cpu_limit+compile
> Delay Maximum . . . . . . . . : 378 usecs
> Delay Maximum . . . . . . . . : 206 usecs
> Delay Maximum . . . . . . . . : 528 usecs
very good. Could you try another thing, and set the rt_cpu_limit to less
than the CPU utilization 'top' reports during the test (or to less than
the DSP CPU utilization in the stats), to deliberately trigger the
limiting code? This both tests the limit and shows the effects it has.
(there should be xruns and a large Delay Maximum.)
Ingo
There were numerous bugs in the SCHED_ISO design prior to now, so it
really was not performing as expected. What is most interesting is that
the DSP load goes to much higher levels now if xruns are avoided and
stay at those high levels. If I push the cpu load too much so that they
get transiently throttled from SCHED_ISO, after the Xrun the dsp load
drops to half. Is this expected behaviour?
Anyway the next patch works well in my environment. Jack, while I
realise you're getting the results you want from Ingo's dropped
privilege, dropped cpu limit patch I would appreciate you testing this
patch. It is not clear yet what direction we will take, but even if we
dont do this, it would be nice just because of the effort on my part.
This version of the patch has full priority support and both ISO_RR and
ISO_FIFO.
This is the patch to apply to 2.6.11-rc2-mm1:
http://ck.kolivas.org/patches/SCHED_ISO/2.6.11-rc2-mm1/2.6.11-rc2-mm1-iso-prio-fifo.diff
Cheers,
Con
pretty much the only criticism of the RT-CPU patch was that the global
sysctl is too rigid and that it doesnt allow privileged tasks to ignore
the limit. I've uploaded a new RT-CPU-limit patch that solves this
problem:
http://redhat.com/~mingo/rt-limit-patches/
i've removed the global sysctl and implemented a new rlimit,
RT_CPU_RATIO: the maximum amount of CPU time RT tasks may use, in
percent. For testing purposes it defaults to 80%.
the RT-limit being an rlimit makes it much more configurable: root tasks
can have unlimited CPU time limit, while users could have a more
conservative setting of say 30%. This also makes it per-process and
runtime configurable as well. The scheduler will instantly act upon any
new RT_CPU_RATIO rlimit.
(this approach is fundamentally different from the previous patch that
made the "maximum RT-priority available to an unprivileged task" value
an rlimit - with priorities being an rlimit we still havent made RT
priorities safe against deadlocks.)
multiple tasks can have different rlimits as well, and the scheduler
interprets it the following way: it maintains a per-CPU "RT CPU use"
load-average value and compares it against the per-task rlimit. If e.g.
the task says "i'm in the 60% range" and the current average is 70%,
then the scheduler delays this RT task - if the next task has an 80%
rlimit then it will be allowed to run. This logic is straightforward and
can be used as a further control mechanism against runaway highprio RT
tasks.
other properties of the RT_CPU_RATIO rlimit:
- if there's idle time in the system then RT tasks will be
allowed to use more than the limit.
- if an RT task goes above the limit all the time then there
is no guarantee that exactly the limit will be allowed for
it. (i.e. you should set the limit to somewhat above the real
needs of the RT task in question.)
- a zero RLIMIT_RT_CPU_RATIO value means unlimited CPU time to that
RT task. If the task is not an RT task then it may not change to RT
priority. (i.e. a zero value makes it fully compatible with previous
RT scheduling semantics.)
- a nonzero rt_cpu_limit value also has the effect of allowing
the use of RT priorities to nonprivileged users.
- on SMP the limit is measured and enforced per-CPU.
- runtime overhead is minimal, especially if the limit is set to 0.
- the CPU-use measurement code has a 'memory' of roughly 300 msecs.
I.e. if an RT task runs 100 msecs nonstop then it will increase
its CPU use by about 30%. This should be fast enough for users for
the limit to be human-inperceptible, but slow enough to allow
occasional longer timeslices to RT tasks.
Ingo
Ingo Molnar wrote:
> pretty much the only criticism of the RT-CPU patch was that the global
> sysctl is too rigid and that it doesnt allow privileged tasks to ignore
> the limit. I've uploaded a new RT-CPU-limit patch that solves this
> problem:
>
> http://redhat.com/~mingo/rt-limit-patches/
>
> i've removed the global sysctl and implemented a new rlimit,
> RT_CPU_RATIO: the maximum amount of CPU time RT tasks may use, in
> percent. For testing purposes it defaults to 80%.
>
> the RT-limit being an rlimit makes it much more configurable: root tasks
> can have unlimited CPU time limit, while users could have a more
> conservative setting of say 30%. This also makes it per-process and
> runtime configurable as well. The scheduler will instantly act upon any
> new RT_CPU_RATIO rlimit.
>
> (this approach is fundamentally different from the previous patch that
> made the "maximum RT-priority available to an unprivileged task" value
> an rlimit - with priorities being an rlimit we still havent made RT
> priorities safe against deadlocks.)
>
> multiple tasks can have different rlimits as well, and the scheduler
> interprets it the following way: it maintains a per-CPU "RT CPU use"
> load-average value and compares it against the per-task rlimit. If e.g.
> the task says "i'm in the 60% range" and the current average is 70%,
> then the scheduler delays this RT task - if the next task has an 80%
> rlimit then it will be allowed to run. This logic is straightforward and
> can be used as a further control mechanism against runaway highprio RT
> tasks.
Very nice. I like the way this approach is evolving.
Cheers,
Con
* Nick Piggin <[email protected]> wrote:
> > This is a far better idea from an API perspective. We can continue
> > writing to the POSIX realtime standard interfaces. Yet users can
> > actually take advantage of them. I like it.
>
> This still doesn't solve your privlige problem though. If I can't
> renice something as a regular user, it makes no sense to allow such
> realtime behaviour.
>
> I still think the ulimit patches aren't a bad idea to solve your
> privilege problem. At that point, is there still a need for
> rt_cpu_limit?
i do believe it is not robust to give unprivileged users RT priorities,
without safeguards installed. Most normal desktops have some sort of
audio playback capability, so this problem needs a robust, API-neutral
and configurable/flexible solution.
RT-LSM and rlimit privileges are configurable, API-neutral but not
robust, while rt_cpu_limit is robust but not flexible. SCHED_ISO meets
all those needs.
there's a fourth option which is simpler than SCHED_ISO: in the previous
mail i've announced the RLIMIT_RT_CPU_RATIO feature, which should meet
all these requirements as well: the security and API ease-of-use of
rt_cpu_limit, and the maximum flexibility of rlimits. (It also has the
extra bonus of enabling the tweaking/securing of existing RT classes,
which SCHED_ISO doesnt do.)
Ingo
Con Kolivas wrote:
> There were numerous bugs in the SCHED_ISO design prior to now, so it
> really was not performing as expected. What is most interesting is that
> the DSP load goes to much higher levels now if xruns are avoided and
> stay at those high levels. If I push the cpu load too much so that they
> get transiently throttled from SCHED_ISO, after the Xrun the dsp load
> drops to half. Is this expected behaviour?
>
> Anyway the next patch works well in my environment. Jack, while I
> realise you're getting the results you want from Ingo's dropped
> privilege, dropped cpu limit patch I would appreciate you testing this
> patch. It is not clear yet what direction we will take, but even if we
> dont do this, it would be nice just because of the effort on my part.
>
> This version of the patch has full priority support and both ISO_RR and
> ISO_FIFO.
>
> This is the patch to apply to 2.6.11-rc2-mm1:
> http://ck.kolivas.org/patches/SCHED_ISO/2.6.11-rc2-mm1/2.6.11-rc2-mm1-iso-prio-fifo.diff
Just for completeness, benchmarks:
logs and pretty pictures:
http://ck.kolivas.org/patches/SCHED_ISO/iso3-benchmarks/
SCHED_ISO:
Total seconds ran . . . . . . : 300
Number of clients . . . . . . : 10
Ports per client . . . . . . : 4
Frames per buffer . . . . . . : 64
Number of runs . . . . . . . :( 3)
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 0
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 150 usecs
Cycle Maximum . . . . . . . . : 725 usecs
Average DSP Load. . . . . . . : 32.3 %
Average CPU System Load . . . : 6.0 %
Average CPU User Load . . . . : 33.6 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 0.1 %
Average CPU IRQ Load . . . . : 0.1 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1758.9 /sec
Average Context-Switch Rate . : 9208.7 /sec
and SCHED_ISO in the presence of continuous compile:
Total seconds ran . . . . . . : 300
Number of clients . . . . . . : 10
Ports per client . . . . . . : 4
Frames per buffer . . . . . . : 64
Number of runs . . . . . . . :( 3)
Timeout Count . . . . . . . . :( 0)
XRUN Count . . . . . . . . . : 0
Delay Count (>spare time) . . : 0
Delay Count (>1000 usecs) . . : 0
Delay Maximum . . . . . . . . : 375 usecs
Cycle Maximum . . . . . . . . : 726 usecs
Average DSP Load. . . . . . . : 35.8 %
Average CPU System Load . . . : 15.1 %
Average CPU User Load . . . . : 82.9 %
Average CPU Nice Load . . . . : 0.0 %
Average CPU I/O Wait Load . . : 1.8 %
Average CPU IRQ Load . . . . : 0.2 %
Average CPU Soft-IRQ Load . . : 0.0 %
Average Interrupt Rate . . . : 1772.6 /sec
Average Context-Switch Rate . : 9565.2 /sec
Cheers,
Con
* Jack O'Quin <[email protected]> wrote:
> I was just pointing out that saying nice(-20) works as well as
> SCHED_ISO, though true, doesn't mean much since neither of them
> (currently) work well enough to be useful.
ok. While i still think nice--20 can be quite good for some purposes, it
will probably not solve all problems that the audio applications need
solved.
> For good reasons, most audio developers prefer the POSIX realtime
> interfaces. They are far from perfect, but remain the only workable,
> portable solution available. That is why I like your rt_cpu_limit
> proposal so much better that this one.
ok - lets forget about nice--20 for now.
Ingo
Con Kolivas <[email protected]> writes:
> There were numerous bugs in the SCHED_ISO design prior to now, so it
> really was not performing as expected. What is most interesting is
> that the DSP load goes to much higher levels now if xruns are avoided
> and stay at those high levels. If I push the cpu load too much so that
> they get transiently throttled from SCHED_ISO, after the Xrun the dsp
> load drops to half. Is this expected behaviour?
Yes.
Any xrun is pretty much guaranteed to blow the next audio cycle or
two. Several together tend to "snowball" into a "avalanche". Hitting
your CPU limit practically guaranteed that kind of realtime disaster.
If you grep your log file for 'client failure:', you'll probably find
that JACK has reacted to the deteriorating situation by shutting down
some of its clients. The number of 'client failure:' messages is
*not* the number of clients shut down, there is some repetition (not
sure why). This will give the actual number...
$ grep '^client failure:' ${LOGFILE} | cut -f4 -d' ' | sort -u | wc -l
It would help if the test script reported this value.
In extreme cases like the following example, eleven of the twenty
clients were shut down by the JACK server. You can see that clearly
in the blue line (DSP load) of the graph...
http://www.joq.us/jack/benchmarks/sched-iso+compile/jack_test3-2.6.11-rc1-exp-200501222329.log
http://www.joq.us/jack/benchmarks/sched-iso+compile/jack_test3-2.6.11-rc1-exp-200501222329.png
> Anyway the next patch works well in my environment. Jack, while I
> realise you're getting the results you want from Ingo's dropped
> privilege, dropped cpu limit patch I would appreciate you testing this
> patch. It is not clear yet what direction we will take, but even if we
> dont do this, it would be nice just because of the effort on my part.
Will do. I appreciate your efforts, and want to see them reach a
working point of closure.
Though I'm somewhat swamped today, I'll run it as soon as I can.
> This version of the patch has full priority support and both ISO_RR
> and ISO_FIFO.
>
> This is the patch to apply to 2.6.11-rc2-mm1:
> http://ck.kolivas.org/patches/SCHED_ISO/2.6.11-rc2-mm1/2.6.11-rc2-mm1-iso-prio-fifo.diff
--
joq
Ingo Molnar <[email protected]> writes:
> * Jack O'Quin <[email protected]> wrote:
>
>> It works great...
>>
>> http://www.joq.us/jack/benchmarks/rt_cpu_limit
>> http://www.joq.us/jack/benchmarks/rt_cpu_limit+compile
>> http://www.joq.us/jack/benchmarks/.SUMMARY
>>
>> I'll experiment with it some more, but this seems to meet all my
>> needs. As one would expect, the results are indistinguishable from
>> SCHED_FIFO...
> very good. Could you try another thing, and set the rt_cpu_limit to less
> than the CPU utilization 'top' reports during the test (or to less than
> the DSP CPU utilization in the stats), to deliberately trigger the
> limiting code? This both tests the limit and shows the effects it has.
> (there should be xruns and a large Delay Maximum.)
Here are some runs with rt_cpu_limit set to 30%. (No point in trying
to compile in the background.)
http://www.joq.us/jack/benchmarks/rt_cpu_limit.30
As expected, the Delay Max and XRUN Count do go to hell...
# rt_cpu_limit
Delay Maximum . . . . . . . . : 290 usecs
Delay Maximum . . . . . . . . : 443 usecs
Delay Maximum . . . . . . . . : 232 usecs
# rt_cpu_limit.30
Delay Maximum . . . . . . . . : 60294 usecs
Delay Maximum . . . . . . . . : 77742 usecs
Delay Maximum . . . . . . . . : 589697 usecs
# rt_cpu_limit
XRUN Count . . . . . . . . . : 0
XRUN Count . . . . . . . . . : 0
XRUN Count . . . . . . . . . : 0
# rt_cpu_limit.30
XRUN Count . . . . . . . . . : 25
XRUN Count . . . . . . . . . : 12
XRUN Count . . . . . . . . . : 15
So, the throttling obviously does work, in the sense of the system
being able to protect itself from RT CPU hogging. The failure of the
JACK subsystem is rather catastophic, however.
Look at this graph...
http://www.joq.us/jack/benchmarks/rt_cpu_limit.30/jack_test3-2.6.11-rc2-q2-200501251346.png
At around 55 seconds into the run, JACK got in trouble and throttled
itself back to approximately the 30% limit (actually a little above).
Then, around second 240 it got in trouble again, this time collapsing
completely. I'm a bit confused by all the messages in that log, but
it appears that approximately 9 of the 20 clients were evertually shut
down by the JACK server. It looks like the second collapse around 240
also caused the scheduler to revoke RT privileges for the rest of the
run (just a guess).
This brings us to the issue of graceful degradation. IMHO, if we
follow the LKML bias for seeing questions from a kernel and not user
perspective, the results will not be very satisfactory.
JACK can probably do a better job of shutting down hyperactive
realtime clients than the kernel, because it knows more about what the
user is trying to do. Multiplying incomprehesible rlimits values does
not help much that I can see.
Sometimes musicians want to "push the envelope" using CPU-hungry
realtime effects like reverbs or Fourier Transforms. It is often hard
to predict how much of this sort of load a given system can handle.
JACK reports its subsystem-wide "DSP load" as a moving average,
allowing users to monitor it. It might be helpful if the kernel's
estimate of this number were also available somewhere (but maybe that
value has no meaning to users). Often, the easiest method is to push
things to the breaking point, and then back off a bit.
For this kind of usage scenario, the simpler `rt_cpu_limit' (B4) patch
works much better than the `rt_cpu_ratio' (D4) patch, allowing a
suitably privileged user to adjust the per-CPU limit directly while
experimenting with system limits. (Don't ask me what any of this
means in an SMP complex.)
The equivalent rlimits experiment probably requires:
(1) editing /etc/security/limits.conf
(2) shutting everything down
(3) logout
(4) login
(5) restarting the test
Eventually there will be ulimit support in the shells. Then, I
suppose this can be streamlined a little, but not much. The problem
is that there are often many audio applications running, with complex
connections between them. So, this is not a simple matter of stopping
and starting a single program (as most of you probably imagine).
--
joq
* Jack O'Quin <[email protected]> wrote:
> At around 55 seconds into the run, JACK got in trouble and throttled
> itself back to approximately the 30% limit (actually a little above).
> Then, around second 240 it got in trouble again, this time collapsing
> completely. I'm a bit confused by all the messages in that log, but
> it appears that approximately 9 of the 20 clients were evertually shut
> down by the JACK server. It looks like the second collapse around 240
> also caused the scheduler to revoke RT privileges for the rest of the
> run (just a guess).
no, the scheduler doesnt revoke RT privileges, it just 'delays' RT tasks
that violate the threshold. In other words, SCHED_OTHER tasks will have
a higher effective priority than RT tasks, up until the CPU use average
drops below the limit again.
the effect is pretty similar to starting too many Jack clients - things
degrade quickly and _all_ clients start skipping, and the whole audio
experience escallates into a big xrun mess. Not much to be done about it
i suspect. Maybe if the current 'RT load' was available via /proc then
jackd could introduce some sort of threshold above which it would reject
new clients?
> JACK can probably do a better job of shutting down hyperactive
> realtime clients than the kernel, because it knows more about what the
> user is trying to do. Multiplying incomprehesible rlimits values does
> not help much that I can see.
please debug this some more - the kernel certainly doesnt do anything
intrusive - the clients only get delayed for some time.
> Sometimes musicians want to "push the envelope" using CPU-hungry
> realtime effects like reverbs or Fourier Transforms. It is often hard
> to predict how much of this sort of load a given system can handle.
> JACK reports its subsystem-wide "DSP load" as a moving average,
> allowing users to monitor it. It might be helpful if the kernel's
> estimate of this number were also available somewhere (but maybe that
> value has no meaning to users). Often, the easiest method is to push
> things to the breaking point, and then back off a bit.
yeah, i'll add this to /proc, so that utilities can access it. Jackd
could monitor it and refuse to start new clients if the RT load is
dangerously close to the limit (say within 10% of it)?
> The equivalent rlimits experiment probably requires:
>
> (1) editing /etc/security/limits.conf
> (2) shutting everything down
> (3) logout
> (4) login
> (5) restarting the test
well, there's setrlimit, so you could add a jackd client callback that
instructs all clients to change their RT_CPU_RATIO rlimit. In theory we
could try to add a new rlimit syscall that changes another task's rlimit
(right now the syscalls only allow the changing of the rlimit of the
current task) - that would enable utilities to change the rlimit of all
tasks in the system, achieving the equivalent of a global sysctl.
Ingo
* Ingo Molnar ([email protected]) wrote:
> well, there's setrlimit, so you could add a jackd client callback that
> instructs all clients to change their RT_CPU_RATIO rlimit. In theory we
> could try to add a new rlimit syscall that changes another task's rlimit
> (right now the syscalls only allow the changing of the rlimit of the
> current task) - that would enable utilities to change the rlimit of all
> tasks in the system, achieving the equivalent of a global sysctl.
We've talked about smth. similar in another thread. I'm not opposed to
the idea.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
* Chris Wright <[email protected]> wrote:
> * Ingo Molnar ([email protected]) wrote:
> > well, there's setrlimit, so you could add a jackd client callback that
> > instructs all clients to change their RT_CPU_RATIO rlimit. In theory we
> > could try to add a new rlimit syscall that changes another task's rlimit
> > (right now the syscalls only allow the changing of the rlimit of the
> > current task) - that would enable utilities to change the rlimit of all
> > tasks in the system, achieving the equivalent of a global sysctl.
>
> We've talked about smth. similar in another thread. I'm not opposed
> to the idea.
did that thread go into technical details? There are some rlimit users
that might not be prepared to see the rlimit change under them. The
RT_CPU_RATIO one ought to be safe, but generally i'm not so sure.
Ingo
* Ingo Molnar ([email protected]) wrote:
> * Chris Wright <[email protected]> wrote:
> > * Ingo Molnar ([email protected]) wrote:
> > > well, there's setrlimit, so you could add a jackd client callback that
> > > instructs all clients to change their RT_CPU_RATIO rlimit. In theory we
> > > could try to add a new rlimit syscall that changes another task's rlimit
> > > (right now the syscalls only allow the changing of the rlimit of the
> > > current task) - that would enable utilities to change the rlimit of all
> > > tasks in the system, achieving the equivalent of a global sysctl.
> >
> > We've talked about smth. similar in another thread. I'm not opposed
> > to the idea.
>
> did that thread go into technical details? There are some rlimit users
> that might not be prepared to see the rlimit change under them. The
> RT_CPU_RATIO one ought to be safe, but generally i'm not so sure.
Not really. I mentioned the above, as well as the security concern.
Right now, at least the task_setrlimit hook would have to change to take
into account the task. And I never convinced myself that async changes
would be safe for each rlimit.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
Ingo Molnar wrote:
> pretty much the only criticism of the RT-CPU patch was that the global
> sysctl is too rigid and that it doesnt allow privileged tasks to ignore
> the limit. I've uploaded a new RT-CPU-limit patch that solves this
> problem:
>
> http://redhat.com/~mingo/rt-limit-patches/
>
> i've removed the global sysctl and implemented a new rlimit,
> RT_CPU_RATIO: the maximum amount of CPU time RT tasks may use, in
> percent. For testing purposes it defaults to 80%.
>
> the RT-limit being an rlimit makes it much more configurable: root tasks
> can have unlimited CPU time limit, while users could have a more
> conservative setting of say 30%. This also makes it per-process and
> runtime configurable as well. The scheduler will instantly act upon any
> new RT_CPU_RATIO rlimit.
>
> (this approach is fundamentally different from the previous patch that
> made the "maximum RT-priority available to an unprivileged task" value
> an rlimit - with priorities being an rlimit we still havent made RT
> priorities safe against deadlocks.)
>
> multiple tasks can have different rlimits as well, and the scheduler
> interprets it the following way: it maintains a per-CPU "RT CPU use"
> load-average value and compares it against the per-task rlimit. If e.g.
> the task says "i'm in the 60% range" and the current average is 70%,
> then the scheduler delays this RT task - if the next task has an 80%
> rlimit then it will be allowed to run. This logic is straightforward and
> can be used as a further control mechanism against runaway highprio RT
> tasks.
>
> other properties of the RT_CPU_RATIO rlimit:
>
> - if there's idle time in the system then RT tasks will be
> allowed to use more than the limit.
>
> - if an RT task goes above the limit all the time then there
> is no guarantee that exactly the limit will be allowed for
> it. (i.e. you should set the limit to somewhat above the real
> needs of the RT task in question.)
>
> - a zero RLIMIT_RT_CPU_RATIO value means unlimited CPU time to that
> RT task. If the task is not an RT task then it may not change to RT
> priority. (i.e. a zero value makes it fully compatible with previous
> RT scheduling semantics.)
>
> - a nonzero rt_cpu_limit value also has the effect of allowing
> the use of RT priorities to nonprivileged users.
>
> - on SMP the limit is measured and enforced per-CPU.
>
> - runtime overhead is minimal, especially if the limit is set to 0.
>
> - the CPU-use measurement code has a 'memory' of roughly 300 msecs.
> I.e. if an RT task runs 100 msecs nonstop then it will increase
> its CPU use by about 30%. This should be fast enough for users for
> the limit to be human-inperceptible, but slow enough to allow
> occasional longer timeslices to RT tasks.
As I understand this (and I may be wrong), the intention is that if a
task has its RT_CPU_RATIO rlimit set to a value greater than zero then
setting its scheduling policy to SCHED_RR or SCHED_FIFO is allowed.
This causes me to ask the following questions:
1. Why is current->signal->rlim[RLIMIT_RT_CPU_RATIO].rlim_cur being used
in setscheduler() instead of p->signal->rlim[RLIMIT_RT_CPU_RATIO].rlim_cur?
2. What stops a task that had a non zero RT_CPU_RATIO rlimit and changed
its policy to SCHED_RR or SCHED_FIFO from then setting RT_CPU_RATIO
rlimit back to zero and escaping the controls? As far as I can see
(and, once again, I may be wrong) the mechanism for setting rlimits only
requires CAP_SYS_RESOURCE privileges in order to increase the value.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
* Ingo Molnar ([email protected]) wrote:
> But if someone reviewed all of the rlimit use in the kernel, we could
> make it a policy that rlimits might change. Any unsafe use could be made
> safe pretty easily. Since they are ulongs they are updated atomically
> even without any locking - but e.g. the default and the hard limit might
> change separately. (from the viewpoint of rlimit-using kernel code.)
I can wade through them later in the week.
> obviously a remote rlimit must listen to same kind of security
> permissions as e.g. ptrace or signal sending.
Yeah.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
Peter Williams wrote:
> Ingo Molnar wrote:
>
>> pretty much the only criticism of the RT-CPU patch was that the global
>> sysctl is too rigid and that it doesnt allow privileged tasks to ignore
>> the limit. I've uploaded a new RT-CPU-limit patch that solves this
>> problem:
>>
>> http://redhat.com/~mingo/rt-limit-patches/
>>
>> i've removed the global sysctl and implemented a new rlimit,
>> RT_CPU_RATIO: the maximum amount of CPU time RT tasks may use, in
>> percent. For testing purposes it defaults to 80%.
>>
>> the RT-limit being an rlimit makes it much more configurable: root tasks
>> can have unlimited CPU time limit, while users could have a more
>> conservative setting of say 30%. This also makes it per-process and
>> runtime configurable as well. The scheduler will instantly act upon any
>> new RT_CPU_RATIO rlimit.
>>
>> (this approach is fundamentally different from the previous patch that
>> made the "maximum RT-priority available to an unprivileged task" value
>> an rlimit - with priorities being an rlimit we still havent made RT
>> priorities safe against deadlocks.)
>>
>> multiple tasks can have different rlimits as well, and the scheduler
>> interprets it the following way: it maintains a per-CPU "RT CPU use"
>> load-average value and compares it against the per-task rlimit. If
>> e.g. the task says "i'm in the 60% range" and the current average is 70%,
>> then the scheduler delays this RT task - if the next task has an 80%
>> rlimit then it will be allowed to run. This logic is straightforward and
>> can be used as a further control mechanism against runaway highprio RT
>> tasks.
>>
>> other properties of the RT_CPU_RATIO rlimit:
>>
>> - if there's idle time in the system then RT tasks will be
>> allowed to use more than the limit.
>>
>> - if an RT task goes above the limit all the time then there
>> is no guarantee that exactly the limit will be allowed for
>> it. (i.e. you should set the limit to somewhat above the real
>> needs of the RT task in question.)
>>
>> - a zero RLIMIT_RT_CPU_RATIO value means unlimited CPU time to that
>> RT task. If the task is not an RT task then it may not change to RT
>> priority. (i.e. a zero value makes it fully compatible with previous
>> RT scheduling semantics.)
>>
>> - a nonzero rt_cpu_limit value also has the effect of allowing
>> the use of RT priorities to nonprivileged users.
>>
>> - on SMP the limit is measured and enforced per-CPU.
>>
>> - runtime overhead is minimal, especially if the limit is set to 0.
>>
>> - the CPU-use measurement code has a 'memory' of roughly 300 msecs.
>> I.e. if an RT task runs 100 msecs nonstop then it will increase
>> its CPU use by about 30%. This should be fast enough for users for
>> the limit to be human-inperceptible, but slow enough to allow
>> occasional longer timeslices to RT tasks.
>
>
> As I understand this (and I may be wrong), the intention is that if a
> task has its RT_CPU_RATIO rlimit set to a value greater than zero then
> setting its scheduling policy to SCHED_RR or SCHED_FIFO is allowed. This
> causes me to ask the following questions:
>
> 1. Why is current->signal->rlim[RLIMIT_RT_CPU_RATIO].rlim_cur being used
> in setscheduler() instead of p->signal->rlim[RLIMIT_RT_CPU_RATIO].rlim_cur?
>
> 2. What stops a task that had a non zero RT_CPU_RATIO rlimit and changed
> its policy to SCHED_RR or SCHED_FIFO from then setting RT_CPU_RATIO
> rlimit back to zero and escaping the controls? As far as I can see
> (and, once again, I may be wrong) the mechanism for setting rlimits only
> requires CAP_SYS_RESOURCE privileges in order to increase the value.
Oops, after rereading the patch, a task that set its RT_CPU_RATIO rlimit
to zero wouldn't be escaping the mechanism at all. It would be
suffering maximum throttling. Sorry for the silly question.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Jack O'Quin wrote:
>
> If you grep your log file for 'client failure:', you'll probably find
> that JACK has reacted to the deteriorating situation by shutting down
> some of its clients. The number of 'client failure:' messages is
> *not* the number of clients shut down, there is some repetition (not
> sure why). This will give the actual number...
>
> $ grep '^client failure:' ${LOGFILE} | cut -f4 -d' ' | sort -u | wc -l
>
> It would help if the test script reported this value.
>
I will include it on the next jack_test4.2 :)
If you remember of anything else, please ask.
Cheers.
--
rncbc aka Rui Nuno Capela
[email protected]
P.S. I'm under a terrible cold|flu right now #( that's why I didn't had
the time or patience to test all these new kernel iso/rt_cpu_limit
goodies. So sorry.
* Chris Wright <[email protected]> wrote:
> > did that thread go into technical details? There are some rlimit users
> > that might not be prepared to see the rlimit change under them. The
> > RT_CPU_RATIO one ought to be safe, but generally i'm not so sure.
>
> Not really. I mentioned the above, as well as the security concern.
> Right now, at least the task_setrlimit hook would have to change to
> take into account the task. And I never convinced myself that async
> changes would be safe for each rlimit.
e.g. the stack rlimit looks dangerous, if any VM codepath ever looks
twice on it (and relies on it being the same, somehow).
But if someone reviewed all of the rlimit use in the kernel, we could
make it a policy that rlimits might change. Any unsafe use could be made
safe pretty easily. Since they are ulongs they are updated atomically
even without any locking - but e.g. the default and the hard limit might
change separately. (from the viewpoint of rlimit-using kernel code.)
obviously a remote rlimit must listen to same kind of security
permissions as e.g. ptrace or signal sending.
Ingo
On Tue, Jan 25, 2005 at 02:03:02PM -0800, Chris Wright wrote:
> * Ingo Molnar ([email protected]) wrote:
> > did that thread go into technical details? There are some rlimit users
> > that might not be prepared to see the rlimit change under them. The
> > RT_CPU_RATIO one ought to be safe, but generally i'm not so sure.
>
> Not really. I mentioned the above, as well as the security concern.
> Right now, at least the task_setrlimit hook would have to change to take
> into account the task. And I never convinced myself that async changes
> would be safe for each rlimit.
As was mentioned, but not discussed, in the /proc/<pid>/rlimit thread,
it is not difficult to envision conditions where setrlimit() on another
process could make exploiting an application bug much easier, by, e.g.,
setting number of open files ridiculously low. So IMHO, it ought require
privileges similar to ptrace() to change some, if not all, of the rlimits.
Bill Rugolsky
Ingo Molnar <[email protected]> writes:
> * Jack O'Quin <[email protected]> wrote:
>
>> At around 55 seconds into the run, JACK got in trouble and throttled
>> itself back to approximately the 30% limit (actually a little above).
>> Then, around second 240 it got in trouble again, this time collapsing
>> completely. I'm a bit confused by all the messages in that log, but
>> it appears that approximately 9 of the 20 clients were evertually shut
>> down by the JACK server. It looks like the second collapse around 240
>> also caused the scheduler to revoke RT privileges for the rest of the
>> run (just a guess).
>
> no, the scheduler doesnt revoke RT privileges, it just 'delays' RT tasks
> that violate the threshold. In other words, SCHED_OTHER tasks will have
> a higher effective priority than RT tasks, up until the CPU use average
> drops below the limit again.
When does it start working again? Does it continue getting 80% of
each CPU in the long run? What is the period over which this "delay"
occurs and recurs?
I know how to deal with running out of CPU cycles. This seems to
present a new and different failure mode. I'd like to know that I can
have 80% of the cycles for each realtime period. (For JACK this is
determined by the audio buffer size.) If I can't finish my work in
that allotment, then I've failed and need to scale back.
But, the scheduler doesn't know about realtime cycles. It just knows
that I used more than 80% over some unspecified period. So, maybe I
handled the first 8 audio buffers, but have no cycles left for buffers
9 and 10. Is that right? That's not a situation I currently expect
to deal with. I need to figure out how to handle it.
> the effect is pretty similar to starting too many Jack clients - things
> degrade quickly and _all_ clients start skipping, and the whole audio
> experience escallates into a big xrun mess. Not much to be done about it
> i suspect. Maybe if the current 'RT load' was available via /proc then
> jackd could introduce some sort of threshold above which it would reject
> new clients?
It could.
It also kicks clients out of the realtime graph if they take too long.
>> JACK can probably do a better job of shutting down hyperactive
>> realtime clients than the kernel, because it knows more about what the
>> user is trying to do. Multiplying incomprehesible rlimits values does
>> not help much that I can see.
>
> please debug this some more - the kernel certainly doesnt do anything
> intrusive - the clients only get delayed for some time.
One simple definition of a realtime operation: there exists some
deadline beyond which, if you didn't get the job done, you might as
well not do it at all.
For many applications, it might actually be less intrusive to kill
them than to delay them. My first thought is to revoke SCHED_FIFO and
send a signal. The default action could be process termination, but
the process might optionally catch the signal, throttle back and try
to restart the operation.
Maybe there are other usage scenarios for which delaying the realtime
thread is a good idea. What kind did you have in mind?
>> Sometimes musicians want to "push the envelope" using CPU-hungry
>> realtime effects like reverbs or Fourier Transforms. It is often hard
>> to predict how much of this sort of load a given system can handle.
>> JACK reports its subsystem-wide "DSP load" as a moving average,
>> allowing users to monitor it. It might be helpful if the kernel's
>> estimate of this number were also available somewhere (but maybe that
>> value has no meaning to users). Often, the easiest method is to push
>> things to the breaking point, and then back off a bit.
>
> yeah, i'll add this to /proc, so that utilities can access it. Jackd
> could monitor it and refuse to start new clients if the RT load is
> dangerously close to the limit (say within 10% of it)?
That could be useful. But, isn't it per-CPU?
Would this be a composite value? Average? Does that mean anything?
>> The equivalent rlimits experiment probably requires:
>>
>> (1) editing /etc/security/limits.conf
>> (2) shutting everything down
>> (3) logout
>> (4) login
>> (5) restarting the test
>
> well, there's setrlimit, so you could add a jackd client callback that
> instructs all clients to change their RT_CPU_RATIO rlimit. In theory we
> could try to add a new rlimit syscall that changes another task's rlimit
> (right now the syscalls only allow the changing of the rlimit of the
> current task) - that would enable utilities to change the rlimit of all
> tasks in the system, achieving the equivalent of a global sysctl.
Sure, we could. That does seem like an enormously complicated
mechanism to accomplish something so simple. We are taking a global
per-CPU limit, treating it as if it were per-process, then invoking a
complex callback scheme to propagate new values, all just to shoe-horn
it into the rlimits structure.
There are many problems for which rlimits is a good solution. This
does not seem to be one them.
--
joq
"To a man with a hammer, every problem looks like a nail." ;-)
Ingo Molnar <[email protected]> writes:
> pretty much the only criticism of the RT-CPU patch was that the global
> sysctl is too rigid and that it doesnt allow privileged tasks to ignore
> the limit. I've uploaded a new RT-CPU-limit patch that solves this
> problem:
>
> http://redhat.com/~mingo/rt-limit-patches/
>
> i've removed the global sysctl and implemented a new rlimit,
> RT_CPU_RATIO: the maximum amount of CPU time RT tasks may use, in
> percent. For testing purposes it defaults to 80%.
Small terminology quibble: `ratio' sounds like a fraction, not a
percentage. Since it really is a percentage, why not call it
RLIMIT_RT_CPU_PERCENT, or (maybe better) just RLIMIT_RT_CPU?
Does getrusage() return anything for this? How can a field be added
to the rusage struct without breaking binary compatibility? Can we
assume that no programs ever use sizeof(struct rusage)?
> the RT-limit being an rlimit makes it much more configurable: root tasks
> can have unlimited CPU time limit, while users could have a more
> conservative setting of say 30%. This also makes it per-process and
> runtime configurable as well. The scheduler will instantly act upon any
> new RT_CPU_RATIO rlimit.
I'm having trouble coming up with reasons why this is better than the
previous (rt_cpu_limit) solution, which was so much simpler and easier
to explain.
I can imagine defining per-user limits based on membership in groups
like `audio', `video' or `cdrom'. While logical, I'm having trouble
thinking of usage scenarios where it makes any practical difference.
What problem(s) are we trying to solve with this rlimits field?
> multiple tasks can have different rlimits as well, and the scheduler
> interprets it the following way: it maintains a per-CPU "RT CPU use"
> load-average value and compares it against the per-task rlimit. If e.g.
> the task says "i'm in the 60% range" and the current average is 70%,
> then the scheduler delays this RT task - if the next task has an 80%
> rlimit then it will be allowed to run. This logic is straightforward and
> can be used as a further control mechanism against runaway highprio RT
> tasks.
I can almost understand how this works, but not quite. I guess I need
to read the code. You're trying to selectively throttle certain tasks
and not others, right? But, the limit they're hitting is system
global.
My main conceptual difficulty is driven by the fact that "realtime" is
actually a system attribute. Factors such as hardware, kernel, device
drivers, applications, and system configuration all contribute to it
and can all screw it up.
So, what does it mean for a task to say "I'm in the 60% range"? That
it individually will never use more than 60% of any one CPU? Or, that
it and several other associated tasks will never use more than that?
> other properties of the RT_CPU_RATIO rlimit:
>
> - a zero RLIMIT_RT_CPU_RATIO value means unlimited CPU time to that
> RT task. If the task is not an RT task then it may not change to RT
> priority. (i.e. a zero value makes it fully compatible with previous
> RT scheduling semantics.)
What about root, or CAP_SYS_NICE?
> - the CPU-use measurement code has a 'memory' of roughly 300 msecs.
> I.e. if an RT task runs 100 msecs nonstop then it will increase
> its CPU use by about 30%. This should be fast enough for users for
> the limit to be human-inperceptible, but slow enough to allow
> occasional longer timeslices to RT tasks.
So, at 80% I get 240 msecs out of every 300? If I use them all up, do
I then have to wait 60 msecs before getting scheduled again?
--
joq
* Jack O'Quin <[email protected]> wrote:
> > http://redhat.com/~mingo/rt-limit-patches/
> >
> > i've removed the global sysctl and implemented a new rlimit,
> > RT_CPU_RATIO: the maximum amount of CPU time RT tasks may use, in
> > percent. For testing purposes it defaults to 80%.
>
> Small terminology quibble: `ratio' sounds like a fraction, not a
> percentage. Since it really is a percentage, why not call it
> RLIMIT_RT_CPU_PERCENT, or (maybe better) just RLIMIT_RT_CPU?
yeah, makes sense. I've uploaded the -D5 patch, which has the following
changes:
- renamed the rlimit to RLIMIT_RT_CPU
- exported the current RT-average value to /proc/stat (it's the last
field in the cpu lines)
> Does getrusage() return anything for this? How can a field be added
> to the rusage struct without breaking binary compatibility? Can we
> assume that no programs ever use sizeof(struct rusage)?
rlimits are easily extended and there are no binary compatibility
worries. The kernel doesnt export the maximum towards userspace.
getrusage() will return the value on new kernels and will return -EINVAL
on old kernels, so new userspace can deal with this accordingly.
> I can imagine defining per-user limits based on membership in groups
> like `audio', `video' or `cdrom'. While logical, I'm having trouble
> thinking of usage scenarios where it makes any practical difference.
e.g. the issue Con and others raised: privileged tasks. By default, the
root user will likely have a 0 rlimit, for compatibility. _But_ i can
easily imagine users wanting to put in a safety limit even for
root-owned RT tasks by making the rlimit 98% or so. Nonprivileged users
would have this rlimit at say 20% in a typical desktop distribution.
> > multiple tasks can have different rlimits as well, and the scheduler
> > interprets it the following way: it maintains a per-CPU "RT CPU use"
> > load-average value and compares it against the per-task rlimit. If e.g.
> > the task says "i'm in the 60% range" and the current average is 70%,
> > then the scheduler delays this RT task - if the next task has an 80%
> > rlimit then it will be allowed to run. This logic is straightforward and
> > can be used as a further control mechanism against runaway highprio RT
> > tasks.
>
> I can almost understand how this works, but not quite. I guess I need
> to read the code. You're trying to selectively throttle certain tasks
> and not others, right? [...]
correct.
> [...] But, the limit they're hitting is system global.
no, the limit is per-task. It is the _current CPU average_, measured by
the scheduler, that is global. (well, on SMP: "per-CPU global".) This
means that the scheduler knows all the time the percentage of time RT
tasks use up. If it's say 76%, and the rlimit of the _current task_ is
at 90% then this task will be allowed to run. (up to the point the
average reaches 90%.) If this task finishes running, and another RT task
would like to run, and we are still at a 76% RT-average, and that new
task has a 60% rlimit, then the scheduler will not allow the new task to
run. Only tasks that are SCHED_OTHER, or have a higher RT rlimit than
the current average will be allowed to run. [except if there are no
SCHED_OTHER tasks, in which case all tasks are allowed to run.]
> So, what does it mean for a task to say "I'm in the 60% range"? That
> it individually will never use more than 60% of any one CPU? Or, that
> it and several other associated tasks will never use more than that?
it means that it will only be allowed to run if the CPU utilization of
all RT tasks (eligible to run) do not exceed 60%.
> > other properties of the RT_CPU_RATIO rlimit:
> >
> > - a zero RLIMIT_RT_CPU_RATIO value means unlimited CPU time to that
> > RT task. If the task is not an RT task then it may not change to RT
> > priority. (i.e. a zero value makes it fully compatible with previous
> > RT scheduling semantics.)
>
> What about root, or CAP_SYS_NICE?
what do you mean? root can change its own rlimit if it wants to, but
there is no special hack to allow root/CAP_SYS_NICE tasks to get
unlimited RT CPU time. That would make it impossible for a privileged
task to make use of this feature.
> > - the CPU-use measurement code has a 'memory' of roughly 300 msecs.
> > I.e. if an RT task runs 100 msecs nonstop then it will increase
> > its CPU use by about 30%. This should be fast enough for users for
> > the limit to be human-inperceptible, but slow enough to allow
> > occasional longer timeslices to RT tasks.
>
> So, at 80% I get 240 msecs out of every 300? If I use them all up, do
> I then have to wait 60 msecs before getting scheduled again?
it's implemented as a decayling average. The 300 msecs means that if the
current RT average is at 100%, and suddenly every RT task stops running
(i.e. the average will decrease as fast as it can), then it will reach
5% in 300 msecs.
the same goes in the other direction: it needs roughly 300 msecs for the
average to go from 0% to 100%.
yes, even though the way it increases is not linear, you can expect the
average to increase if you run for 'too long' - where 'too long' is
roughly (percentage * 300msecs). I.e. if you have it at 80% then you
should expect the limit to kick in after running for 240 msecs. This
shouldnt be a practical issue, unless an RT application has a very
'choppy' workload (processing for 1 second, then sleeping for 1 second,
etc.) - in which case it needs an increased limit. There obviously must
be some sort of boundary where the scheduler says 'this is enough'.
Ingo
* Jack O'Quin <[email protected]> wrote:
> >> The equivalent rlimits experiment probably requires:
> >>
> >> (1) editing /etc/security/limits.conf
> >> (2) shutting everything down
> >> (3) logout
> >> (4) login
> >> (5) restarting the test
> >
> > well, there's setrlimit, so you could add a jackd client callback that
> > instructs all clients to change their RT_CPU_RATIO rlimit. In theory we
> > could try to add a new rlimit syscall that changes another task's rlimit
> > (right now the syscalls only allow the changing of the rlimit of the
> > current task) - that would enable utilities to change the rlimit of all
> > tasks in the system, achieving the equivalent of a global sysctl.
>
> Sure, we could. That does seem like an enormously complicated
> mechanism to accomplish something so simple. We are taking a global
> per-CPU limit, treating it as if it were per-process, then invoking a
> complex callback scheme to propagate new values, [...]
this was just a suggestion. It seems a remote-rlimit syscall is
possible, so there's no need to change Jack if you dont want to - that
syscall enables a utility that will change the rlimit for all running
tasks, so you'll get the 'simplicity of experimentation' of a global
sysctl.
(what you wont get is the ultra-fast time-to-market property of sysctl
hacks. I know that you'd probably prefer a global sysctl that you could
start using tomorrow, and i also know that the global sysctl would
suffice current Jackd needs, but we cannot sacrifice flexibility and
future utility for the sake of a couple of weeks/months of time
advantage...)
Ingo
* Peter Williams <[email protected]> wrote:
> As I understand this (and I may be wrong), the intention is that if a
> task has its RT_CPU_RATIO rlimit set to a value greater than zero then
> setting its scheduling policy to SCHED_RR or SCHED_FIFO is allowed.
correct.
> This causes me to ask the following questions:
>
> 1. Why is current->signal->rlim[RLIMIT_RT_CPU_RATIO].rlim_cur being used
> in setscheduler() instead of p->signal->rlim[RLIMIT_RT_CPU_RATIO].rlim_cur?
>
> 2. What stops a task that had a non zero RT_CPU_RATIO rlimit and
> changed its policy to SCHED_RR or SCHED_FIFO from then setting
> RT_CPU_RATIO rlimit back to zero and escaping the controls? As far as
> I can see (and, once again, I may be wrong) the mechanism for setting
> rlimits only requires CAP_SYS_RESOURCE privileges in order to increase
> the value.
you are right, both are bugs.
i've uploaded the -D6 patch that should have both fixed:
http://redhat.com/~mingo/rt-limit-patches/
thanks,
Ingo
* Peter Williams <[email protected]> wrote:
> Oops, after rereading the patch, a task that set its RT_CPU_RATIO
> rlimit to zero wouldn't be escaping the mechanism at all. It would be
> suffering maximum throttling. [...]
my intention was to let 'limit 0' mean 'old RT semantics' - i.e. 'no RT
CPU time for unprivileged tasks at all', and only privileged tasks may
do it and then they'll get full CPU time with no throttling.
so in that context your observation highlights another bug, which i
fixed in the -D7 patch available from the usual place:
http://redhat.com/~mingo/rt-limit-patches/
not doing the '0' exception would make it harder to introduce the rlimit
in a compatible fashion. (My current thinking is that the default RT_CPU
rlimit should be 0.)
Ingo
i've uploaded a simple utility to set the RT_CPU rlimit, called
execrtlim:
http://redhat.com/~mingo/rt-limit-patches/
execrtlim can be used to test the rlimit, e.g.:
./execrtlim 10 10 /bin/bash
will spawn a new shell with RLIMIT_RT_CPU curr/max set to 10%/10%.
on older kernels the utility prints:
$ ./execrtlim 10 10 /bin/bash
execrtlim: kernel does not support RLIMIT_RT_CPU.
Ingo
Ingo Molnar <[email protected]> writes:
> (My current thinking is that the default RT_CPU rlimit should be 0.)
How about a kernel .config option allowing us to easily compile in a
different default?
That should tide over most of the audio users for the next 6 months or
so until we get userspace tools from the various distributions.
--
joq
Ingo Molnar wrote:
> [...] the -D7 patch available from the usual place:
Hi,
Consideringthe amount and rate of work in progress, this may well be no
longer be pertinent, but I'm consistently getting an oops running the
basic jack_test3.2 with rt-limit-2.6.11-rc2-D7 on SMP (P3 993 x 2). The
oops and jacktest log are at
<http://www.graggrag.com/20050127-oops/>.
cheers, Cal
Please ignore that, I've just started looking through the log properly.
cheers.
Jack O'Quin wrote:
> I notice that JACK's call to mlockall() is failing. This is one
> difference between your system and mine (plus, my machine is UP).
>
> As an experiment, you might try testing with `ulimit -l unlimited'.
I went for the panic retraction on the first report when I saw the
failures in the log. With ulimit -l unlimited, jack seems happier.
Before the change, ulimit -l showed 32.
At what feels like approaching the end of the run, it still goes clunk -
totally so, dead and gone!
<http://www.graggrag.com/200501270420-oops/>
I'll re-read the mails that have gone by, and think about the next step.
cheers, Cal
Ingo Molnar wrote:
> * Peter Williams <[email protected]> wrote:
>
>
>>Oops, after rereading the patch, a task that set its RT_CPU_RATIO
>>rlimit to zero wouldn't be escaping the mechanism at all. It would be
>>suffering maximum throttling. [...]
>
>
> my intention was to let 'limit 0' mean 'old RT semantics' - i.e. 'no RT
> CPU time for unprivileged tasks at all', and only privileged tasks may
> do it and then they'll get full CPU time with no throttling.
>
> so in that context your observation highlights another bug, which i
> fixed in the -D7 patch available from the usual place:
>
> http://redhat.com/~mingo/rt-limit-patches/
>
> not doing the '0' exception would make it harder to introduce the rlimit
> in a compatible fashion. (My current thinking is that the default RT_CPU
> rlimit should be 0.)
One solution to this dilemma might be to set a PF_FLAG on a task
whenever it gains RT status via this privilege bypass and only apply the
limit to tasks that have that flag set.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
On Wed, 2005-01-26 at 16:27 -0600, Jack O'Quin wrote:
> Ingo Molnar <[email protected]> writes:
>
> > - exported the current RT-average value to /proc/stat (it's the last
> > field in the cpu lines)
>
> > e.g. the issue Con and others raised: privileged tasks. By default, the
> > root user will likely have a 0 rlimit, for compatibility. _But_ i can
> > easily imagine users wanting to put in a safety limit even for
> > root-owned RT tasks by making the rlimit 98% or so. Nonprivileged users
> > would have this rlimit at say 20% in a typical desktop distribution.
>
> That seems rather small. CPU starvation is not generally much of a
> problem on desktop machines. If that (single) user wants to eat up
> 70% or 80% of the CPU, that's not likely to be a problem. Mac OS X
> allows 90% on their desktop systems.
>
Just a bit off the topic of this sub-thread...
I'm a bit concerned about this kind of policy and breakage of
userspace APIs going into the kernel. I mean, if an app is
succeeds in gaining SCHED_FIFO / SCHED_RR scheduling, and the
scheduler does something else, that could be undesirable in some
situations.
Secondly, I think we should agree upon and get the basic rlimit
support in ASAP, so the userspace aspect can be firmed up a bit
for people like Paul and Jack (this wouldn't preclude further
work from happening in the scheduler afterwards).
And finally, with rlimit support, is there any reason why lockup
detection and correction can't go into userspace? Even RT
throttling could probably be done in a userspace daemon.
Nick
Ingo Molnar <[email protected]> writes:
> - exported the current RT-average value to /proc/stat (it's the last
> field in the cpu lines)
> e.g. the issue Con and others raised: privileged tasks. By default, the
> root user will likely have a 0 rlimit, for compatibility. _But_ i can
> easily imagine users wanting to put in a safety limit even for
> root-owned RT tasks by making the rlimit 98% or so. Nonprivileged users
> would have this rlimit at say 20% in a typical desktop distribution.
That seems rather small. CPU starvation is not generally much of a
problem on desktop machines. If that (single) user wants to eat up
70% or 80% of the CPU, that's not likely to be a problem. Mac OS X
allows 90% on their desktop systems.
>> So, what does it mean for a task to say "I'm in the 60% range"? That
>> it individually will never use more than 60% of any one CPU? Or, that
>> it and several other associated tasks will never use more than that?
>
> it means that it will only be allowed to run if the CPU utilization of
> all RT tasks (eligible to run) do not exceed 60%.
But how would people use it?
There are many RT application scenarios I don't know about, and
probably some more I'm forgetting at the moment. But, I do have at
least one user-oriented view of the problem. Perhaps others will
mention additional scenarios based on their own experience.
My initial reaction is that the kind of priority-based SCHED_FIFO and
SCHED_RR techniques we're providing (i.e. POSIX realtime) really only
supports one set of at least loosely cooperating RT threads. For
multiple, non-cooperating RT subsystems, one needs something more like
a deadline scheduler. That would be a nice project for someone to
work on, but we are definitely *not* asking for it here.
Priority-based RT thread cooperation generally involves a "realtime
pyramid". A thread with a short trigger latency needs to run at high
priority, near the top of the pyramid. It must do its job quickly to
avoid delaying the lower-priority threads. So, high priority
generally imposes a tight pathlength restriction. Threads at the base
of the pyramid can run longer, but have longer trigger latencies due
to the number and worst-case pathlengths of the layers above them.
You can even think of IRQ handlers as occupying "special" hardware
scheduled priorities at the apex of the pyramid.
But, rlimits is per-process, not per-thread (right??). So, the whole
RT subsystem is going to end up needing to share a single, overall
RLIMIT_RT_CPU value. This makes rlimits a rather poor fit for the
purpose. I could imagine limiting CPU use per priority level (not per
process), with the kernel helping to enforce the realtime priority
pyramid directly, restricting higher priorities to fewer cycles. But,
I have no well thought out proposal like that (as yet).
So apparently, this clumsy rlimits mechanism is really only going to
store two or three different values for every process in the system,
for example: (1) 0% for ordinary processes, (2) 80% for desktop audio
and video users, and (3) 98% for root and perhaps a few other
privileged processes. I probably still don't understand exactly how
these work, but does anyone seriously expect there to be more?
> yes, even though the way it increases is not linear, you can expect the
> average to increase if you run for 'too long' - where 'too long' is
> roughly (percentage * 300msecs). I.e. if you have it at 80% then you
> should expect the limit to kick in after running for 240 msecs. This
> shouldnt be a practical issue, unless an RT application has a very
> 'choppy' workload (processing for 1 second, then sleeping for 1 second,
> etc.) - in which case it needs an increased limit. There obviously must
> be some sort of boundary where the scheduler says 'this is enough'.
This seems reasonable to me. AFAIK, very few RT applications are
*that* intermittent. Generally, there is some basic cycle in which
things get done. Sometimes there are cycles within cycles.
I think I see a way to do graceful degradation with arbitrary shorter
RT cycles. JACK currently watches its clients to detect when they
exceed the available cycle duration. JACK could query its own
RLIMIT_RT_CPU value and use that percentage of the actual cycle time
as the time limit. If this is exceeded, jackd would start shutting
down clients before the scheduler shuts down realtime operation
completely, thus (perhaps) avoiding a catastrophic failure of the
whole subsystem (like we saw in my tests). Since it makes no sense
for the clients to run with a different RLIMIT_RT_CPU value, this
ought to work in practice. Libjack could complain if a client with a
lower limit tried to connect. Each JACK server only handles one user
at a time, anyway.
Will the scheduler support that approach?
--
joq
Ingo Molnar wrote:
> * Peter Williams <[email protected]> wrote:
>
>
>>Oops, after rereading the patch, a task that set its RT_CPU_RATIO
>>rlimit to zero wouldn't be escaping the mechanism at all. It would be
>>suffering maximum throttling. [...]
>
>
> my intention was to let 'limit 0' mean 'old RT semantics' - i.e. 'no RT
> CPU time for unprivileged tasks at all', and only privileged tasks may
> do it and then they'll get full CPU time with no throttling.
>
> so in that context your observation highlights another bug, which i
> fixed in the -D7 patch available from the usual place:
>
> http://redhat.com/~mingo/rt-limit-patches/
>
> not doing the '0' exception would make it harder to introduce the rlimit
> in a compatible fashion. (My current thinking is that the default RT_CPU
> rlimit should be 0.)
I'd suggest making the default 100% and only allow non privileged users
to set a real time policy if the tasks RT_CPU_RATIO is BELOW some
(possibly configurable via sysfs or sysctl) limit. This will have the
desired effect that tasks given RT status via the normal means will be
unaffected.
I'd also like to suggest that you change the units from parts per
hundred (a.k.a. percent) to parts per thousand. In addition to giving
better control granularity this will be a better match to HZ on most
systems giving better efficiency.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Jack O'Quin wrote:
> You seem to have eliminated the mlock() failure as the cause of this
> crash. It should not have caused it anyway, but it *was* one
> noticeable difference between your tests and mine. Since
> do_page_fault() appears in the backtrace, that is useful to know.
>
> The other big difference is SMP. What happens if you build a UP
> kernel and run it on your SMP machine?
Sorry for the delay, some sleep required. A build without SMP also
fails, with multiple oops.
<http://www.graggrag.com/200501271213-oops/>.
cheers, Cal
(NetEase AntiSpam+ at mail.edu.cn has been bouncing someone's copy)
Cal <[email protected]> writes:
> Jack O'Quin wrote:
>> I notice that JACK's call to mlockall() is failing. This is one
>> difference between your system and mine (plus, my machine is UP).
>> As an experiment, you might try testing with `ulimit -l unlimited'.
>
> I went for the panic retraction on the first report when I saw the
> failures in the log. With ulimit -l unlimited, jack seems
> happier. Before the change, ulimit -l showed 32.
>
> At what feels like approaching the end of the run, it still goes clunk
> totally so, dead and gone!
>
> <http://www.graggrag.com/200501270420-oops/>
>
> I'll re-read the mails that have gone by, and think about the next step.
You seem to have eliminated the mlock() failure as the cause of this
crash. It should not have caused it anyway, but it *was* one
noticeable difference between your tests and mine. Since
do_page_fault() appears in the backtrace, that is useful to know.
The other big difference is SMP. What happens if you build a UP
kernel and run it on your SMP machine?
--
joq
Nick Piggin <[email protected]> writes:
> I'm a bit concerned about this kind of policy and breakage of
> userspace APIs going into the kernel. I mean, if an app is
> succeeds in gaining SCHED_FIFO / SCHED_RR scheduling, and the
> scheduler does something else, that could be undesirable in some
> situations.
True. It's similar to running out of CPU bandwidth, but not quite.
AFAICT, the new behavior still meets the letter of the standard[1].
Whether it meets the spirit of the standard is debatable. My own
feeling is that it probably does, and that making SCHED_FIFO somewhat
less powerful but much easier to access is a reasonable tradeoff.
[1] http://www.opengroup.org/onlinepubs/007908799/xsh/realtime.html
If I understand Ingo's proposal correctly, setting RLIMIT_RT_CPU to
zero and then requesting SCHED_FIFO (with CAP_SYS_NICE) yields exactly
the former behavior. This will probably be the default setting.
> Secondly, I think we should agree upon and get the basic rlimit
> support in ASAP, so the userspace aspect can be firmed up a bit
> for people like Paul and Jack (this wouldn't preclude further
> work from happening in the scheduler afterwards).
I don't sense much opposition to adding rlimit support for realtime
scheduling. I personally don't think it a very good way to manage
this problem. But, it certainly can be made to work.
The main point of discussion is: exactly what resource should it
limit? Arjan and Chris proposed to limit priority. Ingo proposed to
limit the percentage of each CPU available for realtime threads
(collectively). Either would meet our minimum needs (AFAICT).
But, they are not identical, and the best choice depends at least
partly on the outcome of Ingo's scheduler experiments. I doubt that
anyone wants to add both (though it could come down to that, I
suppose).
> And finally, with rlimit support, is there any reason why lockup
> detection and correction can't go into userspace? Even RT
> throttling could probably be done in a userspace daemon.
It can. But, doing it in the kernel is more efficient, and probably
more reliable.
--
joq
Cal <[email protected]> writes:
> Consideringthe amount and rate of work in progress, this may well be
> no longer be pertinent, but I'm consistently getting an oops running
> the basic jack_test3.2 with rt-limit-2.6.11-rc2-D7 on SMP (P3 993 x
> 2). The oops and jacktest log are at
> <http://www.graggrag.com/20050127-oops/>.
I notice that JACK's call to mlockall() is failing. This is one
difference between your system and mine (plus, my machine is UP).
As an experiment, you might try testing with `ulimit -l unlimited'.
--
joq
On Wed, 2005-01-26 at 20:31 -0600, Jack O'Quin wrote:
> Nick Piggin <[email protected]> writes:
>
> > I'm a bit concerned about this kind of policy and breakage of
> > userspace APIs going into the kernel. I mean, if an app is
> > succeeds in gaining SCHED_FIFO / SCHED_RR scheduling, and the
> > scheduler does something else, that could be undesirable in some
> > situations.
>
> True. It's similar to running out of CPU bandwidth, but not quite.
>
> AFAICT, the new behavior still meets the letter of the standard[1].
> Whether it meets the spirit of the standard is debatable. My own
> feeling is that it probably does, and that making SCHED_FIFO somewhat
> less powerful but much easier to access is a reasonable tradeoff.
I don't think it does. The process with the highest priority
must run first, regardless of anything else.
You could say "it is similar to running out of CPU bandwidth", or
"it is like running on a CPU that is 80% the speed". And it may be
similar in terms of total throughput.
But the important elements are lost. The standard provides a
deterministic scheduling order, and a deterministic scheduling
latency (of course this doesn't mean a great deal for Linux, but
I think we're good enough for a lot of soft-rt applications now).
>
> [1] http://www.opengroup.org/onlinepubs/007908799/xsh/realtime.html
>
> If I understand Ingo's proposal correctly, setting RLIMIT_RT_CPU to
> zero and then requesting SCHED_FIFO (with CAP_SYS_NICE) yields exactly
> the former behavior. This will probably be the default setting.
>
Oh yes, and that would surely have to be the only sane default
setting. But I then ask, why put this complexity into the kernel
in the first place? And is the cost of significantly breaking the
API and lying to userspace acceptable?
> > Secondly, I think we should agree upon and get the basic rlimit
> > support in ASAP, so the userspace aspect can be firmed up a bit
> > for people like Paul and Jack (this wouldn't preclude further
> > work from happening in the scheduler afterwards).
>
> I don't sense much opposition to adding rlimit support for realtime
> scheduling. I personally don't think it a very good way to manage
> this problem. But, it certainly can be made to work.
>
> The main point of discussion is: exactly what resource should it
> limit? Arjan and Chris proposed to limit priority. Ingo proposed to
> limit the percentage of each CPU available for realtime threads
> (collectively). Either would meet our minimum needs (AFAICT).
>
I think the priority limit is the better way, because the % CPU
available to real time threads approach essentially gives you no
realtime scheduling at all.
By limiting priority, you can at least construct userspace solutions
to managing RT threads.
> But, they are not identical, and the best choice depends at least
> partly on the outcome of Ingo's scheduler experiments. I doubt that
> anyone wants to add both (though it could come down to that, I
> suppose).
>
> > And finally, with rlimit support, is there any reason why lockup
> > detection and correction can't go into userspace? Even RT
> > throttling could probably be done in a userspace daemon.
>
> It can. But, doing it in the kernel is more efficient, and probably
> more reliable.
Possibly. But the people who want a solution seem to be in a very
small minority, and I'm not sure how much you care about efficiency.
Nick Piggin <[email protected]> writes:
> On Wed, 2005-01-26 at 20:31 -0600, Jack O'Quin wrote:
>> Nick Piggin <[email protected]> writes:
>>
>> > I'm a bit concerned about this kind of policy and breakage of
>> > userspace APIs going into the kernel. I mean, if an app is
>> > succeeds in gaining SCHED_FIFO / SCHED_RR scheduling, and the
>> > scheduler does something else, that could be undesirable in some
>> > situations.
>>
>> True. It's similar to running out of CPU bandwidth, but not quite.
>>
>> AFAICT, the new behavior still meets the letter of the standard[1].
>> Whether it meets the spirit of the standard is debatable. My own
>> feeling is that it probably does, and that making SCHED_FIFO somewhat
>> less powerful but much easier to access is a reasonable tradeoff.
>
> I don't think it does. The process with the highest priority
> must run first, regardless of anything else.
>
> You could say "it is similar to running out of CPU bandwidth", or
> "it is like running on a CPU that is 80% the speed". And it may be
> similar in terms of total throughput.
I also said, "but not quite". The differences are noticeable, but I
believe they are manageable. I could still be convinced otherwise,
however, by a sufficiently persuasive argument. ;-)
I don't claim to know about every realtime system.
> But the important elements are lost. The standard provides a
> deterministic scheduling order, and a deterministic scheduling
> latency
Where does the standard actually say this? All I found was the
reference[1] in my previous note. It talks about queue management,
but not scheduling latency. The scheduling order is only defined to
be deterministic within the SCHED_FIFO class. Ingo's patch actually
conforms to this requirement fully, AFAICT.
[1] http://www.opengroup.org/onlinepubs/007908799/xsh/realtime.html
I'm not trying to be a "standards lawyer" about this, and I can see
your point of view. But, we should clearly distinguish what is
*required* by POSIX from what we feel to be good engineering practice.
Both are important.
> (of course this doesn't mean a great deal for Linux, but I think
> we're good enough for a lot of soft-rt applications now).
Absolutely. And getting noticeably better these days.
>> If I understand Ingo's proposal correctly, setting RLIMIT_RT_CPU to
>> zero and then requesting SCHED_FIFO (with CAP_SYS_NICE) yields exactly
>> the former behavior. This will probably be the default setting.
>
> Oh yes, and that would surely have to be the only sane default
> setting. But I then ask, why put this complexity into the kernel
> in the first place? And is the cost of significantly breaking the
> API and lying to userspace acceptable?
Well, this extremely long discussion started with a request on behalf
of a large group of audio developers and users for a way to gain
realtime scheduling privileges without running as root.
Several kernel developers felt that would be unacceptable, because of
the Denial of Service implications. We argued that the owner of a
Digital Audio Workstation should be free to lock up his CPU any time
he wants. But, no one would listen. We were told that we didn't
really know what we needed, and were asking the wrong question. That
was very discouraging. It looked like LKML was going to ignore our
needs for yet another year.
Had I been in charge, I would have adopted our silly little RT-LSM
patch, declared victory, and moved on to other matters. But, I'm not
responsible for this particular kernel. (Thank you, God!) So, I am
willing to defer to the prejudices of those who are. But, the limit
of my deference is reached when they refuse to meet my needs.
When Con and Ingo started floating scheduler proposals, I tried to
help them produce something useful. I think they have succeeded.
Is this the easiest way to meet our needs? No way. But it's not bad,
and I think it will work. In some ways, it's actually better than our
original RT-LSM proposal. Ingo's scheduler patch is not very large.
>> The main point of discussion is: exactly what resource should it
>> limit? Arjan and Chris proposed to limit priority. Ingo proposed to
>> limit the percentage of each CPU available for realtime threads
>> (collectively). Either would meet our minimum needs (AFAICT).
>
> I think the priority limit is the better way, because the % CPU
> available to real time threads approach essentially gives you no
> realtime scheduling at all.
Setting RLIMIT_RT_CPU to 100% gives exactly what you are asking for.
In fact, I will probably lower it to 98%, taking advantage of the
scheduler's ability to throttle runaway threads. Locking up the CPU
is not a big problem in my life, but it can be annoying when it does
occur. The JACK watchdog does not catch every bug we see during
development.
> By limiting priority, you can at least construct userspace solutions
> to managing RT threads.
I think I can do it either way.
Further, I believe distributions will probably set a reasonably high
RLIMIT_RT_CPU for desktop users, allowing audio applications to work
quite well right out of the box.
I don't much like Mac OS X. It annoys the hell out of me that they
can run Linux audio applications so much better than Linux. According
to Stephane Letz (who did the JACK OSX port), their scheduler has a
similar throttle set at about 90%. It does not seem to be causing
them any practical problems that I can tell.
>> > And finally, with rlimit support, is there any reason why lockup
>> > detection and correction can't go into userspace? Even RT
>> > throttling could probably be done in a userspace daemon.
>>
>> It can. But, doing it in the kernel is more efficient, and probably
>> more reliable.
>
> Possibly. But the people who want a solution seem to be in a very
> small minority, and I'm not sure how much you care about efficiency.
I do care. The average overhead of running a watchdog daemon at max
priority is tiny. But, its impact on worst-case trigger latencies is
non-trivial and must be added to everything else in the RT subsystem.
--
joq
On Wed, 2005-01-26 at 23:15 -0600, Jack O'Quin wrote:
> Nick Piggin <[email protected]> writes:
>
> > But the important elements are lost. The standard provides a
> > deterministic scheduling order, and a deterministic scheduling
> > latency
>
> Where does the standard actually say this? All I found was the
> reference[1] in my previous note. It talks about queue management,
> but not scheduling latency. The scheduling order is only defined to
> be deterministic within the SCHED_FIFO class. Ingo's patch actually
> conforms to this requirement fully, AFAICT.
>
> [1] http://www.opengroup.org/onlinepubs/007908799/xsh/realtime.html
>
> I'm not trying to be a "standards lawyer" about this, and I can see
> your point of view. But, we should clearly distinguish what is
> *required* by POSIX from what we feel to be good engineering practice.
> Both are important.
Yes I see what you mean. And in fact it does say that scheduling
of SCHED_OTHER tasks with SCHED_FIFO and SCHED_RR tasks is
implementation specific, however that must be within the framework
of their queued scheduler specification.
And in Linux, sched_get_priority_max() and sched_get_priority_min()
for SCHED_OTHER tasks are both 0, which is below the minimum RT
policy's minimum priority. So in that case, the standard is broken
by the patch.
>
> > (of course this doesn't mean a great deal for Linux, but I think
> > we're good enough for a lot of soft-rt applications now).
>
> Absolutely. And getting noticeably better these days.
>
> >> If I understand Ingo's proposal correctly, setting RLIMIT_RT_CPU to
> >> zero and then requesting SCHED_FIFO (with CAP_SYS_NICE) yields exactly
> >> the former behavior. This will probably be the default setting.
> >
> > Oh yes, and that would surely have to be the only sane default
> > setting. But I then ask, why put this complexity into the kernel
> > in the first place? And is the cost of significantly breaking the
> > API and lying to userspace acceptable?
>
> Well, this extremely long discussion started with a request on behalf
> of a large group of audio developers and users for a way to gain
> realtime scheduling privileges without running as root.
>
> Several kernel developers felt that would be unacceptable, because of
> the Denial of Service implications. We argued that the owner of a
> Digital Audio Workstation should be free to lock up his CPU any time
> he wants. But, no one would listen. We were told that we didn't
> really know what we needed, and were asking the wrong question. That
> was very discouraging. It looked like LKML was going to ignore our
> needs for yet another year.
>
> Had I been in charge, I would have adopted our silly little RT-LSM
> patch, declared victory, and moved on to other matters. But, I'm not
> responsible for this particular kernel. (Thank you, God!) So, I am
> willing to defer to the prejudices of those who are. But, the limit
> of my deference is reached when they refuse to meet my needs.
>
> When Con and Ingo started floating scheduler proposals, I tried to
> help them produce something useful. I think they have succeeded.
>
> Is this the easiest way to meet our needs? No way. But it's not bad,
> and I think it will work. In some ways, it's actually better than our
> original RT-LSM proposal. Ingo's scheduler patch is not very large.
>
I have sympathy for you. I'm just observing that if we add the
simple rlimit first, this solves your (and this large group of
audio developers') particular problem. In fact, this will be the
_quickest_ path to get your requirement into the kernel, too.
At which point, there doesn't appear to be a need for more
complexity (which is what us kernel developers would like to hear!).
And I still maintain that it is stupid to run two unrelated (ie.
neither has knowledge of the other) realtime systems on the same
computer. One or both could easily break. So I agree with your point
that we needn't have to make realtime scheduling "nice" for a multi
user situation.
> >> The main point of discussion is: exactly what resource should it
> >> limit? Arjan and Chris proposed to limit priority. Ingo proposed to
> >> limit the percentage of each CPU available for realtime threads
> >> (collectively). Either would meet our minimum needs (AFAICT).
> >
> > I think the priority limit is the better way, because the % CPU
> > available to real time threads approach essentially gives you no
> > realtime scheduling at all.
>
> Setting RLIMIT_RT_CPU to 100% gives exactly what you are asking for.
>
But now I have no control over priorities. So root cannot be
guaranteed to have its highest-priority watchdog run.
> In fact, I will probably lower it to 98%, taking advantage of the
> scheduler's ability to throttle runaway threads. Locking up the CPU
> is not a big problem in my life, but it can be annoying when it does
> occur. The JACK watchdog does not catch every bug we see during
> development.
>
We've got Alt+SysRq+N now, which is nice for development.
> > By limiting priority, you can at least construct userspace solutions
> > to managing RT threads.
>
> I think I can do it either way.
>
> Further, I believe distributions will probably set a reasonably high
> RLIMIT_RT_CPU for desktop users, allowing audio applications to work
> quite well right out of the box.
>
> I don't much like Mac OS X. It annoys the hell out of me that they
> can run Linux audio applications so much better than Linux. According
> to Stephane Letz (who did the JACK OSX port), their scheduler has a
> similar throttle set at about 90%. It does not seem to be causing
> them any practical problems that I can tell.
>
Well in the context of a multi user system, this really is a privileged
operation. Witness: a normal user isn't even allowed to raise the nice
priority of a normal task. Note that I think everyone agrees here, but
I'm just repeating the point.
> >> > And finally, with rlimit support, is there any reason why lockup
> >> > detection and correction can't go into userspace? Even RT
> >> > throttling could probably be done in a userspace daemon.
> >>
> >> It can. But, doing it in the kernel is more efficient, and probably
> >> more reliable.
> >
> > Possibly. But the people who want a solution seem to be in a very
> > small minority, and I'm not sure how much you care about efficiency.
>
> I do care. The average overhead of running a watchdog daemon at max
> priority is tiny. But, its impact on worst-case trigger latencies is
> non-trivial and must be added to everything else in the RT subsystem.
Oh yeah maybe. But you and the audio guys don't care about multi-user
security, so you don't need this.
Nick
* Nick Piggin <[email protected]> wrote:
> Well in the context of a multi user system, this really is a
> privileged operation. Witness: a normal user isn't even allowed to
> raise the nice priority of a normal task. Note that I think everyone
> agrees here, but I'm just repeating the point.
i've seen this argument repeated a number of times, but i'd like to
point out that with the rlimit set to a sane value, a user can do 'less
damage' to the system via SCHED_FIFO than it could do via nice--20!
negative nice levels are a guaranteed way to monopolize the CPU.
SCHED_FIFO with throttling could at most be used to 'steal' CPU time up
to the threshold. Also, if a task 'runs away' in SCHED_FIFO mode it will
be efficiently throttled. While if it 'runs away' in nice--20 mode, it
will take away 95+% of the CPU time quite agressively. Furthermore, more
nice--20 tasks will do much more damage (try thunk.c at nice--20!),
while more throttled SCHED_FIFO tasks only do damage to their own class
- the guaranteed share of SCHED_OTHER tasks (and privileged RT tasks) is
not affected.
so while it is true that in terms of priorities, throttled SCHED_FIFO
trumps all SCHED_OTHER tasks, but in the "potential damage" sense,
"throttled real-time" is less of a privilege than "nice--20".
Ingo
* Cal <[email protected]> wrote:
> Sorry for the delay, some sleep required. A build without SMP also
> fails, with multiple oops.
> <http://www.graggrag.com/200501271213-oops/>.
thanks, this pinpointed the bug - i've uploaded the -D8 patch to the
usual place:
http://redhat.com/~mingo/rt-limit-patches/
does it fix your crash? Mike Galbraith reported a crash too that i think
could be the same one.
Ingo
* Ingo Molnar <[email protected]> wrote:
> negative nice levels are a guaranteed way to monopolize the CPU.
> SCHED_FIFO with throttling could at most be used to 'steal' CPU time
> up to the threshold. Also, if a task 'runs away' in SCHED_FIFO mode it
> will be efficiently throttled. While if it 'runs away' in nice--20
> mode, it will take away 95+% of the CPU time quite agressively.
> Furthermore, more nice--20 tasks will do much more damage (try thunk.c
> at nice--20!), while more throttled SCHED_FIFO tasks only do damage to
> their own class - the guaranteed share of SCHED_OTHER tasks (and
> privileged RT tasks) is not affected.
furthermore, the current way of throttling SCHED_FIFO tasks that violate
the limit makes it less likely that application writers would abuse the
feature with CPU-intensive apps, because _if_ you violate the limit then
the penalty is high. E.g. a blatant violation of the limit via a pure
CPU loop ends up getting much less CPU time than even the limit would
allow for. For audio/RT apps this is fine, because they must plan their
CPU overhead anyway so they are a much more controlled environment and
just do things properly to avoid the penalty.
Ingo
* Jack O'Quin <[email protected]> wrote:
> Well, this extremely long discussion started with a request on behalf
> of a large group of audio developers and users for a way to gain
> realtime scheduling privileges without running as root.
>
> Several kernel developers felt that would be unacceptable, because of
> the Denial of Service implications. We argued that the owner of a
> Digital Audio Workstation should be free to lock up his CPU any time
> he wants. But, no one would listen. [...]
i certainly listened, but that didnt make the RT-LSM proposal better in
any way!
> Had I been in charge, I would have adopted our silly little RT-LSM
> patch, declared victory, and moved on to other matters. [...]
let me put this very bluntly: this is precisely how bad OSs get written.
Half-done features with known limitations piling up. One or two of
crappy features might not hurt as badly, but when you start having
dozens of them it hurts big time. With time, crap _does_ pile up. So in
the Linux core code we have zero tolerance on crap. We are doing this
for the long-term fun of it.
repeat after me: we dont apply half-assed measures just to solve a
problem in a short-sighted way! And repeat after me: upstream reviewers
or maintainers are not _obliged_ to provide the proper solution either!
The Linux acceptance process is not about "whose patch sucks least", but
whether it hits a subsystem-specific bar of architectural requirements
or not. RT-LSM didnt hit the bar, end of story. You are always free to
write something proper (or convince/pay someone to write it for you)
which _will_ be accepted. Also, you are always free to replace code or
maintainers by writing/maintaining the code in a better way.
and if nobody ends up writing the 'proper' solution then there probably
wasnt enough demand to begin with ... We'll rather live on with one less
feature for another year than with a crappy feature that is twice as
hard to get rid of!
you might ask yourself, 'why is this so, and why cannot the Linux guys
apply pretty much any hack as e.g. userspace code might': the difference
in the case of the kernel is compatibility with thousands of
applications and millions of users. If we expose some method to
applications then that decision has to stick in 99.999% of the cases.
It's awfully hard to get rid of a bad, user-visible feature. Also,
kernel code, especially the scheduler (and the security subsystem) is
complex, interdependent, performance-sensitive and is modified very
frequently - any bad decision lives with us for long, and hinders us for
long.
ironically, i believe you are providing an additional proof that the
Linux development process worked in this particular case:
> When Con and Ingo started floating scheduler proposals, I tried to
> help them produce something useful. I think they have succeeded.
>
> Is this the easiest way to meet our needs? No way. But it's not bad,
> and I think it will work. In some ways, it's actually better than our
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> original RT-LSM proposal. Ingo's scheduler patch is not very large.
^^^^^^^^^^^^^^^^^^^^^^^^^
here you can see the concept in the works: _because_ RT-LSM was not
accepted did Con's and my work get any chance!
Put your hand on your heart and tell me, assuming RT-LSM went in, and in
a year's time i'd write the rlimit patch, would you even test my rlimit
patch with your audio apps? Or would you have told me 'sorry, RT-LSM is
good enough for our purposes and i dont have time to test this now'.
What would you have told me if my patch also removed RT-LSM (because it
implemented a superior method and i wanted to get rid of the crap)?
Would you possibly have cried bloody murder about breaking backwards
compatibility with audio apps?
so we have two different goals. You want a feature. We want robust
features. Those goals do meet in that we both want features, but our
interests are exactly the opposite in terms of quality and timeframe:
you want a solution that meets _your_ needs, as soon as possible - while
we want a longterm solution. (i.e. a solution that is as generic, clean,
maintainable and robust as possible.)
(Dont misunderstand me, this is not any 'fault' of yours, this is simply
your interest: you are not hacking kernel code so you are at most
compassionate with our needs, but you are not directly affected by
kernel code quality issues.)
(also, believe me, this is not arrogance or some kind of game on our
part. If there was a nice clean solution that solved your and others'
problems equally well then it would already be in Linus' tree. But there
is no such solution yet, at the moment. Moreover, the pure fact that so
many patch proposals exist and none looks dominantly convincing shows
that this is a problem area for which there are no easy solutions. We
hate such moments just as much as you do, but they do happen.)
Ingo
Ingo Molnar wrote:
> thanks, this pinpointed the bug - i've uploaded the -D8 patch to the
> usual place:
>
> http://redhat.com/~mingo/rt-limit-patches/
>
> does it fix your crash? Mike Galbraith reported a crash too that i think
> could be the same one.
Yep, with D8 and SMP the test completes successfully.
cheers, Cal
At 09:51 AM 1/27/2005 +0100, Ingo Molnar wrote:
>* Cal <[email protected]> wrote:
>
> > Sorry for the delay, some sleep required. A build without SMP also
> > fails, with multiple oops.
> > <http://www.graggrag.com/200501271213-oops/>.
>
>thanks, this pinpointed the bug - i've uploaded the -D8 patch to the
>usual place:
>
> http://redhat.com/~mingo/rt-limit-patches/
>
>does it fix your crash? Mike Galbraith reported a crash too that i think
>could be the same one.
Yeah, my crash log is 120KB longer, but it looks the same, and is also
fixed by D8.
-Mike
On Wed, 2005-01-26 at 23:15 -0600, Jack O'Quin wrote:
> >> > And finally, with rlimit support, is there any reason why lockup
> >> > detection and correction can't go into userspace? Even RT
> >> > throttling could probably be done in a userspace daemon.
> >>
> >> It can. But, doing it in the kernel is more efficient, and probably
> >> more reliable.
> >
> > Possibly. But the people who want a solution seem to be in a very
> > small minority, and I'm not sure how much you care about efficiency.
>
> I do care. The average overhead of running a watchdog daemon at max
> priority is tiny. But, its impact on worst-case trigger latencies is
> non-trivial and must be added to everything else in the RT subsystem.
Keep in mind that with the max RT prio rlimit solution audio systems
using JACK would not even need the external watchdog thread, because
JACK already has its own watchdog thread. I also like the max RT prio
rlimit approach with (optional) watchdog thread running as root because
it really seems to be the least intrusive out of several good solutions
to the problem and puts policy details (when to throttle an RT thread)
in userspace.
Lee
* Nick Piggin <[email protected]> wrote:
> But the important elements are lost. The standard provides a
> deterministic scheduling order, and a deterministic scheduling latency
> (of course this doesn't mean a great deal for Linux, but I think we're
> good enough for a lot of soft-rt applications now).
>
> > [1] http://www.opengroup.org/onlinepubs/007908799/xsh/realtime.html
no, the patch does not break POSIX. POSIX compliance means that there is
an environment that meets POSIX. Any default install of Linux 'breaks'
POSIX in a dozen ways, you have to take a number of steps to get a
strict, pristine POSIX environment. The only thing that changes is that
now you have to add "set RT_CPU ulimit to 0 or 100" to that (long) list
of things.
Ingo
* Nick Piggin <[email protected]> wrote:
> And finally, with rlimit support, is there any reason why lockup
> detection and correction can't go into userspace? Even RT throttling
> could probably be done in a userspace daemon.
that is correct. Jackd already has a watchdog thread, against lockups.
i'm wondering, couldnt Jackd solve this whole issue completely in
user-space, via a simple setuid-root wrapper app that does nothing else
but validates whether the user is in the 'jackd' group and then keeps a
pipe open to to the real jackd process which it forks off, deprivileges
and exec()s? Then unprivileged jackd could request RT-priority changes
via that pipe in a straightforward way. Jack normally gets installed as
root/admin anyway, so it's not like this couldnt be done.
Ingo
Ingo Molnar <[email protected]> writes:
> * Nick Piggin <[email protected]> wrote:
>
>> But the important elements are lost. The standard provides a
>> deterministic scheduling order, and a deterministic scheduling latency
>> (of course this doesn't mean a great deal for Linux, but I think we're
>> good enough for a lot of soft-rt applications now).
>>
>> > [1] http://www.opengroup.org/onlinepubs/007908799/xsh/realtime.html
>
> no, the patch does not break POSIX. POSIX compliance means that there is
> an environment that meets POSIX. Any default install of Linux 'breaks'
> POSIX in a dozen ways, you have to take a number of steps to get a
> strict, pristine POSIX environment. The only thing that changes is that
> now you have to add "set RT_CPU ulimit to 0 or 100" to that (long) list
> of things.
I agree. Is the rest of this list documented somewhere?
--
joq
Ingo Molnar <[email protected]> writes:
> * Nick Piggin <[email protected]> wrote:
>
>> And finally, with rlimit support, is there any reason why lockup
>> detection and correction can't go into userspace? Even RT throttling
>> could probably be done in a userspace daemon.
>
> that is correct. Jackd already has a watchdog thread, against lockups.
>
> i'm wondering, couldnt Jackd solve this whole issue completely in
> user-space, via a simple setuid-root wrapper app that does nothing else
> but validates whether the user is in the 'jackd' group and then keeps a
> pipe open to to the real jackd process which it forks off, deprivileges
> and exec()s? Then unprivileged jackd could request RT-priority changes
> via that pipe in a straightforward way. Jack normally gets installed as
> root/admin anyway, so it's not like this couldnt be done.
Perhaps.
Until recently, that didn't work because of the longstanding rlimits
bug in mlockall(). For scheduling only, it might be possible.
Of course, this violates your requirement that the user not be able to
lock up the CPU for DoS. The jackd watchdog is not perfect.
--
joq
* Jack O'Quin <[email protected]> wrote:
> > i'm wondering, couldnt Jackd solve this whole issue completely in
> > user-space, via a simple setuid-root wrapper app that does nothing else
> > but validates whether the user is in the 'jackd' group and then keeps a
> > pipe open to to the real jackd process which it forks off, deprivileges
> > and exec()s? Then unprivileged jackd could request RT-priority changes
> > via that pipe in a straightforward way. Jack normally gets installed as
> > root/admin anyway, so it's not like this couldnt be done.
>
> Perhaps.
>
> Until recently, that didn't work because of the longstanding rlimits
> bug in mlockall(). For scheduling only, it might be possible.
>
> Of course, this violates your requirement that the user not be able to
> lock up the CPU for DoS. The jackd watchdog is not perfect.
there is a legitimate fear that if it's made "too easy" to acquire some
sort of SCHED_FIFO priority, that an "arm's race" would begin between
desktop apps, each trying to set themselves to SCHED_FIFO (or SCHED_ISO)
and advising users to 'raise the limit if they see delays' - just to get
snappier than the rest.
thus after a couple of years we'd end up with lots of desktop apps
running as SCHED_FIFO, and latency would go down the drain again.
(yeah, this feels like going back to the drawing board.)
Ingo
Ingo Molnar <[email protected]> writes:
> * Jack O'Quin <[email protected]> wrote:
>
>> > i'm wondering, couldnt Jackd solve this whole issue completely in
>> > user-space, via a simple setuid-root wrapper app that does nothing else
>> > but validates whether the user is in the 'jackd' group and then keeps a
>> > pipe open to to the real jackd process which it forks off, deprivileges
>> > and exec()s? Then unprivileged jackd could request RT-priority changes
>> > via that pipe in a straightforward way. Jack normally gets installed as
>> > root/admin anyway, so it's not like this couldnt be done.
>>
>> Perhaps.
>>
>> Until recently, that didn't work because of the longstanding rlimits
>> bug in mlockall(). For scheduling only, it might be possible.
>>
>> Of course, this violates your requirement that the user not be able to
>> lock up the CPU for DoS. The jackd watchdog is not perfect.
>
> there is a legitimate fear that if it's made "too easy" to acquire some
> sort of SCHED_FIFO priority, that an "arm's race" would begin between
> desktop apps, each trying to set themselves to SCHED_FIFO (or SCHED_ISO)
> and advising users to 'raise the limit if they see delays' - just to get
> snappier than the rest.
>
> thus after a couple of years we'd end up with lots of desktop apps
> running as SCHED_FIFO, and latency would go down the drain again.
I wonder how Mac OS X and Windows deal with this priority escalation
problem? Is it real or only theoretical?
--
joq
* Jack O'Quin <[email protected]> wrote:
> > thus after a couple of years we'd end up with lots of desktop apps
> > running as SCHED_FIFO, and latency would go down the drain again.
>
> I wonder how Mac OS X and Windows deal with this priority escalation
> problem? Is it real or only theoretical?
no idea. Anyone with MacOSX/Windows application writing experience? :-|
Ingo
At 03:01 AM 1/28/2005 -0600, Jack O'Quin wrote:
>Ingo Molnar <[email protected]> writes:
>
> > * Jack O'Quin <[email protected]> wrote:
> >
> >> > i'm wondering, couldnt Jackd solve this whole issue completely in
> >> > user-space, via a simple setuid-root wrapper app that does nothing else
> >> > but validates whether the user is in the 'jackd' group and then keeps a
> >> > pipe open to to the real jackd process which it forks off, deprivileges
> >> > and exec()s? Then unprivileged jackd could request RT-priority changes
> >> > via that pipe in a straightforward way. Jack normally gets installed as
> >> > root/admin anyway, so it's not like this couldnt be done.
> >>
> >> Perhaps.
> >>
> >> Until recently, that didn't work because of the longstanding rlimits
> >> bug in mlockall(). For scheduling only, it might be possible.
> >>
> >> Of course, this violates your requirement that the user not be able to
> >> lock up the CPU for DoS. The jackd watchdog is not perfect.
> >
> > there is a legitimate fear that if it's made "too easy" to acquire some
> > sort of SCHED_FIFO priority, that an "arm's race" would begin between
> > desktop apps, each trying to set themselves to SCHED_FIFO (or SCHED_ISO)
> > and advising users to 'raise the limit if they see delays' - just to get
> > snappier than the rest.
> >
> > thus after a couple of years we'd end up with lots of desktop apps
> > running as SCHED_FIFO, and latency would go down the drain again.
>
>I wonder how Mac OS X and Windows deal with this priority escalation
>problem? Is it real or only theoretical?
WRT the Mac, I thought OS X was created to cure the ills of cooperative
multi-tasking. (which the priority arms race reminds me of)
-Mike
Ingo Molnar wrote:
> * Jack O'Quin <[email protected]> wrote:
>
>
>>>i'm wondering, couldnt Jackd solve this whole issue completely in
>>>user-space, via a simple setuid-root wrapper app that does nothing else
>>>but validates whether the user is in the 'jackd' group and then keeps a
>>>pipe open to to the real jackd process which it forks off, deprivileges
>>>and exec()s? Then unprivileged jackd could request RT-priority changes
>>>via that pipe in a straightforward way. Jack normally gets installed as
>>>root/admin anyway, so it's not like this couldnt be done.
>>
>>Perhaps.
>>
>>Until recently, that didn't work because of the longstanding rlimits
>>bug in mlockall(). For scheduling only, it might be possible.
>>
>>Of course, this violates your requirement that the user not be able to
>>lock up the CPU for DoS. The jackd watchdog is not perfect.
>
>
> there is a legitimate fear that if it's made "too easy" to acquire some
> sort of SCHED_FIFO priority, that an "arm's race" would begin between
> desktop apps, each trying to set themselves to SCHED_FIFO (or SCHED_ISO)
> and advising users to 'raise the limit if they see delays' - just to get
> snappier than the rest.
>
> thus after a couple of years we'd end up with lots of desktop apps
> running as SCHED_FIFO, and latency would go down the drain again.
>
> (yeah, this feels like going back to the drawing board.)
I think part of the problem here is that by comparing each tasks limit
to the runqueue's usage rate (and to some extent using a relatively
short decay period) you're creating the need for the limits to be quite
large i.e. it has to be big enough to be bigger than the combined usage
rates of all the unprivileged real time tasks and also to handle the
short term usage rate peaks of the task.
If the average usage rate is estimated over longer periods it will be
lower allowing lower limits to be used. Also if the task's own usage
rate estimates are used to test the limits then the limit can be lower.
If the default limits can be made sufficiently small then the temptation
to use this feature by "ordinary" applications will disappear.
I'm not an expert but I imagine that the CPU usage rates of most RT
tasks taken over reasonably long time intervals is quite low and
therefore the default limits could also be quite low without adversely
effecting the programs that this mechanism is meant to help.
The sched_cpustats.[ch] files that are part of my SPA scheduler patches
provide a cheap method of estimating per task usage rates. They
estimate usage rates for a task over its recent scheduling cycles but
could be modified to provide updates every tick for the currently active
task for use with this mechanism.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
* Peter Williams <[email protected]> wrote:
> I think part of the problem here is that by comparing each tasks limit
> to the runqueue's usage rate (and to some extent using a relatively
> short decay period) you're creating the need for the limits to be
> quite large i.e. it has to be big enough to be bigger than the
> combined usage rates of all the unprivileged real time tasks and also
> to handle the short term usage rate peaks of the task.
actually, at least for Jackd use, the current average worked out pretty
well - setting the limit 5-10% above that of the reported average CPU
use gave a result that was equivalent to unrestricted SCHED_FIFO
results.
Ingo
On Fri, 2005-01-28 at 10:11 +0100, Ingo Molnar wrote:
> * Jack O'Quin <[email protected]> wrote:
>
> > > thus after a couple of years we'd end up with lots of desktop apps
> > > running as SCHED_FIFO, and latency would go down the drain again.
> >
> > I wonder how Mac OS X and Windows deal with this priority escalation
> > problem? Is it real or only theoretical?
>
> no idea. Anyone with MacOSX/Windows application writing experience? :-|
>
Here's the description from Apple.
(from
http://developer.apple.com/documentation/Darwin/Conceptual/KernelProgramming/scheduler/chapter_8_section_4.html):
However, according St?phane Letz who ported JACK to OSX, this does NOT
describe the reality of the current implementation - it's not a real
deadline scheduler. "period" and "constraint" are ignored, RT tasks are
scheduled round robin, and the scheduler just uses "computation" as the
timeslice. If an RT task repeatedly uses its entire timeslice without
blocking, the scheduler can demote the task to SCHED_NORMAL.
Audio apps do not normally set these parameters directly, the CoreAudio
backend handles it.
(quoting St?phane Letz)
> For example in CoreAudio, the computation value is directly related
> to the audio buffer size in the following way:
>
> buffer size computation
>
> 64 frames 500 us
> 128 300 us
> >= 256 100 us
>
> The idea is that threads with smaller buffer size will get a larger
> computation slice so that there is a chance they can complete their
> jobs. Threads with larger buffer size are more interruptible. The
> CoreMidi thread (to handle incoming Midi events) also has a
> computation value of 500 us.
> Other RT threads like Firewire and various system threads computation
> value are also carefully chosen.
(This was from a private mail thread, that lead to Con's SCHED_ISO patches,
if all the participants agree I will post a link to the full thread because
it answers many questions that are sure to come up on LKML)
So this system *requires* an app to tell the kernel in advance what its
RT constraints are, then revokes isochronous scheduling privileges if
the task lied. This would require a new API. Furthermore I suspect
that these "System" threads aren't subject to having their RT privileges
revoked, and that the GUI gets special treatment, etc.
The upshot is while the OSX system works in that environment, it's
largely due to Apple controlling the kernel and a lot of userspace. OSX
is useful as a model of what a good API for soft realtime support in a
desktop OS would look like. But we are a general purpose OS so we
certainly need a more general solution.
Lee
Peter Williams <[email protected]> writes:
>
> If the average usage rate is estimated over longer periods it will be
> lower allowing lower limits to be used. Also if the task's own usage
> rate estimates are used to test the limits then the limit can be lower.
>
> If the default limits can be made sufficiently small then the
> temptation to use this feature by "ordinary" applications will
> disappear.
>
> I'm not an expert but I imagine that the CPU usage rates of most RT
> tasks taken over reasonably long time intervals is quite low and
> therefore the default limits could also be quite low without adversely
> effecting the programs that this mechanism is meant to help.
True for some, but definitely not for all.
When a system was purchased specifically to do some realtime job, it
often makes sense to dedicate large chunks of the main processor to
realtime number crunching. Mass-produced general-purpose processors
have excellent price/performance ratios. There's no good reason not
to take advantage of that.
People commonly run heavy Fast Fourier Transform or reverb
calculations in realtime threads. They may use up as much of the CPU
as the user/owner is willing to allocate. With soft realtime, its
hard to push this reliably beyond about 70-80%. But, those numbers
are definitely practical.
--
joq
Ingo Molnar wrote:
> * Jack O'Quin <[email protected]> wrote:
>
>
>>>i'm wondering, couldnt Jackd solve this whole issue completely in
>>>user-space, via a simple setuid-root wrapper app that does nothing else
>>>but validates whether the user is in the 'jackd' group and then keeps a
>>>pipe open to to the real jackd process which it forks off, deprivileges
>>>and exec()s? Then unprivileged jackd could request RT-priority changes
>>>via that pipe in a straightforward way. Jack normally gets installed as
>>>root/admin anyway, so it's not like this couldnt be done.
>>
>>Perhaps.
>>
>>Until recently, that didn't work because of the longstanding rlimits
>>bug in mlockall(). For scheduling only, it might be possible.
>>
>>Of course, this violates your requirement that the user not be able to
>>lock up the CPU for DoS. The jackd watchdog is not perfect.
>
>
> there is a legitimate fear that if it's made "too easy" to acquire some
> sort of SCHED_FIFO priority, that an "arm's race" would begin between
> desktop apps, each trying to set themselves to SCHED_FIFO (or SCHED_ISO)
> and advising users to 'raise the limit if they see delays' - just to get
> snappier than the rest.
>
> thus after a couple of years we'd end up with lots of desktop apps
> running as SCHED_FIFO, and latency would go down the drain again.
>
> (yeah, this feels like going back to the drawing board.)
The problem hasn't changed in a few decades, neither has the urge of
developers to make their app look good at the expense of the rest of the
system. Been there and done that myself.
"Back when" we had no good tools except to raise priority and drop
timeslice if a process blocked for i/o and vice-versa if it used the
whole timeslice. The amzing thing is that it worked reasonably well as
long as no one was there who knew how to cook the books the scheduler
used. And the user could hold off interrupts for up to 16ms, just to
make it worse.
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
Ingo Molnar wrote:
> * Peter Williams <[email protected]> wrote:
>
>
>>As I understand this (and I may be wrong), the intention is that if a
>>task has its RT_CPU_RATIO rlimit set to a value greater than zero then
>>setting its scheduling policy to SCHED_RR or SCHED_FIFO is allowed.
>
>
> correct.
>
>
>>This causes me to ask the following questions:
>>
>>1. Why is current->signal->rlim[RLIMIT_RT_CPU_RATIO].rlim_cur being used
>>in setscheduler() instead of p->signal->rlim[RLIMIT_RT_CPU_RATIO].rlim_cur?
>>
>>2. What stops a task that had a non zero RT_CPU_RATIO rlimit and
>>changed its policy to SCHED_RR or SCHED_FIFO from then setting
>>RT_CPU_RATIO rlimit back to zero and escaping the controls? As far as
>>I can see (and, once again, I may be wrong) the mechanism for setting
>>rlimits only requires CAP_SYS_RESOURCE privileges in order to increase
>>the value.
>
>
> you are right, both are bugs.
>
> i've uploaded the -D6 patch that should have both fixed:
>
> http://redhat.com/~mingo/rt-limit-patches/
>
I've just noticed what might be a bug in the original code. Shouldn't
the following:
if ((current->euid != p->euid) && (current->euid != p->uid) &&
!capable(CAP_SYS_NICE))
be:
if ((current->uid != p->uid) && (current->euid != p->uid) &&
!capable(CAP_SYS_NICE))
I.e. if the real or effective uid of the task doing the setting is not
the same as the uid of the target task it is not allowed to change the
target task's policy unless it is specially privileged.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
On Mon, Jan 31, 2005 at 05:29:10PM -0500, Bill Davidsen wrote:
> The problem hasn't changed in a few decades, neither has the urge of
> developers to make their app look good at the expense of the rest of the
> system. Been there and done that myself.
>
> "Back when" we had no good tools except to raise priority and drop
> timeslice if a process blocked for i/o and vice-versa if it used the
> whole timeslice. The amzing thing is that it worked reasonably well as
> long as no one was there who knew how to cook the books the scheduler
> used. And the user could hold off interrupts for up to 16ms, just to
> make it worse.
A lot of this scheduling policy work is going to have to be redone as
badly written apps start getting their crap together and as this patch
is more and more pervasive in the general Linux community. What's
happening now is only the beginning of things to come and it'll require
a solid sample application with even more hooks into the kernel before
we'll see the real benefits of this patch. SCHED_FIFO will have to do
until more development happens with QoS style policies.
bill
* Peter Williams <[email protected]> wrote:
> I've just noticed what might be a bug in the original code. Shouldn't
> the following:
>
> if ((current->euid != p->euid) && (current->euid != p->uid) &&
> !capable(CAP_SYS_NICE))
>
> be:
>
> if ((current->uid != p->uid) && (current->euid != p->uid) &&
> !capable(CAP_SYS_NICE))
>
> I.e. if the real or effective uid of the task doing the setting is not
> the same as the uid of the target task it is not allowed to change the
> target task's policy unless it is specially privileged.
no. The original code is quite logical: when doing something to others,
only the euid is taken into account. When others do something to you,
both the uid and the euid is checked ('others' might have no idea about
this task temporarily changing the euid to a less/more privileged uid).
So sys_setscheduler() [and sys_setaffinity(), which does the same] is
fine.
what _is_ inconsistent is kernel/sys.c's setpriority()/set_one_prio().
It checks current->euid|uid against p->uid, which makes little sense,
but is how we've been doing it ever since. It's a Linux quirk documented
in the manpage. To make things funnier, SuS requires current->euid|uid
match against p->euid.
The patch below fixes it (and brings the logic in line with what
setscheduler()/setaffinity() does), but if we do it then it should be
done only in 2.6.12 or later, after good exposure in -mm.
(Worst-case this could break an application but i highly doubt it: it at
most could deny renicing another task to positive (or in very rare
cases, to negative) nice values, which no application should crash on
something like that, normally.)
Ingo
--
fix a setpriority() Linux quirk, implement euid semantics correctly.
Signed-off-by: Ingo Molnar <[email protected]>
--- linux/kernel/sys.c.orig
+++ linux/kernel/sys.c
@@ -216,12 +216,13 @@ int unregister_reboot_notifier(struct no
}
EXPORT_SYMBOL(unregister_reboot_notifier);
+
static int set_one_prio(struct task_struct *p, int niceval, int error)
{
int no_nice;
if (p->uid != current->euid &&
- p->uid != current->uid && !capable(CAP_SYS_NICE)) {
+ p->euid != current->euid && !capable(CAP_SYS_NICE)) {
error = -EPERM;
goto out;
}
Ingo Molnar wrote:
> * Peter Williams <[email protected]> wrote:
>
>
>>I've just noticed what might be a bug in the original code. Shouldn't
>>the following:
>>
>> if ((current->euid != p->euid) && (current->euid != p->uid) &&
>> !capable(CAP_SYS_NICE))
>>
>>be:
>>
>> if ((current->uid != p->uid) && (current->euid != p->uid) &&
>> !capable(CAP_SYS_NICE))
>>
>>I.e. if the real or effective uid of the task doing the setting is not
>>the same as the uid of the target task it is not allowed to change the
>>target task's policy unless it is specially privileged.
>
>
> no. The original code is quite logical: when doing something to others,
> only the euid is taken into account. When others do something to you,
> both the uid and the euid is checked ('others' might have no idea about
> this task temporarily changing the euid to a less/more privileged uid).
> So sys_setscheduler() [and sys_setaffinity(), which does the same] is
> fine.
I disagree. Logically, the only privileges that should be relevant are
those of the task making the change i.e. those of current. The
effective uid of the target task is irrelevant. If only the effective
uid counts for determining what a task can do then the statement can be
simplified to:
if ((current->euid != p->uid) && !capable(CAP_SYS_NICE))
but I think that this is wrong as having an effective uid that is
different to the real uid doesn't cancel the privileges assoctaied with
the real uid.
The way uid and effective uid effect a task's operations on files are a
guide to their logical application in cases like this.
Since this operation is usually performed by a task on itself it
probably doesn't matter. But when task A is trying to change the policy
of task B then it is the privileges of the task A that count and all
that is relevant about task B is who its real owner is.
For example, if a user A ran a program that was setuid to an
unprivileged user B then A would be unable (from within that program) to
change the policy of any RT tasks that she had running to SCHED_NORMAL
unless those tasks were also setuid to B. I agree that this is a highly
unlikely circumstance but it does serve to illustrate the point w.r.t.
what is logical.
>
> what _is_ inconsistent is kernel/sys.c's setpriority()/set_one_prio().
>
> It checks current->euid|uid against p->uid, which makes little sense,
I think that this is correct.
> but is how we've been doing it ever since. It's a Linux quirk documented
> in the manpage. To make things funnier, SuS requires current->euid|uid
> match against p->euid.
I've just read the man page and I think that the description of what
Linux does (or is supposed to do) is correct/logical and that the other
behaviors described are (very) illogical. In any case, the change that
I suggested will make the behaviour match what the man page says.
>
> The patch below fixes it (and brings the logic in line with what
> setscheduler()/setaffinity() does), but if we do it then it should be
> done only in 2.6.12 or later, after good exposure in -mm.
>
> (Worst-case this could break an application but i highly doubt it: it at
> most could deny renicing another task to positive (or in very rare
> cases, to negative) nice values, which no application should crash on
> something like that, normally.)
Yes, it's fairly benign. Especially as most common use of this would be
tasks calling it on themselves. I'm not going to lose any sleep over it :-)
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Ingo,
I hope we can get past this anger and continue working together. We
have too much to gain by cooperating. It would be a shame to let hurt
feelings get in the way for either of us.
> * Jack O'Quin <[email protected]> wrote:
>> Well, this extremely long discussion started with a request on behalf
>> of a large group of audio developers and users for a way to gain
>> realtime scheduling privileges without running as root.
>>
>> Several kernel developers felt that would be unacceptable, because of
>> the Denial of Service implications. We argued that the owner of a
>> Digital Audio Workstation should be free to lock up his CPU any time
>> he wants. But, no one would listen. [...]
Ingo Molnar <[email protected]> writes:
> i certainly listened, but that didnt make the RT-LSM proposal better in
> any way!
You co-opted the whole discussion in the direction you already wanted
to go. To me, that isn't listening.
> So in the Linux core code we have zero tolerance on crap. We are
> doing this for the long-term fun of it.
So, we should never do anything boring, even though people actually
need it?
The fact that a large group of frustrated Linux audio developers could
find no better outlet than to develop this solution is a rather strong
indictment of the kernel requirements-gathering process.
> and if nobody ends up writing the 'proper' solution then there probably
> wasnt enough demand to begin with ... We'll rather live on with one less
> feature for another year than with a crappy feature that is twice as
> hard to get rid of!
Is nobody responsible for figuring out what users need? I didn't
realize kernel development had become so disconnected.
The implicit assumptions are that (1) only problems for which someone
has submitted a nice, clean patch are worth working on, and (2) only
kernel developers are smart enough to understand the real requirements.
> you might ask yourself, 'why is this so, and why cannot the Linux guys
> apply pretty much any hack as e.g. userspace code might': the difference
> in the case of the kernel is compatibility with thousands of
> applications and millions of users. If we expose some method to
> applications then that decision has to stick in 99.999% of the cases.
> It's awfully hard to get rid of a bad, user-visible feature. Also,
> kernel code, especially the scheduler (and the security subsystem) is
> complex, interdependent, performance-sensitive and is modified very
> frequently - any bad decision lives with us for long, and hinders us for
> long.
I do understand binary compatibility and OS maintenance very well.
That is why I submitted a solution with *no* Application Programming
Interface and no hit to the scheduler.
Remember when I asked how you handle changes to sizeof(struct rusage)?
That was a serious question. I hope there's a solution. But, I got
no answer, only handwaving.
> Put your hand on your heart and tell me, assuming RT-LSM went in, and in
> a year's time i'd write the rlimit patch, would you even test my rlimit
> patch with your audio apps? Or would you have told me 'sorry, RT-LSM is
> good enough for our purposes and i dont have time to test this now'.
Yes, I would. Even if you keep insulting me. I would do it because
(like you) I want to make things better. We don't have to be friends
to work together, though a little mutual respect *would* help.
Maybe you recall when I offered to help with realtime testing last
year. It was a major embarrassment that 2.6.0 shipped with low
latency claims when in fact it was quite inferior to 2.4 with the low
latency patches. (Thanks to you, in 2.6.10 it finally is not.) No
one thought this was important enough to accept my offer. After all,
if I were smart, I would be a kernel programmer, right?
> so we have two different goals. You want a feature. We want robust
> features. Those goals do meet in that we both want features, but our
> interests are exactly the opposite in terms of quality and timeframe:
> you want a solution that meets _your_ needs, as soon as possible - while
> we want a longterm solution. (i.e. a solution that is as generic, clean,
> maintainable and robust as possible.)
The LSM was a stop-gap measure intended to tide us over until a real
fix could be "done right" for 2.8. It had the advantage of being
minimally disruptive to the kernel and its maintainability. In that
role it is a success[1].
[1] http://packages.debian.org/testing/misc/realtime-lsm-module-2.6.8-1-686-smp
I am still amazed that you are willing to make scheduler changes in
the 2.6 development stream. But, that is your decision to make.
> What would you have told me if my patch also removed RT-LSM (because it
> implemented a superior method and i wanted to get rid of the crap)?
> Would you possibly have cried bloody murder about breaking backwards
> compatibility with audio apps?
I expect that the life span of the RT-LSM will end in 2.8, replaced by
something better. Does this break binary compatibility? No, it has
no API. Only system admin procedures are affected. These things
change between major releases anyway. As long as there is still a way
to solve the problem, we will adjust.
If you were to take it out and replace it with something like that
nice(-20) hack that theoretically works but actually doesn't, *then*
I'd scream bloody murder. Wouldn't you want me to?
> (Dont misunderstand me, this is not any 'fault' of yours, this is simply
> your interest: you are not hacking kernel code so you are at most
> compassionate with our needs, but you are not directly affected by
> kernel code quality issues.)
I do care. But, since maintainance is your responsibility, I am
willing to defer to your wishes in the matter. I looked at sched.c.
It appears well-written. I have no desire to second-guess development
when it's working.
Lack of clear requirements gathering, on the other hand, directly
affects me in very unpleasant ways. That's not working.
> (also, believe me, this is not arrogance or some kind of game on our
> part. If there was a nice clean solution that solved your and others'
> problems equally well then it would already be in Linus' tree. But there
> is no such solution yet, at the moment. Moreover, the pure fact that so
> many patch proposals exist and none looks dominantly convincing shows
> that this is a problem area for which there are no easy solutions. We
> hate such moments just as much as you do, but they do happen.)
The actual requirement is nowhere near as difficult as you imagine.
You and several others continue to view realtime in a multi-user
context. That doesn't work. No wonder you have no good solution.
The humble RT-LSM was actually optimal for the multi-user scenario:
don't load it. Then it adds no security issues, complexity or
scheduler pathlength. As an added benefit, the sysadmin can easily
verify that it's not there.
The cost/performance characteristics of commodity PC's running Linux
are quite compelling for a wide range of practical realtime
applications. But, these are dedicated machines. The whole system
must be carefully tuned. That is the only method that actually works.
The scheduler is at most a peripheral concern; the best it can do is
not screw up.
--
joq
On Tue, Feb 01, 2005 at 11:10:48PM -0600, Jack O'Quin wrote:
> Ingo Molnar <[email protected]> writes:
> > (also, believe me, this is not arrogance or some kind of game on our
> > part. If there was a nice clean solution that solved your and others'
> > problems equally well then it would already be in Linus' tree. But there
> > is no such solution yet, at the moment. Moreover, the pure fact that so
> > many patch proposals exist and none looks dominantly convincing shows
> > that this is a problem area for which there are no easy solutions. We
> > hate such moments just as much as you do, but they do happen.)
>
> The actual requirement is nowhere near as difficult as you imagine.
> You and several others continue to view realtime in a multi-user
> context. That doesn't work. No wonder you have no good solution.
A notion of process/thread scoping is needed from my point of view. How
to implement that is another matter and there are no clear solutions
that don't involve major changes in some way to fundamental syscalls
like fork/clone() and underlying kernel structures from what I see.
The very notion of Unix fork semantics isn't sufficient enough to
"contain" these semantics. It's more about controlling things with
known quantities over time, not about process creation relationships,
and therein lies the mismatch.
Also, as media apps get more sophisticated they're going to need some
kind of access to the some traditional softirq facilities, possibily
migrating it into userspace safely somehow, with how it handles IO
processing such as iSCSI, FireWire, networking and all peripherals
that need some kind of prioritized IO handling. It's akin to O_DIRECT,
where folks need to determine policy over the kernel's own facilities,
IO queues, but in a more broad way. This is inevitable for these
category of apps. Scary ? yes I know.
Think XFS streaming with guaranteed rate IO, then generalize this for
all things that can be streamed in the kernel. A side note, they'll
also be pegging CPU usage and attempting to draw to the screen at the
same time. It would be nice to have slack from scheduler frames be use
for less critical things such as drawing to the screen.
The policy for scheduling these IO requests maybe divorced from the
actual priority of the thread requesting it which present some problems
with the current Linux code as I understand it.
Whether this suitable for main stream inclusion is another matter. But
as a person that wants to write apps of this nature, I came into this
kernel stuff knowing that there's going to be a conflict between the
the needs of media apps folks and what the Linux kernel folks will
tolerate as a community.
> The humble RT-LSM was actually optimal for the multi-user scenario:
> don't load it. Then it adds no security issues, complexity or
> scheduler pathlength. As an added benefit, the sysadmin can easily
> verify that it's not there.
>
> The cost/performance characteristics of commodity PC's running Linux
> are quite compelling for a wide range of practical realtime
> applications. But, these are dedicated machines. The whole system
> must be carefully tuned. That is the only method that actually works.
> The scheduler is at most a peripheral concern; the best it can do is
> not screw up.
It's very compelling and very deadly to the industry if these things
become common place in the normal Linux kernel. It would instantly
make Linux the top platform for anything media related, graphic and
audio. (Hopefully, I can get back to kernel coding RT stuff after this
current distraction that has me reassigned onto an emergency project)
I hope I clarified some of this communication and not completely scare
Ingo and others too much. Just a little bit is ok. :)
bill
* Jack O'Quin <[email protected]> wrote:
> Remember when I asked how you handle changes to sizeof(struct rusage)?
> That was a serious question. I hope there's a solution. [...]
what does any of what we've talking about have to do with struct rusage?
One of the patches i wrote adds a new rlimit. It has no impact on
rusage, at all. A new rlimit can be added transparently, we routinely
add new rlimits, and no, most of them have no matching rusage fields!
> But, I got no answer, only handwaving.
i very much replied to your point:
http://marc.theaimsgroup.com/?l=linux-kernel&m=110672338910363&w=2
" > Does getrusage() return anything for this? How can a field be added
> to the rusage struct without breaking binary compatibility? Can we
> assume that no programs ever use sizeof(struct rusage)?
rlimits are easily extended and there are no binary compatibility
worries. The kernel doesnt export the maximum towards userspace.
getrusage() will return the value on new kernels and will return
-EINVAL on old kernels, so new userspace can deal with this
accordingly. "
(and here i meant getrlimit(), not getrusage() - getrusage() is not
affected by the patch at all.)
Ingo
[trimming the Cc: list]
> * Jack O'Quin <[email protected]> wrote:
>> Remember when I asked how you handle changes to sizeof(struct rusage)?
>> That was a serious question. I hope there's a solution. [...]
Ingo Molnar <[email protected]> writes:
> what does any of what we've talking about have to do with struct rusage?
Your previous message implied that "userspace" programmers don't
understand binary compatibility...
> you might ask yourself, 'why is this so, and why cannot the Linux guys
> apply pretty much any hack as e.g. userspace code might'
I was just demonstating that I do.
> " > Does getrusage() return anything for this? How can a field be added
> > to the rusage struct without breaking binary compatibility? Can we
> > assume that no programs ever use sizeof(struct rusage)?
>
> rlimits are easily extended and there are no binary compatibility
> worries. The kernel doesnt export the maximum towards userspace.
> getrusage() will return the value on new kernels and will return
> -EINVAL on old kernels, so new userspace can deal with this
> accordingly. "
>
> (and here i meant getrlimit(), not getrusage() - getrusage() is not
> affected by the patch at all.)
Well, that was source of my question.
I had asked about rusage. You said it did return a new value, but
that this was not a problem. That made no sense to me. Thank you for
clearing it up.
Certainly getrlimit() works OK. I understood that already.
--
joq
Bill Huey (hui) <[email protected]> writes:
> Also, as media apps get more sophisticated they're going to need some
> kind of access to the some traditional softirq facilities, possibily
> migrating it into userspace safely somehow, with how it handles IO
> processing such as iSCSI, FireWire, networking and all peripherals
> that need some kind of prioritized IO handling. It's akin to O_DIRECT,
> where folks need to determine policy over the kernel's own facilities,
> IO queues, but in a more broad way. This is inevitable for these
> category of apps. Scary ? yes I know.
I believe Ingo's RT patches already support this on a per-IRQ basis.
Each IRQ handler can run in a realtime thread with priority assigned
by the sysadmin. Balancing the interrupt handler priorities with
those of other realtime activities allows excellent control.
This is really only useful within the context of a dedicated realtime
system, of course.
Stephane Letz reports a similar feature in Mac OS X.
> Whether this suitable for main stream inclusion is another matter. But
> as a person that wants to write apps of this nature, I came into this
> kernel stuff knowing that there's going to be a conflict between the
> the needs of media apps folks and what the Linux kernel folks will
> tolerate as a community.
That's a price both groups pay for doing realtime within the context
of a general-purpose OS. But, for many, many applications it's the
best option.
Fortunately, most of what we need also improves the general quality
and responsiveness of the kernel. The important things like short
lock hold times are really just good concurrent programming practice.
>> The cost/performance characteristics of commodity PC's running Linux
>> are quite compelling for a wide range of practical realtime
>> applications. But, these are dedicated machines. The whole system
>> must be carefully tuned. That is the only method that actually works.
>> The scheduler is at most a peripheral concern; the best it can do is
>> not screw up.
>
> It's very compelling and very deadly to the industry if these things
> become common place in the normal Linux kernel. It would instantly
> make Linux the top platform for anything media related, graphic and
> audio.
Yes, many people want to take advantage of this.
--
joq
On Tue, 2005-02-01 at 23:10 -0600, Jack O'Quin wrote:
> > So in the Linux core code we have zero tolerance on crap. We are
> > doing this for the long-term fun of it.
>
> So, we should never do anything boring, even though people actually
> need it?
>
> The fact that a large group of frustrated Linux audio developers could
> find no better outlet than to develop this solution is a rather strong
> indictment of the kernel requirements-gathering process.
>
> > and if nobody ends up writing the 'proper' solution then there probably
> > wasnt enough demand to begin with ... We'll rather live on with one less
> > feature for another year than with a crappy feature that is twice as
> > hard to get rid of!
>
> Is nobody responsible for figuring out what users need? I didn't
> realize kernel development had become so disconnected.
>
Interesting point. The kernel development process has been written
about at length, but you don't hear much about the requirements
gathering process.
It seems like aside from the internal forces of kernel developers
wanting to improve the system, the big distros, do a lot of the
requirements gathering. Makes sense, as the distros are in a very good
position to know what users want, and can commit developer resources to
get it done. For example there has been a big push from the distros to
make Linux competitive with MS on the desktop, and it shows in the
direction of Linux development. These days you can throw in a cd of a
modern distro and the installer will get you to a working desktop easier
than Windows (well, almost, your sound might not work ;-).
Really, if the Linux audio community wants to get its requirements
heard, then all the AGNULA and Planet CCRMA users should start to demand
that Fedora and Debian be usable OOTB for low latency audio. If you
want to run JACK in realtime mode on RH or Debian you have to be root
for crying out loud. There's no reason Linux audio users should have to
patch the kernel or install a bunch of specialized packages any more
than people who want to use it as a web server. File bug reports,
complain on your distro's user mailing lists, whatever.
I do appreciate the progress that has been made, and that the Linux
kernel developers really stepped up to address the latency issues. But,
most of the push has come from outside the kernel development community,
from individual Linux audio users and developers. If we had waited for
the big distros to demand that low latency audio work OOTB, we would be
exactly where we were in 2001 (or this time last year) still using 2.4
+ll+preempt and struggling to get that old kernel to work on our new
hardware.
IMHO the requirements gathering process usually works well. When
someone with a redhat.com (for example) address posts a patch there's an
implicit assumption that it addresses the needs of their gadzillions of
users. Still, RH hires professional kernel developers, people who
produce known good code will always have an easier time getting patches
merged. If Linus & co. don't know you from Adam and you show up with a
patch that claims to solve a big problem, then I would expect them to be
a bit skeptical. Especially if the problem is either low priority or
not well understood by the major distros.
Lee
> On Tue, 2005-02-01 at 23:10 -0600, Jack O'Quin wrote:
>> Is nobody responsible for figuring out what users need? I didn't
>> realize kernel development had become so disconnected.
Lee Revell <[email protected]> writes:
> IMHO the requirements gathering process usually works well. When
> someone with a redhat.com (for example) address posts a patch there's an
> implicit assumption that it addresses the needs of their gadzillions of
> users. Still, RH hires professional kernel developers, people who
> produce known good code will always have an easier time getting patches
> merged. If Linus & co. don't know you from Adam and you show up with a
> patch that claims to solve a big problem, then I would expect them to be
> a bit skeptical. Especially if the problem is either low priority or
> not well understood by the major distros.
I guess you're right, Lee. I hadn't thought of it that way. It just
looks broken to me because we have no standing in any normal kernel
requirements process. That's a shame, but it does seem less like a
systemic issue.
I think the distributions are getting more interested in these issues.
Maybe that will help. The RT-LSM is available as a module in Debian
sarge.
Back when I did OS development for a living, there was a huge focus on
defining user requirements. But, our kernel development was never
organizationally separate from the rest of the OS. That makes a big
difference.
--
joq
* Jack O'Quin <[email protected]> wrote:
> The LSM was a stop-gap measure intended to tide us over until a real
> fix could be "done right" for 2.8. It had the advantage of being
> minimally disruptive to the kernel and its maintainability. [...]
i'm not opposed to the LSM solution per se, especially given that none
of the other solutions in existence are fully satisfactory (and thus
acceptable for the scheduler currently). The LSM patch is clearly the
least intrusive solution.
Ingo
* Jack O'Quin <[email protected]> wrote:
> I guess you're right, Lee. I hadn't thought of it that way. It just
> looks broken to me because we have no standing in any normal kernel
> requirements process. That's a shame, but it does seem less like a
> systemic issue.
you have just as much standing, and you certainly went to great lengths
(writing patches, testing stuff, etc.) to address this issue - it is
just an unfortunate situation that the issue here is _not_ clear-cut at
all. It is a longstanding habit on lkml to try to solve things as
cleanly and generally as possible, but there are occasional cases where
this is just not possible.
e.g. technically it was much harder to write all the latency-fix patches
(and infrastructure around it) that are now in 2.6.11-rc2, but it was
also a much clearer issue, with clean solutions; so there was no
conflict about whether to do it and you'll reap the benefits of that in
2.6.11.
so forgive us this stubborness, it's not directed against you in person
or against any group of users, it's always directed at the problem at
hand. I think we can do the LSM thing, and if this problem comes up in
the future again, then maybe by that time there will be a better
solution. (e.g. it's quite possible that something nice will come out of
the various virtualization projects, for this problem area.)
Ingo
On Wed, Feb 02, 2005 at 10:44:22AM -0600, Jack O'Quin wrote:
> Bill Huey (hui) <[email protected]> writes:
> > Also, as media apps get more sophisticated they're going to need some
> > kind of access to the some traditional softirq facilities, possibily
> > migrating it into userspace safely somehow, with how it handles IO
> > processing such as iSCSI, FireWire, networking and all peripherals
> > that need some kind of prioritized IO handling. It's akin to O_DIRECT,
> > where folks need to determine policy over the kernel's own facilities,
> > IO queues, but in a more broad way. This is inevitable for these
> > category of apps. Scary ? yes I know.
>
> I believe Ingo's RT patches already support this on a per-IRQ basis.
> Each IRQ handler can run in a realtime thread with priority assigned
> by the sysadmin. Balancing the interrupt handler priorities with
> those of other realtime activities allows excellent control.
No they don't. That's a physical mapping of these kernel entities, not a
logic organization that projects upward to things like individual sockets
or file streams. The current irq-thread patches are primarily for dealing
with the low level acks and stuff for the devices in question. It does not
deal with queuing policy or how these things are scheduler on a logical
basis, which is what softirqs do. softirqs group a number of things together
in one big uncontrollable chunk. Really, a bit of time spent in the kernel
regarding this would clarify it more in the future. Don't speculate.
This misunderstanding, often babble, from app folks is why kernel folks
partially dismiss the needs requested from this subgroup. It's important
to understand your needs before articulating it to a wider community.
The kernel community must understand the true nature of these needs and
then facilitate them. If the relationship is where kernel folks dictate
what apps folks have, you basically pervert the relationbship and the
responsiblities of overall development, which fatally cripples app
and all development of this nature. It's a two way street, but kernel
folks can be more proactive about it, definitely.
Step one in this is to acknowlege that Unix scheduling semantics is
"inantiquated" with regard to media apps. Some notion of scoping needs to
be put in.
Everybody on the same page ?
> This is really only useful within the context of a dedicated realtime
> system, of course.
>
> Stephane Letz reports a similar feature in Mac OS X.
OS X is very coarse grained (two funnels) and I would seriously doubt
that it would perform without scheduling side effects to the overall
system because of that. With a largely stalled FreeBSD SMPng project
where they hijack a good chunk of their code into an antiquate and
bloated Mach threading system, that situation isn't helping it.
What the Linux community has with the RT patches is potentially light
years ahead of OS X regarding overall system latency, since RT and
SMP performance is tightly related. It's just a matter of getting
right folks to understand the problem space and then make changes so
that the overall picture is completed.
bill
On Wed, Feb 02, 2005 at 10:21:00PM +0100, Ingo Molnar wrote:
> yes and no. You are right in that the individual workloads (e.g.
> softirqs) are not separated and identified/credited to the thread that
> requested them. (in part due to the fact that you cannot e.g. credit a
> thread for e.g. unrequested workloads like incoming sockets, or for
> 'merged' workloads like writeout of a commonly accessed file.)
What's not being addressed here is a need for pervasive QoS across all
kernel systems. The power of this patch is multiplicative. It's not
about a specific component of the system having microsecond latencies,
it's about how all parts, softirqs, hardirqs, VM, etc... work together
so that the entire system is suitable for (near) hard real time. It's
unconstrained, unlike dual kernel RT systems, across all component
boundaries. Those constraints create large chunks of glue logic between
systems, which is exploded the complexity of things that app folks
much deal with.
This is where properly written Linux apps (non exist right now because
of kernel issues) can really overtake competing apps from other OSes
(ignoring how crappy X11 is).
> but Jack is right in practical terms: the audio folks achieved pretty
> good results with the current IRQ threading mechanism, partly due to the
> fact that the audio stack doesnt use softirqs, so all the
> latency-critical activities are in the audio IRQ thread and the
> application itself.
It's clever that they do that, but additional control is needed in the
future. jackd isn't the most sophisticate media app on this planet (not
too much of an insult :)) and the demands from this group is bound to
increase as their group and competing projects get more and more
sophisticated. When I mean kernel folks needs to be proactive, I really
mean it. The Linux kernel latency issues and poor driver support is
largely why media apps are way below even being second rate with regard
to other operating systems such as Apple's OS X for instance.
bill
Bill Huey (hui) wrote:
> On Tue, Feb 01, 2005 at 11:10:48PM -0600, Jack O'Quin wrote:
>
>>Ingo Molnar <[email protected]> writes:
>>
>>>(also, believe me, this is not arrogance or some kind of game on our
>>>part. If there was a nice clean solution that solved your and others'
>>>problems equally well then it would already be in Linus' tree. But there
>>>is no such solution yet, at the moment. Moreover, the pure fact that so
>>>many patch proposals exist and none looks dominantly convincing shows
>>>that this is a problem area for which there are no easy solutions. We
>>>hate such moments just as much as you do, but they do happen.)
>>
>>The actual requirement is nowhere near as difficult as you imagine.
>>You and several others continue to view realtime in a multi-user
>>context. That doesn't work. No wonder you have no good solution.
>
>
> A notion of process/thread scoping is needed from my point of view. How
> to implement that is another matter and there are no clear solutions
> that don't involve major changes in some way to fundamental syscalls
> like fork/clone() and underlying kernel structures from what I see.
> The very notion of Unix fork semantics isn't sufficient enough to
> "contain" these semantics. It's more about controlling things with
> known quantities over time, not about process creation relationships,
> and therein lies the mismatch.
>
> Also, as media apps get more sophisticated they're going to need some
> kind of access to the some traditional softirq facilities, possibily
> migrating it into userspace safely somehow, with how it handles IO
> processing such as iSCSI, FireWire, networking and all peripherals
> that need some kind of prioritized IO handling. It's akin to O_DIRECT,
> where folks need to determine policy over the kernel's own facilities,
> IO queues, but in a more broad way. This is inevitable for these
> category of apps. Scary ? yes I know.
>
> Think XFS streaming with guaranteed rate IO, then generalize this for
> all things that can be streamed in the kernel. A side note, they'll
> also be pegging CPU usage and attempting to draw to the screen at the
> same time. It would be nice to have slack from scheduler frames be use
> for less critical things such as drawing to the screen.
>
> The policy for scheduling these IO requests maybe divorced from the
> actual priority of the thread requesting it which present some problems
> with the current Linux code as I understand it.
>
> Whether this suitable for main stream inclusion is another matter. But
> as a person that wants to write apps of this nature, I came into this
> kernel stuff knowing that there's going to be a conflict between the
> the needs of media apps folks and what the Linux kernel folks will
> tolerate as a community.
>
>
>>The humble RT-LSM was actually optimal for the multi-user scenario:
>>don't load it. Then it adds no security issues, complexity or
>>scheduler pathlength. As an added benefit, the sysadmin can easily
>>verify that it's not there.
>>
>>The cost/performance characteristics of commodity PC's running Linux
>>are quite compelling for a wide range of practical realtime
>>applications. But, these are dedicated machines. The whole system
>>must be carefully tuned. That is the only method that actually works.
>>The scheduler is at most a peripheral concern; the best it can do is
>>not screw up.
>
>
> It's very compelling and very deadly to the industry if these things
> become common place in the normal Linux kernel. It would instantly
> make Linux the top platform for anything media related, graphic and
> audio. (Hopefully, I can get back to kernel coding RT stuff after this
> current distraction that has me reassigned onto an emergency project)
>
> I hope I clarified some of this communication and not completely scare
> Ingo and others too much. Just a little bit is ok. :)
As Ingo said in an earlier a post, with a little ingenuity this problem
can be solved in user space. The programs in question can be setuid
root so that they can set RT scheduling policy BUT have their
permissions set so that they only executable by owner and group with the
group set to a group that only contains those users that have permission
to run this program in RT mode. If you wish to allow other users to run
the program but not in RT mode then you would need two copies of the
program: one set up as above and the other with normal permissions.
It may be necessary for users that are members of this RT group to do a
newgrp before running this program if it isn't their primary group. But
that could be done in a shell wrapper.
If you have the source code for the programs then they could be modified
to drop the root euid after they've changed policy. Or even do the
group membership check inside the program and if the user that launches
the program is not a member of the group drop root euid immediately on
start up.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
* Bill Huey <[email protected]> wrote:
> On Wed, Feb 02, 2005 at 10:44:22AM -0600, Jack O'Quin wrote:
> > I believe Ingo's RT patches already support this on a per-IRQ basis.
> > Each IRQ handler can run in a realtime thread with priority assigned
> > by the sysadmin. Balancing the interrupt handler priorities with
> > those of other realtime activities allows excellent control.
>
> No they don't. That's a physical mapping of these kernel entities, not a
> logic organization that projects upward to things like individual sockets
> or file streams. [...]
yes and no. You are right in that the individual workloads (e.g.
softirqs) are not separated and identified/credited to the thread that
requested them. (in part due to the fact that you cannot e.g. credit a
thread for e.g. unrequested workloads like incoming sockets, or for
'merged' workloads like writeout of a commonly accessed file.)
but Jack is right in practical terms: the audio folks achieved pretty
good results with the current IRQ threading mechanism, partly due to the
fact that the audio stack doesnt use softirqs, so all the
latency-critical activities are in the audio IRQ thread and the
application itself.
Ingo
On Wed, Feb 02, 2005 at 01:14:05PM -0800, Bill Huey wrote:
> Step one in this is to acknowlege that Unix scheduling semantics is
> "inantiquated" with regard to media apps. Some notion of scoping needs to
bah, "inadequate".
> be put in.
>
> Everybody on the same page ?
bill
Ingo Molnar <[email protected]> writes:
> so forgive us this stubborness, it's not directed against you in person
> or against any group of users, it's always directed at the problem at
> hand. I think we can do the LSM thing, and if this problem comes up in
> the future again, then maybe by that time there will be a better
> solution. (e.g. it's quite possible that something nice will come out of
> the various virtualization projects, for this problem area.)
No hard feelings, Ingo.
I respect stubbornness in the pursuit of quality.
If requested, I will provide testing and other support for those
working on better solutions.
--
joq
>It's clever that they do that, but additional control is needed in the
>future. jackd isn't the most sophisticate media app on this planet (not
>too much of an insult :)) and the demands from this group is bound to
Actually, JACK probably is the most sophisticated media *framework* on
the planet, at least inasmuch as it connects ideas drawn from the
media world and OS research/design into a coherent package. Its not
perfect, and we've just started adding new data types to its
capabilities (its actually relatively easy). But it is amazingly
powerful in comparison to anything offered to data, and is
unencumbered by the limitations that have affected other attempts to
do what it does.
And it makes possible some of the most sophisticated *audio* apps on
the planet, though admittedly not video and other data at this time.
>increase as their group and competing projects get more and more
>sophisticated. When I mean kernel folks needs to be proactive, I really
>mean it. The Linux kernel latency issues and poor driver support is
>largely why media apps are way below even being second rate with regard
>to other operating systems such as Apple's OS X for instance.
This is a bit misleading. With the right kernel+patch, Linux performs
at least as well as OSX, and measurably better in many configurations
and cases. And its JACK that has provided OSX with inter-application
audio routing, not the other way around. The higher quality of OSX
apps isn't because of kernel latency or poor driver support: its
because the apps have been under development for longer and have the
immense of benefit of OSX's single unified development
environment. Within the next month, expect to see several important
Linux audio apps released for OSX, providing functionality not
available on that platform at present.
--p
>As Ingo said in an earlier a post, with a little ingenuity this problem
>can be solved in user space. The programs in question can be setuid
>root so that they can set RT scheduling policy BUT have their
>permissions set so that they only executable by owner and group with the
>group set to a group that only contains those users that have permission
>to run this program in RT mode. If you wish to allow other users to run
>the program but not in RT mode then you would need two copies of the
>program: one set up as above and the other with normal permissions.
Just a reminder: setuid root is precisely what we are attempting to
avoid.
>If you have the source code for the programs then they could be modified
>to drop the root euid after they've changed policy. Or even do the
This is insufficient, since they need to be able to drop RT scheduling
and then reacquire it again later.
--p
Paul Davis wrote:
>>As Ingo said in an earlier a post, with a little ingenuity this problem
>>can be solved in user space. The programs in question can be setuid
>>root so that they can set RT scheduling policy BUT have their
>>permissions set so that they only executable by owner and group with the
>>group set to a group that only contains those users that have permission
>>to run this program in RT mode. If you wish to allow other users to run
>>the program but not in RT mode then you would need two copies of the
>>program: one set up as above and the other with normal permissions.
>
>
> Just a reminder: setuid root is precisely what we are attempting to
> avoid.
There's nothing wrong with using setuid root if you do it carefully and
properly. If you have the sources to the program then you can place all
the necessary safeguards in the program itself. Doing this inside the
program would allow more elaborate control over who is allowed to set RT
policy from within the program. E.g. you could have a file (owned by
root and only writable by root) in /etc with the names of the users that
have this privilege. If this file does not exist, has the wrong
privileges or the user associated with task's ruid is not in the file
then the the program immediately and irrevocably drops root privileges
otherwise it drops them temporarily and regains them when it needs to
either change policy to RT or fork another task that needs the same
privileges.
>
>
>>If you have the source code for the programs then they could be modified
>>to drop the root euid after they've changed policy. Or even do the
>
>
> This is insufficient, since they need to be able to drop RT scheduling
> and then reacquire it again later.
I believe that there are mechanisms that allow this. The setuid man
page states that a process with non root real uid but setuid as root can
use the seteuid call to use the _POSIX_SAVED_IDS mechanism to drop and
regain root privileges as required.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Peter Williams <[email protected]> writes:
>>> If you have the source code for the programs then they could be
>>> modified to drop the root euid after they've changed policy. Or
>>> even do the
> Paul Davis wrote:
>> This is insufficient, since they need to be able to drop RT
>> scheduling and then reacquire it again later.
> I believe that there are mechanisms that allow this. The setuid man
> page states that a process with non root real uid but setuid as root
> can use the seteuid call to use the _POSIX_SAVED_IDS mechanism to
> drop and regain root privileges as required.
Which every system cracker knows. Any attack on such a program is
going to re-acquire root privileges and take over the system.
Temporarily dropping privileges gains no security whatsoever. It is
nothing more than a coding convenience. The program remains *inside*
the system security perimeter.
--
joq
On Wed, Feb 02, 2005 at 05:59:54PM -0500, Paul Davis wrote:
> Actually, JACK probably is the most sophisticated media *framework* on
> the planet, at least inasmuch as it connects ideas drawn from the
> media world and OS research/design into a coherent package. Its not
> perfect, and we've just started adding new data types to its
> capabilities (its actually relatively easy). But it is amazingly
> powerful in comparison to anything offered to data, and is
> unencumbered by the limitations that have affected other attempts to
> do what it does.
This is a bit off topic, but I'm interested in applications that are
more driven by time and has abstraction closer to that in a pure way.
A lot of audio kits tend to be overly about DSP and not about time.
This is difficult to explain, but what I'm referring to here is ideally
the next generation these applications and their design, not the current
lot. A lot more can be done.
> And it makes possible some of the most sophisticated *audio* apps on
> the planet, though admittedly not video and other data at this time.
Again, the notion of time based processing with broader uses and not
just DSP which what a lot of current graph driven audio frameworks
seem to still do at this time. Think gaming audio in 3d, etc...
I definitely have ideas on this subject and I'm going to hold my
current position on this matter in that we can collectively do much
better.
bill
On Thu, Feb 03, 2005 at 08:54:24AM +1100, Peter Williams wrote:
> As Ingo said in an earlier a post, with a little ingenuity this problem
> can be solved in user space. The programs in question can be setuid
> root so that they can set RT scheduling policy BUT have their
> permissions set so that they only executable by owner and group with the
> group set to a group that only contains those users that have permission
> to run this program in RT mode. If you wish to allow other users to run
> the program but not in RT mode then you would need two copies of the
> program: one set up as above and the other with normal permissions.
Again, in my post that you snipped you didn't either read or understand
what I was saying regarding QoS, nor about the large scale issues regarding
dual/single kernel development environments. Ultimately this stuff requires
non-trivial support in kernel space, a softirq thread migration mechanism
and a frame driven scheduler to back IO submission across async boundaries.
My posts where pretty clear on this topic and lot of this has origins
coming from SGI IRIX. Yes, SGI IRIX. One of the only system man enough
to handle this stuff.
Ancient, antiquated Unix scheduler semantics (sort and run) and lack of
control over critical facilities like softirq processing are obstacles
to getting at this.
bill
Jack O'Quin wrote:
> Peter Williams <[email protected]> writes:
>
>
>>>>If you have the source code for the programs then they could be
>>>>modified to drop the root euid after they've changed policy. Or
>>>>even do the
>
>
>>Paul Davis wrote:
>>
>>>This is insufficient, since they need to be able to drop RT
>>>scheduling and then reacquire it again later.
>
>
>>I believe that there are mechanisms that allow this. The setuid man
>>page states that a process with non root real uid but setuid as root
>>can use the seteuid call to use the _POSIX_SAVED_IDS mechanism to
>>drop and regain root privileges as required.
>
>
> Which every system cracker knows. Any attack on such a program is
> going to re-acquire root privileges and take over the system.
>
> Temporarily dropping privileges gains no security whatsoever. It is
> nothing more than a coding convenience.
Yes, to help avoid accidentally misusing the privileges.
> The program remains *inside*
> the system security perimeter.
Which is why you have to be careful in writing setuid programs.
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Bill Huey (hui) wrote:
> On Thu, Feb 03, 2005 at 08:54:24AM +1100, Peter Williams wrote:
>
>>As Ingo said in an earlier a post, with a little ingenuity this problem
>>can be solved in user space. The programs in question can be setuid
>>root so that they can set RT scheduling policy BUT have their
>>permissions set so that they only executable by owner and group with the
>>group set to a group that only contains those users that have permission
>>to run this program in RT mode. If you wish to allow other users to run
>>the program but not in RT mode then you would need two copies of the
>>program: one set up as above and the other with normal permissions.
>
>
> Again, in my post that you snipped you didn't either read or understand
> what I was saying regarding QoS,
I guess that I thought that it was overkill for the problem under
discussion and probably won't solve it anyway. Giving any task special
preferential (emphasis on the preferential) treatment should require
authorization by a suitably privileged entity at some stage. So the
problem of how ordinary users manage to launch tasks that receive
preferential treatment will remain.
> nor about the large scale issues regarding
> dual/single kernel development environments. Ultimately this stuff requires
> non-trivial support in kernel space, a softirq thread migration mechanism
> and a frame driven scheduler to back IO submission across async boundaries.
>
> My posts where pretty clear on this topic and lot of this has origins
> coming from SGI IRIX. Yes, SGI IRIX. One of the only system man enough
> to handle this stuff.
>
> Ancient, antiquated Unix scheduler semantics (sort and run) and lack of
> control over critical facilities like softirq processing are obstacles
> to getting at this.
Sorry for upsetting you,
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
> Jack O'Quin wrote:
>> Temporarily dropping privileges gains no security whatsoever. It is
>> nothing more than a coding convenience.
Peter Williams <[email protected]> writes:
> Yes, to help avoid accidentally misusing the privileges.
>> The program remains *inside* the system security perimeter.
>
> Which is why you have to be careful in writing setuid programs.
Which is why I'd rather not run an inherently insecure program like
jackd with root privileges.
I can live with a cracker crashing my audio workstation with a DoS
attack using realtime privileges. I'll just have to reboot. But, I
do not want him turning my mail server into a spam relay.
--
joq
>This is a bit off topic, but I'm interested in applications that are
>more driven by time and has abstraction closer to that in a pure way.
>A lot of audio kits tend to be overly about DSP and not about time.
>This is difficult to explain, but what I'm referring to here is ideally
>the next generation these applications and their design, not the current
>lot. A lot more can be done.
>
>> And it makes possible some of the most sophisticated *audio* apps on
>> the planet, though admittedly not video and other data at this time.
>
>Again, the notion of time based processing with broader uses and not
>just DSP which what a lot of current graph driven audio frameworks
>seem to still do at this time. Think gaming audio in 3d, etc...
Ever since I started work on JACK, it has always been in the back of
my head that on at least one level, its not about audio at all. Its a
user-space cooperative scheduler. All it really is a way to fire up a
bunch of processes (and/or internal callbacks) based on the passing of
time as measured by some kernel-induced wakeup. It comes with a lot of
extra stuff, like ports for passing around data, and the notion of a
"backend" which is what actually responds to the wakeup and has
somewhat more specific semantics than the client model. It also has
the notion of enforcing time-related deadlines by evicting clients
that appear to cause them to be violated. Even so, there is remarkably
little about audio/DSP that affects the core design of JACK, which is
why it can be run without any audio hardware, can provide network
data transport, etc. etc.
There are several kernel-side attributes that would make JACK better from
my perspective:
* better ways to acquire and release RT scheduling
* better kernel scheduling latency (which has now come a long
way already)
* real inter-process handoff. i am thinking of something like
sched_yield(), but it would take a TID as the target
of the yield. this would avoid all the crap we have to
go through to drive the graph of clients with FIFO's and
write(2) and poll(2). Futexes might be a usable
approximation in 2.6 (we are supporting 2.4, so we can't
use them all the time)
* better ways to know what to lock into physical RAM
Gaming audio is really a very different model from
pro-audio. Developers there have evolved a different approach to
dealing with latency issues. They use lots of kernel/hardware
buffering for audio, but they require the ability to overwrite what
they have already written to the buffers. This lets them get away with
using rather large audio latency (especially by comparison with us
audio folk), but still allows them stick new audio into the playback
stream at short notice based on user actions. This is a design that
won't work very well for pro-audio. Gamers could use the pro-audio
"calculate everything at the last moment" model, but guess what:
they'd be beating down on kernel scheduling door and RT-acquisition
policy center just like we have, only in much larger numbers :)
--p
Paul Davis wrote:
> * real inter-process handoff. i am thinking of something like
> sched_yield(), but it would take a TID as the target
> of the yield. this would avoid all the crap we have to
> go through to drive the graph of clients with FIFO's and
> write(2) and poll(2). Futexes might be a usable
> approximation in 2.6 (we are supporting 2.4, so we can't
> use them all the time)
yield_to(tid) should not be too hard to implement. Ingo? What do you think?
Con
* Con Kolivas <[email protected]> wrote:
> > * real inter-process handoff. i am thinking of something like
> > sched_yield(), but it would take a TID as the target
> > of the yield. this would avoid all the crap we have to
> > go through to drive the graph of clients with FIFO's and
> > write(2) and poll(2). Futexes might be a usable
> > approximation in 2.6 (we are supporting 2.4, so we can't
> > use them all the time)
>
> yield_to(tid) should not be too hard to implement. Ingo? What do you
> think?
i dont really like it - it's really the wrong interface to use. Futexes
are a much better locking/signalling interface. yield_to() would not be
available in 2.4 either. If the apropriate pthread objects are used then
libpthread will do it more or less optimally on 2.4 too, while on 2.6
they'd be perfectly fine and based on futexes. If 2.4 is not an issue
then a good, futex-based inter-process API is POSIX 1003.1b semaphores
(the sem_init()/sem_*() APIs). But if it should work inter-process on
non-futex kernels too, then only pthread spinlocks will do it.
Ingo
>> yield_to(tid) should not be too hard to implement. Ingo? What do you
>> think?
>
>i dont really like it - it's really the wrong interface to use. Futexes
>are a much better locking/signalling interface. yield_to() would not be
i agree in principle, and i was suprised to see Con express this
thought so readily.
however, i don't agree that futexes are conceptually superior. they
don't express the intended operation nearly as accurately as
yield_to(tid) would. the operation is "i have nothing else to do, and
i want <tid> to run next". a futex says "this particular condition is
satisfied, which might wake one or more tasks". its still necessary
for the caller to go to sleep explicitly, its still necessary for the
tasks involved to know about the futexes, which actually are really
irrelevant - there are no conditions to satisfy, just a series of
tasks we want to run.
--p
* Paul Davis <[email protected]> wrote:
> however, i don't agree that futexes are conceptually superior. they
> don't express the intended operation nearly as accurately as
> yield_to(tid) would. the operation is "i have nothing else to do, and
> i want <tid> to run next". a futex says "this particular condition is
> satisfied, which might wake one or more tasks". [...]
what i suggested was to use one of the pthread APIs - not to use raw
futexes.
so the basic model is that you have a processing dependency between
threads, correct? If one thread finishes, you know which one should come
next - based on the graph. The first thread is triggered by some
external event. (timer or audio event.)
this can be cleanly implemented by attaching a pthread spinlock to each
node of the graph, and initializing the lock to a locked state. The
threads go sleeping by 'taking the lock'. The one that does processing,
wakes up the next one by unlocking that graph node, and then it goes to
sleep by locking its own node.
(it would be cleaner to use POSIX semaphores for this, but you mentioned
the requirement for the mechanism to work on 2.4 kernels too - pthread
spinlocks will work inter-process on 2.4 too, and will schedule nicely.)
> [...] its still necessary for the caller to go to sleep explicitly,
> its still necessary for the tasks involved to know about the futexes,
> which actually are really irrelevant - there are no conditions to
> satisfy, just a series of tasks we want to run.
well, no. Unless i misunderstood your application model, you want
threads to sleep until they are woken up. So you want a very basic
sleep/wake mechanism. But yield_to() does not achieve that! yield_to()
will yield to _already running_ (i.e. in the runqueue) threads. Using
yield() (or yield_to()) for this is really suboptimal. By using a futex
based mechanism you get a very nice schedule/sleep pattern.
(you could also use kill()/sigwait(), but that is slower than futexes.)
Ingo
* Paul Davis <[email protected]> wrote:
> Just a reminder: setuid root is precisely what we are attempting to
> avoid.
>
> >If you have the source code for the programs then they could be modified
> >to drop the root euid after they've changed policy. Or even do the
>
> This is insufficient, since they need to be able to drop RT scheduling
> and then reacquire it again later.
i believe RT-LSM provides a way to solve this cleanly: you can make your
audio app setguid-audio (note: NOT setuid), and make the audio group
have CAP_SYS_NICE-equivalent privilege via the RT-LSM, and then you
could have a finegrained per-app way of enabling SCHED_FIFO scheduling,
without giving _users_ the blanket permission to SCHED_FIFO. Ok?
this way if jackd (or a client) gets run by _any_ user, all jackd
processes will be part of the audio group and can do SCHED_FIFO - but
users are not automatically trusted with SCHED_FIFO.
you are currently using RT-LSM to enable a user to do SCHED_FIFO, right?
I think the above mechanism is more secure and more finegrained than
that.
Ingo
>(it would be cleaner to use POSIX semaphores for this, but you mentioned
>the requirement for the mechanism to work on 2.4 kernels too - pthread
>spinlocks will work inter-process on 2.4 too, and will schedule nicely.)
can't work. pthread interprocess spinlocks are hopelessly non-RT safe
in 2.4/linuxthreads. or they were the last time i looked - they rely
on sending signals and passing through the whole pthread layer.
this is why we use FIFO's right now: fast, portable (though not to OSX
<grin> - we have to use Mach ports to get enough speed there), and a
PITA :)
>well, no. Unless i misunderstood your application model, you want
>threads to sleep until they are woken up. So you want a very basic
>sleep/wake mechanism. But yield_to() does not achieve that! yield_to()
>will yield to _already running_ (i.e. in the runqueue) threads. Using
>yield() (or yield_to()) for this is really suboptimal. By using a futex
>based mechanism you get a very nice schedule/sleep pattern.
i mentioned earlier today in a message to bill huey that JACK is
really a user-space cooperative scheduler. JACK's "scheduler" knows
who is doing what and when, and if it doesn't then it can't work at
all. so the scenario you describe is impossible and/or broken.
please keep in mind this description of JACK. there is no doubt some
remnant of the work i did on scheduler activations in the mid-90's
left floating around in my head and shaping JACK in mysterious ways :)
--p "kernel-to-user-space: thread <tid> blocked on disk i/o. please advise."
* Bill Huey <[email protected]> wrote:
> > but Jack is right in practical terms: the audio folks achieved pretty
> > good results with the current IRQ threading mechanism, partly due to the
> > fact that the audio stack doesnt use softirqs, so all the
> > latency-critical activities are in the audio IRQ thread and the
> > application itself.
>
> It's clever that they do that, but additional control is needed in the
> future. jackd isn't the most sophisticate media app on this planet (not
> too much of an insult :)) [...]
i think you are underestimating Jack - it is easily amongst the most
sophisticated audio frameworks in existence, and it certainly has one of
the most robust designs. Just shop around on google for Jack-based audio
applications. What i'd love to see is more integration (and cooperation)
between the audio frameworks of desktop projects (KDE, Gnome) and Jack.
Ingo
Paul Davis wrote:
>
> There are several kernel-side attributes that would make JACK better from
> my perspective:
>
> * better ways to acquire and release RT scheduling
I'm no expert on the topic but it would seem to me that the mechanisms
associated with the capable() function are intended to provide a
consistent and extensible interface to the control of privileged
operations with possible finer grained control than "root 'yes' and
everybody else 'no'". Maybe the way to solve this problem is to modify
the interpretation of capable(CAP_SYS_NICE) so that it returns true when
invoked by a task setuid to a nominated uid in addition to zero?
By default, this additional uid would be set to zero (i.e. not change to
current capabilities) but a mechanism to allow a suitable privileged
user to change it could be provided. Programs which the sysadmin wishes
to be allowed to acquire RT scheduling even when used by ordinary users
could be setuid to this "RT user". If the account for the "RT user" was
properly configured (e.g. not allowed to log in, no home directory,
etc.) then the damage that could be done by tasks run as setuid "RT
user" would be limited.
Peter
PS Maybe SELinux already provides this functionality or something better?
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
* Paul Davis <[email protected]> wrote:
> >well, no. Unless i misunderstood your application model, you want
> >threads to sleep until they are woken up. So you want a very basic
> >sleep/wake mechanism. But yield_to() does not achieve that! yield_to()
> >will yield to _already running_ (i.e. in the runqueue) threads. Using
> >yield() (or yield_to()) for this is really suboptimal. By using a futex
> >based mechanism you get a very nice schedule/sleep pattern.
>
> i mentioned earlier today in a message to bill huey that JACK is
> really a user-space cooperative scheduler. JACK's "scheduler" knows
> who is doing what and when, and if it doesn't then it can't work at
> all. so the scenario you describe is impossible and/or broken.
that might be all well and good, but i believe you still dont understand
my point: for yield_to() to work the target task _needs to be running_.
I.e. it needs to be in TASK_RUNNING state. You cannot yield_to() a
sleeping task out of thin air: any 'targeted wakeup' needs some wait
object. Either a futex, a signal or a fifo, or something else. The waker
has to identify the wait object somehow.
so if you want to use yield_to() (which is a targeted variant of
sched_yield()) for this purpose, it wont and cannot work. Maybe you
thought of some other API, but yield_to() just doesnt cut it.
in theory it would be possible to add two new syscalls: sys_suspend()
and sys_wakeup(tid), where suspend would just enter TASK_INTERRUPTIBLE
without being on any waitqueue, and sys_wakeup() would just do a
process_wakeup() of the target task (if the target task is in
TASK_INTERRUPTIBLE state). But this would probably only be marginally
faster than futexes, and you'd still have all the problems with not
having this API on 2.4 kernels. But it would have one big advantage: it
would be evidently and trivially RT-safe :-)
Ingo
>that might be all well and good, but i believe you still dont understand
>my point: for yield_to() to work the target task _needs to be running_.
correct, i did not understand. perhaps Con didn't either. my idea was
related to:
>in theory it would be possible to add two new syscalls: sys_suspend()
>and sys_wakeup(tid), where suspend would just enter TASK_INTERRUPTIBLE
but more like:
sys_suspend_and_wake (tid)
where current enters TASK_INTERRUPTIBLE, and process_wakeup() is
called on tid.
>having this API on 2.4 kernels. But it would have one big advantage: it
>would be evidently and trivially RT-safe :-)
no small advantage.
it has another big advantage from the user space perspective: no other
information is required apart from <tid>. no state needs to be
maintained by the system that uses this. thats a huge win over the
baroque collection of FIFOs (or futexes) that we have to look after now.
--p
* Paul Davis <[email protected]> wrote:
> >having this API on 2.4 kernels. But it would have one big advantage: it
> >would be evidently and trivially RT-safe :-)
>
> no small advantage.
>
> it has another big advantage from the user space perspective: no other
> information is required apart from <tid>. no state needs to be
> maintained by the system that uses this. thats a huge win over the
> baroque collection of FIFOs (or futexes) that we have to look after
> now.
ok, i'll whip up something after 2.6.11.
Ingo
On Thu, Feb 03, 2005 at 10:41:33PM +0100, Ingo Molnar wrote:
> * Bill Huey <[email protected]> wrote:
> > It's clever that they do that, but additional control is needed in the
> > future. jackd isn't the most sophisticate media app on this planet (not
> > too much of an insult :)) [...]
>
> i think you are underestimating Jack - it is easily amongst the most
> sophisticated audio frameworks in existence, and it certainly has one of
> the most robust designs. Just shop around on google for Jack-based audio
> applications. What i'd love to see is more integration (and cooperation)
> between the audio frameworks of desktop projects (KDE, Gnome) and Jack.
This is a really long winded and long standing offtopic gripe I have with
general application development under Linux. The only way I'm going to
get folks to understand my position on it is if I code it up in my
implementation language of choice with my own APIs.
There's a TON more that can be done with QoS in the kernel (EDL schedulers),
DSP JIT compiler techniques and other kernel things that can support
pro-audio. I simply can't get to yet until the RT patch has a few more
goodies and I'm permitted to do this as my next project.
I had a crazy prototype of some DSP graph system (in C++) I wrote years
ago for 3D audio where I'm drawing my knowledge from and it's getting
time to resurrect it again if I'm going to provide a proof of concept
to push an edge.
Also, think, people working with the RT patch are also ignoring frame
accurate video and many others things that just haven't been done yet
since the patch is so new and there hasn't been more interest from
folks yet regarding it. I suspect that it's because that folks don't
know about it yet.
bill
* Ingo Molnar ([email protected]) wrote:
> i believe RT-LSM provides a way to solve this cleanly: you can make your
> audio app setguid-audio (note: NOT setuid), and make the audio group
> have CAP_SYS_NICE-equivalent privilege via the RT-LSM, and then you
> could have a finegrained per-app way of enabling SCHED_FIFO scheduling,
> without giving _users_ the blanket permission to SCHED_FIFO. Ok?
>
> this way if jackd (or a client) gets run by _any_ user, all jackd
> processes will be part of the audio group and can do SCHED_FIFO - but
> users are not automatically trusted with SCHED_FIFO.
>
> you are currently using RT-LSM to enable a user to do SCHED_FIFO, right?
> I think the above mechanism is more secure and more finegrained than
> that.
No, rt-lsm is actually gid based.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
On Thu, 03 Feb 2005 21:47:11 +0100, Ingo Molnar wrote:
>
> * Con Kolivas <[email protected]> wrote:
>
>> > * real inter-process handoff. i am thinking of something like
>> > sched_yield(), but it would take a TID as the target
>> > of the yield. this would avoid all the crap we have to
>> > go through to drive the graph of clients with FIFO's and
>> > write(2) and poll(2). Futexes might be a usable
>> > approximation in 2.6 (we are supporting 2.4, so we can't
>> > use them all the time)
>>
>> yield_to(tid) should not be too hard to implement. Ingo? What do you
>> think?
>
> i dont really like it - it's really the wrong interface to use.
Would it be nice to create scheduling domains, so the processes in a
domain have priority relative to each other. The second highest priority
FIFO task could actually be a scheduling sub-domain (which appears to be a
child process of task A). In that scheduling sub-domain would be all the
real-time tasks that A manages the scheduling of. It gets to set their
relative priorities and sched class.
Task A could then make the first FIFO client be at the top of that domain,
with the second task to run being at the next priority, etc. Then Task A
blocks waiting for a watchdog timer so it can keep an eye on things. This
way yield_to would actually be almost a special case of a pre-configurable
sequence of operations which can have FIFO parts, RR parts, and OTHER
parts. Task A could also put one of its own threads in the sub-domain to
be able to change its state at a certain point in the sequence rather than
a certain time.
This could be extended to allow init to own a sub-domain for each user
that it could manage to allow users to each have real-time operations, and
daemons with high priorities relative to that user's other tasks, but
without interfering with other users (or giving root an automatic
advantage). It could make accounting easier, and allow users to get total
priority equal to the amount of money they pay.
I know this could be a lot of work, but is it sound in principle?
How would you do better than RR between users with domains at the same
priority though... That I'm not sure of. IE, it might be good to switch
back to any domain that has a recently woken task more frequently but for
less time then extend the time and decrease the frequency as the time
since waking increases. That way users tasks could respond quickly to
events without ever being able to cause or suffer from starvation.
Unfortunately, O(1) behaviour would be a must, and that scheme could be
particularly hard to implement in O(1) :)
Is it useful, too much work, over-engineered, logically impossible?
--
Tristan Wibberley
Peter Williams <[email protected]> writes:
> Paul Davis wrote:
>> There are several kernel-side attributes that would make JACK better
>> from my perspective:
>> * better ways to acquire and release RT scheduling
>
> I'm no expert on the topic but it would seem to me that the mechanisms
> associated with the capable() function are intended to provide a
> consistent and extensible interface to the control of privileged
> operations with possible finer grained control than "root 'yes' and
> everybody else 'no'". Maybe the way to solve this problem is to
> modify the interpretation of capable(CAP_SYS_NICE) so that it returns
> true when invoked by a task setuid to a nominated uid in addition to
> zero?
That is essentially what the RT-LSM does. At exec() time RT-LSM turns
on CAP_SYS_NICE for appropriate process images.
In the current implementation this is only done per-group not
per-user. Adding UID as well as GID granularity should be easy. We
didn't do it because we didn't really need it. If there's a use for
it, I have no objection to adding it. It could even compatibly be
added later.
Many distributions require users to join group `audio' anyway to gain
access to the sound card. We found it convenient to piggy-back on
that mechanism.
I believe Paul considers this adequate for his requirements. :-)
--
joq
Ingo Molnar <[email protected]> writes:
> i believe RT-LSM provides a way to solve this cleanly: you can make your
> audio app setguid-audio (note: NOT setuid), and make the audio group
> have CAP_SYS_NICE-equivalent privilege via the RT-LSM, and then you
> could have a finegrained per-app way of enabling SCHED_FIFO scheduling,
> without giving _users_ the blanket permission to SCHED_FIFO. Ok?
Yes, we designed the module with this scenario specifically in mind.
> this way if jackd (or a client) gets run by _any_ user, all jackd
> processes will be part of the audio group and can do SCHED_FIFO - but
> users are not automatically trusted with SCHED_FIFO.
>
> you are currently using RT-LSM to enable a user to do SCHED_FIFO, right?
> I think the above mechanism is more secure and more finegrained than
> that.
We *are* doing that (based on group membership). We designed it just
as you say. And it works fine for Qt and command line clients.
Unfortunately, GTK+ refuses to cooperate. It has a special check at
startup (in gtkmain)...
if (ruid != euid || ruid != suid ||
rgid != egid || rgid != sgid)
{
g_warning ("This process is currently running setuid or setgid.\n"
"This is not a supported use of GTK+. You must create a helper\n"
"program instead. For further details, see:\n\n"
" http://www.gtk.org/setuid.html\n\n"
"Refusing to initialize GTK+.");
exit (1);
}
Note that this calls *exit(1)*, not just returning an error code.
Following the suggested URL, <http://www.gtk.org/setuid.html>, reveals
their understandable, but basically wrong-headed, rationale...
GTK+ supports the environment variable GTK_MODULES which specifies
arbitrary dynamic modules to be loaded and executed when GTK+ is
initialized. It is somewhat similar to the LD_PRELOAD environment
variable. However, this (and similar functionality such as
specifying theme engines) is not disabled when running setuid or
setgid. Is this a security hole? No. Writing setuid and setgid
programs using GTK+ is bad idea and will never be supported by the
GTK+ team.
They are wrong (IMHO), because these kinds of security tests *cannot*
reliably be done in userspace. They are not testing for possession of
privileges, but merely disallowing two of a half-dozen ways of
granting those privileges. Why should it be OK to run GTK as `root',
but not as setgid `audio'? Ironically, people don't run GTK threads
with SCHED_FIFO. Those are precisely the threads over which the
signal processing threads need to have priority.
This "feature" has forced us to fall back on supplementary groups for
our main authorization mechanism. That is unfortunate because, as you
say, the setgid() approach has finer granularity, which is better.
So, that GTK test has the unintended consequence of making our
security exposure larger, not smaller.
We can live with this, mainly because our users often need
supplementary membership in group `audio' anyway, to gain access to
the sound card.
If we can ever convince the GTK developers to remove this "feature",
the RT-LSM handles setgid() correctly. So, we could immediately start
using it (at least on systems with a new enough GTK library to permit
that).
--
joq
Jack O'Quin wrote:
> Peter Williams <[email protected]> writes:
>
>
>>Paul Davis wrote:
>>
>>>There are several kernel-side attributes that would make JACK better
>>>from my perspective:
>>> * better ways to acquire and release RT scheduling
>>
>>I'm no expert on the topic but it would seem to me that the mechanisms
>>associated with the capable() function are intended to provide a
>>consistent and extensible interface to the control of privileged
>>operations with possible finer grained control than "root 'yes' and
>>everybody else 'no'". Maybe the way to solve this problem is to
>>modify the interpretation of capable(CAP_SYS_NICE) so that it returns
>>true when invoked by a task setuid to a nominated uid in addition to
>>zero?
>
>
> That is essentially what the RT-LSM does. At exec() time RT-LSM turns
> on CAP_SYS_NICE for appropriate process images.
>
> In the current implementation this is only done per-group not
> per-user. Adding UID as well as GID granularity should be easy. We
> didn't do it because we didn't really need it. If there's a use for
> it, I have no objection to adding it. It could even compatibly be
> added later.
If what you have is adequate I wouldn't suggest changing it. My use of
uid in my rant was just to illustrate a general idea.
>
> Many distributions require users to join group `audio' anyway to gain
> access to the sound card. We found it convenient to piggy-back on
> that mechanism.
>
> I believe Paul considers this adequate for his requirements. :-)
Peter
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
[ Cc:s trimmed, added abiss-general ]
Con Kolivas wrote:
> Possibly reiserfs journal related. That has larger non-preemptible code
> sections.
If I understand your workload right, it should consist mainly of
computation, networking (?), and disk reads.
I don't know much about ReiserFS, but in some experiments with ext3,
using ABISS, we found that a reader application competing with best
effort readers would experience worst-case delays of dozens of
milliseconds.
They were caused by journaled atime updates. Mounting the file
system with "noatime" reduced delays to a few hundred microseconds
(still worst-case).
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Werner Almesberger <[email protected]> writes:
> [ Cc:s trimmed, added abiss-general ]
>
> Con Kolivas wrote:
>> Possibly reiserfs journal related. That has larger non-preemptible code
>> sections.
>
> If I understand your workload right, it should consist mainly of
> computation, networking (?), and disk reads.
The jack_test3.2 is basically a multiprocess realtime audio test. A
fair amount of computation, signifcant task switch overhead, but
most I/O is to the sound card.
There's some disk activity starting clients and probably some other
system activity in the background.
> I don't know much about ReiserFS, but in some experiments with ext3,
> using ABISS, we found that a reader application competing with best
> effort readers would experience worst-case delays of dozens of
> milliseconds.
>
> They were caused by journaled atime updates. Mounting the file
> system with "noatime" reduced delays to a few hundred microseconds
> (still worst-case).
Interesting. Worth a try to verify. Con was seeing a 6msec delay
every 20 seconds. This was devastating to the test, which tries to
run a full realtime audio cycle every 1.45msec.
--
joq
Jack O'Quin wrote:
> Werner Almesberger <[email protected]> writes:
>
>
>>[ Cc:s trimmed, added abiss-general ]
>>
>>Con Kolivas wrote:
>>
>>>Possibly reiserfs journal related. That has larger non-preemptible code
>>>sections.
>>
>>If I understand your workload right, it should consist mainly of
>>computation, networking (?), and disk reads.
>
>
> The jack_test3.2 is basically a multiprocess realtime audio test. A
> fair amount of computation, signifcant task switch overhead, but
> most I/O is to the sound card.
>
> There's some disk activity starting clients and probably some other
> system activity in the background.
>
>
>>I don't know much about ReiserFS, but in some experiments with ext3,
>>using ABISS, we found that a reader application competing with best
>>effort readers would experience worst-case delays of dozens of
>>milliseconds.
>>
>>They were caused by journaled atime updates. Mounting the file
>>system with "noatime" reduced delays to a few hundred microseconds
>>(still worst-case).
>
>
> Interesting. Worth a try to verify. Con was seeing a 6msec delay
> every 20 seconds. This was devastating to the test, which tries to
> run a full realtime audio cycle every 1.45msec.
They already were mounted noatime :(
Con
On Thu, 2005-02-03 at 22:41 +0100, Ingo Molnar wrote:
> >
> > It's clever that they do that, but additional control is needed in the
> > future. jackd isn't the most sophisticate media app on this planet (not
> > too much of an insult :)) [...]
>
> i think you are underestimating Jack - it is easily amongst the most
> sophisticated audio frameworks in existence, and it certainly has one of
> the most robust designs. Just shop around on google for Jack-based audio
> applications. What i'd love to see is more integration (and cooperation)
> between the audio frameworks of desktop projects (KDE, Gnome) and Jack.
JACK was not designed as a general purpose sound server, it's main goals
were sample accurate synchronization and low latency which a general
purpose desktop sound server does not need. But, JACK does provide a
superset of the needed functionality - if you can do low latency you
should be able to handle high latency/buffering jut as easily, and
sample accurate sync will not break apps that don't need it.
The main obstacle to JACK-ifying everything is that it requires audio
apps to conform to the callback based JACK programming model.
JACK-ifying a complex app that expects to be able to read and write
audio whenever it wants to amounts to a complete rewrite. But simpler
apps like XMMS can use a layer on top of JACK like the bio2jack library.
Anyway, Linspire (formerly L*nd*ws, formerly ...) will be using JACK as
the sound server in their next release. And GNOME is moving to
gstreamer which can use JACK as a backend.
Bill has a good point, in that JACK is really just scratching the
surface as far as the myriad possibilities that good realtime support
will open up. Apps like mplayer that use a broken single threaded
design which completely ignores the RT constraint inherent in AV
playback will be corrected for example.
Lee