2001-12-20 21:08:45

by Davide Libenzi

[permalink] [raw]
Subject: [RFC] Scheduler issue 1, RT tasks ...


I'd like to have some comments about RT tasks implementation in a SMP
system because POSIX it's not clear about how the priority rules apply to
multiprocessor systems.
The Balanced Multi Queue Scheduler ( BMQS, http://www.xmailserver.org/linux-patches/mss-2.html )
i'm working on tries to keep CPU schedulers the more independent as
possible and currently implements two kind of RT tasks, local one and
global ones.
Local RT tasks apply POSIX priority rules inside the local CPU, that means
that an RT task running on CPU0 cannot preempt another task ( being it
normal or RT ) on CPU1. This keeps schedulers interlocking very low
because of the very fast path in reschedule_idle() ( no multi lock
acquisition, CPU queue loops, etc...).
Global RT tasks, that live in a separate run queue, have the ability to
preempt remote CPU and this can lead ( in the unfortunate case that the
last CPU running the RT task is running another RT task ) to an higher
cost in reschedule_idle().
The check for a global RT task selection is done in a very fast way before
checking the local queue :

if (!list_empty(&runqueue_head(RT_QID)))
goto rt_queue_select;
rt_queue_select_back:

and this does not affect the scheduler latency at all.
On the contrary, by having a separate queue for global RT tasks, can
improve it in high run queue load cases.
The local/global RT task selection is done with setscheduler() with a new
( or'ed ) flag SCHED_RTGLOBAL, and this means that the default is RT task
local.
I'd like to have comments on this before jumping to the next Scheduler
issue ( balancing mode ).




- Davide



2001-12-20 22:26:09

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

Davide Libenzi wrote:
>
> I'd like to have some comments about RT tasks implementation in a SMP
> system because POSIX it's not clear about how the priority rules apply to
> multiprocessor systems.
> The Balanced Multi Queue Scheduler ( BMQS, http://www.xmailserver.org/linux-patches/mss-2.html )
> i'm working on tries to keep CPU schedulers the more independent as
> possible and currently implements two kind of RT tasks, local one and
> global ones.
> Local RT tasks apply POSIX priority rules inside the local CPU, that means
> that an RT task running on CPU0 cannot preempt another task ( being it
> normal or RT ) on CPU1. This keeps schedulers interlocking very low
> because of the very fast path in reschedule_idle() ( no multi lock
> acquisition, CPU queue loops, etc...).
> Global RT tasks, that live in a separate run queue, have the ability to
> preempt remote CPU and this can lead ( in the unfortunate case that the
> last CPU running the RT task is running another RT task ) to an higher
> cost in reschedule_idle().
> The check for a global RT task selection is done in a very fast way before
> checking the local queue :
>
> if (!list_empty(&runqueue_head(RT_QID)))
> goto rt_queue_select;
> rt_queue_select_back:
>
> and this does not affect the scheduler latency at all.
> On the contrary, by having a separate queue for global RT tasks, can
> improve it in high run queue load cases.
> The local/global RT task selection is done with setscheduler() with a new
> ( or'ed ) flag SCHED_RTGLOBAL, and this means that the default is RT task
> local.
> I'd like to have comments on this before jumping to the next Scheduler
> issue ( balancing mode ).
>
My understanding of the POSIX standard is the the highest priority
task(s) are to get the cpu(s) using the standard calls. If you want to
deviate from this I think the standard allows extensions, but they IMHO
should be requested, not the default, so I would turn your flag around
to force LOCAL, not GLOBAL.

--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2001-12-20 22:37:18

by Momchil Velikov

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

>>>>> "George" == george anzinger <[email protected]> writes:

George> Davide Libenzi wrote:
>> Local RT tasks apply POSIX priority rules inside the local CPU, that means
>> that an RT task running on CPU0 cannot preempt another task ( being it
>> normal or RT ) on CPU1.
[...]
>> Global RT tasks, that live in a separate run queue, have the ability to
>> preempt remote CPU and this can lead.
[...]
>> The local/global RT task selection is done with setscheduler() with a new
>> ( or'ed ) flag SCHED_RTGLOBAL, and this means that the default is RT task
>> local.

George> My understanding of the POSIX standard is the the highest priority
George> task(s) are to get the cpu(s) using the standard calls. If you want to
George> deviate from this I think the standard allows extensions, but they IMHO
George> should be requested, not the default, so I would turn your flag around
George> to force LOCAL, not GLOBAL.

I'd like to second that, IMHO the RT task scheduling should trade
throughput for latency, and if someone wants priority inversion, let
him explicitly request it.

Regards,
-velco

2001-12-20 22:33:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Thu, 20 Dec 2001, george anzinger wrote:

> Davide Libenzi wrote:
> >
> > I'd like to have some comments about RT tasks implementation in a SMP
> > system because POSIX it's not clear about how the priority rules apply to
> > multiprocessor systems.
> > The Balanced Multi Queue Scheduler ( BMQS, http://www.xmailserver.org/linux-patches/mss-2.html )
> > i'm working on tries to keep CPU schedulers the more independent as
> > possible and currently implements two kind of RT tasks, local one and
> > global ones.
> > Local RT tasks apply POSIX priority rules inside the local CPU, that means
> > that an RT task running on CPU0 cannot preempt another task ( being it
> > normal or RT ) on CPU1. This keeps schedulers interlocking very low
> > because of the very fast path in reschedule_idle() ( no multi lock
> > acquisition, CPU queue loops, etc...).
> > Global RT tasks, that live in a separate run queue, have the ability to
> > preempt remote CPU and this can lead ( in the unfortunate case that the
> > last CPU running the RT task is running another RT task ) to an higher
> > cost in reschedule_idle().
> > The check for a global RT task selection is done in a very fast way before
> > checking the local queue :
> >
> > if (!list_empty(&runqueue_head(RT_QID)))
> > goto rt_queue_select;
> > rt_queue_select_back:
> >
> > and this does not affect the scheduler latency at all.
> > On the contrary, by having a separate queue for global RT tasks, can
> > improve it in high run queue load cases.
> > The local/global RT task selection is done with setscheduler() with a new
> > ( or'ed ) flag SCHED_RTGLOBAL, and this means that the default is RT task
> > local.
> > I'd like to have comments on this before jumping to the next Scheduler
> > issue ( balancing mode ).
> >
> My understanding of the POSIX standard is the the highest priority
> task(s) are to get the cpu(s) using the standard calls. If you want to
> deviate from this I think the standard allows extensions, but they IMHO
> should be requested, not the default, so I would turn your flag around
> to force LOCAL, not GLOBAL.

So, you're basically saying that for a better standard compliancy it's
better to have global preemption policy by default. And having users to
request rt tasks localization explicitly. It's fine for me.




- Davide


2001-12-20 22:55:20

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On 21 Dec 2001, Momchil Velikov wrote:

> >>>>> "George" == george anzinger <[email protected]> writes:
>
> George> Davide Libenzi wrote:
> >> Local RT tasks apply POSIX priority rules inside the local CPU, that means
> >> that an RT task running on CPU0 cannot preempt another task ( being it
> >> normal or RT ) on CPU1.
> [...]
> >> Global RT tasks, that live in a separate run queue, have the ability to
> >> preempt remote CPU and this can lead.
> [...]
> >> The local/global RT task selection is done with setscheduler() with a new
> >> ( or'ed ) flag SCHED_RTGLOBAL, and this means that the default is RT task
> >> local.
>
> George> My understanding of the POSIX standard is the the highest priority
> George> task(s) are to get the cpu(s) using the standard calls. If you want to
> George> deviate from this I think the standard allows extensions, but they IMHO
> George> should be requested, not the default, so I would turn your flag around
> George> to force LOCAL, not GLOBAL.
>
> I'd like to second that, IMHO the RT task scheduling should trade
> throughput for latency, and if someone wants priority inversion, let
> him explicitly request it.

No a great performance loss anyway. It's zero performance loss if the CPU
that has ran the woke up RT task for the last time is not running another
RT task ( very probable ). If the last CPU of the woke up task is running
another RT task a CPU discovery loop ( like the current scheduler ) must
be triggered. Not a great deal anyway.




- Davide


2001-12-21 17:03:39

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Thu, Dec 20, 2001 at 02:57:55PM -0800, Davide Libenzi wrote:
> On 21 Dec 2001, Momchil Velikov wrote:
> >
> > I'd like to second that, IMHO the RT task scheduling should trade
> > throughput for latency, and if someone wants priority inversion, let
> > him explicitly request it.
>
> No a great performance loss anyway. It's zero performance loss if the CPU
> that has ran the woke up RT task for the last time is not running another
> RT task ( very probable ). If the last CPU of the woke up task is running
> another RT task a CPU discovery loop ( like the current scheduler ) must
> be triggered. Not a great deal anyway.

Some time back, I asked if anyone had any RT benchmarks and got
little response. Performance (latency) degradation for RT tasks
while implementing new schedulers was my concern. Does anyone
have ideas about how we should measure/benchmark this? My
'solution' at the time was to take a scheduler heavy benchmark
like reflex, and simply make all the tasks RT. This wasn't very
'real world', but at least it did allow me to compare scheduler
overhead in the RT paths of various scheduler implementations.

--
Mike

2001-12-21 17:18:01

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Fri, 21 Dec 2001, Mike Kravetz wrote:

> On Thu, Dec 20, 2001 at 02:57:55PM -0800, Davide Libenzi wrote:
> > On 21 Dec 2001, Momchil Velikov wrote:
> > >
> > > I'd like to second that, IMHO the RT task scheduling should trade
> > > throughput for latency, and if someone wants priority inversion, let
> > > him explicitly request it.
> >
> > No a great performance loss anyway. It's zero performance loss if the CPU
> > that has ran the woke up RT task for the last time is not running another
> > RT task ( very probable ). If the last CPU of the woke up task is running
> > another RT task a CPU discovery loop ( like the current scheduler ) must
> > be triggered. Not a great deal anyway.
>
> Some time back, I asked if anyone had any RT benchmarks and got
> little response. Performance (latency) degradation for RT tasks
> while implementing new schedulers was my concern. Does anyone
> have ideas about how we should measure/benchmark this? My
> 'solution' at the time was to take a scheduler heavy benchmark
> like reflex, and simply make all the tasks RT. This wasn't very
> 'real world', but at least it did allow me to compare scheduler
> overhead in the RT paths of various scheduler implementations.

Mike, a better real world test would be to have a variable system runqueue
load with the wakeup of an rt task and measuring the latency of the rt
task under various loads.
This can be easily implemented with cpuhog ( that load the runqueue ) plus
the LatSched ( scheduler latency sampler ) that will measure the exact
latency in CPU cycles.




- Davide


2001-12-21 17:35:01

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Fri, Dec 21, 2001 at 09:19:04AM -0800, Davide Libenzi wrote:
> On Fri, 21 Dec 2001, Mike Kravetz wrote:
>
> > Some time back, I asked if anyone had any RT benchmarks and got
> > little response. Performance (latency) degradation for RT tasks
> > while implementing new schedulers was my concern. Does anyone
> > have ideas about how we should measure/benchmark this? My
> > 'solution' at the time was to take a scheduler heavy benchmark
> > like reflex, and simply make all the tasks RT. This wasn't very
> > 'real world', but at least it did allow me to compare scheduler
> > overhead in the RT paths of various scheduler implementations.
>
> Mike, a better real world test would be to have a variable system runqueue
> load with the wakeup of an rt task and measuring the latency of the rt
> task under various loads.
> This can be easily implemented with cpuhog ( that load the runqueue ) plus
> the LatSched ( scheduler latency sampler ) that will measure the exact
> latency in CPU cycles.

Right! Any ideas on variable system runqueue load? Should those
other tasks be RT or OTHER? a mix? I would suspect that we would
want multiple RT tasks on the runqueue or at least in the system
(otherwise why worry about global semantics?).

However, I would feel better about this if someone had a real world
load involving RT tasks on a SMP system. At least then we could try
to simulate a load someone cares about.

--
Mike

2001-12-21 18:27:47

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Fri, 21 Dec 2001, Mike Kravetz wrote:

> On Fri, Dec 21, 2001 at 09:19:04AM -0800, Davide Libenzi wrote:
> > On Fri, 21 Dec 2001, Mike Kravetz wrote:
> >
> > > Some time back, I asked if anyone had any RT benchmarks and got
> > > little response. Performance (latency) degradation for RT tasks
> > > while implementing new schedulers was my concern. Does anyone
> > > have ideas about how we should measure/benchmark this? My
> > > 'solution' at the time was to take a scheduler heavy benchmark
> > > like reflex, and simply make all the tasks RT. This wasn't very
> > > 'real world', but at least it did allow me to compare scheduler
> > > overhead in the RT paths of various scheduler implementations.
> >
> > Mike, a better real world test would be to have a variable system runqueue
> > load with the wakeup of an rt task and measuring the latency of the rt
> > task under various loads.
> > This can be easily implemented with cpuhog ( that load the runqueue ) plus
> > the LatSched ( scheduler latency sampler ) that will measure the exact
> > latency in CPU cycles.
>
> Right! Any ideas on variable system runqueue load? Should those
> other tasks be RT or OTHER? a mix? I would suspect that we would
> want multiple RT tasks on the runqueue or at least in the system
> (otherwise why worry about global semantics?).
>
> However, I would feel better about this if someone had a real world
> load involving RT tasks on a SMP system. At least then we could try
> to simulate a load someone cares about.

In my tests i stop the run queue load to 8 ( per cpu ) now coz higher
values are somehow unusual.
A good plot should also have a third dimension that is the number of real
time tasks running.
I guess i've to take a better look at gnuplot docs for 3d graphs :)



- Davide


2001-12-24 00:26:24

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Thu, Dec 20, 2001 at 02:36:07PM -0800, Davide Libenzi wrote:
> > My understanding of the POSIX standard is the the highest priority
> > task(s) are to get the cpu(s) using the standard calls. If you want to
> > deviate from this I think the standard allows extensions, but they IMHO
> > should be requested, not the default, so I would turn your flag around
> > to force LOCAL, not GLOBAL.
>
> So, you're basically saying that for a better standard compliancy it's
> better to have global preemption policy by default. And having users to
> request rt tasks localization explicitly. It's fine for me.

Can you please cite the passaaages in the standrd you have in mind?

>
>
>
>
> - Davide
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-12-24 00:25:14

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...



Run a "RT" task that is scheduled every millisecond (or time of your
choice)
while(1`){
read cycle timer
clock_nanosleep(time period using aabsolute time
read cycle timer - what was actual delay? track worst
case
}

Run this
a) on aaaaaaaaan unstressed system
b) under stress
c) while a timed non-rt benchmark runs to figure out "RT"
overhead.


On Fri, Dec 21, 2001 at 09:00:15AM -0800, Mike Kravetz wrote:
> On Thu, Dec 20, 2001 at 02:57:55PM -0800, Davide Libenzi wrote:
> > On 21 Dec 2001, Momchil Velikov wrote:
> > >
> > > I'd like to second that, IMHO the RT task scheduling should trade
> > > throughput for latency, and if someone wants priority inversion, let
> > > him explicitly request it.
> >
> > No a great performance loss anyway. It's zero performance loss if the CPU
> > that has ran the woke up RT task for the last time is not running another
> > RT task ( very probable ). If the last CPU of the woke up task is running
> > another RT task a CPU discovery loop ( like the current scheduler ) must
> > be triggered. Not a great deal anyway.
>
> Some time back, I asked if anyone had any RT benchmarks and got
> little response. Performance (latency) degradation for RT tasks
> while implementing new schedulers was my concern. Does anyone
> have ideas about how we should measure/benchmark this? My
> 'solution' at the time was to take a scheduler heavy benchmark
> like reflex, and simply make all the tasks RT. This wasn't very
> 'real world', but at least it did allow me to compare scheduler
> overhead in the RT paths of various scheduler implementations.
>
> --
> Mike
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-12-24 01:18:24

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Sun, 23 Dec 2001, Victor Yodaiken wrote:

> On Thu, Dec 20, 2001 at 02:36:07PM -0800, Davide Libenzi wrote:
> > > My understanding of the POSIX standard is the the highest priority
> > > task(s) are to get the cpu(s) using the standard calls. If you want to
> > > deviate from this I think the standard allows extensions, but they IMHO
> > > should be requested, not the default, so I would turn your flag around
> > > to force LOCAL, not GLOBAL.
> >
> > So, you're basically saying that for a better standard compliancy it's
> > better to have global preemption policy by default. And having users to
> > request rt tasks localization explicitly. It's fine for me.
>
> Can you please cite the passaaages in the standrd you have in mind?

POSIX 1003. The doubt was if ( since the POSIX standard does not talk
about SMP ) the real time priorities apply to CPU or to the entire system.
This because the scheduler i'm working on has two kind of RT tasks, local
and global ones. Local RT tasks cannot preempt remote CPU so if, for
example, one RT task is woke up and its last CPU is running another RT
task with higher priority, the fresly woke up task will wait even if other
CPUs are running tasks wil lower priority. Global RT task will force
remote preemption in case the last CPU that ran the woke up RT task is
running another higher priority RT task. Global RT tasks have their own
queue and lock like CPUs. My old default was local RT task that was
forced by a setscheduler() flag SCHED_RTGLOBAL while George suggested that
it's better to have default global and to have this behavior forced by a
SCHED_RTLOCAL flag. I already changed the code to default to global.




- Davide


2001-12-24 01:28:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Sun, 23 Dec 2001, Victor Yodaiken wrote:

>
>
> Run a "RT" task that is scheduled every millisecond (or time of your
> choice)
> while(1`){
> read cycle timer
> clock_nanosleep(time period using aabsolute time
> read cycle timer - what was actual delay? track worst
> case
> }
>
> Run this
> a) on aaaaaaaaan unstressed system
> b) under stress
> c) while a timed non-rt benchmark runs to figure out "RT"
> overhead.

I've coded a test app that uses the LatSched latency patch ( that uses
rdtsc ).
It basically does 1) set the current process priority to RT 2) an ioctl()
to activate the scheduler latency sampler 3) sleep for 1-2 secs 4) ioctl()
to stop the sampler 5) peek the sample with pid == getpid().
In this way i get the net RT task scheduler latency. Yes it does not get
the real one that includes accessories kernel paths but my code does not
affect these ones. And they add noise to the measure.




- Davide


2001-12-24 05:40:59

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Sun, Dec 23, 2001 at 05:31:11PM -0800, Davide Libenzi wrote:
> On Sun, 23 Dec 2001, Victor Yodaiken wrote:
>
> >
> >
> > Run a "RT" task that is scheduled every millisecond (or time of your
> > choice)
> > while(1`){
> > read cycle timer
> > clock_nanosleep(time period using aabsolute time
> > read cycle timer - what was actual delay? track worst
> > case
> > }
> >
> > Run this
> > a) on aaaaaaaaan unstressed system
> > b) under stress
> > c) while a timed non-rt benchmark runs to figure out "RT"
> > overhead.
>
> I've coded a test app that uses the LatSched latency patch ( that uses
> rdtsc ).
> It basically does 1) set the current process priority to RT 2) an ioctl()
> to activate the scheduler latency sampler 3) sleep for 1-2 secs 4) ioctl()
> to stop the sampler 5) peek the sample with pid == getpid().
> In this way i get the net RT task scheduler latency. Yes it does not get
> the real one that includes accessories kernel paths but my code does not
> affect these ones. And they add noise to the measure.


Seems to me that you are not testing what apps see. Internal benchmarks
are useful only for figuring out how to remove bottlenecks that
effect actual user apps - in my humble opinion of course.
The nice thing about my benchmark is that it actually tests something
useful - how well you can do periodic tasks. BTW, on RTLinux we get
under 100 microseconds on even 50Mhzx PPC860 - 17us on a 800Mhz K7.
I'd be happy to see some decent numbers in Linux, but you gotta
measure something more applied.

>
>
>
>
> - Davide
>

2001-12-24 18:50:35

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Sun, 23 Dec 2001, Victor Yodaiken wrote:

> On Sun, Dec 23, 2001 at 05:31:11PM -0800, Davide Libenzi wrote:
> > On Sun, 23 Dec 2001, Victor Yodaiken wrote:
> >
> > >
> > >
> > > Run a "RT" task that is scheduled every millisecond (or time of your
> > > choice)
> > > while(1`){
> > > read cycle timer
> > > clock_nanosleep(time period using aabsolute time
> > > read cycle timer - what was actual delay? track worst
> > > case
> > > }
> > >
> > > Run this
> > > a) on aaaaaaaaan unstressed system
> > > b) under stress
> > > c) while a timed non-rt benchmark runs to figure out "RT"
> > > overhead.
> >
> > I've coded a test app that uses the LatSched latency patch ( that uses
> > rdtsc ).
> > It basically does 1) set the current process priority to RT 2) an ioctl()
> > to activate the scheduler latency sampler 3) sleep for 1-2 secs 4) ioctl()
> > to stop the sampler 5) peek the sample with pid == getpid().
> > In this way i get the net RT task scheduler latency. Yes it does not get
> > the real one that includes accessories kernel paths but my code does not
> > affect these ones. And they add noise to the measure.
>
>
> Seems to me that you are not testing what apps see. Internal benchmarks
> are useful only for figuring out how to remove bottlenecks that
> effect actual user apps - in my humble opinion of course.
> The nice thing about my benchmark is that it actually tests something
> useful - how well you can do periodic tasks. BTW, on RTLinux we get
> under 100 microseconds on even 50Mhzx PPC860 - 17us on a 800Mhz K7.
> I'd be happy to see some decent numbers in Linux, but you gotta
> measure something more applied.

I know what you're saying but my goal now is to fix the scheduler not the
overall RT latency ( at least not the one that does not depend on the
scheduler ). Just take for example your 17us for your 800MHz machine, in
my dual PIII 733 MHz with an rqlen of 4 the scheduler latency ( with that
std scheduler ) is about 0.9us ( real one, not lat_ctx ). That means the
the scheduler responsibility in your 17us is about 5%, and the remaining
95% is due "external" kernel paths. With an rqlen of 16 ( std scheduler )
the latency peaks up to ~2.4us going to ~14-15% of scheduler responsibility.
I've coded this simple app :

http://www.xmailserver.org/linux-patches/lnxsched.html#RtLats

and i use it with the cpuhog ( hi-tech software that is available inside
the same link ) to load the run queue. I'm going to plot the measured
latency versus the runqueue length. Thanks to OSDLAB i'll have an 8 way
machine to make some test on these big SMPs. I'll code even the simple
app you're proposing but the real problem is how to load the system. The
cpuhog load is a runqueue load and is "neutral", that means that is the
same on all the systems. Loading the system with other kind of loads can
introduce a device-driver/hw dependency on the measure ( much or less run
time with irq disabled for example ).





- Davide



2001-12-27 03:08:32

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Mon, Dec 24, 2001 at 10:52:46AM -0800, Davide Libenzi wrote:
> I know what you're saying but my goal now is to fix the scheduler not the
> overall RT latency ( at least not the one that does not depend on the

my bias is to fix the cause of the problem, but go ahead.


> scheduler ). Just take for example your 17us for your 800MHz machine, in
> my dual PIII 733 MHz with an rqlen of 4 the scheduler latency ( with that
> std scheduler ) is about 0.9us ( real one, not lat_ctx ). That means the
> the scheduler responsibility in your 17us is about 5%, and the remaining
> 95% is due "external" kernel paths. With an rqlen of 16 ( std scheduler )

No: we've measured. The time in our system, which does not follow any
Linux kernel paths, is dominated by motherboard bus delays.

> the latency peaks up to ~2.4us going to ~14-15% of scheduler responsibility.
> I've coded this simple app :
>
> http://www.xmailserver.org/linux-patches/lnxsched.html#RtLats
>
> and i use it with the cpuhog ( hi-tech software that is available inside
> the same link ) to load the run queue. I'm going to plot the measured
> latency versus the runqueue length. Thanks to OSDLAB i'll have an 8 way
> machine to make some test on these big SMPs. I'll code even the simple
> app you're proposing but the real problem is how to load the system. The
> cpuhog load is a runqueue load and is "neutral", that means that is the
> same on all the systems. Loading the system with other kind of loads can
> introduce a device-driver/hw dependency on the measure ( much or less run
> time with irq disabled for example ).

Try
ping -f localhost&
ping -f onsamelocalnet &
dd if=/dev/hda1 of=/dev/null &
make clean; make bzImage;


as a simple start

>
>
>
>
>
> - Davide
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-12-27 03:49:26

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Sun, Dec 23, 2001 at 05:20:26PM -0800, Davide Libenzi wrote:
> On Sun, 23 Dec 2001, Victor Yodaiken wrote:
>
> > On Thu, Dec 20, 2001 at 02:36:07PM -0800, Davide Libenzi wrote:
> > > > My understanding of the POSIX standard is the the highest priority
> > > > task(s) are to get the cpu(s) using the standard calls. If you want to
> > > > deviate from this I think the standard allows extensions, but they IMHO
> > > > should be requested, not the default, so I would turn your flag around
> > > > to force LOCAL, not GLOBAL.
> > >
> > > So, you're basically saying that for a better standard compliancy it's
> > > better to have global preemption policy by default. And having users to
> > > request rt tasks localization explicitly. It's fine for me.
> >
> > Can you please cite the passaaages in the standrd you have in mind?
>
> POSIX 1003. The doubt was if ( since the POSIX standard does not talk
> about SMP ) the real time priorities apply to CPU or to the entire system.

Right, that was my question. George says, in your words, "for better
standards compliancy ..." and I want to know why you guys think that.

> This because the scheduler i'm working on has two kind of RT tasks, local
> and global ones. Local RT tasks cannot preempt remote CPU so if, for
> example, one RT task is woke up and its last CPU is running another RT
> task with higher priority, the fresly woke up task will wait even if other
> CPUs are running tasks wil lower priority. Global RT task will force
> remote preemption in case the last CPU that ran the woke up RT task is
> running another higher priority RT task. Global RT tasks have their own
> queue and lock like CPUs. My old default was local RT task that was
> forced by a setscheduler() flag SCHED_RTGLOBAL while George suggested that
> it's better to have default global and to have this behavior forced by a
> SCHED_RTLOCAL flag. I already changed the code to default to global.



>
>
>
>
> - Davide
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-12-27 17:39:08

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Wed, 26 Dec 2001, Victor Yodaiken wrote:

> On Mon, Dec 24, 2001 at 10:52:46AM -0800, Davide Libenzi wrote:
> > I know what you're saying but my goal now is to fix the scheduler not the
> > overall RT latency ( at least not the one that does not depend on the
>
> my bias is to fix the cause of the problem, but go ahead.
>
>
> > scheduler ). Just take for example your 17us for your 800MHz machine, in
> > my dual PIII 733 MHz with an rqlen of 4 the scheduler latency ( with that
> > std scheduler ) is about 0.9us ( real one, not lat_ctx ). That means the
> > the scheduler responsibility in your 17us is about 5%, and the remaining
> > 95% is due "external" kernel paths. With an rqlen of 16 ( std scheduler )
>
> No: we've measured. The time in our system, which does not follow any
> Linux kernel paths, is dominated by motherboard bus delays.

17us of bus delay ?!
UP or SMP ?
Under which kind of bus load ?


> > the latency peaks up to ~2.4us going to ~14-15% of scheduler responsibility.
> > I've coded this simple app :
> >
> > http://www.xmailserver.org/linux-patches/lnxsched.html#RtLats
> >
> > and i use it with the cpuhog ( hi-tech software that is available inside
> > the same link ) to load the run queue. I'm going to plot the measured
> > latency versus the runqueue length. Thanks to OSDLAB i'll have an 8 way
> > machine to make some test on these big SMPs. I'll code even the simple
> > app you're proposing but the real problem is how to load the system. The
> > cpuhog load is a runqueue load and is "neutral", that means that is the
> > same on all the systems. Loading the system with other kind of loads can
> > introduce a device-driver/hw dependency on the measure ( much or less run
> > time with irq disabled for example ).
>
> Try
> ping -f localhost&
> ping -f onsamelocalnet &
> dd if=/dev/hda1 of=/dev/null &
> make clean; make bzImage;
>
>
> as a simple start

Below is dumped the skeleton of a test app but i need an high res timer
patch to sleep 2-5ms




- Davide





/*
* rtttest by Davide Libenzi ( linux kernel scheduler rt latency sampler )
* Version 0.16 - Copyright (C) 2001 Davide Libenzi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*
* Davide Libenzi <[email protected]>
*
*
* The purpose of this tool is to measure the scheduler latency for
* real time tasks using the "latsched" kernel patch.
* Build:
*
* gcc -o rtttest rtttest.c -lrt
*
* Use:
*
* rtttest [--test-stime s] [--sleep-mstime ms] [--pause-mstime ms] [--priority p]
* [--sched-fifo] [--sched-rr] [-- cmdpath [arg] ...]
*
* --test-stime = Set the test time in seconds
* --sleep-mstime = Set the sleep time in milliseconds
* --pause-mstime = Set the pause time in milliseconds
* --priority = Set the real time task priority ( 1..99 )
* --sched-fifo = Set the real time task policy to FIFO
* --sched-rr = Set the real time task policy to RR
* -- = Separate the optional command to be executed during the test time
* cmdpath = Command to be executed
* arg = Command argouments
*
*/


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <signal.h>
#include <sched.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <linux/timex.h>


#define STD_SLEEP_TIME 4
#define PAUSE_SLEEP_TIME 200
#define STD_TEST_TIME 8


static volatile int stop_test = 0;


void sig_int(int sig)
{
++stop_test;
signal(sig, sig_int);
}


int main(int argc, char *argv[]) {
int ii, icmd, pausetime = PAUSE_SLEEP_TIME, testtime = STD_TEST_TIME,
policy = SCHED_FIFO, priority = 1, sleeptime = STD_SLEEP_TIME, numsamples;
pid_t expid = -1;
cycles_t cys, cye, cylat = 0, mscycles;
cycles_t *samples;
struct sched_param sp;
struct timespec ts1, ts2;

for (ii = 1; ii < argc; ii++) {
if (strcmp(argv[ii], "--test-stime") == 0) {
if (++ii < argc)
testtime = atoi(argv[ii]);
continue;
}
if (strcmp(argv[ii], "--sleep-mstime") == 0) {
if (++ii < argc)
sleeptime = atoi(argv[ii]);
continue;
}
if (strcmp(argv[ii], "--pause-mstime") == 0) {
if (++ii < argc)
pausetime = atoi(argv[ii]);
continue;
}
if (strcmp(argv[ii], "--priority") == 0) {
if (++ii < argc)
priority = atoi(argv[ii]);
continue;
}
if (strcmp(argv[ii], "--sched-fifo") == 0) {
policy = SCHED_FIFO;
continue;
}
if (strcmp(argv[ii], "--sched-rr") == 0) {
policy = SCHED_RR;
continue;
}
if (strcmp(argv[ii], "--") == 0) {
icmd = ++ii;
break;
}
}

numsamples = (testtime * 1000) / pausetime + 1;
if (!(samples = (cycles_t *) malloc(numsamples * sizeof(cycles_t)))) {
perror("malloc");
return 1;
}

if (icmd > 0 && icmd < argc) {
expid = fork();
if (expid == -1) {
perror("fork");
return 5;
} else if (expid == 0) {
setpgid(0, getpid());
execv(argv[icmd], &argv[icmd]);
exit(0);
}
}

memset(&sp, 0, sizeof(sp));
sp.sched_priority = priority;

if (sched_setscheduler(0, policy, &sp)) {
perror("sched_setscheduler");
if (expid > 0 && kill(-expid, SIGKILL))
perror("SIGKILL");
return 4;
}

signal(SIGINT, sig_int);

clock_getres(CLOCK_REALTIME, &ts1);
fprintf(stderr, "timeres=%ld\n", ts1.tv_nsec / 1000);

clock_gettime(CLOCK_REALTIME, &ts1);
cys = get_cycles();
sleep(1);
clock_gettime(CLOCK_REALTIME, &ts2);
cye = get_cycles();
mscycles = (cye - cys) / ((ts2.tv_sec - ts1.tv_sec) * 1000 + (ts2.tv_nsec - ts1.tv_nsec) / 1000000);

for (ii = 0; ii < numsamples && !stop_test; ii++) {
ts1.tv_sec = 0;
ts1.tv_nsec = sleeptime * 1000000;

cys = get_cycles();
clock_nanosleep(CLOCK_REALTIME, 0, &ts1, &ts2);
cye = get_cycles();

samples[ii] = (cye - cys) / mscycles;
if (samples[ii] > cylat)
cylat = samples[ii];

usleep(pausetime * 1000);
}

numsamples = ii;

memset(&sp, 0, sizeof(sp));
sp.sched_priority = 0;

if (sched_setscheduler(0, SCHED_OTHER, &sp)) {
perror("sched_setscheduler");
if (expid > 0 && kill(-expid, SIGKILL))
perror("SIGKILL");
return 6;
}

if (expid > 0 && kill(-expid, SIGKILL))
perror("SIGKILL");

for (ii = 0; ii < numsamples; ii++) {

}

fprintf(stdout, "maxlat=%llu\n", cylat);

return 0;

}





2001-12-27 17:45:59

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Wed, 26 Dec 2001, Victor Yodaiken wrote:

> On Sun, Dec 23, 2001 at 05:20:26PM -0800, Davide Libenzi wrote:
> > On Sun, 23 Dec 2001, Victor Yodaiken wrote:
> >
> > > On Thu, Dec 20, 2001 at 02:36:07PM -0800, Davide Libenzi wrote:
> > > > > My understanding of the POSIX standard is the the highest priority
> > > > > task(s) are to get the cpu(s) using the standard calls. If you want to
> > > > > deviate from this I think the standard allows extensions, but they IMHO
> > > > > should be requested, not the default, so I would turn your flag around
> > > > > to force LOCAL, not GLOBAL.
> > > >
> > > > So, you're basically saying that for a better standard compliancy it's
> > > > better to have global preemption policy by default. And having users to
> > > > request rt tasks localization explicitly. It's fine for me.
> > >
> > > Can you please cite the passaaages in the standrd you have in mind?
> >
> > POSIX 1003. The doubt was if ( since the POSIX standard does not talk
> > about SMP ) the real time priorities apply to CPU or to the entire system.
>
> Right, that was my question. George says, in your words, "for better
> standards compliancy ..." and I want to know why you guys think that.

The thought was that if someone need RT tasks he probably need a very low
latency and so the idea that by applying global preemption decisions would
lead to a better compliancy. But i'll be happy to ear that this is false
anyway ...




- Davide


2001-12-28 00:13:03

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Thu, Dec 27, 2001 at 09:41:33AM -0800, Davide Libenzi wrote:
> > No: we've measured. The time in our system, which does not follow any
> > Linux kernel paths, is dominated by motherboard bus delays.
>
> 17us of bus delay ?!
> UP or SMP ?
> Under which kind of bus load ?

Try
cli
read cycle timer
inb from some isa port
read cycle timer
repeat for a while
sti
print worst case and weep

2001-12-28 00:46:27

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Thu, 27 Dec 2001, Victor Yodaiken wrote:

> On Thu, Dec 27, 2001 at 09:41:33AM -0800, Davide Libenzi wrote:
> > > No: we've measured. The time in our system, which does not follow any
> > > Linux kernel paths, is dominated by motherboard bus delays.
> >
> > 17us of bus delay ?!
> > UP or SMP ?
> > Under which kind of bus load ?
>
> Try
> cli
> read cycle timer
> inb from some isa port
> read cycle timer
> repeat for a while
> sti
> print worst case and weep

No need to test, i've a positive guess from ISA :)



- Davide


2001-12-28 09:52:56

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

> Re: [RFC] Scheduler issue 1, RT tasks ...
>
> >
> > Right, that was my question. George says, in your words, "for better
>
> > standards compliancy ..." and I want to know why you guys think
> that.
>
> The thought was that if someone need RT tasks he probably need a very
> low
> latency and so the idea that by applying global preemption decisions
> would
> lead to a better compliancy. But i'll be happy to ear that this is
> false
> anyway ...
>

without wanting to start a RT flame-fest, what do people really want
when they talk about RT in this [Linux] context:

- very low latency
- deterministic latency ("never to exceed")
- both
- something completely different

Thanks
Martin
--
+-----------------------------------------------------+
|Martin Knoblauch |
|-----------------------------------------------------|
|http://www.knobisoft.de/cats |
|-----------------------------------------------------|
|e-mail: [email protected] |
+-----------------------------------------------------+

2001-12-29 09:13:25

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

Martin Knoblauch wrote:
>
> > Re: [RFC] Scheduler issue 1, RT tasks ...
> >
> > >
> > > Right, that was my question. George says, in your words, "for better
> >
> > > standards compliancy ..." and I want to know why you guys think
> > that.
> >
> > The thought was that if someone need RT tasks he probably need a very
> > low
> > latency and so the idea that by applying global preemption decisions
> > would
> > lead to a better compliancy. But i'll be happy to ear that this is
> > false
> > anyway ...
> >
>
> without wanting to start a RT flame-fest, what do people really want
> when they talk about RT in this [Linux] context:
>
> - very low latency
> - deterministic latency ("never to exceed")
> - both
> - something completely different
>
All of the above from time to time and user to user. That is, some
folks want one or more of the above, some folks want more, some less.
What is really up? Well they have a job to do that requires certain
things. Different jobs require different capabilities. It is hard to
say that any given system will do a reasonably complex job with out
testing. For example we may have the required latency but find the
system fails because, to get the latency, we preempted another task that
was (and so still is) in the middle of updating something we need to
complete the job.

On the other hand, some things clearly are in the way of doing some real
time tasks in a timely fashion. Among these things are long context
switch latency, high kernel overhead, and low resolution time keeping/
alarms. So we talk (argue? posture?) most about these. At the same
time all the other bullet items of *nix systems are, at least some
times, important.

Why Linux? The same reasons it is used any where else. Among these
reasons is the desire to have to know and support only one system. Thus
the drive to extend it to the more responsive end of the spectrum
without loosing other capabilities. And, of course, the standards issue
is in here. Standards compliance is important from an investment point
of view. It allows the user to move his costly (far more than the
hardware) software investment from one kernel/ system to another with
little or no rework.
--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2001-12-29 19:02:39

by Dieter Nützel

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

Martin Knoblauch wrote:
>
> > Re: [RFC] Scheduler issue 1, RT tasks ...
> >
> > >
> > > Right, that was my question. George says, in your words, "for better
> >
> > > standards compliancy ..." and I want to know why you guys think
> > that.
> >
> > The thought was that if someone need RT tasks he probably need a very
> > low latency and so the idea that by applying global preemption decisions
> > would lead to a better compliancy. But i'll be happy to ear that this is
> > false anyway ...
> >
>
> without wanting to start a RT flame-fest, what do people really want
> when they talk about RT in this [Linux] context:
>
> - very low latency
> - deterministic latency ("never to exceed")
> - both
> - something completely different
>
> All of the above from time to time and user to user. That is, some
> folks want one or more of the above, some folks want more, some less.
> What is really up? Well they have a job to do that requires certain
> things. Different jobs require different capabilities. It is hard to
> say that any given system will do a reasonably complex job with out
> testing. For example we may have the required latency but find the
> system fails because, to get the latency, we preempted another task that
> was (and so still is) in the middle of updating something we need to
> complete the job.

So George what direction should I try for some tests?
2.4.17 plus your and Robert's preempt plus lock-break?
Add your high-res-timers, rtscheduler or both?
Do they apply against 2.4.17/2.4.18-pre1?
A combination of the above plus Davide's BMQS?

I ask because my MP3/Ogg-Vorbis hiccup during dbench isn't solved anyway.
Running 2.4.17 + preempt + lock-break + 10_vm-21 (AA).
Some wisdom?

Thank you for all your work and
Happy New Year

-Dieter
--
Dieter N?tzel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
@home: [email protected]

2001-12-29 21:04:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

Dieter N?tzel wrote:
>
> I ask because my MP3/Ogg-Vorbis hiccup during dbench isn't solved anyway.
> Running 2.4.17 + preempt + lock-break + 10_vm-21 (AA).
> Some wisdom?

Please test this elevator patch. I'll be putting it out more formally
in a day or two. Much more testing is needed yet, but for me, the
time to read a 16 megabyte file whilst running dbench 160 falls from
three minutes thirty seconds to seven seconds. (This is a VM thing,
not an elevator thing).



--- linux-2.4.18-pre1/drivers/block/elevator.c Thu Jul 19 20:59:41 2001
+++ linux-akpm/drivers/block/elevator.c Sat Dec 29 00:52:05 2001
@@ -82,6 +82,7 @@ int elevator_linus_merge(request_queue_t
{
struct list_head *entry = &q->queue_head;
unsigned int count = bh->b_size >> 9, ret = ELEVATOR_NO_MERGE;
+ const int max_bomb_segments = q->elevator.max_bomb_segments;

while ((entry = entry->prev) != head) {
struct request *__rq = blkdev_entry_to_request(entry);
@@ -116,6 +117,56 @@ int elevator_linus_merge(request_queue_t
}
}

+ /*
+ * If we failed to merge a read anywhere in the request
+ * queue, we really don't want to place it at the end
+ * of the list, behind lots of writes. So place it near
+ * the front.
+ *
+ * We don't want to place it in front of _all_ writes: that
+ * would create lots of seeking, and isn't tunable.
+ * We try to avoid promoting this read in front of existing
+ * reads.
+ *
+ * max_bomb_sectors becomes the maximum number of write
+ * requests which we allow to remain in place in front of
+ * a newly introduced read. We weight things a little bit,
+ * so large writes are more expensive than small ones, but it's
+ * requests which count, not sectors.
+ */
+ if (max_bomb_segments && rw == READ && ret == ELEVATOR_NO_MERGE) {
+ int cur_latency = 0;
+ struct request * const cur_request = *req;
+
+ entry = head->next;
+ while (entry != &q->queue_head) {
+ struct request *__rq;
+
+ if (entry == &q->queue_head)
+ BUG();
+ if (entry == q->queue_head.next &&
+ q->head_active && !q->plugged)
+ BUG();
+ __rq = blkdev_entry_to_request(entry);
+
+ if (__rq == cur_request) {
+ /*
+ * This is where the old algorithm placed it.
+ * There's no point pushing it further back,
+ * so leave it here, in sorted order.
+ */
+ break;
+ }
+ if (__rq->cmd == WRITE) {
+ cur_latency += 1 + __rq->nr_sectors / 64;
+ if (cur_latency >= max_bomb_segments) {
+ *req = __rq;
+ break;
+ }
+ }
+ entry = entry->next;
+ }
+ }
return ret;
}

@@ -188,7 +239,7 @@ int blkelvget_ioctl(elevator_t * elevato
output.queue_ID = elevator->queue_ID;
output.read_latency = elevator->read_latency;
output.write_latency = elevator->write_latency;
- output.max_bomb_segments = 0;
+ output.max_bomb_segments = elevator->max_bomb_segments;

if (copy_to_user(arg, &output, sizeof(blkelv_ioctl_arg_t)))
return -EFAULT;
@@ -207,9 +258,12 @@ int blkelvset_ioctl(elevator_t * elevato
return -EINVAL;
if (input.write_latency < 0)
return -EINVAL;
+ if (input.max_bomb_segments < 0)
+ return -EINVAL;

elevator->read_latency = input.read_latency;
elevator->write_latency = input.write_latency;
+ elevator->max_bomb_segments = input.max_bomb_segments;
return 0;
}

--- linux-2.4.18-pre1/include/linux/elevator.h Thu Feb 15 16:58:34 2001
+++ linux-akpm/include/linux/elevator.h Sat Dec 29 12:57:33 2001
@@ -3,10 +3,11 @@

typedef void (elevator_fn) (struct request *, elevator_t *,
struct list_head *,
- struct list_head *, int);
+ struct list_head *);

-typedef int (elevator_merge_fn) (request_queue_t *, struct request **, struct list_head *,
- struct buffer_head *, int, int);
+typedef int (elevator_merge_fn)(request_queue_t *, struct request **,
+ struct list_head *, struct buffer_head *bh,
+ int rw, int max_sectors);

typedef void (elevator_merge_cleanup_fn) (request_queue_t *, struct request *, int);

@@ -16,6 +17,7 @@ struct elevator_s
{
int read_latency;
int write_latency;
+ int max_bomb_segments;

elevator_merge_fn *elevator_merge_fn;
elevator_merge_cleanup_fn *elevator_merge_cleanup_fn;
@@ -24,13 +26,13 @@ struct elevator_s
unsigned int queue_ID;
};

-int elevator_noop_merge(request_queue_t *, struct request **, struct list_head *, struct buffer_head *, int, int);
-void elevator_noop_merge_cleanup(request_queue_t *, struct request *, int);
-void elevator_noop_merge_req(struct request *, struct request *);
-
-int elevator_linus_merge(request_queue_t *, struct request **, struct list_head *, struct buffer_head *, int, int);
-void elevator_linus_merge_cleanup(request_queue_t *, struct request *, int);
-void elevator_linus_merge_req(struct request *, struct request *);
+elevator_merge_fn elevator_noop_merge;
+elevator_merge_cleanup_fn elevator_noop_merge_cleanup;
+elevator_merge_req_fn elevator_noop_merge_req;
+
+elevator_merge_fn elevator_linus_merge;
+elevator_merge_cleanup_fn elevator_linus_merge_cleanup;
+elevator_merge_req_fn elevator_linus_merge_req;

typedef struct blkelv_ioctl_arg_s {
int queue_ID;
@@ -54,22 +56,6 @@ extern void elevator_init(elevator_t *,
#define ELEVATOR_FRONT_MERGE 1
#define ELEVATOR_BACK_MERGE 2

-/*
- * This is used in the elevator algorithm. We don't prioritise reads
- * over writes any more --- although reads are more time-critical than
- * writes, by treating them equally we increase filesystem throughput.
- * This turns out to give better overall performance. -- sct
- */
-#define IN_ORDER(s1,s2) \
- ((((s1)->rq_dev == (s2)->rq_dev && \
- (s1)->sector < (s2)->sector)) || \
- (s1)->rq_dev < (s2)->rq_dev)
-
-#define BHRQ_IN_ORDER(bh, rq) \
- ((((bh)->b_rdev == (rq)->rq_dev && \
- (bh)->b_rsector < (rq)->sector)) || \
- (bh)->b_rdev < (rq)->rq_dev)
-
static inline int elevator_request_latency(elevator_t * elevator, int rw)
{
int latency;
@@ -85,7 +71,7 @@ static inline int elevator_request_laten
((elevator_t) { \
0, /* read_latency */ \
0, /* write_latency */ \
- \
+ 0, /* max_bomb_segments */ \
elevator_noop_merge, /* elevator_merge_fn */ \
elevator_noop_merge_cleanup, /* elevator_merge_cleanup_fn */ \
elevator_noop_merge_req, /* elevator_merge_req_fn */ \
@@ -95,7 +81,7 @@ static inline int elevator_request_laten
((elevator_t) { \
8192, /* read passovers */ \
16384, /* write passovers */ \
- \
+ 6, /* max_bomb_segments */ \
elevator_linus_merge, /* elevator_merge_fn */ \
elevator_linus_merge_cleanup, /* elevator_merge_cleanup_fn */ \
elevator_linus_merge_req, /* elevator_merge_req_fn */ \

2001-12-29 22:23:04

by Davide Libenzi

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Sat, 29 Dec 2001, Dieter [iso-8859-15] N?tzel wrote:

> Martin Knoblauch wrote:
> >
> > > Re: [RFC] Scheduler issue 1, RT tasks ...
> > >
> > > >
> > > > Right, that was my question. George says, in your words, "for better
> > >
> > > > standards compliancy ..." and I want to know why you guys think
> > > that.
> > >
> > > The thought was that if someone need RT tasks he probably need a very
> > > low latency and so the idea that by applying global preemption decisions
> > > would lead to a better compliancy. But i'll be happy to ear that this is
> > > false anyway ...
> > >
> >
> > without wanting to start a RT flame-fest, what do people really want
> > when they talk about RT in this [Linux] context:
> >
> > - very low latency
> > - deterministic latency ("never to exceed")
> > - both
> > - something completely different
> >
> > All of the above from time to time and user to user. That is, some
> > folks want one or more of the above, some folks want more, some less.
> > What is really up? Well they have a job to do that requires certain
> > things. Different jobs require different capabilities. It is hard to
> > say that any given system will do a reasonably complex job with out
> > testing. For example we may have the required latency but find the
> > system fails because, to get the latency, we preempted another task that
> > was (and so still is) in the middle of updating something we need to
> > complete the job.
>
> So George what direction should I try for some tests?
> 2.4.17 plus your and Robert's preempt plus lock-break?
> Add your high-res-timers, rtscheduler or both?
> Do they apply against 2.4.17/2.4.18-pre1?
> A combination of the above plus Davide's BMQS?
>
> I ask because my MP3/Ogg-Vorbis hiccup during dbench isn't solved anyway.
> Running 2.4.17 + preempt + lock-break + 10_vm-21 (AA).
> Some wisdom?

A bad scheduler can make the latency to increase but in your case i don't
think that it could increase that much ( in percent ). By copying a huge
file arund you can experience spots of 1-2 secs of machine freeze and
this is definitely not the scheduler. The demage the a bad scheduler can
do is directly proportional to the cs anyway.




- Davide


2001-12-30 10:01:44

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

Dieter N?tzel wrote:
>
> Martin Knoblauch wrote:
> >
> > > Re: [RFC] Scheduler issue 1, RT tasks ...
> > >
> > > >
> > > > Right, that was my question. George says, in your words, "for better
> > >
> > > > standards compliancy ..." and I want to know why you guys think
> > > that.
> > >
> > > The thought was that if someone need RT tasks he probably need a very
> > > low latency and so the idea that by applying global preemption decisions
> > > would lead to a better compliancy. But i'll be happy to ear that this is
> > > false anyway ...
> > >
> >
> > without wanting to start a RT flame-fest, what do people really want
> > when they talk about RT in this [Linux] context:
> >
> > - very low latency
> > - deterministic latency ("never to exceed")
> > - both
> > - something completely different
> >
> > All of the above from time to time and user to user. That is, some
> > folks want one or more of the above, some folks want more, some less.
> > What is really up? Well they have a job to do that requires certain
> > things. Different jobs require different capabilities. It is hard to
> > say that any given system will do a reasonably complex job with out
> > testing. For example we may have the required latency but find the
> > system fails because, to get the latency, we preempted another task that
> > was (and so still is) in the middle of updating something we need to
> > complete the job.
>
> So George what direction should I try for some tests?
> 2.4.17 plus your and Robert's preempt plus lock-break?
> Add your high-res-timers, rtscheduler or both?
> Do they apply against 2.4.17/2.4.18-pre1?
> A combination of the above plus Davide's BMQS?

I would guess you want preempt plus lock-break at least. rtsched may
give a small improvement if you run any real time (i.e. not SCHED_OTHER)
tasks (and the improvement should be in both real time and non-real time
preemption) but, in general, the schedule is not any where near the
problem the long held locks are so I really don't expect to see much
improvement here. If you have a lot of task on the system (not active,
just there) you may see the "recalculate" with the standard scheduler
which is much improved with rtsched (it does not include tasks not in
the run list in the recalculate).

As for high-res-timers, I just put out a 2.4.13 version which should
work on 2.4.17 (there are rejects in the patch, but all in non-i386
code). I have one report, however, of asm errors which seem to depend
on the compiler (or asm) version. I will look into this and put up a
2.4.17 version early next week. Testing wise, I don't think this will
be visible because you most likely are not using POSIX timers. There is
a change in the timer list structure, but that should be in the noise
also. In short, the high-res-timers project provides new capability,
not improved performance with existing capability.
>
> I ask because my MP3/Ogg-Vorbis hiccup during dbench isn't solved anyway.
> Running 2.4.17 + preempt + lock-break + 10_vm-21 (AA).
> Some wisdom?

Try the preempt-stats patch and collect data during the hiccup. It
should point the finger at the problem. Let us know what you find.
Robert has been very good at fixing things like this with his lock-break
stuff, but we/he need to know who the bad guy is.
>
> Thank you for all your work and
> Happy New Year
>
> -Dieter
> --
> Dieter N?tzel
> Graduate Student, Computer Science
>
> University of Hamburg
> Department of Computer Science
> @home: [email protected]

--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2001-12-30 19:55:12

by Dieter Nützel

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Sunday, 29. December 2001 21:00, you wrote:
>Dieter N?tzel wrote:
> >
> > I ask because my MP3/Ogg-Vorbis hiccup during dbench isn't solved anyway.
> > Running 2.4.17 + preempt + lock-break + 10_vm-21 (AA).
> > Some wisdom?
>
> Please test this elevator patch. I'll be putting it out more formally
> in a day or two. Much more testing is needed yet, but for me, the
> time to read a 16 megabyte file whilst running dbench 160 falls from
> three minutes thirty seconds to seven seconds. (This is a VM thing,
> not an elevator thing).

Andrew or anybody else,

can you please send me a copy directly?
The version I've extracted from the list is some what broken.
I am not on LKML 'cause it is to much traffic for such a poor little boy like
me...;-)

Thanks,
Dieter

2001-12-31 13:57:30

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

Dieter N?tzel wrote:
>
> On Sunday, 29. December 2001 21:00, you wrote:
> >Dieter N?tzel wrote:
> > >
> > > I ask because my MP3/Ogg-Vorbis hiccup during dbench isn't solved anyway.
> > > Running 2.4.17 + preempt + lock-break + 10_vm-21 (AA).
> > > Some wisdom?
> >
> > Please test this elevator patch. I'll be putting it out more formally
> > in a day or two. Much more testing is needed yet, but for me, the
> > time to read a 16 megabyte file whilst running dbench 160 falls from
> > three minutes thirty seconds to seven seconds. (This is a VM thing,
> > not an elevator thing).
>
> Andrew or anybody else,
>
> can you please send me a copy directly?
> The version I've extracted from the list is some what broken.
> I am not on LKML 'cause it is to much traffic for such a poor little boy like
> me...;-)
>
Andrew,

I think the problem is that the mailer(s) insert new lines. Is this
right Dieter? It is certainly a problem for me.

Best to mail as an attachment.
--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2002-01-01 18:55:45

by Dieter Nützel

[permalink] [raw]
Subject: Re: [RFC] Scheduler issue 1, RT tasks ...

On Monday, 31. December 2001 14:56, george anzinger wrote:
> Dieter N?tzel wrote:
> > On Sunday, 29. December 2001 21:00, you wrote:
> > >Dieter N?tzel wrote:
> > > > I ask because my MP3/Ogg-Vorbis hiccup during dbench isn't solved
> > > > anyway. Running 2.4.17 + preempt + lock-break + 10_vm-21 (AA).
> > > > Some wisdom?
> > >
> > > Please test this elevator patch. I'll be putting it out more formally
> > > in a day or two. Much more testing is needed yet, but for me, the
> > > time to read a 16 megabyte file whilst running dbench 160 falls from
> > > three minutes thirty seconds to seven seconds. (This is a VM thing,
> > > not an elevator thing).
> >
> > Andrew or anybody else,
> >
> > can you please send me a copy directly?
> > The version I've extracted from the list is some what broken.
> > I am not on LKML 'cause it is to much traffic for such a poor little boy
> > like me...;-)
>
> Andrew,
>
> I think the problem is that the mailer(s) insert new lines. Is this
> right Dieter? It is certainly a problem for me.

Yes.

> Best to mail as an attachment.

Yes.

But I applied it by hand and got the best results I ever had!
GREAT work, Andrew!
This should be go in, soon.

2.4.17
preempt-kernel-rml-2.4.17-1.patch
lock-break-rml-2.4.17-2.patch
00_nanosleep-5
10_vm-21 (Andrea)
bootmem-2.4.17-pre6
elevator-fix (Andrew)
O-inode-attrs.patch (ReiserFS)
linux-2.4.17rc2-KLMN+exp_trunc+3fixes.patch (ReiserFS)

Happy New Year and best wishes!

-Dieter

BTW Below are my first results. More to come (analysis of latency).

2.4.17-preempt + 10_vm-21 + elevator
dbench/dbench> time ./dbench 32
32 clients started
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................+.................................................................+...................+.......+....................+.........................................................................................................+.........+....................+................+..................+.....+...................................................................................................................................................................................................................................................................+..............++...........+.+.+..++++.+++.+.+++.++++********************************
Throughput 49.7707 MB/sec (NB=62.2133 MB/sec 497.707 MBit/sec)
13.800u 51.810s 1:25.89 76.3% 0+0k 0+0io 939pf+0w

2.4.17-preempt + 10_vm-21 + elevator + MP3 playback
dbench/dbench> time ./dbench 32
32 clients started
..............................................................................................................................................................................................................................................................................................................................................+.................................++....+.+.+.......+.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................+......................++.....+..+++.++..++.++++.+....++++.+++++********************************
Throughput 48.6323 MB/sec (NB=60.7904 MB/sec 486.323 MBit/sec)
14.690u 52.920s 1:27.87 76.9% 0+0k 0+0io 939pf+0w