-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
This one took a lot of detective work to track down the scheduler starvation
issue seen rarely, but reproducibly with certain applications. Thanks Mike
Galbraith for significantly narrowing my search path.
In O15 I mentioned that preventing parents from preempting their children
prevented starvation of applications where they would be busy on wait. Long
story to describe how, but I discovered the problem inducing starvation in
O15 was the same, but with wakers and their wakee. The wakee would preempt
the waker, and the waker could make no progress until it got rescheduled...
however if the wakee had better priority than the waker it preempted it until
it's own priority dropped enough for the waker to get scheduled. Because in
O15 the priority decayed slower we were able to see this happening in these
"busy on wait" applications... and they're not as rare as we'd like. In fact
this wakee preemption is going on at a mild level all the time even in the
vanilla scheduler. I've experimented with ways to improve the
performance/feel of these applications but I found it was to the detriment of
most other apps, so this patch simply makes them run without inducing
starvation at usable performance. I'm not convinced the scheduler should have
a workaround, but that the apps shouldn't busy on wait.
Those who experienced starvation could you please test this patch.
Changes:
Waker is now kept track of.
Only user tasks have the bonus ceiling from uninterruptible sleep.
Preemption of tasks at the same level with twice as much timeslice has been
dropped as this is not necessary with timeslice granularity (may improve
performance of cpu intensive tasks).
Preemption of user tasks is limited to those in the interactive range; cpu
intensive non interactive tasks can run out their full timeslice (may also
improve cpu intensive performance)
Tasks cannot preempt their own waker.
Cleanups etc.
This patch applies onto 2.6.0-test3-mm2 (or O15int)
and is available at http://kernel.kolivas.org/2.5
Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQE/PQD1ZUg7+tp6mRURAkObAJ45p2KLBA6lGFQ588PnSuE4yhrGXgCeOpTL
9bhnnGW6e8Pfn1BTHG/wbh8=
=EQ52
-----END PGP SIGNATURE-----
Con Kolivas wrote:
> Preemption of tasks at the same level with twice as much timeslice has been
> dropped as this is not necessary with timeslice granularity (may improve
> performance of cpu intensive tasks).
Does this situation happen where two tasks at different nice levels have
dynamic priority adjustments which make them effectively have the same
priority?
> Preemption of user tasks is limited to those in the interactive range; cpu
> intensive non interactive tasks can run out their full timeslice (may also
> improve cpu intensive performance)
What can cause preemption of a task that has not used up its timeslice?
I assume a device interrupt could do this, but... there's a question I
asked earlier which I haven't read the answer to yet, so I'm going to guess:
A hardware timer interrupt happens at timeslice granularity. If the
interrupt occurs, but the timeslice is not expired, then NORMALLY, the
ISR would just return right back to the running task, but sometimes, it
might decided to end the timeslice early and run some other task.
Right?
So, what might cause the scheduler to decide to preempt a task which has
not used up its timeslice?
On Fri, 15 Aug 2003, Timothy Miller wrote:
>
>
> Con Kolivas wrote:
>
> > Preemption of tasks at the same level with twice as much timeslice has been
> > dropped as this is not necessary with timeslice granularity (may improve
> > performance of cpu intensive tasks).
>
> Does this situation happen where two tasks at different nice levels have
> dynamic priority adjustments which make them effectively have the same
> priority?
>
> > Preemption of user tasks is limited to those in the interactive range; cpu
> > intensive non interactive tasks can run out their full timeslice (may also
> > improve cpu intensive performance)
>
> What can cause preemption of a task that has not used up its timeslice?
> I assume a device interrupt could do this, but... there's a question I
> asked earlier which I haven't read the answer to yet, so I'm going to guess:
>
> A hardware timer interrupt happens at timeslice granularity. If the
> interrupt occurs, but the timeslice is not expired, then NORMALLY, the
> ISR would just return right back to the running task, but sometimes, it
> might decided to end the time-slice early and run some other task.
>
> Right?
>
Never. However, since the time-slice is 'time', the very instant that
the hardware interrupt executes it's "iret", the hardware-timer may
interrupt and the CPU gets taken away from the task.
Suppose that the preemption timer ticked at 1 HZ intervals. Suppose
that an awful interrupt service routine (one that loops inside) took 1
second to be serviced. What would happen if a task was 3/4 of a second
into its time-slice and then a hardware interrupt occurred?
The CPU would be taken away at 3/4 second, given to the bad ISR, then
the CPU would not be returned until (1) the one-second execution time
had occurred, and (2), all other higher priority tasks had gotten their
time-slices. Each of those higher-priority tasks, could further get
interrupted by the rogue ISR. The result may be that you get the CPU
back next Tuesday.
> So, what might cause the scheduler to decide to preempt a task which has
> not used up its timeslice?
>
The scheduler does not preempt a task until its time has expired.
However time is a constantly-expiring thing so interrupts can
eat up a processes' time.
The usual way for a process (task) to lose it's allocated CPU time-
slice is to perform some I/O. When waiting for I/O, the kernel may
give the CPU to somebody else.
If, the scheduler worked on task-CPU time, rather than hardware-clock
"wall time", maybe it would be more "fair" during periods of high
interrupt activity. However, since interrupts occur anytime, they
tend to attack all competing processes equally, therefore becoming
"fair" unless it's one task that's generating all that interrupt
activity, like network I/O, or some kinds of screen-interactivity.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.
On Fri, 2003-08-15 at 17:49, Con Kolivas wrote:
> In O15 I mentioned that preventing parents from preempting their children
> prevented starvation of applications where they would be busy on wait. Long
> story to describe how, but I discovered the problem inducing starvation in
> O15 was the same, but with wakers and their wakee. The wakee would preempt
> the waker, and the waker could make no progress until it got rescheduled...
> however if the wakee had better priority than the waker it preempted it until
> it's own priority dropped enough for the waker to get scheduled. Because in
> O15 the priority decayed slower we were able to see this happening in these
> "busy on wait" applications... and they're not as rare as we'd like. In fact
> this wakee preemption is going on at a mild level all the time even in the
> vanilla scheduler. I've experimented with ways to improve the
> performance/feel of these applications but I found it was to the detriment of
> most other apps, so this patch simply makes them run without inducing
> starvation at usable performance. I'm not convinced the scheduler should have
> a workaround, but that the apps shouldn't busy on wait.
Well, I'm sorry to say there's something really wrong here with O16int.
2.6.0-test3-mm2 plus O16int takes exactly twice the time to boot than
vanilla 2.6.0-test3-mm2, and that's a lot of time. Let me explain.
I've made timings starting at the moment GRUB passes the control to the
kernel, and until mingetty displays the login prompt. The time it takes
for the kernel to boot until "init" is called is roughly the same on
both kernels (milisecond up or down).
2.6.0-test3-mm2: 16.30 seconds
2.6.0-test3-mm2 + O16int: 33.47 seconds
There must be something really damaged here as it's more than twice the
time. During boot, things that are almost immediate like applying
"iptables" on a RHL9 box, take ages when using a O16int-based kernel.
Is anyone experiencing those extreme delays? Is this a new kind of
starvation? Cause using exactly the same machine, Linux distribution,
disk partition, etc... but simply by changing kernels, almost everything
on boot takes twice the time to be done.
Also, logging to KDE takes ages compared with vanilla 2.6.0-test3-mm2.
Any ideas?
On 2003-08-15 19:00:00 Felipe Alfaro Solana wrote:
[...]
> Is anyone experiencing those extreme delays? Is this a new kind of
> starvation? Cause using exactly the same machine, Linux distribution,
> disk partition, etc... but simply by changing kernels, almost
> everything on boot takes twice the time to be done.
Ditto, with handcrafted simple scripts in my rc.d. About twice as long.
But that's not all... Game-test can't be run. I waited 4 minutes for it
to start the intro sequences (normally 10 seconds), clicked to load a
saved game, got neverending "sound repeats". Patiently logged in on a
text console and patiently killed the rouge processes. Took about five
minutes of starvation to squeeze in keystrokes with the few seconds of
interactivety now and then.
Rotating the world in Blender quickly "jerks", but now the pauses are
extended to more than 10 seconds (previously 1/2 to 2)
Ringing the bell and shouting "Unclean, unclean!"
Mats Johannesson
On Sat, Aug 16, 2003 at 01:49:06AM +1000, Con Kolivas wrote:
> Changes:
> Waker is now kept track of.
>
> Only user tasks have the bonus ceiling from uninterruptible sleep.
>
> Preemption of tasks at the same level with twice as much timeslice has been
> dropped as this is not necessary with timeslice granularity (may improve
> performance of cpu intensive tasks).
>
> Preemption of user tasks is limited to those in the interactive range; cpu
> intensive non interactive tasks can run out their full timeslice (may also
> improve cpu intensive performance)
>
> Tasks cannot preempt their own waker.
>
> Cleanups etc.
Con, given the problems reported, maybe each of these should be in a
different changeset (OXXint, etc...).
Seeing the large number of changes, compared to previous releases gave me a
pause on this patch. So I waited a few minutes and there was a problem
report quickly.
Is there any way you can provide each change in a seperate patch, so we can
narrow down the problem patch? (I have a feeling that most of your changes will do a
lot of good)
Mike
Con Kolivas wrote:
> In O15 I mentioned that preventing parents from preempting their children
> prevented starvation of applications where they would be busy on wait. Long
> story to describe how, but I discovered the problem inducing starvation in
> O15 was the same, but with wakers and their wakee. The wakee would preempt
> the waker, and the waker could make no progress until it got rescheduled...
I'm not clear on the effect of this.
There's a specific threading technique that I haven't programmed, but
would like to. Whether it will work well depends on the kernel
scheduler.
The idea is that I run many cooperative tasks using user-space
switching (these are very light weight tasks, no stack, more like
state machines). It is not so different from a classic poll() loop.
If one of these calls a system call, like read() or fsync(), I want
the program to make progress with its other state machines while the
first is blocked doing I/O. For read(), I can use async I/O, but aio
doesn't provide async variants of all the system calls which can block,
such as open(), stat() and readdir() - and I use stat() more than read().
It isn't reasonable to make a kernel thread per userspace state
machine: I want less preemption than that implies, having more control
over locking contexts between the state machines than that. And each
kernel thread uses a relatively large space, while the states are
quite small and it is reasonable to have a large number.
The usual solution is to spawn a few kernel threads, where that number
is determined empirically according to how much blocking the program
seems to do. For example see nfsd. I would do this anyway, to take
advantage of SMP/HT, but I dislike the fact that the optimal number of
kernel threads is impossible to plan, as it varies according to what
the program is doing.
Also, I do not like multiple kernel threads to end up on the same CPU,
with each one handling many state machines, as many of the machines
don't block in system calls, and so the kernel threads will appear to
be mostly CPU hogs, to the kernel scheduler.
That would mean when one kernel thread uses its timeslice, another
takes over and makes a state machine pause for the length of a CPU hog
timeslice, which isn't always appropriate.
One solution is for each kernel thread (one per CPU) to maintain a
shadow thread which normally sleeps. Whenever I'm about to call a
system call which may block, I'd wake the shadow thread. If the
system call doesn't block, it'll return without the shadow thread
running, and I can put it back to sleep again. If a shadow thread
does manage to run, it will make itself an active thread and spawn a
shadow thread for itself. When there's an excess of active threads,
they turn themselves into shadow threads and sleep, or kill
themselves.
Of course I would maintain a pool of sleeping shadow threads, with
watermaks, not creating and destroying them at high speed.
This way, I create and destroy active kernel threads according to
the number of userspace state machines which are really blocked in
system calls. It seems like good idea to me.
I think it's been done before, under the name "scheduler activations",
on some other kernel.
There is but one little caveat. :)
(And this is before starting to code it :)
When a kernel thread wakes another, it's desirable that the first
thread continues running, and the woken thread does not run
immediately (not even on another CPU), so that if the waker doesn't
block in the system call, the wakee can be put back to sleep before it runs.
I'm wondering - this is my question - does the current scheduler have
predictable behaviour in this regard?
Cheers,
-- Jamie
On Sat, Aug 16, 2003 at 12:03:12AM +0100, Jamie Lokier wrote:
> I think it's been done before, under the name "scheduler activations",
> on some other kernel.
>
Wouldn't futexes help with this?
Mike Fedyk wrote:
> On Sat, Aug 16, 2003 at 12:03:12AM +0100, Jamie Lokier wrote:
> > I think it's been done before, under the name "scheduler activations",
> > on some other kernel.
> >
>
> Wouldn't futexes help with this?
Futexes are great for the waking up part, not so great for putting
another task to sleep :)
I see two ways to use a futex.
1. Active task changes a synchronisation word.
2. Active task FUTEX_WAKEs the shadow task before syscall.
3. Syscall.
4. Active task restores synchronisation word.
..time passes..
5. Shadow task runs.
6. Looks at synchronisation word, which says "go back to sleep".
7. Shadow task sleeps with FUTEX_WAIT.
This isn't bad, except that a shadow task runs every time we do a
potentially blocking system call from the active task, _or_ is often
ready to run.
If it's just often ready to run, that's not a problem. If it always
runs immediately, that's two unnecessary context switches per system
call; quite an overhead, and I might as well hand off system calls to
helper threads in that case :)
Next way is the same, except that control is always handed to the
shadow task and the active task, when the system call is finished,
queues the current state machine for the shadow task to pick it up and
then sleeps. Effectively the active and shadow tasks swap roles on
each system call.
This may or may not be better, depending on whether we've reduced the
average number of context switches to 1 or increased it to 1 :)
It'd wreck the kernel scheduler's interactivity heuristics, too :):)
The first futex method would be quite efficient if a variant of
FUTEX_WAIT was clever enough not to need to be scheduled just to go
back to sleep when the word has the "go back to sleep" value.
Third way is just to use SIGCONT and SIGSTOP. Not the fastest, but
perhaps faster than the futex-induced context switches. It'd need to
be measured.
None of these will work well if "wakee" tasks are able to run
immediately after being woken, before "waker" tasks get a chance to
either block or put the wakees back to sleep.
-- Jamie
On Sat, 16 Aug 2003 05:00, Felipe Alfaro Solana wrote:
> Well, I'm sorry to say there's something really wrong here with O16int.
> 2.6.0-test3-mm2 plus O16int takes exactly twice the time to boot than
Easy fix. Sorry about that. Here is an O16.1int patch. Backs out the selective
preemption by only interactive tasks.
Con
On Sat, 16 Aug 2003 04:26, Timothy Miller wrote:
> Con Kolivas wrote:
> > Preemption of tasks at the same level with twice as much timeslice has
> > been dropped as this is not necessary with timeslice granularity (may
> > improve performance of cpu intensive tasks).
>
> Does this situation happen where two tasks at different nice levels have
> dynamic priority adjustments which make them effectively have the same
> priority?
Yes it does. Preemption and order of scheduling is determined entirely by the
dynamic priority.
> > Preemption of user tasks is limited to those in the interactive range;
> > cpu intensive non interactive tasks can run out their full timeslice (may
> > also improve cpu intensive performance)
>
> What can cause preemption of a task that has not used up its timeslice?
Any task of better (dynamic) priority will preempt it.
> I assume a device interrupt could do this, but... there's a question I
> asked earlier which I haven't read the answer to yet, so I'm going to
> guess:
>
> A hardware timer interrupt happens at timeslice granularity. If the
> interrupt occurs, but the timeslice is not expired, then NORMALLY, the
> ISR would just return right back to the running task, but sometimes, it
> might decided to end the timeslice early and run some other task.
>
> Right?
No, the timeslice granularity is a hard cut off where a task gets rescheduled
and put at the back of the queue again. If there is no other task of equal or
better priority it will just start again.
> So, what might cause the scheduler to decide to preempt a task which has
> not used up its timeslice?
Better dynamic priority.
Con
At 01:54 AM 8/16/2003 +0100, Jamie Lokier wrote:
[...]
>None of these will work well if "wakee" tasks are able to run
>immediately after being woken, before "waker" tasks get a chance to
>either block or put the wakees back to sleep.
Sounds like another scheduler class (SCHED_NOPREEMPT) would be required.
-Mike
On Sat, 16 Aug 2003 01:49, Con Kolivas wrote:
> Tasks cannot preempt their own waker.
Looks like I can do this with a lot less code. Will post an update to this
which doesn't significantly change functionality but culls a lot of code.
Thanks Mike.
Con
Mike Galbraith wrote:
> At 01:54 AM 8/16/2003 +0100, Jamie Lokier wrote:
> [...]
>
> >None of these will work well if "wakee" tasks are able to run
> >immediately after being woken, before "waker" tasks get a chance to
> >either block or put the wakees back to sleep.
>
> Sounds like another scheduler class (SCHED_NOPREEMPT) would be required.
If something special were to be added, it should be a way for a task
to say "If I call schedule() and block, don't do a schedule, just
continue my timeslice in task X".
The point of the mechanism is to submit system calls in an
asynchronous fashion, after all. A proper task scheduling is
inappropriate when all we'd like to do is initiate the syscall and
continue processing, just as if it were an async I/O request.
The interesting part is what to do when the original task (the one
that went to sleep) wakes up.
-- Jamie
On Sat, Aug 16, 2003 at 08:14:36AM +0200, Mike Galbraith wrote:
> At 01:54 AM 8/16/2003 +0100, Jamie Lokier wrote:
> [...]
>
> >None of these will work well if "wakee" tasks are able to run
> >immediately after being woken, before "waker" tasks get a chance to
> >either block or put the wakees back to sleep.
> Sounds like another scheduler class (SCHED_NOPREEMPT) would be required.
Sounds more like a new futex feature: Wakeup after time expired.
It would be very easy to do, since we have a timeout handy
anyway (look at kernel/futex.c:do_futex()).
But it would require to use a kernel timer and we might add a
cancel of this wakeup returning some sane error, if the wakeup
happenend already.
Then this "blocking" could be modeled as "blocking too long",
which is how this kind of thing is handled more sanely anyway.
The only thing that is disturbing about blocking, is blocking
"too long". Blocking itself is ok, since it frees CPU time for
other processes the programmer is completely unaware of.
Regards
Ingo Oeser
Ingo Oeser wrote:
> Sounds more like a new futex feature: Wakeup after time expired.
>
> Then this "blocking" could be modeled as "blocking too long",
> which is how this kind of thing is handled more sanely anyway.
Nice idea, but it isn't quite right. If my active task takes a second
in the system call, but doesn't block, I probably don't want the other
task to run at all. There are occasions when a timer-initiated switch
like that would be useful, though.
The real problem is that if my active task blocks immediately,
e.g. because I call open() and it issues a disk I/O, I want to
continue handling the next work item as soon as the CPU is free. Not
after 1ms or 10ms of CPU idle time.
The ideal is something like async I/O, but that only works for a
subset of I/O operations, and it isn't practical to extend it to the
more complex I/O operations efficiently. (Inefficiently, yes, but I
may as well use my own helper threads if that's what aio would use).
This is a way to get equivalent efficiency to async I/O, that is
equivalent in terms of fully utilising the CPU(s) without lots of
threading contexts. Therefore it should achive that, otherwise it
just isn't worth doing at all.
> The only thing that is disturbing about blocking, is blocking
> "too long". Blocking itself is ok, since it frees CPU time for
> other processes the programmer is completely unaware of.
No. If there are other processes, the scheduler is perfectly able to
give time to them already. It's wrong to give other processes
_additional_ time, over what they would already get.
Secondly, blocking is just a waste of time when there are no other
processes. An idle CPU, reducing throughput and gaining nothing
(except a lower electricity bill, perhaps).
Thirdly, you're saying that I can have async system calls but I must
expect to give up some CPU time, either to the idle task or to someone
else.
In which case, I (like any other programmer) will choose to use
multiple threads instead and do it less efficiently, but at least my
program runs faster than it would with your suggestion - albeit with
less predictable response latencies :)
Conclusion - I like your idea, though it should work more like this:
- Initiate wakeup of shadow task on timeout _or_ this task blocking.
- Cancel wakeup of shadow task.
I don't know whether a modified futex or something else is right for it.
-- Jamie
At 03:18 PM 8/16/2003 +0100, Jamie Lokier wrote:
>Mike Galbraith wrote:
> > At 01:54 AM 8/16/2003 +0100, Jamie Lokier wrote:
> > [...]
> >
> > >None of these will work well if "wakee" tasks are able to run
> > >immediately after being woken, before "waker" tasks get a chance to
> > >either block or put the wakees back to sleep.
> >
> > Sounds like another scheduler class (SCHED_NOPREEMPT) would be required.
>
>If something special were to be added, it should be a way for a task
>to say "If I call schedule() and block, don't do a schedule, just
>continue my timeslice in task X".
>
>The point of the mechanism is to submit system calls in an
>asynchronous fashion, after all. A proper task scheduling is
>inappropriate when all we'd like to do is initiate the syscall and
>continue processing, just as if it were an async I/O request.
Ok, so you'd want a class where you could register an "exception handler"
prior to submitting a system call, and any subsequent schedule would be
treated as an exception? (they'd have to be nestable exceptions too
right?... <imagines stack explosions> egad:)
>The interesting part is what to do when the original task (the one
>that went to sleep) wakes up.
Yeah.
-Mike
Mike Galbraith wrote:
> >The point of the mechanism is to submit system calls in an
> >asynchronous fashion, after all. A proper task scheduling is
> >inappropriate when all we'd like to do is initiate the syscall and
> >continue processing, just as if it were an async I/O request.
>
> Ok, so you'd want a class where you could register an "exception handler"
> prior to submitting a system call, and any subsequent schedule would be
> treated as an exception? (they'd have to be nestable exceptions too
> right?... <imagines stack explosions> egad:)
Well, apart from not resembling exceptions, and no they don't nest :)
You may be wondering what happens when I do five stat() calls, all of
which should be asynchronous (topical: to get the best out of the
elevator).
Nested? Not quite. At each stat() call that blocks for I/O, its
shadow task becomes active; that creates its own shadow task (pulling
a kernel task from userspace's cache of them), then continues to
perform the next item of work, which is the next stat().
The result is five kernel threads, each blocked on I/O inside a stat()
call, exactly as desired. A sixth kernel thread, the only one running
of my program, is continuing the work of the program.
Soon, each of the I/O bound threads unblocks, returns to userspace,
stores its result, queues the next work of this state machine, adds
this kernel task to userspace's cache, and goes to sleep.
As you can see, this achieves asynchronous system calls which are too
complex for aio(*), best use of the I/O elevator, and 100% CPU
utilisation doing useful calculations.
Other user/kernel scheduler couplings are possible, but what I'm
describing doesn't ask for much(**). Just the right behaviour from
the kernel's scheduling heuristic: namely, waker not preempted by
wakee. Seems to be the way it's going anyway.
-- Jamie
(*) Performing a complex operation like open() or link()
asynchronously requires a kernel context for each operation in
progress, as it isn't practical to recode those as state machines.
In a sense, this sequence is close to an optimal way
to dispatch these I/O operations concurrently.
Jamie Lokier wrote:
>Mike Galbraith wrote:
>
>>>The point of the mechanism is to submit system calls in an
>>>asynchronous fashion, after all. A proper task scheduling is
>>>inappropriate when all we'd like to do is initiate the syscall and
>>>continue processing, just as if it were an async I/O request.
>>>
>>Ok, so you'd want a class where you could register an "exception handler"
>>prior to submitting a system call, and any subsequent schedule would be
>>treated as an exception? (they'd have to be nestable exceptions too
>>right?... <imagines stack explosions> egad:)
>>
>
>Well, apart from not resembling exceptions, and no they don't nest :)
>
Is it clear that this is a win over having a regular thread to
perform the system call for you? Its obviously a lot more complicated.
I _think_ what you describe is almost exactly what KSE or scheduler
activations in FreeBSD 5 does. I haven't yet seen a test where they
significantly beat regular threads. Although I'm not sure if FreeBSD
uses them for asynchronous syscalls, or just user-space thread
scheduling.
At 07:55 AM 8/17/2003 +0100, Jamie Lokier wrote:
>Mike Galbraith wrote:
> > >The point of the mechanism is to submit system calls in an
> > >asynchronous fashion, after all. A proper task scheduling is
> > >inappropriate when all we'd like to do is initiate the syscall and
> > >continue processing, just as if it were an async I/O request.
> >
> > Ok, so you'd want a class where you could register an "exception handler"
> > prior to submitting a system call, and any subsequent schedule would be
> > treated as an exception? (they'd have to be nestable exceptions too
> > right?... <imagines stack explosions> egad:)
>
>Well, apart from not resembling exceptions, and no they don't nest :)
(ok, I only misunderstood _almost_ everything:)
>You may be wondering what happens when I do five stat() calls, all of
>which should be asynchronous (topical: to get the best out of the
>elevator).
>
>Nested? Not quite. At each stat() call that blocks for I/O, its
>shadow task becomes active; that creates its own shadow task (pulling
>a kernel task from userspace's cache of them), then continues to
>perform the next item of work, which is the next stat().
>
>The result is five kernel threads, each blocked on I/O inside a stat()
>call, exactly as desired. A sixth kernel thread, the only one running
>of my program, is continuing the work of the program.
Oh. You just want to dispatch N syscalls from one entry to the kernel?
>Soon, each of the I/O bound threads unblocks, returns to userspace,
>stores its result, queues the next work of this state machine, adds
>this kernel task to userspace's cache, and goes to sleep.
>
>As you can see, this achieves asynchronous system calls which are too
>complex for aio(*), best use of the I/O elevator, and 100% CPU
>utilisation doing useful calculations.
>
>Other user/kernel scheduler couplings are possible, but what I'm
>describing doesn't ask for much(**). Just the right behaviour from
>the kernel's scheduling heuristic: namely, waker not preempted by
>wakee. Seems to be the way it's going anyway.
If that's all you need, a SCHED_NOPREEMPT (synchronous wakeups) class
should do the trick. I thought you wanted a huge truckload more than that.
-Mike
On Sun, 2003-08-17 at 13:12, Jamie Lokier wrote:
> > Oh. You just want to dispatch N syscalls from one entry to the kernel?
>
> No, not at all. I want to schedule cooperative state machines in
> userspace, in the classical select-loop style, but without idling the
> CPU when there's unpredictable blocking on disk I/O.
eg you want AIO stat().....
Mike Galbraith wrote:
> >You may be wondering what happens when I do five stat() calls, all of
> >which should be asynchronous (topical: to get the best out of the
> >elevator).
> >
> >Nested? Not quite. At each stat() call that blocks for I/O, its
> >shadow task becomes active; that creates its own shadow task (pulling
> >a kernel task from userspace's cache of them), then continues to
> >perform the next item of work, which is the next stat().
> >
> >The result is five kernel threads, each blocked on I/O inside a stat()
> >call, exactly as desired. A sixth kernel thread, the only one running
> >of my program, is continuing the work of the program.
>
> Oh. You just want to dispatch N syscalls from one entry to the kernel?
No, not at all. I want to schedule cooperative state machines in
userspace, in the classical select-loop style, but without idling the
CPU when there's unpredictable blocking on disk I/O.
The modern way is to use a few of worker threads per CPU, but they
introduce latency problems and you still have to keep adapting the
number of threads to the type of workload. (See my response to Nick
Piggin and Ingo Oeser).
> >Soon, each of the I/O bound threads unblocks, returns to userspace,
> >stores its result, queues the next work of this state machine, adds
> >this kernel task to userspace's cache, and goes to sleep.
> >
> >As you can see, this achieves asynchronous system calls which are too
> >complex for aio(*), best use of the I/O elevator, and 100% CPU
> >utilisation doing useful calculations.
> >
> >Other user/kernel scheduler couplings are possible, but what I'm
> >describing doesn't ask for much(**). Just the right behaviour from
> >the kernel's scheduling heuristic: namely, waker not preempted by
> >wakee. Seems to be the way it's going anyway.
>
> If that's all you need, a SCHED_NOPREEMPT (synchronous wakeups) class
> should do the trick. I thought you wanted a huge truckload more than that.
Heh. It looks like that may not be needed, with Con's latest "wakee
doesn't preempt waker" patch. That's why this thread is a followup to
that one.
There are other efficiency concerns: sending SIGCONT and SIGSTOP
before and after each potentially-blocking syscall is not the fastest
thing in the world to do. Also it doesn't help with blocking due to
vm paging, but that can be worked around in other ways.
SCHED_NOPREMPT is not right even in principle. An active task wakes
its shadow task, and the shadow task should not run unless the active
task blocks before putting the shadow task back to sleep. The wakeup
_is_ a synchronous wakeup, yet we don't want it to run shadow task to run.
-- Jamie
Arjan van de Ven wrote:
> > > Oh. You just want to dispatch N syscalls from one entry to the kernel?
> >
> > No, not at all. I want to schedule cooperative state machines in
> > userspace, in the classical select-loop style, but without idling the
> > CPU when there's unpredictable blocking on disk I/O.
>
> eg you want AIO stat().....
Plus AIO link(), unlink(), rename(), open() (especially open),
msync(), mlock(), mmap(), sendfile(), readdir(), readlink(), ioctl()
and a few more.
These are too complicated to re-implement using explicit state
machines in the kernel. That would require rewriting huge amounts of
code in all the individual filesystems. You could never really trust
all the blocking corner cases were removed - fetching metadata,
reading directory blocks, allocating memory etc.
That's why the kernel would implement AIO stat() et al. by creating a
set of task contexts to handle queue of these requests.
Even that's at least as complicated as doing it in userspace.
Also with current AIO interface there will always be operations things
that it doesn't support, while the scheduler method works with
anything that blocks, including VM paging.
That said, a generalised "async syscall" interface would be very nice.
It could automatically dispatch things which AIO supports using AIO,
and everything else using task contexts.
-- Jamie
At 06:12 PM 8/17/2003 +0100, Jamie Lokier wrote:
>Mike Galbraith wrote:
> > >You may be wondering what happens when I do five stat() calls, all of
> > >which should be asynchronous (topical: to get the best out of the
> > >elevator).
> > >
> > >Nested? Not quite. At each stat() call that blocks for I/O, its
> > >shadow task becomes active; that creates its own shadow task (pulling
> > >a kernel task from userspace's cache of them), then continues to
> > >perform the next item of work, which is the next stat().
> > >
> > >The result is five kernel threads, each blocked on I/O inside a stat()
> > >call, exactly as desired. A sixth kernel thread, the only one running
> > >of my program, is continuing the work of the program.
> >
> > Oh. You just want to dispatch N syscalls from one entry to the kernel?
>
>No, not at all. I want to schedule cooperative state machines in
>userspace, in the classical select-loop style, but without idling the
>CPU when there's unpredictable blocking on disk I/O.
>
>The modern way is to use a few of worker threads per CPU, but they
>introduce latency problems and you still have to keep adapting the
>number of threads to the type of workload. (See my response to Nick
>Piggin and Ingo Oeser).
>
> > >Soon, each of the I/O bound threads unblocks, returns to userspace,
> > >stores its result, queues the next work of this state machine, adds
> > >this kernel task to userspace's cache, and goes to sleep.
> > >
> > >As you can see, this achieves asynchronous system calls which are too
> > >complex for aio(*), best use of the I/O elevator, and 100% CPU
> > >utilisation doing useful calculations.
> > >
> > >Other user/kernel scheduler couplings are possible, but what I'm
> > >describing doesn't ask for much(**). Just the right behaviour from
> > >the kernel's scheduling heuristic: namely, waker not preempted by
> > >wakee. Seems to be the way it's going anyway.
> >
> > If that's all you need, a SCHED_NOPREEMPT (synchronous wakeups) class
> > should do the trick. I thought you wanted a huge truckload more than that.
>
>Heh. It looks like that may not be needed, with Con's latest "wakee
>doesn't preempt waker" patch. That's why this thread is a followup to
>that one.
I think you'll need a truckload of something :) Maybe not the truckload of
nastiness I was imagining, but simply disabling preempt on wakeup ain't
gonna cut it. The synchronous wakeup I mentioned does that, it only marks
the freshly enqueued task as runnable, but there will be no preempt, no
priority recalculation, no migration, nada. However,...
>There are other efficiency concerns: sending SIGCONT and SIGSTOP
>before and after each potentially-blocking syscall is not the fastest
>thing in the world to do. Also it doesn't help with blocking due to
>vm paging, but that can be worked around in other ways.
>
>SCHED_NOPREMPT is not right even in principle. An active task wakes
>its shadow task, and the shadow task should not run unless the active
>task blocks before putting the shadow task back to sleep. The wakeup
>_is_ a synchronous wakeup, yet we don't want it to run shadow task to run.
...once the shadow task is enqueued and runnable, there's nothing to
prevent the worker thread from exhausting it's slice before it can put it's
shadow back to sleep. This, and the continue my slice in some other thread
thing is what made me think you'd have to deal with a schedule happening to
your worker thread with some kind of handler, and do all kinds of evil
things within... basically overloading the entire scheduler for that class.
-Mike
Mike Galbraith wrote:
> This, and the continue my slice in some other thread thing is what
> made me think you'd have to deal with a schedule happening to your
> worker thread with some kind of handler, and do all kinds of evil
> things within... basically overloading the entire scheduler for that class.
Ew. Very nasty.
-- Jamie
Mike Galbraith wrote:
> ...once the shadow task is enqueued and runnable, there's nothing to
> prevent the worker thread from exhausting it's slice before it can put it's
> shadow back to sleep.
This is why the shadow needs to check whether the active task is
runnable when the shadow returns from FUTEX_WAIT. Reading
/proc/pid/stat alas, but hey that's what we have.
-- Jamie
Con, you're probably wondering what this thread has to do with you :)
It's because of the "wakee doesn't preempt waker" heuristic you
mentioned. I would like to know if it had the properties described
below, near where I mention futexes. Thanks :)
Nick Piggin wrote:
> Is it clear that this is a win over having a regular thread to
> perform the system call for you? Its obviously a lot more complicated.
See below. It's about predictable latency (in the absense of
contention) and maximising throughput at the same time.
Ingo Oeser wrote:
> On Sat, Aug 16, 2003 at 10:39:01PM +0100, Jamie Lokier wrote:
> > Ingo Oeser wrote:
> > Nice idea, but it isn't quite right. If my active task takes a second
> > in the system call, but doesn't block, I probably don't want the other
> > task to run at all. There are occasions when a timer-initiated switch
> > like that would be useful, though.
>
> Yes, I thought you are trying to solve a realtime latency
> problem. Instead you seem to be trying to solve a throughput
> problem.
No, I am trying to solve both problems together. Throughput is easily
maximised by simply creating enough worker threads that the CPU is
never idle. This is done by many server applications these days.
That causes unpredictable latencies, though. If the userspace state
machines hardly ever block, for example because most of the filesystem
metadata they use is in cache, the worker threads will be scheduled by
the kernel as CPU hogs.
This means, and this is only an approximation, that each worker thread
will run for most or all of its timeslice, and then be preempted for a
_long time_ as each of the others runs for its portion of timeslice.
Most of the little state machines will be able to do their work
quickly, and satisfy desired low latencies so long as the data they
need is not stalled behind a slow I/O, but some of the little state
machines will just seem to pause for a long time _even when they are
not stalled by I/O requirements_.
This gives good throughput, but makes certain latency guarantees,
namely "if I do not actually wait for I/O I should not spontaneously
delay for a long time", very difficult to guarantee.
> > The real problem is that if my active task blocks immediately,
> > e.g. because I call open() and it issues a disk I/O, I want to
> > continue handling the next work item as soon as the CPU is free. Not
> > after 1ms or 10ms of CPU idle time.
>
> So you are always alone on the machine? You basically talk about
> completely occupying the CPU, if you application has any work
> to do without allowing any throuput to other maybe busy machines.
No, it is intended for server and interactive applications on machines
which _may_ have other tasks running.
I'm saying two things about CPU utilisation: 1. the CPU should never
be idle if my program has work to do that is not waiting on I/O;
2. it's ok to give some CPU up to other applications, _if there are
any running_, but it's not ok to give other applications _more_ CPU
than their fair share.
> > The ideal is something like async I/O, but that only works for a
> > subset of I/O operations, and it isn't practical to extend it to the
> > more complex I/O operations efficiently. (Inefficiently, yes, but I
> > may as well use my own helper threads if that's what aio would use).
>
> Yes, the big problem of vectorizing syscalls is error handling
> and errors following because of errors.
Vectorizing doesn't help. In the example of 5 stat() calls, those
calls could easily be due to 5 different service state machines, each
responding to a different user request. There's no easy way to work
out that they could have been submitted as a single vector.
This thread is about making system calls asynchronous; vector
syscalls are orthogonal to that, and best left for another time.
> This could be done today with having CPUs dedicated to thread
> sets. For processes we even have a solution to your problem: SIGSTOP
> and SIGCONT. You fork a shadow process for your running process and
> SIGSTOP it. If you think, you'll block you SIGCONT your shadow and
> the shadows blocks on your work indicator for you. If the
> maybe-blocking worker returns, it SIGSTOPs the shadow again.
Brilliant! That's exactly what I had in mind. :)
It is not perfect, because of the large number of additional kill()
syscalls, and it doesn't help with blocking due to VM paging, but it's
a fine solution in principle.
There's a scheduling heuristic problem if SIGCONT were to run the
other thread immediately, as the shadow task is likely to be classed
as "interactive". However, Con's "wakee doesn't preempt waker"
strategy may or may not prevent this.
What makes more sense to me is to wake up the shadow task, using a
futex, and leave the shadow task in the woken state all the time.
When it eventually runs, it checks whether its active partner is
currently running, and if so goes back to sleep, waiting on the futex.
This is nice because very few extra syscalls are needed: an occasional
FUTEX_WAIT and an occasional FUTEX_WAKE, but these aren't needed on
every potentially-blocking syscall.
The problem of course is that a nearly-always-runnable task will tend
to preempt the one which is running, so often that it become
inefficient. It would be great if Con's "wakee doesn't preempt waker"
heuristic prevents this, and makes the shadow task behave as we want.
> Now the semantics work even better. You use SCHED_RR for
> scheduling and the normal task has a higher priority than the
> shadow. Now your task will BY DEFINITION always be scheduled
> until it blocks (strong priority scheduling here!).
>
> It could be that I'm confusing this with SCHED_FIFO, but you get
> the idea ;-)
Yes, one of SCHED_RR or SCHED_FIFO would do it perfectly :)
Unfortunately that's not acceptable in a multi-user environment,
although SOFTRR _might_ work for some of the applications using this
technique.
In general, though, I hope the "wakee doesn't preempt waker" scheduler
heuristic will allow it to work, and still be fair in the presence of
other appliciations.
> Maybe we need a yield_this(tid) for this kind of work.
Maybe. I like to think the old Hierarchical Fair Scheduler patches
had the right idea. You could just use fixed priorities _within_
a node in the tree, and it would work and be fair with other processes.
-- Jamie
On Sun, Aug 17, 2003 at 09:02:53PM +0100, Jamie Lokier wrote:
> Vectorizing doesn't help. In the example of 5 stat() calls, those
> calls could easily be due to 5 different service state machines, each
> responding to a different user request. There's no easy way to work
> out that they could have been submitted as a single vector.
Well, it's pretty much orthogonal to "making everything async", but it
does have the advantage of batching and hence reducing the number of
system call traps that need to be made to get a given amount of work
done. It's unfortunate there aren't more users of the vectored API's.
-- wli
* Con Kolivas <[email protected]> [15-08-2003 22:21]:
[snip]
> Those who experienced starvation could you please test this patch.
O16.1int on top of 2.6.0-test3-mm1 is an improvement over O15int but
it still has some issues. Sound skips and general unresponsiveness
occur under relatively light load. For example, scrolling a PDF in
acrobat sometimes results in 2-3 second skips in sound. Also, the PDF
continues to scroll long after I have left the scroll button. While
scrolling, if I try to switch to Firebird or some other relatively
heavy app, there is a noticeable delay before it comes up. Sometimes
even an xterm running mutt takes a second to show while a PDF is
scrolling.
I was doing a `make htmldocs` while scrolling the pdf. That is all. In
O15int, the same behaviour would occur even if there was _nothing_
else running (except the browser window and xmms ofcourse). That is
the only improvement I have noticed.
I am attaching some numbers that were requested (vmstat, top and
readprofile outputs). These are generated from a script that was
posted here a few days ago. I am attaching the script so it is clear
what was done.
While this script was running, I was basically scrolling a PDF and one
long skip in sound was heard while doing so. I also kept switching
between application windows.
Hope this helps.
Regards,
- Apurva
--
Engineers motto: cheap, good, fast: choose any two
Thanks for report.
On Mon, 18 Aug 2003 20:08, Apurva Mehta wrote:
> * Con Kolivas <[email protected]> [15-08-2003 22:21]:
> [snip]
>
> > Those who experienced starvation could you please test this patch.
>
> O16.1int on top of 2.6.0-test3-mm1 is an improvement over O15int but
> it still has some issues. Sound skips and general unresponsiveness
> occur under relatively light load. For example, scrolling a PDF in
> acrobat sometimes results in 2-3 second skips in sound. Also, the PDF
> continues to scroll long after I have left the scroll button. While
> scrolling, if I try to switch to Firebird or some other relatively
> heavy app, there is a noticeable delay before it comes up. Sometimes
> even an xterm running mutt takes a second to show while a PDF is
> scrolling.
Ah well acrobat reader does some very strange things, even stranger when it's
a plugin in mozilla. A kernel profile while running it (like yours) shows mad
virtual memory activity, not just rescheduling so apart from starving acrobat
reader forcibly (or mozilla when acroread is the plugin) the scheduler can't
help it an awful lot. Try opening pdfs in another pdf viewer and you'll see
what I mean.
> I was doing a `make htmldocs` while scrolling the pdf. That is all. In
> O15int, the same behaviour would occur even if there was _nothing_
> else running (except the browser window and xmms ofcourse). That is
> the only improvement I have noticed.
>
> I am attaching some numbers that were requested (vmstat, top and
> readprofile outputs). These are generated from a script that was
> posted here a few days ago. I am attaching the script so it is clear
> what was done.
>
> While this script was running, I was basically scrolling a PDF and one
> long skip in sound was heard while doing so. I also kept switching
> between application windows.
The (priority inversion) starvation issue is being actively attended to in a
different way at the moment but acroread is a different beast again, and we
shall see.
> Hope this helps.
Of course; any report contributes to the pool of information; thank you.
Con
Hi all,
On Sun, Aug 17, 2003 at 09:02:53PM +0100, Jamie Lokier wrote:
> It's because of the "wakee doesn't preempt waker" heuristic you
> mentioned.
Which could have strange effects, if not put in a scheduler class
unreachable by the normal user or be at least a compile-out option.
That's why I didn't even considered solving it in the kernel
scheduler and basically suggested rolling the scheduling of this
special case your own ;-)
> Ingo Oeser wrote:
> > Yes, I thought you are trying to solve a realtime latency
> > problem. Instead you seem to be trying to solve a throughput
> > problem.
>
> No, I am trying to solve both problems together.
So you are on a quest for the holy grail of the real time people.
Do them a favor and write at least a paper about how you solved
this and I'll have even more good arguments (pro Linux) for my
new employer.
> Most of the little state machines will be able to do their work
> quickly, and satisfy desired low latencies so long as the data they
> need is not stalled behind a slow I/O, but some of the little state
> machines will just seem to pause for a long time _even when they are
> not stalled by I/O requirements_.
Now I completely understand your problem. Does it also arise, if
your application is alone on the machine? If it doesn't then you
are just hit by the bad programming of other people.
> This gives good throughput, but makes certain latency guarantees,
> namely "if I do not actually wait for I/O I should not spontaneously
> delay for a long time", very difficult to guarantee.
You can't guarantee that, if you are not higher priority AND in
the realtime scheduler class (which you like to avoid, as seen
below). There are other applications, which
might want to run, even you are busy.
stupid example:
Consider syslogd logging loads of your debugging messages,
because if you debug your application like this you generate
most debugging messages, if you are really busy.
> No, it is intended for server and interactive applications on machines
> which _may_ have other tasks running.
>
> I'm saying two things about CPU utilisation: 1. the CPU should never
> be idle if my program has work to do that is not waiting on I/O;
> 2. it's ok to give some CPU up to other applications, _if there are
> any running_, but it's not ok to give other applications _more_ CPU
> than their fair share.
1. and 2. seem to contradict in practise :-(
[SIGCONT/SIGSTOP as wakeup/wait]
> It is not perfect, because of the large number of additional kill()
> syscalls, and it doesn't help with blocking due to VM paging, but it's
> a fine solution in principle.
It helps a lot, since you want these signals to have SIG_DFL
handlers an thus they'll never reach user space. This could be
made a fast path, if your usage pattern gets its lobby.
And if the VM pages, you better not interfere, since you have not
enough knowledge about the overall system state to interfere. If
it becomes a problem, better fix the VM for low latency. The
audio people will kiss your feet for that.
> There's a scheduling heuristic problem if SIGCONT were to run the
> other thread immediately, as the shadow task is likely to be classed
> as "interactive". However, Con's "wakee doesn't preempt waker"
> strategy may or may not prevent this.
This doesn't matter, since the thread will immediately block, if
there is no work to do. But I would still prefer a timeout here
to determine 'blocks too long' with this solution, to avoid even
the context switch, when it's not necessary.
> What makes more sense to me is to wake up the shadow task, using a
> futex, and leave the shadow task in the woken state all the time.
> When it eventually runs, it checks whether its active partner is
> currently running, and if so goes back to sleep, waiting on the futex.
Thats my 'work indicator'. My problem with this solution is that
this activation is non-sense, if you go asleep most of the time.
I timer will prevent exactly this, but will limit throughput as
you added.
> Yes, one of SCHED_RR or SCHED_FIFO would do it perfectly :)
>
> Unfortunately that's not acceptable in a multi-user environment,
> although SOFTRR _might_ work for some of the applications using this
> technique.
>
> In general, though, I hope the "wakee doesn't preempt waker" scheduler
> heuristic will allow it to work, and still be fair in the presence of
> other appliciations.
>
> > Maybe we need a yield_this(tid) for this kind of work.
>
> Maybe. I like to think the old Hierarchical Fair Scheduler patches
> had the right idea. You could just use fixed priorities _within_
> a node in the tree, and it would work and be fair with other processes.
That is the perfect solution of your problem. You tell the kernel
your priorities within the application and forget all the hacks
we talked about.
Problem: This scheduler is more complex.
Solution: With a distinction between the group and the members of
a group, where both can be set through syscalls would simplify
it a lot and provide most things we want.
Group can be either thread pool or process pool, with
explicitly set members.
Unfortunatly I neither have the time, nor the problem, nor the
machines to implement this.
Con? Ingo?
Regards
Ingo Oeser
* Con Kolivas <[email protected]> [18-08-2003 17:04]:
> Thanks for report.
>
> On Mon, 18 Aug 2003 20:08, Apurva Mehta wrote:
> > * Con Kolivas <[email protected]> [15-08-2003 22:21]:
> > [snip]
> >
> > > Those who experienced starvation could you please test this patch.
> >
> > O16.1int on top of 2.6.0-test3-mm1 is an improvement over O15int but
> > it still has some issues. Sound skips and general unresponsiveness
> > occur under relatively light load. For example, scrolling a PDF in
> > acrobat sometimes results in 2-3 second skips in sound. Also, the PDF
> > continues to scroll long after I have left the scroll button. While
> > scrolling, if I try to switch to Firebird or some other relatively
> > heavy app, there is a noticeable delay before it comes up. Sometimes
> > even an xterm running mutt takes a second to show while a PDF is
> > scrolling.
>
> Ah well acrobat reader does some very strange things, even stranger
> when it's a plugin in mozilla. A kernel profile while running it
> (like yours) shows mad virtual memory activity, not just
> rescheduling so apart from starving acrobat reader forcibly (or
> mozilla when acroread is the plugin) the scheduler can't help it an
> awful lot.
While acrobat's behaviour is the most pronounced, it is just one
example. Sound skips occur sometimes while switching tabs in firebird
as well. Inactive applications take longer to be redrawn. Your earlier
patches (O10 and O11int) were better in this respect.
I reported acrobat because it was the only app which I could provide
numbers for since its behaviour is consistent.
> Try opening pdfs in another pdf viewer and you'll see what I mean.
Hmm.. your right, things are a whole lot smoother in xpdf.
> > I was doing a `make htmldocs` while scrolling the pdf. That is all. In
> > O15int, the same behaviour would occur even if there was _nothing_
> > else running (except the browser window and xmms ofcourse). That is
> > the only improvement I have noticed.
> >
> > I am attaching some numbers that were requested (vmstat, top and
> > readprofile outputs). These are generated from a script that was
> > posted here a few days ago. I am attaching the script so it is clear
> > what was done.
> >
> > While this script was running, I was basically scrolling a PDF and one
> > long skip in sound was heard while doing so. I also kept switching
> > between application windows.
>
> The (priority inversion) starvation issue is being actively attended to in a
> different way at the moment but acroread is a different beast again, and we
> shall see.
Yes, I have been following the threads.
> > Hope this helps.
>
> Of course; any report contributes to the pool of information; thank you.
Your most welcome! And thank you for putting in so much work into the
scheduler.
Keep up the great work,
- Apurva
--
Engineers motto: cheap, good, fast: choose any two
Ingo Oeser wrote:
> On Sun, Aug 17, 2003 at 09:02:53PM +0100, Jamie Lokier wrote:
> > It's because of the "wakee doesn't preempt waker" heuristic you
> > mentioned.
>
> Which could have strange effects, if not put in a scheduler class
> unreachable by the normal user or be at least a compile-out option.
Increasingly, a new type of futex with a hook into the scheduler looks
like the right answer to this question. But I prefer not to need a
kernel patch. Like most people with this dilemma, my programs are
likely to be run by folk who will not use non-mainstream kernels.
> That's why I didn't even considered solving it in the kernel
> scheduler and basically suggested rolling the scheduling of this
> special case your own ;-)
Indeed. But how do you suggest rolling a scheduler on my own while
the kernel is blocking me? :)
> > Ingo Oeser wrote:
> > > Yes, I thought you are trying to solve a realtime latency
> > > problem. Instead you seem to be trying to solve a throughput
> > > problem.
> >
> > No, I am trying to solve both problems together.
>
> So you are on a quest for the holy grail of the real time people.
> Do them a favor and write at least a paper about how you solved
> this and I'll have even more good arguments (pro Linux) for my
> new employer.
Heh :) Am not really aiming for real-time, just reasonable response
times and no unnecessary idle time. My goal is web servers, file
servers, database servers and interactive apps. Real-time could be
another goal - does your employer pay for good papers? ;)
> > Most of the little state machines will be able to do their work
> > quickly, and satisfy desired low latencies so long as the data they
> > need is not stalled behind a slow I/O, but some of the little state
> > machines will just seem to pause for a long time _even when they are
> > not stalled by I/O requirements_.
>
> Now I completely understand your problem. Does it also arise, if
> your application is alone on the machine? If it doesn't then you
> are just hit by the bad programming of other people.
It _only_ arises if the application is alone. When there are others,
its quite reasonable to expect to be preempted, if we are not doing real-time.
When the application is alone those unwanted long stalls occur when an
application has _self-contention_, due to running too many kernel
worker threads which preempt each other.
This is the main fundamental reason for not just running lots of
threads. (Secondary reasons are to reduce memory consumption and
reduce locking requirements).
> > This gives good throughput, but makes certain latency guarantees,
> > namely "if I do not actually wait for I/O I should not spontaneously
> > delay for a long time", very difficult to guarantee.
>
> You can't guarantee that, if you are not higher priority AND in
> the realtime scheduler class (which you like to avoid, as seen
> below). There are other applications, which
> might want to run, even you are busy.
Quite; it's ok to give up CPU under those circumstances. That is
fair, and not wasting anything.
> [SIGCONT/SIGSTOP as wakeup/wait]
> > It is not perfect, because of the large number of additional kill()
> > syscalls, and it doesn't help with blocking due to VM paging, but it's
> > a fine solution in principle.
>
> It helps a lot, since you want these signals to have SIG_DFL
> handlers an thus they'll never reach user space. This could be
> made a fast path, if your usage pattern gets its lobby.
Even this "fast path" is two extra syscalls per original syscall.
Syscalls are not that fast that we are happy to sprinkle them around.
> And if the VM pages, you better not interfere, since you have not
> enough knowledge about the overall system state to interfere. If
> it becomes a problem, better fix the VM for low latency. The
> audio people will kiss your feet for that.
There is nothing wrong with trying to do some other work during a VM
page-in. Folk do it all the time, it's called running multiple threads :)
> > There's a scheduling heuristic problem if SIGCONT were to run the
> > other thread immediately, as the shadow task is likely to be classed
> > as "interactive". However, Con's "wakee doesn't preempt waker"
> > strategy may or may not prevent this.
>
> This doesn't matter, since the thread will immediately block, if
> there is no work to do.
It matters because of two unwanted context switches per original
syscall. That's overhead which would outweigh the gain in many benchmarks.
> But I would still prefer a timeout here
> to determine 'blocks too long' with this solution, to avoid even
> the context switch, when it's not necessary.
>
> > What makes more sense to me is to wake up the shadow task, using a
> > futex, and leave the shadow task in the woken state all the time.
> > When it eventually runs, it checks whether its active partner is
> > currently running, and if so goes back to sleep, waiting on the futex.
>
> Thats my 'work indicator'. My problem with this solution is that
> this activation is non-sense, if you go asleep most of the time.
> I timer will prevent exactly this, but will limit throughput as
> you added.
The point of wake-without-putting-back-to-sleep is to avoid most of
the extra syscalls and the context switches. Otherwise the overhead
of the mechanism makes it perform poorly in the cases where there is
very little blocking.
-- Jamie
Con Kolivas wrote:
>>
>>A hardware timer interrupt happens at timeslice granularity. If the
>>interrupt occurs, but the timeslice is not expired, then NORMALLY, the
>>ISR would just return right back to the running task, but sometimes, it
>>might decided to end the timeslice early and run some other task.
>>
>>Right?
>
>
> No, the timeslice granularity is a hard cut off where a task gets rescheduled
> and put at the back of the queue again. If there is no other task of equal or
> better priority it will just start again.
Hmmm... I'm still having trouble making sense of this.
So, it seems that you're saying that all tasks, regardless of timeslice
length, are interrupted every 10ms (at 100hz). If another task exists
at a higher priority, then it gets run at that point. However, if there
is more than one task at a given priority level, then they will not
round-robin until the current task has used up all of its timeslice
(some integer multiple of 10ms).
Am I finally correct, or do I still have it wrong? :)
Timothy Miller wrote:
>
>
> Con Kolivas wrote:
>
>>>
>>> A hardware timer interrupt happens at timeslice granularity. If the
>>> interrupt occurs, but the timeslice is not expired, then NORMALLY, the
>>> ISR would just return right back to the running task, but sometimes, it
>>> might decided to end the timeslice early and run some other task.
>>>
>>> Right?
>>
>>
>>
>> No, the timeslice granularity is a hard cut off where a task gets
>> rescheduled and put at the back of the queue again. If there is no
>> other task of equal or better priority it will just start again.
>
>
>
> Hmmm... I'm still having trouble making sense of this.
>
> So, it seems that you're saying that all tasks, regardless of
> timeslice length, are interrupted every 10ms (at 100hz).
Interrupted - in that scheduler_tick is run, yes. They don't get switched.
> If another task exists at a higher priority, then it gets run at
> that point.
Well loosely, yes. Actually, it happens if the task exists and is "running",
and has timeslice left. That only happens in scheduler_tick when the
current
task has finished its timeslice and the priority arrays are about to be
switched. The required conditions for preemption can also occur when a task
is being woken up, (after sleeping or newly forked).
> However, if there is more than one task at a given priority level,
> then they will not round-robin until the current task has used up all
> of its timeslice (some integer multiple of 10ms).
Thats right. Although I think Con or Ingo recently changed this in mm
>
>
> Am I finally correct, or do I still have it wrong? :)
>
Sounds right to me.
Nick Piggin wrote:
>> If another task exists at a higher priority, then it gets run at
>> that point.
>
>
>
> Well loosely, yes. Actually, it happens if the task exists and is
> "running",
> and has timeslice left.
> That only happens in scheduler_tick when the
> current
> task has finished its timeslice and the priority arrays are about to be
> switched.
What only happens then?
I'm confused again. Are you talking about swapping the active and
expired arrays?
Of course, all bets are off if the current task actually uses up it
whole timeslice. Then it's not being preempted in quite the same way.
So, then, if there are not tasks left in the active array, naturally,
the highest priority task from what was once the expired array will be
run, and that may be of higher priority.
Is that what you're saying?
> The required conditions for preemption can also occur when a task
> is being woken up, (after sleeping or newly forked).
This is the case that I was thinking of. No swapping of queues. It's
just that a higher priority task was sleeping (or not existing) which
could cause the current task to be preempted before its timeslice ends.
Timothy Miller wrote:
>
>
> Nick Piggin wrote:
>
>>> If another task exists at a higher priority, then it gets run at
>>> that point.
>>
>>
>>
>>
>> Well loosely, yes. Actually, it happens if the task exists and is
>> "running",
>> and has timeslice left.
>
>
>> That only happens in scheduler_tick when the current
>> task has finished its timeslice and the priority arrays are about to be
>> switched.
>
>
> What only happens then?
The task being preempted.
>
> I'm confused again. Are you talking about swapping the active and
> expired arrays?
Yes.
>
> Of course, all bets are off if the current task actually uses up it
> whole timeslice. Then it's not being preempted in quite the same way.
OK ;)
>
> So, then, if there are not tasks left in the active array, naturally,
> the highest priority task from what was once the expired array will be
> run, and that may be of higher priority.
>
> Is that what you're saying?
Yep.
>
>
> > The required conditions for preemption can also occur when a task
>
>> is being woken up, (after sleeping or newly forked).
>
>
> This is the case that I was thinking of. No swapping of queues. It's
> just that a higher priority task was sleeping (or not existing) which
> could cause the current task to be preempted before its timeslice ends.
Yep.