LinuxLists.cc - Scheduling problem with 2.4?

2003-05-17 11:09:50

Subject: Scheduling problem with 2.4?

I have a problem with Emacs which crawls when used as a shell. If I
call M-x shell RET and then do something like
hexdump -v /dev/null|dd count=100k bs=1
then characters appear rather slowly at first, most of them in single
character packets, with an occasional 1024 packet in between. More
of them at the end. The rate of the packets does not exceed 100/sec.

Other operating systems do not exhibit this extreme case of
throttling. Emacs is using select calls AFAICS for dealing with
packets arriving, and it would appear that the delivering process
causes a context switch immediately when writing its single byte to
the pipe or is taken from the run queue while the select call is
pending. Whatever. In the ensuing time slice, Emacs processes the
single arriving byte and then enters the select call again.

Now that the pipe is not allowed to fill up in most cases (most: in
some, increasingly more at the later time, the pipe gets filled
alternately complete and with single bytes) before a context switch
occurs is already unfortunate with regard to the efficiency of
processing. What is quite more of a nuisance is the apparent
throttling to a single byte per clock tick: the CPU is idling most of
the time. It would be unfortunate enough if it were going full blast
at processing 1-byte packets, but most of the time it does nothing
but wait for the next tick.

Now I would want to

a) get sufficient information about what might go wrong inside of
Linux as compared to other operating systems.

b) find a good workaround for not triggering this kind of behavior.

I don't have sufficient kernel experience to debug this one, and it
definitely impacts the speed of operations severely, and in close
cooperation with the operating system. So any pointers would be
welcome.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum

2003-05-17 17:28:11

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

On Sat, May 17, 2003 at 01:22:23PM +0200, David Kastrup wrote:
>
> I have a problem with Emacs which crawls when used as a shell. If I
> call M-x shell RET and then do something like
> hexdump -v /dev/null|dd count=100k bs=1

this certainly can't be it

andrea@dualathlon:~> hexdump -v /dev/null|dd count=100k bs=1
0+0 records in
0+0 records out
andrea@dualathlon:~>

how can that run slow, there's nothing to hexdump in /dev/null and
nothing to read for dd?

One related hint, since you're playing with >4k buffers with dd, make
sure to run a -aa kernel with the lowlatency fixes (latest is
2.4.21rc2aa1), or those large writes can loop in the kernel forbid
preemption (in this case maybe not if they block in a socket/pipe before
100k is all written)

overall often people notices improvement with preempt becuse they
compare it with buggy kernels (I know this has nothing to do with your
report, but it's related to the above)

Andrea

2003-05-17 19:23:56

by David.Kastrup

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

Attachments:

(No filename) (825.00 B)
testio.el (1.52 kB)
(No filename) (901.00 B)
Download all attachments

2003-05-17 20:18:01

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

On Sat, May 17, 2003 at 09:36:29PM +0200, David Kastrup wrote:
> Andrea Arcangeli <[email protected]> writes:
>
> > On Sat, May 17, 2003 at 01:22:23PM +0200, David Kastrup wrote:
> > >
> > > I have a problem with Emacs which crawls when used as a shell. If I
> > > call M-x shell RET and then do something like
> > > hexdump -v /dev/null|dd count=100k bs=1
> >
> > this certainly can't be it
> >
> > andrea@dualathlon:~> hexdump -v /dev/null|dd count=100k bs=1
> > 0+0 records in
> > 0+0 records out
>
> Argl. Substitute /dev/zero in the obvious place. My stupidity.
>
> Here is a test file for generating further stats from within Emacs.
> You can call it interactively from within Emacs with
> M-x load-library RET testio.el RET
> M-x make-test RET
>
> and you can use it from the command line (if you really must) with
> emacs -batch -q -no-site-file -l testio.el -eval "(progn(make-test t
> 'kill-emacs)(sit-for 100000))"
>

>
>
> It will then output the read stuff from the process followed by
> statistics of block sizes over each second it ran.
>
> The interactive session tends to be somewhat worse, but the command
> line version is bad enough for a demonstration.

this is within emacs:

1 blocks with size 95
1 blocks with size 146
1 blocks with size 175
1 blocks with size 189
1 blocks with size 199
2 blocks with size 231
2 blocks with size 232
1 blocks with size 233
1 blocks with size 238
1 blocks with size 241
1 blocks with size 242
1 blocks with size 243
1 blocks with size 244
1 blocks with size 245
1 blocks with size 247
1 blocks with size 252
2 blocks with size 257
1 blocks with size 260
2 blocks with size 261
2 blocks with size 262
1 blocks with size 263
1 blocks with size 266
1 blocks with size 271
1 blocks with size 276
1 blocks with size 282
1 blocks with size 283
1 blocks with size 289
1 blocks with size 296
1 blocks with size 297
1 blocks with size 299
1 blocks with size 300
2 blocks with size 302
1 blocks with size 304
1 blocks with size 306
2 blocks with size 307
1 blocks with size 308
1 blocks with size 310
1 blocks with size 311
3 blocks with size 312
1 blocks with size 313
2 blocks with size 315
1 blocks with size 316
1 blocks with size 317
1 blocks with size 322
1 blocks with size 324
1 blocks with size 330
1 blocks with size 345
1 blocks with size 347
2 blocks with size 355
2 blocks with size 359
2 blocks with size 363
1 blocks with size 368
1 blocks with size 369
2 blocks with size 373
1 blocks with size 374
2 blocks with size 375
1 blocks with size 383
1 blocks with size 384
2 blocks with size 385
1 blocks with size 386
2 blocks with size 387
1 blocks with size 389
1 blocks with size 391
1 blocks with size 392
2 blocks with size 393
3 blocks with size 394
1 blocks with size 397
2 blocks with size 398
4 blocks with size 400
1 blocks with size 402
1 blocks with size 405
1 blocks with size 409
1 blocks with size 414
1 blocks with size 418
1 blocks with size 420
4 blocks with size 423
1 blocks with size 424
1 blocks with size 428
1 blocks with size 429
1 blocks with size 430
1 blocks with size 432
1 blocks with size 433
1 blocks with size 436
1 blocks with size 437
1 blocks with size 443
2 blocks with size 445
1 blocks with size 448
2 blocks with size 451
1 blocks with size 453
1 blocks with size 455
1 blocks with size 472
2 blocks with size 485
1 blocks with size 490
1 blocks with size 498
1 blocks with size 504
1 blocks with size 505
1 blocks with size 510
1 blocks with size 512
2 blocks with size 533
2 blocks with size 537
1 blocks with size 540
1 blocks with size 545
1 blocks with size 551
1 blocks with size 557
2 blocks with size 562
1 blocks with size 571
1 blocks with size 572
1 blocks with size 573
1 blocks with size 578
1 blocks with size 579
1 blocks with size 582
1 blocks with size 588
2 blocks with size 596
1 blocks with size 598
1 blocks with size 599
1 blocks with size 600
1 blocks with size 602
2 blocks with size 610
1 blocks with size 616
1 blocks with size 620
2 blocks with size 623
1 blocks with size 628
1 blocks with size 629
1 blocks with size 630
2 blocks with size 638
1 blocks with size 639
2 blocks with size 640
1 blocks with size 641
1 blocks with size 642
1 blocks with size 644
1 blocks with size 648
1 blocks with size 650
1 blocks with size 651
1 blocks with size 652
1 blocks with size 654
1 blocks with size 656
1 blocks with size 657
1 blocks with size 659
1 blocks with size 662
1 blocks with size 665
1 blocks with size 667
2 blocks with size 670
1 blocks with size 672
1 blocks with size 679
1 blocks with size 687
1 blocks with size 720
1 blocks with size 735
1 blocks with size 746
1 blocks with size 754
1 blocks with size 775
1 blocks with size 834
1 blocks with size 844
1 blocks with size 894
1 blocks with size 896
1 blocks with size 906
1 blocks with size 928
1 blocks with size 980
2 blocks with size 987
1 blocks with size 991
1 blocks with size 998
1 blocks with size 1023
8 blocks with size 1024

from the shell prompt:

6 blocks with size 26
5 blocks with size 27
14 blocks with size 28
10 blocks with size 29
4 blocks with size 30
2 blocks with size 31
2 blocks with size 32
2 blocks with size 33
4 blocks with size 34
1 blocks with size 35
1 blocks with size 36
2 blocks with size 38
2 blocks with size 39
1 blocks with size 41
1 blocks with size 45
1 blocks with size 46
3 blocks with size 48
1 blocks with size 49
1 blocks with size 52
3 blocks with size 55
1 blocks with size 57
1 blocks with size 58
2 blocks with size 60
1 blocks with size 61
1 blocks with size 62
1 blocks with size 63
1 blocks with size 65
1 blocks with size 71
2 blocks with size 75
1 blocks with size 79
1 blocks with size 86
1 blocks with size 120
1 blocks with size 128
2 blocks with size 147
1 blocks with size 190
1 blocks with size 233
1 blocks with size 241
1 blocks with size 250
1 blocks with size 269
1 blocks with size 273
1 blocks with size 407
1 blocks with size 564
1 blocks with size 614
2 blocks with size 626
1 blocks with size 667
1 blocks with size 680
1 blocks with size 709
1 blocks with size 711
1 blocks with size 712
1 blocks with size 719
1 blocks with size 722
1 blocks with size 729
1 blocks with size 732
2 blocks with size 737
1 blocks with size 759
1 blocks with size 772
1 blocks with size 1014
5 blocks with size 1023
31 blocks with size 1024

my emacs is:

GNU Emacs 21.2.1
Copyright (C) 2001 Free Software Foundation, Inc.
GNU Emacs comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of Emacs
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.

>
> It is obvious that during most of the time, the pipe only receives
> single characters.
>
> Again, I am pretty sure that Emacs is at fault too, but I don't
> understand the implications of what scheduling behavior causes the
> pipes to remain mostly empty, with an occasional full filling. It
> would be better if Emacs would not be context-switched into the
> moment something appears in the pipe, but if the writer to the pipe
> would keep the opportunity to fill'er'up until it is either preempted
> or needs to wait. If there was some possibility to force this
> behavior from within Emacs, this would be good to know.

I don't see any slowdown here.

As said the 100k bs will lead to just 1 syscall for lots of I/O, this
made me think the lowlatency fixes can be related. So I would suggest
you to try to reproduce with my 2.4 tree or with 2.5 + -preempt enabled.

Just for the record, this is the patch I'm talking about (you can apply
it easily to other 2.4, this should really be merged into mainline too):

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/kernels/v2.4/2.4.21rc2aa1/9998_lowlatency-fixes-12

Andrea

2003-05-17 20:31:40

by David.Kastrup

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

Andrea Arcangeli <[email protected]> writes:

> > The interactive session tends to be somewhat worse, but the command
> > line version is bad enough for a demonstration.
>
> this is within emacs:
>
> 1 blocks with size 95
> 1 blocks with size 146
> 1 blocks with size 175

[...]

> 1 blocks with size 980
> 2 blocks with size 987
> 1 blocks with size 991
> 1 blocks with size 998
> 1 blocks with size 1023
> 8 blocks with size 1024
>
> from the shell prompt:
>

[...]

Wel, those are excellent results and as it should be.

> my emacs is:
>
> GNU Emacs 21.2.1

Same here.

I am talking about the kernel coming with RedHat 9, uname -a gives
Linux lola.goethe.zz 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 i686 i386 GNU/Linux

Unfortunately, this kernel is here to stay for quite a while, and I
would want to find a way to let Emacs cooperate with it better without
telling people to recompile the kernel or wait a year.

> > Again, I am pretty sure that Emacs is at fault too, but I don't
> > understand the implications of what scheduling behavior causes the
> > pipes to remain mostly empty, with an occasional full filling. It
> > would be better if Emacs would not be context-switched into the
> > moment something appears in the pipe, but if the writer to the
> > pipe would keep the opportunity to fill'er'up until it is either
> > preempted or needs to wait. If there was some possibility to
> > force this behavior from within Emacs, this would be good to know.
>
> I don't see any slowdown here.

Which kernel? Oh, and of course: on a SMP machine the problem would
not be visible since then the context switch to Emacs does not stop
the other process from stuffing into the pipe.

> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/kernels/v2.4/2.4.21rc2aa1/9998_lowlatency-fixes-12

Taking notice, but will need some time to apply. And I really would
want to find a way to avoid triggering this problem.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum

2003-05-17 21:40:56

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

On Sat, May 17, 2003 at 10:44:16PM +0200, David Kastrup wrote:
> Which kernel? Oh, and of course: on a SMP machine the problem would

I run it on smp. but I understood what's going on.

--- testio.el.orig 2003-05-17 22:22:24.000000000 +0200
+++ testio.el 2003-05-17 23:39:49.000000000 +0200
@@ -19,7 +19,7 @@
(setq test-pattern nil test-start (float-time))
(let ((process (start-process
"test" (and (bufferp printer) printer) "sh"
- "-c" "od -v /dev/zero|dd bs=1 count=100k")))
+ "-c" "od -v /dev/zero| dd bs=100k count=1")))
(set-process-filter process `(lambda (process string)
(test-filter process string
',printer)))

this is what I thought you were really doing with this script, not the
other way around ;).

The reason you get the 1 char reads in your version is because dd will
only write 1 char at time to emacs. This has nothing to do with emacs,
this is about dd writing 1 char at time. If you don't write 1 char at
time, emacs won't read 1 char at time.

The only thing you could do to decrease the cxt switch flood is to put a
buffer with timeout in between, but by default having a buffer with
timeout in the kernel would be an overkill overhead and depending on the
timeout it could be visible for normal shells. So the kernel behaviour
is right and it isn't going to change.

If one important app of yours is dumb and writes flood of data 1 char at
time then just write a buffer in userspace yourself, and in 99% of cases
a dd bs=4k will just work perfectly (you'll need a timeout-stdout-flush
version one only if the pipe keeps staying open i.e. if the app doesn't
quit).

Andrea

2003-05-17 21:41:51

by Barry K. Nathan

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

On Sat, May 17, 2003 at 10:44:16PM +0200, David Kastrup wrote:
> I am talking about the kernel coming with RedHat 9, uname -a gives
> Linux lola.goethe.zz 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 i686 i386 GNU/Linux
>
> Unfortunately, this kernel is here to stay for quite a while, and I
> would want to find a way to let Emacs cooperate with it better without
> telling people to recompile the kernel or wait a year.

They should be upgrading to the kernel here, at least, for security
reasons:
https://rhn.redhat.com/errata/RHSA-2003-172.html

I have no idea what effect, if any, it would have on your performance
problems though.

-Barry K. Nathan <[email protected]>

2003-05-17 22:24:22

by David.Kastrup

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

Andrea Arcangeli <[email protected]> writes:

> On Sat, May 17, 2003 at 10:44:16PM +0200, David Kastrup wrote:
> > Which kernel? Oh, and of course: on a SMP machine the problem would
>
> I run it on smp.

Then the results are worthless. The problem is that the write to the
pipe causes an immediate context switch, keeping the pipe from
gathering useful amounts of information.

> but I understood what's going on.
>
> --- testio.el.orig 2003-05-17 22:22:24.000000000 +0200
> +++ testio.el 2003-05-17 23:39:49.000000000 +0200
> @@ -19,7 +19,7 @@
> (setq test-pattern nil test-start (float-time))
> (let ((process (start-process
> "test" (and (bufferp printer) printer) "sh"
> - "-c" "od -v /dev/zero|dd bs=1 count=100k")))
> + "-c" "od -v /dev/zero| dd bs=100k count=1")))
> (set-process-filter process `(lambda (process string)
> (test-filter process string
> ',printer)))

That utterly and entirely defeats the purpose of my demonstration.
Emacs crawls as a terminal/shell/whatever exactly because programs
writing to a pty do not write in chunks of 100k.

> The reason you get the 1 char reads in your version is because dd will
> only write 1 char at time to emacs.

Of course it does. I told it to do so. But there is no necessity to
do an immediate context switch: it would be completely sufficient if
Emacs (which is waiting on select) were put in the run queue and
scheduled when the time slice of dd was up. Performance gets better
at some time only because Emacs spends some time with other business
(like garbage collection) and gets preempted outside of the select
system call, and only _then_ can dd fill the pipe in a whiff.

> This has nothing to do with emacs, this is about dd writing 1 char
> at time. If you don't write 1 char at time, emacs won't read 1 char
> at time.

But if I am doing process communication with other processes, the I/O
_will_ arrive in small portions, and when the generating processes are
running on the same CPU instead of being I/O bound, I don't stand a
chance of working efficiently, namely utilizing the pipes, if I do a
quasi-synchronous context switch by force even if the time slice has
been hardly touched.

Pipes that never fill up cause overhead. I think it amusing that a
pipeline on a single processor system will yield _vastly_ increased
performance the moment the receiving end works as _slowly_ as to get
preempted. Only then will the sender be able to fill the pipe,
causing a large performance gain.

> The only thing you could do to decrease the cxt switch flood is to put a
> buffer with timeout in between, but by default having a buffer with
> timeout in the kernel would be an overkill overhead and depending on the
> timeout it could be visible for normal shells. So the kernel behaviour
> is right and it isn't going to change.

Visible for normal shells? Hardly. We are talking about letting the
output generating process live out its time slice (10ms) or being
scheduled because of waiting for something else. Do you think you
will negatively notice that the first letter of the prompt appears
10ms later together with the complete rest of the prompt, instead of
the letters appearing one by one, but the first immediately?

> If one important app of yours is dumb and writes flood of data 1
> char at time

All interactive apps, which includes almost anything run in a pty,
write data in small chunks. If you context switch them away from the
pipe the moment they have placed a small chunk in, performance will
suffer.

Wasn't the idea behind the anticipative I/O scheduler something like
that? I remember vaguely.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum

2003-05-17 23:37:59

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

On Sun, May 18, 2003 at 12:37:01AM +0200, David Kastrup wrote:
> Of course it does. I told it to do so. But there is no necessity to
> do an immediate context switch: it would be completely sufficient if
> Emacs (which is waiting on select) were put in the run queue and
> scheduled when the time slice of dd was up. Performance gets better

the switch happens because emacs has higher dynamic priority, as it was
sleeping for the longest time. without these special cases for the
interactive tasks we couldn't use these long timeslices without making
the system not responsive.

> (like garbage collection) and gets preempted outside of the select
> system call, and only _then_ can dd fill the pipe in a whiff.
>
> > This has nothing to do with emacs, this is about dd writing 1 char
> > at time. If you don't write 1 char at time, emacs won't read 1 char
> > at time.
>
> But if I am doing process communication with other processes, the I/O
> _will_ arrive in small portions, and when the generating processes are
> running on the same CPU instead of being I/O bound, I don't stand a
> chance of working efficiently, namely utilizing the pipes, if I do a

writing 1 byte per syscall isn't very efficient in the first place (no
matter if the cxt switch happens or not).

I see what you mean, but I still don't think it is a problem. If
bandwidth matters you will have to use large writes and reads anyways,
if bandwidth doesn't matter the number of ctx switches doesn't matter
either and latency usually is way more important with small messages.

you're applying small messages to a "throughput" test, this is why you
have a problem. If you really did interprocess communication (and not a
throughput benchmark) you would probably want the smallest delay in the
delivery of the signal/message.

Andrea

2003-05-18 00:03:41

by David Schwartz

[permalink] [raw]

Subject: RE: Scheduling problem with 2.4?

> I see what you mean, but I still don't think it is a problem. If
> bandwidth matters you will have to use large writes and reads anyways,
> if bandwidth doesn't matter the number of ctx switches doesn't matter
> either and latency usually is way more important with small messages.

> Andrea

This is the danger of pre-emption based upon dynamic priorities. You can
get cases where two processes each are permitted to make a very small amount
of progress in alternation. This can happen just as well with large writes
as small ones, the amount of data is irrelevent, it's the amount of CPU time
that's important, or to put it another way, it's how far a process can get
without suffering a context switch.

I suggest that a process be permitted to use up at least some portion of
its timeslice exempt from any pre-emption based solely on dynamic
priorities.

DS

2003-05-18 00:11:34

by David.Kastrup

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

Andrea Arcangeli <[email protected]> writes:

> On Sun, May 18, 2003 at 12:37:01AM +0200, David Kastrup wrote:
> > Of course it does. I told it to do so. But there is no necessity to
> > do an immediate context switch: it would be completely sufficient if
> > Emacs (which is waiting on select) were put in the run queue and
> > scheduled when the time slice of dd was up. Performance gets better
>
> the switch happens because emacs has higher dynamic priority, as it
> was sleeping for the longest time. without these special cases for
> the interactive tasks we couldn't use these long timeslices without
> making the system not responsive.

Then we will need a smaller timeslice for making this decision. If I
have a sending process able to write 1000000 characters per second,
it is wasteful if such a process is scheduled away after writing a
single line (most processes running inside of a shell will work
line-buffered, won't they?).

> > But if I am doing process communication with other processes, the I/O
> > _will_ arrive in small portions, and when the generating processes are
> > running on the same CPU instead of being I/O bound, I don't stand a
> > chance of working efficiently, namely utilizing the pipes, if I do a
>
> writing 1 byte per syscall isn't very efficient in the first place
> (no matter if the cxt switch happens or not).

Sure, but we are more typically talking about writing a line at a
time, which is easily by a factor 50 smaller than the pipe capacity
for typical output.

> I see what you mean, but I still don't think it is a problem. If
> bandwidth matters you will have to use large writes and reads
> anyways, if bandwidth doesn't matter the number of ctx switches
> doesn't matter either and latency usually is way more important with
> small messages.
>
> you're applying small messages to a "throughput" test, this is why
> you have a problem. If you really did interprocess communication
> (and not a throughput benchmark) you would probably want the
> smallest delay in the delivery of the signal/message.

Not when I could have my pipe filled within a fraction of a
millisecond _without_ idling (starting up the reading process the
moment that the writing process has taken more than its due of time
or is in itself waiting is, of course, perfectly sensible).

Xterm:
time dd if=/dev/zero bs=16k count=16|od -v
real 0m8.656s
user 0m0.240s
sys 0m0.090s

time dd if=/dev/zero bs=16k count=16|od -v|dd obs=16k
real 0m3.794s
user 0m0.240s
sys 0m0.060s

Do you really think that I would have appreciated the "smaller
latency" of few milliseconds at best in the first case? Note that
this is exclusively a uni-processor problem: the fast context switch
will starve the writing process from CPU time and make sure that even
if it could crank out hundreds of writes in millisecond, it will only
get a single one of them placed into the pipe, ever, unless the
receiving end gets preempted before reaching select, at which time the
writer can finally stuff the pipe completely.

This all-or-nothing utilization of a pipe or pty is not a sane
tradeoff.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum

2003-05-18 00:53:32

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

On Sat, May 17, 2003 at 05:16:33PM -0700, David Schwartz wrote:
>
> > I see what you mean, but I still don't think it is a problem. If
> > bandwidth matters you will have to use large writes and reads anyways,
> > if bandwidth doesn't matter the number of ctx switches doesn't matter
> > either and latency usually is way more important with small messages.
>
> > Andrea
>
> This is the danger of pre-emption based upon dynamic priorities. You can
> get cases where two processes each are permitted to make a very small amount
> of progress in alternation. This can happen just as well with large writes
> as small ones, the amount of data is irrelevent, it's the amount of CPU time
> that's important, or to put it another way, it's how far a process can get
> without suffering a context switch.
>
> I suggest that a process be permitted to use up at least some portion of
> its timeslice exempt from any pre-emption based solely on dynamic
> priorities.

that's the issue yes. but then when a multithreaded app sends a signal
to another process it can take up to this "x" timeslice portion before
the signal will run (I mean in UP). Same goes for mouse clicks etc..
1msec for mouse clicks should not be noticeable though. And over all I
don't see a real big issue in the current code.

To try it probably the simpler way to add a need_resched_timeout
along to need_resched, and to honour the need_resched only when the
timeout triggers, immediate need_resched will set the timeout = jiffies
so it'll trigger immediatly, the timer interrupt will check it. The
reschedule_idle on a non idle cpu will be the only one to set the
timeout. Effectively I'm curious to see what will happen. Not all archs
would need to check against it (the entry.S code is the main reader of
need_resched), it'll be an hint only and idle() for sure must not check
it at all. this would guarantee minimal timeslice reserved up to 1/HZ by
setting the timeout to jiffies + 2 (jiffies + 1 would return a mean of
1/HZ/2 but the worst case would be ~0, in practice even + 1 would be
enough) Does anybody think this could have a value? If yes I can make a
quick hack to see what happens.

Andrea

2003-05-18 08:37:57

by Mike Galbraith

[permalink] [raw]

Subject: RE: Scheduling problem with 2.4?

At 05:16 PM 5/17/2003 -0700, David Schwartz wrote:

> > I see what you mean, but I still don't think it is a problem. If
> > bandwidth matters you will have to use large writes and reads anyways,
> > if bandwidth doesn't matter the number of ctx switches doesn't matter
> > either and latency usually is way more important with small messages.
>
> > Andrea
>
> This is the danger of pre-emption based upon dynamic priorities.
> You can
>get cases where two processes each are permitted to make a very small amount
>of progress in alternation. This can happen just as well with large writes
>as small ones, the amount of data is irrelevent, it's the amount of CPU time
>that's important, or to put it another way, it's how far a process can get
>without suffering a context switch.
>
> I suggest that a process be permitted to use up at least some
> portion of
>its timeslice exempt from any pre-emption based solely on dynamic
>priorities.

Cool.

Thank you for the spiffy idea. I implemented this in my (hack/chop) mm5
tree in about 30 seconds, and it works just fine. Very simple
time_after(jiffies, p->last_run + MIN_TIMESLICE) checks in wake_up_cpu()
plus the obvious dinky change to schedule(), and it worked
instantly. Hmm.. I wonder if this might help some of the other scheduler
(seemingly) bad behavior as well.

Is there any down-side to not preempting quite as often? It seems like
there should be a bandwidth gain.

-Mike

2003-05-18 09:29:00

by David.Kastrup

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

Andrea Arcangeli <[email protected]> writes:

> On Sat, May 17, 2003 at 05:16:33PM -0700, David Schwartz wrote:
> >
> > This is the danger of pre-emption based upon dynamic
> > priorities. You can get cases where two processes each are
> > permitted to make a very small amount of progress in
> > alternation. This can happen just as well with large writes as
> > small ones, the amount of data is irrelevent, it's the amount of
> > CPU time that's important, or to put it another way, it's how far
> > a process can get without suffering a context switch.
> >
> > I suggest that a process be permitted to use up at least some
> > portion of its timeslice exempt from any pre-emption based solely
> > on dynamic priorities.
>
> that's the issue yes. but then when a multithreaded app sends a
> signal to another process it can take up to this "x" timeslice
> portion before the signal will run (I mean in UP). Same goes for
> mouse clicks etc.. 1msec for mouse clicks should not be noticeable
> though. And over all I don't see a real big issue in the current
> code.

Well, the problem that we have is excessive synchronous context
switching, so the solution might be to simply throttle on that. It's
no problem the first few times round, but after some time the
scheduler should recognize the pattern and clamp down. So I'd propose
that we give a process that yields synchronously while it could be
ready to run an accumulating priority boost that gets wasted pretty
fast when the process does a full wait (aka the pipe is full) or is
preempted.

That is, have the scheduler penalize/throttle application-visible
context switches between runnable processes. On particular if the
processes are dependent on one another, like in the case of the pipe.
We can proud ourselves to optimize the kernel to make a context switch
in 1us, but if each context switch occurs 2ms of avoidable overhead in
our application, the bottom line to the user will remain ugly.

> To try it probably the simpler way to add a need_resched_timeout
> along to need_resched, and to honour the need_resched only when the
> timeout triggers, immediate need_resched will set the timeout = jiffies
> so it'll trigger immediatly, the timer interrupt will check it. The
> reschedule_idle on a non idle cpu will be the only one to set the
> timeout. Effectively I'm curious to see what will happen. Not all archs
> would need to check against it (the entry.S code is the main reader of
> need_resched), it'll be an hint only and idle() for sure must not check
> it at all. this would guarantee minimal timeslice reserved up to 1/HZ by
> setting the timeout to jiffies + 2 (jiffies + 1 would return a mean of
> 1/HZ/2 but the worst case would be ~0, in practice even + 1 would be
> enough) Does anybody think this could have a value? If yes I can make a
> quick hack to see what happens.

We probably need not immediate action. A good test case is an xterm,
I guess, and I noticed that od has an option for limiting its length.

So I propose that comparing the output of

time od -vN 100k /dev/zero

to that of

time od -vN 100k /dev/zero|dd obs=16k

in an xterm should provide some idea about _typical_ overhead occured
in an _application_ due to excessive context switching.

If you want to get really nasty, you can compare to
time od -vN 100k /dev/zero|dd obs=1

Note that in all of these cases, it is by far the xterm that is
wasting the lion's share of the processing time (and so you need to
take a look at the _real_ time expended, since xterm will not be
measured in the time command), the kernel people and the generating
processes will wash their hands off any guilt: "_we_ do our work
efficiently and give xterm all the time of the world to be able to
empty the pipe, and let it take all the time it needs to without
pushing it". Only that xterm could work much more efficiently if one
pushed it by piling things into the pipe.

Another silly exercise is the following:

time od -v /dev/zero|dd obs=1

in an xterm. Keep it running. Now in a different term, enter

while true;do :;done

The above xterm will freeze while the interactive shell still has its
bonus, and then will start working again, at a much higher speed than
when it has an idle CPU at its disposal.

Kill off the busy loop, and the xterm gets slow again.

Come to think of it, I have witnessed the phenomenon of "slow start
xterms" that get ragingly more efficient after several screenfulls
when they _can't_ keep up on a busy system for years on a large
variety of Unices with single CPU. It is some of those quirks one
does not think about too much.

So I repeat: I should think a fast accumulating scheduling malus for a
synchronous context switch to a process woken by an event from a still
runnable process should be appropriate. If anybody can think of a
scheme that magically converges to a good tradeoff between average
fill level of a pipe and estimated processing overhead by the
application at the receiving end, of course this would be appreciated.

The aim would be that running an additional

while true;do :;done

should not usually be able to speed up the _real_ time performance of
an unrelated task.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum

2003-05-18 17:33:31

by David Schwartz

[permalink] [raw]

Subject: RE: Scheduling problem with 2.4?

> Is there any down-side to not preempting quite as often? It seems like
> there should be a bandwidth gain.
>
> -Mike

The theoretical down-side is that interactivity might suffer a bit because
a process isn't scheduled quite as quickly. Yes, the less-often you preempt
a process, the faster the system will go in the sense of work done per unit
time. But those pesky users want their characters to echo quickly and the
mouse pointer to track their physical motions.

Obviously, we must preempt when a process with a higher static priority
becomes ready to run. However, preempting based on dynamic priorities has
permitted time slices to be even longer, permitting a reduction in context
switches without sacrificing interactivity.

I still believe, however, that a process should be 'guaranteed' some slice
of time every time it's scheduled unless circumstances make it impossible to
allow the process to continue running. IMO, the pendulum has swung too far
in favor of interactivity. Obviously, if the process faults, blocks, or a
process with higher static priority becomes ready to run, then we must
terminate the process' time slice early.

DS

2003-05-18 22:58:37

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

On Sun, May 18, 2003 at 10:55:17AM +0200, Mike Galbraith wrote:
> At 05:16 PM 5/17/2003 -0700, David Schwartz wrote:
>
> >> I see what you mean, but I still don't think it is a problem. If
> >> bandwidth matters you will have to use large writes and reads anyways,
> >> if bandwidth doesn't matter the number of ctx switches doesn't matter
> >> either and latency usually is way more important with small messages.
> >
> >> Andrea
> >
> > This is the danger of pre-emption based upon dynamic priorities.
> >You can
> >get cases where two processes each are permitted to make a very small
> >amount
> >of progress in alternation. This can happen just as well with large writes
> >as small ones, the amount of data is irrelevent, it's the amount of CPU
> >time
> >that's important, or to put it another way, it's how far a process can get
> >without suffering a context switch.
> >
> > I suggest that a process be permitted to use up at least some
> >portion of
> >its timeslice exempt from any pre-emption based solely on dynamic
> >priorities.
>
> Cool.
>
> Thank you for the spiffy idea. I implemented this in my (hack/chop) mm5
> tree in about 30 seconds, and it works just fine. Very simple
> time_after(jiffies, p->last_run + MIN_TIMESLICE) checks in wake_up_cpu()

If I understand well, what you did is different (in functionalty) from
what I described (and what I described in the last email certainly takes
more than 30 seconds no matter how smart you implement it ;). I mean,
you lose the whole "wakeup" information, yeah that will fix
it too like deleting the need_resched =1 after the check on the
curr->prio enterely, but while it's so simple you you don't only
guarantee the miniumum timeslice, but you let the current task running
even after it expired the minimum timeslice. That will most certainly
hurt interactivity way too much, I don't think it's an option, unless
you want to trim significantly the timeslice length too. The only reason
we can take these long timeslices are the interactivity hints, we always
had those in linux, all versions. If you start to ignore it, things
should not do too well, even throughput can decrease in a multithread
environment due the slower delivery of events.

Delaying a interprocess message for 1msec (or even 10msec) [i.e. 1/HZ]
may not be noticeable, but delaying it for a whole timeslice will
certainly be noticeable, that's an order of magnitude bigger delay.

Actually besides the way I described yesterday (that would require arch
asm changes) to experiment the "miniumum timeslice guarantee", it could
also be implemented by moving the timeslice-min_timeslice in a
rest_timeslice field if jiffies - last_run < MIN_TIMESLICE and if
rest_timeslice is zero, and trimming the timeslice down to
min_timeslice. Then the next schedule would put rest_timeslice back in
timeslice. This is on the lines of what you implemented but it also
guaranteees that the higher dyn-prio task will be scheduled after this
min_timeslice (to still provide the ~same interactive behaviour, which
is a must IMHO ;) This will be a bit more difficult of the need_resched
secondary field, but it's probably conceptually cleaner, since it
restricts the algorithm in the scheduler keeping the asm fast paths
small and simple.

> plus the obvious dinky change to schedule(), and it worked
> instantly. Hmm.. I wonder if this might help some of the other scheduler
> (seemingly) bad behavior as well.
>
> Is there any down-side to not preempting quite as often? It seems like
> there should be a bandwidth gain.

it's certainly on the right trick for this issue, in the sense there
will be the wanted bandwidth gain in the 1 byte write case to a pty ;)
(like dropping the need_resched = 1, that even goes down to 5 sec and a
1 liner patch), however I don't think the 30 sec modification alone is
just a generic solution. delaying the reschedule to min_timeslice may be
accpetable instead.

Andrea

2003-05-18 23:05:53

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

On Sun, May 18, 2003 at 10:46:24AM -0700, David Schwartz wrote:
>
> > Is there any down-side to not preempting quite as often? It seems like
> > there should be a bandwidth gain.
> >
> > -Mike
>
> The theoretical down-side is that interactivity might suffer a bit because
> a process isn't scheduled quite as quickly. Yes, the less-often you preempt
> a process, the faster the system will go in the sense of work done per unit
> time. But those pesky users want their characters to echo quickly and the
> mouse pointer to track their physical motions.
>
> Obviously, we must preempt when a process with a higher static priority

the static priority is the same for all tasks in the system unless you
use nice. I believe linux should do well without forcing the user to set
the GUI at high prio etc..

> becomes ready to run. However, preempting based on dynamic priorities
> has
> permitted time slices to be even longer, permitting a reduction in context

the dyn prio doesn't change the timeslice size, it only chooses if to
wakeup a task immediatly or not, timeslices are calculated in function
of the static priority, not the dyn prio. Those interactive tasks uses
nearly zero of their timeslice anyways, the size of the timeslice has
little meaning for interactive tasks, what matters is "when" they run
and that's controlled by the dyn prio.

> switches without sacrificing interactivity.
>
> I still believe, however, that a process should be 'guaranteed' some slice
> of time every time it's scheduled unless circumstances make it impossible to
> allow the process to continue running. IMO, the pendulum has swung too far
> in favor of interactivity. Obviously, if the process faults, blocks, or a
> process with higher static priority becomes ready to run, then we must

if the static priority is much higher the immediate switch already
happens. But if we choose to guarantee a min_timeslice to tasks to avoid
the ctx switch flood for your testcase, then also the higher static prio
tasks will have to wait for this min_timeslice. The issue is no
different with higher static prio, or you will complain next time that
you get a ctx switch flood from a dd bs=1 writing into a pty connected
to an xterm running at -20 prio.

> terminate the process' time slice early.
>
> DS
>
>

Andrea

2003-05-19 02:46:12

by David Schwartz

[permalink] [raw]

Subject: RE: Scheduling problem with 2.4?

> > Obviously, we must preempt when a process with a higher
> > static priority

> the static priority is the same for all tasks in the system unless you
> use nice. I believe linux should do well without forcing the user to set
> the GUI at high prio etc..

Exactly. Pre-emption based on dynamic priorities is an "if it helps"
feature and not one required by the semantics. On the other hand,
pre-emption based on static priorities is pretty much required and not doing
that would certainly violate the principle of least astonishment.

> > becomes ready to run. However, preempting based on dynamic priorities
> > has
> > permitted time slices to be even longer, permitting a reduction
> > in context

> the dyn prio doesn't change the timeslice size, it only chooses if to
> wakeup a task immediatly or not, timeslices are calculated in function
> of the static priority, not the dyn prio. Those interactive tasks uses
> nearly zero of their timeslice anyways, the size of the timeslice has
> little meaning for interactive tasks, what matters is "when" they run
> and that's controlled by the dyn prio.

You say "the dyn prio doesn't change the timeslice size", but that's not
true. If a process with low dynamic priority writes to a FIFO, its timeslice
will be cut short, whereas one with a higher dynamic priorit would enjoy a
longer timeslice. Yes, priority doesn't affect the *intended* timeslice, but
it can affect the effective timeslice.

> > I still believe, however, that a process should be
> > 'guaranteed' some slice
> > of time every time it's scheduled unless circumstances make it
> > impossible to
> > allow the process to continue running. IMO, the pendulum has
> > swung too far
> > in favor of interactivity. Obviously, if the process faults,
> > blocks, or a
> > process with higher static priority becomes ready to run, then we must

> if the static priority is much higher the immediate switch already
> happens.

As it certainly should. The maximum amount of time we could justify
delaying a task with a higher static priority in the name of better 'bulk
data' performance is so small that it's probably not feasible to do it at
all.

> But if we choose to guarantee a min_timeslice to tasks to avoid
> the ctx switch flood for your testcase, then also the higher static prio
> tasks will have to wait for this min_timeslice.

No. I don't think that's appropriate. Higher statis priority is an explicit
decision made by users for the specific reason of getting the faster
response. Under no circumstances should an NTP process that received a UDP
packet be prevented from getting that packet as quickly as possible.

> The issue is no
> different with higher static prio, or you will complain next time that
> you get a ctx switch flood from a dd bs=1 writing into a pty connected
> to an xterm running at -20 prio.

With static priorities, you can always tell people "Don't do that then, you
got what you asked for. The system doesn't know what you 'really want'."
With dynamic priorities, however, this is not the case.

I'm not sure what the best solution is. But I think it's along the lines of
deferring a deschedule based purely on dynamic priorities if the task has
used only a small fraction of its timeslice. Perhaps, for example, a normal
timeslice should be four jiffies. If a reschule would normally occur during
the first jiffy, and it's based purely on dynamic priorities, we set a flag
which we check when the first jiffy ends.

Now maybe in some cases where the static priorities only differ slightly,
we should defer even based on static priorities. Perhaps immediate
pre-emption during the beginning of a timeslice should only occur when the
winning process has an elevated priority (and not just because the losing
process has a reduced one).

I don't really know. But I think what we're doing now is fundamentally
wrong because processes should be allowed to use up their timeslices when
possible.

DS

2003-05-19 03:45:07

by Mike Galbraith

[permalink] [raw]

Subject: RE: Scheduling problem with 2.4?

At 10:55 AM 5/18/2003 +0200, Mike Galbraith wrote:
>At 05:16 PM 5/17/2003 -0700, David Schwartz wrote:
>
>> I suggest that a process be permitted to use up at least some
>> portion of
>>its timeslice exempt from any pre-emption based solely on dynamic
>>priorities.
>
>
>Is there any down-side to not preempting quite as often? It seems like
>there should be a bandwidth gain.

(my reply was supposed to be off-list, but..)

The answer appears to be yes, there are down-sides. Instead of the
expected throughput gain, I got a loss :-/ which makes sense from the
cache side I suppose. While it was instant gratification for the pipe
testcase, as a general solution it leaves something to be desired... at
least when implemented in it's simplest form Oh well.

(hmm.. maybe I should go stare at the cache decay stuff some more. later)

-Mike

2003-05-19 06:58:47

by Mike Galbraith

[permalink] [raw]

Subject: Re: Scheduling problem with 2.4?

At 01:11 AM 5/19/2003 +0200, Andrea Arcangeli wrote:
>On Sun, May 18, 2003 at 10:55:17AM +0200, Mike Galbraith wrote:
> > At 05:16 PM 5/17/2003 -0700, David Schwartz wrote:
> >
> > >> I see what you mean, but I still don't think it is a problem. If
> > >> bandwidth matters you will have to use large writes and reads anyways,
> > >> if bandwidth doesn't matter the number of ctx switches doesn't matter
> > >> either and latency usually is way more important with small messages.
> > >
> > >> Andrea
> > >
> > > This is the danger of pre-emption based upon dynamic priorities.
> > >You can
> > >get cases where two processes each are permitted to make a very small
> > >amount
> > >of progress in alternation. This can happen just as well with large writes
> > >as small ones, the amount of data is irrelevent, it's the amount of CPU
> > >time
> > >that's important, or to put it another way, it's how far a process can get
> > >without suffering a context switch.
> > >
> > > I suggest that a process be permitted to use up at least some
> > >portion of
> > >its timeslice exempt from any pre-emption based solely on dynamic
> > >priorities.
> >
> > Cool.
> >
> > Thank you for the spiffy idea. I implemented this in my (hack/chop) mm5
> > tree in about 30 seconds, and it works just fine. Very simple
> > time_after(jiffies, p->last_run + MIN_TIMESLICE) checks in wake_up_cpu()
>
>If I understand well, what you did is different (in functionalty) from
>what I described (and what I described in the last email certainly takes
>more than 30 seconds no matter how smart you implement it ;). I mean,

The p->last_run is a typo, it's effectively current->last_run, with
last_run also being updated upon task switch so you can see how long he's
had the cpu. That can be done much quicker than 30 seconds by someone who
can type with fingers instead of thumbs ;-)

>you lose the whole "wakeup" information, yeah that will fix
>it too like deleting the need_resched =1 after the check on the
>curr->prio enterely, but while it's so simple you you don't only
>guarantee the miniumum timeslice, but you let the current task running
>even after it expired the minimum timeslice. That will most certainly
>hurt interactivity way too much, I don't think it's an option, unless
>you want to trim significantly the timeslice length too. The only reason
>we can take these long timeslices are the interactivity hints, we always
>had those in linux, all versions. If you start to ignore it, things
>should not do too well, even throughput can decrease in a multithread
>environment due the slower delivery of events.

Throughput did decrease. I was thinking that there was likely to be
another event before my ignoring the preempt could matter much. As
mentioned privately, I guaranteed that no task could hold the cpu for more
than 50ms max, so on the next schedule he'd get the cpu (if prio high
enough) but the test results weren't encouraging.

>Delaying a interprocess message for 1msec (or even 10msec) [i.e. 1/HZ]
>may not be noticeable, but delaying it for a whole timeslice will
>certainly be noticeable, that's an order of magnitude bigger delay.
>
>Actually besides the way I described yesterday (that would require arch
>asm changes) to experiment the "miniumum timeslice guarantee", it could
>also be implemented by moving the timeslice-min_timeslice in a
>rest_timeslice field if jiffies - last_run < MIN_TIMESLICE and if
>rest_timeslice is zero, and trimming the timeslice down to
>min_timeslice. Then the next schedule would put rest_timeslice back in
>timeslice. This is on the lines of what you implemented but it also
>guaranteees that the higher dyn-prio task will be scheduled after this
>min_timeslice (to still provide the ~same interactive behaviour, which
>is a must IMHO ;) This will be a bit more difficult of the need_resched
>secondary field, but it's probably conceptually cleaner, since it
>restricts the algorithm in the scheduler keeping the asm fast paths
>small and simple.

I'd really like to try a deferred preempt. I don't know enough (yet
anyway;) to not create spectacular explosions with that idea though. I'll
go ponder your idea instead.

-Mike