Hi,
Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's fix
for the IO stalls/deadlocks.
Please test it.
Summary of changes from v2.4.21-rc5 to v2.4.21-rc6
============================================
<[email protected]>:
o IDE config.in correctness
Andi Kleen <[email protected]>:
o x86-64 fix for the ioport problem
Andrew Morton <[email protected]>:
o Fix IO stalls and deadlocks
Marcelo Tosatti <[email protected]>:
o Add missing via82xxx PCI ID
o Backout erroneous fsync on last opener at close()
o Changed EXTRAVERSION to -rc6
On Thu, 29 May 2003 10:55, Marcelo Tosatti wrote:
> Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's fix
> for the IO stalls/deadlocks.
Good for you. Well done Marcelo!
> Please test it.
Yes everyone who gets these stalls please test it also!
> Andrew Morton <[email protected]>:
> o Fix IO stalls and deadlocks
For those interested these are patches 1 and 2 from akpm's proposed fixes in
the looong thread discussing this problem.
Con
On Thu, May 29, 2003 at 11:22:20AM +1000, Con Kolivas wrote:
> On Thu, 29 May 2003 10:55, Marcelo Tosatti wrote:
> > Andrew Morton <[email protected]>:
> > o Fix IO stalls and deadlocks
>
> For those interested these are patches 1 and 2 from akpm's proposed fixes in
> the looong thread discussing this problem.
Are you sure? I'm no C programmer, but it looks to me like all three
patches are in 21-rc6.
And I still see the stalls, although it's much reduced. :( I just had mutt
freeze cold on me though for ~15 sec when it tried to open my debian-devel
mbox (rather lage file) while brag was beating on the drive.
<whimper>
--
Marc Wilson | You have had a long-term stimulation relative to
[email protected] | business.
Hi Marc.
> I just had mutt > freeze cold on me though for ~15 sec when
> it tried to open my debian-devel mbox (rather large file)
> while brag was beating on the drive.
>
> <whimper>
I used to get the same effect when I asked pine to open the
Linux-Kernel mailbox on my system. I long since cured that by
having procmail split Linux-Kernel mail into multiple mailboxes,
one for each calendar week.
The basic problem there is that any mail client needs to know
just how many messages are in a particular folder to handle that
folder, and the only way to do this is to count them all. That's
what takes the time when one opens a large folder.
Best wishes from Riley.
---
* Nothing as pretty as a smile, nothing as ugly as a frown.
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.484 / Virus Database: 282 - Release Date: 27-May-2003
On Thu, May 29, 2003 at 06:34:48AM +0100, Riley Williams wrote:
> The basic problem there is that any mail client needs to know
> just how many messages are in a particular folder to handle that
> folder, and the only way to do this is to count them all. That's
> what takes the time when one opens a large folder.
No, the basic problem there is that the kernel is deadlocking. Read the
VERY long thread for the details.
I think I have enough on the ball to be able to tell the difference between
mutt opening a folder and counting messages, with a counter and percentage
indicator advancing, and mutt sitting there deadlocked with the HD activity
light stuck on and all the rest of X stuck tight.
And it just happened again, so -rc6 is no sure fix. What did y'all that
reported the problem had gone away do, patch -rc4 with the akpm patches?
^_^
--
Marc Wilson | Fortune favors the lucky.
[email protected] |
Hi Marc.
>> The basic problem there is that any mail client needs to know
>> just how many messages are in a particular folder to handle that
>> folder, and the only way to do this is to count them all. That's
>> what takes the time when one opens a large folder.
> No, the basic problem there is that the kernel is deadlocking.
> Read the VERY long thread for the details.
>
> I think I have enough on the ball to be able to tell the difference
> between mutt opening a folder and counting messages, with a counter
> and percentage indicator advancing, and mutt sitting there
> deadlocked with the HD activity light stuck on and all the rest of
> X stuck tight.
I thought I was on the ball when a similar situation happened to me.
What I observed was that the counters and percentage indicators were
NOT advancing for about 30 seconds, and both would then jump up by
about 70 messages and the relevant percent rather than counting
smoothly through. It was only when I noticed those jumps that I went
back to basics and analysed the folder rather than the kernel.
However, I apologise profusely for assuming that my experience in what
to me appear to be similar circumstances to yours could have any sort
of bearing on the problem you are seeing.
> And it just happened again, so -rc6 is no sure fix. What did y'all
> that reported the problem had gone away do, patch -rc4 with the
> akpm patches?
In my case, I fixed the problem by splitting the relevant folder up,
as stated in my previous message. However, such a solution apparently
doesn't work for you, so I'm unable to help any further.
Best wishes from Riley.
---
* Nothing as pretty as a smile, nothing as ugly as a frown.
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.484 / Virus Database: 282 - Release Date: 27-May-2003
Hi !
On Wed, May 28, 2003 at 10:57:35PM -0700, Marc Wilson wrote:
> No, the basic problem there is that the kernel is deadlocking. Read the
> VERY long thread for the details.
I didn't follow this thread, what's its subject, please ?
> I think I have enough on the ball to be able to tell the difference between
> mutt opening a folder and counting messages, with a counter and percentage
> indicator advancing, and mutt sitting there deadlocked with the HD activity
> light stuck on and all the rest of X stuck tight.
even on -rc3, I don't observe this behaviour. I tried from a cold cache, and
mutt took a little less than 3 seconds to open LKML's May folder (35 MB), and
progressed very smoothly. Since it's on my Alpha file server, I can't test
with X. But the I/O bandwidth and scheduler frequency (1024 HZ) may have an
impact.
Cheers,
Willy
On Thu, May 29, 2003 at 10:38:04AM +0200, Willy Tarreau wrote:
> Hi !
>
> On Wed, May 28, 2003 at 10:57:35PM -0700, Marc Wilson wrote:
> > No, the basic problem there is that the kernel is deadlocking. Read the
> > VERY long thread for the details.
>
> I didn't follow this thread, what's its subject, please ?
Hmmm never mind, I easily found it (yes, VERY long) !
Cheers,
Willy
On Thu, 29 May 2003 10:55, Marcelo Tosatti wrote:
> Hi,
>
> Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's fix
> for the IO stalls/deadlocks.
>
> Please test it.
>
>
> Andrew Morton <[email protected]>:
> o Fix IO stalls and deadlocks
As this is only patches 1 and 2 from akpm's suggested changes I was wondering
if my report got lost in the huge thread so I've included it here:
Ok patch combination final score for me is as follows in the presence of a
large continuous write:
1 No change
2 No change
3 improvement++; minor hangs with reads
1+2 improvement+++; minor pauses with switching applications
1+2+3 improvement++++; no pauses
Applications may start up slowly that's fine. The mouse cursor keeps spinning
and responding at all times though with 1+2+3 which it hasn't done in 2.4 for
a year or so.
Is there a reason the 3rd patch was omitted?
Con
On Wed, 28 May 2003 21:55:39 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:
> Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> fix for the IO stalls/deadlocks.
While others may be dubious about the efficacy of this patch, I've been
running -rc6 on my laptop now since sometime last night and have seen
nothing odd.
In case anybody cares, I'm using both ide and a ieee1394 (for a large
external drive [which implies scsi]) and I do a _lot_ of big work with
BK so I was seeing the problem within hours previously.
-g
On Thu, 29 May 2003, Georg Nikodym wrote:
> On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> Marcelo Tosatti <[email protected]> wrote:
>
> > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> > fix for the IO stalls/deadlocks.
>
> While others may be dubious about the efficacy of this patch, I've been
> running -rc6 on my laptop now since sometime last night and have seen
> nothing odd.
>
> In case anybody cares, I'm using both ide and a ieee1394 (for a large
> external drive [which implies scsi]) and I do a _lot_ of big work with
> BK so I was seeing the problem within hours previously.
Great!
-rc7 will have to be released due to some problems :(
Dnia czw 29. maja 2003 21:11, Marcelo Tosatti napisa?:
> On Thu, 29 May 2003, Georg Nikodym wrote:
> > On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> >
> > Marcelo Tosatti <[email protected]> wrote:
> > > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> > > fix for the IO stalls/deadlocks.
> >
> > While others may be dubious about the efficacy of this patch, I've been
> > running -rc6 on my laptop now since sometime last night and have seen
> > nothing odd.
> >
> > In case anybody cares, I'm using both ide and a ieee1394 (for a large
> > external drive [which implies scsi]) and I do a _lot_ of big work with
> > BK so I was seeing the problem within hours previously.
>
> Great!
>
> -rc7 will have to be released due to some problems :(
hmm, seems to ide modules and others are broken. Im looking for reason why ..
here are depmod errors and my .config file:
make[1]: Nie nic do roboty w `modules_install'.
make[1]: Opuszczam katalog
`/home/users/dzimi/rpm/BUILD/linux-2.4.20/arch/i386/l
ib'
cd /lib/modules/2.4.21-rc6; \
mkdir -p pcmcia; \
find kernel -path '*/pcmcia/*' -name '*.o' | xargs -i -r ln -sf ../{} pcmcia
if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.4.21-rc6; fi
depmod: *** Unresolved symbols in
/lib/modules/2.4.21-rc6/kernel/drivers/ide/ide
-disk.o
depmod: proc_ide_read_geometry
depmod: ide_remove_proc_entries
depmod: *** Unresolved symbols in
/lib/modules/2.4.21-rc6/kernel/drivers/ide/ide
-floppy.o
depmod: proc_ide_read_geometry
depmod: ide_remove_proc_entries
depmod: *** Unresolved symbols in
/lib/modules/2.4.21-rc6/kernel/drivers/ide/ide
-probe.o
depmod: do_ide_request
depmod: ide_add_generic_settings
depmod: create_proc_ide_interfaces
depmod: *** Unresolved symbols in
/lib/modules/2.4.21-rc6/kernel/drivers/ide/ide
-tape.o
depmod: ide_remove_proc_entries
depmod: *** Unresolved symbols in
/lib/modules/2.4.21-rc6/kernel/drivers/ide/ide
.o
depmod: ide_release_dma
depmod: ide_add_proc_entries
depmod: cmd640_vlb
depmod: ide_probe_for_cmd640x
depmod: ide_scan_pcibus
depmod: proc_ide_read_capacity
depmod: proc_ide_create
depmod: ide_remove_proc_entries
depmod: destroy_proc_ide_drives
depmod: proc_ide_destroy
depmod: create_proc_ide_interfaces
depmod: *** Unresolved symbols in
/lib/modules/2.4.21-rc6/kernel/drivers/net/wan
/comx.o
depmod: proc_get_inode
depmod: *** Unresolved symbols in
/lib/modules/2.4.21-rc6/kernel/net/atm/common.
o
depmod: free_atm_vcc_sk
depmod: atm_init_aal34
depmod: alloc_atm_vcc_sk
depmod: atm_init_aal0
depmod: atm_devs
depmod: *** Unresolved symbols in /lib/modules/2.4.21-rc6/kernel/net/atm/pvc.o
depmod: atm_getsockopt
depmod: atm_recvmsg
depmod: atm_release
depmod: atm_ioctl
depmod: atm_create
depmod: atm_sendmsg
depmod: atm_poll
depmod: atm_connect
depmod: atm_proc_init
depmod: atm_setsockopt
depmod: *** Unresolved symbols in
/lib/modules/2.4.21-rc6/kernel/net/atm/resourc
es.o
depmod: atm_proc_dev_deregister
depmod: atm_proc_dev_register
depmod: *** Unresolved symbols in
/lib/modules/2.4.21-rc6/kernel/net/atm/signali
ng.o
depmod: nodev_vccs
depmod: atm_devs
depmod: *** Unresolved symbols in /lib/modules/2.4.21-rc6/kernel/net/atm/svc.o
depmod: atm_getsockopt
depmod: atm_recvmsg
depmod: free_atm_vcc_sk
depmod: atm_ioctl
depmod: atm_create
depmod: atm_sendmsg
depmod: atm_poll
depmod: atm_connect
depmod: atm_release_vcc_sk
depmod: atm_setsockopt
my .config (it's distro config, im PLD kernel packager) is include.
Ok im going to fix those trivial (?) problems, when i worked around 2.2.x i
made some hacks on ksyms.c, was it corect ?
--
Krzysiek Taraszka ([email protected])
http://cyborg.kernel.pl/~dzimi/
Dnia czw 29. maja 2003 21:56, Krzysiek Taraszka napisa?:
> Dnia czw 29. maja 2003 21:11, Marcelo Tosatti napisa?:
> > On Thu, 29 May 2003, Georg Nikodym wrote:
> > > On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> > >
> > > Marcelo Tosatti <[email protected]> wrote:
> > > > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> > > > fix for the IO stalls/deadlocks.
> > >
> > > While others may be dubious about the efficacy of this patch, I've been
> > > running -rc6 on my laptop now since sometime last night and have seen
> > > nothing odd.
> > >
> > > In case anybody cares, I'm using both ide and a ieee1394 (for a large
> > > external drive [which implies scsi]) and I do a _lot_ of big work with
> > > BK so I was seeing the problem within hours previously.
> >
> > Great!
> >
> > -rc7 will have to be released due to some problems :(
>
> hmm, seems to ide modules and others are broken. Im looking for reason why
hmm, for IDE subsystem the ide-proc.o was't made for CONFIG_BLK_DEV_IDE=m ...
anyone goes to fix it ? or shall I prepare and send here my own patch ?
--
Krzysiek Taraszka ([email protected])
http://cyborg.kernel.pl/~dzimi/
i tried 2.4.21-rc6 as i was told it might fix the mouse stalling on
heavy disk IO problem and i would like to report that it DOES fix them
for the most part, even certain compiles/benchmarks/stress tests that
could stall my pc for seconds now affect the mouse for mere fractions of
one second, situations that used to cause short stalls are now a thing
of the past
2.4.21-rc6 is the best kernel i have tried to date and i have tried many
on my quest to get a smooth mouse
i dont subscribe to lkml, so if you have questions please CC me
personally
hope this input is helpful
Daniel
morfic
On Fri, May 30, 2003 at 02:08:51PM -0500, Daniel Goller wrote:
> i tried 2.4.21-rc6 as i was told it might fix the mouse stalling on
> heavy disk IO problem and i would like to report that it DOES fix them
> for the most part, even certain compiles/benchmarks/stress tests that
> could stall my pc for seconds now affect the mouse for mere fractions of
> one second, situations that used to cause short stalls are now a thing
> of the past
>
> 2.4.21-rc6 is the best kernel i have tried to date and i have tried many
> on my quest to get a smooth mouse
There are reports that 2.4.18 also "fixed" the problems with the mouse. Can
you verify?
On Fri, 2003-05-30 at 15:52, Mike Fedyk wrote:
> On Fri, May 30, 2003 at 02:08:51PM -0500, Daniel Goller wrote:
> > i tried 2.4.21-rc6 as i was told it might fix the mouse stalling on
> > heavy disk IO problem and i would like to report that it DOES fix them
> > for the most part, even certain compiles/benchmarks/stress tests that
> > could stall my pc for seconds now affect the mouse for mere fractions of
> > one second, situations that used to cause short stalls are now a thing
> > of the past
> >
> > 2.4.21-rc6 is the best kernel i have tried to date and i have tried many
> > on my quest to get a smooth mouse
>
> There are reports that 2.4.18 also "fixed" the problems with the mouse. Can
> you verify?
sorry i never ran a 2.4.18 kernel, can't comment on that
On Saturday 31 May 2003 15:06, Daniel Goller wrote:
> On Fri, 2003-05-30 at 15:52, Mike Fedyk wrote:
> > On Fri, May 30, 2003 at 02:08:51PM -0500, Daniel Goller
wrote:
> > > i tried 2.4.21-rc6 as i was told it might fix the
> > > mouse stalling on heavy disk IO problem and i would
> > > like to report that it DOES fix them for the most
> > > part, even certain compiles/benchmarks/stress tests
> > > that could stall my pc for seconds now affect the
> > > mouse for mere fractions of one second, situations
> > > that used to cause short stalls are now a thing of
> > > the past
> > >
> > > 2.4.21-rc6 is the best kernel i have tried to date
> > > and i have tried many on my quest to get a smooth
> > > mouse
> >
> > There are reports that 2.4.18 also "fixed" the problems
> > with the mouse. Can you verify?
>
Yes, it performs similar to -rc6 but not nearly as good as
2.5.70.
On 2.5.70 the mouse is really smooth all the time, scrollong
of large pages in opera is fairly smooth most the time also
with large disk io loads such as the script i posted
earlier.
Regards
Michael
On Sat, 2003-05-31 at 06:12, Michael Frank wrote:
> On Saturday 31 May 2003 15:06, Daniel Goller wrote:
> > On Fri, 2003-05-30 at 15:52, Mike Fedyk wrote:
> > > On Fri, May 30, 2003 at 02:08:51PM -0500, Daniel Goller
> wrote:
> > > > i tried 2.4.21-rc6 as i was told it might fix the
> > > > mouse stalling on heavy disk IO problem and i would
> > > > like to report that it DOES fix them for the most
> > > > part, even certain compiles/benchmarks/stress tests
> > > > that could stall my pc for seconds now affect the
> > > > mouse for mere fractions of one second, situations
> > > > that used to cause short stalls are now a thing of
> > > > the past
> > > >
> > > > 2.4.21-rc6 is the best kernel i have tried to date
> > > > and i have tried many on my quest to get a smooth
> > > > mouse
> > >
> > > There are reports that 2.4.18 also "fixed" the problems
> > > with the mouse. Can you verify?
> >
>
> Yes, it performs similar to -rc6 but not nearly as good as
> 2.5.70.
>
> On 2.5.70 the mouse is really smooth all the time, scrollong
> of large pages in opera is fairly smooth most the time also
> with large disk io loads such as the script i posted
> earlier.
>
> Regards
> Michael
>
unfortunately the radeon dri is broken in 2.5.70 so i havent tried that
much, need to see if someone already suggests a fix for this unused int
(it seems unused to me, after a *quick* look through the file) i guess i
will have to subscribe now to lkml
On Wed, 28 May 2003, Marc Wilson wrote:
> On Thu, May 29, 2003 at 06:34:48AM +0100, Riley Williams wrote:
> > The basic problem there is that any mail client needs to know
> > just how many messages are in a particular folder to handle that
> > folder, and the only way to do this is to count them all. That's
> > what takes the time when one opens a large folder.
>
> No, the basic problem there is that the kernel is deadlocking. Read the
> VERY long thread for the details.
>
> I think I have enough on the ball to be able to tell the difference between
> mutt opening a folder and counting messages, with a counter and percentage
> indicator advancing, and mutt sitting there deadlocked with the HD activity
> light stuck on and all the rest of X stuck tight.
>
> And it just happened again, so -rc6 is no sure fix. What did y'all that
> reported the problem had gone away do, patch -rc4 with the akpm patches?
> ^_^
Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
On Tuesday 03 June 2003 18:02, Marcelo Tosatti wrote:
Hi Marcelo,
> Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
well, even if you mean Marc Wilson, I also have to say something (as I've
written in my previous email some days ago)
The pauses/stops are _a lot_ less than w/o the fix but they are _not_ gone.
Tested with 2.4.21-rc6.
ciao, Marc
On Wednesday 04 June 2003 00:02, Marcelo Tosatti wrote:
> On Wed, 28 May 2003, Marc Wilson wrote:
> > On Thu, May 29, 2003 at 06:34:48AM +0100, Riley Williams wrote:
> > > The basic problem there is that any mail client needs to know
> > > just how many messages are in a particular folder to handle that
> > > folder, and the only way to do this is to count them all. That's
> > > what takes the time when one opens a large folder.
> >
> > No, the basic problem there is that the kernel is deadlocking. Read the
> > VERY long thread for the details.
> >
> > I think I have enough on the ball to be able to tell the difference
> > between mutt opening a folder and counting messages, with a counter and
> > percentage indicator advancing, and mutt sitting there deadlocked with
> > the HD activity light stuck on and all the rest of X stuck tight.
> >
> > And it just happened again, so -rc6 is no sure fix. What did y'all that
> > reported the problem had gone away do, patch -rc4 with the akpm patches?
> > ^_^
>
> Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
-rc6 is better - comparable to 2.4.18 in what I have seen with my script.
After the long obscure problems since 2.4.19x, -rc6 could use serious
stress-testing.
User level testing is not sufficient here - it's just like playing roulette.
By serious stress-testing I mean:
Everone testing comes up with one dedicated "tough test"
which _must_ be reproducible (program, script) along his line of
expertise/application.
Two or more of these independent tests are run in combination.
This method should increase the coverage drastically.
Regards
Michael
On Wed, Jun 04, 2003 at 12:30:27AM +0800, Michael Frank wrote:
> On Wednesday 04 June 2003 00:02, Marcelo Tosatti wrote:
> -rc6 is better - comparable to 2.4.18 in what I have seen with my script.
>
> After the long obscure problems since 2.4.19x, -rc6 could use serious
> stress-testing.
>
> User level testing is not sufficient here - it's just like playing roulette.
>
> By serious stress-testing I mean:
>
> Everone testing comes up with one dedicated "tough test"
> which _must_ be reproducible (program, script) along his line of
> expertise/application.
>
> Two or more of these independent tests are run in combination.
Agreed and I'm willing to run test-scripts on my system, that has these
hangs (long ones with 2.4.19-pre1 to 2.4.21-rc5 and only short ones with
2.4.21-rc6). But at the moment I have neither time nor enough knowledge to
write a test to reproduce it.
So if someone comes up with a suitable test skript, I'm happy to try it
and use it on different kernel versions.
Bye,
Matthias
--
[email protected]
Rechenzentrum Universitaet Karlsruhe
Abteilung Netze
On Tuesday 03 June 2003 18:30, Michael Frank wrote:
Hi Michael,
> > Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
> -rc6 is better - comparable to 2.4.18 in what I have seen with my script.
> After the long obscure problems since 2.4.19x, -rc6 could use serious
> stress-testing.
> User level testing is not sufficient here - it's just like playing
> roulette.
> By serious stress-testing I mean:
> Everone testing comes up with one dedicated "tough test"
> which _must_ be reproducible (program, script) along his line of
> expertise/application.
well, very easy one:
dd if=/dev/zero of=/home/largefile bs=16384 count=131072
then use your mouse, your apps, switch between them, use them, _w/o_ pauses,
delay, stops or kinda that. If _that_ will work flawlessly for everyone, then
it is fixed, if not, it _needs_ to be fixed.
ciao, Marc
On Tuesday 03 June 2003 18:59, Marc-Christian Petersen wrote:
Hi again,
> well, very easy one:
> dd if=/dev/zero of=/home/largefile bs=16384 count=131072
> then use your mouse, your apps, switch between them, use them, _w/o_
> pauses, delay, stops or kinda that. If _that_ will work flawlessly for
> everyone, then it is fixed, if not, it _needs_ to be fixed.
I forgot to mention. If you have more than 2GB free memory (the above one will
create a 2GB file), the test is useless.
Have less memory free, so the machine will swap, doesn't matter if the same
disk or another or whatever!
ciao, Marc
On Wednesday 04 June 2003 00:59, Marc-Christian Petersen wrote:
> On Tuesday 03 June 2003 18:30, Michael Frank wrote:
> well, very easy one:
>
> dd if=/dev/zero of=/home/largefile bs=16384 count=131072
Got that already - more flexible:
http://www.ussg.iu.edu/hypermail/linux/kernel/0305.3/1291.html
Breaks anything >= 2.4.19 < rc6 in no time.
We need more - any ideas
Reagards
Michael
Good Evening,
On Tue, 2003-06-03 at 18:03, Marc-Christian Petersen wrote:
> On Tuesday 03 June 2003 18:59, Marc-Christian Petersen wrote:
>
> Hi again,
>
> > well, very easy one:
> > dd if=/dev/zero of=/home/largefile bs=16384 count=131072
> > then use your mouse, your apps, switch between them, use them, _w/o_
> > pauses, delay, stops or kinda that. If _that_ will work flawlessly for
> > everyone, then it is fixed, if not, it _needs_ to be fixed.
> I forgot to mention. If you have more than 2GB free memory (the above one will
> create a 2GB file), the test is useless.
>
> Have less memory free, so the machine will swap, doesn't matter if the same
> disk or another or whatever!
Would it count if I said I run 2.4.21-rc6-ac1 and had 768MB RAM, ended
up using about 250MB swap and when I today suspended VMware and closed a
few gnome-terminals, Galeon and Evolution, the mouse cursor would not
move, then jump half way across the screen after a second, then 'stick'
again before doing another jump.
I thought it sounded a little like what you are describing. If more
details are required, let me know and I will try and collect what is
asked for.
Regards,
--
Anders Karlsson <[email protected]>
Trudheim Technology Limited
Marcelo Tosatti <[email protected]>, on Wed May 28, 2003 [09:55:39 PM] said:
>
> Hi,
>
> Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's fix
> for the IO stalls/deadlocks.
>
> Please test it.
>
Hi;
It seems if I run 'make menuconfig', and the only change
I make is to change the processor type from its default to
486, "CONFIG_X86_TSC=y", remains in the .config, which results
in a kernel that wont boot on a 486.
Running 'make oldconfig' seems to fix it up, though...
Paul
[email protected]
On Tue, 2003-06-03 15:45:37 -0400, Paul <[email protected]>
wrote in message <[email protected]>:
> Marcelo Tosatti <[email protected]>, on Wed May 28, 2003 [09:55:39 PM] said:
> > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's fix
> > for the IO stalls/deadlocks.
> >
> > Please test it.
> >
> Hi;
>
> It seems if I run 'make menuconfig', and the only change
> I make is to change the processor type from its default to
> 486, "CONFIG_X86_TSC=y", remains in the .config, which results
> in a kernel that wont boot on a 486.
> Running 'make oldconfig' seems to fix it up, though...
Yeah, that's a but hitting i80386 also, I had sent a patch for that some
time ago to LKML. There's simply some CONFIG_X86_TSC=n missing in the
case of i486 and i486.
MfG, JBG
--
Jan-Benedict Glaw [email protected] . +49-172-7608481
"Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg
fuer einen Freien Staat voll Freier B?rger" | im Internet! | im Irak!
ret = do_actions((curr | FREE_SPEECH) & ~(IRAQ_WAR_2 | DRM | TCPA));
On 06.03, Anders Karlsson wrote:
> Good Evening,
>
> On Tue, 2003-06-03 at 18:03, Marc-Christian Petersen wrote:
> > On Tuesday 03 June 2003 18:59, Marc-Christian Petersen wrote:
> >
> > Hi again,
> >
> > > well, very easy one:
> > > dd if=/dev/zero of=/home/largefile bs=16384 count=131072
> > > then use your mouse, your apps, switch between them, use them, _w/o_
> > > pauses, delay, stops or kinda that. If _that_ will work flawlessly for
> > > everyone, then it is fixed, if not, it _needs_ to be fixed.
> > I forgot to mention. If you have more than 2GB free memory (the above one will
> > create a 2GB file), the test is useless.
> >
> > Have less memory free, so the machine will swap, doesn't matter if the same
> > disk or another or whatever!
>
> Would it count if I said I run 2.4.21-rc6-ac1 and had 768MB RAM, ended
> up using about 250MB swap and when I today suspended VMware and closed a
> few gnome-terminals, Galeon and Evolution, the mouse cursor would not
> move, then jump half way across the screen after a second, then 'stick'
> again before doing another jump.
>
One vote in the opposite sense (I know, nobody uses plain rc6 ???)
I am using a -jam kernel (-aa with some additional patches), and I did
not notice anything. Dual PII box with 900 Mb, as buffers were filling
memory, no stalls. Just a very small (less than half a second) jump in the
cursor under gnome when the memory got full, and then smooth again.
I use pointer-focus and was rapidly moving the pointer from window to
window to change focus and response was ok. Launching an aterm was instant.
--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-rc6-jam1 (gcc 3.2.3 (Mandrake Linux 9.2 3.2.3-1mdk))
On Tuesday 03 June 2003 23:12, J.A. Magallon wrote:
Hi J.A.,
> One vote in the opposite sense (I know, nobody uses plain rc6 ???)
> I am using a -jam kernel (-aa with some additional patches), and I did
> not notice anything. Dual PII box with 900 Mb, as buffers were filling
> memory, no stalls. Just a very small (less than half a second) jump in the
> cursor under gnome when the memory got full, and then smooth again.
> I use pointer-focus and was rapidly moving the pointer from window to
> window to change focus and response was ok. Launching an aterm was instant.
once again for you ;-)
-aa is using low latency elevator! Pauses/Stops are more less with it.
ciao, Marc
On Tue, Jun 03, 2003 at 01:02:45PM -0300, Marcelo Tosatti wrote:
> Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
Yes, with -rc6, and this:
rei $ dd if=/dev/zero of=/home/mwilson/largefile bs=16384 count=131072
The mouse starts skipping soon after the box starts swapping. It
eventually catches up, but then when I start up another application, it
starts again.
I have the test running as I type this e-mail in mutt (with vim as the
editor), and there are noticeable pauses where I'm typing, but there isn't
anything happening on the screen.
It's *much* better than it was with my prior kernel (-rc2), but it's most
definately still there.
Anyone got any other test they want me to make on the box?
--
Marc Wilson | You're a card which will have to be dealt with.
[email protected] |
On Thu, May 29, 2003 at 04:11:12PM -0300, Marcelo Tosatti wrote:
>
>
> On Thu, 29 May 2003, Georg Nikodym wrote:
>
> > On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> > Marcelo Tosatti <[email protected]> wrote:
> >
> > > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> > > fix for the IO stalls/deadlocks.
> >
> > While others may be dubious about the efficacy of this patch, I've been
> > running -rc6 on my laptop now since sometime last night and have seen
> > nothing odd.
> >
> > In case anybody cares, I'm using both ide and a ieee1394 (for a large
> > external drive [which implies scsi]) and I do a _lot_ of big work with
> > BK so I was seeing the problem within hours previously.
>
> Great!
are you really sure that it is the right fix?
I mean, the batching has a basic problem (I was discussing it with Jens
two days ago and he said he's already addressed in 2.5, I wonder if that
could also have an influence on the fact 2.5 is so much better in
fariness)
the issue with batching in 2.4, is that it is blocking at 0 and waking
at batch_requests. But it's not blocking new get_request to eat requests
in the way back from 0 to batch_requests. I mean, there are two
directions, when we move from batch_requests to 0 get_requests should
return requests. in the way back from 0 to batch_requests the
get_request should block (and it doesn't in 2.4, that is the problem)
>
> -rc7 will have to be released due to some problems :(
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Andrea
On Wednesday 04 June 2003 12:22, Andrea Arcangeli wrote:
Hi Andrea,
> are you really sure that it is the right fix?
> I mean, the batching has a basic problem (I was discussing it with Jens
> two days ago and he said he's already addressed in 2.5, I wonder if that
> could also have an influence on the fact 2.5 is so much better in
> fariness)
> the issue with batching in 2.4, is that it is blocking at 0 and waking
> at batch_requests. But it's not blocking new get_request to eat requests
> in the way back from 0 to batch_requests. I mean, there are two
> directions, when we move from batch_requests to 0 get_requests should
> return requests. in the way back from 0 to batch_requests the
> get_request should block (and it doesn't in 2.4, that is the problem)
do you see a chance to fix this up in 2.4?
ciao, Marc
On Wed, Jun 04, 2003 at 12:35:07PM +0200, Marc-Christian Petersen wrote:
> On Wednesday 04 June 2003 12:22, Andrea Arcangeli wrote:
>
> Hi Andrea,
>
> > are you really sure that it is the right fix?
> > I mean, the batching has a basic problem (I was discussing it with Jens
> > two days ago and he said he's already addressed in 2.5, I wonder if that
> > could also have an influence on the fact 2.5 is so much better in
> > fariness)
> > the issue with batching in 2.4, is that it is blocking at 0 and waking
> > at batch_requests. But it's not blocking new get_request to eat requests
> > in the way back from 0 to batch_requests. I mean, there are two
> > directions, when we move from batch_requests to 0 get_requests should
> > return requests. in the way back from 0 to batch_requests the
> > get_request should block (and it doesn't in 2.4, that is the problem)
> do you see a chance to fix this up in 2.4?
sure, it's just a matter of adding a bit to the blkdev structure.
However I'm not 100% sure that it is the real thing that could make the
difference, but overall the exclusive wakeup FIFO in theory should
provide even an higher degree of fariness, so at the very least the
"fix" 2 from Andrew makes very little sense to me, and it seems just an
hack meant to hide a real problem in the algorithm.
I mean, going wakeall (LIFO btw) rather than wake-one FIFO if something
should make things worse unless it is hiding some other issue.
As for 1 and 3 they were just included in my tree for ages.
BTW, Chris recently spotted a nearly impossible to trigger SMP-only race
in the fix pausing patch [great spotting Chris] (to trigger it would
need an intersection of two races at the same time), it'll be fixed in
my next tree, however nobody ever reproduced it and you certainly can
ignore it in practice so it can't explain any issue.
Andrea
On Wed, Jun 04 2003, Marc-Christian Petersen wrote:
> On Wednesday 04 June 2003 12:22, Andrea Arcangeli wrote:
>
> Hi Andrea,
>
> > are you really sure that it is the right fix?
> > I mean, the batching has a basic problem (I was discussing it with Jens
> > two days ago and he said he's already addressed in 2.5, I wonder if that
> > could also have an influence on the fact 2.5 is so much better in
> > fariness)
> > the issue with batching in 2.4, is that it is blocking at 0 and waking
> > at batch_requests. But it's not blocking new get_request to eat requests
> > in the way back from 0 to batch_requests. I mean, there are two
> > directions, when we move from batch_requests to 0 get_requests should
> > return requests. in the way back from 0 to batch_requests the
> > get_request should block (and it doesn't in 2.4, that is the problem)
> do you see a chance to fix this up in 2.4?
Nick posted a patch to do so the other day and asked people to test.
--
Jens Axboe
On Wed, Jun 04, 2003 at 12:46:33PM +0200, Marc-Christian Petersen wrote:
> On Wednesday 04 June 2003 12:42, Jens Axboe wrote:
>
> Hi Jens,
>
> > > > the issue with batching in 2.4, is that it is blocking at 0 and waking
> > > > at batch_requests. But it's not blocking new get_request to eat
> > > > requests in the way back from 0 to batch_requests. I mean, there are
> > > > two directions, when we move from batch_requests to 0 get_requests
> > > > should return requests. in the way back from 0 to batch_requests the
> > > > get_request should block (and it doesn't in 2.4, that is the problem)
> > > do you see a chance to fix this up in 2.4?
> > Nick posted a patch to do so the other day and asked people to test.
> Silly mcp. His mail was CC'ed to me :( ... F*ck huge inbox.
I was probably not CC'ed, I'll search for the email (and I was
travelling the last few days so I didn't read every single l-k email yet
sorry ;)
Andrea
On Wednesday 04 June 2003 12:42, Jens Axboe wrote:
Hi Jens,
> > > the issue with batching in 2.4, is that it is blocking at 0 and waking
> > > at batch_requests. But it's not blocking new get_request to eat
> > > requests in the way back from 0 to batch_requests. I mean, there are
> > > two directions, when we move from batch_requests to 0 get_requests
> > > should return requests. in the way back from 0 to batch_requests the
> > > get_request should block (and it doesn't in 2.4, that is the problem)
> > do you see a chance to fix this up in 2.4?
> Nick posted a patch to do so the other day and asked people to test.
Silly mcp. His mail was CC'ed to me :( ... F*ck huge inbox.
ciao, Marc
On Wednesday 04 June 2003 12:43, Andrea Arcangeli wrote:
Hi Andrea,
> sure, it's just a matter of adding a bit to the blkdev structure.
> However I'm not 100% sure that it is the real thing that could make the
> difference, but overall the exclusive wakeup FIFO in theory should
> provide even an higher degree of fariness, so at the very least the
> "fix" 2 from Andrew makes very little sense to me, and it seems just an
> hack meant to hide a real problem in the algorithm.
well, at least it reduces pauses/stops ;)
> As for 1 and 3 they were just included in my tree for ages.
err, 1 yes, but I don't see that 3 is in your tree. Well, ok, a bit different.
But hey, your 1+3 are still having pauses ;)
> BTW, Chris recently spotted a nearly impossible to trigger SMP-only race
> in the fix pausing patch [great spotting Chris] (to trigger it would
Cool Chris!
> need an intersection of two races at the same time), it'll be fixed in
> my next tree, however nobody ever reproduced it and you certainly can
> ignore it in practice so it can't explain any issue.
Good to know. Thanks.
ciao, Marc
--- linux-2.4/include/linux/blkdev.h.orig 2003-06-02 21:59:06.000000000 +1000
+++ linux-2.4/include/linux/blkdev.h 2003-06-02 22:39:57.000000000 +1000
@@ -118,13 +118,21 @@ struct request_queue
/*
* Boolean that indicates whether this queue is plugged or not.
*/
- char plugged;
+ int plugged:1;
/*
* Boolean that indicates whether current_request is active or
* not.
*/
- char head_active;
+ int head_active:1;
+
+ /*
+ * Booleans that indicate whether the queue's free requests have
+ * been exhausted and is waiting to drop below the batch_requests
+ * threshold
+ */
+ int read_full:1;
+ int write_full:1;
unsigned long bounce_pfn;
@@ -140,6 +148,30 @@ struct request_queue
wait_queue_head_t wait_for_requests[2];
};
+static inline void set_queue_full(request_queue_t *q, int rw)
+{
+ if (rw == READ)
+ q->read_full = 1;
+ else
+ q->write_full = 1;
+}
+
+static inline void clear_queue_full(request_queue_t *q, int rw)
+{
+ if (rw == READ)
+ q->read_full = 0;
+ else
+ q->write_full = 0;
+}
+
+static inline int queue_full(request_queue_t *q, int rw)
+{
+ if (rw == READ)
+ return q->read_full;
+ else
+ return q->write_full;
+}
+
extern unsigned long blk_max_low_pfn, blk_max_pfn;
#define BLK_BOUNCE_HIGH (blk_max_low_pfn << PAGE_SHIFT)
--- linux-2.4/drivers/block/ll_rw_blk.c.orig 2003-06-02 21:56:54.000000000 +1000
+++ linux-2.4/drivers/block/ll_rw_blk.c 2003-06-02 22:17:13.000000000 +1000
@@ -513,7 +513,10 @@ static struct request *get_request(reque
struct request *rq = NULL;
struct request_list *rl = q->rq + rw;
- if (!list_empty(&rl->free)) {
+ if (list_empty(&rl->free))
+ set_queue_full(q, rw);
+
+ if (!queue_full(q, rw)) {
rq = blkdev_free_rq(&rl->free);
list_del(&rq->queue);
rl->count--;
@@ -594,7 +597,7 @@ static struct request *__get_request_wai
add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (q->rq[rw].count == 0)
+ if (queue_full(q, rw))
schedule();
spin_lock_irq(&io_request_lock);
rq = get_request(q, rw);
@@ -829,9 +832,14 @@ void blkdev_release_request(struct reque
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= q->batch_requests &&
- waitqueue_active(&q->wait_for_requests[rw]))
- wake_up(&q->wait_for_requests[rw]);
+ q->rq[rw].count++;
+ if (q->rq[rw].count >= q->batch_requests) {
+ if (q->rq[rw].count == q->batch_requests)
+ clear_queue_full(q, rw);
+
+ if (waitqueue_active(&q->wait_for_requests[rw]))
+ wake_up(&q->wait_for_requests[rw]);
+ }
}
}
On Wed, Jun 04 2003, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >On Wed, Jun 04, 2003 at 12:46:33PM +0200, Marc-Christian Petersen wrote:
> >
> >>On Wednesday 04 June 2003 12:42, Jens Axboe wrote:
> >>
> >>Hi Jens,
> >>
> >>
> >>>>>the issue with batching in 2.4, is that it is blocking at 0 and waking
> >>>>>at batch_requests. But it's not blocking new get_request to eat
> >>>>>requests in the way back from 0 to batch_requests. I mean, there are
> >>>>>two directions, when we move from batch_requests to 0 get_requests
> >>>>>should return requests. in the way back from 0 to batch_requests the
> >>>>>get_request should block (and it doesn't in 2.4, that is the problem)
> >>>>>
> >>>>do you see a chance to fix this up in 2.4?
> >>>>
> >>>Nick posted a patch to do so the other day and asked people to test.
> >>>
> >>Silly mcp. His mail was CC'ed to me :( ... F*ck huge inbox.
> >>
> >
> >I was probably not CC'ed, I'll search for the email (and I was
> >travelling the last few days so I didn't read every single l-k email yet
> >sorry ;)
> >
> >
> The patch I sent is actually against 2.4.20, contrary to my
> babling. Reports I have had say it helps, but maybe not so
> much as Andrew'ss fixes. Then Matthias Mueller ported my patch
> to 2.4.21-rc6 which include Andrew's fixes.
>
> It seems that they might be fixing two different problems.
> It looks promising though.
It is a different problem I think, yours will fix the starvation of
writers (of readers, writers is much much easier to trigger though)
where someone will repeatedly get cheaten by the request allocator.
The other problem is still not clear to anyone. I doubt this patch would
make any difference (apart from a psychological one) in this case,
since you have a single writer and maybe a reader or two. The single
writer cannot starve anyone else.
--
Jens Axboe
On Wed, Jun 04, 2003 at 02:00:53PM +0200, Jens Axboe wrote:
> since you have a single writer and maybe a reader or two. The single
> writer cannot starve anyone else.
unless you're changing an atime and you've to mark_buffer_dirty or
similar (balance_dirty will write stuff the same way from cp and the
reader then).
Maybe we can get some stack trace with kgdb to be sure where the reader
is blocking.
Andrea
Jens Axboe wrote:
>On Wed, Jun 04 2003, Nick Piggin wrote:
>
>>Andrea Arcangeli wrote:
>>
>>
>>>On Wed, Jun 04, 2003 at 12:46:33PM +0200, Marc-Christian Petersen wrote:
>>>
>>>
>>>>On Wednesday 04 June 2003 12:42, Jens Axboe wrote:
>>>>
>>>>Hi Jens,
>>>>
>>>>
>>>>
>>>>>>>the issue with batching in 2.4, is that it is blocking at 0 and waking
>>>>>>>at batch_requests. But it's not blocking new get_request to eat
>>>>>>>requests in the way back from 0 to batch_requests. I mean, there are
>>>>>>>two directions, when we move from batch_requests to 0 get_requests
>>>>>>>should return requests. in the way back from 0 to batch_requests the
>>>>>>>get_request should block (and it doesn't in 2.4, that is the problem)
>>>>>>>
>>>>>>>
>>>>>>do you see a chance to fix this up in 2.4?
>>>>>>
>>>>>>
>>>>>Nick posted a patch to do so the other day and asked people to test.
>>>>>
>>>>>
>>>>Silly mcp. His mail was CC'ed to me :( ... F*ck huge inbox.
>>>>
>>>>
>>>I was probably not CC'ed, I'll search for the email (and I was
>>>travelling the last few days so I didn't read every single l-k email yet
>>>sorry ;)
>>>
>>>
>>>
>>The patch I sent is actually against 2.4.20, contrary to my
>>babling. Reports I have had say it helps, but maybe not so
>>much as Andrew'ss fixes. Then Matthias Mueller ported my patch
>>to 2.4.21-rc6 which include Andrew's fixes.
>>
>>It seems that they might be fixing two different problems.
>>It looks promising though.
>>
>
>It is a different problem I think, yours will fix the starvation of
>writers (of readers, writers is much much easier to trigger though)
>where someone will repeatedly get cheaten by the request allocator.
>
>The other problem is still not clear to anyone. I doubt this patch would
>make any difference (apart from a psychological one) in this case,
>since you have a single writer and maybe a reader or two. The single
>writer cannot starve anyone else.
>
You are right about what the patch does. It wouldn't surprise me
if there are still other problems, but it could be that the reader
has to write some swap or other dirty buffers when trying to get
memory itself.
I have had 3 or so reports all saying similar things, but it could
be psychological I guess.
On Wed, Jun 04 2003, Andrea Arcangeli wrote:
> On Wed, Jun 04, 2003 at 02:00:53PM +0200, Jens Axboe wrote:
> > since you have a single writer and maybe a reader or two. The single
> > writer cannot starve anyone else.
>
> unless you're changing an atime and you've to mark_buffer_dirty or
> similar (balance_dirty will write stuff the same way from cp and the
> reader then).
Yes you are right, could be.
But the whole thing still smells fishy. Read starvation causing mouse
stalls, hmm.
--
Jens Axboe
In article <[email protected]>,
Nick Piggin <[email protected]> wrote:
>- char plugged;
>+ int plugged:1;
This is dangerous:
struct foo {
int bla:1;
};
int main()
{
struct foo f;
f.bla = 1;
printf("%d\n", f.bla);
}
$ ./a.out
-1
If you want to put "0" and "1" in a 1-bit field, use "unsigned int bla:1".
Mike.
--
.. somehow I have a feeling the hurting hasn't even begun yet
-- Bill, "The Terrible Thunderlizards"
On Wed, Jun 04, 2003 at 12:30:27AM +0800, Michael Frank wrote:
...
> >
> > Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
>
> -rc6 is better - comparable to 2.4.18 in what I have seen with my script.
I've run 2.4.20 for a long time, and have been seriously plagued with
the I/O stalls.
On a file server (details below) here I upgraded to 2.4.21-rc6
yesterday.
The I/O stalls have *almost* gone away.
Best of all, we still have our data intact ;)
Server data:
~130 GB data on a ~150 GB ext3fs with >1 million files
Software RAID-0+1 on four IDE disks
Two promise controllers 1x20262 1x20269
1x Intel eepro100, 1x Intel e1000
dual PIII, half a gig of memory
NFS server (mainly v3, many different clients)
>
> After the long obscure problems since 2.4.19x, -rc6 could use serious
> stress-testing.
This server rarely has load below 1, but frequently above 15. It may
run some compilers and linkers locally, but most of the load comes from
NFS serving.
So far it's been running for 28 hours with that kind of load. Nothing
suspicious in the dmesg yet.
I will of course let you all know if it falls on it's knees.
So far it's all thumbs-up from me! There may still be an occational
stall here and there, but compared to 2.4.20 this is heaven (it really
was unbelievably annoying having your emacs stall for 10 seconds every
30 seonds when someone was linking on the cluster) :)
A big *thank*you* to Marcelo for deciding to include a fix for the I/O
stalls!
--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
On Thu, 29 May 2003, Krzysiek Taraszka wrote:
> Dnia czw 29. maja 2003 21:56, Krzysiek Taraszka napisa?:
> > Dnia czw 29. maja 2003 21:11, Marcelo Tosatti napisa?:
> > > On Thu, 29 May 2003, Georg Nikodym wrote:
> > > > On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> > > >
> > > > Marcelo Tosatti <[email protected]> wrote:
> > > > > Here goes -rc6. I've decided to delay 2.4.21 a bit and try Andrew's
> > > > > fix for the IO stalls/deadlocks.
> > > >
> > > > While others may be dubious about the efficacy of this patch, I've been
> > > > running -rc6 on my laptop now since sometime last night and have seen
> > > > nothing odd.
> > > >
> > > > In case anybody cares, I'm using both ide and a ieee1394 (for a large
> > > > external drive [which implies scsi]) and I do a _lot_ of big work with
> > > > BK so I was seeing the problem within hours previously.
> > >
> > > Great!
> > >
> > > -rc7 will have to be released due to some problems :(
> >
> > hmm, seems to ide modules and others are broken. Im looking for reason why
>
> hmm, for IDE subsystem the ide-proc.o was't made for CONFIG_BLK_DEV_IDE=m ...
> anyone goes to fix it ? or shall I prepare and send here my own patch ?
Feel free to send your own patch, please :)
On Wednesday 04 June 2003 08:20, Jens Axboe wrote:
> On Wed, Jun 04 2003, Andrea Arcangeli wrote:
> > On Wed, Jun 04, 2003 at 02:00:53PM +0200, Jens Axboe wrote:
> > > since you have a single writer and maybe a reader or two. The single
> > > writer cannot starve anyone else.
> >
> > unless you're changing an atime and you've to mark_buffer_dirty or
> > similar (balance_dirty will write stuff the same way from cp and the
> > reader then).
>
> Yes you are right, could be.
>
> But the whole thing still smells fishy. Read starvation causing mouse
> stalls, hmm.
If reads from swap get starved, you can have interactive dropouts in just
about anything.
My desktop is usually pretty deep into swap. I upgrade to machines with four
times as much memory, but that usually means the graphics resolution went up
and it just lets me keep more windows open in more desktops. (Currently
six.)
My record was driving the system so deep into swapping frenzy it was still
swapping when I came back from lunch. Really. This was under 2.4.4, though.
On RH 9/2.4.20-? my record is a little under five minutes of "frozen
thrashing on swap" before I got control of the system back. That's just a
"go for a soda" break. And at least the mouse cursor never froze for more
than a couple seconds at a time during that, even if the desktop was ignoring
me... :)
Haven't tried 2.5 on anything but servers yet, but it's on my to-do list...
Rob
(I am the VM subsystem's worst nightmare. Bwahaha.)
Dnia Wednesday 04 of June 2003 20:17, Marcelo Tosatti napisa?:
> On Thu, 29 May 2003, Krzysiek Taraszka wrote:
> > Dnia czw 29. maja 2003 21:56, Krzysiek Taraszka napisa?:
> > > Dnia czw 29. maja 2003 21:11, Marcelo Tosatti napisa?:
> > > > On Thu, 29 May 2003, Georg Nikodym wrote:
> > > > > On Wed, 28 May 2003 21:55:39 -0300 (BRT)
> > > > >
> > > > > Marcelo Tosatti <[email protected]> wrote:
> > > > > > Here goes -rc6. I've decided to delay 2.4.21 a bit and try
> > > > > > Andrew's fix for the IO stalls/deadlocks.
> > > > >
> > > > > While others may be dubious about the efficacy of this patch, I've
> > > > > been running -rc6 on my laptop now since sometime last night and
> > > > > have seen nothing odd.
> > > > >
> > > > > In case anybody cares, I'm using both ide and a ieee1394 (for a
> > > > > large external drive [which implies scsi]) and I do a _lot_ of big
> > > > > work with BK so I was seeing the problem within hours previously.
> > > >
> > > > Great!
> > > >
> > > > -rc7 will have to be released due to some problems :(
> > >
> > > hmm, seems to ide modules and others are broken. Im looking for reason
> > > why
> >
> > hmm, for IDE subsystem the ide-proc.o was't made for CONFIG_BLK_DEV_IDE=m
> > ... anyone goes to fix it ? or shall I prepare and send here my own patch
> > ?
>
> Feel free to send your own patch, please :)
Hm, I send it few days ago (replay to Andrzej Krzysztofowicz post (sth with
-rc3 in subject :)) with another fixes but without cmd640 fixes.
Alan made almoust the same changes but him ac tree still have got broken
cmd640 modular driver (cmd640_vlb still is unresolved).
I tried hack it .. but I droped it ... maybe tomorrow i back to that code ...
or someone goes to fix it (maybe Alan ?)
--
Krzysiek Taraszka ([email protected])
http://cyborg.kernel.pl/~dzimi/
Hi!
> > Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
> well, even if you mean Marc Wilson, I also have to say something (as I've
> written in my previous email some days ago)
>
> The pauses/stops are _a lot_ less than w/o the fix but they are _not_ gone.
> Tested with 2.4.21-rc6.
If hangs are not worse than 2.4.20, then I'd go ahead with release....
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]
On Mer, 2003-06-04 at 22:41, Krzysiek Taraszka wrote:
> -rc3 in subject :)) with another fixes but without cmd640 fixes.
> Alan made almoust the same changes but him ac tree still have got broken
> cmd640 modular driver (cmd640_vlb still is unresolved).
> I tried hack it .. but I droped it ... maybe tomorrow i back to that code ...
> or someone goes to fix it (maybe Alan ?)
cmd640_vlb is gone from the core code in the -ac tree so that suprises
me. Adrian Bunk sent me some more patches to look at. I'm not 100%
convinced by them but there are a few cases left and some of his stuff
certainly fixes real problems
On Thursday 05 June 2003 05:54, Pavel Machek wrote:
> Hi!
>
> > > Ok, so you can reproduce the hangs reliably EVEN with -rc6, Marc?
> >
> > well, even if you mean Marc Wilson, I also have to say something (as I've
> > written in my previous email some days ago)
> >
> > The pauses/stops are _a lot_ less than w/o the fix but they are _not_
> > gone. Tested with 2.4.21-rc6.
>
> If hangs are not worse than 2.4.20, then I'd go ahead with release....
>
>
I have -rc6 running on a P4 for a few days, doing the test script,
compiles, Opera and found it to be comparable to 2.4.18.
It also does well on slower machines of about 1/4 the the CPU and disk
bandwidth.
IMHO, interactivity is reasonable (again just IMHO), and others
may disagree.
--
Powered by linux-2.5.70-mm3
My current linux related activities in rough order of priority:
- Testing of 2.4/2.5 kernel interactivity
- Testing of Swsusp for 2.4
- Testing of Opera 7.11 emphasizing interactivity
- Research of NFS i/o errors during transfer 2.4>2.5
- Learning 2.5 series kernel debugging with kgdb - it's in the -mm tree
- Studying 2.5 series serial and ide drivers, ACPI, S3
* Input and feedback is always welcome *
Ok, there are lots of different problems here, and I've spent a little
while trying to get some numbers with the __get_request_wait stats patch
I posted before. This is all on ext2, since I wanted to rule out
interactions with the journal flavors.
Basically a dbench 90 run on ext2 rc6 vanilla kernels can generate
latencies of over 2700 jiffies in __get_request_wait, with an average
latency over 250 jiffies.
No, most desktop workloads aren't dbench 90, but between balance_dirty()
and the way we send stuff to disk during memory allocations, just about
any process can get stuck submitting dirty buffers even if you've just
got one process doing a dd if=/dev/zero of=foo.
So, for the moment I'm going to pretend people seeing stalls in X are
stuck in atime updates or memory allocations, or reading proc or some
other silly spot.
For the SMP corner cases, I've merged Andrea's fix-pausing patch into
rc7, along with an altered form of Nick Piggin's queue_full patch to try
and fix the latency problems.
The major difference from Nick's patch is that once the queue is marked
full, I don't clear the full flag until the wait queue is empty. This
means new io can't steal available requests until every existing waiter
has been granted a request.
The latency results are better, with average time spent in
__get_request_wait being around 28 jiffies, and a max of 170 jiffies.
The cost is throughput, further benchmarking needs to be done but, but I
wanted to get this out for review and testing. It should at least help
us decide if the request allocation code really is causing our problems.
The patch below also includes the __get_request_wait latency stats. If
people try this and still see stalls, please run elvtune /dev/xxxx and
send along the resulting console output.
I haven't yet compared this to Andrea's elevator latency code, but the
stat patch was originally developed on top of his 2.4.21pre3aa1, where
the average wait was 97 jiffies and the max was 318.
Anyway, less talk, more code. Treat this with care, it has only been
lightly tested. Thanks to Andrea and Nick whose patches this is largely
based on:
--- 1.9/drivers/block/blkpg.c Sat Mar 30 06:58:05 2002
+++ edited/drivers/block/blkpg.c Mon Jun 9 12:17:24 2003
@@ -261,6 +261,7 @@
return blkpg_ioctl(dev, (struct blkpg_ioctl_arg *) arg);
case BLKELVGET:
+ blk_print_stats(dev);
return blkelvget_ioctl(&blk_get_queue(dev)->elevator,
(blkelv_ioctl_arg_t *) arg);
case BLKELVSET:
--- 1.45/drivers/block/ll_rw_blk.c Wed May 28 03:50:02 2003
+++ edited/drivers/block/ll_rw_blk.c Mon Jun 9 17:13:16 2003
@@ -429,6 +429,8 @@
q->rq[READ].count = 0;
q->rq[WRITE].count = 0;
q->nr_requests = 0;
+ q->read_full = 0;
+ q->write_full = 0;
si_meminfo(&si);
megs = si.totalram >> (20 - PAGE_SHIFT);
@@ -442,6 +444,56 @@
spin_lock_init(&q->queue_lock);
}
+void blk_print_stats(kdev_t dev)
+{
+ request_queue_t *q;
+ unsigned long avg_wait;
+ unsigned long min_wait;
+ unsigned long high_wait;
+ unsigned long *d;
+
+ q = blk_get_queue(dev);
+ if (!q)
+ return;
+
+ min_wait = q->min_wait;
+ if (min_wait == ~0UL)
+ min_wait = 0;
+ if (q->num_wait)
+ avg_wait = q->total_wait / q->num_wait;
+ else
+ avg_wait = 0;
+ printk("device %s: num_req %lu, total jiffies waited %lu\n",
+ kdevname(dev), q->num_req, q->total_wait);
+ printk("\t%lu forced to wait\n", q->num_wait);
+ printk("\t%lu min wait, %lu max wait\n", min_wait, q->max_wait);
+ printk("\t%lu average wait\n", avg_wait);
+ d = q->deviation;
+ printk("\t%lu < 100, %lu < 200, %lu < 300, %lu < 400, %lu < 500\n",
+ d[0], d[1], d[2], d[3], d[4]);
+ high_wait = d[0] + d[1] + d[2] + d[3] + d[4];
+ high_wait = q->num_wait - high_wait;
+ printk("\t%lu waits longer than 500 jiffies\n", high_wait);
+}
+
+static void reset_stats(request_queue_t *q)
+{
+ q->max_wait = 0;
+ q->min_wait = ~0UL;
+ q->total_wait = 0;
+ q->num_req = 0;
+ q->num_wait = 0;
+ memset(q->deviation, 0, sizeof(q->deviation));
+}
+void blk_reset_stats(kdev_t dev)
+{
+ request_queue_t *q;
+ q = blk_get_queue(dev);
+ if (!q)
+ return;
+ printk("reset latency stats on device %s\n", kdevname(dev));
+ reset_stats(q);
+}
static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh);
/**
@@ -491,6 +543,9 @@
q->plug_tq.routine = &generic_unplug_device;
q->plug_tq.data = q;
q->plugged = 0;
+
+ reset_stats(q);
+
/*
* These booleans describe the queue properties. We set the
* default (and most common) values here. Other drivers can
@@ -508,7 +563,7 @@
* Get a free request. io_request_lock must be held and interrupts
* disabled on the way in. Returns NULL if there are no free requests.
*/
-static struct request *get_request(request_queue_t *q, int rw)
+static struct request *__get_request(request_queue_t *q, int rw)
{
struct request *rq = NULL;
struct request_list *rl = q->rq + rw;
@@ -521,10 +576,17 @@
rq->cmd = rw;
rq->special = NULL;
rq->q = q;
- }
+ } else
+ set_queue_full(q, rw);
return rq;
}
+static struct request *get_request(request_queue_t *q, int rw)
+{
+ if (queue_full(q, rw))
+ return NULL;
+ return __get_request(q, rw);
+}
/*
* Here's the request allocation design:
@@ -588,23 +650,57 @@
static struct request *__get_request_wait(request_queue_t *q, int rw)
{
register struct request *rq;
+ int waited = 0;
+ unsigned long wait_start = jiffies;
+ unsigned long time_waited;
DECLARE_WAITQUEUE(wait, current);
- add_wait_queue(&q->wait_for_requests[rw], &wait);
+ add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
+
do {
set_current_state(TASK_UNINTERRUPTIBLE);
- generic_unplug_device(q);
- if (q->rq[rw].count == 0)
- schedule();
spin_lock_irq(&io_request_lock);
- rq = get_request(q, rw);
+ if ((!waited && queue_full(q, rw)) || q->rq[rw].count == 0) {
+ __generic_unplug_device(q);
+ spin_unlock_irq(&io_request_lock);
+ schedule();
+ spin_lock_irq(&io_request_lock);
+ waited = 1;
+ }
+ rq = __get_request(q, rw);
spin_unlock_irq(&io_request_lock);
} while (rq == NULL);
remove_wait_queue(&q->wait_for_requests[rw], &wait);
current->state = TASK_RUNNING;
+
+ if (!waitqueue_active(&q->wait_for_requests[rw]))
+ clear_queue_full(q, rw);
+
+ time_waited = jiffies - wait_start;
+ if (time_waited > q->max_wait)
+ q->max_wait = time_waited;
+ if (time_waited && time_waited < q->min_wait)
+ q->min_wait = time_waited;
+ q->total_wait += time_waited;
+ q->num_wait++;
+ if (time_waited < 500) {
+ q->deviation[time_waited/100]++;
+ }
+
return rq;
}
+static void get_request_wait_wakeup(request_queue_t *q, int rw)
+{
+ /*
+ * avoid losing an unplug if a second __get_request_wait did the
+ * generic_unplug_device while our __get_request_wait was running
+ * w/o the queue_lock held and w/ our request out of the queue.
+ */
+ if (waitqueue_active(&q->wait_for_requests[rw]))
+ wake_up(&q->wait_for_requests[rw]);
+}
+
/* RO fail safe mechanism */
static long ro_bits[MAX_BLKDEV][8];
@@ -829,8 +925,14 @@
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= q->batch_requests)
- wake_up(&q->wait_for_requests[rw]);
+ q->rq[rw].count++;
+ if (q->rq[rw].count >= q->batch_requests) {
+ smp_mb();
+ if (waitqueue_active(&q->wait_for_requests[rw]))
+ wake_up(&q->wait_for_requests[rw]);
+ else
+ clear_queue_full(q, rw);
+ }
}
}
@@ -948,7 +1050,6 @@
*/
max_sectors = get_max_sectors(bh->b_rdev);
-again:
req = NULL;
head = &q->queue_head;
/*
@@ -957,6 +1058,7 @@
*/
spin_lock_irq(&io_request_lock);
+again:
insert_here = head->prev;
if (list_empty(head)) {
q->plug_device_fn(q, bh->b_rdev); /* is atomic */
@@ -1042,6 +1144,9 @@
if (req == NULL) {
spin_unlock_irq(&io_request_lock);
freereq = __get_request_wait(q, rw);
+ head = &q->queue_head;
+ spin_lock_irq(&io_request_lock);
+ get_request_wait_wakeup(q, rw);
goto again;
}
}
@@ -1063,6 +1168,7 @@
req->rq_dev = bh->b_rdev;
req->start_time = jiffies;
req_new_io(req, 0, count);
+ q->num_req++;
blk_started_io(count);
add_request(q, req, insert_here);
out:
@@ -1196,8 +1302,15 @@
bh->b_rdev = bh->b_dev;
bh->b_rsector = bh->b_blocknr * count;
+ get_bh(bh);
generic_make_request(rw, bh);
+ /* fix race condition with wait_on_buffer() */
+ smp_mb(); /* spin_unlock may have inclusive semantics */
+ if (waitqueue_active(&bh->b_wait))
+ wake_up(&bh->b_wait);
+
+ put_bh(bh);
switch (rw) {
case WRITE:
kstat.pgpgout += count;
--- 1.83/fs/buffer.c Wed May 14 12:51:00 2003
+++ edited/fs/buffer.c Mon Jun 9 13:55:22 2003
@@ -153,10 +153,23 @@
get_bh(bh);
add_wait_queue(&bh->b_wait, &wait);
do {
- run_task_queue(&tq_disk);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
if (!buffer_locked(bh))
break;
+ /*
+ * We must read tq_disk in TQ_ACTIVE after the
+ * add_wait_queue effect is visible to other cpus.
+ * We could unplug some line above it wouldn't matter
+ * but we can't do that right after add_wait_queue
+ * without an smp_mb() in between because spin_unlock
+ * has inclusive semantics.
+ * Doing it here is the most efficient place so we
+ * don't do a suprious unplug if we get a racy
+ * wakeup that make buffer_locked to return 0, and
+ * doing it here avoids an explicit smp_mb() we
+ * rely on the implicit one in set_task_state.
+ */
+ run_task_queue(&tq_disk);
schedule();
} while (buffer_locked(bh));
tsk->state = TASK_RUNNING;
@@ -1507,6 +1520,9 @@
/* Done - end_buffer_io_async will unlock */
SetPageUptodate(page);
+
+ wakeup_page_waiters(page);
+
return 0;
out:
@@ -1538,6 +1554,7 @@
} while (bh != head);
if (need_unlock)
UnlockPage(page);
+ wakeup_page_waiters(page);
return err;
}
@@ -1765,6 +1782,8 @@
else
submit_bh(READ, bh);
}
+
+ wakeup_page_waiters(page);
return 0;
}
@@ -2378,6 +2397,7 @@
submit_bh(rw, bh);
bh = next;
} while (bh != head);
+ wakeup_page_waiters(page);
return 0;
}
--- 1.49/fs/super.c Wed Dec 18 21:34:24 2002
+++ edited/fs/super.c Mon Jun 9 12:17:24 2003
@@ -726,6 +726,7 @@
if (!fs_type->read_super(s, data, flags & MS_VERBOSE ? 1 : 0))
goto Einval;
s->s_flags |= MS_ACTIVE;
+ blk_reset_stats(dev);
path_release(&nd);
return s;
--- 1.45/fs/reiserfs/inode.c Thu May 22 16:35:02 2003
+++ edited/fs/reiserfs/inode.c Mon Jun 9 12:17:24 2003
@@ -2048,6 +2048,7 @@
*/
if (nr) {
submit_bh_for_writepage(arr, nr) ;
+ wakeup_page_waiters(page);
} else {
UnlockPage(page) ;
}
--- 1.23/include/linux/blkdev.h Fri Nov 29 17:03:01 2002
+++ edited/include/linux/blkdev.h Mon Jun 9 17:31:18 2003
@@ -126,6 +126,14 @@
*/
char head_active;
+ /*
+ * Booleans that indicate whether the queue's free requests have
+ * been exhausted and is waiting to drop below the batch_requests
+ * threshold
+ */
+ char read_full;
+ char write_full;
+
unsigned long bounce_pfn;
/*
@@ -138,8 +146,17 @@
* Tasks wait here for free read and write requests
*/
wait_queue_head_t wait_for_requests[2];
+ unsigned long max_wait;
+ unsigned long min_wait;
+ unsigned long total_wait;
+ unsigned long num_req;
+ unsigned long num_wait;
+ unsigned long deviation[5];
};
+void blk_reset_stats(kdev_t dev);
+void blk_print_stats(kdev_t dev);
+
#define blk_queue_plugged(q) (q)->plugged
#define blk_fs_request(rq) ((rq)->cmd == READ || (rq)->cmd == WRITE)
#define blk_queue_empty(q) list_empty(&(q)->queue_head)
@@ -156,6 +173,33 @@
}
}
+static inline void set_queue_full(request_queue_t *q, int rw)
+{
+ wmb();
+ if (rw == READ)
+ q->read_full = 1;
+ else
+ q->write_full = 1;
+}
+
+static inline void clear_queue_full(request_queue_t *q, int rw)
+{
+ wmb();
+ if (rw == READ)
+ q->read_full = 0;
+ else
+ q->write_full = 0;
+}
+
+static inline int queue_full(request_queue_t *q, int rw)
+{
+ rmb();
+ if (rw == READ)
+ return q->read_full;
+ else
+ return q->write_full;
+}
+
extern unsigned long blk_max_low_pfn, blk_max_pfn;
#define BLK_BOUNCE_HIGH (blk_max_low_pfn << PAGE_SHIFT)
@@ -217,6 +261,7 @@
extern void generic_make_request(int rw, struct buffer_head * bh);
extern inline request_queue_t *blk_get_queue(kdev_t dev);
extern void blkdev_release_request(struct request *);
+extern void blk_print_stats(kdev_t dev);
/*
* Access functions for manipulating queue properties
--- 1.19/include/linux/pagemap.h Sun Aug 25 15:32:11 2002
+++ edited/include/linux/pagemap.h Mon Jun 9 14:47:11 2003
@@ -97,6 +97,8 @@
___wait_on_page(page);
}
+extern void FASTCALL(wakeup_page_waiters(struct page * page));
+
/*
* Returns locked page at given index in given cache, creating it if needed.
*/
--- 1.68/kernel/ksyms.c Fri May 23 17:40:47 2003
+++ edited/kernel/ksyms.c Mon Jun 9 12:17:24 2003
@@ -295,6 +295,7 @@
EXPORT_SYMBOL(filemap_fdatawait);
EXPORT_SYMBOL(lock_page);
EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);
/* device registration */
EXPORT_SYMBOL(register_chrdev);
--- 1.77/mm/filemap.c Thu Apr 24 11:05:10 2003
+++ edited/mm/filemap.c Mon Jun 9 12:17:24 2003
@@ -812,6 +812,20 @@
return &wait[hash];
}
+/*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+ wait_queue_head_t * head;
+
+ head = page_waitqueue(page);
+ if (waitqueue_active(head))
+ wake_up(head);
+}
+
/*
* Wait for a page to get unlocked.
*
Hi,
On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
> Ok, there are lots of different problems here, and I've spent a little
> while trying to get some numbers with the __get_request_wait stats patch
> I posted before. This is all on ext2, since I wanted to rule out
> interactions with the journal flavors.
>
> Basically a dbench 90 run on ext2 rc6 vanilla kernels can generate
> latencies of over 2700 jiffies in __get_request_wait, with an average
> latency over 250 jiffies.
>
> No, most desktop workloads aren't dbench 90, but between balance_dirty()
> and the way we send stuff to disk during memory allocations, just about
> any process can get stuck submitting dirty buffers even if you've just
> got one process doing a dd if=/dev/zero of=foo.
>
> So, for the moment I'm going to pretend people seeing stalls in X are
> stuck in atime updates or memory allocations, or reading proc or some
> other silly spot.
>
> For the SMP corner cases, I've merged Andrea's fix-pausing patch into
> rc7, along with an altered form of Nick Piggin's queue_full patch to try
> and fix the latency problems.
>
> The major difference from Nick's patch is that once the queue is marked
> full, I don't clear the full flag until the wait queue is empty. This
> means new io can't steal available requests until every existing waiter
> has been granted a request.
>
> The latency results are better, with average time spent in
> __get_request_wait being around 28 jiffies, and a max of 170 jiffies.
> The cost is throughput, further benchmarking needs to be done but, but I
> wanted to get this out for review and testing. It should at least help
> us decide if the request allocation code really is causing our problems.
>
> The patch below also includes the __get_request_wait latency stats. If
> people try this and still see stalls, please run elvtune /dev/xxxx and
> send along the resulting console output.
>
> I haven't yet compared this to Andrea's elevator latency code, but the
> stat patch was originally developed on top of his 2.4.21pre3aa1, where
> the average wait was 97 jiffies and the max was 318.
>
> Anyway, less talk, more code. Treat this with care, it has only been
> lightly tested. Thanks to Andrea and Nick whose patches this is largely
> based on:
I spent last Saturday working on this too. This is the status of my
current patches, would be interesting to compare them. they're not very
well tested yet though.
They would obsoletes the old fix-pausing and the old elevator-lowlatency
(I was going to release a new tree today, but I delayed it so I fixed
uml today too first [tested with skas and w/o skas]).
those backout the rc7 interactivity changes (the only one that wasn't in
my tree was the add_wait_queue_exclusive, that IMHO would better stay
for scalability reasons).
Of course I would be very interested to know if those two patches (or
Chris's one, you also retained the exclusive wakeup) are still greatly
improved by removing the _exclusive weakups and going wake-all (in
theory they shouldn't).
Andrea
Chris Mason wrote:
>Ok, there are lots of different problems here, and I've spent a little
>while trying to get some numbers with the __get_request_wait stats patch
>I posted before. This is all on ext2, since I wanted to rule out
>interactions with the journal flavors.
>
>Basically a dbench 90 run on ext2 rc6 vanilla kernels can generate
>latencies of over 2700 jiffies in __get_request_wait, with an average
>latency over 250 jiffies.
>
>No, most desktop workloads aren't dbench 90, but between balance_dirty()
>and the way we send stuff to disk during memory allocations, just about
>any process can get stuck submitting dirty buffers even if you've just
>got one process doing a dd if=/dev/zero of=foo.
>
>So, for the moment I'm going to pretend people seeing stalls in X are
>stuck in atime updates or memory allocations, or reading proc or some
>other silly spot.
>
>For the SMP corner cases, I've merged Andrea's fix-pausing patch into
>rc7, along with an altered form of Nick Piggin's queue_full patch to try
>and fix the latency problems.
>
>The major difference from Nick's patch is that once the queue is marked
>full, I don't clear the full flag until the wait queue is empty. This
>means new io can't steal available requests until every existing waiter
>has been granted a request.
>
Yes, this is probably a good idea.
>
>The latency results are better, with average time spent in
>__get_request_wait being around 28 jiffies, and a max of 170 jiffies.
>The cost is throughput, further benchmarking needs to be done but, but I
>wanted to get this out for review and testing. It should at least help
>us decide if the request allocation code really is causing our problems.
>
Well the latency numbers are good - is this with dbench 90?
snip
>
>+static inline void set_queue_full(request_queue_t *q, int rw)
>+{
>+ wmb();
>+ if (rw == READ)
>+ q->read_full = 1;
>+ else
>+ q->write_full = 1;
>+}
>+
>+static inline void clear_queue_full(request_queue_t *q, int rw)
>+{
>+ wmb();
>+ if (rw == READ)
>+ q->read_full = 0;
>+ else
>+ q->write_full = 0;
>+}
>+
>+static inline int queue_full(request_queue_t *q, int rw)
>+{
>+ rmb();
>+ if (rw == READ)
>+ return q->read_full;
>+ else
>+ return q->write_full;
>+}
>+
>
I don't think you need the barriers here, do you?
On Mon, 2003-06-09 at 18:19, Andrea Arcangeli wrote:
> > Anyway, less talk, more code. Treat this with care, it has only been
> > lightly tested. Thanks to Andrea and Nick whose patches this is largely
> > based on:
>
> I spent last Saturday working on this too. This is the status of my
> current patches, would be interesting to compare them. they're not very
> well tested yet though.
>
I'll try to get some numbers in the morning.
> They would obsoletes the old fix-pausing and the old elevator-lowlatency
> (I was going to release a new tree today, but I delayed it so I fixed
> uml today too first [tested with skas and w/o skas]).
>
> those backout the rc7 interactivity changes (the only one that wasn't in
> my tree was the add_wait_queue_exclusive, that IMHO would better stay
> for scalability reasons).
>
I didn't test without _exlusive for the final iteration of my patch, but
in all the early ones using _exclusive improve latencies. I think
people are reporting otherwise because they have hit the sweet spot for
number of procs going after the requests. With _exclusive they have a
higher chance of getting starved by a new process coming in, without the
_exclusive, each waiter has a fighting chance of getting to the free
request on their own. Hopefully we can do better with the _exclusive,
it does seem to scale much better.
Aside from the io in flight calculations, the major difference between
our patches is in __get_request_wait. Once a process waits once, that
call to __get_request_wait ignores q->full in my code.
I found the q->full checks did help, but as you increased the number of
concurrent readers/writers things broke down to the old high latencies.
By delaying the point where q->full was cleared, I could make the
latency benefit last for a higher number of procs. Finally I gave up
and left it set until all the waiters were gone, which seems to have the
most consistent results.
The interesting part was it didn't seem to change the hit in
throughput. The cost was about the same between the original patch and
my final one, but I need to test more.
-chris
On Mon, 2003-06-09 at 19:51, Nick Piggin wrote:
> >
> >The latency results are better, with average time spent in
> >__get_request_wait being around 28 jiffies, and a max of 170 jiffies.
> >The cost is throughput, further benchmarking needs to be done but, but I
> >wanted to get this out for review and testing. It should at least help
> >us decide if the request allocation code really is causing our problems.
> >
>
> Well the latency numbers are good - is this with dbench 90?
>
Yes, that number was dbench 90, but dbench 50,90, and 120 gave about the
same stats with the final patch.
> snip
> >+
> >+static inline int queue_full(request_queue_t *q, int rw)
> >+{
> >+ rmb();
> >+ if (rw == READ)
> >+ return q->read_full;
> >+ else
> >+ return q->write_full;
> >+}
> >+
> >
>
> I don't think you need the barriers here, do you?
>
I put the barriers in early on when almost all the calls were done
outside spin locks, the current flavor of the patch only does one
clear_queue_full without the io_request_lock held. It should be enough
to toss a barrier in just that one spot. But I wanted to leave them in
so I could move things around until the final version (if there ever is
one ;-)
-chris
Chris Mason wrote:
>On Mon, 2003-06-09 at 19:51, Nick Piggin wrote:
>
>
>>>The latency results are better, with average time spent in
>>>__get_request_wait being around 28 jiffies, and a max of 170 jiffies.
>>>The cost is throughput, further benchmarking needs to be done but, but I
>>>wanted to get this out for review and testing. It should at least help
>>>us decide if the request allocation code really is causing our problems.
>>>
>>>
>>Well the latency numbers are good - is this with dbench 90?
>>
>>
>
>Yes, that number was dbench 90, but dbench 50,90, and 120 gave about the
>same stats with the final patch.
>
Great.
>
>>snip
>>
>
>>>+
>>>+static inline int queue_full(request_queue_t *q, int rw)
>>>+{
>>>+ rmb();
>>>+ if (rw == READ)
>>>+ return q->read_full;
>>>+ else
>>>+ return q->write_full;
>>>+}
>>>+
>>>
>>>
>>I don't think you need the barriers here, do you?
>>
>>
>
>I put the barriers in early on when almost all the calls were done
>outside spin locks, the current flavor of the patch only does one
>clear_queue_full without the io_request_lock held. It should be enough
>to toss a barrier in just that one spot. But I wanted to leave them in
>so I could move things around until the final version (if there ever is
>one ;-)
>
Yeah I see.
From: [email protected]
[mailto:[email protected]]On Behalf Of Nick Piggin
> Chris Mason wrote:
> >The major difference from Nick's patch is that once the queue is marked
> >full, I don't clear the full flag until the wait queue is empty. This
> >means new io can't steal available requests until every existing waiter
> >has been granted a request.
> Yes, this is probably a good idea.
Err... wouldn't this subvert the spirit, if not the warrant, of real time
scheduling and time-critical applications?
After all we *do* want to all-but-starve lower priority tasks of IO in the
presence of higher priority tasks. A select few applications absolutely
need to be pampered (think ProTools audio mixing suite on the Mac etc.) and
any solution that doesn't take this into account will have to be re-done by
the people who want to bring these kinds of tasks to Linux.
I am not most familiar with this body of code, but wouldn't the people
trying to do audio sampling and gaming get really frosted if they had to
wait for a list of lower priority IO events to completely drain before they
could get back to work? It would certainly produce really bad encoding of
live data streams (etc).
>From a purely queue-theory stand point, I'm not even sure why this queue can
become "full". Shouldn't the bounding case come about primarily by lack of
resources (can't allocate a queue entry or a data block) out where the users
can see and cope with the problem before all the expensive blocking and
waiting.
Still from a pure-theory standpoint, it would be "better" to make the wait
queues priority queues and leave their sizes unbounded.
In practice it is expensive to maintain a fully "proper" priority queue for
a queue of non-trivial size. Then again, IO isn't cheap over the domain of
time anyway.
The solution proposed, by limiting the queue size sort-of turns the
scheduler's wakeup behavior into that priority queue sorting mechanism.
That in turn would (it seems to me) lead to some degenerate behaviors just
outside the zone of midline stability. In short several very-high-priority
tasks could completely starve out the system if they can consistently submit
enough request to fill the queue.
[That is: consider a bunch of tasks sleeping in the scheduler because they
are waiting for the queue to empty. When they are all woken up, they will
actually be scheduled in priority order. So higher priority tasks get first
crack at the "empty" queue. If there are "enough" such tasks (which are IO
bound on this device) they will keep getting serviced, and then keep going
back to sleep on the full queue. (And god help you if they are runaways
8-). The high priority tasks constantly butt in line (because the scheduler
is now the keeper of the IO queue) and the lower priority tasks could wait
forever.]
{please note; I write some fairly massively-threaded applications, it would
only take one such application running at a high priority to produce "a
substantial number" of high priority processes submitting IO requests, so
the scenario, while not common, is potentially real.}
(so just off the top of my head...)
I would think that the best theoretical solution would be a priority heap.
(ignoring heap storage requirements for a moment) you keep the highest
priority items in the front of the heap and any time a heap reorg passes a
node by you jack that nodes priority by one. For an extremely busy queue
nothing is starved, but the incline remains high enough to make sure that
the truly desperate priorities (of which there should be few in a real world
system) will "never" wait behind some dd(1) of vanishingly close to no
import.
Clearly doing a full heap with only pointers is ugly almost beyond
comprehension, and doing a heap in an array would tend to be impractical for
a large list under variable conditions. A red-black tree gets too expensive
if you use them that many times throughout a system. (and so on)
While several possible sort-of-heapish or sort-of-priority-queueish data str
uctures come to mind, I don't have a replacement concept that I can really
promote just now...
I would say that at a MINIMUM there needs to be some threshold of priority
for requests that get to go on a "full list" no matter what. There really
"ought to be" a way for requests from higher priority tasks to get closer
to the front of the list. There "should be" a priority floor where tasks
with lower priorities get their requests queued up with the current
first-come-first-served mentality (as we don't need to spend a lot of time
thinking about things that have been nice(d) into the noise floor). And
then there should be a promotion mechanism to prevent complete starvation.
Anything simpler and it is safer from a system stability standpoint to keep
with the current high-latency-on-occasion simple queue solution.
Rob.
On Mon, 2003-06-09 at 21:48, Robert White wrote:
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Nick Piggin
>
> > Chris Mason wrote:
>
> > >The major difference from Nick's patch is that once the queue is marked
> > >full, I don't clear the full flag until the wait queue is empty. This
> > >means new io can't steal available requests until every existing waiter
> > >has been granted a request.
>
> > Yes, this is probably a good idea.
>
>
> Err... wouldn't this subvert the spirit, if not the warrant, of real time
> scheduling and time-critical applications?
>
[ lots of interesting points ]
Heh, I didn't really make my goals for the patch clear. They go:
1) quantify the stalls people are seeing with real numbers so we can
point at a section of code causing bad performance.
2) Provide a somewhat obvious patch that makes the current
__get_request_wait call significantly more fair, in hopes of either
blaming it for the stalls or removing it from the list of candidates
3) fix the stalls
Most of your suggestions are 2.5 discussion material, where real
experimental work is going on. The 2.4 io request wait queue isn't
working on priorities, the current one tries to be fair to everyone and
provide good throughput to everyone at the same time. It's failing on
at least one of those, and until we can fix that I don't even want to
think about more complex issues.
Current users of the vanilla 2.4 tree will hopefully benefit from a
lower latency io request wait queue. The next best thing to real time is
a consistently small wait, which is what my patch is trying for.
-chris
Robert White wrote:
>From: [email protected]
>[mailto:[email protected]]On Behalf Of Nick Piggin
>
>
>>Chris Mason wrote:
>>
>
>>>The major difference from Nick's patch is that once the queue is marked
>>>full, I don't clear the full flag until the wait queue is empty. This
>>>means new io can't steal available requests until every existing waiter
>>>has been granted a request.
>>>
>
>>Yes, this is probably a good idea.
>>
>
>
>Err... wouldn't this subvert the spirit, if not the warrant, of real time
>scheduling and time-critical applications?
>
No, my patch (plus Chris' modification) change request allocation
from an overloaded queue from semi random (timing dependant mixture
of LIFO and FIFO), to FIFO.
As Chris has shown, can cause a task to be starved for 2.7s (and
theoretically infinite) when it should be woken in < 200ms under
similar situations with the FIFO scheme.
>
>After all we *do* want to all-but-starve lower priority tasks of IO in the
>presence of higher priority tasks. A select few applications absolutely
>need to be pampered (think ProTools audio mixing suite on the Mac etc.) and
>any solution that doesn't take this into account will have to be re-done by
>the people who want to bring these kinds of tasks to Linux.
>
>I am not most familiar with this body of code, but wouldn't the people
>trying to do audio sampling and gaming get really frosted if they had to
>wait for a list of lower priority IO events to completely drain before they
>could get back to work? It would certainly produce really bad encoding of
>live data streams (etc).
>
>
Actually, there is no priority other than time (ie. FIFO), and
seek distance in the IO subsystem. I guess this is why your
arguments fall down ;)
>>From a purely queue-theory stand point, I'm not even sure why this queue can
>become "full". Shouldn't the bounding case come about primarily by lack of
>resources (can't allocate a queue entry or a data block) out where the users
>can see and cope with the problem before all the expensive blocking and
>waiting.
>
In practice, the problems of having a memory size limited queue
outweigh the benefits.
>
>Still from a pure-theory standpoint, it would be "better" to make the wait
>queues priority queues and leave their sizes unbounded.
>
>In practice it is expensive to maintain a fully "proper" priority queue for
>a queue of non-trivial size. Then again, IO isn't cheap over the domain of
>time anyway.
>
If IO priorities were implemented, you still have the problem of
starvation. It would be better to simply have a per process limit
on request allocation, and implement the priority scheduling in
the io scheduler.
I think you would find that most processes do just fine with
just a couple of requests each, though.
>
>
>The solution proposed, by limiting the queue size sort-of turns the
>scheduler's wakeup behavior into that priority queue sorting mechanism.
>That in turn would (it seems to me) lead to some degenerate behaviors just
>outside the zone of midline stability. In short several very-high-priority
>tasks could completely starve out the system if they can consistently submit
>enough request to fill the queue.
>
>[That is: consider a bunch of tasks sleeping in the scheduler because they
>are waiting for the queue to empty. When they are all woken up, they will
>actually be scheduled in priority order. So higher priority tasks get first
>crack at the "empty" queue. If there are "enough" such tasks (which are IO
>bound on this device) they will keep getting serviced, and then keep going
>back to sleep on the full queue. (And god help you if they are runaways
>8-). The high priority tasks constantly butt in line (because the scheduler
>is now the keeper of the IO queue) and the lower priority tasks could wait
>forever.]
>
No, they will be woken up one at a time as requests
become freed, and in FIFO order. It might be possible
for a higher (CPU) priority task to be woken up
before the previous has a chance to run, but this
scheme is no worse than before (the solution here is
per process request limits, but this is 2.4).
From: Nick Piggin [mailto:[email protected]]
Sent: Monday, June 09, 2003 8:23 PM
>
> Actually, there is no priority other than time (ie. FIFO), and
> seek distance in the IO subsystem. I guess this is why your
> arguments fall down ;)
I'll buy that for the most part, though one of the differences I read
elsewhere in the thread was the choice between add_wait_queue() and
add_wait_queue_exclusive(). You will, however, note that one of the factors
that is playing in this patch is process priority.
(If I understand correctly) The wait queue in question becomes your FIFOing
agent, it is kind of a pre-queue on the actual IO queue, once you reach a
"full" condition.
In the later case [add_wait_queue_exclusive()] you are strictly FIFO over
the set of processes, where the moment-of-order is determined by insertion
into the wait queue.
In the former case [add_wait_queue()] when the queue is woken up all the
waiters will be marked executable on the scheduler, and the scheduler will
then (at least tend to) sort the submissions into task priority order. So
the higher priority tasks will get to butt into line. Worse, the FIFO is
essentially lost to the vagaries of the scheduler so without the _exclusive
you have no FIFO at all.
I think that is the reason that Chris was saying the
add_wait_queue_exclusive() mode "does seem to scale much better."
So your "original new" batching agent is really order-of-arrival that
becomes anti-sorted by process priority. Which can lead to scheduler
induced starvation (and the observed "improvements" by using the strict FIFO
created by a_w_q_exclusive). The problem is that you get a little communist
about the FIFO-ness when you use a_w_q_exclusive() and that can *SERIOUSLY*
harm a task that must approach real-time behavior.
One solution would be to stick with the add_wait_queue() process-priority
influenced never-really-FIFO, but every time a process/task wakes up, and it
then doesn't get its request onto the queue, add a small fixed increment to
its priority before going back into the wait. This gives you both the
process-priority mechanism and a fairness metric.
Something like (in pure pseudo-code since I don't have my references here):
int priority_delta = 0
while (try_enqueing_io_request() == queue_full) {
if (current->priority < priority_max) {
current->priority += priority_increment;
priority_delta += priority_increment;
}
wait_on_queue()
}
current->priority -= priority_delta;
(and still, of course, only wake the wait queue when the "full" queue
reaches empty.)
What that gets you is democratic entry into the io request queue when it is
non-full. It gets you seniority-based (plutocratic?) access to the io queue
as your request "ages" in the full pool. If the pool gets so large that all
the requests are making their tasks reach priority_max then you "degrade" to
the fairness of the scheduler, which is an arbitrary but workable metric.
You get all that, but you preserve (or invent) a relationship that lets the
task priority automagically factor in "for free" so that relative starvation
(which is a good thing for deliberately asymmetric task priorities, and
matches user expectations) can be achieved without ever having absolute
starvation.
Further if priority_max isn't priority_system_max you get the
real-time-trumps-all behavior that something like a live audio stream
encoder may need (for any priority >= priority_max).
Rob.
From: Chris Mason [mailto:[email protected]]
Sent: Monday, June 09, 2003 7:13 PM
> 2) Provide a somewhat obvious patch that makes the current
> __get_request_wait call significantly more fair, in hopes of either
> blaming it for the stalls or removing it from the list of candidates
Without the a_w_q_exclusive() on add_wait_queue the FIFO effect is lost when
all the members of the wait queue compete for their timeslice in the
scheduler. For all intents and purposes the fairness goes up some (you stop
having the one guy sorted to the un-happy end of the disk) but low priority
tasks will still always end up stalled on the dirty end of the stick.
Basically each new round at the queue-empty moment is a mob rush for the
door.
With the a_w_q_exclusive(), you get past fair and well into anti-optimal.
Your FIFO becomes essentially mandatory with no regard for anything but the
order things hit the wait queue. (Particularly on an SMP machine, however)
"new requestors" may/will jump to the head of the line because they were
never *in* the wait queue. So you have only achieved "fairness" with
respect to requests that come in to a io queue that was full-at-the-time of
the initial entry into the driver. This becomes exactly like the experience
of waiting patiently on line to get off the highway and watching all the
rude people driving by you only to cut over and nose into the queue just at
the exit sign.
So you need the _exclusive if you want any kind of predictable fairness
(without getting into anything obscure) but it is still only "fair" for
those that were unfortunate enough to end up on the wait queue originally.
There is a small window for tasks to butt in freely.
> 3) fix the stalls
Without the _exclusive() you can't have fixed the stalls, you can only have
moved the locus-of-blame to the scheduler which may (or may not) have some
way to compensate and "fake fairness" built in by coincidence.
The thing I suggest in my other email, where you use the non-exclusive
version of the routine but temporarily bump the process priority each time a
request gets foisted off on the wait_queue instead of the IO queue, actually
has semantic fairness built in. This basically builds a fairness elevator
that functions both over time-in-queue and original process priority (when
built into your basic patch).
It's also quite space/time efficient and fairly clear to reader and
implementer alike.
Rob.
On Mon, 2003-06-09 at 18:19, Andrea Arcangeli wrote:
> I spent last Saturday working on this too. This is the status of my
> current patches, would be interesting to compare them. they're not very
> well tested yet though.
>
> They would obsoletes the old fix-pausing and the old elevator-lowlatency
> (I was going to release a new tree today, but I delayed it so I fixed
> uml today too first [tested with skas and w/o skas]).
>
> those backout the rc7 interactivity changes (the only one that wasn't in
> my tree was the add_wait_queue_exclusive, that IMHO would better stay
> for scalability reasons).
>
> Of course I would be very interested to know if those two patches (or
> Chris's one, you also retained the exclusive wakeup) are still greatly
> improved by removing the _exclusive weakups and going wake-all (in
> theory they shouldn't).
Ok, I merged these into rc7 along with the __get_request_wait stats
patch. All numbers below were on ext2...I'm calling your patches -aa,
even though it's just a small part of the real -aa ;-) After a dbench 50
run, the -aa __get_request_wait latencies look like this:
device 08:01: num_req 6029, total jiffies waited 213475
844 forced to wait
2 min wait, 806 max wait
252 average wait
357 < 100, 29 < 200, 110 < 300, 111 < 400, 82 < 500
155 waits longer than 500 jiffies
I changed my patch to have q->nr_requests at 1024 like yours, and reran
the dbench 50:
device 08:01: num_req 11122, total jiffies waited 121573
8782 forced to wait
1 min wait, 237 max wait
13 average wait
8654 < 100, 126 < 200, 2 < 300, 0 < 400, 0 < 500
0 waits longer than 500 jiffies
So, I had 5000 more requests for the same workload, and 8000 of my
requests were forced to wait (compared to 844 of yours). But the total
number of jiffies spent waiting on my patch was lower, as were the
average and max waits. Increasing the number of requests with my patch
make the system feel slower, even though the __get_request_wait latency
numbers didn't change.
On this dbench run, you got a throughput of 118mb/s and I got 90mb/s.
The __get_request_wait latency numbers were reliable across runs, but I
might as well have thrown a dart to pick throughput numbers. So, next
tests were done with iozone.
On aa after iozone -s 100M -i 0 -t 20 (20 procs each doing streaming
writes to a private 100M file)
device 08:01: num_req 167133, total jiffies waited 872566
6424 forced to wait
4 min wait, 507 max wait
135 average wait
2619 < 100, 2020 < 200, 1433 < 300, 325 < 400, 26 < 500
1 waits longer than 500 jiffies
And the iozone throughput numbers looked like so (again -aa patches)
Children see throughput for 20 initial writers = 13824.22 KB/sec
Parent sees throughput for 20 initial writers = 6811.29 KB/sec
Min throughput per process = 451.99 KB/sec
Max throughput per process = 904.14 KB/sec
Avg throughput per process = 691.21 KB/sec
Min xfer = 51136.00 KB
The avg throughput per process with vanilla rc7 is 3MB/s, the best I've
been able to do was with nr_requests at higher levels was 1.3MB/s. With
smaller of iozone threads (10 and lower so far) I can match rc7 speeds,
but not with 20 procs.
Anyway, my latency numbers for iozone -s 100M -i 0 -t 20:
device 08:01: num_req 146049, total jiffies waited 434025
130670 forced to wait
1 min wait, 65 max wait
3 average wait
130671 < 100, 0 < 200, 0 < 300, 0 < 400, 0 < 500
0 waits longer than 500 jiffies
And the iozone reported throughput:
Children see throughput for 20 initial writers = 19828.92 KB/sec
Parent sees throughput for 20 initial writers = 7003.36 KB/sec
Min throughput per process = 526.61 KB/sec
Max throughput per process = 1353.45 KB/sec
Avg throughput per process = 991.45 KB/sec
Min xfer = 39968.00 KB
The patch I was working on today was almost the same as the one I posted
yesterday, the only difference being the hunk below and changes to
nr_requests (256 balanced nicely on my box, all numbers above were at
1024).
This hunk against my patch yesterday just avoids an unplug in
__get_request_wait if there are still available requests. A process
might be waiting in __get_request_wait just because the queue was full,
which has little do to with the queue needing an unplug. He'll get
woken up later by get_request_wait_wakeup if nobody else manages to wake
him (I think).
diff -u edited/drivers/block/ll_rw_blk.c edited/drivers/block/ll_rw_blk.c
--- edited/drivers/block/ll_rw_blk.c Mon Jun 9 17:13:16 2003
+++ edited/drivers/block/ll_rw_blk.c Tue Jun 10 16:46:50 2003
@@ -661,7 +661,8 @@
set_current_state(TASK_UNINTERRUPTIBLE);
spin_lock_irq(&io_request_lock);
if ((!waited && queue_full(q, rw)) || q->rq[rw].count == 0) {
- __generic_unplug_device(q);
+ if (q->rq[rw].count == 0)
+ __generic_unplug_device(q);
spin_unlock_irq(&io_request_lock);
schedule();
spin_lock_irq(&io_request_lock);
On Tue, Jun 10, 2003 at 07:13:45PM -0400, Chris Mason wrote:
> On Mon, 2003-06-09 at 18:19, Andrea Arcangeli wrote:
> The avg throughput per process with vanilla rc7 is 3MB/s, the best I've
> been able to do was with nr_requests at higher levels was 1.3MB/s. With
> smaller of iozone threads (10 and lower so far) I can match rc7 speeds,
> but not with 20 procs.
at least with my patches, I also made this change:
-#define ELV_LINUS_SEEK_COST 16
+#define ELV_LINUS_SEEK_COST 1
#define ELEVATOR_NOOP \
((elevator_t) { \
@@ -93,8 +93,8 @@ static inline int elevator_request_laten
#define ELEVATOR_LINUS \
((elevator_t) { \
- 2048, /* read passovers */ \
- 8192, /* write passovers */ \
+ 128, /* read passovers */ \
+ 512, /* write passovers */ \
\
you didn't change the I/O scheduler at all compared to mainline, so
there can be quite a lot of difference in the bandwidth average per
process between my patches and mainline and your patches (unless you run
elvtune or unless you backed out the above).
Anyways the 130671 < 100, 0 < 200, 0 < 300, 0 < 400, 0 < 500 from your
patch sounds perfectly fair and that's unrelated to I/O scheduler and
size of runqueue. I believe the most interesting difference is the
blocking of tasks until the waitqueue is empty (i.e. clearing the
waitqueue-full bit only when nobody is waiting). That is the right thing
to do of course, that was a bug in my patch I merged by mistake from
Nick's original patch, and that I'm going to fix immediatly of course.
Andrea
On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
> + if (!waitqueue_active(&q->wait_for_requests[rw]))
> + clear_queue_full(q, rw);
you've an smp race above, the smp safe implementation is this:
if (!waitqueue_active(&q->wait_for_requests[rw])) {
clear_queue_full(q, rw);
mb();
if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
wake_up(&q->wait_for_requests[rw]);
}
I'm also unsure what the "waited" logic does, it doesn't seem necessary.
Andrea
Robert White wrote:
>From: Nick Piggin [mailto:[email protected]]
>Sent: Monday, June 09, 2003 8:23 PM
>
>>Actually, there is no priority other than time (ie. FIFO), and
>>seek distance in the IO subsystem. I guess this is why your
>>arguments fall down ;)
>>
>
>I'll buy that for the most part, though one of the differences I read
>elsewhere in the thread was the choice between add_wait_queue() and
>add_wait_queue_exclusive(). You will, however, note that one of the factors
>that is playing in this patch is process priority.
>
>(If I understand correctly) The wait queue in question becomes your FIFOing
>agent, it is kind of a pre-queue on the actual IO queue, once you reach a
>"full" condition.
>
Right.
>
>In the later case [add_wait_queue_exclusive()] you are strictly FIFO over
>the set of processes, where the moment-of-order is determined by insertion
>into the wait queue.
>
>In the former case [add_wait_queue()] when the queue is woken up all the
>waiters will be marked executable on the scheduler, and the scheduler will
>then (at least tend to) sort the submissions into task priority order. So
>the higher priority tasks will get to butt into line. Worse, the FIFO is
>essentially lost to the vagaries of the scheduler so without the _exclusive
>you have no FIFO at all.
>
>I think that is the reason that Chris was saying the
>add_wait_queue_exclusive() mode "does seem to scale much better."
>
Yep
>
>So your "original new" batching agent is really order-of-arrival that
>becomes anti-sorted by process priority. Which can lead to scheduler
>induced starvation (and the observed "improvements" by using the strict FIFO
>created by a_w_q_exclusive). The problem is that you get a little communist
>about the FIFO-ness when you use a_w_q_exclusive() and that can *SERIOUSLY*
>harm a task that must approach real-time behavior.
>
I think it had better be FIFO for now. If its not, you're
making the worst case latency worse. It requires a lot of
careful testing to get something like that working right.
You have some good ideas, and quite possibly they would be
worth implementing, but the behaviour of the code is quite
complex, especially when you take into account its affect
on the io scheduler.
On Tue, 2003-06-10 at 20:16, Andrea Arcangeli wrote:
> On Tue, Jun 10, 2003 at 07:13:45PM -0400, Chris Mason wrote:
> > On Mon, 2003-06-09 at 18:19, Andrea Arcangeli wrote:
> > The avg throughput per process with vanilla rc7 is 3MB/s, the best I've
> > been able to do was with nr_requests at higher levels was 1.3MB/s. With
> > smaller of iozone threads (10 and lower so far) I can match rc7 speeds,
> > but not with 20 procs.
>
> at least with my patches, I also made this change:
>
> -#define ELV_LINUS_SEEK_COST 16
> +#define ELV_LINUS_SEEK_COST 1
>
> #define ELEVATOR_NOOP \
> ((elevator_t) { \
> @@ -93,8 +93,8 @@ static inline int elevator_request_laten
>
> #define ELEVATOR_LINUS \
> ((elevator_t) { \
> - 2048, /* read passovers */ \
> - 8192, /* write passovers */ \
> + 128, /* read passovers */ \
> + 512, /* write passovers */ \
> \
>
Right, I had forgotten to elvtune these in before my runs. It shouldn't
change the __get_request_wait numbers, except for changes in the
percentage of merged requests leading to a different number of requests
overall (which my numbers did show).
> you didn't change the I/O scheduler at all compared to mainline, so
> there can be quite a lot of difference in the bandwidth average per
> process between my patches and mainline and your patches (unless you run
> elvtune or unless you backed out the above).
>
> Anyways the 130671 < 100, 0 < 200, 0 < 300, 0 < 400, 0 < 500 from your
> patch sounds perfectly fair and that's unrelated to I/O scheduler and
> size of runqueue. I believe the most interesting difference is the
> blocking of tasks until the waitqueue is empty (i.e. clearing the
> waitqueue-full bit only when nobody is waiting). That is the right thing
> to do of course, that was a bug in my patch I merged by mistake from
> Nick's original patch, and that I'm going to fix immediatly of course.
Ok, Increasing q->nr_requests also changes the throughput in high merge
workloads.
Basically if we have 20 procs doing streaming buffered io, the buffers
end up mixed together on the dirty list. So assuming we hit the hard
dirty limit and all 20 procs are running write_some_buffers() the only
way we'll be able to efficiently merge the end result is if we can get
in 20 * 32 requests before unplugging.
This is because write_some_buffers grabs 32 buffers at a time, and each
caller has to wait fairly in __get_request_wait. With only 128 requests
in the run queue, the disk is unplugged before any of the 20 procs has
submitted each of their 32 buffers.
It might make sense to change write_some_buffers to work in smaller
units, 32 seems like a lot of times to wait in __get_request_wait just
for an atime update.
-chris
Andrea Arcangeli wrote:
>On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
>
>>+ if (!waitqueue_active(&q->wait_for_requests[rw]))
>>+ clear_queue_full(q, rw);
>>
>
>you've an smp race above, the smp safe implementation is this:
>
> if (!waitqueue_active(&q->wait_for_requests[rw])) {
> clear_queue_full(q, rw);
> mb();
> if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
> wake_up(&q->wait_for_requests[rw]);
> }
>
>I'm also unsure what the "waited" logic does, it doesn't seem necessary.
>
When a task is woken up, it is quite likely that the
queue is still marked full.
On Tue, 2003-06-10 at 20:33, Andrea Arcangeli wrote:
> On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
> > + if (!waitqueue_active(&q->wait_for_requests[rw]))
> > + clear_queue_full(q, rw);
>
> you've an smp race above, the smp safe implementation is this:
>
clear_queue_full has a wmb() in my patch, and queue_full has a rmb(), I
thought that covered these cases? I'd rather remove those though, since
the spot you point out is the only place done outside the
io_request_lock.
> if (!waitqueue_active(&q->wait_for_requests[rw])) {
> clear_queue_full(q, rw);
> mb();
> if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
> wake_up(&q->wait_for_requests[rw]);
> }
>
I don't think we need the extra wake_up (this is in __get_request_wait,
right?), since it gets done by get_request_wait_wakeup()
> I'm also unsure what the "waited" logic does, it doesn't seem necessary.
Once a process waits once, they are allowed to ignore the q->full flag.
This way existing waiters can make progress even when q->full is set.
Without the waited check, q->full will never get cleared because the
last writer wouldn't proceed until the last writer was gone. I had to
make __get_request for the same reason.
-chris
On Tue, 2003-06-10 at 19:04, Robert White wrote:
> From: Chris Mason [mailto:[email protected]]
> Sent: Monday, June 09, 2003 7:13 PM
>
> > 2) Provide a somewhat obvious patch that makes the current
> > __get_request_wait call significantly more fair, in hopes of either
> > blaming it for the stalls or removing it from the list of candidates
>
> Without the a_w_q_exclusive() on add_wait_queue the FIFO effect is lost when
> all the members of the wait queue compete for their timeslice in the
> scheduler. For all intents and purposes the fairness goes up some (you stop
> having the one guy sorted to the un-happy end of the disk) but low priority
> tasks will still always end up stalled on the dirty end of the stick.
> Basically each new round at the queue-empty moment is a mob rush for the
> door.
>
> With the a_w_q_exclusive(), you get past fair and well into anti-optimal.
> Your FIFO becomes essentially mandatory with no regard for anything but the
> order things hit the wait queue. (Particularly on an SMP machine, however)
> "new requestors" may/will jump to the head of the line because they were
> never *in* the wait queue.
The patches flying around force new io into the wait queue any time
someone else is already waiting, nobody is allowed to jump to the head
of the line.
The rest of your ideas are interesting, we just can't smush them into
2.4. Please consider doing some experiments on the 2.5 io schedulers
and making suggestions, it's a critical area.
-chris
On Tue, Jun 10, 2003 at 08:54:00PM -0400, Chris Mason wrote:
> On Tue, 2003-06-10 at 20:33, Andrea Arcangeli wrote:
> > On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
> > > + if (!waitqueue_active(&q->wait_for_requests[rw]))
> > > + clear_queue_full(q, rw);
> >
> > you've an smp race above, the smp safe implementation is this:
> >
>
> clear_queue_full has a wmb() in my patch, and queue_full has a rmb(), I
> thought that covered these cases? I'd rather remove those though, since
> the spot you point out is the only place done outside the
> io_request_lock.
>
> > if (!waitqueue_active(&q->wait_for_requests[rw])) {
> > clear_queue_full(q, rw);
> > mb();
> > if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
> > wake_up(&q->wait_for_requests[rw]);
> > }
> >
> I don't think we need the extra wake_up (this is in __get_request_wait,
> right?), since it gets done by get_request_wait_wakeup()
there's no get_request_wait_wakeup in blkdev_release_request. I put the
construct in both places though (i've the clear_queue_full explicit as
q->full = 0).
And I don't think any of your barriers is needed at all, I mean, we only
need to be careful to clear it right, we don't need to be careful to set
or read it right when it transits from 0 to 1. And the above seems
enough to me to get right the clearing.
> > I'm also unsure what the "waited" logic does, it doesn't seem necessary.
>
> Once a process waits once, they are allowed to ignore the q->full flag.
> This way existing waiters can make progress even when q->full is set.
> Without the waited check, q->full will never get cleared because the
> last writer wouldn't proceed until the last writer was gone. I had to
> make __get_request for the same reason.
__get_request makes perfect sense of course and it's needed, this is not
the issue, my point about the waited check is that the last writer has
to get the wakeup (and the wakeup has nothing to do with the waited
check since waited == 0), and after the wakeup it will get the request
and it won't re-run the loop, so I don't see why waited is needed.
Furthmore even if for whatever reason it doesn't get the request, it
will re-set full to 1 and it'll be still the first to get the wakeup,
and it will definitely get another wakeup if none request was available.
Andrea
On Wed, Jun 11, 2003 at 10:48:23AM +1000, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
> >On Mon, Jun 09, 2003 at 05:39:23PM -0400, Chris Mason wrote:
> >
> >>+ if (!waitqueue_active(&q->wait_for_requests[rw]))
> >>+ clear_queue_full(q, rw);
> >>
> >
> >you've an smp race above, the smp safe implementation is this:
> >
> > if (!waitqueue_active(&q->wait_for_requests[rw])) {
> > clear_queue_full(q, rw);
> > mb();
> > if (unlikely(waitqueue_active(&q->wait_for_requests[rw])))
> > wake_up(&q->wait_for_requests[rw]);
> > }
> >
> >I'm also unsure what the "waited" logic does, it doesn't seem necessary.
> >
>
> When a task is woken up, it is quite likely that the
> queue is still marked full.
but we don't care if it's marked full, see __get_request. If we cared
about full it would deadlock anyways (no matter the waited logic)
Andrea
On Tue, 2003-06-10 at 21:06, Andrea Arcangeli wrote:
> And I don't think any of your barriers is needed at all, I mean, we only
> need to be careful to clear it right, we don't need to be careful to set
> or read it right when it transits from 0 to 1. And the above seems
> enough to me to get right the clearing.
>
The current form of the patch has way too many barriers. When I first
added them the patch was really different, I left them in because it
seems to be easier to remember to rip them out than add them back ;-)
> > > I'm also unsure what the "waited" logic does, it doesn't seem necessary.
> >
> > Once a process waits once, they are allowed to ignore the q->full flag.
> > This way existing waiters can make progress even when q->full is set.
> > Without the waited check, q->full will never get cleared because the
> > last writer wouldn't proceed until the last writer was gone. I had to
> > make __get_request for the same reason.
>
> __get_request makes perfect sense of course and it's needed, this is not
> the issue, my point about the waited check is that the last writer has
> to get the wakeup (and the wakeup has nothing to do with the waited
> check since waited == 0), and after the wakeup it will get the request
> and it won't re-run the loop, so I don't see why waited is needed.
> Furthmore even if for whatever reason it doesn't get the request, it
> will re-set full to 1 and it'll be still the first to get the wakeup,
> and it will definitely get another wakeup if none request was available.
Ok, I see your point, we don't strictly need the waited check. I had
added it as an optimization at first, so that those who waited once were
not penalized by further queue_full checks.
-chris
On Tue, Jun 10, 2003 at 09:57:11PM -0400, Chris Mason wrote:
> Ok, I see your point, we don't strictly need the waited check. I had
> added it as an optimization at first, so that those who waited once were
> not penalized by further queue_full checks.
I could taste the feeling of not penalizing while reading the code but
that's just a feeling, in reality if they blocked it means they set full
by themself and there was no request so they want to go to sleep no
matter ->full or not ;)
Andrea
On Tue, 2003-06-10 at 22:10, Andrea Arcangeli wrote:
> On Tue, Jun 10, 2003 at 09:57:11PM -0400, Chris Mason wrote:
> > Ok, I see your point, we don't strictly need the waited check. I had
> > added it as an optimization at first, so that those who waited once were
> > not penalized by further queue_full checks.
>
> I could taste the feeling of not penalizing while reading the code but
> that's just a feeling, in reality if they blocked it means they set full
> by themself and there was no request so they want to go to sleep no
> matter ->full or not ;)
You're completely right, as the patch changed I didn't realize waited
wasn't needed anymore ;-)
Are you adding the hunk from yesterday to avoid unplugs when q->rq.count
!= 0?
-chris
Ok here's an updated patch, it changes the barriers around, updates
comments, and gets rid of the waited check in __get_request_wait. It is
still a combined patch with fix_pausing, queue_full and latency stats,
mostly because I want to make really sure any testers are using all
three.
So, if someone who saw io stalls in 2.4.21-rc could give this a try, I'd
be grateful. If you still see stalls with this applied, run elvtune
/dev/xxx and send along the resulting console output.
-chris
On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
> + if (q->rq[rw].count >= q->batch_requests) {
> + smp_mb();
> + if (waitqueue_active(&q->wait_for_requests[rw]))
> + wake_up(&q->wait_for_requests[rw]);
in my tree I also changed this to:
wake_up_nr(&q->wait_for_requests[rw], q->rq[rw].count);
otherwise only one waiter will eat the requests, while multiple waiters
can eat requests in parallel instead because we freed not just 1 request
but many of them.
I wonder if my above change is really the right way to implement the
removal of the _exclusive line that went in rc6. However with your patch
the wake_up_nr (or ~equivalent removal of _exclusive wakeup of rc6)
should mostly improve cpu parallelism in smp and while waiting for I/O,
the amount of stuff in the I/O queue and the overall fariness shouldn't
change very significantly with this new completely fair FIFO request
allocator.
Andrea
On Wed, 2003-06-11 at 14:12, Andrea Arcangeli wrote:
> On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
> > + if (q->rq[rw].count >= q->batch_requests) {
> > + smp_mb();
> > + if (waitqueue_active(&q->wait_for_requests[rw]))
> > + wake_up(&q->wait_for_requests[rw]);
>
> in my tree I also changed this to:
>
> wake_up_nr(&q->wait_for_requests[rw], q->rq[rw].count);
>
> otherwise only one waiter will eat the requests, while multiple waiters
> can eat requests in parallel instead because we freed not just 1 request
> but many of them.
I tried a few variations of this yesterday and they all led to horrible
latencies, but I couldn't really explain why. I had a bunch of other
stuff in at the time to try and improve throughput though, so I'll try
it again.
I think part of the problem is the cascading wakeups from
get_request_wait_wakeup(). So if we wakeup 32 procs they in turn wakeup
another 32, etc.
-chris
On Wed, Jun 11, 2003 at 02:27:13PM -0400, Chris Mason wrote:
> On Wed, 2003-06-11 at 14:12, Andrea Arcangeli wrote:
> > On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
> > > + if (q->rq[rw].count >= q->batch_requests) {
> > > + smp_mb();
> > > + if (waitqueue_active(&q->wait_for_requests[rw]))
> > > + wake_up(&q->wait_for_requests[rw]);
> >
> > in my tree I also changed this to:
> >
> > wake_up_nr(&q->wait_for_requests[rw], q->rq[rw].count);
> >
> > otherwise only one waiter will eat the requests, while multiple waiters
> > can eat requests in parallel instead because we freed not just 1 request
> > but many of them.
>
> I tried a few variations of this yesterday and they all led to horrible
> latencies, but I couldn't really explain why. I had a bunch of other
the I/O latency in theory shouldn't change, we're not reordering the
queue at all, they'll go to sleep immediatly again if __get_request
returns null.
> stuff in at the time to try and improve throughput though, so I'll try
> it again.
>
> I think part of the problem is the cascading wakeups from
> get_request_wait_wakeup(). So if we wakeup 32 procs they in turn wakeup
> another 32, etc.
so maybe it's enough to wakeup count / 2 to account for the double
wakeup? that will avoid some overscheduling indeed.
Andrea
Andrea Arcangeli wrote:
>On Wed, Jun 11, 2003 at 02:27:13PM -0400, Chris Mason wrote:
>
>>On Wed, 2003-06-11 at 14:12, Andrea Arcangeli wrote:
>>
>>>On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
>>>
>>>>+ if (q->rq[rw].count >= q->batch_requests) {
>>>>+ smp_mb();
>>>>+ if (waitqueue_active(&q->wait_for_requests[rw]))
>>>>+ wake_up(&q->wait_for_requests[rw]);
>>>>
>>>in my tree I also changed this to:
>>>
>>> wake_up_nr(&q->wait_for_requests[rw], q->rq[rw].count);
>>>
>>>otherwise only one waiter will eat the requests, while multiple waiters
>>>can eat requests in parallel instead because we freed not just 1 request
>>>but many of them.
>>>
>>I tried a few variations of this yesterday and they all led to horrible
>>latencies, but I couldn't really explain why. I had a bunch of other
>>
>
>the I/O latency in theory shouldn't change, we're not reordering the
>queue at all, they'll go to sleep immediatly again if __get_request
>returns null.
>
And go to the end of the queue?
>
>>stuff in at the time to try and improve throughput though, so I'll try
>>it again.
>>
>>I think part of the problem is the cascading wakeups from
>>get_request_wait_wakeup(). So if we wakeup 32 procs they in turn wakeup
>>another 32, etc.
>>
>
>so maybe it's enough to wakeup count / 2 to account for the double
>wakeup? that will avoid some overscheduling indeed.
>
>
Andrea, this isn't needed because when the queue falls below
the batch limit, every retiring request will do a wake up and
another request will be put on (as long as the waitqueue is
active).
On Wed, 2003-06-11 at 21:04, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >On Wed, Jun 11, 2003 at 02:27:13PM -0400, Chris Mason wrote:
> >
> >>On Wed, 2003-06-11 at 14:12, Andrea Arcangeli wrote:
> >>
> >>>On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
> >>>
> >>>>+ if (q->rq[rw].count >= q->batch_requests) {
> >>>>+ smp_mb();
> >>>>+ if (waitqueue_active(&q->wait_for_requests[rw]))
> >>>>+ wake_up(&q->wait_for_requests[rw]);
> >>>>
> >>>in my tree I also changed this to:
> >>>
> >>> wake_up_nr(&q->wait_for_requests[rw], q->rq[rw].count);
> >>>
> >>>otherwise only one waiter will eat the requests, while multiple waiters
> >>>can eat requests in parallel instead because we freed not just 1 request
> >>>but many of them.
> >>>
> >>I tried a few variations of this yesterday and they all led to horrible
> >>latencies, but I couldn't really explain why. I had a bunch of other
> >>
> >
> >the I/O latency in theory shouldn't change, we're not reordering the
> >queue at all, they'll go to sleep immediatly again if __get_request
> >returns null.
> >
>
> And go to the end of the queue?
>
This got dragged into private mail for a few messages, but we figured
out the problem turns into scheduling fairness with wake_up_nr()
32 procs might get woken, but when the first of those procs gets a
request, he'll wake another, and so on. But there's no promise that
getting woken fairly means you'll get scheduled fairly, so you might not
get scheduled in for quite a while, perhaps even after new requests have
gone onto the wait queue and gotten woken up.
The real problem is get_request_wait_wakeup, andrea is working on a few
changes to that.
-chris
On Thu, Jun 12, 2003 at 11:04:42AM +1000, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
> >On Wed, Jun 11, 2003 at 02:27:13PM -0400, Chris Mason wrote:
> >
> >>On Wed, 2003-06-11 at 14:12, Andrea Arcangeli wrote:
> >>
> >>>On Wed, Jun 11, 2003 at 01:42:41PM -0400, Chris Mason wrote:
> >>>
> >>>>+ if (q->rq[rw].count >= q->batch_requests) {
> >>>>+ smp_mb();
> >>>>+ if (waitqueue_active(&q->wait_for_requests[rw]))
> >>>>+ wake_up(&q->wait_for_requests[rw]);
> >>>>
> >>>in my tree I also changed this to:
> >>>
> >>> wake_up_nr(&q->wait_for_requests[rw],
> >>> q->rq[rw].count);
> >>>
> >>>otherwise only one waiter will eat the requests, while multiple waiters
> >>>can eat requests in parallel instead because we freed not just 1 request
> >>>but many of them.
> >>>
> >>I tried a few variations of this yesterday and they all led to horrible
> >>latencies, but I couldn't really explain why. I had a bunch of other
> >>
> >
> >the I/O latency in theory shouldn't change, we're not reordering the
> >queue at all, they'll go to sleep immediatly again if __get_request
> >returns null.
> >
>
> And go to the end of the queue?
they stay in queue, so they don't go to the end.
but as Chris found since we've the get_request_wait_wakeup, even waking
free-requests/2 isn't enough since that will generate free-requests *1.5 of
wakeups where the last free-requests/2 (implicitly generated by the
get_request_wait_wakeup) will become runnable and they will race with the
other tasks later waken by another request release.
I sort of fixed that by remebering an old suggestion from Andrew:
static void get_request_wait_wakeup(request_queue_t *q, int rw)
{
/*
* avoid losing an unplug if a second __get_request_wait did the
* generic_unplug_device while our __get_request_wait was
* running
* w/o the queue_lock held and w/ our request out of the queue.
*/
if (waitqueue_active(&q->wait_for_requests))
run_task_queue(&tq_disk);
}
this will avoid get_request_wait_wakeup to mess the wakeup, so we can
wakep_nr(rq.count) safely.
then there's the last issue raised by Chris, that is if we get request
released faster than the tasks can run, still we can generate a not
perfect fairness. My solution to that is to change wake_up to have a
nr_exclusive not obeying to the try_to_wakeup retval. that should
guarantee exact FIFO then, but it's a minor issue because the requests
shouldn't be released systematically in a flood. So I'm leaving it
opened for now, the others already addressed should be the major ones.
> >>stuff in at the time to try and improve throughput though, so I'll try
> >>it again.
> >>
> >>I think part of the problem is the cascading wakeups from
> >>get_request_wait_wakeup(). So if we wakeup 32 procs they in turn wakeup
> >>another 32, etc.
> >>
> >
> >so maybe it's enough to wakeup count / 2 to account for the double
> >wakeup? that will avoid some overscheduling indeed.
> >
> >
>
> Andrea, this isn't needed because when the queue falls below
actually the problem is that it isn't enough, not that it isn't needed.
I had to stop get_request_wait_wakeup to mess the wakeups, so now I can
return doing /2.
> the batch limit, every retiring request will do a wake up and
> another request will be put on (as long as the waitqueue is
> active).
this was my argument for doing /2, but that will lead to count * 1.5 of
wakeups, where the last count /2 will race with further wakeups screwing
the FIFO ordering. As said that's fixed completely now and the last
issue is the one with the flood of request release that can't keep up
with the tasks becoming runnable but it's hopefully a minor issue (I'm
not going to change how wake_up_nr works right now, maybe later).
Andrea
On Thu, Jun 12, 2003 at 03:29:51AM +0200, Andrea Arcangeli wrote:
> static void get_request_wait_wakeup(request_queue_t *q, int rw)
> {
> /*
> * avoid losing an unplug if a second __get_request_wait did the
> * generic_unplug_device while our __get_request_wait was
> * running
> * w/o the queue_lock held and w/ our request out of the queue.
> */
> if (waitqueue_active(&q->wait_for_requests))
> run_task_queue(&tq_disk);
btw, that was the old version, Chris did it right
s/run_task_queue(&tq_disk)/__generic_unplug_device(q)/
Andrea
On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
> this will avoid get_request_wait_wakeup to mess the wakeup, so we can
> wakep_nr(rq.count) safely.
>
> then there's the last issue raised by Chris, that is if we get request
> released faster than the tasks can run, still we can generate a not
> perfect fairness. My solution to that is to change wake_up to have a
> nr_exclusive not obeying to the try_to_wakeup retval. that should
> guarantee exact FIFO then, but it's a minor issue because the requests
> shouldn't be released systematically in a flood. So I'm leaving it
> opened for now, the others already addressed should be the major ones.
I think the only time we really need to wakeup more than one waiter is
when we hit the q->batch_request mark. After that, each new request
that is freed can be matched with a single waiter, and we know that any
previously finished requests have probably already been matched to their
own waiter.
-chris
Chris Mason wrote:
>On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
>
>
>>this will avoid get_request_wait_wakeup to mess the wakeup, so we can
>>wakep_nr(rq.count) safely.
>>
>>then there's the last issue raised by Chris, that is if we get request
>>released faster than the tasks can run, still we can generate a not
>>perfect fairness. My solution to that is to change wake_up to have a
>>nr_exclusive not obeying to the try_to_wakeup retval. that should
>>guarantee exact FIFO then, but it's a minor issue because the requests
>>shouldn't be released systematically in a flood. So I'm leaving it
>>opened for now, the others already addressed should be the major ones.
>>
>
>I think the only time we really need to wakeup more than one waiter is
>when we hit the q->batch_request mark. After that, each new request
>that is freed can be matched with a single waiter, and we know that any
>previously finished requests have probably already been matched to their
>own waiter.
>
>
Nope. Not even then. Each retiring request should submit
a wake up, and the process will submit another request.
So the number of requests will be held at the batch_request
mark until no more waiters.
Now that begs the question, why have batch_requests anymore?
It no longer does anything.
On Thu, Jun 12, 2003 at 12:41:58PM +1000, Nick Piggin wrote:
>
>
> Chris Mason wrote:
>
> >On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
> >
> >
> >>this will avoid get_request_wait_wakeup to mess the wakeup, so we can
> >>wakep_nr(rq.count) safely.
> >>
> >>then there's the last issue raised by Chris, that is if we get request
> >>released faster than the tasks can run, still we can generate a not
> >>perfect fairness. My solution to that is to change wake_up to have a
> >>nr_exclusive not obeying to the try_to_wakeup retval. that should
> >>guarantee exact FIFO then, but it's a minor issue because the requests
> >>shouldn't be released systematically in a flood. So I'm leaving it
> >>opened for now, the others already addressed should be the major ones.
> >>
> >
> >I think the only time we really need to wakeup more than one waiter is
> >when we hit the q->batch_request mark. After that, each new request
> >that is freed can be matched with a single waiter, and we know that any
> >previously finished requests have probably already been matched to their
> >own waiter.
> >
> >
> Nope. Not even then. Each retiring request should submit
> a wake up, and the process will submit another request.
> So the number of requests will be held at the batch_request
> mark until no more waiters.
>
> Now that begs the question, why have batch_requests anymore?
> It no longer does anything.
it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
the wake_up_nr.
Andrea
Andrea Arcangeli wrote:
>On Thu, Jun 12, 2003 at 12:41:58PM +1000, Nick Piggin wrote:
>
>>
>>Chris Mason wrote:
>>
>>
>>>On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
>>>
>>>
>>>
>>>>this will avoid get_request_wait_wakeup to mess the wakeup, so we can
>>>>wakep_nr(rq.count) safely.
>>>>
>>>>then there's the last issue raised by Chris, that is if we get request
>>>>released faster than the tasks can run, still we can generate a not
>>>>perfect fairness. My solution to that is to change wake_up to have a
>>>>nr_exclusive not obeying to the try_to_wakeup retval. that should
>>>>guarantee exact FIFO then, but it's a minor issue because the requests
>>>>shouldn't be released systematically in a flood. So I'm leaving it
>>>>opened for now, the others already addressed should be the major ones.
>>>>
>>>>
>>>I think the only time we really need to wakeup more than one waiter is
>>>when we hit the q->batch_request mark. After that, each new request
>>>that is freed can be matched with a single waiter, and we know that any
>>>previously finished requests have probably already been matched to their
>>>own waiter.
>>>
>>>
>>>
>>Nope. Not even then. Each retiring request should submit
>>a wake up, and the process will submit another request.
>>So the number of requests will be held at the batch_request
>>mark until no more waiters.
>>
>>Now that begs the question, why have batch_requests anymore?
>>It no longer does anything.
>>
>
>it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
>the wake_up_nr.
>
>
That is pretty pointless as well. You might as well just start
waking up at the queue full limit, and wake one at a time.
The purpose for batch_requests was I think for devices with a
very small request size, to reduce context switches.
>Andrea
>
>
>
Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
>> On Thu, Jun 12, 2003 at 12:41:58PM +1000, Nick Piggin wrote:
>>
>>>
>>> Chris Mason wrote:
>>>
>>>
>>>> On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
>>>>
>>>>
>>>>
>>>>> this will avoid get_request_wait_wakeup to mess the wakeup, so we can
>>>>> wakep_nr(rq.count) safely.
>>>>>
>>>>> then there's the last issue raised by Chris, that is if we get
>>>>> request
>>>>> released faster than the tasks can run, still we can generate a not
>>>>> perfect fairness. My solution to that is to change wake_up to have a
>>>>> nr_exclusive not obeying to the try_to_wakeup retval. that should
>>>>> guarantee exact FIFO then, but it's a minor issue because the
>>>>> requests
>>>>> shouldn't be released systematically in a flood. So I'm leaving it
>>>>> opened for now, the others already addressed should be the major
>>>>> ones.
>>>>>
>>>>>
>>>> I think the only time we really need to wakeup more than one waiter is
>>>> when we hit the q->batch_request mark. After that, each new request
>>>> that is freed can be matched with a single waiter, and we know that
>>>> any
>>>> previously finished requests have probably already been matched to
>>>> their
>>>> own waiter.
>>>>
>>>>
>>>>
>>> Nope. Not even then. Each retiring request should submit
>>> a wake up, and the process will submit another request.
>>> So the number of requests will be held at the batch_request
>>> mark until no more waiters.
>>>
>>> Now that begs the question, why have batch_requests anymore?
>>> It no longer does anything.
>>>
>>
>> it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
>> the wake_up_nr.
>>
>>
> That is pretty pointless as well. You might as well just start
> waking up at the queue full limit, and wake one at a time.
>
> The purpose for batch_requests was I think for devices with a
> very small request size, to reduce context switches.
I guess you could fix this by having a "last woken" flag, and
allow that process to allocate requests without blocking from
the batch limit until the queue full limit. That is how
batch_requests is supposed to work.
Nick Piggin wrote:
>
> I guess you could fix this by having a "last woken" flag, and
> allow that process to allocate requests without blocking from
> the batch limit until the queue full limit. That is how
> batch_requests is supposed to work.
s/flag/pid maybe?
On Thu, Jun 12, 2003 at 12:49:46PM +1000, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
> >On Thu, Jun 12, 2003 at 12:41:58PM +1000, Nick Piggin wrote:
> >
> >>
> >>Chris Mason wrote:
> >>
> >>
> >>>On Wed, 2003-06-11 at 21:29, Andrea Arcangeli wrote:
> >>>
> >>>
> >>>
> >>>>this will avoid get_request_wait_wakeup to mess the wakeup, so we can
> >>>>wakep_nr(rq.count) safely.
> >>>>
> >>>>then there's the last issue raised by Chris, that is if we get request
> >>>>released faster than the tasks can run, still we can generate a not
> >>>>perfect fairness. My solution to that is to change wake_up to have a
> >>>>nr_exclusive not obeying to the try_to_wakeup retval. that should
> >>>>guarantee exact FIFO then, but it's a minor issue because the requests
> >>>>shouldn't be released systematically in a flood. So I'm leaving it
> >>>>opened for now, the others already addressed should be the major ones.
> >>>>
> >>>>
> >>>I think the only time we really need to wakeup more than one waiter is
> >>>when we hit the q->batch_request mark. After that, each new request
> >>>that is freed can be matched with a single waiter, and we know that any
> >>>previously finished requests have probably already been matched to their
> >>>own waiter.
> >>>
> >>>
> >>>
> >>Nope. Not even then. Each retiring request should submit
> >>a wake up, and the process will submit another request.
> >>So the number of requests will be held at the batch_request
> >>mark until no more waiters.
> >>
> >>Now that begs the question, why have batch_requests anymore?
> >>It no longer does anything.
> >>
> >
> >it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
> >the wake_up_nr.
> >
> >
> That is pretty pointless as well. You might as well just start
> waking up at the queue full limit, and wake one at a time.
>
> The purpose for batch_requests was I think for devices with a
> very small request size, to reduce context switches.
batch_requests at least in my tree matters only when each request is
512btyes and you've some thousand of them to compose a 4M queue or so.
To maximize cpu cache usage etc.. I try to wakeup a task every 512bytes
written, but every 32*512bytes written or so. Of course w/o the
wake_up_nr that I added, that wasn't really working w/ the _exlusive
wakeup.
if you check my tree you'll see that for sequential I/O with 512k in
each request (not 512bytes!) batch_requests is already a noop.
Andrea
On Thu, Jun 12, 2003 at 12:51:30PM +1000, Nick Piggin wrote:
> I guess you could fix this by having a "last woken" flag, and
> allow that process to allocate requests without blocking from
> the batch limit until the queue full limit. That is how
> batch_requests is supposed to work.
I see what you mean, I did care about the case of each request belonging
to a different task, but of course this doesn't work if there's just one
task. In such case there will be a single wakeup and one for each
request, so it won't be able to eat all the requests and it'll keep
hanging on the full bitflag. So yes, the ->full bit partly disabled the
batch sectors in presence of only 1 task. With multiple tasks and the
wake_up_nr batch_sectors will still work. However I don't care about
that right now ;), it's a minor issue I guess, single task I/O normally
doesn't seek heavily so more likely it will run into the oversized queue
before being able to take advantage of the batch sectors.
Andrea
Andrea Arcangeli wrote:
>On Thu, Jun 12, 2003 at 12:49:46PM +1000, Nick Piggin wrote:
>
>>
>>Andrea Arcangeli wrote:
>>
>>>it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
>>>the wake_up_nr.
>>>
>>>
>>>
>>That is pretty pointless as well. You might as well just start
>>waking up at the queue full limit, and wake one at a time.
>>
>>The purpose for batch_requests was I think for devices with a
>>very small request size, to reduce context switches.
>>
>
>batch_requests at least in my tree matters only when each request is
>512btyes and you've some thousand of them to compose a 4M queue or so.
>To maximize cpu cache usage etc.. I try to wakeup a task every 512bytes
>written, but every 32*512bytes written or so. Of course w/o the
>wake_up_nr that I added, that wasn't really working w/ the _exlusive
>wakeup.
>
>if you check my tree you'll see that for sequential I/O with 512k in
>each request (not 512bytes!) batch_requests is already a noop.
>
You are waking up multiple tasks which will each submit
1 request. You want to be waking up 1 task which will
submit multiple requests - that is how you will save
context switches, cpu cache, etc, and that task's requests
will have a much better chance of being merged, or at
least serviced as a nice batch than unrelated tasks.
On Thu, Jun 12, 2003 at 01:04:27PM +1000, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
> >On Thu, Jun 12, 2003 at 12:49:46PM +1000, Nick Piggin wrote:
> >
> >>
> >>Andrea Arcangeli wrote:
> >>
> >>>it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
> >>>the wake_up_nr.
> >>>
> >>>
> >>>
> >>That is pretty pointless as well. You might as well just start
> >>waking up at the queue full limit, and wake one at a time.
> >>
> >>The purpose for batch_requests was I think for devices with a
> >>very small request size, to reduce context switches.
> >>
> >
> >batch_requests at least in my tree matters only when each request is
> >512btyes and you've some thousand of them to compose a 4M queue or so.
> >To maximize cpu cache usage etc.. I try to wakeup a task every 512bytes
> >written, but every 32*512bytes written or so. Of course w/o the
> >wake_up_nr that I added, that wasn't really working w/ the _exlusive
> >wakeup.
> >
> >if you check my tree you'll see that for sequential I/O with 512k in
> >each request (not 512bytes!) batch_requests is already a noop.
> >
>
>
> You are waking up multiple tasks which will each submit
> 1 request. You want to be waking up 1 task which will
> submit multiple requests - that is how you will save
> context switches, cpu cache, etc, and that task's requests
> will have a much better chance of being merged, or at
> least serviced as a nice batch than unrelated tasks.
for fairness reasons if there are multiple tasks, I want to wake them
all and let the others be able to eat requests before the first
allocates all the batch_sectors. So the current code is fine and
batch_sectors still works fine with multiple tasks queued in the
waitqueue, it still makes sense to wake more than one of them at the
same time to improve cpu utilization (regardless they're different
tasks, for istance we take less frequently the waitqueue spinlocks
etc..).
What we disabled is only the batch_sectors in function of the single
task, so if for example there's just 1 single task, we could let it go,
but it's quite a special case, if for example there would be two tasks,
we wouldn't want to let them go ahead (unless we can distributed exactly
count/2 requests to each task w/o reentering into __get_request_wait
that's unlikely). So the current code looks ok to me with the
wake_up_nr to take advantage of the batch_sectors against different
tasks, still w/o penalizing fariness.
Andrea
Andrea Arcangeli wrote:
>On Thu, Jun 12, 2003 at 01:04:27PM +1000, Nick Piggin wrote:
>
>>
>>Andrea Arcangeli wrote:
>>
>>
>>>On Thu, Jun 12, 2003 at 12:49:46PM +1000, Nick Piggin wrote:
>>>
>>>
>>>>Andrea Arcangeli wrote:
>>>>
>>>>
>>>>>it does nothing w/ _exclusive and w/o the wake_up_nr, that's why I added
>>>>>the wake_up_nr.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>That is pretty pointless as well. You might as well just start
>>>>waking up at the queue full limit, and wake one at a time.
>>>>
>>>>The purpose for batch_requests was I think for devices with a
>>>>very small request size, to reduce context switches.
>>>>
>>>>
>>>batch_requests at least in my tree matters only when each request is
>>>512btyes and you've some thousand of them to compose a 4M queue or so.
>>>To maximize cpu cache usage etc.. I try to wakeup a task every 512bytes
>>>written, but every 32*512bytes written or so. Of course w/o the
>>>wake_up_nr that I added, that wasn't really working w/ the _exlusive
>>>wakeup.
>>>
>>>if you check my tree you'll see that for sequential I/O with 512k in
>>>each request (not 512bytes!) batch_requests is already a noop.
>>>
>>>
>>
>>You are waking up multiple tasks which will each submit
>>1 request. You want to be waking up 1 task which will
>>submit multiple requests - that is how you will save
>>context switches, cpu cache, etc, and that task's requests
>>will have a much better chance of being merged, or at
>>least serviced as a nice batch than unrelated tasks.
>>
>
>for fairness reasons if there are multiple tasks, I want to wake them
>all and let the others be able to eat requests before the first
>allocates all the batch_sectors. So the current code is fine and
>batch_sectors still works fine with multiple tasks queued in the
>waitqueue, it still makes sense to wake more than one of them at the
>same time to improve cpu utilization (regardless they're different
>tasks, for istance we take less frequently the waitqueue spinlocks
>etc..).
>
Its no less fair this way, tasks will still be woken in fifo
order. They will just be given the chance to submit a batch
of requests.
I think the cpu utilization gain of waking a number of tasks
at once would be outweighed by advantage of waking 1 task
and not putting it to sleep again for a number of requests.
You obviously are not claiming concurrency improvements, as
your method would also increase contention on the io lock
(or the queue lock in 2.5).
Then you have the cache gains of running each task for a
longer period of time. You also get possible IO scheduling
improvements.
Consider 8 requests, batch_requests at 4, 10 tasks writing
to different areas of disk.
Your method still only allows each task to have 1 request in
the elevator at once. Mine allows each to have a run of 4
requests in the elevator.
On Thu, Jun 12, 2003 at 01:20:44PM +1000, Nick Piggin wrote:
> Its no less fair this way, tasks will still be woken in fifo
> order. They will just be given the chance to submit a batch
> of requests.
If you change the behaviour with queued_task_nr > batch_requests it is
less fair period. Whatever else thing I don't care about right now
because it is a minor cpu improvement anyways.
I'm not talking about performance, I'm talking about latency and
fariness only. This is the whole point of the ->full logic.
> I think the cpu utilization gain of waking a number of tasks
> at once would be outweighed by advantage of waking 1 task
> and not putting it to sleep again for a number of requests.
> You obviously are not claiming concurrency improvements, as
> your method would also increase contention on the io lock
> (or the queue lock in 2.5).
I'm claiming that with queued_task_nr > batch_requests the
batch_requests logic still has a chance to save some cpu, this is the
only reason I didn't nuke it completely as you suggested some email ago.
> Then you have the cache gains of running each task for a
> longer period of time. You also get possible IO scheduling
> improvements.
>
> Consider 8 requests, batch_requests at 4, 10 tasks writing
> to different areas of disk.
>
> Your method still only allows each task to have 1 request in
> the elevator at once. Mine allows each to have a run of 4
> requests in the elevator.
I definitely want 1 request in the elevator at once or we can as well
drop your ->full and return to be unfair. The whole point of ->full is
to get the total fariness, across the tasks in the queue queue, and for
tasks outside the queue calling get_request too. Since not all tasks
will fit in the I/O queue, providing a very fair FIFO in the
wait_for_request is fundamental to provide any sort of latency
guarantee IMHO (the fact an _exclusive wakeup removal that mixes stuff
and probably has the side effect of being more fair, made that much
difference to mainline users kind of confirms that).
Andrea
Andrea Arcangeli wrote:
>On Thu, Jun 12, 2003 at 01:20:44PM +1000, Nick Piggin wrote:
>
>>Its no less fair this way, tasks will still be woken in fifo
>>order. They will just be given the chance to submit a batch
>>of requests.
>>
>
>If you change the behaviour with queued_task_nr > batch_requests it is
>less fair period. Whatever else thing I don't care about right now
>because it is a minor cpu improvement anyways.
>
>I'm not talking about performance, I'm talking about latency and
>fariness only. This is the whole point of the ->full logic.
>
I say each task getting 8 requests at a time is as fair
as each getting 1 request at a time. Yes, you may get a
worse latency, but _fairness_ is the same.
>
>>I think the cpu utilization gain of waking a number of tasks
>>at once would be outweighed by advantage of waking 1 task
>>and not putting it to sleep again for a number of requests.
>>You obviously are not claiming concurrency improvements, as
>>your method would also increase contention on the io lock
>>(or the queue lock in 2.5).
>>
>
>I'm claiming that with queued_task_nr > batch_requests the
>batch_requests logic still has a chance to save some cpu, this is the
>only reason I didn't nuke it completely as you suggested some email ago.
>
Well I'm not so sure that your method will do a great deal
of good. On SMP you would increase contention on the spinlock.
IMO it would be better to serialise them on the waitqueue
instead of a spinlock seeing as they are already sleeping.
>
>>Then you have the cache gains of running each task for a
>>longer period of time. You also get possible IO scheduling
>>improvements.
>>
>>Consider 8 requests, batch_requests at 4, 10 tasks writing
>>to different areas of disk.
>>
>>Your method still only allows each task to have 1 request in
>>the elevator at once. Mine allows each to have a run of 4
>>requests in the elevator.
>>
>
>I definitely want 1 request in the elevator at once or we can as well
>drop your ->full and return to be unfair. The whole point of ->full is
>to get the total fariness, across the tasks in the queue queue, and for
>tasks outside the queue calling get_request too. Since not all tasks
>will fit in the I/O queue, providing a very fair FIFO in the
>wait_for_request is fundamental to provide any sort of latency
>guarantee IMHO (the fact an _exclusive wakeup removal that mixes stuff
>and probably has the side effect of being more fair, made that much
>difference to mainline users kind of confirms that).
>
>
I think we'll just have to agree to disagree here. This
sort of thing came up in our CFQ discussion as well. Its
not that I think your way is without merits, but I think
in an overload situtation its a better aim to attempt to
keep throughput up rather than attempt to provide the
lowest possible latency.
On Thu, Jun 12, 2003 at 01:48:04PM +1000, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
> >On Thu, Jun 12, 2003 at 01:20:44PM +1000, Nick Piggin wrote:
> >
> >>Its no less fair this way, tasks will still be woken in fifo
> >>order. They will just be given the chance to submit a batch
> >>of requests.
> >>
> >
> >If you change the behaviour with queued_task_nr > batch_requests it is
> >less fair period. Whatever else thing I don't care about right now
> >because it is a minor cpu improvement anyways.
> >
> >I'm not talking about performance, I'm talking about latency and
> >fariness only. This is the whole point of the ->full logic.
> >
>
> I say each task getting 8 requests at a time is as fair
> as each getting 1 request at a time. Yes, you may get a
> worse latency, but _fairness_ is the same.
It is the worse latency that is the problem of course. Fariness in this
case isn't affected (assuming you would write the batch_sectors fair),
but latency would definitely be affected.
> Well I'm not so sure that your method will do a great deal
> of good. On SMP you would increase contention on the spinlock.
> IMO it would be better to serialise them on the waitqueue
> instead of a spinlock seeing as they are already sleeping.
I think the worse part is the cacheline bouncing, but that might happen
anyways under load. at least certainly it makes sense for UP or if
you're lucky and the tasks run serially (possible if all cpus are
running).
> I think we'll just have to agree to disagree here. This
> sort of thing came up in our CFQ discussion as well. Its
> not that I think your way is without merits, but I think
> in an overload situtation its a better aim to attempt to
> keep throughput up rather than attempt to provide the
> lowest possible latency.
Those changes (like the CFQ I/O scheduler in 2.5) are being developed
mostly due the latency complains we get as feedback on l-k. That's why
I care about latency first here. But we've to care about throughput too
of course. This isn't CFQ, it just tries to provide new requests to
different tasks with the minimal possible latency which in turn also
guarantees fariness. That sounds a good default to me, especially when I
see the removal of the _exclusive wakeup in mainline taken as a major
fariness/latency improvement.
Andrea
Andrea Arcangeli wrote:
>On Thu, Jun 12, 2003 at 01:48:04PM +1000, Nick Piggin wrote:
>
>>
>>Andrea Arcangeli wrote:
>>
>>
>>>On Thu, Jun 12, 2003 at 01:20:44PM +1000, Nick Piggin wrote:
>>>
>>>
>>>>Its no less fair this way, tasks will still be woken in fifo
>>>>order. They will just be given the chance to submit a batch
>>>>of requests.
>>>>
>>>>
>>>If you change the behaviour with queued_task_nr > batch_requests it is
>>>less fair period. Whatever else thing I don't care about right now
>>>because it is a minor cpu improvement anyways.
>>>
>>>I'm not talking about performance, I'm talking about latency and
>>>fariness only. This is the whole point of the ->full logic.
>>>
>>>
>>I say each task getting 8 requests at a time is as fair
>>as each getting 1 request at a time. Yes, you may get a
>>worse latency, but _fairness_ is the same.
>>
>
>It is the worse latency that is the problem of course. Fariness in this
>case isn't affected (assuming you would write the batch_sectors fair),
>but latency would definitely be affected.
>
Yep.
>
>>Well I'm not so sure that your method will do a great deal
>>of good. On SMP you would increase contention on the spinlock.
>>IMO it would be better to serialise them on the waitqueue
>>instead of a spinlock seeing as they are already sleeping.
>>
>
>I think the worse part is the cacheline bouncing, but that might happen
>anyways under load. at least certainly it makes sense for UP or if
>you're lucky and the tasks run serially (possible if all cpus are
>running).
>
>
>>I think we'll just have to agree to disagree here. This
>>sort of thing came up in our CFQ discussion as well. Its
>>not that I think your way is without merits, but I think
>>in an overload situtation its a better aim to attempt to
>>keep throughput up rather than attempt to provide the
>>lowest possible latency.
>>
>
>Those changes (like the CFQ I/O scheduler in 2.5) are being developed
>mostly due the latency complains we get as feedback on l-k. That's why
>I care about latency first here. But we've to care about throughput too
>of course. This isn't CFQ, it just tries to provide new requests to
>different tasks with the minimal possible latency which in turn also
>guarantees fariness. That sounds a good default to me, especially when I
>see the removal of the _exclusive wakeup in mainline taken as a major
>fariness/latency improvement.
>
Throughput vs latency is always difficult I guess. In this
case, I think when there are few waiters, then latency should
not be much worse. When there are a lot of waiters it is
probably not an interactive load to start with and throughput
is more important.
Anyway, the ideas you are following are interesting and
worthwhile, so we'll each try our own thing :)
On Wed, 2003-06-11 at 22:41, Nick Piggin wrote:
> >I think the only time we really need to wakeup more than one waiter is
> >when we hit the q->batch_request mark. After that, each new request
> >that is freed can be matched with a single waiter, and we know that any
> >previously finished requests have probably already been matched to their
> >own waiter.
> >
> >
> Nope. Not even then. Each retiring request should submit
> a wake up, and the process will submit another request.
> So the number of requests will be held at the batch_request
> mark until no more waiters.
>
> Now that begs the question, why have batch_requests anymore?
> It no longer does anything.
>
We've got many flavors of the patch discussed in this thread, so this
needs a little qualification. When get_request_wait_wakeup wakes one of
the waiters (as in the patch I sent yesterday), you want to make sure
that after you wake the first waiter there is a request available for
the proccess he is going to wake up, and so on for each other waiter.
I did a quick test of this yesterday, and under the 20 proc iozone test,
turning off batch_requests more than doubled the number of context
switches hit during the run, I'm assuming this was from wakeups that
failed to find requests.
I'm doing a few tests with Andrea's new get_request_wait_wakeup ideas
and wake_up_nr.
-chris
On Wed, 2003-06-11 at 23:20, Nick Piggin wrote:
>
> I think the cpu utilization gain of waking a number of tasks
> at once would be outweighed by advantage of waking 1 task
> and not putting it to sleep again for a number of requests.
> You obviously are not claiming concurrency improvements, as
> your method would also increase contention on the io lock
> (or the queue lock in 2.5).
I've been trying variations on this for a few days, none have been
thrilling but the end result is better dbench and iozone throughput
overall. For the 20 writer iozone test, rc7 got an average throughput
of 3MB/s, and yesterdays latency patch got 500k/s or so. Ouch.
This gets us up to 1.2MB/s. I'm keeping yesterday's
get_request_wait_wake, which wakes up a waiter instead of unplugging.
The basic idea here is that after a process is woken up and grabs a
request, he becomes the batch owner. Batch owners get to ignore the
q->full flag for either 1/5 second or 32 requests, whichever comes
first. The timer part is an attempt at preventing memory pressure
writers (who go 1 req at a time) from holding onto batch ownership for
too long. Latency stats after dbench 50:
device 08:01: num_req 120077, total jiffies waited 663231
65538 forced to wait
1 min wait, 175 max wait
10 average wait
65296 < 100, 242 < 200, 0 < 300, 0 < 400, 0 < 500
0 waits longer than 500 jiffies
Good latency system wide comes from fair waiting, but it also comes from
how fast we can run write_some_buffers(), since that is the unit of
throttling. Hopefully this patch decreases the time it takes for
write_some_buffers over the past latency patches, or gives someone else
a better idea ;-)
Attached is an incremental over yesterday's io-stalls-5.diff.
-chris
Chris Mason wrote:
>On Wed, 2003-06-11 at 23:20, Nick Piggin wrote:
>
>
>>I think the cpu utilization gain of waking a number of tasks
>>at once would be outweighed by advantage of waking 1 task
>>and not putting it to sleep again for a number of requests.
>>You obviously are not claiming concurrency improvements, as
>>your method would also increase contention on the io lock
>>(or the queue lock in 2.5).
>>
>
>I've been trying variations on this for a few days, none have been
>thrilling but the end result is better dbench and iozone throughput
>overall. For the 20 writer iozone test, rc7 got an average throughput
>of 3MB/s, and yesterdays latency patch got 500k/s or so. Ouch.
>
>This gets us up to 1.2MB/s. I'm keeping yesterday's
>get_request_wait_wake, which wakes up a waiter instead of unplugging.
>
>The basic idea here is that after a process is woken up and grabs a
>request, he becomes the batch owner. Batch owners get to ignore the
>q->full flag for either 1/5 second or 32 requests, whichever comes
>first. The timer part is an attempt at preventing memory pressure
>writers (who go 1 req at a time) from holding onto batch ownership for
>too long. Latency stats after dbench 50:
>
Yeah, I get ~50% more throughput and up to 20% better CPU
efficiency on tiobench 256 for sequential and random
writers by doing something similar.
Hello all,
[ short version, the patch attached should fix io latencies in 2.4.21.
Please review and/or give it a try ]
My last set of patches was directed at reducing the latencies in
__get_request_wait, which really helped reduce stalls when you had lots
of io to one device and balance_dirty() was causing pauses while you
tried to do io to other devices.
But, a streaming write could still starve reads to the same device,
mostly because the read would have to send down any huge merged writes
that were before it in the queue.
Andrea's kernel has a fix for this too, he limits the total number of
sectors that can be in the request queue at any given time. But, his
patches change blk_finished_io, both in the arguments it takes and the
side effects of calling it. I don't think we can merge his current form
without breaking external drivers.
So, I added a can_throttle flag to the queue struct, drivers can enable
it if they are going to call the new blk_started_sectors and
blk_finished_sectors funcs any time they call blk_{started,finished}_io,
and these do all the -aa style sector throttling.
There were a few other small changes to Andrea's patch, he wasn't
setting q->full when get_request decided there were too many sectors in
flight. This resulted in large latencies in __get_request_wait. He was
also unconditionally clearing q->full in blkdev_release_request, my code
only clears q->full when all the waiters are gone.
I changed generic_unplug_device to zero the elevator_sequence field of
the last request on the queue. This means there won't be any merges
with requests pending once an unplug is done, and helps limit the number
of sectors that need to be sent down during the run_task_queue(&tq_disk)
in wait_on_buffer.
I lowered the -aa default limit on sectors in flight from 4MB to 2MB.
We probably want an elvtune for it, large arrays with writeback cache
should be able to tolerate larger values.
There's still a little work left to do, this patch enables sector
throttling for scsi and IDE. cciss, DAC960 and cpqarray need
modification too (99% done already in -aa). No sense in doing that
until after the bulk of the patch is reviewed though.
As before, most of the code here is from Andrea and Nick, I've just
wrapped a lot of duct tape around it and done some tweaking. The
primary pieces are:
fix-pausing (andrea, corner cases where wakeups are missed)
elevator-low-latency (andrea, limit sectors in flight)
queue_full (Nick, fairness in __get_request_wait)
I've removed my latency stats for __get_request_wait in hopes of making
it a better merging candidate.
-chris
On Wed, Jun 25, 2003 at 03:03:43PM -0400, Chris Mason wrote:
> Hello all,
>
> [ short version, the patch attached should fix io latencies in 2.4.21.
> Please review and/or give it a try ]
>
> My last set of patches was directed at reducing the latencies in
> __get_request_wait, which really helped reduce stalls when you had lots
> of io to one device and balance_dirty() was causing pauses while you
> tried to do io to other devices.
>
> But, a streaming write could still starve reads to the same device,
> mostly because the read would have to send down any huge merged writes
> that were before it in the queue.
>
> Andrea's kernel has a fix for this too, he limits the total number of
> sectors that can be in the request queue at any given time. But, his
> patches change blk_finished_io, both in the arguments it takes and the
> side effects of calling it. I don't think we can merge his current form
> without breaking external drivers.
>
> So, I added a can_throttle flag to the queue struct, drivers can enable
> it if they are going to call the new blk_started_sectors and
> blk_finished_sectors funcs any time they call blk_{started,finished}_io,
> and these do all the -aa style sector throttling.
>
> There were a few other small changes to Andrea's patch, he wasn't
> setting q->full when get_request decided there were too many sectors in
wasn't is really in the past, because I'm doing it in 2.4.21rc8aa1 and
in my latest status.
> flight. This resulted in large latencies in __get_request_wait. He was
> also unconditionally clearing q->full in blkdev_release_request, my code
> only clears q->full when all the waiters are gone.
my current code including the older 2.4.21rc8aa1 does that too, merged
from your previous patches.
> I changed generic_unplug_device to zero the elevator_sequence field of
> the last request on the queue. This means there won't be any merges
> with requests pending once an unplug is done, and helps limit the number
> of sectors that need to be sent down during the run_task_queue(&tq_disk)
> in wait_on_buffer.
this sounds like an hack, forbidding merges is pretty bad for
performance in general, of course most of the merging happens in between
the unplugs, but during heavy I/O with frequent unplugs from many
readers this may hurt performance. And as you said this mostly has the
advantage of limiting the size of the queue, like I enforce in my tree
with the elevator-lowlatency. I doubt this is right.
> I lowered the -aa default limit on sectors in flight from 4MB to 2MB.
I got a few complains for performance slowdown, originally it was 2MB,
so I increased it to 4, from 4M it should be enough for most hardware.
> We probably want an elvtune for it, large arrays with writeback cache
> should be able to tolerate larger values.
Yes, it largely depends on the speed of the device.
> There's still a little work left to do, this patch enables sector
> throttling for scsi and IDE. cciss, DAC960 and cpqarray need
> modification too (99% done already in -aa). No sense in doing that
> until after the bulk of the patch is reviewed though.
>
> As before, most of the code here is from Andrea and Nick, I've just
> wrapped a lot of duct tape around it and done some tweaking. The
> primary pieces are:
>
> fix-pausing (andrea, corner cases where wakeups are missed)
> elevator-low-latency (andrea, limit sectors in flight)
> queue_full (Nick, fairness in __get_request_wait)
>
> I've removed my latency stats for __get_request_wait in hopes of making
> it a better merging candidate.
this is very similar to my status in -aa, exept for the hack that
forbids merging which I think is wrong and the fact you miss the
wake_up_nr that I added to give a meaning to the batching again and that
you don't avoid the unplugs in get_request_wait_wakeup until the queue
is empty. I mean this:
+static void get_request_wait_wakeup(request_queue_t *q, int rw)
+{
+ /*
+ * avoid losing an unplug if a second __get_request_wait did the
+ * generic_unplug_device while our __get_request_wait was
running
+ * w/o the queue_lock held and w/ our request out of the queue.
+ */
+ if (q->rq[rw].count == 0 && waitqueue_active(&q->wait_for_requests[rw]))
+ __generic_unplug_device(q);
+}
+
you said last week the above is racy and it even hanged your box, could
you elaborate? The above is in 2.4.21rc8aa1 and it works fine so far
(though especially the race in wait_for_request is never been known to
be reproducible)
thanks,
Andrea
On Wed, 2003-06-25 at 15:25, Andrea Arcangeli wrote:
> > There were a few other small changes to Andrea's patch, he wasn't
> > setting q->full when get_request decided there were too many sectors in
>
> wasn't is really in the past, because I'm doing it in 2.4.21rc8aa1 and
> in my latest status.
>
Hmm, I thought I grabbed the patch from rc8aa1, clearly not though,
sorry about that.
> > I changed generic_unplug_device to zero the elevator_sequence field of
> > the last request on the queue. This means there won't be any merges
> > with requests pending once an unplug is done, and helps limit the number
> > of sectors that need to be sent down during the run_task_queue(&tq_disk)
> > in wait_on_buffer.
>
> this sounds like an hack, forbidding merges is pretty bad for
> performance in general, of course most of the merging happens in between
> the unplugs, but during heavy I/O with frequent unplugs from many
> readers this may hurt performance. And as you said this mostly has the
> advantage of limiting the size of the queue, like I enforce in my tree
> with the elevator-lowlatency. I doubt this is right.
>
Well, I would hit sysrq-t when I noticed read stalls, and the reader was
frequently in run_task_queue. I kept the hunk because it made a
noticeable difference. I agree there's a throughput tradeoff here, my
goal for the patch was to find the major places I could improve latency
and change them, then go back later and decide if each one was worth it.
Your elevator-lowlatency patch doesn't enforce sector limits for merged
requests, so a merger could constantly come in and steal space in the
sector limit from other waiters. This lead to high latency in
__get_request_wait. That hunk for generic_unplug_device solves both of
those problems.
> > I lowered the -aa default limit on sectors in flight from 4MB to 2MB.
>
> I got a few complains for performance slowdown, originally it was 2MB,
> so I increased it to 4, from 4M it should be enough for most hardware.
>
I've no preference really. I didn't notice a throughput difference but
my scsi drives only have 2MB of cache.
> this is very similar to my status in -aa, exept for the hack that
> forbids merging which I think is wrong and the fact you miss the
> wake_up_nr that I added to give a meaning to the batching again and that
> you don't avoid the unplugs in get_request_wait_wakeup until the queue
> is empty. I mean this:
>
> +static void get_request_wait_wakeup(request_queue_t *q, int rw)
> +{
> + /*
> + * avoid losing an unplug if a second __get_request_wait did the
> + * generic_unplug_device while our __get_request_wait was
> running
> + * w/o the queue_lock held and w/ our request out of the queue.
> + */
> + if (q->rq[rw].count == 0 && waitqueue_active(&q->wait_for_requests[rw]))
> + __generic_unplug_device(q);
> +}
> +
>
> you said last week the above is racy and it even hanged your box, could
> you elaborate? The above is in 2.4.21rc8aa1 and it works fine so far
> (though especially the race in wait_for_request is never been known to
> be reproducible)
It caused hangs/stalls, but I didn't have the sector throttling code at
the time and it really changes the interaction of things. I think the
hang went a little like this:
Lets say all the pending io is done, but the wait queue isn't empty yet
because all the waiting tasks haven't yet been scheduled in. Also, we
have fewer than nr_requests processes waiting to start io, so when they
do all get scheduled in they won't generate an unplug.
q->rq.count = q->nr_requests, q->full = 1
new io comes in, sees q->full = 1, unplugs and sleeps. No io is done
because the queue is empty.
All the old waiters finally get scheduled in and grab their requests,
but get_request_wait_wakeup doesn't unplug because q->rq.count != 0.
If no additional io comes in, the queue never gets unplugged, and our
waiter never gets woken.
With the sector throttling on, we've got additional wakeups coming from
blk_finished_io (or blk_finished_sectors in my patch). I kept out the
wakeup_nr idea because I couldn't figure out how to keep
__get_request_wait fair with it in.
-chris
Chris Mason wrote:
>Hello all,
>
>[ short version, the patch attached should fix io latencies in 2.4.21.
>Please review and/or give it a try ]
>
>My last set of patches was directed at reducing the latencies in
>__get_request_wait, which really helped reduce stalls when you had lots
>of io to one device and balance_dirty() was causing pauses while you
>tried to do io to other devices.
>
>But, a streaming write could still starve reads to the same device,
>mostly because the read would have to send down any huge merged writes
>that were before it in the queue.
>
>Andrea's kernel has a fix for this too, he limits the total number of
>sectors that can be in the request queue at any given time. But, his
>patches change blk_finished_io, both in the arguments it takes and the
>side effects of calling it. I don't think we can merge his current form
>without breaking external drivers.
>
>So, I added a can_throttle flag to the queue struct, drivers can enable
>it if they are going to call the new blk_started_sectors and
>blk_finished_sectors funcs any time they call blk_{started,finished}_io,
>and these do all the -aa style sector throttling.
>
>There were a few other small changes to Andrea's patch, he wasn't
>setting q->full when get_request decided there were too many sectors in
>flight. This resulted in large latencies in __get_request_wait. He was
>also unconditionally clearing q->full in blkdev_release_request, my code
>only clears q->full when all the waiters are gone.
>
>I changed generic_unplug_device to zero the elevator_sequence field of
>the last request on the queue. This means there won't be any merges
>with requests pending once an unplug is done, and helps limit the number
>of sectors that need to be sent down during the run_task_queue(&tq_disk)
>in wait_on_buffer.
>
>I lowered the -aa default limit on sectors in flight from 4MB to 2MB.
>We probably want an elvtune for it, large arrays with writeback cache
>should be able to tolerate larger values.
>
>There's still a little work left to do, this patch enables sector
>throttling for scsi and IDE. cciss, DAC960 and cpqarray need
>modification too (99% done already in -aa). No sense in doing that
>until after the bulk of the patch is reviewed though.
>
>As before, most of the code here is from Andrea and Nick, I've just
>wrapped a lot of duct tape around it and done some tweaking. The
>primary pieces are:
>
>fix-pausing (andrea, corner cases where wakeups are missed)
>elevator-low-latency (andrea, limit sectors in flight)
>queue_full (Nick, fairness in __get_request_wait)
>
I am hoping to go a slightly different way in 2.5 pending
inclusion of process io contexts. If you had time to look
over my changes there (in current mm tree) it would be
appreciated, but they don't help your problem for 2.4.
I found that my queue full fairness for 2.4 didn't address
the batching issue well. It does, guarantee lowest possible
maximum latency for singular requests, but due to lowered
throughput this can cause worse "high level" latency.
I couldn't find a really good, comprehensive method of
allowing processes to batch without resorting to very
complex wakeup methods unless process io contexts are used.
The other possibility would be to keep a list of "batching"
processes which should achieve the same as io contexts.
An easier approach would be to just allow the last woken
process to submit a batch of requests. This wouldn't have
as good guaranteed fairness, but not to say that it would
have starvation issues. I'll help you implement it if you
are interested.
On Thu, 2003-06-26 at 01:48, Nick Piggin wrote:
> I am hoping to go a slightly different way in 2.5 pending
> inclusion of process io contexts. If you had time to look
> over my changes there (in current mm tree) it would be
> appreciated, but they don't help your problem for 2.4.
>
> I found that my queue full fairness for 2.4 didn't address
> the batching issue well. It does, guarantee lowest possible
> maximum latency for singular requests, but due to lowered
> throughput this can cause worse "high level" latency.
>
> I couldn't find a really good, comprehensive method of
> allowing processes to batch without resorting to very
> complex wakeup methods unless process io contexts are used.
> The other possibility would be to keep a list of "batching"
> processes which should achieve the same as io contexts.
>
> An easier approach would be to just allow the last woken
> process to submit a batch of requests. This wouldn't have
> as good guaranteed fairness, but not to say that it would
> have starvation issues. I'll help you implement it if you
> are interested.
One of the things I tried in this area was basically queue ownership.
When each process woke up, he was given strict ownership of the queue
and could submit up to N number of requests. One process waited for
ownership in a yield loop for a max limit of a certain number of
jiffies, all the others waited on the request queue.
It generally increased the latency in __get_request wait by a multiple
of N. I didn't keep it because the current patch is already full of
subtle interactions, I didn't want to make things more confusing than
they already were ;-)
The real problem with this approach is that we're guessing about the
number of requests a given process wants to submit, and we're assuming
those requests are going to be highly mergable. If the higher levels
pass these hints down to the elevator, we should be able to do a better
job of giving both low latency and high throughput.
Between bios and the pdflush daemons, I think 2.5 is in pretty good
shape to do what we need. I'm not 100% sure we need batching when the
requests being submitted are not highly mergable, but I haven't put lots
of thought into that part yet.
Anyway for 2.4 I'm not sure there's much more we can do. I'd like to
add tunables to the current patch, so userland can control the max io in
flight and a simple toggle between throughput mode and latency mode on a
per device basis. It's not perfect but should tide us over until 2.6.
-chris
Chris Mason wrote:
>On Thu, 2003-06-26 at 01:48, Nick Piggin wrote:
>
>
>>I am hoping to go a slightly different way in 2.5 pending
>>inclusion of process io contexts. If you had time to look
>>over my changes there (in current mm tree) it would be
>>appreciated, but they don't help your problem for 2.4.
>>
>>I found that my queue full fairness for 2.4 didn't address
>>the batching issue well. It does, guarantee lowest possible
>>maximum latency for singular requests, but due to lowered
>>throughput this can cause worse "high level" latency.
>>
>>I couldn't find a really good, comprehensive method of
>>allowing processes to batch without resorting to very
>>complex wakeup methods unless process io contexts are used.
>>The other possibility would be to keep a list of "batching"
>>processes which should achieve the same as io contexts.
>>
>>An easier approach would be to just allow the last woken
>>process to submit a batch of requests. This wouldn't have
>>as good guaranteed fairness, but not to say that it would
>>have starvation issues. I'll help you implement it if you
>>are interested.
>>
>
>One of the things I tried in this area was basically queue ownership.
>When each process woke up, he was given strict ownership of the queue
>and could submit up to N number of requests. One process waited for
>ownership in a yield loop for a max limit of a certain number of
>jiffies, all the others waited on the request queue.
>
Not sure exactly what you mean by one process waiting for ownership
in a yield loop, but why don't you simply allow the queue "owner" to
submit up to a maximum of N requests within a time limit. Once either
limit expires (or, rarely, another might become owner -) the process
would just be put to sleep by the normal queue_full mechanism.
>
>It generally increased the latency in __get_request wait by a multiple
>of N. I didn't keep it because the current patch is already full of
>subtle interactions, I didn't want to make things more confusing than
>they already were ;-)
>
Yeah, something like that. I think that in a queue full situation,
the processes are wanting to submit more than 1 request though. So
the better thoughput you can achieve by batching translates to
better effective throughput. Read my recent debate with Andrea
about this though - I couldn't convince him!
I have seen much better maximum latencies, 2-3 times the
throughput, and an order of magnitude less context switching on
many threaded tiobench write loads when using batching.
In short, measuring get_request latency won't give you the full
story.
>
>The real problem with this approach is that we're guessing about the
>number of requests a given process wants to submit, and we're assuming
>those requests are going to be highly mergable. If the higher levels
>pass these hints down to the elevator, we should be able to do a better
>job of giving both low latency and high throughput.
>
No, the numbers (batch # requests, batch time) are not highly scientific.
Simply when a process wakes up, we'll let them submit a small burst of
requests before they go back to sleep. Now in 2.5 (mm) we can cheat and
make this more effective, fair, and without possible missed wakes because
io contexts means that multiple processes can be batching at the same
time, and dynamically allocated requests means it doesn't matter if we
go a bit over the queue limit.
I think a decent solution for 2.4 would be to simply have the one queue
owner, but he allowed the queue to fall below the batch limit, wake
someone else and make them the owner. It can be a bit less fair, and
it doesn't work across queues, but they're less important cases.
>
>Between bios and the pdflush daemons, I think 2.5 is in pretty good
>shape to do what we need. I'm not 100% sure we need batching when the
>requests being submitted are not highly mergable, but I haven't put lots
>of thought into that part yet.
>
No, there are a couple of problems here.
First, good locality != sequential. I saw tiobench 256 random write
throughput _doubled_ because each process is writing within its own
file.
Second, mergeable doesn't mean anything if your request size only
grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)
Third, context switch rate. In the latest IBM regression tests,
tiobench 64 on ext2, 8xSMP (so don't look at throughput!), average
cs/s was about 2500 with mainline (FIFO request allocation), and
140 in mm (batching allocation). So nearly 20x better. This might
not be due to batching alone, but I didn't see any other obvious
change in mm.
>
>Anyway for 2.4 I'm not sure there's much more we can do. I'd like to
>add tunables to the current patch, so userland can control the max io in
>flight and a simple toggle between throughput mode and latency mode on a
>per device basis. It's not perfect but should tide us over until 2.6.
>
>
The changes do seem to be a critical fix due to the starvation issue,
but I'm worried that they take a big step back in performance under
high disk load. I found my FIFO mechanism to be unacceptably slow for
2.5.
Nick Piggin wrote:
snip
>
> Yeah, something like that. I think that in a queue full situation,
> the processes are wanting to submit more than 1 request though. So
> the better thoughput you can achieve by batching translates to
> better effective throughput. Read my recent debate with Andrea
^^^^^^^^^^
Err, latency
snip
>
> No, the numbers (batch # requests, batch time) are not highly scientific.
> Simply when a process wakes up, we'll let them submit a small burst of
> requests before they go back to sleep.
by this, I mean that its not a big problem that we don't know how many
requests a process wants to submit.
snip
>
> The changes do seem to be a critical fix due to the starvation issue,
> but I'm worried that they take a big step back in performance under
> high disk load. I found my FIFO mechanism to be unacceptably slow for
> 2.5.
BTW. sorry for the lack of better benchmark numbers. I couldn't
find good ones lying around. I found uniprocessor tiobench to
be quite helpful at queue_nr_requests * 0.5, 2 threads to
measure different types of overloadedness.
Also, I didn't see much gain in read performance in my testing -
probably due to AS. I expect 2.4 and 2.5 non AS read performance
to show bigger improvements from batching (ie. regressions).
On Thu, 2003-06-26 at 09:04, Nick Piggin wrote:
> >One of the things I tried in this area was basically queue ownership.
> >When each process woke up, he was given strict ownership of the queue
> >and could submit up to N number of requests. One process waited for
> >ownership in a yield loop for a max limit of a certain number of
> >jiffies, all the others waited on the request queue.
> >
>
> Not sure exactly what you mean by one process waiting for ownership
> in a yield loop, but why don't you simply allow the queue "owner" to
> submit up to a maximum of N requests within a time limit. Once either
> limit expires (or, rarely, another might become owner -) the process
> would just be put to sleep by the normal queue_full mechanism.
>
You need some way to wakeup the queue after that time limit has expired,
in case the owner never submits another request. This can either be a
timer or a process in a yield loop. Given that very short expire time I
set (10 jiffies), I went for the yield loop.
> >
> >It generally increased the latency in __get_request wait by a multiple
> >of N. I didn't keep it because the current patch is already full of
> >subtle interactions, I didn't want to make things more confusing than
> >they already were ;-)
> >
>
> Yeah, something like that. I think that in a queue full situation,
> the processes are wanting to submit more than 1 request though. So
> the better thoughput you can achieve by batching translates to
> better effective throughput. Read my recent debate with Andrea
> about this though - I couldn't convince him!
>
Well, it depends ;-) I think we've got 3 basic kinds of procs during a
q->full condition:
1) wants to submit lots of somewhat contiguous io
2) wants to submit a single io
3) wants to submit lots of random io
>From a throughput point of view, we only care about giving batch
ownership to #1. giving batch ownership to #3 will help reduce context
switches, but if it helps throughput than the io wasn't really random
(you've got a good point about locality below, drive write caches make a
huge difference there).
The problem I see in 2.4 is the elevator can't tell any of these cases
apart, so any attempt at batch ownership is certain to be wrong at least
part of the time.
> I have seen much better maximum latencies, 2-3 times the
> throughput, and an order of magnitude less context switching on
> many threaded tiobench write loads when using batching.
>
> In short, measuring get_request latency won't give you the full
> story.
>
Very true. But get_request latency is the minimum amount of time a
single read is going to wait (in 2.4.x anyway), and that is what we need
to focus on when we're trying to fix interactive performance.
> >
> >The real problem with this approach is that we're guessing about the
> >number of requests a given process wants to submit, and we're assuming
> >those requests are going to be highly mergable. If the higher levels
> >pass these hints down to the elevator, we should be able to do a better
> >job of giving both low latency and high throughput.
> >
>
> No, the numbers (batch # requests, batch time) are not highly scientific.
> Simply when a process wakes up, we'll let them submit a small burst of
> requests before they go back to sleep. Now in 2.5 (mm) we can cheat and
> make this more effective, fair, and without possible missed wakes because
> io contexts means that multiple processes can be batching at the same
> time, and dynamically allocated requests means it doesn't matter if we
> go a bit over the queue limit.
>
I agree 2.5 has a lot more room for the contexts to be effective, and I
think they are a really good idea.
> I think a decent solution for 2.4 would be to simply have the one queue
> owner, but he allowed the queue to fall below the batch limit, wake
> someone else and make them the owner. It can be a bit less fair, and
> it doesn't work across queues, but they're less important cases.
>
> >
> >Between bios and the pdflush daemons, I think 2.5 is in pretty good
> >shape to do what we need. I'm not 100% sure we need batching when the
> >requests being submitted are not highly mergable, but I haven't put lots
> >of thought into that part yet.
> >
>
> No, there are a couple of problems here.
> First, good locality != sequential. I saw tiobench 256 random write
> throughput _doubled_ because each process is writing within its own
> file.
>
> Second, mergeable doesn't mean anything if your request size only
> grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
> go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)
Well, play around with raw io, my box writes at roughly disk speed with
128k synchronous requests (contiguous writes).
> Third, context switch rate. In the latest IBM regression tests,
> tiobench 64 on ext2, 8xSMP (so don't look at throughput!), average
> cs/s was about 2500 with mainline (FIFO request allocation), and
> 140 in mm (batching allocation). So nearly 20x better. This might
> not be due to batching alone, but I didn't see any other obvious
> change in mm.
>
Makes sense.
> >
> >Anyway for 2.4 I'm not sure there's much more we can do. I'd like to
> >add tunables to the current patch, so userland can control the max io in
> >flight and a simple toggle between throughput mode and latency mode on a
> >per device basis. It's not perfect but should tide us over until 2.6.
> >
> >
>
> The changes do seem to be a critical fix due to the starvation issue,
> but I'm worried that they take a big step back in performance under
> high disk load. I found my FIFO mechanism to be unacceptably slow for
> 2.5.
Me too, but I'm not sure how to fix it other than a userspace knob to
turn off the q->full checks for server workloads. Andrea's
elevator-lowlatency alone has pretty good throughput numbers, since it
still allows request stealing. Its get_request_wait latency numbers
aren't horrible either, it only suffers in a few corner cases.
But, if someone wants to play with this more, I've attached a quick
remerge of my batch ownership code. I made a read and write owner, so
that a reader doing a single request doesn't grab ownership and make all
the writes wait.
It does make throughput better overall, and it also makes latencies
worse overall. We'll probably get similar results just by disabling
q->full in io-stalls-7, but the batch ownership does a better job of
limiting get_request latencies at a fixed (although potentially large)
number.
lat-stat-5.diff goes on top of io-stalls-7.diff from yesterday
batch_owner.diff goes on top of lat-stat-5.diff.
-chris
Chris Mason wrote:
>On Thu, 2003-06-26 at 09:04, Nick Piggin wrote:
>
>
>>>One of the things I tried in this area was basically queue ownership.
>>>When each process woke up, he was given strict ownership of the queue
>>>and could submit up to N number of requests. One process waited for
>>>ownership in a yield loop for a max limit of a certain number of
>>>jiffies, all the others waited on the request queue.
>>>
>>>
>>Not sure exactly what you mean by one process waiting for ownership
>>in a yield loop, but why don't you simply allow the queue "owner" to
>>submit up to a maximum of N requests within a time limit. Once either
>>limit expires (or, rarely, another might become owner -) the process
>>would just be put to sleep by the normal queue_full mechanism.
>>
>>
>
>You need some way to wakeup the queue after that time limit has expired,
>in case the owner never submits another request. This can either be a
>timer or a process in a yield loop. Given that very short expire time I
>set (10 jiffies), I went for the yield loop.
>
>
>>>It generally increased the latency in __get_request wait by a multiple
>>>of N. I didn't keep it because the current patch is already full of
>>>subtle interactions, I didn't want to make things more confusing than
>>>they already were ;-)
>>>
>>>
>>Yeah, something like that. I think that in a queue full situation,
>>the processes are wanting to submit more than 1 request though. So
>>the better thoughput you can achieve by batching translates to
>>better effective throughput. Read my recent debate with Andrea
>>about this though - I couldn't convince him!
>>
>>
>
>Well, it depends ;-) I think we've got 3 basic kinds of procs during a
>q->full condition:
>
>1) wants to submit lots of somewhat contiguous io
>2) wants to submit a single io
>3) wants to submit lots of random io
>
>>From a throughput point of view, we only care about giving batch
>ownership to #1. giving batch ownership to #3 will help reduce context
>switches, but if it helps throughput than the io wasn't really random
>(you've got a good point about locality below, drive write caches make a
>huge difference there).
>
>The problem I see in 2.4 is the elevator can't tell any of these cases
>apart, so any attempt at batch ownership is certain to be wrong at least
>part of the time.
>
I think though, for fairness, if we allow one to submit a batch
of requests, we have to give that opportunity to the others.
And yeah, it does reduce context switches, and it does improve
throughput for "random" localised IO.
>
>
>>I have seen much better maximum latencies, 2-3 times the
>>throughput, and an order of magnitude less context switching on
>>many threaded tiobench write loads when using batching.
>>
>>In short, measuring get_request latency won't give you the full
>>story.
>>
>>
>
>Very true. But get_request latency is the minimum amount of time a
>single read is going to wait (in 2.4.x anyway), and that is what we need
>to focus on when we're trying to fix interactive performance.
>
The read situation is different to write. To fill the read queue,
you need queue_nr_requests / 2-3 (for readahead) reading processes
to fill the queue, more if the reads are random.
If this kernel is being used interactively, its not our fault we
might not give quite as good interactive performance. I'm sure
the fileserver admin would rather take the tripled bandwidth ;)
That said, I think a lot of interactive programs will want to do
more than 1 request at a time anyway.
>
>>>The real problem with this approach is that we're guessing about the
>>>number of requests a given process wants to submit, and we're assuming
>>>those requests are going to be highly mergable. If the higher levels
>>>pass these hints down to the elevator, we should be able to do a better
>>>job of giving both low latency and high throughput.
>>>
>>>
>>No, the numbers (batch # requests, batch time) are not highly scientific.
>>Simply when a process wakes up, we'll let them submit a small burst of
>>requests before they go back to sleep. Now in 2.5 (mm) we can cheat and
>>make this more effective, fair, and without possible missed wakes because
>>io contexts means that multiple processes can be batching at the same
>>time, and dynamically allocated requests means it doesn't matter if we
>>go a bit over the queue limit.
>>
>>
>
>I agree 2.5 has a lot more room for the contexts to be effective, and I
>think they are a really good idea.
>
>
>>I think a decent solution for 2.4 would be to simply have the one queue
>>owner, but he allowed the queue to fall below the batch limit, wake
>>someone else and make them the owner. It can be a bit less fair, and
>>it doesn't work across queues, but they're less important cases.
>>
>>
>>>Between bios and the pdflush daemons, I think 2.5 is in pretty good
>>>shape to do what we need. I'm not 100% sure we need batching when the
>>>requests being submitted are not highly mergable, but I haven't put lots
>>>of thought into that part yet.
>>>
>>>
>>No, there are a couple of problems here.
>>First, good locality != sequential. I saw tiobench 256 random write
>>throughput _doubled_ because each process is writing within its own
>>file.
>>
>>Second, mergeable doesn't mean anything if your request size only
>>grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
>>go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)
>>
>
>Well, play around with raw io, my box writes at roughly disk speed with
>128k synchronous requests (contiguous writes).
>
Yeah, I'm not talking about request overhead - I think a 128K sized
request is just fine. But when there are 256 threads writing, with
FIFO method, 128 threads will each have 1 request in the queue. If
they are sequential writers, each request will probably be 128K.
That isn't enough to get good disk bandwidth. The elevator _has_ to
make a suboptimal decision.
With batching, say 8 processes have 16 sequential requests on the
queue each. The elevator can make good choices.
On Thu, 2003-06-26 at 21:21, Nick Piggin wrote:
> >Very true. But get_request latency is the minimum amount of time a
> >single read is going to wait (in 2.4.x anyway), and that is what we need
> >to focus on when we're trying to fix interactive performance.
> >
>
> The read situation is different to write. To fill the read queue,
> you need queue_nr_requests / 2-3 (for readahead) reading processes
> to fill the queue, more if the reads are random.
> If this kernel is being used interactively, its not our fault we
> might not give quite as good interactive performance. I'm sure
> the fileserver admin would rather take the tripled bandwidth ;)
>
> That said, I think a lot of interactive programs will want to do
> more than 1 request at a time anyway.
>
My intuition agrees with yours, but if this is true then andrea's old
elevator-lowlatency patch alone is enough, and we don't need q->full at
all. Users continued to complain of bad latencies even with his code
applied.
>From a practical point of view his old code is the same as the batch
wakeup code for get_request latencies and provides good throughput.
There are a few cases where batch wakeup has shorter overall latencies,
but I don't think people were in those heavy workloads while they were
complaining of stalls in -aa.
> >>Second, mergeable doesn't mean anything if your request size only
> >>grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
> >>go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)
> >>
> >
> >Well, play around with raw io, my box writes at roughly disk speed with
> >128k synchronous requests (contiguous writes).
> >
>
> Yeah, I'm not talking about request overhead - I think a 128K sized
> request is just fine. But when there are 256 threads writing, with
> FIFO method, 128 threads will each have 1 request in the queue. If
> they are sequential writers, each request will probably be 128K.
> That isn't enough to get good disk bandwidth. The elevator _has_ to
> make a suboptimal decision.
>
> With batching, say 8 processes have 16 sequential requests on the
> queue each. The elevator can make good choices.
I agree here too, it just doesn't match the user reports we've been
getting in 2.4 ;-) If 2.5 can dynamically allocate requests now and
then you can get much better results with io contexts/dynamic wakeups,
but I can't see how to make it work in 2.4 without larger backports.
So, the way I see things, we've got a few choices.
1) do nothing. 2.6 isn't that far off.
2) add elevator-lowlatency without q->full. It solves 90% of the
problem
3) add q->full as well and make it the default. Great latencies, not so
good throughput. Add userland tunables so people can switch.
4) back port some larger chunk of 2.5 and find a better overall
solution.
I vote for #3, don't care much if q->full is on or off by default, as
long as we make an easy way for people to set it.
-chris
On Wed, 25 Jun 2003, Chris Mason wrote:
> I've no preference really. I didn't notice a throughput difference but
> my scsi drives only have 2MB of cache.
You shouldn't be using the drive's write cache in the first place!
The write cache, regardless of ATA or SCSI, can, as far as I know, not
be used safely with any Linux file systems (and my questions whether 2.6
will finally change that went unanswered so far), because the write
reordering the write cache can do can seriously damage file systems,
whether journalling or not.
Please conduct all your tests with write caches turned off because
that's what matters in REAL systems; in that case, these latencies
become a REAL pain in the back because writing is so much slower because
of all the seeks.
Optimizing for write cached behaviour can happen not a single second
before:
1. the file systems know how to queue "ordered tags" in the right places
(write barrier to enforce proper ordering for on-disk consistency,
which I assume will make for a lot of ordered tags for writing to the
journal itself)
2. the device driver knows how to map "ordered tags" to flush or
whatever operations for drives that don't do tagged command queueing
(ATA mostly, or SCSI when TCQ is switched off).
All these "0-bytes in file" problems with XFS, ReiserFS, JFS, ext2 and
ext3 in data=writeback mode happen because the kernel doesn't care about
write ordering, and these broken files are a) occasionally hard to find,
b) another PITA.
I consider proper write ordering and enforcing thereof a MUST-FIX. This
is much more important than getting some extra latencies squished. It
must do the right thing in the first place, and then it can do the right
thing faster.
I am aware that you're not the only person responsible for the state
Linux is in, and I'd like to see the write barriers revived ASAP for at
least ext2/ext3/reiserfs, sym53c8xx, aic7xxx, tmscsim and IDE. I am
sorry not being able to offer any help on that, I'm not acquainted with
the kernel stuff and I can't donate money to anyone for me to do it.
SCNR.
--
Matthias Andree
Chris Mason wrote:
>On Thu, 2003-06-26 at 21:21, Nick Piggin wrote:
>
>
>>>Very true. But get_request latency is the minimum amount of time a
>>>single read is going to wait (in 2.4.x anyway), and that is what we need
>>>to focus on when we're trying to fix interactive performance.
>>>
>>>
>>The read situation is different to write. To fill the read queue,
>>you need queue_nr_requests / 2-3 (for readahead) reading processes
>>to fill the queue, more if the reads are random.
>>If this kernel is being used interactively, its not our fault we
>>might not give quite as good interactive performance. I'm sure
>>the fileserver admin would rather take the tripled bandwidth ;)
>>
>>That said, I think a lot of interactive programs will want to do
>>more than 1 request at a time anyway.
>>
>>
>
>My intuition agrees with yours, but if this is true then andrea's old
>elevator-lowlatency patch alone is enough, and we don't need q->full at
>all. Users continued to complain of bad latencies even with his code
>applied.
>
Didn't that still have the starvation issues in get_request that
my patch addressed though? This batching is needed due to the
strict FIFO behaviour that my "q->full" thing did.
>
>>From a practical point of view his old code is the same as the batch
>wakeup code for get_request latencies and provides good throughput.
>There are a few cases where batch wakeup has shorter overall latencies,
>but I don't think people were in those heavy workloads while they were
>complaining of stalls in -aa.
>
>
>>>>Second, mergeable doesn't mean anything if your request size only
>>>>grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
>>>>go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)
>>>>
>>>>
>>>Well, play around with raw io, my box writes at roughly disk speed with
>>>128k synchronous requests (contiguous writes).
>>>
>>>
>>Yeah, I'm not talking about request overhead - I think a 128K sized
>>request is just fine. But when there are 256 threads writing, with
>>FIFO method, 128 threads will each have 1 request in the queue. If
>>they are sequential writers, each request will probably be 128K.
>>That isn't enough to get good disk bandwidth. The elevator _has_ to
>>make a suboptimal decision.
>>
>>With batching, say 8 processes have 16 sequential requests on the
>>queue each. The elevator can make good choices.
>>
>
>I agree here too, it just doesn't match the user reports we've been
>getting in 2.4 ;-) If 2.5 can dynamically allocate requests now and
>then you can get much better results with io contexts/dynamic wakeups,
>but I can't see how to make it work in 2.4 without larger backports.
>
>So, the way I see things, we've got a few choices.
>
>1) do nothing. 2.6 isn't that far off.
>
>2) add elevator-lowlatency without q->full. It solves 90% of the
>problem
>
>3) add q->full as well and make it the default. Great latencies, not so
>good throughput. Add userland tunables so people can switch.
>
>4) back port some larger chunk of 2.5 and find a better overall
>solution.
>
>I vote for #3, don't care much if q->full is on or off by default, as
>long as we make an easy way for people to set it.
>
5) include the "q->full" starvation fix; add the concept of a
queue owner, the batching process.
I'm a bit busy at the moment and so I won't test this, unfortunately.
I would prefer that if something like #5 doesn't get in, then nothing
be done for .22 unless its backed up by a few decent benchmarks. But
its not my call anyway.
Cheers,
Nick
On Fri, 2003-06-27 at 05:45, Nick Piggin wrote:
> Chris Mason wrote:
> >>>
> >>The read situation is different to write. To fill the read queue,
> >>you need queue_nr_requests / 2-3 (for readahead) reading processes
> >>to fill the queue, more if the reads are random.
> >>If this kernel is being used interactively, its not our fault we
> >>might not give quite as good interactive performance. I'm sure
> >>the fileserver admin would rather take the tripled bandwidth ;)
> >>
> >>That said, I think a lot of interactive programs will want to do
> >>more than 1 request at a time anyway.
> >>
> >>
> >
> >My intuition agrees with yours, but if this is true then andrea's old
> >elevator-lowlatency patch alone is enough, and we don't need q->full at
> >all. Users continued to complain of bad latencies even with his code
> >applied.
> >
>
> Didn't that still have the starvation issues in get_request that
> my patch addressed though? This batching is needed due to the
> strict FIFO behaviour that my "q->full" thing did.
>
Sure, but even though the batch wakeup code didn't have starvation
issues, the overall get_request latency was still high. The end result
was basically the same, without q->full we've got a higher max wait and
a lower average wait. With batch wakeup we've got a higher average
(300-400 jiffies) and a lower max (800-900 jiffies).
Especially for things like directory listings, where 2.4 generally does
io a few blocks at a time, the get_request latency is a big part of the
latency an interactive user sees.
> >So, the way I see things, we've got a few choices.
> >
> >1) do nothing. 2.6 isn't that far off.
> >
> >2) add elevator-lowlatency without q->full. It solves 90% of the
> >problem
> >
> >3) add q->full as well and make it the default. Great latencies, not so
> >good throughput. Add userland tunables so people can switch.
> >
> >4) back port some larger chunk of 2.5 and find a better overall
> >solution.
> >
> >I vote for #3, don't care much if q->full is on or off by default, as
> >long as we make an easy way for people to set it.
> >
>
> 5) include the "q->full" starvation fix; add the concept of a
> queue owner, the batching process.
>
I've tried two different approaches to #5, the first is a just a
batch_owner where other procs are still allowed to grab requests and the
owner was allowed to ignore q->full. The end result was low latencies
but not much better throughput. With a small number of procs, you've
got a good chance bdflush is going to get ownership and the throughput
is pretty good. With more procs the probability of that goes down and
the throughput benefit goes away.
My second attempt was the batch wakeup patch from yesterday. Overall I
don't feel the latencies are significantly better with that patch than
with Andrea's elevator-lowlatency and q->full disabled.
> I'm a bit busy at the moment and so I won't test this, unfortunately.
> I would prefer that if something like #5 doesn't get in, then nothing
> be done for .22 unless its backed up by a few decent benchmarks. But
> its not my call anyway.
>
Andrea's code without q->full is a good starting point regardless. The
throughput is good and the latencies are better overall. q->full is
simple enough that making it available via a tunable is pretty easy. I
really do wish I could make one patch that works well for both, but I've
honestly run out of ideas ;-)
-chris