I recently found and fixed a misterious hang that could hang the
kernel with tasks in D state with the disk idle. We could reproduce
very long hangs (several hours) with tasks in D state with reiserfs
after some hour of some intensive load running (not cerberus, see below
why), but it wasn't a reiserfs specific problem, reiserfs just happens
to take the lock_super while doing the fsync_buffer_list and this leads
kupdate to get stuck in the lock_super waiting a wait_on_buffer to
return, so with kupdate stuck the background run_task_queue() doesn't
run every 5 seconds anymore and in turn if there's a missing unplug
somewhere it will lead to an hanging machine in wait_on_buffer for
indefinite/infinite time (kind of deadlock, unless somebody can trigger
a readpage or something that unplugs the disk queue, usually logging in
with ssh fixed the problem). Increasing singificantly the kupdate
interval would potentially lead to the same indefinite hang on a ext2
while running fsync.
For some time I didn't even consider the possibility of wait_on_buffer
being the problem, there are over 700 patches applied in the kernel
where we could reproduce so for some time I was looking at everything
but the buggy place. After ruling out various other bits
(scheduler fixes/compiler/fsync corruption fixes etc..) I actually
realized the problem is a longstanding design locking problem in
wait_on_buffer (then I found the same problem in wait_on_page and
yesterday Chris found a similar problem in get_request_wait too, the
get_request_wait is not exactly the same issue, but it's quite similar
and it could lead to exactly the same hangs).
Probably nobody noticed this yet because normally with ext2/ext3 these
hangs happens in all the machines but they are resolved after a disk
idle time of 2.5 seconds in mean and they happens once in a while,
normally people would see mean delays of 2.5 sec caming from the
datacenter and they would think it's a normal I/O congestion or the
elevator or something during the fsync on ext2. Furthmore as Chris
pointed out with very intensive load bdflush would be usually running in
the background, this race can trigger only with mid writepage loads when
bdflush/pdflush has no reason to run.
We also have the lowlatency fixes (they're fixes) inside submit_bh so we
probably opened a larger window for the race to trigger than mainline.
Chris also double checked the bug we were facing was really this race by
introducing a delay in submit_bh to make it reproducible in a reasonable
amount of time.
the race looks like this:
CPU0 CPU1
----------------- ------------------------
reiserfs_writepage
lock_buffer()
fsync_buffers_list() under lock_super()
wait_on_buffer()
run_task_queue(&tq_disk) -> noop
schedule() <- hang with lock_super acquired
submit_bh()
/* don't unplug here */
This example is reiserfs specific but any wait_on_buffer can definitely
hang indefinitely against any concurrent ll_rw_block or submit_bh (even
on UP since submit_bh is a blocking operation and in particular with the
lowlat fixes). There's no big kernel lock anymore serializing
wait_on_buffer/ll_rw_block. This design locking problem was introduced
with the removal of the BKL from wait_on_buffer/wait_on_page/ll_rw_blk
during one of the 2.3 scalability efforts. So any 2.4 kernel out there
is affected by this race.
in short the problem here is that the wait_on_"something" has no clue if
the locked "something" is just inserted in the I/O queue and visible to the
device, so it has no clue if the run_task_queue may become a noop or if
it may affect the "something". And the writer side that executes the
submit_bh won't unplug the queue rightfully (to allow merging and boost
performance until somebody actually asks for the I/O completed ASAP).
I fixed the race by simply doing a wakeup of any waiter after any
submit_bh/submit_bio that left stuff pending in the I/O queue. So if the
race triggers now the wait_on_something will get a wakeup and in turn it
will trigger the unplug again closing the window for the race. This is
fixing the problem in practice and it seems the best fix at least for
2.4, and I don't see any potential performance regression, so I don't
feel the need of anything more complicated than this, the race triggers
once every several hours only under some special workload. You may try
to avoid loading the waitqueue head cacheline during submit_bh, but at
least for 2.4 I don't think it worth the complexity and it's an I/O path
anyways so it's certainly not critical.
The problem noticed by Chris with get_request_wait is similar, the
unplugging was run before adding the task to the waitqueue, so the
unplug could free the requests and somebody else could allocate the
freed requests without unplugging the queue afterwards. I fixed it simply
by unplugging the queue just before schedule(). That was really a more
genuine bug than the other subtle ones. With get_request_wait the fix is
so simple because we deal with entities that are guaranteed to be
affected by the queue unplug always (they're the I/O requests), this
isn't the case with the locked bh or locked/writeback pages, that was
infact the wrong assumption that allowed the other races to trigger in the
first place.
while doing these fixes I noticed various other bits:
1) in general the blk_run_queues()/run_task_queue()/sync_page should
always run just before schedule(), it's pointless to unplug anything if
we don't run schedule (ultramicrooptimization)
2) in 2.4 the run_task_queue() in wait_on_buffer could have its
TQ_ACTIVE executed inside the add_wait_queue critical section since
spin_unlock has inclusive semantics (literally speculative reads can
pass the spin_unlock even on x86/x86-64)
3) the __blk_put_request/blkdev_release_request was increasing the count
and reading the waitqueue contents without even a barrier() for the asm
layer, it needs an smp_mb() in between to serialize against
get_request_wait that runs locklessy
I did two patches one for 2.4.20rc1 and one for 2.5.47 (sorry no bk
tree here, I will try to make bitdropper.py available shortly so I can
access the new info encoded in proprietary form too) that will address
all these races. However 2.5 should be analysed further, I didn't
search too hard for all the possible places that could have this race in
2.5, I searched hard in 2.4 and I only addressed all the same problems
in 2.5. The only bit I think could be problematic in 2.4 is the nfs
specualtive I/O, the reason nfs is implementing a sync_page in the first
place. That may have the same race, I heard infact of some report with
nfs hung in wait_on_page, and I wonder if this could explain it too.
I assume the fs maintainers will take care of checking their fs for
missing wakeups of page waiters in 2.4 and 2.5 now that the problem is
well known.
http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.20rc1/fix-pausing-1
http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.5/2.5.47/fix-pausing-1
they're attached to this email too since they're tiny.
Andrea
Hi Andrea,
I've applied your patch to 2.4.19 final and 2.4.20 final and those "pausings"
still exists when I do a full kernel tree copy with "cp -a linux-2.4.19
linux-2.4.20". When I am in a screen session and try to do a Ctrl-A-C to
create a new screen session it comes up within 3-5 seconds ... If I want to
switch to another session with Ctrl-A-N it sometimes happens fast and
sometimes within 3-5 seconds ... I've put above in a "while true; do ...;
done" loop. There is nothing else running or eating CPU and also nothing else
eating i/o. This does not happen with virgin 2.4.18.
This is with a Celeron Tualatin 1300MHz, i815 chipset, 512MB RAM, UDMA100 HDD,
DMA enabled.
/dev/hda:
multcount = 16 (on)
IO_support = 1 (32-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 7299/255/63, sectors = 117266688, start = 0
Model=MAXTOR 6L060J3, FwRev=A93.0500, SerialNo=663219752652
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=16
CurCHS=4047/16/255, CurSects=16511760, LBA=yes, LBAsects=117266688
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 udma6
AdvancedPM=no WriteCache=enabled
Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: 1 2 3 4 5
Does also appear if unmaskirq is set to 1.
ciao, Marc
On Sat, Nov 16, 2002 at 05:24:17PM +0100, Marc-Christian Petersen wrote:
> Hi Andrea,
>
> I've applied your patch to 2.4.19 final and 2.4.20 final and those "pausings"
> still exists when I do a full kernel tree copy with "cp -a linux-2.4.19
> linux-2.4.20". When I am in a screen session and try to do a Ctrl-A-C to
> create a new screen session it comes up within 3-5 seconds ... If I want to
> switch to another session with Ctrl-A-N it sometimes happens fast and
> sometimes within 3-5 seconds ... I've put above in a "while true; do ...;
> done" loop. There is nothing else running or eating CPU and also nothing else
> eating i/o. This does not happen with virgin 2.4.18.
Your pausing problem have little to do with the pausing fix, the problem
for you is the read latency, you're not triggering the race condition
fixed by the pausing fix so it can't make differences. One of the
foundamental obstacles to the read latency is the size of the I/O queue,
factor that is workarounded by the read-latency patch that basically
bypasses the size of the queue hiding the problem and in turn can't fix
the write latency with O_SYNC and the read latency during true read aio etc...
Andrea
On Saturday 16 November 2002 17:59, Andrea Arcangeli wrote:
Hi Andrea,
> Your pausing problem have little to do with the pausing fix, the problem
> for you is the read latency, you're not triggering the race condition
> fixed by the pausing fix so it can't make differences. One of the
> foundamental obstacles to the read latency is the size of the I/O queue,
> factor that is workarounded by the read-latency patch that basically
> bypasses the size of the queue hiding the problem and in turn can't fix
> the write latency with O_SYNC and the read latency during true read aio
> etc...
ok, after some further tests, I think this is _somewhat_ FS dependent. I tried
this with ext2, ext3 (no difference there) and also with ReiserFS and what
must I say, those "Pausings" caused be the write latency doing it with
ReiserFS are alot less than with ext2|ext3 but are still occuring.
There must went in something bullshitty into 2.4.19/2.4.20 that causes those
ugly things because 2.4.18 does not have that problem. This is still why I
don't use any kernels >2.4.18.
After changing elevator things like this:
root@codeman:[/] # elvtune /dev/hda
/dev/hda elevator ID 0
read_latency: 2048
write_latency: 1024
max_bomb_segments: 0
those "pausings" are less worse than before but are still there.
NOTE: Write latency is lower than read latency (it's not a typo :)
ciao, Marc
On Sat, Nov 16, 2002 at 06:23:26PM +0100, Marc-Christian Petersen wrote:
> On Saturday 16 November 2002 17:59, Andrea Arcangeli wrote:
>
> Hi Andrea,
>
> > Your pausing problem have little to do with the pausing fix, the problem
> > for you is the read latency, you're not triggering the race condition
> > fixed by the pausing fix so it can't make differences. One of the
> > foundamental obstacles to the read latency is the size of the I/O queue,
> > factor that is workarounded by the read-latency patch that basically
> > bypasses the size of the queue hiding the problem and in turn can't fix
> > the write latency with O_SYNC and the read latency during true read aio
> > etc...
> ok, after some further tests, I think this is _somewhat_ FS dependent. I tried
> this with ext2, ext3 (no difference there) and also with ReiserFS and what
> must I say, those "Pausings" caused be the write latency doing it with
> ReiserFS are alot less than with ext2|ext3 but are still occuring.
>
> There must went in something bullshitty into 2.4.19/2.4.20 that causes those
> ugly things because 2.4.18 does not have that problem. This is still why I
> don't use any kernels >2.4.18.
>
> After changing elevator things like this:
>
> root@codeman:[/] # elvtune /dev/hda
>
> /dev/hda elevator ID 0
> read_latency: 2048
> write_latency: 1024
> max_bomb_segments: 0
>
> those "pausings" are less worse than before but are still there.
> NOTE: Write latency is lower than read latency (it's not a typo :)
you may want to try with this setting that helps with very slow devices:
echo 2 500 0 0 500 3000 3 1 0 > /proc/sys/vm/bdflush
or also with my current default tuned for high performance:
echo 50 500 0 0 500 3000 60 20 > /proc/sys/vm/bdflush
you may have too many dirty buffers around and you end running at disk
speed at every memory allocation, the first setting will decrease the
amount of dirty buffers dramatically, if you still have significant
slowdown with the first setting above, it's most probably only the usual
elevator issue.
Andrea
On Saturday 16 November 2002 18:32, Andrea Arcangeli wrote:
Hi Andrea,
> you may want to try with this setting that helps with very slow devices:
> echo 2 500 0 0 500 3000 3 1 0 > /proc/sys/vm/bdflush
>
> or also with my current default tuned for high performance:
> echo 50 500 0 0 500 3000 60 20 > /proc/sys/vm/bdflush
I've tested both without any changes, "pausings" are still there.
> you may have too many dirty buffers around and you end running at disk
> speed at every memory allocation, the first setting will decrease the
> amount of dirty buffers dramatically, if you still have significant
> slowdown with the first setting above, it's most probably only the usual
> elevator issue.
Seems so.
So I have to use 2.4.18 until there is a real proper fix for that.
ciao, Marc
On Sat, Nov 16, 2002 at 06:43:36PM +0100, Marc-Christian Petersen wrote:
> On Saturday 16 November 2002 18:32, Andrea Arcangeli wrote:
>
> Hi Andrea,
>
> > you may want to try with this setting that helps with very slow devices:
> > echo 2 500 0 0 500 3000 3 1 0 > /proc/sys/vm/bdflush
> >
> > or also with my current default tuned for high performance:
> > echo 50 500 0 0 500 3000 60 20 > /proc/sys/vm/bdflush
> I've tested both without any changes, "pausings" are still there.
>
> > you may have too many dirty buffers around and you end running at disk
> > speed at every memory allocation, the first setting will decrease the
> > amount of dirty buffers dramatically, if you still have significant
> > slowdown with the first setting above, it's most probably only the usual
> > elevator issue.
> Seems so.
>
> So I have to use 2.4.18 until there is a real proper fix for that.
just to make a quick test, can you try an hack like this combined with a
setting of elvtune -r 128 -w 256 on top of 2.4.20rc1?
--- x/drivers/block/ll_rw_blk.c.~1~ Sat Nov 2 19:45:33 2002
+++ x/drivers/block/ll_rw_blk.c Sat Nov 16 19:44:20 2002
@@ -432,7 +432,7 @@ static void blk_init_free_list(request_q
si_meminfo(&si);
megs = si.totalram >> (20 - PAGE_SHIFT);
- nr_requests = 128;
+ nr_requests = 16;
if (megs < 32)
nr_requests /= 2;
blk_grow_request_list(q, nr_requests);
Andrea
Hi Andrea,
> just to make a quick test, can you try an hack like this combined with a
> setting of elvtune -r 128 -w 256 on top of 2.4.20rc1?
>
> --- x/drivers/block/ll_rw_blk.c.~1~ Sat Nov 2 19:45:33 2002
> +++ x/drivers/block/ll_rw_blk.c Sat Nov 16 19:44:20 2002
> @@ -432,7 +432,7 @@ static void blk_init_free_list(request_q
>
> si_meminfo(&si);
> megs = si.totalram >> (20 - PAGE_SHIFT);
> - nr_requests = 128;
> + nr_requests = 16;
> if (megs < 32)
> nr_requests /= 2;
> blk_grow_request_list(q, nr_requests);
hehe, Andrea, it seem's we both think of the same ... :-) ... I am just
recompiling the kernel ... hang on.
ciao, Marc
On Saturday 16 November 2002 19:58, Marc-Christian Petersen wrote:
Hi Andrea,
> > just to make a quick test, can you try an hack like this combined with a
> > setting of elvtune -r 128 -w 256 on top of 2.4.20rc1?
> >
> > --- x/drivers/block/ll_rw_blk.c.~1~ Sat Nov 2 19:45:33 2002
> > +++ x/drivers/block/ll_rw_blk.c Sat Nov 16 19:44:20 2002
> > @@ -432,7 +432,7 @@ static void blk_init_free_list(request_q
> >
> > si_meminfo(&si);
> > megs = si.totalram >> (20 - PAGE_SHIFT);
> > - nr_requests = 128;
> > + nr_requests = 16;
> > if (megs < 32)
> > nr_requests /= 2;
> > blk_grow_request_list(q, nr_requests);
>
> hehe, Andrea, it seem's we both think of the same ... :-) ... I am just
> recompiling the kernel ... hang on.
Andrea, this makes a difference! The pausings are much less than before, but
still occur. Situation below.
Another thing I've noticed, while doing the "cp -a xyz abc" in
a loop the available memory decreases alot (silly caching of files)
Before copying:
---------------
MemTotal: 515992 kB
MemFree: 440876 kB
Buffers: 3808 kB
Cached: 24128 kB
While copying:
--------------
root@codeman:[/usr/src] # ./bla.sh
+ cd /usr/src
+ rm -rf linux-2.4.19-blaold linux-2.4.19-blanew
+ COUNT=0
+ echo 0
0
+ cp -a linux-2.4.19-vanilla linux-2.4.19-blaold
+ cp -a linux-2.4.19-vanilla linux-2.4.19-blanew
not yet finished the above and memory is this:
MemTotal: 515992 kB
MemFree: 3348 kB
Buffers: 12244 kB
Cached: 451608 kB
swap (500mb) turned off. Pausings are almost none. (without your patch /
elvtune) even there were massive pauses.
If swap is turned on, swapusage grows alot and then, if SWAP is used, we have
pauses (even more less than without your patch).
To free the used/cached memory I use umount /usr/src.
I think your proposal is good. Anything else I should test/change? Any further
informations/test I can/should run?
Thnx alot!
ciao, Marc
On Saturday 16 November 2002 22:55, Marc-Christian Petersen wrote:
Hi Andrea,
> Andrea, this makes a difference! The pausings are much less than before,
> but still occur. Situation below.
So, after some time in trying to exchange 2.4.20-rc2 files with the following
files from 2.4.18 (and compile successfully)
drivers/block/ll_rw_blk.c
drivers/block/elevator.c
include/linux/elevator.h
those "pauses/stopps" are _GONE_!
I've also tested all this situations with totally different mashines. No
changes, all behaves the same.
I am really afraid that no one else (seems so) noticed this but me! I cannot
believe that?! :-(
It is a really serious "bug" I cannot live with. Doing that high disk i/o and
even pings are ignored for the amount of time the stop occurs, is an
absolutely inacceptable situation! *imho*
What do you think? :-)
ciao, Marc
> Andrea, this makes a difference! The pausings are much less than before,
> but still occur. Situation below.
MP> So, after some time in trying to exchange 2.4.20-rc2 files with the
following
MP> files from 2.4.18 (and compile successfully)
MP> drivers/block/ll_rw_blk.c
MP> drivers/block/elevator.c
MP> include/linux/elevator.h
MP> those "pauses/stopps" are _GONE_!
What is really strange, is that while I run the following in the
boodscripts:
echo "90 500 0 0 600000 600000 95 20 0" > /proc/sys/vm/bdflush
elvtune /dev/hda1 -r 2048 -w 131072
I never experience those stalls at all!
And that was surely I was expecting (clean buffers at 90% and do that
synchronouse at 95% gives little room for the first to complete before new
buffers get dirty. also that elevator-fiddling should force large bursts
(why am I doing this anyway? I wanted to configure the kernel et al so that
as minimal as possible disk-reads/writes occur))