2005-02-15 01:49:59

by Peter Chubb

[permalink] [raw]
Subject: Repeatable hang with XFS under 2.6.11-rc4


Running Reaim-7 on a 4G ram disk with 4 processors on
Itanium... Every few runs, as the multiprocessing level increases, we
see 22 processes hung in sync(), all except one waiting in
sync_filesystems() and that one waiting in pagebuf_iowait().

There's lots of free memory, the ram-disk is not full, ...
Load average is low; nothing in the logs or on the console.

root@trixie:/proc# vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 23027552 1091472 218496 0 0 1 42107 12 6 1 21 78 0
0 0 0 23027552 1091472 218496 0 0 0 0 4110 10 0 0 100 0
0 0 0 23027552 1091472 218496 0 0 0 0 4109 8 0 0 100 0
0 0 0 23027488 1091472 218496 0 0 0 32 4114 15 0 0 100 0
0 0 0 23027488 1091472 218496 0 0 0 0 4110 9 0 0 100 0
0 0 0 23027488 1091472 218496 0 0 0 0 4109 9 0 0 100 0

root@trixie:/proc/fs/xfs# df /mnt/ram-disk
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/ram1 1038336 127800 910536 13% /mnt/ram-disk


--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*


2005-02-15 02:30:51

by Nathan Scott

[permalink] [raw]
Subject: Re: Repeatable hang with XFS under 2.6.11-rc4

Hi Peter,

On Tue, Feb 15, 2005 at 12:49:45PM +1100, Peter Chubb wrote:
> Running Reaim-7 on a 4G ram disk with 4 processors on
> Itanium... Every few runs, as the multiprocessing level increases, we
> see 22 processes hung in sync(), all except one waiting in
> sync_filesystems() and that one waiting in pagebuf_iowait().

That would indicate either XFS dropped the IO completion for a
metadata buffer, or the driver didn't pass it back to us. Hard
to say which; is this with default mkfs options? If so, try
using -ssize=4k at mkfs time, that'll get rid of some of the
more unusual IO patterns which XFS can send down. Also try a
blocksize the same as the pagesize (16K there I would guess).
If the behaviour changes, these'll give us pointers and help
isolate where the problem is.

cheers.

--
Nathan