2003-11-03 23:46:30

by Shirley Shi

[permalink] [raw]
Subject: All filesystems hang under long periods of heavy load (read and write) on a filesystem

Can anyone know why all filesystems hang under periods of heavy load on
one of the filesystem? Once the filesystems hang, any command related to
the filesystem, like 'ls', 'cat',etc., will stick forever until re-power
cycling the machine.

I kept running the following script to read and write the data on a same
filesystem(ext2 or XFS) since we need do some tests for the storage. Is
half day, onn the beginning, the system was running well. But after
running the script for a long time, such a half day, one day or two
days, all filesystems would get hung, including the root filesystem
although I didn't do any heavy load on it. The file(M.1) I used for
reading and writing is about 2.5GB.


@ total = 115
while (1)
@ cc = 2
while ($cc <= $total)
dd bs=512k if=/data/M.1 of=/data/M.$cc
echo "copying $cc of $total..."
@ cc = $cc + 1
end
rm -f /data/M.*
end


I tried RH8.0 with kernel 2.4.18 and kernel 2.4.21 with XFS and patch
rmap15j. I have the same issue running with the two kernels. Basically I
have two filesytems configured. One for the root configured with ext3,
and another is for the data configured with ext2 or XFS. With either
ext2 or XFS, I have the same problem.

I also tried on different Dual CPU machines as follows, but saw the same
problem.

- A dual-CPU machine with an on-board 320 SCSI controller(running
AIC79XX.o driver) connecting a disk drive as the root system and a
MegaRAID 320-4X controller connecting to several disk drives. I created
two H/W logical RAID devices and built the data filesystem with the S/W
RAID0(/dev/md0).
- A HP dual-CPU machine with the HP Smart Array 5i connecting to a disk
drive as the root system and a 160 SCSI controller connecting with 13
disk drives and configured as a S/W RAID0(/dev/md0) the data filesystem.
- A dual-CPU machine with an on-board 320 SCSI controller(with AIC79XX.o
driver) connecting a disk drive as the root system and a FC controller
connecting to a RAID storage.

Any comment would be appreciated.

Thanks,

Shirley



2003-11-04 07:34:20

by bert hubert

[permalink] [raw]
Subject: Re: All filesystems hang under long periods of heavy load (read and write) on a filesystem

On Mon, Nov 03, 2003 at 03:46:22PM -0800, Shirley Shi wrote:
> Can anyone know why all filesystems hang under periods of heavy load on
> one of the filesystem? Once the filesystems hang, any command related to
> the filesystem, like 'ls', 'cat',etc., will stick forever until re-power
> cycling the machine.

I suggest you figure out what your systems have in common, if this were
universal people would've noticed by now. If you have such a hang again, can
you show us the output of 'dmesg' and 'ps aux'? If at all possible, can you
run ShowTasks from the magic SysRQ menu?

Good luck!


--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2003-11-04 09:29:23

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: All filesystems hang under long periods of heavy load (read and write) on a filesystem

On Die, 2003-11-04 at 00:46, Shirley Shi wrote:
> filesystem(ext2 or XFS) since we need do some tests for the storage. Is
> half day, onn the beginning, the system was running well. But after
> running the script for a long time, such a half day, one day or two
> days, all filesystems would get hung, including the root filesystem
> although I didn't do any heavy load on it. The file(M.1) I used for
> reading and writing is about 2.5GB.

Wild guess: Perhaps a memory leak somewhere in the kernel and it shows
up after $BIG_NUMBER of operations?

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2003-11-04 15:55:48

by Randy.Dunlap

[permalink] [raw]
Subject: Re: All filesystems hang under long periods of heavy load (read and write) on a filesystem

On Mon, 03 Nov 2003 15:46:22 -0800 Shirley Shi <[email protected]> wrote:

| Can anyone know why all filesystems hang under periods of heavy load on
| one of the filesystem? Once the filesystems hang, any command related to
| the filesystem, like 'ls', 'cat',etc., will stick forever until re-power
| cycling the machine.
|
| I kept running the following script to read and write the data on a same
| filesystem(ext2 or XFS) since we need do some tests for the storage. Is
| half day, onn the beginning, the system was running well. But after
| running the script for a long time, such a half day, one day or two
| days, all filesystems would get hung, including the root filesystem
| although I didn't do any heavy load on it. The file(M.1) I used for
| reading and writing is about 2.5GB.
|
|
| @ total = 115
| while (1)
| @ cc = 2
| while ($cc <= $total)
| dd bs=512k if=/data/M.1 of=/data/M.$cc
| echo "copying $cc of $total..."
| @ cc = $cc + 1
| end
| rm -f /data/M.*
| end
|
|
| I tried RH8.0 with kernel 2.4.18 and kernel 2.4.21 with XFS and patch
| rmap15j. I have the same issue running with the two kernels. Basically I
| have two filesytems configured. One for the root configured with ext3,
| and another is for the data configured with ext2 or XFS. With either
| ext2 or XFS, I have the same problem.

Can you try a recent kernel, like 2.4.23-pre8 or -pre9?

--
~Randy