Message-ID: <3E00C738.1070506@mailsammler.de>
Date: Wed, 18 Dec 2002 20:06:32 +0100
From: Torben Frey <kernel@mailsammler.de>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.2.1) Gecko/20021130
MIME-Version: 1.0
To: linux-kernel@vger.kernel.org
Subject: Horrible drive performance under concurrent i/o jobs (dlh problem?)
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2586
Lines: 55

Hi list readers (and hopefully writers),

after getting crazy with our main server in the company for over a week 
now, this list is possibly my last help - I am no kernel programmer but 
suscpect it to be a kernel problem. Reading through the list did not 
help me (although I already thought so, see below).

We are running a 3ware Escalade 7850 Raid controller with 7 IBM Deskstar 
GXP 180 disks in Raid 5 mode, so it builds a 1.11TB disk.
There's one partition on it, /dev/sda1, formatted with Reiserfs format 
3.6. The Board is an MSI 6501 (K7D Master) with 1GB RAM but only one 
processor.

We were running the Raid smoothly while there was not much I/O - but 
when we tried to produce large amounts of data last week, read and write 
performance went down to inacceptable low rates. The load of the machine 
went high up to 8,9,10... and every disk access stopped processes from 
responding for a few seconds (nedit, ls). An "rm" of many small files 
made the machine not react to "reboot" anymore, I had to reset it.

So copied everything away to a software raid and tried all the disk 
tuning stuff (min-, max-readahead, bdflush, elvtune). Nothing helped. 
Last Sunday I then found a hint about a bug introduced in kernel 
2.4.19-pre6 which could be fixed using a "dlh", disk latency hack - or 
going back to 2.4.18. Last is what I did ( from 2.4.20 )

It seemed to help in a way that I could copy about 350GB back to the 
RAID in about 3-4 hrs overnight. So I THOUGHT everything would be fine. 
Since Tuesday morning my collegues are trying to produce large amounts 
of data again - and every concurrent I/O operation blocks all the 
others. We cannot work with that.

When I am working all alone on the disk creating a 1 GB file by
time dd if=/dev/zero of=testfile bs=1G count=1
results in real times from 14 seconds when I am very lucky up to 4 
minutes usually.
Watching vmstat 1 shows me that "bo" drops quickly down from rates in 
the 10 or 20 thousands to low rates of about 2 or 3 thousands when the 
runs take so long.

Can anyone of you please tell me how can I find out if this is a kernel 
problem or a hardware 3ware-device problem? Is there a way to see the 
difference? Or could it come from running an SMP kernel although I have 
only one CPU in my board?

I would be very happy about every answer, really!

Torben


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/