2002-12-18 18:58:47

by Torben Frey

[permalink] [raw]
Subject: Horrible drive performance under concurrent i/o jobs (dlh problem?)

Hi list readers (and hopefully writers),

after getting crazy with our main server in the company for over a week
now, this list is possibly my last help - I am no kernel programmer but
suscpect it to be a kernel problem. Reading through the list did not
help me (although I already thought so, see below).

We are running a 3ware Escalade 7850 Raid controller with 7 IBM Deskstar
GXP 180 disks in Raid 5 mode, so it builds a 1.11TB disk.
There's one partition on it, /dev/sda1, formatted with Reiserfs format
3.6. The Board is an MSI 6501 (K7D Master) with 1GB RAM but only one
processor.

We were running the Raid smoothly while there was not much I/O - but
when we tried to produce large amounts of data last week, read and write
performance went down to inacceptable low rates. The load of the machine
went high up to 8,9,10... and every disk access stopped processes from
responding for a few seconds (nedit, ls). An "rm" of many small files
made the machine not react to "reboot" anymore, I had to reset it.

So copied everything away to a software raid and tried all the disk
tuning stuff (min-, max-readahead, bdflush, elvtune). Nothing helped.
Last Sunday I then found a hint about a bug introduced in kernel
2.4.19-pre6 which could be fixed using a "dlh", disk latency hack - or
going back to 2.4.18. Last is what I did ( from 2.4.20 )

It seemed to help in a way that I could copy about 350GB back to the
RAID in about 3-4 hrs overnight. So I THOUGHT everything would be fine.
Since Tuesday morning my collegues are trying to produce large amounts
of data again - and every concurrent I/O operation blocks all the
others. We cannot work with that.

When I am working all alone on the disk creating a 1 GB file by
time dd if=/dev/zero of=testfile bs=1G count=1
results in real times from 14 seconds when I am very lucky up to 4
minutes usually.
Watching vmstat 1 shows me that "bo" drops quickly down from rates in
the 10 or 20 thousands to low rates of about 2 or 3 thousands when the
runs take so long.

Can anyone of you please tell me how can I find out if this is a kernel
problem or a hardware 3ware-device problem? Is there a way to see the
difference? Or could it come from running an SMP kernel although I have
only one CPU in my board?

I would be very happy about every answer, really!

Torben



2002-12-18 21:02:49

by Con Kolivas

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)



>So copied everything away to a software raid and tried all the disk
>tuning stuff (min-, max-readahead, bdflush, elvtune). Nothing helped.
>Last Sunday I then found a hint about a bug introduced in kernel
>2.4.19-pre6 which could be fixed using a "dlh", disk latency hack - or
>going back to 2.4.18. Last is what I did ( from 2.4.20 )

I made the dlh (disk latency hack) and it is related to a problem of system
response under heavy IO load, NOT the actual IO throughput so this sounds
unrelated. However, I have seen what you describe with reiserFS and ide raid at
least and had it fixed by applying AA's stuck in D fix, which ReiserFS is more
prone to for some complicated reason. Give that a go.

In

http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/

it is patch 9980_fix-pausing-2

Regards,
Con

2002-12-18 22:10:57

by Torben Frey

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

Hi Con,

thanks for your fast reply. Unfortunately - I cannot patch a vanilla
2.4.20 kernel using patch -p1. The first hunk fails, the other ones are
found with offsets or even fuzz. Although I applied the first hunk
manually, compiling fails.

Do I need the other patches, too? Or a special version of the kernel?

Greetings from Munich,
Torben

2002-12-18 22:33:07

by Torben Frey

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

Ok, found the complete aa1-patch, so I am compiling this right now (hope
the machine at work comes up again, otherwise I am stuck until tomorrow
morning...

I will let you know about the results, for sure!

Torben

2002-12-18 22:29:26

by Andrew Morton

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

Torben Frey wrote:
>
> Hi Con,
>
> thanks for your fast reply. Unfortunately - I cannot patch a vanilla
> 2.4.20 kernel using patch -p1. The first hunk fails, the other ones are
> found with offsets or even fuzz. Although I applied the first hunk
> manually, compiling fails.
>
> Do I need the other patches, too? Or a special version of the kernel?
>

Here's a diff against base 2.4.20. It may be a little out of date
wrt Andrea's latest but it should tell us if we're looking in the
right place.

I doubt it though. This problem will be exceedingly rare on kernels
which do not have a voluntary scheduling point in submit_bh(). SMP and
preemptible kernels will hit it, but rarely.

So please try this patch. Also it would be interesting to know if
read activity against the device fixes the problem. Try doing a

cat /dev/sda1 > /dev/null

and see if that unjams things. If so then yes, it's a queue unplugging
problem.


drivers/block/ll_rw_blk.c | 25 ++++++++++++++++++++-----
fs/buffer.c | 22 +++++++++++++++++++++-
fs/reiserfs/inode.c | 1 +
include/linux/pagemap.h | 2 ++
kernel/ksyms.c | 1 +
mm/filemap.c | 14 ++++++++++++++
6 files changed, 59 insertions(+), 6 deletions(-)

--- 24/drivers/block/ll_rw_blk.c~fix-pausing Wed Dec 18 14:32:06 2002
+++ 24-akpm/drivers/block/ll_rw_blk.c Wed Dec 18 14:32:06 2002
@@ -590,12 +590,20 @@ static struct request *__get_request_wai
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);

- generic_unplug_device(q);
add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (q->rq[rw].count == 0)
+ if (q->rq[rw].count == 0) {
+ /*
+ * All we care about is not to stall if any request
+ * is been released after we set TASK_UNINTERRUPTIBLE.
+ * This is the most efficient place to unplug the queue
+ * in case we hit the race and we can get the request
+ * without waiting.
+ */
+ generic_unplug_device(q);
schedule();
+ }
spin_lock_irq(&io_request_lock);
rq = get_request(q, rw);
spin_unlock_irq(&io_request_lock);
@@ -829,9 +837,11 @@ void blkdev_release_request(struct reque
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= q->batch_requests &&
- waitqueue_active(&q->wait_for_requests[rw]))
- wake_up(&q->wait_for_requests[rw]);
+ if (++q->rq[rw].count >= q->batch_requests) {
+ smp_mb();
+ if (waitqueue_active(&q->wait_for_requests[rw]))
+ wake_up(&q->wait_for_requests[rw]);
+ }
}
}

@@ -1200,6 +1210,11 @@ void submit_bh(int rw, struct buffer_hea

generic_make_request(rw, bh);

+ /* fix race condition with wait_on_buffer() */
+ smp_mb(); /* spin_unlock may have inclusive semantics */
+ if (waitqueue_active(&bh->b_wait))
+ wake_up(&bh->b_wait);
+
switch (rw) {
case WRITE:
kstat.pgpgout += count;
--- 24/fs/buffer.c~fix-pausing Wed Dec 18 14:32:06 2002
+++ 24-akpm/fs/buffer.c Wed Dec 18 14:32:06 2002
@@ -153,10 +153,23 @@ void __wait_on_buffer(struct buffer_head
get_bh(bh);
add_wait_queue(&bh->b_wait, &wait);
do {
- run_task_queue(&tq_disk);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
if (!buffer_locked(bh))
break;
+ /*
+ * We must read tq_disk in TQ_ACTIVE after the
+ * add_wait_queue effect is visible to other cpus.
+ * We could unplug some line above it wouldn't matter
+ * but we can't do that right after add_wait_queue
+ * without an smp_mb() in between because spin_unlock
+ * has inclusive semantics.
+ * Doing it here is the most efficient place so we
+ * don't do a suprious unplug if we get a racy
+ * wakeup that make buffer_locked to return 0, and
+ * doing it here avoids an explicit smp_mb() we
+ * rely on the implicit one in set_task_state.
+ */
+ run_task_queue(&tq_disk);
schedule();
} while (buffer_locked(bh));
tsk->state = TASK_RUNNING;
@@ -1512,6 +1525,9 @@ static int __block_write_full_page(struc

/* Done - end_buffer_io_async will unlock */
SetPageUptodate(page);
+
+ wakeup_page_waiters(page);
+
return 0;

out:
@@ -1543,6 +1559,7 @@ out:
} while (bh != head);
if (need_unlock)
UnlockPage(page);
+ wakeup_page_waiters(page);
return err;
}

@@ -1770,6 +1787,8 @@ int block_read_full_page(struct page *pa
else
submit_bh(READ, bh);
}
+
+ wakeup_page_waiters(page);

return 0;
}
@@ -2383,6 +2402,7 @@ int brw_page(int rw, struct page *page,
submit_bh(rw, bh);
bh = next;
} while (bh != head);
+ wakeup_page_waiters(page);
return 0;
}

--- 24/fs/reiserfs/inode.c~fix-pausing Wed Dec 18 14:32:06 2002
+++ 24-akpm/fs/reiserfs/inode.c Wed Dec 18 14:32:06 2002
@@ -1993,6 +1993,7 @@ static int reiserfs_write_full_page(stru
*/
if (nr) {
submit_bh_for_writepage(arr, nr) ;
+ wakeup_page_waiters(page);
} else {
UnlockPage(page) ;
}
--- 24/include/linux/pagemap.h~fix-pausing Wed Dec 18 14:32:06 2002
+++ 24-akpm/include/linux/pagemap.h Wed Dec 18 14:32:06 2002
@@ -97,6 +97,8 @@ static inline void wait_on_page(struct p
___wait_on_page(page);
}

+extern void wakeup_page_waiters(struct page * page);
+
/*
* Returns locked page at given index in given cache, creating it if needed.
*/
--- 24/kernel/ksyms.c~fix-pausing Wed Dec 18 14:32:06 2002
+++ 24-akpm/kernel/ksyms.c Wed Dec 18 14:32:06 2002
@@ -293,6 +293,7 @@ EXPORT_SYMBOL(filemap_fdatasync);
EXPORT_SYMBOL(filemap_fdatawait);
EXPORT_SYMBOL(lock_page);
EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(wakeup_page_waiters);

/* device registration */
EXPORT_SYMBOL(register_chrdev);
--- 24/mm/filemap.c~fix-pausing Wed Dec 18 14:32:06 2002
+++ 24-akpm/mm/filemap.c Wed Dec 18 14:32:06 2002
@@ -909,6 +909,20 @@ void lock_page(struct page *page)
}

/*
+ * This must be called after every submit_bh with end_io
+ * callbacks that would result into the blkdev layer waking
+ * up the page after a queue unplug.
+ */
+void wakeup_page_waiters(struct page * page)
+{
+ wait_queue_head_t * head;
+
+ head = page_waitqueue(page);
+ if (waitqueue_active(head))
+ wake_up(head);
+}
+
+/*
* a rather lightweight function, finding and getting a reference to a
* hashed page atomically.
*/

_

2002-12-18 23:23:22

by Torben Frey

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

Hi Andrew, hi Con,

> Here's a diff against base 2.4.20. It may be a little out of date
> wrt Andrea's latest but it should tell us if we're looking in the
> right place.
Ok, I did not run the complete 2.4.20aa1 kernel yet since I am not sure
if it is intended to be used, but I applied your patch, Andrew (thanks
for mailing it). It still does not fix the problem. One job doing much
I/O starts with about 80% CPU but then drops down to about 30% in the
first 40 seconds. Load goes from 0.00 to 2.4 within that time.

And I can see bdflush and my process marked with "D" in the process list.

Catting the device to /dev/null only made it worse :-(

Creating a 1GB file using dd takes about 1 minute compared to 16 seconds
without other jobs running.

Do you think it could be a ReiserFS problem on a RAID? Do you know of
anything else I could try? Sorry, but my knowledge doesn't reach that far.

TIA,
Torben

2002-12-18 23:39:05

by Andrew Morton

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

Torben Frey wrote:
>
> Hi Andrew, hi Con,
>
> > Here's a diff against base 2.4.20. It may be a little out of date
> > wrt Andrea's latest but it should tell us if we're looking in the
> > right place.
> Ok, I did not run the complete 2.4.20aa1 kernel yet since I am not sure
> if it is intended to be used, but I applied your patch, Andrew (thanks
> for mailing it). It still does not fix the problem. One job doing much
> I/O starts with about 80% CPU but then drops down to about 30% in the
> first 40 seconds. Load goes from 0.00 to 2.4 within that time.
>
> And I can see bdflush and my process marked with "D" in the process list.
>
> Catting the device to /dev/null only made it worse :-(
>
> Creating a 1GB file using dd takes about 1 minute compared to 16 seconds
> without other jobs running.

err, now hang on.

I thought you said that this simple dd sometimes takes 14 seconds,
and sometimes takes 4 minutes.

Please describe _exactly_ what activity is happening, and against
what disks when this problem exhibits. The "other jobs".

> Do you think it could be a ReiserFS problem on a RAID?

Doubtful. As far as reiserfs (and the block layer) is concerned,
your raid array is just a single disk (is this correct??)

> Do you know of anything else I could try?

Try a different filesystem

Try a dd to the blockdevice itself (or cat /dev/zero > /dev/sda1)

Run `vmstat 1' and send us the output which corresponds to
the poor throughput.

Try a different RAID mode.

Pull some disks out.

2002-12-19 14:21:56

by Torben Frey

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

Ok, so now I have set up a backup software Raid 0 formatted with

mke2fs -b 4096 -j -R stride=16

and mounted that device. After starting to backup stuff from the 3ware
controller to the software Raid I soon had complaints from my collegues
because the system load went up to 4 and they had the same bad
responsiveness as before. Of course I CTRL-C'ed my "cp -av" while
watching "vmstat 1" in another window - and this is what surprised me:
when I stopped the copy job, there were 22 more seconds when data was
still written to the backup software raid. Is this a hint where the
problem could be? I have the same "feature" when I write to my 3ware.

My kernel is 2.4.20 with Andrew's patch from last night.

Greetings,
Torben

0 1 3 25292 2188 72056 759644 0 0 4168 23732 1069 780 6 26 68
0 1 3 25292 2176 72056 759644 0 0 0 15728 523 147 1 10 89
2 0 4 25292 2292 72056 759548 0 0 30404 20084 820 1149 4 85 11
0 1 3 25292 2828 72048 759012 0 0 40716 23772 845 1307 2 62 36
0 1 2 25292 3208 72280 758372 0 0 0 16532 573 231 6 13 81
1 0 2 25292 3216 72276 758404 0 0 4880 23800 530 264 2 10 88
0 1 3 25292 2224 72224 759420 0 0 22596 15620 695 602 5 38 57
0 1 3 25292 3996 72204 757692 0 0 26704 23808 765 924 3 38 59
1 0 3 25292 3932 72208 757760 0 0 14380 23996 651 548 3 23 74
0 1 3 25292 2180 72296 759408 0 0 39024 15948 850 1089 3 56 41
1 0 3 25292 3180 72308 758416 0 0 39028 17568 957 1265 1 57 42
0 1 2 25292 3296 72260 758328 0 0 36976 24000 837 1194 5 47 49
1 0 2 25292 3212 72264 758428 0 0 6164 22000 594 330 2 15 83
0 1 3 25292 2212 72268 759412 0 0 44492 16116 878 1433 2 63 35
1 0 3 25292 2896 72056 758952 0 0 19180 24556 683 912 1 32 67
0 0 2 25292 3300 72068 758672 0 0 10564 24296 636 518 1 23 76
HERE WAS MY CTRL-C
1 0 2 25292 3292 72068 758672 0 0 0 15820 511 146 1 15 84
0 0 2 25292 3276 72068 758672 0 0 0 27720 607 341 0 46 54
0 0 2 25292 3276 72068 758672 0 0 0 15912 529 167 1 12 87
0 0 2 25292 3232 72112 758672 0 0 0 23880 537 199 5 7 88
0 0 2 25292 3232 72112 758672 0 0 0 15872 558 198 0 8 92
0 0 2 25292 3232 72112 758672 0 0 0 23740 517 168 4 6 90
0 0 2 25292 4620 72112 757320 0 0 0 23800 528 1044 8 12 80
0 0 2 25292 4620 72112 757320 0 0 0 16100 522 177 4 6 90
0 0 2 25292 4516 72216 757320 0 0 0 24268 558 192 2 5 93
0 0 2 25292 4516 72216 757320 0 0 0 23552 525 179 3 2 95
0 0 2 25292 4516 72216 757320 0 0 0 15872 521 137 0 9 91
0 0 2 25292 4516 72216 757320 0 0 0 31924 515 179 4 7 89
0 0 2 25292 4516 72216 757320 0 0 0 17368 526 144 2 2 96
0 0 1 25292 4488 72244 757320 0 0 0 25308 533 195 1 8 91
0 0 1 25292 4488 72244 757320 0 0 0 16368 504 145 3 12 85
0 0 1 25292 4488 72244 757320 0 0 0 25320 558 247 0 28 72
0 0 1 25292 4484 72244 757320 0 0 0 16376 529 187 3 6 91
0 0 1 25292 4484 72244 757320 0 0 0 24576 508 140 1 3 96
0 0 1 25292 4464 72264 757320 0 0 0 24356 523 197 4 5 91
0 0 1 25292 4464 72264 757320 0 0 0 16364 516 148 1 1 98
0 0 1 25292 4464 72264 757320 0 0 0 24552 507 138 2 1 97
0 0 1 25292 4464 72264 757320 0 0 0 21020 522 174 2 17 81
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 25292 4448 72280 757320 0 0 0 656 347 142 0 8 92
0 0 0 25292 4432 72296 757320 0 0 0 796 353 179 4 2 94

HERE THE WRITING OUT STOPPED, 22 seconds later!!!

0 0 0 25292 4432 72296 757320 0 0 0 0 176 141 1 1 98
0 0 0 25292 4432 72296 757320 0 0 0 0 194 184 2 1 97
0 0 0 25292 4432 72296 757320 0 0 0 0 176 137 1 1 98
0 0 0 25292 4432 72296 757320 0 0 0 0 179 175 2 1 97
0 0 0 25292 4292 72312 757328 116 0 124 32 188 165 1 1 98
0 0 0 25292 4292 72312 757328 0 0 0 0 248 310 2 12 86
0 0 0 25292 4292 72312 757328 0 0 0 0 178 148 1 2 97
0 0 0 25292 4292 72312 757328 0 0 0 0 184 144 0 1 99
0 0 0 25292 4292 72312 757328 0 0 0 0 193 185 2 3 95
0 0 0 25292 4272 72332 757328 0 0 0 640 360 200 1 2 97

2002-12-20 01:37:38

by Nuno Silva

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

Hello!

Torben Frey wrote:

[..snip..]

> watching "vmstat 1" in another window - and this is what surprised me:
> when I stopped the copy job, there were 22 more seconds when data was
> still written to the backup software raid. Is this a hint where the
> problem could be? I have the same "feature" when I write to my 3ware.
>

[..snip..]

AFAIK, this is because you have some GB of memory (RAM) that are beeing
used as disk-cache. It took 22 seconds for the cached writes-to-disk
being flushed to the device.

If 22 seconds is too much for the amount of cached disk-writes is
another story :)

Maybe 3ware controllers are slow with many disks? Try the same with only
3 disks to eliminate this variable.

Regards,
Nuno Silva

2002-12-20 14:22:47

by Roger Larsson

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

On Thursday 19 December 2002 15:29, Torben Frey wrote:
> 2 0 4 25292 2292 72056 759548 0 0 30404 20084 820 1149 4 85 11
> 0 1 3 25292 2828 72048 759012 0 0 40716 23772 845 1307 2 62 36
> 0 1 2 25292 3208 72280 758372 0 0 0 16532 573 231 6 13 81
> 1 0 2 25292 3216 72276 758404 0 0 4880 23800 530 264 2 10 88
>

Hmm... No process running but still lots of CPU used.
Are you sure that you run the disks with DMA?

You should also try to take a profile of a run.
(see
man readprofile
and
linux/Documentation/kernel-parameters.txt
profile=2 as a boot option should do)

/RogerL

--
Roger Larsson
Skellefte?
Sweden

2002-12-20 20:32:22

by Joseph D. Wagner

[permalink] [raw]
Subject: RE: Horrible drive performance under concurrent i/o jobs (dlh problem?)

I hope I'm not entering this discussion too late to be of help.

> We are running a 3ware Escalade 7850 Raid controller
> with 7 IBM Deskstar GXP 180 disks in Raid 5 mode, so
> it builds a 1.11TB disk.

Raid 0 is better in terms of speed.

> There's one partition on it, /dev/sda1, formatted
> with ReiserFS format 3.6.

Oh no! This is bad news, both in terms of speed and security.

Lumping everything into one partition makes it impossible to protect against
the SUID/GUID security vulnerability (security), and requires all reads and
writes to be funneled through one partition (speed).

At an absolute MINIMUM, your partitions should be divided into:
/boot 032 MB
/tmp 512 MB
swap 1.5 - 3.0 times the amount of RAM,
not to exceed a 2 GB per swap partition limit
/root 5 GB
/var 10 GB
/usr 20% of what is leftover
/home 50% of what is leftover,
or at least 32 MB per user
/ 30% of what is leftover

The above numbers are my recommendations for your 1.1 TB Raid Array ONLY.
Please don't flame me about how those numbers aren't right for everybody's
systems.

Additionally, in your case, I'd add a /data partition to store this huge
amount of rapidly generated data you're talking about. That way, if you
should need to reformat, remount, whatever, the partition with the data, you
won't have to take down or redo the whole system.

> Or could it come from running an SMP kernel
> although I have only one CPU in my board?

This is a bad idea. The SMP kernel includes a whole load of extra code
which is totally unnecessary in a Uniprocessor environment. All that extra
code will do is slow down your system and take up extra memory.

Switch to a Uniprocessor kernel, or see my next point.

> The Board is an MSI 6501 (K7D Master) with 1GB RAM
> but only one processor.

Upgrade to 4 GB of RAM (if possible) and add a second processor (if
possible).

>From what you're telling me, you're going to need all the RAM you can get.

Further, SCSI really works better with two or more processors. SCSI is
designed to take advantage of multiple processors. If you're not running
multiple processors, you might as well be running IDE, IMHO.

> Watching vmstat 1 shows me that "bo" drops quickly
> down from rates in the 10 or 20 thousands to low rates
> of about 2 or 3 thousands when the runs take so long.

I'm willing to bet that your system is spending all of its time flushing the
buffers. When you're generating gigabytes of data, your buffers are going
to need to be huge.

I have no idea how to do this.

Hope this helped.

Joseph Wagner

2002-12-20 22:29:27

by David Lang

[permalink] [raw]
Subject: RE: Horrible drive performance under concurrent i/o jobs (dlh problem?)

On Fri, 20 Dec 2002, Joseph D. Wagner wrote:

> Date: Fri, 20 Dec 2002 14:40:30 -0600
> From: Joseph D. Wagner <[email protected]>
> To: 'Torben Frey' <[email protected]>, [email protected]
> Subject: RE: Horrible drive performance under concurrent i/o jobs (dlh
> problem?)
>
> I hope I'm not entering this discussion too late to be of help.
>
> > We are running a 3ware Escalade 7850 Raid controller
> > with 7 IBM Deskstar GXP 180 disks in Raid 5 mode, so
> > it builds a 1.11TB disk.
>
> Raid 0 is better in terms of speed.

IIRC raid0 is striping with no redundancy, good for speed, very bad for
reliability. Raid 1 or 1+0 are the fastest modes to use that give
redundancy for the data, but they cost you half your disk space. In
addition everything I have seen and experianced about the 3ware cards
indicates that they are very fast for these modes, but much slower for
Raid 4 or 5

> > There's one partition on it, /dev/sda1, formatted
> > with ReiserFS format 3.6.
>
> Oh no! This is bad news, both in terms of speed and security.
>
> Lumping everything into one partition makes it impossible to protect against
> the SUID/GUID security vulnerability (security), and requires all reads and
> writes to be funneled through one partition (speed).

Here I have to disagree with you.

for default system installs that don't mount things noexec or readonly
there is little if any security difference between one large partition and
lots of small partitions. If you do have to time to use the mount flags to
protect your system then it is an advantage, unfortunantly in the real
world overworked sysadmins seldom have the time to do this (and if it's
not done at build time it becomes almost impossible to do later becouse
things get into places that would violate your mount restrictions)

As for speed, as long as you are on the same spindles there is no
definante speed gain for having lots of partitions and there is a
definante cost to having lots of partitions. If you think about it, if you
have seperate partitions you KNOW that you will have to seek across a
large portion of the disk to get from /root to /var where they may not be
seperated that much if they are one filesystem.

> At an absolute MINIMUM, your partitions should be divided into:
> /boot 032 MB
> /tmp 512 MB
> swap 1.5 - 3.0 times the amount of RAM,
> not to exceed a 2 GB per swap partition limit
> /root 5 GB
> /var 10 GB
> /usr 20% of what is leftover
> /home 50% of what is leftover,
> or at least 32 MB per user
> / 30% of what is leftover

The other problem with lots of partitions is that you almost always get
into a situation where one partition fills up and you have to go to a
significant amount of work to repartition things.

> The above numbers are my recommendations for your 1.1 TB Raid Array ONLY.
> Please don't flame me about how those numbers aren't right for everybody's
> systems.
>
> Additionally, in your case, I'd add a /data partition to store this huge
> amount of rapidly generated data you're talking about. That way, if you
> should need to reformat, remount, whatever, the partition with the data, you
> won't have to take down or redo the whole system.
>
> > Or could it come from running an SMP kernel
> > although I have only one CPU in my board?
>
> This is a bad idea. The SMP kernel includes a whole load of extra code
> which is totally unnecessary in a Uniprocessor environment. All that extra
> code will do is slow down your system and take up extra memory.
>
> Switch to a Uniprocessor kernel, or see my next point.

Definantly. Also make sure you compile your own kernel optimizing for the
CPU that you have. this can be a 30% improvement with no other changes.

> > The Board is an MSI 6501 (K7D Master) with 1GB RAM
> > but only one processor.
>
> Upgrade to 4 GB of RAM (if possible) and add a second processor (if
> possible).
>
> From what you're telling me, you're going to need all the RAM you can get.

I am a little more shy about going beyond 1G of ram. things are getting
better, but there are still lots of things that can only exist in the
lowest Gig of ram and if you have lots of memory you run the possibility
of filling it up and having the machine lockup. If you need the ram add
it, but on my machines that don't _NEED_ that much ram I put in 1G and use
960M of it (himem disabled). on systems doing a lot of disk IO I do put
large amounts of cache on the scsi controllers.

> Further, SCSI really works better with two or more processors. SCSI is
> designed to take advantage of multiple processors. If you're not running
> multiple processors, you might as well be running IDE, IMHO.

Here I will disagree slightly. I see very significant advantages running
SCSI even on single CPU machines. it all depends on your workload.

> > Watching vmstat 1 shows me that "bo" drops quickly
> > down from rates in the 10 or 20 thousands to low rates
> > of about 2 or 3 thousands when the runs take so long.
>
> I'm willing to bet that your system is spending all of its time flushing the
> buffers. When you're generating gigabytes of data, your buffers are going
> to need to be huge.
>
> I have no idea how to do this.

you definantly need to experiment with all the other types of journaling
filesystems. the performance tradeoffs of the different options are not
well understood (a combination of a lack of benchmarks of the different
types and the tendancy of benchmarks to show the best aspect of the
filesystem while missing the problem areas) and performance changes so
drasticly based on workloads and tuning parameters (journal size, flush
delays, etc)

David Lang

2002-12-21 05:52:06

by Joseph D. Wagner

[permalink] [raw]
Subject: RE: Horrible drive performance under concurrent i/o jobs (dlh problem?)

>> Raid 0 is better in terms of speed.

> IIRC raid0 is striping with no redundancy,
> good for speed, very bad for reliability.

I left that out figuring that since he's a system administrator that he'd
already know that.

>>> There's one partition on it, /dev/sda1

>> Oh no! This is bad news, both in terms of speed
>> and security.
>>
>> Lumping everything into one partition makes it
>> impossible to protect against the SUID/GUID
>> security vulnerability (security), and requires
>> all reads and writes to be funneled through one
>> partition (speed).

> Here I have to disagree with you.
> for default system installs... [trimmed]

Oh yeah, I agree that default installs are woefully insufficient and
inadequate when it comes to security. Red Hat should really get its act
together. OpenBSD, for example, has only had two security holes in the
default installation over the past three years.

However, if he doesn't use separate partitions in the first place, he's
going to have to go back and repartition from scratch just to plug that
security hole.

On the other hand, if he does use separate partitions from the beginning,
all he as to do is change some mount options in the /etc/fstab file.

> As for speed, as long as you are on the same
> spindles there is no definante speed gain for
> having lots of partitions and there is a
> definante cost to having lots of partitions.
> If you think about it, if you have separate
> partitions you KNOW that you will have to seek
> across a large portion of the disk to get
> from /root to /var where they may not be
> seperated that much if they are one filesystem.

Ok, now here's where you're just plain wrong.

SHORT ANSWER: Segregating partitions reduces seek time. Period.

LONG ANSWER: Reads and writes tend to be grouped within a partition. For
example, if you're starting a program, you're going to be doing a lot of
reads somewhere in the /usr partition. If the program uses temporary files,
you're going to do a lot of reads & writes in the /tmp partition. If you're
saving a file, you're going to be doing lots of writes to the /home
partition. Hence, since most disk accesses occur in groups within a
partition, preference should be giving to reducing seek time WITHIN a
partition, rather than reducing seek time BETWEEN partitions.

>> At an absolute MINIMUM, your partitions should be divided into:
>> /boot 032 MB
>> /tmp 512 MB
>> swap 1.5 - 3.0 times the amount of RAM,
>> not to exceed a 2 GB per swap partition limit
>> /root 5 GB
>> /var 10 GB
>> /usr 20% of what is leftover
>> /home 50% of what is leftover,
>> or at least 32 MB per user
>> / 30% of what is leftover

> The other problem with lots of partitions is
> that you almost always get into a situation
> where one partition fills up and you have to go
> to a significant amount of work to repartition
> things.

That's why my recommended numbers are as large as they are, and that's why
the /usr, /home, and / partitions are percentages of space leftover instead
of fixed sizes. I see no situation where my recommended numbers are too low
for his system.

>> Switch to a Uniprocessor kernel, or see my next point.

> Definantly.

Finally. We agree on something. 8-)

> [trimmed]...you run the possibility of filling [RAM]
> up and having the machine lockup.

Oh come on! Linux is the most stable, reliable, wonderful operating system
in existence. It's the be-all, end-all of operating systems. How can it
"lockup"?!?

Hahahahhahahahhahahhahah!

But seriously, you are correct in saying that Linux is piss poor in
utilizing memory beyond 1 GB, but hopefully someday either 1) the Linux
goons will get their act together and write some better code, or 2) he'll
switch to Windows 2000/XP Pro which CAN efficiently and effectively utilize
memory beyond 1 GB.

>> Further, SCSI really works better with two or more
>> processors. SCSI is designed to take advantage of
>> multiple processors. If you're not running multiple
>> processors, you might as well be running IDE, IMHO.

> Here I will disagree slightly. I see very significant
> advantages running SCSI even on single CPU machines.
> it all depends on your workload.

But isn't that the point? His workload is so high that he needs the second
CPU to manage other processes while this program is generating vast amounts
of data?

Joseph Wagner

2002-12-23 01:33:31

by David Lang

[permalink] [raw]
Subject: RE: Horrible drive performance under concurrent i/o jobs (dlh problem?)

On Sat, 21 Dec 2002, Joseph D. Wagner wrote:

> > As for speed, as long as you are on the same
> > spindles there is no definante speed gain for
> > having lots of partitions and there is a
> > definante cost to having lots of partitions.
> > If you think about it, if you have separate
> > partitions you KNOW that you will have to seek
> > across a large portion of the disk to get
> > from /root to /var where they may not be
> > seperated that much if they are one filesystem.
>
> Ok, now here's where you're just plain wrong.
>
> SHORT ANSWER: Segregating partitions reduces seek time. Period.
>
> LONG ANSWER: Reads and writes tend to be grouped within a partition. For
> example, if you're starting a program, you're going to be doing a lot of
> reads somewhere in the /usr partition. If the program uses temporary files,
> you're going to do a lot of reads & writes in the /tmp partition. If you're
> saving a file, you're going to be doing lots of writes to the /home
> partition. Hence, since most disk accesses occur in groups within a
> partition, preference should be giving to reducing seek time WITHIN a
> partition, rather than reducing seek time BETWEEN partitions.

with one partition you MAY have to seek across the disk to get from one
file to another (depends on the optimization of the filesystem)

with multiple partitions you WILL have to seek across the disk becouse
files on one partition are forrced by your partitioning to be on a
seperate part of the drive.

if all your access it to the same file it won't matter how you are
partitioned, but if you read a file from one filesystem, put intermediate
results in /tmp, then put the final result back on the first filesystem
you will end up doing a LOT of seeking between the partitions.

I am not saying that a single partition is nessasarily faster, but I am
saying that multiple partitions are not a definante win, and under some
conditions can be a significant loss.

it's like filesystem type, you need to look at what you are doing and plan
accordingly.

David Lang

2002-12-23 13:29:26

by Denis Vlasenko

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

On 18 December 2002 17:06, Torben Frey wrote:
> Hi list readers (and hopefully writers),
>
> after getting crazy with our main server in the company for over a
> week now, this list is possibly my last help - I am no kernel
> programmer but suscpect it to be a kernel problem. Reading through
> the list did not help me (although I already thought so, see below).
>
> We are running a 3ware Escalade 7850 Raid controller with 7 IBM
> Deskstar GXP 180 disks in Raid 5 mode, so it builds a 1.11TB disk.
> There's one partition on it, /dev/sda1, formatted with Reiserfs
> format 3.6. The Board is an MSI 6501 (K7D Master) with 1GB RAM but
> only one processor.
>
> We were running the Raid smoothly while there was not much I/O - but
> when we tried to produce large amounts of data last week, read and
> write performance went down to inacceptable low rates. The load of
> the machine went high up to 8,9,10... and every disk access stopped
> processes from responding for a few seconds (nedit, ls). An "rm" of
> many small files made the machine not react to "reboot" anymore, I
> had to reset it.

Can you provide solid numbers (say Mb/s) of single dd's of varying
size? Of concurrent dd's? etc...

> When I am working all alone on the disk creating a 1 GB file by
> time dd if=/dev/zero of=testfile bs=1G count=1
> results in real times from 14 seconds when I am very lucky up to 4
> minutes usually.
> Watching vmstat 1 shows me that "bo" drops quickly down from rates in
> the 10 or 20 thousands to low rates of about 2 or 3 thousands when
> the runs take so long.

Yes, and provide us with vmstat, top, cat /proc/meminfo output
and the like!
--
vda

2002-12-23 21:53:22

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

Looks like at least partially off-topic, but...

"Joseph D. Wagner" <[email protected]> writes:

> Lumping everything into one partition makes it impossible to protect against
> the SUID/GUID security vulnerability (security),

Why not? chmod ug-s /usr/bin/file... works as expected (?)
You worry about users creating suid files in /home? If they can do that,
your system is already broken (into).

> and requires all reads and
> writes to be funneled through one partition (speed).

But that's higher speed, statistically. Especially if it have to run
between your /tmp at the start of disk and /home at the end otherwise.
Not that I recommend putting /home on / fs.

> At an absolute MINIMUM, your partitions should be divided into:
> /boot 032 MB

Only needed if your BIOS has problems booting from /.

> /tmp 512 MB

Some of my machines wouldn't be able to work with that small /tmp.

> swap 1.5 - 3.0 times the amount of RAM,
> not to exceed a 2 GB per swap partition limit

Why? I thought you need that much swap as you need - and no more.
One of my machines doesn't even have any swap at all.

> /root 5 GB

For /root/.ssh etc? You must be using many authorized_keys :-)

> /var 10 GB

I always preferred to have separate /var/spool/{mqueue,postfix/etc} and
/var/spool/lp (and so on). The size should just be appropriate.

> /usr 20% of what is leftover

Hmm... Why not check how much do that need?

> /home 50% of what is leftover,
> or at least 32 MB per user

Why not check how much do users need?

> / 30% of what is leftover

And that's for?

Basically, you need as much space on particular fs as much you need.
No more, no less.

If (in particular case) I have a disk bigger than I need (they no longer
sell <= 20 GB IDE), the unused space might well be just unused space.

> The above numbers are my recommendations for your 1.1 TB Raid Array ONLY.
> Please don't flame me about how those numbers aren't right for everybody's
> systems.

I'm not trying to. However, they might be right for nearly no system
(except yours, possibly).

> > The Board is an MSI 6501 (K7D Master) with 1GB RAM
> > but only one processor.
>
> Upgrade to 4 GB of RAM (if possible) and add a second processor (if
> possible).

Why do you think adding extra RAM and CPU can increase disk performance?
Cache, yes, but we don't even know if the problem is there.
CPU with RAID5, yes, but not with a RAID controller (with software RAID5
only).

Distributing load over multiple disks/controllers may help better.

Writing 1 GB of data might well take 14 seconds and more, but should
not take 4 minutes in a idle and correctly configured system.

I don't know the controller in question, but chances are it's just
to slow and has no enough bandwidth. Or _it_ might require more RAM.

Switching from RAID5 to RAID1 might help as well, but it's quite expensive.

> Further, SCSI really works better with two or more processors. SCSI is
> designed to take advantage of multiple processors. If you're not running
> multiple processors, you might as well be running IDE, IMHO.

Not true. SCSI works best with just a multi-request (i.e. multiprocess)
system. The CPU is way faster than any disk, and it can switch processes
fast while they are waiting for disk I/O.

ps xa|grep D
--
Krzysztof Halasa
Network Administrator

2002-12-24 09:10:48

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

> SHORT ANSWER: Segregating partitions reduces seek time. Period.
>
> LONG ANSWER: Reads and writes tend to be grouped within a partition.
> For
> example, if you're starting a program, you're going to be doing a lot
> of
> reads somewhere in the /usr partition. If the program uses temporary
> files,
> you're going to do a lot of reads & writes in the /tmp partition. If
> you're
> saving a file, you're going to be doing lots of writes to the /home
> partition. Hence, since most disk accesses occur in groups within a
> partition, preference should be giving to reducing seek time WITHIN a
> partition, rather than reducing seek time BETWEEN partitions.

keep in mind that only around half of the seek time is because of the
partition! Taking an IBM 120GXP as an example:

Average seek: 8.5ms
Full stroke seek: 15.0ms
Time to rotate disk one round: 1/(7200/60)*1000 = 8.3ms

Then, the sector you're looking for, will, by average, be half a round
away from where you are, and thus, giving the minimum average seek time
8.3/2 = 4.15ms or something like half the seek time. Concidering this,
you may gain a maximum <= 50% gain in using smaller partitions.

btw. anyone that knows the zone layout on IBM drives?

roy

2002-12-24 17:13:19

by jw schultz

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

On Tue, Dec 24, 2002 at 10:18:52AM +0100, Roy Sigurd Karlsbakk wrote:
> >SHORT ANSWER: Segregating partitions reduces seek time. Period.
> >
> >LONG ANSWER: Reads and writes tend to be grouped within a partition.
> >For
> >example, if you're starting a program, you're going to be doing a lot
> >of
> >reads somewhere in the /usr partition. If the program uses temporary
> >files,
> >you're going to do a lot of reads & writes in the /tmp partition. If
> >you're
> >saving a file, you're going to be doing lots of writes to the /home
> >partition. Hence, since most disk accesses occur in groups within a
> >partition, preference should be giving to reducing seek time WITHIN a
> >partition, rather than reducing seek time BETWEEN partitions.
>
> keep in mind that only around half of the seek time is because of the
> partition! Taking an IBM 120GXP as an example:
>
> Average seek: 8.5ms
> Full stroke seek: 15.0ms
> Time to rotate disk one round: 1/(7200/60)*1000 = 8.3ms

I'm afraid your math is off.

The rotational frequency should be 7200*60/sec which makes
for 2.31 us which would produce an average rotational
latency of 1.16us if such a condition even still applies.
My expectation is that the whole track is buffered starting
from the first sector that syncs thereby making the time
rotfreq + rotfreq/nsect or something similar. In any case
the rotational latency or frequency is orders of magnitude
smaller than the seek time, even between adjacent
tracks/cylinders.

If the the stated average seek is 50% of full stroke and not
based on reality then 76% of the cost of an average seek is
attributed to distance and likewise 87% of the cost of a
full. Based on that i'd say the seek distance is a much
bigger player than you are assuming. If it weren't the
value of elevators would be much less.

>
> Then, the sector you're looking for, will, by average, be half a round
> away from where you are, and thus, giving the minimum average seek time
> 8.3/2 = 4.15ms or something like half the seek time. Concidering this,
> you may gain a maximum <= 50% gain in using smaller partitions.
>
> btw. anyone that knows the zone layout on IBM drives?

Having chimed in i'll also mention that having the
filesystems right-sized and small should produce better
locality of reference for multiple files and large files
given the tendency of our filesystems to spread their
directories across the cylinders. One big filesystem is as
likely to have the assorted files spread from one end of the
disk to the other as you will get with several smaller ones.
Witness the discussions that introduce the orlov allocator
to ext[23].

As for the repartitioning when a filesystems outgrows its
partition that is reason #1 for lvm. Care should be taken
though because lvm can also destroy locality through
discontinuous extent allocation.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-12-24 20:52:50

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

On Tue, 2002-12-24 at 09:21, jw schultz wrote:
> I'm afraid your math is off.
>
> The rotational frequency should be 7200*60/sec which makes
> for 2.31 us which would produce an average rotational
> latency of 1.16us if such a condition even still applies.

No, I think you're just a bit off. A 7200RPM drive *does* revolve in
8.3ms. A 2.31us rotation time would be 432900 RPS or approx. 26,000,000
RPM. Which is pushing it.

The 8.3ms rotational latency is clearly visible as a wide band if you
graph access time against distance.
http://www.goop.org/~jeremy/seek-buf.eps is an example of a 7200RPM
Western Digital drive.

> My expectation is that the whole track is buffered starting
> from the first sector that syncs thereby making the time
> rotfreq + rotfreq/nsect or something similar. In any case
> the rotational latency or frequency is orders of magnitude
> smaller than the seek time, even between adjacent
> tracks/cylinders.

Track to track seek time is typically around 1ms. Rotational latency is
often the dominating factor in access time.

> If the the stated average seek is 50% of full stroke and not
> based on reality then 76% of the cost of an average seek is
> attributed to distance and likewise 87% of the cost of a
> full.

Well, average seek is a vague concept. If you assume that all seeks are
randomly distributed, then it might mean a half-stroke seek. But almost
all seeks are short, so the weighting means that short seek time is the
most important to optimise (which drive vendors do: they use techniques
like dumping more current into the coils for short seeks because they
know it is short; if they dumped the same current for a long seek, it
would burn things out). Also, there are only two cylinders between
which you can have the maximal seek, whereas every adjacent pair of
cylinders can have a minimal seek; in other words the physical nature of
the drive means that short seeks will dominate, even with random
seeking.

Rotational latency tends to dominate these days: a fast drive can do a
track to track seek in 1ms, and full-stroke seek in 9ms or so. That
means that a 1ms track-to-track seek can take longer than a full stroke
if you include the 8.3ms variation.

Elevators are important for keeping seeks short. This is partly to
reduce the seek time, but mostly because drives are optimised so that
the track-to-track skew is set up to minimise rotational latency. The
further you seek, the more likely you are to miss the rotlat deadline
and have to take a full rotation penalty.

J

2002-12-25 01:26:30

by Rik van Riel

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

On Tue, 24 Dec 2002, jw schultz wrote:

> The rotational frequency should be 7200*60/sec which makes
> for 2.31 us which would produce an average rotational
> latency of 1.16us if such a condition even still applies.

That would be 432000 rotations per second, meaning that the
edge of a 3.5" disk would travel at almost 120 kilometers per
second and be stressed by some pretty impressive G forces,
which I'm too lazy to calculate.

Good thing a 7200 RPM disk only spins 120 times a second,
that's a lot safer in consumer applications. ;)

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-12-25 01:54:53

by jw schultz

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

On Tue, Dec 24, 2002 at 09:21:23AM -0800, jw schultz wrote:
> On Tue, Dec 24, 2002 at 10:18:52AM +0100, Roy Sigurd Karlsbakk wrote:
> > keep in mind that only around half of the seek time is because of the
> > partition! Taking an IBM 120GXP as an example:
> >
> > Average seek: 8.5ms
> > Full stroke seek: 15.0ms
> > Time to rotate disk one round: 1/(7200/60)*1000 = 8.3ms
>
> I'm afraid your math is off.
>
> The rotational frequency should be 7200*60/sec which makes
> for 2.31 us which would produce an average rotational
> latency of 1.16us if such a condition even still applies.
> My expectation is that the whole track is buffered starting
> from the first sector that syncs thereby making the time
> rotfreq + rotfreq/nsect or something similar. In any case
> the rotational latency or frequency is orders of magnitude
> smaller than the seek time, even between adjacent
> tracks/cylinders.
>
> If the the stated average seek is 50% of full stroke and not
> based on reality then 76% of the cost of an average seek is
> attributed to distance and likewise 87% of the cost of a
> full. Based on that i'd say the seek distance is a much
> bigger player than you are assuming. If it weren't the
> value of elevators would be much less.

No. Your math is correct. Mine is upside down. Don't know
where that came from. Apologies for the bad smell.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-12-25 03:33:01

by Barry K. Nathan

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

On Tue, Dec 24, 2002 at 09:21:23AM -0800, jw schultz wrote:
> If the the stated average seek is 50% of full stroke and not

no, stated average seek = one third (~33%) of full stroke

google for "average seek stroke" if you don't believe me :)

-Barry K. Nathan <[email protected]>

2002-12-27 12:56:47

by Torben Frey

[permalink] [raw]
Subject: Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)

Hi Nuno,

> AFAIK, this is because you have some GB of memory (RAM) that are beeing
> used as disk-cache. It took 22 seconds for the cached writes-to-disk
> being flushed to the device.

Yes, I guess you are right...must be the flushing of my buffers, altogether it writes out
exactly the amount of data I would expect (the "22seconds block" included).

Regards,
Torben

PS: Sorry for my long delay, but I was at home over Xmas and could not reply earlier.