Message-ID: <01BDB7EEF8D4D3119D95009027AE99951B0E63FA@fmsmsx33.fm.intel.com>
From: "Griffiths, Richard A" <richard.a.griffiths@intel.com>
To: "'Andrew Morton'" <akpm@zip.com.au>, mgross <mgross@unix-os.sc.intel.com>
Cc: "Griffiths, Richard A" <richard.a.griffiths@intel.com>,
       "'Jens Axboe'" <axboe@suse.de>,
       Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
       lse-tech@lists.sourceforge.net
Subject: RE: ext3 performance bottleneck as the number of spindles gets la
	rge
Date: Mon, 24 Jun 2002 15:51:35 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="ISO-8859-1"
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3649
Lines: 97

Andrew,
I ran your write-and-fsync program.  Here are the results:
#cntrlrs x #drives
2x2 avg = 25.04 MB/s        aggregate = 100 MB/s
2x4 avg =  9.17 MB/s        aggregate = 110 MB/s
4x2 avg = 14.55 MB/s        aggregate = 116.4 MB/s
4x6 avg =  4.94 MB/s        aggregate = 118.6 MB/s

Your program only addresses large I/O (1MB) against a fairly large file
(4GB). We did that as well with Bonnie++ (2GB file 1MB I/O requests).  The
results without the fsync option ran about 94 - 100 MB/s.  Our concern was
creating a more real world mix of I/O.  How well does the system scale
against a variety of I/O request sizes on various size files.  Where we saw
the worst overall scaling was with 8K requests.

Richard

mgross wrote:
> 
> ...
> >And please tell us some more details regarding the performance
bottleneck.
> >I assume that you mean that the IO rate per disk slows as more
> >disks are added to an adapter?  Or does the total throughput through
> >the adapter fall as more disks are added?
> >
> No, the IO block write throughput for the system goes down as drives are
> added under this work load.  We measure the system throughput not the
> per drive throughput, but one could infer the per drive throughput by
> dividing.
> 
> Running bonnie++ on with 300MB files doing 8Kb sequential writes we get
> the following system wide throughput as a function of the number of
> drives attached and by number of addapters.
> 
> One addapter
> 1 drive per addapter    127,702KB/Sec
> 2 drives per addapter  93,283 KB/Sec
> 6 drives per addapter   85,626 KB/Sec

127 megabytes/sec to a single disk?  Either that's a very
fast disk, or you're using very small bytes :)

> 2 addapters
> 1 drive per addapter    92,095 KB/Sec
> 2 drives per addapter  110,956 KB/Sec
> 6 drives per addapter   106,883 KB/Sec
> 
> 4 addapters
> 1 drive per addapter    121,125 KB/Sec
> 2 drives per addapter   117,575 KB/Sec
> 6 drives per addapter   116,570 KB/Sec
> 

Possibly what is happening here is that a significant amount
of dirty data is being left in memory and is escaping the
measurement period.   When you run the test against more disks,
the *total* amount of dirty memory is increased, so the kernel
is forced to perform more writeback within the measurement period.

So with two filesystems, you're actually performing more I/O.

You need to either ensure that all I/O is occurring *within the
measurement interval*, or make the test write so much data (wrt
main memory size) that any leftover unwritten stuff is insignificant.

bonnie++ is too complex for this work.  Suggest you use
http://www.zip.com.au/~akpm/linux/write-and-fsync.c
which will just write and fsync a file.  Time how long that
takes.  Or you could experiment with bonnie++'s fsync option.

My suggestion is to work with this workload:

for i in /mnt/1 /mnt/2 /mnt/3 /mnt/4 ...
do
	write-and-fsync $i/foo 4000 &
done

which will write a 4 gig file to each disk.  This will defeat
any caching effects and is just a way simpler workload, which
will allow you to test one thing in isolation.


So anyway.  All this possibly explains the "negative scalability"
in the single-adapter case.  For four adapters with one disk on
each, 120 megs/sec seems reasonable, assuming the sustained
write bandwidth of a single disk is 30 megs/sec.

For four adapters, six disks on each you should be doing better.
Something does appear to be wrong there.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/