Date: Wed, 18 Jun 2003 16:22:42 -0700
From: Andrew Morton <akpm@digeo.com>
To: rwhron@earthlink.net
Cc: piggin@cyberone.com.au, linux-kernel@vger.kernel.org
Subject: Re: i/o benchmarks on 2.5.70* kernels
Message-Id: <20030618162242.115a68df.akpm@digeo.com>
In-Reply-To: <20030618225017.GA15635@rushmore>
References: <20030618225017.GA15635@rushmore>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5672
Lines: 128

rwhron@earthlink.net wrote:
>
> 
> > tiobench on SMP results are not very good, lots of
> > fragmentation
> 
> On uniprocessor with IDE disk, benchmarks look very
> different than SMP/SCSI. Recent -mm on uniprocessor 
> is frequently ahead of 2.5.70/2.5.71.
> 
> All numbers below are uniprocessor/ide.

Yes, there's a hge difference between SMP and UP with ext3, and a
significant difference with ext2.  This is because the block allocator is
borked: on SMP with multiple processes writing multiple files the blocks
get all interleaved.  It's most notable in tiobench - it affects both read
and write bandwidth (read, because the file is already poorly laid out).

This became more serious with the Orlov allocator, and its tendency to put
files closer together on disk.  But Orlov is a net win and no way can we
revert its changes just because the block allocator is stupid.  The XFS
allcoator is smarter, and the effects are clear.

I'm still thinking about the block allocation problem.  I don't like the
ext2 prealloc stuff and would prefer to not conceptually graft it into
ext3.

I don't think the tiobench scenario is a serious problem in real-life -
it's mainly a benchmarking problem.

But the block allocator problem also shows when multiple files are being
slowly grown, and we need to fix it for this real-world case.


> tiobench
> 2.4.21-rc8aa1       8   15.49 47.53%     5.969     3460.66  0.00000  0.00000     33
> 2.5.70-mm7          8   14.86 48.31%     6.171     1345.95  0.00000  0.00000     31
> 2.5.70-mm6          8   14.78 47.73%     6.147     1292.34  0.00000  0.00000     31
> 2.5.70-mm1          8   14.56 45.80%     6.234     1370.54  0.00000  0.00000     32
> 2.5.70-mm3          8   14.45 45.42%     6.277     1481.31  0.00000  0.00000     32
> 2.5.70-mm5          8   14.41 45.67%     6.326     1477.68  0.00000  0.00000     32
> 2.5.69-ac1          8   10.60 15.37%     8.765      791.57  0.00000  0.00000     69
> 2.5.70              8   10.53 14.77%     8.800      866.26  0.00000  0.00000     71
> 2.5.71              8   10.21 12.78%     9.086     1126.91  0.00000  0.00000     80

-aa's huge readahead window seems to be as affective as anticipatory
scheduling here.

> 2.5.70-mm6        128   13.24 37.65%    95.348    62124.61  2.36968  0.07667     35
> 2.5.70-mm5        128   13.05 36.58%    96.072    65599.84  2.24152  0.10757     36
> 2.5.70-mm3        128   13.01 37.47%    93.371   124939.92  2.02942  0.11635     35
> 2.5.70-mm7        128   12.65 39.38%   106.741    30835.10  1.99966  0.00152     32
> 2.5.70-mm1        128   12.57 33.23%    98.494   113042.54  2.30483  0.14649     38
> 2.4.21-rc8aa1     128    9.20 26.93%    95.763    80411.93  1.00212  0.14877     34
> 2.5.70            128    9.14 14.14%   152.896    47046.50  0.78164  0.01373     65
> 2.5.71            128    5.66  7.67%   209.858    36249.59  0.26436  0.00572     74

But as the number of threads is increased, -aa might have hit readahead thrashing.

> dbench ext2 64 processes         Average	High		Low (MB/sec)
> 2.4.21-rc8aa1           	 42.13		 42.51		 41.68
> 2.4.21-rc3               	 39.59		 40.16		 39.13
> 2.5.70                   	 31.47		 40.51		 18.76
> 2.5.70-mm3               	 27.97		 32.61		 21.67
> 2.5.70-mm6               	 27.41		 36.07		 21.19
> 2.5.70-mm7               	 27.06		 28.47		 25.08
> 2.5.69-ac1               	 26.45		 37.44		 20.19
> 2.5.70-mm5               	 26.31		 37.12		 21.88
> 2.5.70-mm1               	 18.59		 24.52		 14.06
> 2.5.71                   	  8.11		 14.62		  3.89

2.5's dbench throughput took a bullet ages ago, when I merged the below
fairness patch.

I don't have any plans to un-merge it.  The fact is, allowing one writer to
starve out all the others improves dbench throughput but worsens latency in
many other situations.   dbench can get stuffed.


When a throttled writer is performing writeback, and it encounters an inode
which is already under writeback it is forced to wait on the inode.  So that
process sleeps until whoever is writing it out finishes the writeout.

Which is OK - we want to throttle that process, and another process is
currently pumping data at the disk anyway.

But in one situations the delays are excessive.  If one process is performing
a huge linear write, other processes end up waiting for a very long time
indeed.  It appears that this is because the writing process just keeps on
hogging the CPU, returning to userspace, generating more dirty data, writing
it out, sleeping in get_request_wait, etc.  All other throttled dirtiers get
starved.

So just remove the wait altogether if it is just a memory-cleansing writeout.
 The calling process will then throttle in balance_dirty_pages()'s call to
blk_congestion_wait().


diff -puN fs/fs-writeback.c~dont-wait-on-inode fs/fs-writeback.c
--- 25/fs/fs-writeback.c~dont-wait-on-inode	2003-02-03 23:47:06.000000000 -0800
+++ 25-akpm/fs/fs-writeback.c	2003-02-03 23:47:06.000000000 -0800
@@ -185,11 +185,14 @@ static void
 __writeback_single_inode(struct inode *inode,
 			struct writeback_control *wbc)
 {
-	if (current_is_pdflush() && (inode->i_state & I_LOCK)) {
+	if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) {
 		list_move(&inode->i_list, &inode->i_sb->s_dirty);
 		return;
 	}
 
+	/*
+	 * It's a data-integrity sync.  We must wait.
+	 */
 	while (inode->i_state & I_LOCK) {
 		__iget(inode);
 		spin_unlock(&inode_lock);

_


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/