From: Bill Fink <billfink@mindspring.com>
Subject: Re: [RFC PATCH] ext4: fix 50% disk write performance regression
Date: Mon, 30 Aug 2010 21:14:37 -0400
Message-ID: <20100830211437.765d117e.billfink@mindspring.com>
References: <20100829231126.8d8b2086.billfink@mindspring.com>
	<20100830174000.GA6647@thunk.org>
	<20100830164958.edb64c63.bill@wizard.sci.gsfc.nasa.gov>
	<20100831003710.GA4272@thunk.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: Bill Fink <bill@wizard.sci.gsfc.nasa.gov>,
	"adilger@sun.com" <adilger@sun.com>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"Fink, William E. (GSFC-6061)" <william.e.fink@nasa.gov>
To: Ted Ts'o <tytso@mit.edu>
In-Reply-To: <20100831003710.GA4272@thunk.org>
Sender: linux-ext4-owner@vger.kernel.org

On Mon, 30 Aug 2010, Ted Ts'o wrote:

> On Mon, Aug 30, 2010 at 04:49:58PM -0400, Bill Fink wrote:
> > > Thanks for reporting it.  I'm going to have to take a closer look at
> > > why this makes a difference.  I'm going to guess though that what's
> > > going on is that we're posting writes in such a way that they're no
> > > longer aligned or ending at the end of a RAID5 stripe, causing a
> > > read-modify-write pass.  That would easily explain the write
> > > performance regression.
> > 
> > I'm not sure I understand.  How could calling or not calling
> > ext4_num_dirty_pages() (unpatched versus patched 2.6.35 kernel)
> > affect the write alignment?
> 
> Suppose you have 8 disks, with stripe size of 16k.  Assuming that
> you're only using one parity disk (i.e., RAID 5) and no spare disks,
> that means the optimal I/O size is 7*16k == 112k.  If we do a write
> which is smaller than 112k, or which is not a multiple of 112k, then
> the RAID subsystem will need to do a read-modify-write to update the
> parity disk.  Furthermore, the write had better be aligned on an 112k
> byte boundary.  The block allocator will guarantee that block #0 is
> aligned on a 112k block, but writes have to also be right size in
> order to avoid the read-modify-write.
> 
> If we end up doing very small writes, then it can end up being quite
> disatrous for write performance.

I understand how unaligned writes can be very bad for performance.
That makes perfect sense.  What I don't understand is how just
calling or not calling ext4_num_dirty_pages() can affect the
write alignment, and that's the only difference between the
unpatched and patched 2.6.35 kernels.  I thought the only thing
ext4_num_dirty_pages does is to count the number of ext4 dirty
pages.  How can that counting affect the write alignment?  I
guess there must be some subtle side affect of ext4_num_dirty_pages
that I'm not getting.

> > I was wondering if the locking being done in ext4_num_dirty_pages()
> > could somehow be affecting the performance.  I did notice from top
> > that in the patched 2.6.35 kernel, the I/O wait time was generally
> > in the 60-65% range, while in the unpatched 2.6.35 kernel, it was
> > at a higher 75-80% range.  However, I don't know if that's just a
> > result of the lower performance, or a possible clue to its cause.
> 
> I/O wait time would tend to imply that the raid controller is taking
> longer to do the write updates, which would tend to confirm that we're
> doing more read-modify-write cycles.  If we were hitting spinlock
> contention, this would show up as more system CPU time consumed.

OK.  There wasn't more CPU utilization.  It was about proportionally
less in the bad case as the reduced level of performance.

						-Thanks

						-Bill