2009-07-02 08:01:09

by Yanmin Zhang

[permalink] [raw]
Subject: fio mmap sequential read 30% regression

Comapring with 2.6.30's result, fio mmap sequtial read has
about 30% regression on one of my stoakley machine with 1
JBOD (7 SAS disks) with kernel 2.6.31-rc1.

Every disk has 2 partitions and 4 1-GB files per partition. Start
10 processes per disk to do mmap read sequentinally.

Bisect down to below patch.

ef00e08e26dd5d84271ef706262506b82195e752 is first bad commit
commit ef00e08e26dd5d84271ef706262506b82195e752
Author: Linus Torvalds <[email protected]>
Date: Tue Jun 16 15:31:25 2009 -0700

readahead: clean up and simplify the code for filemap page fault readahead

This shouldn't really change behavior all that much, but the single rather
complex function with read-ahead inside a loop etc is broken up into more
manageable pieces.

The behaviour is also less subtle, with the read-ahead being done up-front
rather than inside some subtle loop and thus avoiding the now unnecessary
extra state variables (ie "did_readaround" is gone).

Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
which case the original code will directly jump to no_cached_page reading.


The bisect is stable.

Yanmin


2009-07-02 13:42:18

by Fengguang Wu

[permalink] [raw]
Subject: Re: fio mmap sequential read 30% regression

On Thu, Jul 02, 2009 at 04:01:22PM +0800, Zhang, Yanmin wrote:
> Comapring with 2.6.30's result, fio mmap sequtial read has
> about 30% regression on one of my stoakley machine with 1
> JBOD (7 SAS disks) with kernel 2.6.31-rc1.
>
> Every disk has 2 partitions and 4 1-GB files per partition. Start
> 10 processes per disk to do mmap read sequentinally.
>
> Bisect down to below patch.
>
> ef00e08e26dd5d84271ef706262506b82195e752 is first bad commit
> commit ef00e08e26dd5d84271ef706262506b82195e752
> Author: Linus Torvalds <[email protected]>
> Date: Tue Jun 16 15:31:25 2009 -0700
>
> readahead: clean up and simplify the code for filemap page fault readahead
>
> This shouldn't really change behavior all that much, but the single rather
> complex function with read-ahead inside a loop etc is broken up into more
> manageable pieces.
>
> The behaviour is also less subtle, with the read-ahead being done up-front
> rather than inside some subtle loop and thus avoiding the now unnecessary
> extra state variables (ie "did_readaround" is gone).
>
> Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
> the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
> which case the original code will directly jump to no_cached_page reading.
>
>
> The bisect is stable.

Let me take care of this bug: it may well be caused by my other readahead patches.

Thanks,
Fengguang

2009-07-02 17:10:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: fio mmap sequential read 30% regression



On Thu, 2 Jul 2009, Zhang, Yanmin wrote:
>
> Bisect down to below patch.
>
> ef00e08e26dd5d84271ef706262506b82195e752 ("readahead: clean up and
> simplify the code for filemap page fault readahead")
>
> The bisect is stable.

Hmm. That patch is the one patch in the whole series that should _not_
have changed any behavior, but it got forward-ported by something like 8
months, so maybe something changed in the meantime.

Also, I do think it changes the exact details of "ra->mmap_miss" (along
with the fault counters). The old code was really quite odd in the
statistics department, and would do some odd things wrt the miss counts
and statistics - because it had this re-try loop that essentially updated
the stats twice (once on the first try, and then once more when doing the
"retry_find" thing).

The patch gets rid of the hard-to-follow retry logic (it had that
"did_readaround" logic to determine if it was on the first loop or the
second one, and used those _sometimes_), and should _fix_ those stats, but
obviously all historical tuning had happened with the old stats. So it's
entirely possible that as part of "fixing" them, I actually broke what the
old logic tried to do.

There's also one _intentional_ change: the new code doesn't take the lock
on the page before it starts async read-ahead. That can certainly change
timings a lot. It _should_ cause better behavior (more IO overlap), but
depending on read-ahead code and the exact behavior of your IO subsystem,
maybe it causes you problems.

But it is also possible that the patch simply changes behavior in
unintended ways. Either originally, or due to being forward-ported with
all the other read-ahead changes. I'm not seeing it, though - the patch
still looks like it shouldn't change any semantics.

Linus