Date: Wed, 21 Nov 2012 15:46:19 +0800
From: Fengguang Wu <fengguang.wu@intel.com>
To: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
Cc: Claudio Freire <klaussfreire@gmail.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        linux-kernel@vger.kernel.org,
        Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: fadvise interferes with readahead
Message-ID: <20121121074619.GA7834@localhost>
References: <CAGTBQpaDR4+V5b1AwAVyuVLu5rkU=Wc1WeUdLu5ag=WOk5oJzQ@mail.gmail.com>
 <20121120080427.GA11019@localhost>
 <50AB8FAA.50100@gmail.com>
 <20121120151502.GD19467@localhost>
 <50AC79FD.8030202@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <50AC79FD.8030202@gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 9851
Lines: 189

On Wed, Nov 21, 2012 at 02:51:41PM +0800, Jaegeuk Hanse wrote:
> On 11/20/2012 11:15 PM, Fengguang Wu wrote:
> >On Tue, Nov 20, 2012 at 10:11:54PM +0800, Jaegeuk Hanse wrote:
> >>On 11/20/2012 04:04 PM, Fengguang Wu wrote:
> >>>Hi Claudio,
> >>>
> >>>Thanks for the detailed problem description!
> >>Hi Fengguang,
> >>
> >>Another question, thanks in advance.
> >>
> >>What's the meaning of interleaved reads? If the first process
> >It's access patterns like
> >
> >         1, 1001, 2, 1002, 3, 1003, ...
> >
> >in which there are two (or more) mixed sequential read streams.
> >
> >>readahead from start ~ start + size - async_size, another process
> >>read start + size - aysnc_size + 1, then what will happen? It seems
> >>that variable hit_readahead_marker is false, and related codes can't
> >>run, where I miss?
> >Yes hit_readahead_marker will be false. However on reading 1002,
> >hit_readahead_marker()/count_history_pages() will find the previous
> >page 1001 already in page cache and trigger context readahead.
> 
> Hi Fengguang,
> 
> Thanks for your explaination, the comment in function
> ondemand_readahead, "Hit a marked page without valid readahead
> state". What's the meaning of "without valid readahead state"?

It normally happens in interleaved (or clustered random) reads. When
there are two read streams for one struct file, the one file_ra_state
won't be able to track state for the two streams. When the readahead
code is triggered for stream A, the file_ra_state may contain the
previous readahead window information for stream B. In this case
stream B's readahead state (ra->start, ra->size etc.) is invalid for
the current stream A that we are working on.

Thanks,
Fengguang

> >>>On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote:
> >>>>Hi. First of all, I'm not subscribed to this list, so I'd suggest all
> >>>>replies copy me personally.
> >>>>
> >>>>I have been trying to implement some I/O pipelining in Postgres (ie:
> >>>>read the next data page asynchronously while working on the current
> >>>>page), and stumbled upon some puzzling behavior involving the
> >>>>interaction between fadvise and readahead.
> >>>>
> >>>>I'm running kernel 3.0.0 (debian testing), on a single-disk system
> >>>>which, though unsuitable for database workloads, is slow enough to let
> >>>>me experiment with these read-ahead issues.
> >>>>
> >>>>Typical random I/O performance is on the order of between 150 r/s to
> >>>>200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s.
> >>>>Sequential I/O can go up to 60MB/s, though it tends to be around 50.
> >>>>
> >>>>Now onto the problem. In order to parallelize I/O with computation,
> >>>>I've made postgres fadvise(willneed) the pages it will read next. How
> >>>>far ahead is configurable, and I've tested with a number of
> >>>>configurations.
> >>>>
> >>>>The prefetching logic is aware of the OS and pg-specific cache, so it
> >>>>will only fadvise a block once. fadvise calls will stay 1 (or a
> >>>>configurable N) real I/O ahead of read calls, and there's no fadvising
> >>>>of pages that won't be read eventually, in the same order. I checked
> >>>>with strace.
> >>>>
> >>>>However, performance when fadvising drops considerably for a specific
> >>>>yet common access pattern:
> >>>>
> >>>>When a nested loop with two index scans happens, access is random
> >>>>locally, but eventually whole ranges of a file get read (in this
> >>>>random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2
> >>>>4 5 101 298 301". Though random, there are ranges there that can be
> >>>>merged in one read-request.
> >>>>
> >>>>The kernel seems to do the merge by applying some form of readahead,
> >>>>not sure if it's context, ondemand or adaptive readahead on the 3.0.0
> >>>>kernel. Anyway, it seems to do readahead, as iostat says:
> >>>>
> >>>>Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> >>>>avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> >>>>sda               0.00     4.40  224.20    2.00     4.16     0.03
> >>>>37.86     1.91    8.43    8.00   56.80   4.40  99.44
> >>>>
> >>>>(notice the avgrq-sz of 37.8)
> >>>>
> >>>>With fadvise calls, the thing looks a lot different:
> >>>>
> >>>>Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> >>>>avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> >>>>sda               0.00    18.00  226.80    1.00     1.80     0.07
> >>>>16.81     4.00   17.52   17.23   82.40   4.39  99.92
> >>>FYI, there is a readahead tracing/stats patchset that can provide far
> >>>more accurate numbers about what's going on with readahead, which will
> >>>help eliminate lots of the guess works here.
> >>>
> >>>https://lwn.net/Articles/472798/
> >>>
> >>>>Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's
> >>>>spot-on with a postgres page (8k). So, fadvise seems to carry out the
> >>>>requests verbatim, while read manages to merge at least two of them.
> >>>>
> >>>>The random nature of reads makes me think the scheduler is failing to
> >>>>merge the requests in both cases (rrqm/s = 0), because it only looks
> >>>>at successive requests (I'm only guessing here though).
> >>>I guess it's not a merging problem, but that the kernel readahead code
> >>>manages to submit larger IO requests in the first place.
> >>>
> >>>>Looking into the kernel code, it seems the problem could be related to
> >>>>how fadvise works in conjunction with readahead. fadvise seems to call
> >>>>the function in readahead.c that schedules the asynchornous I/O[0]. It
> >>>>doesn't seem subject to readahead logic itself[1], which in on itself
> >>>>doesn't seem bad. But it does, I assume (not knowing the code that
> >>>>well), prevent readahead logic[2] to eventually see the pattern. It
> >>>>effectively disables readahead altogether.
> >>>You are right. If user space does fadvise() and the fadvised pages
> >>>cover all read() pages, the kernel readahead code will not run at all.
> >>>
> >>>So the title is actually a bit misleading. The kernel readahead won't
> >>>interfere with user space prefetching at all. ;)
> >>>
> >>>>This, I theorize, may be because after the fadvise call starts an
> >>>>async I/O on the page, further reads won't hit readahead code because
> >>>>of the page cache[3] (!PageUptodate I imagine). Whether this is
> >>>>desirable or not is not really obvious. In this particular case, doing
> >>>>fadvise calls in what would seem an optimum way, results in terribly
> >>>>worse performance. So I'd suggest it's not really that advisable.
> >>>Yes. The kernel readahead code by design will outperform simple
> >>>fadvise in the case of clustered random reads. Imagine the access
> >>>pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs literally. While
> >>>kernel readahead will likely trigger 3 IOs for 1, 3, 2-9. Because on
> >>>the page miss for 2, it will detect the existence of history page 1
> >>>and do readahead properly. For hard disks, it's mainly the number of
> >>>IOs that matters. So even if kernel readahead loses some opportunities
> >>>to do async IO and possibly loads some extra pages that will never be
> >>>used, it still manges to perform much better.
> >>>
> >>>>The fix would lay in fadvise, I think. It should update readahead
> >>>>tracking structures. Alternatively, one could try to do it in
> >>>>do_generic_file_read, updating readahead on !PageUptodate or even on
> >>>>page cache hits. I really don't have the expertise or time to go
> >>>>modifying, building and testing the supposedly quite simple patch that
> >>>>would fix this. It's mostly about the testing, in fact. So if someone
> >>>>can comment or try by themselves, I guess it would really benefit
> >>>>those relying on fadvise to fix this behavior.
> >>>One possible solution is to try the context readahead at fadvise time
> >>>to check the existence of history pages and do readahead accordingly.
> >>>
> >>>However it will introduce *real interferences* between kernel
> >>>readahead and user prefetching. The original scheme is, once user
> >>>space starts its own informed prefetching, kernel readahead will
> >>>automatically stand out of the way.
> >>>
> >>>Thanks,
> >>>Fengguang
> >>>
> >>>>Additionally, I would welcome any suggestions for ways to mitigate
> >>>>this problem on current kernels, as the patch I'm working I'd like to
> >>>>deploy with older kernels. Even if the latest kernel had this behavior
> >>>>fixed, I'd still welcome some workarounds.
> >>>>
> >>>>More details on the benchmarks I've run can be found in the postgresql
> >>>>dev ML archive[4].
> >>>>
> >>>>[0] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/fadvise.c#l95
> >>>>[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/readahead.c#l211
> >>>>[2] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/readahead.c#l398
> >>>>[3] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/filemap.c#l1081
> >>>>[4] http://archives.postgresql.org/pgsql-hackers/2012-10/msg01139.php
> >>>>--
> >>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>>>the body of a message to majordomo@vger.kernel.org
> >>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>Please read the FAQ at  http://www.tux.org/lkml/
> >>>--
> >>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>>the body to majordomo@kvack.org.  For more info on Linux MM,
> >>>see: http://www.linux-mm.org/ .
> >>>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/