Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752224Ab2KUGvu (ORCPT ); Wed, 21 Nov 2012 01:51:50 -0500 Received: from mail-oa0-f46.google.com ([209.85.219.46]:42870 "EHLO mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750824Ab2KUGvt (ORCPT ); Wed, 21 Nov 2012 01:51:49 -0500 Message-ID: <50AC79FD.8030202@gmail.com> Date: Wed, 21 Nov 2012 14:51:41 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:16.0) Gecko/20121028 Thunderbird/16.0.2 MIME-Version: 1.0 To: Fengguang Wu CC: Claudio Freire , Andrew Morton , linux-kernel@vger.kernel.org, Linux Memory Management List Subject: Re: fadvise interferes with readahead References: <20121120080427.GA11019@localhost> <50AB8FAA.50100@gmail.com> <20121120151502.GD19467@localhost> In-Reply-To: <20121120151502.GD19467@localhost> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9154 Lines: 185 On 11/20/2012 11:15 PM, Fengguang Wu wrote: > On Tue, Nov 20, 2012 at 10:11:54PM +0800, Jaegeuk Hanse wrote: >> On 11/20/2012 04:04 PM, Fengguang Wu wrote: >>> Hi Claudio, >>> >>> Thanks for the detailed problem description! >> Hi Fengguang, >> >> Another question, thanks in advance. >> >> What's the meaning of interleaved reads? If the first process > It's access patterns like > > 1, 1001, 2, 1002, 3, 1003, ... > > in which there are two (or more) mixed sequential read streams. > >> readahead from start ~ start + size - async_size, another process >> read start + size - aysnc_size + 1, then what will happen? It seems >> that variable hit_readahead_marker is false, and related codes can't >> run, where I miss? > Yes hit_readahead_marker will be false. However on reading 1002, > hit_readahead_marker()/count_history_pages() will find the previous > page 1001 already in page cache and trigger context readahead. Hi Fengguang, Thanks for your explaination, the comment in function ondemand_readahead, "Hit a marked page without valid readahead state". What's the meaning of "without valid readahead state"? Regards, Jaegeuk > > Thanks, > Fengguang > >>> On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote: >>>> Hi. First of all, I'm not subscribed to this list, so I'd suggest all >>>> replies copy me personally. >>>> >>>> I have been trying to implement some I/O pipelining in Postgres (ie: >>>> read the next data page asynchronously while working on the current >>>> page), and stumbled upon some puzzling behavior involving the >>>> interaction between fadvise and readahead. >>>> >>>> I'm running kernel 3.0.0 (debian testing), on a single-disk system >>>> which, though unsuitable for database workloads, is slow enough to let >>>> me experiment with these read-ahead issues. >>>> >>>> Typical random I/O performance is on the order of between 150 r/s to >>>> 200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s. >>>> Sequential I/O can go up to 60MB/s, though it tends to be around 50. >>>> >>>> Now onto the problem. In order to parallelize I/O with computation, >>>> I've made postgres fadvise(willneed) the pages it will read next. How >>>> far ahead is configurable, and I've tested with a number of >>>> configurations. >>>> >>>> The prefetching logic is aware of the OS and pg-specific cache, so it >>>> will only fadvise a block once. fadvise calls will stay 1 (or a >>>> configurable N) real I/O ahead of read calls, and there's no fadvising >>>> of pages that won't be read eventually, in the same order. I checked >>>> with strace. >>>> >>>> However, performance when fadvising drops considerably for a specific >>>> yet common access pattern: >>>> >>>> When a nested loop with two index scans happens, access is random >>>> locally, but eventually whole ranges of a file get read (in this >>>> random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2 >>>> 4 5 101 298 301". Though random, there are ranges there that can be >>>> merged in one read-request. >>>> >>>> The kernel seems to do the merge by applying some form of readahead, >>>> not sure if it's context, ondemand or adaptive readahead on the 3.0.0 >>>> kernel. Anyway, it seems to do readahead, as iostat says: >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>> sda 0.00 4.40 224.20 2.00 4.16 0.03 >>>> 37.86 1.91 8.43 8.00 56.80 4.40 99.44 >>>> >>>> (notice the avgrq-sz of 37.8) >>>> >>>> With fadvise calls, the thing looks a lot different: >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>> sda 0.00 18.00 226.80 1.00 1.80 0.07 >>>> 16.81 4.00 17.52 17.23 82.40 4.39 99.92 >>> FYI, there is a readahead tracing/stats patchset that can provide far >>> more accurate numbers about what's going on with readahead, which will >>> help eliminate lots of the guess works here. >>> >>> https://lwn.net/Articles/472798/ >>> >>>> Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's >>>> spot-on with a postgres page (8k). So, fadvise seems to carry out the >>>> requests verbatim, while read manages to merge at least two of them. >>>> >>>> The random nature of reads makes me think the scheduler is failing to >>>> merge the requests in both cases (rrqm/s = 0), because it only looks >>>> at successive requests (I'm only guessing here though). >>> I guess it's not a merging problem, but that the kernel readahead code >>> manages to submit larger IO requests in the first place. >>> >>>> Looking into the kernel code, it seems the problem could be related to >>>> how fadvise works in conjunction with readahead. fadvise seems to call >>>> the function in readahead.c that schedules the asynchornous I/O[0]. It >>>> doesn't seem subject to readahead logic itself[1], which in on itself >>>> doesn't seem bad. But it does, I assume (not knowing the code that >>>> well), prevent readahead logic[2] to eventually see the pattern. It >>>> effectively disables readahead altogether. >>> You are right. If user space does fadvise() and the fadvised pages >>> cover all read() pages, the kernel readahead code will not run at all. >>> >>> So the title is actually a bit misleading. The kernel readahead won't >>> interfere with user space prefetching at all. ;) >>> >>>> This, I theorize, may be because after the fadvise call starts an >>>> async I/O on the page, further reads won't hit readahead code because >>>> of the page cache[3] (!PageUptodate I imagine). Whether this is >>>> desirable or not is not really obvious. In this particular case, doing >>>> fadvise calls in what would seem an optimum way, results in terribly >>>> worse performance. So I'd suggest it's not really that advisable. >>> Yes. The kernel readahead code by design will outperform simple >>> fadvise in the case of clustered random reads. Imagine the access >>> pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs literally. While >>> kernel readahead will likely trigger 3 IOs for 1, 3, 2-9. Because on >>> the page miss for 2, it will detect the existence of history page 1 >>> and do readahead properly. For hard disks, it's mainly the number of >>> IOs that matters. So even if kernel readahead loses some opportunities >>> to do async IO and possibly loads some extra pages that will never be >>> used, it still manges to perform much better. >>> >>>> The fix would lay in fadvise, I think. It should update readahead >>>> tracking structures. Alternatively, one could try to do it in >>>> do_generic_file_read, updating readahead on !PageUptodate or even on >>>> page cache hits. I really don't have the expertise or time to go >>>> modifying, building and testing the supposedly quite simple patch that >>>> would fix this. It's mostly about the testing, in fact. So if someone >>>> can comment or try by themselves, I guess it would really benefit >>>> those relying on fadvise to fix this behavior. >>> One possible solution is to try the context readahead at fadvise time >>> to check the existence of history pages and do readahead accordingly. >>> >>> However it will introduce *real interferences* between kernel >>> readahead and user prefetching. The original scheme is, once user >>> space starts its own informed prefetching, kernel readahead will >>> automatically stand out of the way. >>> >>> Thanks, >>> Fengguang >>> >>>> Additionally, I would welcome any suggestions for ways to mitigate >>>> this problem on current kernels, as the patch I'm working I'd like to >>>> deploy with older kernels. Even if the latest kernel had this behavior >>>> fixed, I'd still welcome some workarounds. >>>> >>>> More details on the benchmarks I've run can be found in the postgresql >>>> dev ML archive[4]. >>>> >>>> [0] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/fadvise.c#l95 >>>> [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/readahead.c#l211 >>>> [2] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/readahead.c#l398 >>>> [3] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/filemap.c#l1081 >>>> [4] http://archives.postgresql.org/pgsql-hackers/2012-10/msg01139.php >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> Please read the FAQ at http://www.tux.org/lkml/ >>> -- >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in >>> the body to majordomo@kvack.org. For more info on Linux MM, >>> see: http://www.linux-mm.org/ . >>> Don't email: email@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/