Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753397Ab2KTOMG (ORCPT ); Tue, 20 Nov 2012 09:12:06 -0500 Received: from mail-ie0-f174.google.com ([209.85.223.174]:37678 "EHLO mail-ie0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752314Ab2KTOME (ORCPT ); Tue, 20 Nov 2012 09:12:04 -0500 Message-ID: <50AB8FAA.50100@gmail.com> Date: Tue, 20 Nov 2012 22:11:54 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:16.0) Gecko/20121028 Thunderbird/16.0.2 MIME-Version: 1.0 To: Fengguang Wu CC: Claudio Freire , Andrew Morton , linux-kernel@vger.kernel.org, Linux Memory Management List Subject: Re: fadvise interferes with readahead References: <20121120080427.GA11019@localhost> In-Reply-To: <20121120080427.GA11019@localhost> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8167 Lines: 165 On 11/20/2012 04:04 PM, Fengguang Wu wrote: > Hi Claudio, > > Thanks for the detailed problem description! Hi Fengguang, Another question, thanks in advance. What's the meaning of interleaved reads? If the first process readahead from start ~ start + size - async_size, another process read start + size - aysnc_size + 1, then what will happen? It seems that variable hit_readahead_marker is false, and related codes can't run, where I miss? Regards, Jaegeuk > > On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote: >> Hi. First of all, I'm not subscribed to this list, so I'd suggest all >> replies copy me personally. >> >> I have been trying to implement some I/O pipelining in Postgres (ie: >> read the next data page asynchronously while working on the current >> page), and stumbled upon some puzzling behavior involving the >> interaction between fadvise and readahead. >> >> I'm running kernel 3.0.0 (debian testing), on a single-disk system >> which, though unsuitable for database workloads, is slow enough to let >> me experiment with these read-ahead issues. >> >> Typical random I/O performance is on the order of between 150 r/s to >> 200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s. >> Sequential I/O can go up to 60MB/s, though it tends to be around 50. >> >> Now onto the problem. In order to parallelize I/O with computation, >> I've made postgres fadvise(willneed) the pages it will read next. How >> far ahead is configurable, and I've tested with a number of >> configurations. >> >> The prefetching logic is aware of the OS and pg-specific cache, so it >> will only fadvise a block once. fadvise calls will stay 1 (or a >> configurable N) real I/O ahead of read calls, and there's no fadvising >> of pages that won't be read eventually, in the same order. I checked >> with strace. >> >> However, performance when fadvising drops considerably for a specific >> yet common access pattern: >> >> When a nested loop with two index scans happens, access is random >> locally, but eventually whole ranges of a file get read (in this >> random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2 >> 4 5 101 298 301". Though random, there are ranges there that can be >> merged in one read-request. >> >> The kernel seems to do the merge by applying some form of readahead, >> not sure if it's context, ondemand or adaptive readahead on the 3.0.0 >> kernel. Anyway, it seems to do readahead, as iostat says: >> >> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >> avgrq-sz avgqu-sz await r_await w_await svctm %util >> sda 0.00 4.40 224.20 2.00 4.16 0.03 >> 37.86 1.91 8.43 8.00 56.80 4.40 99.44 >> >> (notice the avgrq-sz of 37.8) >> >> With fadvise calls, the thing looks a lot different: >> >> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >> avgrq-sz avgqu-sz await r_await w_await svctm %util >> sda 0.00 18.00 226.80 1.00 1.80 0.07 >> 16.81 4.00 17.52 17.23 82.40 4.39 99.92 > FYI, there is a readahead tracing/stats patchset that can provide far > more accurate numbers about what's going on with readahead, which will > help eliminate lots of the guess works here. > > https://lwn.net/Articles/472798/ > >> Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's >> spot-on with a postgres page (8k). So, fadvise seems to carry out the >> requests verbatim, while read manages to merge at least two of them. >> >> The random nature of reads makes me think the scheduler is failing to >> merge the requests in both cases (rrqm/s = 0), because it only looks >> at successive requests (I'm only guessing here though). > I guess it's not a merging problem, but that the kernel readahead code > manages to submit larger IO requests in the first place. > >> Looking into the kernel code, it seems the problem could be related to >> how fadvise works in conjunction with readahead. fadvise seems to call >> the function in readahead.c that schedules the asynchornous I/O[0]. It >> doesn't seem subject to readahead logic itself[1], which in on itself >> doesn't seem bad. But it does, I assume (not knowing the code that >> well), prevent readahead logic[2] to eventually see the pattern. It >> effectively disables readahead altogether. > You are right. If user space does fadvise() and the fadvised pages > cover all read() pages, the kernel readahead code will not run at all. > > So the title is actually a bit misleading. The kernel readahead won't > interfere with user space prefetching at all. ;) > >> This, I theorize, may be because after the fadvise call starts an >> async I/O on the page, further reads won't hit readahead code because >> of the page cache[3] (!PageUptodate I imagine). Whether this is >> desirable or not is not really obvious. In this particular case, doing >> fadvise calls in what would seem an optimum way, results in terribly >> worse performance. So I'd suggest it's not really that advisable. > Yes. The kernel readahead code by design will outperform simple > fadvise in the case of clustered random reads. Imagine the access > pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs literally. While > kernel readahead will likely trigger 3 IOs for 1, 3, 2-9. Because on > the page miss for 2, it will detect the existence of history page 1 > and do readahead properly. For hard disks, it's mainly the number of > IOs that matters. So even if kernel readahead loses some opportunities > to do async IO and possibly loads some extra pages that will never be > used, it still manges to perform much better. > >> The fix would lay in fadvise, I think. It should update readahead >> tracking structures. Alternatively, one could try to do it in >> do_generic_file_read, updating readahead on !PageUptodate or even on >> page cache hits. I really don't have the expertise or time to go >> modifying, building and testing the supposedly quite simple patch that >> would fix this. It's mostly about the testing, in fact. So if someone >> can comment or try by themselves, I guess it would really benefit >> those relying on fadvise to fix this behavior. > One possible solution is to try the context readahead at fadvise time > to check the existence of history pages and do readahead accordingly. > > However it will introduce *real interferences* between kernel > readahead and user prefetching. The original scheme is, once user > space starts its own informed prefetching, kernel readahead will > automatically stand out of the way. > > Thanks, > Fengguang > >> Additionally, I would welcome any suggestions for ways to mitigate >> this problem on current kernels, as the patch I'm working I'd like to >> deploy with older kernels. Even if the latest kernel had this behavior >> fixed, I'd still welcome some workarounds. >> >> More details on the benchmarks I've run can be found in the postgresql >> dev ML archive[4]. >> >> [0] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/fadvise.c#l95 >> [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/readahead.c#l211 >> [2] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/readahead.c#l398 >> [3] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/filemap.c#l1081 >> [4] http://archives.postgresql.org/pgsql-hackers/2012-10/msg01139.php >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/