Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758642Ab0DBGii (ORCPT ); Fri, 2 Apr 2010 02:38:38 -0400 Received: from 0122700014.0.fullrate.dk ([95.166.99.235]:34582 "EHLO kernel.dk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758193Ab0DBGic (ORCPT ); Fri, 2 Apr 2010 02:38:32 -0400 Date: Fri, 2 Apr 2010 08:38:30 +0200 From: Jens Axboe To: Wu Fengguang Cc: Linux Kernel Subject: Re: [PATCH] readahead even for FMODE_RANDOM Message-ID: <20100402063830.GR23510@kernel.dk> References: <20100401183151.GO23510@kernel.dk> <20100402012338.GA6393@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100402012338.GA6393@localhost> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6158 Lines: 169 On Fri, Apr 02 2010, Wu Fengguang wrote: > Hi Jens, > > On Fri, Apr 02, 2010 at 02:31:51AM +0800, Jens Axboe wrote: > > Hi, > > > > I got a problem report with fio where larger block size random reads > > where markedly slower with buffered IO than with O_DIRECT, and the > > initial thought was that perhaps this was some fio oddity. The reporter > > eventually discovered that turning off the fadvise hint made it work > > fine. So I took a look, and it seems we never do readahead for > > FMODE_RANDOM even if the request size is larger than 1 page. That seems > > like a bug, if an application is doing eg 16kb random reads, you want to > > readahead the 12kb remaining data. On devices where smaller transfer > > sizes are slower than larger ones, this can make a large difference. > > > > This patch makes us readahead even for FMODE_RANDOM, iff we'll be > > reading more pages in that single read. I ran a quick test here, and it > > appears to fix the problem (no difference with fadvise POSIX_FADV_RANDOM > > being passed in or not). > > I guess the application is doing (at least partial) sequential reads, > while at the same time tell kernel with POSIX_FADV_RANDOM that it's > doing random reads. > > If so, it's mainly the application's fault. The application is doing large random reads. It's purely random, so the POSIX_FADV_RANDOM hint is correct. However, thinking about it, it may be that we later hit a random "block" that has now been cached due to this read-ahead. Let me try and rule that out completely and see if there's still the difference. The original reporter observed 4kb reads hitting the driver, where 128kb was expected. > However the kernel can behave more smart and less "dumb" :) > It can inherit the current good behavior of "submit one single 16kb io > request for a 16kb random read() syscall", while still be able to > start _larger sized_ readahead if it's actually a sequential one. Yeah, that's an ancient issue and pretty sad. > > diff --git a/mm/readahead.c b/mm/readahead.c > > index 337b20e..d4b201c 100644 > > --- a/mm/readahead.c > > +++ b/mm/readahead.c > > @@ -501,8 +501,11 @@ void page_cache_sync_readahead(struct address_space *mapping, > > if (!ra->ra_pages) > > return; > > > > - /* be dumb */ > > - if (filp->f_mode & FMODE_RANDOM) { > > + /* > > + * Be dumb for files marked as randomly accessed, but do readahead > > + * inside the original request (req_size > 1). > > + */ > > + if ((filp->f_mode & FMODE_RANDOM) && req_size == 1) { > > force_page_cache_readahead(mapping, filp, offset, req_size); > > return; > > } > > The patch only fixes the (req_size != 1) case that exposed by your > application. A complete fix would be > > @@ -820,12 +825,6 @@ void page_cache_sync_readahead(struct ad > if (!ra->ra_pages) > return; > > - /* be dumb */ > - if (filp->f_mode & FMODE_RANDOM) { > - force_page_cache_readahead(mapping, filp, offset, req_size); > - return; > - } > - Hmm, are we talking about the same thing? I want to hit read-ahead for the remaining pages inside that random read, eg ensure that read-ahead gets activated inside that window of the random request. > /* do read-ahead */ > ondemand_readahead(mapping, ra, filp, false, offset, req_size); > } > > And a more optimized patch would look like this. Note that only the > last chunk is necessary for bug fixing, and only this chunk can be > applied to vanilla 2.6.34-rc3. > > If no problem, I'll submit a patch with only the last chunk for > 2.6.34, and submit the remaining chunks for 2.6.35. > > Thanks, > Fengguang > --- > Subject: readahead: more smart readahead on POSIX_FADV_RANDOM > From: Wu Fengguang > Date: Fri Apr 02 08:52:42 CST 2010 > > Some times user space applications will tell POSIX_FADV_RANDOM > while still doing some sequential reads. > > The kernel can behave a bit smarter in this case, by letting the > readahead heuristics handle the POSIX_FADV_RANDOM case, but with > less aggressive assumption on sequential reads. I'll try and give this a spin. On the laptop, I cannot reproduce the problem of smaller < reqsize ios, so hard to say just now. > > CC: Jens Axboe > Signed-off-by: Wu Fengguang > --- > mm/readahead.c | 17 ++++++++--------- > 1 file changed, 8 insertions(+), 9 deletions(-) > > --- linux.orig/mm/readahead.c 2010-04-02 08:43:53.000000000 +0800 > +++ linux/mm/readahead.c 2010-04-02 09:00:51.000000000 +0800 > @@ -664,6 +664,7 @@ ondemand_readahead(struct address_space > unsigned long max = max_sane_readahead(ra->ra_pages); > unsigned long tt; /* thrashing shreshold */ > pgoff_t start; > + bool random_hint = (filp && (filp->f_mode & FMODE_RANDOM)); > > /* > * start of file > @@ -671,7 +672,8 @@ ondemand_readahead(struct address_space > if (!offset) { > ra_set_pattern(ra, RA_PATTERN_INITIAL); > ra->start = offset; > - if ((ra->ra_flags & READAHEAD_LSEEK) && req_size <= max) { > + if ((random_hint || (ra->ra_flags & READAHEAD_LSEEK)) && > + req_size <= max) { > ra->size = req_size; > ra->async_size = 0; > goto readit; > @@ -743,8 +745,11 @@ context_readahead: > } else > start = offset; > > - tt = count_history_pages(mapping, ra, offset, > - READAHEAD_ASYNC_RATIO * max); > + if (unlikely(random_hint)) > + tt = 0; > + else > + tt = count_history_pages(mapping, ra, offset, > + READAHEAD_ASYNC_RATIO * max); > /* > * no history pages cached, could be > * - a random read > @@ -820,12 +825,6 @@ void page_cache_sync_readahead(struct ad > if (!ra->ra_pages) > return; > > - /* be dumb */ > - if (filp->f_mode & FMODE_RANDOM) { > - force_page_cache_readahead(mapping, filp, offset, req_size); > - return; > - } > - > /* do read-ahead */ > ondemand_readahead(mapping, ra, filp, false, offset, req_size); > } -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/