LinuxLists.cc - [PATCH/RFC] Simplified Readahead

2004-09-24 22:53:42

On Mon, Sep 27 2004, Andrew Morton wrote:
> Steven Pratt <[email protected]> wrote:
> >
> > >yup. POSIX_FADV_WILLNEED should just populate pagecache and should launch
> > >asynchronous I/O.
> > >
> >
> > Well then this could cause problems (other than congestion) on both the
> > current and new code since both will effectivly see 2 reads, the second
> > of which may appear to be a seek backwards thus confusing the code
> > slightly. Would it be best to just special case the POSIX_FADV_WILLNEED
> > and issue the I/O required (via do_page_cache_readahead) without
> > updating any of the window or current page offset information ?
>
> That's what we do at present. do_page_cache_readahead() and
> force_page_cache_readahead() are low-level functions which do not affect
> file->ra_state.
>
> Except whoops. POSIX_FADV_WILLNEED is using force_page_cache_readahead(),
> which bypasses the congested check. Wonder how that happened.
>
> <digs out the changeset>
>
> ChangeSet 1.1046.589.14 2003/08/01 10:02:32 [email protected]
> [PATCH] rework readahead for congested queues
>
> Since Jens changed the block layer to fail readahead if the queue has no
> requests free, a few changes suggest themselves.
>
> - It's a bit silly to go and alocate a bunch of pages, build BIOs for them,
> submit the IO only to have it fail, forcing us to free the pages again.
>
> So the patch changes do_page_cache_readahead() to peek at the queue's
> read_congested state. If the queue is read-congested we abandon the entire
> readahead up-front without doing all that work.
>
> - If the queue is not read-congested, we go ahead and do the readahead,
> after having set PF_READAHEAD.
>
> The backing_dev_info's read-congested threshold cuts in when 7/8ths of
> the queue's requests are in flight, so it is probable that the readahead
> abandonment code in __make_request will now almost never trigger.
>
> - The above changes make do_page_cache_readahead() "unreliable", in that it
> may do nothing at all.
>
> However there are some system calls:
>
> - fadvise(POSIX_FADV_WILLNEED)
> - madvise(MADV_WILLNEED)
> - sys_readahead()
>
> In which the user has an expectation that the kernel will actually
> perform the IO.
>
> So the patch creates a new "force_page_cache_readahead()" which will
> perform the IO regardless of the queue's congestion state.
>
> Arguably, this is the wrong thing to do: even though the application
> requested readahead it could be that the kernel _should_ abandon the user's
> request because the disk is so busy.
>
> I don't know. But for now, let's keep the above syscalls behaviour
> unchanged. It is trivial to switch back to do_page_cache_readahead()
> later.
>
>
> So there we have it. The reason why normal readahead skips congested
> queues is because the block layer will drop the I/O on the floor *anyway*
> because it also skips congested queues for readahead I/O requests.
>
> And fadvise() was switched to force_page_cache_readahead() because that was
> the old behaviour.
>
> But PF_READAHEAD isn't there any more, and BIO_RW_AHEAD and BIO_RW_BLOCK
> are not used anywhere, so we broke that. Jens, do you remember what
> happened there?

Nothing changed from the block layer point of view (in 2.6 or earler,
rw_ahead was always tossed away for congested queues). Why the
read-ahead algorithms dropped PF_READAHEAD or flagging bio_rw_ahead() I
don't know, I haven't worked on that.

--
Jens Axboe

2004-09-29 18:48:31

On a readahead-related note, I'm wondering how hard it would be to have
some tunables and/or hooks from the readahead state manchine made
available to the filesystem? With the 2.4 readahead code it was basically
impossible for the filesystem to disable the readahead, I haven't looked
at the 2.6 readahead enough to determine whether we need that or not.

The real issue (reason for turning off RA in 2.4) is that within Lustre
there can be many DLM extent locks for a single file, so client A can
be writing to one part of the file, and client B can be reading from
another part of the same file. With the stock readahead it wouldn't
stay within the lock extent boundaries, and we couldn't turn it off
easily. Having some sort of FS method that says "don't do RA beyond
this offset" would be useful here.

The other problem that Lustre had was that the stock readahead would
send out page reads in small chunks as the window grew instead of
sending out large requests that could be turned into large, efficient
network RPCs. So the desire would be to have some sort of tunable in
the readahead state (per fs or per file) that says "don't submit
another readahead until the window is growing by X pages".

As it is we've basically had to implement our own readahead code within
the filesystem in order to get correct behaviour and good performance.
This is of course not optimal from a code duplication point of view and
also we don't get any benefits from the algorithm improvements being
done here.

The other question is whether the new readahead code takes the latency
of completing read requests into account when determining the size of
the readahead window? Lustre generally runs with very fast disk and
network systems so the size of the readahead window has to be very large
in order to keep the pipeline full to avoid stalling on the client.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/

Attachments:

(No filename) (1.97 kB)
(No filename) (189.00 B)
Download all attachments

2004-09-30 01:12:51

On Fri, 2004-10-01 at 14:02, Steven Pratt wrote:
> Ram Pai wrote:
>
> snip...
>
> >>>>>To summarize you noticed 3 problems:
> >>>>>
> >>>>>1. page cache hits not handled properly.
> >>>>>2. readahead thrashing not accounted.
> >>>>>3. read congestion not accounted.
> >>>>>
> >>>>>
> >
> >
> >I have enclosed 5 patches that address each of the issues.
> >
> >1 . Code is obtuse and hard to maintain
> >
> > The best I could do is update the comments to reflect the
> > current code. Hopefully that should help.
> >
> > attached patch 1_comment.patch takes care of that part to
> > some extent.
> >
> >
> >2. page cache hits not handled properly.
> >
> > I fixed this by decrementing the size of the next readahead window
> > by the number of pages hit in the page cache. Now it slowly
> > accomodates the page cache hits.
> >
> > attached patch 2_cachehits.patch takes care of this issue.
> >
> >3. queue congestion not handled.
> >
> > The fix is: call force_page_cache_readahead() if we are
> > populating pages in the current window.
> > And call do_page_cache_readahead() if we are populating
> > pages in the ahead window. However if do_page_cache_readahead()
> > return with congestion, the readahead window is collapsed back
> > to size zero. This will ensure that the next time ahead window
> > is attempted to populate.
> >
> > attached patch 3_queuecongestion.patch handles this issue.
> >
> >4. page thrash handled ineffectively.
> >
> > The fix is: on page thrash detection shutdown readahead.
> >
> > attached patch 4_pagethrash.patch handles this issue.
> >
> >5. slow read path is too slow.
> >
> > I could not figure out a way to atleast-read-the-requested-
> > number-of-pages if readahead is shutdown, without incorporating
> > the readsize parameter to page_cache_readahead(). So had
> > to pick some of your code in filemap.c to do that. Thanks!
> >
> > attached patch 5_fixedslowread.patch handles this issue.
> >
> >
> >Apart from this you have noticed other issues
> >
> >6. cache lookup done unneccessrily twice for pagecache_hits.
> >
> > I have not handled this issue currently. But should be doable
> > if I introducing a flag, which notes when readahead is
> > shutdown by pagecahche hits. And hence attempts to lookup
> > the page only once.
> >
> >
> >And you have other features in your patch which will be the real
> >differentiating factors.
> >
> >7. exponential expand and shrink of window sizes.
> >
> >8. overlapped read of current window and ahead window.
> >
> > ( I think both are desirable feature )
> >
> >I did run some premilinary tests using your patch and the above patches
> >and found
> >
> >your patch was doing slightly better on iozone and sysbench.
> >however the above patch were doing slightly better with DSS workload.
> >
> >
>
> Ok, I have re-run the Tiobench tests. On a single cpu ide based system
> you new patches have no noticable effect on sequential read performance
> (a good thing); but on random I/O things went bad :-(.
>
> Here are the random read results for 16k io with 4GB fileset on 256MB
> mem, single cpu IDE
>
> Stock w/ patches
>
> Threads MBs/sec MBs/sec %diff diff
> ---------- ------------ ------------ -------- ------------
> 1 1.73 1.72 -0.58 -0.01
> 4 1.70 1.56 -8.24 -0.14
> 16 1.66 0.81 -51.20 -0.85
> 64 1.49 0.68 -54.36 -0.81
>
> As you can see somewhere after 4 threads the new patches cause performance to tank.
>
> With 512k ios the problem kicks in with less than 4 threads.
>
> Stock w/ patches
> Threads MBs/sec MBs/sec %diff diff
> ---------- ------------ ------------ -------- ------------
> 1 18.50 18.55 0.27 0.05
> 4 8.55 6.59 -22.92 -1.96
> 16 8.40 5.18 -38.33 -3.22
> 64 7.34 4.76 -35.15 -2.58
>
>
> Unfortunately this is the _good_ news. The bad news is that this is much worse on SCSI.
> We lose a few percent on sequential reads for all block sizes and random is just totally screwed.
>
> Here is the same 16k io requests size with 4GB fileset on 1GB memory on 8way system on single scsi disk.
>
> stock w/ patch
> Threads MBs/sec MBs/sec %diff diff
> ---------- ------------ ------------ -------- ------------
> 1 3.43 3.03 -11.66 -0.40
> 4 4.51 1.06 -76.50 -3.45
> 16 5.86 1.43 -75.60 -4.43
> 64 6.13 1.66 -72.92 -4.47
>
> 11% degrade even on 1 thread, 75% degrade for 4 threads and above! This is horribly broken.
>
>
Sorry for the late response. Was out yesterday.

Yes something is broken horribly. Will look into what is broken.

RP