LinuxLists.cc - [PATCH 00/32] Adaptive readahead V14

2006-05-27 15:51:28

Subject: [PATCH 00/32] Adaptive readahead V14

Andrew,

This is the 14th release of the adaptive readahead patchset.

Thanks to Andrew Morton, Nick Piggin and Peter Zijlstra,
the patchset is reviewed and greatly improved.

It has received tests in a wide range of applications in the past
six months, and polished up considerably.

Performance benefits
====================

Besides file servers and desktops, it is recently found to benefit
postgresql databases a lot.

I explained to pgsql users how the patch may help their db performance:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
[QUOTE]
HOW IT WORKS

In adaptive readahead, the context based method may be of particular
interest to postgresql users. It works by peeking into the file cache
and check if there are any history pages present or accessed. In this
way it can detect almost all forms of sequential / semi-sequential read
patterns, e.g.
- parallel / interleaved sequential scans on one file
- sequential reads across file open/close
- mixed sequential / random accesses
- sparse / skimming sequential read

It also have methods to detect some less common cases:
- reading backward
- seeking all over reading N pages

WAYS TO BENEFIT FROM IT

As we know, postgresql relies on the kernel to do proper readahead.
The adaptive readahead might help performance in the following cases:
- concurrent sequential scans
- sequential scan on a fragmented table
(some DBs suffer from this problem, not sure for pgsql)
- index scan with clustered matches
- index scan on majority rows (in case the planner goes wrong)

And received positive responses:
[QUOTE from Michael Stone]
I've got one DB where the VACUUM ANALYZE generally takes 11M-12M ms;
with the patch the job took 1.7M ms. Another VACUUM that normally takes
between 300k-500k ms took 150k. Definately a promising addition.

[QUOTE from Michael Stone]
>I'm thinking about it, we're already using a fixed read-ahead of 16MB
>using blockdev on the stock Redhat 2.6.9 kernel, it would be nice to
>not have to set this so we may try it.

FWIW, I never saw much performance difference from doing that. Wu's
patch, OTOH, gave a big boost.

[QUOTE: odbc-bench with Postgresql 7.4.11 on dual Opteron]
Base kernel:
Transactions per second: 92.384758
Transactions per second: 99.800896

After read-ahvm.readahead_ratio = 100:
Transactions per second: 105.461952
Transactions per second: 105.458664

vm.readahead_ratio = 100 ; vm.readahead_hit_rate = 1:
Transactions per second: 113.055367
Transactions per second: 124.815910

Patches
=======

All 32 patches are bisect friendly.

The following 28 patches are only logically seperated -
one should not remove one of them and expect others to compile cleanly:

[patch 01/32] readahead: kconfig options
[patch 02/32] radixtree: introduce radix_tree_scan_hole[_backward]()
[patch 03/32] mm: introduce probe_page()
[patch 04/32] mm: introduce PG_readahead
[patch 05/32] readahead: add look-ahead support to __do_page_cache_readahead()
[patch 06/32] readahead: delay page release in do_generic_mapping_read()
[patch 07/32] readahead: insert cond_resched() calls
[patch 08/32] readahead: {MIN,MAX}_RA_PAGES
[patch 09/32] readahead: events accounting
[patch 10/32] readahead: rescue_pages()
[patch 11/32] readahead: sysctl parameters
[patch 12/32] readahead: min/max sizes
[patch 13/32] readahead: state based method - aging accounting
[patch 14/32] readahead: state based method - routines
[patch 15/32] readahead: state based method
[patch 16/32] readahead: context based method
[patch 17/32] readahead: initial method - guiding sizes
[patch 18/32] readahead: initial method - thrashing guard size
[patch 19/32] readahead: initial method - expected read size
[patch 20/32] readahead: initial method - user recommended size
[patch 21/32] readahead: initial method
[patch 22/32] readahead: backward prefetching method
[patch 23/32] readahead: seeking reads method
[patch 24/32] readahead: thrashing recovery method
[patch 25/32] readahead: call scheme
[patch 26/32] readahead: laptop mode
[patch 27/32] readahead: loop case
[patch 28/32] readahead: nfsd case

The following 4 patches are for debugging purpose, and for -mm only:

[patch 29/32] readahead: turn on by default
[patch 30/32] readahead: debug radix tree new functions
[patch 31/32] readahead: debug traces showing accessed file names
[patch 32/32] readahead: debug traces showing read patterns

Diffstat
========

Documentation/sysctl/vm.txt | 37 +
block/ll_rw_blk.c | 34
drivers/block/loop.c | 6
fs/file_table.c | 7
fs/mpage.c | 4
fs/nfsd/vfs.c | 5
include/linux/backing-dev.h | 3
include/linux/fs.h | 74 +-
include/linux/mm.h | 31
include/linux/mmzone.h | 6
include/linux/page-flags.h | 6
include/linux/pagemap.h | 2
include/linux/radix-tree.h | 4
include/linux/sysctl.h | 2
include/linux/writeback.h | 6
kernel/sysctl.c | 28
lib/radix-tree.c | 71 +
mm/Kconfig | 62 +
mm/filemap.c | 112 ++-
mm/page-writeback.c | 2
mm/page_alloc.c | 18
mm/readahead.c | 1599 +++++++++++++++++++++++++++++++++++++++++++-
mm/swap.c | 2
mm/vmscan.c | 4
24 files changed, 2088 insertions(+), 37 deletions(-)

Changelog
=========

V14 2006-05-27
- remove __radix_tree_lookup_parent()
- implement radix_tree_scan_hole*() as dumb and safe ones
- break file_ra_state.cache_hits into u16s
- rationalize ra_dispatch() and move look-ahead/age stuffs here
- move node_free_and_cold_pages() to page_alloc.c/nr_free_inactive_pages_node()
- fix a bug in query_page_cache_segment()
- adjust RA_FLAG_XXX to avoid confliction with ra_class_{new,old}
- random comments

V13 2006-05-26
- remove radix tree look-aside cache
- fix radix tree NULL dereference bug
- fix radix tree bugs on direct embedded data
- add comment on cold_page_refcnt()
- rename find_page() to probe_page()
- replace the non-atomic __SetPageReadahead()
- fix the risky rescue_pages()
- some cleanups recommended by Nick Piggin

V12 2006-05-24
- improve small files case
- allow pausing of events accounting
- disable sparse read-ahead by default
- a bug fix in radix_tree_cache_lookup_parent()
- more cleanups

V11 2006-03-19
- patchset rework
- add kconfig option to make the feature compile-time selectable
- improve radix tree scan functions
- fix bug of using smp_processor_id() in preemptible code
- avoid overflow in compute_thrashing_threshold()
- disable sparse read prefetching if (readahead_hit_rate == 1)
- make thrashing recovery a standalone function
- random cleanups

V10 2005-12-16
- remove delayed page activation
- remove live page protection
- revert mmap readaround to old behavior
- default to original readahead logic
- default to original readahead size
- merge comment fixes from Andreas Mohr
- merge radixtree cleanups from Christoph Lameter
- reduce sizeof(struct file_ra_state) by unnamed union
- stateful method cleanups
- account other read-ahead paths

V9 2005-12-3
- standalone mmap read-around code, a little more smart and tunable
- make stateful method sensible of request size
- decouple readahead_ratio from live pages protection
- let readahead_ratio contribute to ra_size grow speed in stateful method
- account variance of ra_size

V8 2005-11-25

- balance zone aging only in page relaim paths and do it right
- do the aging of slabs in the same way as zones
- add debug code to dump the detailed page reclaim steps
- undo exposing of struct radix_tree_node and uninline related functions
- work better with nfsd
- generalize accelerated context based read-ahead
- account smooth read-ahead aging based on page referenced/activate bits
- avoid divide error in compute_thrashing_threshold()
- more low lantency efforts
- update some comments
- rebase debug actions on debugfs entries instead of magic readahead_ratio values

V7 2005-11-09

- new tunable parameters: readahead_hit_rate/readahead_live_chunk
- support sparse sequential accesses
- delay look-ahead if drive is spinned down in laptop mode
- disable look-ahead for loopback file
- make mandatory thrashing protection more simple and robust
- attempt to improve responsiveness on large read-ahead size

V6 2005-11-01

- cancel look-ahead in laptop mode
- increase read-ahead limit to 0xFFFF pages

V5 2005-10-28

- rewrite context based method to make it clean and robust
- improved accuracy of stateful thrashing threshold estimation
- make page aging equal to the number of code pages scanned
- sort out the thrashing protection logic
- enhanced debug/accounting facilities

V4 2005-10-15

- detect and save live chunks on page reclaim
- support database workload
- support reading backward
- radix tree lookup look-aside cache

V3 2005-10-06

- major code reorganization and documention
- stateful estimation of thrashing-threshold
- context method with accelerated grow up phase
- adaptive look-ahead
- early detection and rescue of pages in danger
- statitics data collection
- synchronized page aging between zones

V2 2005-09-15

- delayed page activation
- look-ahead: towards pipelined read-ahead

V1 2005-09-13

Initial release which features:
o stateless (for now)
o adapts to available memory / read speed
o free of thrashing (in theory)

And handles:
o large number of slow streams (FTP server)
o open/read/close access patterns (NFS server)
o multiple interleaved, sequential streams in one file
(multithread / multimedia / database)

Cheers,
Wu Fengguang
--
Dept. Automation University of Science and Technology of China

2006-05-27 17:29:54

by Michael Tokarev

[permalink] [raw]

Subject: Re: [PATCH 00/32] Adaptive readahead V14

Wu Fengguang wrote:
> Andrew,
>
> This is the 14th release of the adaptive readahead patchset.

A question I wanted to ask for quite some time already but for
some reason didn't ask...

How the new readahead logic works with media read errors?
Current linux behavior is questionable (it killed my dvd drive
for example, due to too many retries to read a single bad block
on a CD-Rom), it - I think - should be to stop reading ahead if
an read error occurs, instead of re-trying, and only retry to
read that block (if at all) when and only when an application
asks for that block. I'm unsure when it should "resume reading
ahead" again (ie, like, setting ra to 0 on first error, and
restoring it back if we trying to read past the bad block.. or
set it to 0, and try to increase it on subsequent reads one by
one back to the original value, or...) - but that's probably
different story, for now, i think just setting ra to 0 on read
error will be sufficient...

Thanks.

/mjt

2006-05-28 12:08:16

by Wu Fengguang

[permalink] [raw]

Subject: Re: [PATCH 00/32] Adaptive readahead V14

On Sat, May 27, 2006 at 09:29:46PM +0400, Michael Tokarev wrote:
> How the new readahead logic works with media read errors?
> Current linux behavior is questionable (it killed my dvd drive
> for example, due to too many retries to read a single bad block
> on a CD-Rom), it - I think - should be to stop reading ahead if
> an read error occurs, instead of re-trying, and only retry to
> read that block (if at all) when and only when an application
> asks for that block. I'm unsure when it should "resume reading
> ahead" again (ie, like, setting ra to 0 on first error, and
> restoring it back if we trying to read past the bad block.. or
> set it to 0, and try to increase it on subsequent reads one by
> one back to the original value, or...) - but that's probably
> different story, for now, i think just setting ra to 0 on read
> error will be sufficient...

It's not quite reasonable for readahead to worry about media errors.
If the media fails, fix it. Or it will hurt read sooner or later.

2006-05-28 19:23:38

by Michael Tokarev

[permalink] [raw]

Subject: Re: [PATCH 00/32] Adaptive readahead V14

Wu Fengguang wrote:
>
> It's not quite reasonable for readahead to worry about media errors.
> If the media fails, fix it. Or it will hurt read sooner or later.

Well... In reality, it is just the opposite.

Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
In order to "fix" it, one have to read it and write to another CD-rom,
or something.. or just ignore the error (if it's just a skip in a video
stream). Let's assume the unreadable block is number U.

But current behavior is just insane. An application requests block
number N, which is before U. Kernel tries to read-ahead blocks N..U.
Cdrom drive tries to read it, re-read it.. for some time. Finally,
when all the N..U-1 blocks are read, kernel returns block number N
(as requested) to an application, successefully.

Now an app requests block number N+1, and kernel tries to read
blocks N+1..U+1. Retrying again as in previous step.

And so on, up to when an app requests block number U-1. And when,
finally, it requests block U, it receives read error.

So, kernel currentry tries to re-read the same failing block as
many times as the current readahead value (256 (times?) by default).

This whole process already killed my cdrom drive (I posted about it
to LKML several months ago) - literally, the drive has fried, and
does not work anymore. Ofcourse that problem was a bug in firmware
(or whatever) of the drive *too*, but.. main problem with that is
current readahead logic as described above.

With that logic, an app also becomes unkillable (at least for some
time) -- ie, even when I knew something's wrong and the CDrom should
not behave like it was, I wasn't able to stop it until I powered the
machine off (just unplugged the power cable) - but.. too late.

Yes, bad media is just that - a bad thing. But it's not a reason to
force power unplug to stop the process, and not a reason to burn a
drive (or anything else). And this is where readahead comes into
play - it IS read-ahead logic who's responsible for the situation.

And there's alot of scratched/whatever CD-Rom drives out there -
unreadable CDrom (or a floppy which is already ancient, or some
other media) - you can't just say to every user out there that
linux isn't compatible with all people's stuff and those people
should "fix" it before ever trying to insert it into their linux
machine...

Thanks.

/mjt

2006-05-29 03:01:55

by Wu Fengguang

[permalink] [raw]

Subject: Re: [PATCH 00/32] Adaptive readahead V14

On Sun, May 28, 2006 at 11:23:33PM +0400, Michael Tokarev wrote:
> Wu Fengguang wrote:
> >
> > It's not quite reasonable for readahead to worry about media errors.
> > If the media fails, fix it. Or it will hurt read sooner or later.
>
> Well... In reality, it is just the opposite.
>
> Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
> In order to "fix" it, one have to read it and write to another CD-rom,
> or something.. or just ignore the error (if it's just a skip in a video
> stream). Let's assume the unreadable block is number U.
>
> But current behavior is just insane. An application requests block
> number N, which is before U. Kernel tries to read-ahead blocks N..U.
> Cdrom drive tries to read it, re-read it.. for some time. Finally,
> when all the N..U-1 blocks are read, kernel returns block number N
> (as requested) to an application, successefully.
>
> Now an app requests block number N+1, and kernel tries to read
> blocks N+1..U+1. Retrying again as in previous step.
>
> And so on, up to when an app requests block number U-1. And when,
> finally, it requests block U, it receives read error.
>
> So, kernel currentry tries to re-read the same failing block as
> many times as the current readahead value (256 (times?) by default).

Good insight... But I'm not sure about it.

Jens, will a bad sector cause the _whole_ request to fail?
Or only the page that contains the bad sector?

> This whole process already killed my cdrom drive (I posted about it
> to LKML several months ago) - literally, the drive has fried, and
> does not work anymore. Ofcourse that problem was a bug in firmware
> (or whatever) of the drive *too*, but.. main problem with that is
> current readahead logic as described above.
>
> With that logic, an app also becomes unkillable (at least for some
> time) -- ie, even when I knew something's wrong and the CDrom should
> not behave like it was, I wasn't able to stop it until I powered the
> machine off (just unplugged the power cable) - but.. too late.
>
> Yes, bad media is just that - a bad thing. But it's not a reason to
> force power unplug to stop the process, and not a reason to burn a
> drive (or anything else). And this is where readahead comes into
> play - it IS read-ahead logic who's responsible for the situation.
>
> And there's alot of scratched/whatever CD-Rom drives out there -
> unreadable CDrom (or a floppy which is already ancient, or some
> other media) - you can't just say to every user out there that
> linux isn't compatible with all people's stuff and those people
> should "fix" it before ever trying to insert it into their linux
> machine...
>
> Thanks.
>
> /mjt

2006-05-30 09:21:26

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 00/32] Adaptive readahead V14

On Mon, May 29 2006, Wu Fengguang wrote:
> On Sun, May 28, 2006 at 11:23:33PM +0400, Michael Tokarev wrote:
> > Wu Fengguang wrote:
> > >
> > > It's not quite reasonable for readahead to worry about media errors.
> > > If the media fails, fix it. Or it will hurt read sooner or later.
> >
> > Well... In reality, it is just the opposite.
> >
> > Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
> > In order to "fix" it, one have to read it and write to another CD-rom,
> > or something.. or just ignore the error (if it's just a skip in a video
> > stream). Let's assume the unreadable block is number U.
> >
> > But current behavior is just insane. An application requests block
> > number N, which is before U. Kernel tries to read-ahead blocks N..U.
> > Cdrom drive tries to read it, re-read it.. for some time. Finally,
> > when all the N..U-1 blocks are read, kernel returns block number N
> > (as requested) to an application, successefully.
> >
> > Now an app requests block number N+1, and kernel tries to read
> > blocks N+1..U+1. Retrying again as in previous step.
> >
> > And so on, up to when an app requests block number U-1. And when,
> > finally, it requests block U, it receives read error.
> >
> > So, kernel currentry tries to re-read the same failing block as
> > many times as the current readahead value (256 (times?) by default).
>
> Good insight... But I'm not sure about it.
>
> Jens, will a bad sector cause the _whole_ request to fail?
> Or only the page that contains the bad sector?

Depends entirely on the driver, and that point we've typically lost the
fact that this is a read-ahead request and could just be tossed. In
fact, the entire request may consist of read-ahead as well as normal
read entries.

For ide-cd, it tends do only end the first part of the request on a
medium error. So you may see a lot of repeats :/

--
Jens Axboe

2006-05-30 11:32:15

by Wu Fengguang

[permalink] [raw]

Subject: Re: [PATCH 00/32] Adaptive readahead V14

On Tue, May 30, 2006 at 11:23:10AM +0200, Jens Axboe wrote:
> On Mon, May 29 2006, Wu Fengguang wrote:
> > On Sun, May 28, 2006 at 11:23:33PM +0400, Michael Tokarev wrote:
> > > Wu Fengguang wrote:
> > > >
> > > > It's not quite reasonable for readahead to worry about media errors.
> > > > If the media fails, fix it. Or it will hurt read sooner or later.
> > >
> > > Well... In reality, it is just the opposite.
> > >
> > > Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
> > > In order to "fix" it, one have to read it and write to another CD-rom,
> > > or something.. or just ignore the error (if it's just a skip in a video
> > > stream). Let's assume the unreadable block is number U.
> > >
> > > But current behavior is just insane. An application requests block
> > > number N, which is before U. Kernel tries to read-ahead blocks N..U.
> > > Cdrom drive tries to read it, re-read it.. for some time. Finally,
> > > when all the N..U-1 blocks are read, kernel returns block number N
> > > (as requested) to an application, successefully.
> > >
> > > Now an app requests block number N+1, and kernel tries to read
> > > blocks N+1..U+1. Retrying again as in previous step.
> > >
> > > And so on, up to when an app requests block number U-1. And when,
> > > finally, it requests block U, it receives read error.
> > >
> > > So, kernel currentry tries to re-read the same failing block as
> > > many times as the current readahead value (256 (times?) by default).
> >
> > Good insight... But I'm not sure about it.
> >
> > Jens, will a bad sector cause the _whole_ request to fail?
> > Or only the page that contains the bad sector?
>
> Depends entirely on the driver, and that point we've typically lost the
> fact that this is a read-ahead request and could just be tossed. In
> fact, the entire request may consist of read-ahead as well as normal
> read entries.
>
> For ide-cd, it tends do only end the first part of the request on a
> medium error. So you may see a lot of repeats :/

Another question about it:
If the block layer issued a request, which happened to contain
R ranges of B bad blocks, i.e. 3 ranges of 9 bad-blocks:
___b_____bb___________bbbbbb____
How many retries will incur? 1, 3, 9, or something else?
If it is 3 or more, then we are even more bad luck :(

Will it be suitable to _automatically_ apply the following retracting
policy on I/O error? Please comment if there's better ways:

--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -983,6 +983,7 @@ readpage:
}
unlock_page(page);
error = -EIO;
+ ra.ra_pages /= 2;
goto readpage_error;
}
unlock_page(page);
@@ -1535,6 +1536,7 @@ page_not_uptodate:
* Things didn't work out. Return zero to tell the
* mm layer so, possibly freeing the page cache page first.
*/
+ ra->ra_pages /= 2;
page_cache_release(page);
return NULL;
}

2006-05-30 12:27:42

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 00/32] Adaptive readahead V14

On Tue, May 30 2006, Wu Fengguang wrote:
> On Tue, May 30, 2006 at 11:23:10AM +0200, Jens Axboe wrote:
> > On Mon, May 29 2006, Wu Fengguang wrote:
> > > On Sun, May 28, 2006 at 11:23:33PM +0400, Michael Tokarev wrote:
> > > > Wu Fengguang wrote:
> > > > >
> > > > > It's not quite reasonable for readahead to worry about media errors.
> > > > > If the media fails, fix it. Or it will hurt read sooner or later.
> > > >
> > > > Well... In reality, it is just the opposite.
> > > >
> > > > Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
> > > > In order to "fix" it, one have to read it and write to another CD-rom,
> > > > or something.. or just ignore the error (if it's just a skip in a video
> > > > stream). Let's assume the unreadable block is number U.
> > > >
> > > > But current behavior is just insane. An application requests block
> > > > number N, which is before U. Kernel tries to read-ahead blocks N..U.
> > > > Cdrom drive tries to read it, re-read it.. for some time. Finally,
> > > > when all the N..U-1 blocks are read, kernel returns block number N
> > > > (as requested) to an application, successefully.
> > > >
> > > > Now an app requests block number N+1, and kernel tries to read
> > > > blocks N+1..U+1. Retrying again as in previous step.
> > > >
> > > > And so on, up to when an app requests block number U-1. And when,
> > > > finally, it requests block U, it receives read error.
> > > >
> > > > So, kernel currentry tries to re-read the same failing block as
> > > > many times as the current readahead value (256 (times?) by default).
> > >
> > > Good insight... But I'm not sure about it.
> > >
> > > Jens, will a bad sector cause the _whole_ request to fail?
> > > Or only the page that contains the bad sector?
> >
> > Depends entirely on the driver, and that point we've typically lost the
> > fact that this is a read-ahead request and could just be tossed. In
> > fact, the entire request may consist of read-ahead as well as normal
> > read entries.
> >
> > For ide-cd, it tends do only end the first part of the request on a
> > medium error. So you may see a lot of repeats :/
>
> Another question about it:
> If the block layer issued a request, which happened to contain
> R ranges of B bad blocks, i.e. 3 ranges of 9 bad-blocks:
> ___b_____bb___________bbbbbb____
> How many retries will incur? 1, 3, 9, or something else?
> If it is 3 or more, then we are even more bad luck :(

Again, this is driver specific. But for ide-cd, if it's using PIO the
right thing should happen since we do each chunk individually. For dma
it looks much worse, since we only get an EIO back from the hardware for
the entire range. It wont do the right thing at all, only for the very
last thing when get get past the last bbbbbb block.

> Will it be suitable to _automatically_ apply the following retracting
> policy on I/O error? Please comment if there's better ways:

Probably it should be even more aggressively scaling down. The real
problem is the drivers of course, we should spend some time fixing them
up too.

--
Jens Axboe

2006-05-30 14:34:17

by Wu Fengguang

[permalink] [raw]

Subject: Re: [PATCH 00/32] Adaptive readahead V14

On Tue, May 30, 2006 at 02:29:34PM +0200, Jens Axboe wrote:
> On Tue, May 30 2006, Wu Fengguang wrote:
> > On Tue, May 30, 2006 at 11:23:10AM +0200, Jens Axboe wrote:
> > > On Mon, May 29 2006, Wu Fengguang wrote:
> > > > On Sun, May 28, 2006 at 11:23:33PM +0400, Michael Tokarev wrote:
> > > > > Wu Fengguang wrote:
> > > > > >
> > > > > > It's not quite reasonable for readahead to worry about media errors.
> > > > > > If the media fails, fix it. Or it will hurt read sooner or later.
> > > > >
> > > > > Well... In reality, it is just the opposite.
> > > > >
> > > > > Suppose there's a CD-rom with a scratch/etc, one sector is unreadable.
> > > > > In order to "fix" it, one have to read it and write to another CD-rom,
> > > > > or something.. or just ignore the error (if it's just a skip in a video
> > > > > stream). Let's assume the unreadable block is number U.
> > > > >
> > > > > But current behavior is just insane. An application requests block
> > > > > number N, which is before U. Kernel tries to read-ahead blocks N..U.
> > > > > Cdrom drive tries to read it, re-read it.. for some time. Finally,
> > > > > when all the N..U-1 blocks are read, kernel returns block number N
> > > > > (as requested) to an application, successefully.
> > > > >
> > > > > Now an app requests block number N+1, and kernel tries to read
> > > > > blocks N+1..U+1. Retrying again as in previous step.
> > > > >
> > > > > And so on, up to when an app requests block number U-1. And when,
> > > > > finally, it requests block U, it receives read error.
> > > > >
> > > > > So, kernel currentry tries to re-read the same failing block as
> > > > > many times as the current readahead value (256 (times?) by default).
> > > >
> > > > Good insight... But I'm not sure about it.
> > > >
> > > > Jens, will a bad sector cause the _whole_ request to fail?
> > > > Or only the page that contains the bad sector?
> > >
> > > Depends entirely on the driver, and that point we've typically lost the
> > > fact that this is a read-ahead request and could just be tossed. In
> > > fact, the entire request may consist of read-ahead as well as normal
> > > read entries.
> > >
> > > For ide-cd, it tends do only end the first part of the request on a
> > > medium error. So you may see a lot of repeats :/
> >
> > Another question about it:
> > If the block layer issued a request, which happened to contain
> > R ranges of B bad blocks, i.e. 3 ranges of 9 bad-blocks:
> > ___b_____bb___________bbbbbb____
> > How many retries will incur? 1, 3, 9, or something else?
> > If it is 3 or more, then we are even more bad luck :(
>
> Again, this is driver specific. But for ide-cd, if it's using PIO the
> right thing should happen since we do each chunk individually. For dma
> it looks much worse, since we only get an EIO back from the hardware for
> the entire range. It wont do the right thing at all, only for the very
> last thing when get get past the last bbbbbb block.
>
> > Will it be suitable to _automatically_ apply the following retracting
> > policy on I/O error? Please comment if there's better ways:
>
> Probably it should be even more aggressively scaling down. The real
> problem is the drivers of course, we should spend some time fixing them
> up too.

nod, it's so frustrating...

Updated the patch, please comment if necessary.

With this patch, retries are reduced from, say, 256, to 5.

Wu
---

--- linux.orig/mm/filemap.c
+++ linux/mm/filemap.c
@@ -809,6 +809,32 @@ grab_cache_page_nowait(struct address_sp
EXPORT_SYMBOL(grab_cache_page_nowait);

/*
+ * CD/DVDs are error prone. When a medium error occurs, the driver may fail
+ * a _large_ part of the i/o request. Imagine the worst scenario:
+ *
+ * ---R__________________________________________B__________
+ * ^ reading here ^ bad block(assume 4k)
+ *
+ * read(R) => miss => readahead(R...B) => media error => frustrating retries
+ * => failing the whole request => read(R) => read(R+1) =>
+ * readahead(R+1...B+1) => bang => read(R+2) => read(R+3) =>
+ * readahead(R+3...B+2) => bang => read(R+3) => read(R+4) =>
+ * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ......
+ *
+ * It is going insane. Fix it by quickly scale down the readahead size.
+ */
+static void shrink_readahead_size_eio(struct file *filp,
+ struct file_ra_state *ra)
+{
+ if (!ra->ra_pages)
+ return;
+
+ ra->ra_pages /= 4;
+ printk(KERN_WARNING "Retracting readahead size of %s to %lu\n",
+ filp->f_dentry->d_iname, ra->ra_pages);
+}
+
+/*
* This is a generic file read routine, and uses the
* mapping->a_ops->readpage() function for the actual low-level
* stuff.
@@ -983,6 +1009,7 @@ readpage:
}
unlock_page(page);
error = -EIO;
+ shrink_readahead_size_eio(filp, &ra);
goto readpage_error;
}
unlock_page(page);
@@ -1535,6 +1562,7 @@ page_not_uptodate:
* Things didn't work out. Return zero to tell the
* mm layer so, possibly freeing the page cache page first.
*/
+ shrink_readahead_size_eio(file, ra);
page_cache_release(page);
return NULL;
}