[readahead.c has badly readable comments, on a standard
80-column display: many lines have a size just slightly
over 80 chars]
In the good old days we had tunable readahead.
Very good, especially for special purposes.
I recall the days where I tried to get something off
a bad SCSI disk, and the kernel would die in the retries
trying to read a bad block, while the data I needed was
not in the block but just before. Set readahead to zero
and all was fine.
Yesterday evening I was playing with my sddr09 driver,
reading SmartMedia cards, and found to my dismay that
the kernel wants to do a 128 block readahead.
Not only is that bad on a slow medium, one is waiting
a noticeable time for unwanted data, but it is worse
that setting the readahead no longer works.
[Indeed, it is very desirable to be able to set readahead
to zero. It is also desirable to be able to set it to a
small value. Today on 2.5.8 both are impossible, readahead.c
insists on a minimum readahead of 16 sectors.]
Andries
On Tue, 2002-04-16 at 07:54, [email protected] wrote:
> [readahead.c has badly readable comments, on a standard
> 80-column display: many lines have a size just slightly
> over 80 chars]
[other stuff snipped]
How is this? Now the longest comment line should end at column 80.
Patch is made against 2.5.8-dj1.
Steven
--- linux-2.5.8-dj1/mm/readahead.c.orig Tue Apr 16 09:25:36 2002
+++ linux-2.5.8-dj1/mm/readahead.c Tue Apr 16 09:54:31 2002
@@ -63,9 +63,10 @@
* Together, start and size represent the `readahead window'.
* next_size: The number of pages to read when we get the next readahead miss.
* prev_page: The page which the readahead algorithm most-recently inspected.
- * prev_page is mainly an optimisation: if page_cache_readahead sees
- * that it is again being called for a page which it just looked at,
- * it can return immediately without making any state changes.
+ * prev_page is mainly an optimisation: if page_cache_readahead
+ * sees that it is again being called for a page which it just
+ * looked at, it can return immediately without making any state
+ * changes.
* ahead_start,
* ahead_size: Together, these form the "ahead window".
*
@@ -95,30 +96,31 @@
* If readahead hits are more sparse (say, the application is only reading every
* second page) then the window will build more slowly.
*
- * On a readahead miss (the application seeked away) the readahead window is shrunk
- * by 25%. We don't want to drop it too aggressively, because it's a good assumption
- * that an application which has built a good readahead window will continue to
- * perform linear reads. Either at the new file position, or at the old one after
- * another seek.
- *
- * There is a special-case: if the first page which the application tries to read
- * happens to be the first page of the file, it is assumed that a linear read is
- * about to happen and the window is immediately set to half of the device maximum.
+ * On a readahead miss (the application seeked away) the readahead window is
+ * shrunk by 25%. We don't want to drop it too aggressively, because it's a
+ * good assumption that an application which has built a good readahead window
+ * will continue to perform linear reads. Either at the new file position,
+ * or at the old one after another seek.
+ *
+ * There is a special-case: if the first page which the application tries to
+ * read happens to be the first page of the file, it is assumed that a linear
+ * read is about to happen and the window is immediately set to half of the
+ * device maximum.
*
* A page request at (start + size) is not a miss at all - it's just a part of
* sequential file reading.
*
* This function is to be called for every page which is read, rather than when
- * it is time to perform readahead. This is so the readahead algorithm can centrally
- * work out the access patterns. This could be costly with many tiny read()s, so
- * we specifically optimise for that case with prev_page.
+ * it is time to perform readahead. This is so the readahead algorithm can
+ * centrally work out the access patterns. This could be costly with many tiny
+ * read()s, so we specifically optimise for that case with prev_page.
*/
/*
* do_page_cache_readahead actually reads a chunk of disk. It allocates all the
- * pages first, then submits them all for I/O. This avoids the very bad behaviour
- * which would occur if page allocations are causing VM writeback. We really don't
- * want to intermingle reads and writes like that.
+ * pages first, then submits them all for I/O. This avoids the very bad
+ * behaviour which would occur if page allocations are causing VM writeback.
+ * We really don't want to intermingle reads and writes like that.
*/
void do_page_cache_readahead(struct file *file,
unsigned long offset, unsigned long nr_to_read)
@@ -231,7 +233,8 @@
ra->next_size += 2;
} else {
/*
- * A miss - lseek, pread, etc. Shrink the readahead window by 25%.
+ * A miss - lseek, pread, etc.
+ * Shrink the readahead window by 25%.
*/
ra->next_size -= ra->next_size / 4;
if (ra->next_size < min)
[email protected] wrote:
>
> [readahead.c has badly readable comments, on a standard
> 80-column display: many lines have a size just slightly
> over 80 chars]
Sigh. At least it has comments. Agree with the 80-column
thing, but I find for the kernel coding style, 80 is just
5-10 columns too short, often.
> In the good old days we had tunable readahead.
> Very good, especially for special purposes.
readahead is tunable, but the window size is stored
at the request queue layer. If it has never been
set, or if the device doesn't have a request queue,
you get the defaults.
Do these cards not have a request queue? Suggestions
are sought.
> I recall the days where I tried to get something off
> a bad SCSI disk, and the kernel would die in the retries
> trying to read a bad block, while the data I needed was
> not in the block but just before. Set readahead to zero
> and all was fine.
Yes, but things should be OK as-is. If the readahead attempt
gets an I/O error, do_generic_file_read will notice the non-uptodate
page and will issue a single-page read. So everything up to
a page's distance from the bad block should be recoverable.
That's the theory; can't say that I've tested it.
If the driver is actually dying over the bad block, well, foo.
> Yesterday evening I was playing with my sddr09 driver,
> reading SmartMedia cards, and found to my dismay that
> the kernel wants to do a 128 block readahead.
> Not only is that bad on a slow medium, one is waiting
> a noticeable time for unwanted data, but it is worse
> that setting the readahead no longer works.
>
> [Indeed, it is very desirable to be able to set readahead
> to zero. It is also desirable to be able to set it to a
> small value. Today on 2.5.8 both are impossible, readahead.c
> insists on a minimum readahead of 16 sectors.]
Yup. Permitting a window size of zero is on my todo list,
but it would require that the device have a request queue.
Maybe the readahead size should be placed in struct blk_dev_struct,
and not in the request queue?
-
From: Andrew Morton <[email protected]>
> In the good old days we had tunable readahead.
> Very good, especially for special purposes.
readahead is tunable, but the window size is stored
at the request queue layer. If it has never been
set, or if the device doesn't have a request queue,
you get the defaults.
Do these cards not have a request queue?
The kernel views them as SCSI disks.
So yes, I can do
blockdev --setra 0 /dev/sdc
Unfortunately that does not help in the least.
Indeed, the only user of the readahead info is
readahead.c: get_max_readahead() and it does
blk_ra_kbytes = blk_get_readahead(inode->i_dev) / 2;
if (blk_ra_kbytes < VM_MIN_READAHEAD)
blk_ra_kbytes = VM_MAX_READAHEAD;
We need to distinguish between undefined, and explicily zero.
Also, overriding the value explicitly given by the user
is a bad idea.
> I recall the days where I tried to get something off
> a bad SCSI disk, and the kernel would die in the retries
> trying to read a bad block, while the data I needed was
> not in the block but just before. Set readahead to zero
> and all was fine.
Yes, but things should be OK as-is. If the readahead attempt
gets an I/O error, do_generic_file_read will notice the non-uptodate
page and will issue a single-page read. So everything up to
a page's distance from the bad block should be recoverable.
That's the theory; can't say that I've tested it.
It is really important to be able to tell the kernel to read and
write only the blocks it has been asked to read and write and
not to touch anything else.
In my SCSI example you go easily past "an I/O error", but what
this driver would do is retry a few times, reset the device,
retry again, reset the scsi bus, and then the kernel would crash
or hang forever. Maybe things are better today, but one does
not want to depend on complicated subsystems recovering
from their errors. There must just not be any errors.
In my situation yesterday night entirely different things play a role.
This card has a mapping from logical to physical blocks, but a
logical block only has a corresponding physical block when it has
been written at least once. So readahead will ask for blocks that
do not exist yet. (The driver that I put on ftp now recognizes this
situation and returns an all zero block, instead of an error.)
There are other situations where reading something has side effects.
A very common side effect is time delay.
So, for some devices I want to be able to kill read-ahead, even
before the kernel looks at the partition table.
Fortunately, I think that 2.5 will include the code that moves
partition table reading code out of the kernel, so this is
really possible.
If the driver is actually dying over the bad block, well, foo.
Yup. Permitting a window size of zero is on my todo list,
but it would require that the device have a request queue.
It has.
Andries
[email protected] wrote:
>
> ...
> Do these cards not have a request queue?
>
> The kernel views them as SCSI disks.
> So yes, I can do
>
> blockdev --setra 0 /dev/sdc
>
> Unfortunately that does not help in the least.
> Indeed, the only user of the readahead info is
> readahead.c: get_max_readahead() and it does
>
> blk_ra_kbytes = blk_get_readahead(inode->i_dev) / 2;
> if (blk_ra_kbytes < VM_MIN_READAHEAD)
> blk_ra_kbytes = VM_MAX_READAHEAD;
>
> We need to distinguish between undefined, and explicily zero.
Yup. The below (untested) patch should fix it up. Assuming
that all drivers use blk_init_queue(), and heaven knows if
that's the case. If not, the readahead window will have to be
set from userspace for those devices.
> ...
>
> Yes, but things should be OK as-is. If the readahead attempt
> gets an I/O error, do_generic_file_read will notice the non-uptodate
> page and will issue a single-page read. So everything up to
> a page's distance from the bad block should be recoverable.
> That's the theory; can't say that I've tested it.
>
> It is really important to be able to tell the kernel to read and
> write only the blocks it has been asked to read and write and
> not to touch anything else.
That is really hard when there is a filesystem mounted.
Even with readahead set to zero, the kernel will go and
read PAGE_CACHE_SIZE chunks. That's not worth changing
to address this problem. In the last resort, the
user would need to perform a sector-by-sector read of
the dud device into a regular file, recover from that.
--- 2.5.8/mm/readahead.c~readahead-fixes Tue Apr 16 12:07:42 2002
+++ 2.5.8-akpm/mm/readahead.c Tue Apr 16 12:13:52 2002
@@ -25,9 +25,6 @@
* has a zero value of ra_sectors.
*/
-#define VM_MAX_READAHEAD 128 /* kbytes */
-#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
-
/*
* Return max readahead size for this inode in number-of-pages.
*/
@@ -36,9 +33,6 @@ static int get_max_readahead(struct inod
unsigned blk_ra_kbytes = 0;
blk_ra_kbytes = blk_get_readahead(inode->i_dev) / 2;
- if (blk_ra_kbytes < VM_MIN_READAHEAD)
- blk_ra_kbytes = VM_MAX_READAHEAD;
-
return blk_ra_kbytes >> (PAGE_CACHE_SHIFT - 10);
}
@@ -96,10 +90,10 @@ static int get_min_readahead(struct inod
* second page) then the window will build more slowly.
*
* On a readahead miss (the application seeked away) the readahead window is shrunk
- * by 25%. We don't want to drop it too aggressively, because it's a good assumption
- * that an application which has built a good readahead window will continue to
- * perform linear reads. Either at the new file position, or at the old one after
- * another seek.
+ * by 25%. We don't want to drop it too aggressively, because it is a good
+ * assumption that an application which has built a good readahead window will
+ * continue to perform linear reads. Either at the new file position, or at the
+ * old one after another seek.
*
* There is a special-case: if the first page which the application tries to read
* happens to be the first page of the file, it is assumed that a linear read is
@@ -109,9 +103,9 @@ static int get_min_readahead(struct inod
* sequential file reading.
*
* This function is to be called for every page which is read, rather than when
- * it is time to perform readahead. This is so the readahead algorithm can centrally
- * work out the access patterns. This could be costly with many tiny read()s, so
- * we specifically optimise for that case with prev_page.
+ * it is time to perform readahead. This is so the readahead algorithm can
+ * centrally work out the access patterns. This could be costly with many tiny
+ * read()s, so we specifically optimise for that case with prev_page.
*/
/*
@@ -208,8 +202,10 @@ void page_cache_readahead(struct file *f
goto out;
}
- min = get_min_readahead(inode);
max = get_max_readahead(inode);
+ if (max == 0)
+ goto out; /* No readahead */
+ min = get_min_readahead(inode);
if (ra->next_size == 0 && offset == 0) {
/*
@@ -310,7 +306,8 @@ void page_cache_readaround(struct file *
{
unsigned long target;
unsigned long backward;
- const int min = get_min_readahead(file->f_dentry->d_inode->i_mapping->host) * 2;
+ const int min =
+ get_min_readahead(file->f_dentry->d_inode->i_mapping->host) * 2;
if (file->f_ra.next_size < min)
file->f_ra.next_size = min;
--- 2.5.8/include/linux/mm.h~readahead-fixes Tue Apr 16 12:11:39 2002
+++ 2.5.8-akpm/include/linux/mm.h Tue Apr 16 12:12:14 2002
@@ -539,6 +539,8 @@ extern int filemap_sync(struct vm_area_s
extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int);
/* readahead.c */
+#define VM_MAX_READAHEAD 128 /* kbytes */
+#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
void do_page_cache_readahead(struct file *file,
unsigned long offset, unsigned long nr_to_read);
void page_cache_readahead(struct file *file, unsigned long offset);
--- 2.5.8/drivers/block/ll_rw_blk.c~readahead-fixes Tue Apr 16 12:12:19 2002
+++ 2.5.8-akpm/drivers/block/ll_rw_blk.c Tue Apr 16 12:13:43 2002
@@ -851,7 +851,7 @@ int blk_init_queue(request_queue_t *q, r
q->plug_tq.data = q;
q->queue_flags = (1 << QUEUE_FLAG_CLUSTER);
q->queue_lock = lock;
- q->ra_sectors = 0; /* Use VM default */
+ q->ra_sectors = VM_MAX_READAHEAD << (PAGE_CACHE_SHIFT - 9);
blk_queue_segment_boundary(q, 0xffffffff);
-
On Tue, Apr 16 2002, Andrew Morton wrote:
> [email protected] wrote:
> >
> > ...
> > Do these cards not have a request queue?
> >
> > The kernel views them as SCSI disks.
> > So yes, I can do
> >
> > blockdev --setra 0 /dev/sdc
> >
> > Unfortunately that does not help in the least.
> > Indeed, the only user of the readahead info is
> > readahead.c: get_max_readahead() and it does
> >
> > blk_ra_kbytes = blk_get_readahead(inode->i_dev) / 2;
> > if (blk_ra_kbytes < VM_MIN_READAHEAD)
> > blk_ra_kbytes = VM_MAX_READAHEAD;
> >
> > We need to distinguish between undefined, and explicily zero.
>
> Yup. The below (untested) patch should fix it up. Assuming
> that all drivers use blk_init_queue(), and heaven knows if
> that's the case. If not, the readahead window will have to be
set it in blk_queue_make_request(), please.
--
Jens Axboe