1. Current code can't always detect sequential reading, in case
when read size is not PAGE_CACHE_SIZE aligned.
If application reads the file by 4096+512 chunks, we have:
1st read: first read detected, prev_page = 2.
2nd read: offset == 2, the read is considered random.
page_cache_readahead() should treat prev_page == offset as
sequential access. In this case it is better to ++offset,
because of blockable_page_cache_readahead(offset, size).
2. If application reads 4096 bytes with *ppos == 512, we have to
read 2 pages, but req_size == 1 in do_generic_mapping_read().
Usually it's not a problem. But in random read case it results
in unnecessary page cache misses.
~$ time dd conv=notrunc if=/tmp/GIG of=/tmp/dummy bs=$((4096+512))
2.6.11-clean: real=370.35 user=0.16 sys=14.66
2.6.11-patched: real=234.49 user=0.19 sys=12.41
Signed-off-by: Oleg Nesterov <[email protected]>
--- 2.6.11/mm/readahead.c~ 2005-02-27 21:57:43.000000000 +0300
+++ 2.6.11/mm/readahead.c 2005-03-01 23:00:21.000000000 +0300
@@ -443,26 +443,26 @@ page_cache_readahead(struct address_spac
struct file *filp, unsigned long offset,
unsigned long req_size)
{
- unsigned long max, newsize = req_size;
- int sequential = (offset == ra->prev_page + 1);
+ unsigned long max, newsize;
+ int sequential;
/*
* Here we detect the case where the application is performing
* sub-page sized reads. We avoid doing extra work and bogusly
* perturbing the readahead window expansion logic.
- * If size is zero, there is no read ahead window so we need one
*/
- if (offset == ra->prev_page && req_size == 1)
- goto out;
+ if (offset == ra->prev_page && --req_size)
+ ++offset;
+ sequential = (offset == ra->prev_page + 1);
ra->prev_page = offset;
+
max = get_max_readahead(ra);
newsize = min(req_size, max);
- if (newsize == 0 || (ra->flags & RA_FLAG_INCACHE)) {
- newsize = 1;
- goto out; /* No readahead or file already in cache */
- }
+ /* No readahead or file already in cache or sub-page sized read */
+ if (newsize == 0 || (ra->flags & RA_FLAG_INCACHE))
+ goto out;
ra->prev_page += newsize - 1;
@@ -527,7 +527,7 @@ page_cache_readahead(struct address_spac
}
out:
- return newsize;
+ return ra->prev_page + 1;
}
/*
--- 2.6.11/mm/filemap.c~ 2005-02-25 14:32:52.000000000 +0300
+++ 2.6.11/mm/filemap.c 2005-03-01 22:26:10.000000000 +0300
@@ -691,7 +691,7 @@ void do_generic_mapping_read(struct addr
unsigned long index;
unsigned long end_index;
unsigned long offset;
- unsigned long req_size;
+ unsigned long last_index;
unsigned long next_index;
unsigned long prev_index;
loff_t isize;
@@ -703,7 +703,7 @@ void do_generic_mapping_read(struct addr
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_page;
- req_size = (desc->count + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;
isize = i_size_read(inode);
@@ -713,7 +713,7 @@ void do_generic_mapping_read(struct addr
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
for (;;) {
struct page *page;
- unsigned long ret_size, nr, ret;
+ unsigned long nr, ret;
/* nr is the maximum number of bytes to copy from this page */
nr = PAGE_CACHE_SIZE;
@@ -728,12 +728,9 @@ void do_generic_mapping_read(struct addr
nr = nr - offset;
cond_resched();
- if (index == next_index && req_size) {
- ret_size = page_cache_readahead(mapping, &ra,
- filp, index, req_size);
- next_index += ret_size;
- req_size -= ret_size;
- }
+ if (index == next_index)
+ next_index = page_cache_readahead(mapping, &ra, filp,
+ index, last_index - index);
find_page:
page = find_get_page(mapping, index);
Oleg Nesterov <[email protected]> wrote:
>
> ~$ time dd conv=notrunc if=/tmp/GIG of=/tmp/dummy bs=$((4096+512))
>
> 2.6.11-clean: real=370.35 user=0.16 sys=14.66
> 2.6.11-patched: real=234.49 user=0.19 sys=12.41
whoa, nice. Ram, can you put this through the torture-test sometime?
Thanks.
On Wed, 2005-03-02 at 11:08, Oleg Nesterov wrote:
..snip...
> @@ -527,7 +527,7 @@ page_cache_readahead(struct address_spac
> }
>
> out:
> - return newsize;
> + return ra->prev_page + 1;
This change introduces one key behavioural change in
page_cache_readahead(). Instead of returning the number-of-pages
successfully read, it now returns the next-page-index which is yet to be
read. Was this essential?
At least, a comment towards this effect at the top of the function is
worth adding.
RP
Ram wrote:
>On Wed, 2005-03-02 at 11:08, Oleg Nesterov wrote:
>
>
>..snip...
>
>
>
>>@@ -527,7 +527,7 @@ page_cache_readahead(struct address_spac
>> }
>>
>> out:
>>- return newsize;
>>+ return ra->prev_page + 1;
>>
>>
>
>This change introduces one key behavioural change in
>page_cache_readahead(). Instead of returning the number-of-pages
>successfully read, it now returns the next-page-index which is yet to be
>read. Was this essential?
>
>
>
and unless filmap.c was changed accordingly this is broken.. need. to
look at this more.
Steve
>At least, a comment towards this effect at the top of the function is
>worth adding.
>
>RP
>
>
On Wed, 2005-03-09 at 16:03, Steven Pratt wrote:
> Ram wrote:
>
> >On Wed, 2005-03-02 at 11:08, Oleg Nesterov wrote:
> >
> >
> >..snip...
> >
> >
> >
> >>@@ -527,7 +527,7 @@ page_cache_readahead(struct address_spac
> >> }
> >>
> >> out:
> >>- return newsize;
> >>+ return ra->prev_page + 1;
> >>
> >>
> >
> >This change introduces one key behavioural change in
> >page_cache_readahead(). Instead of returning the number-of-pages
> >successfully read, it now returns the next-page-index which is yet to be
> >read. Was this essential?
> >
> >
> >
> and unless filmap.c was changed accordingly this is broken.. need. to
> look at this more.
Oleg has taken care of it in filemap.c whereever page_cache_readahead()
is called. So its not a bug as such. But I dont think it was necessary.
Doing so takes care of 1-line code reduction in filemap.c which is the
only plus point.
RP
> >
> >
>
Ram wrote:
>
> On Wed, 2005-03-02 at 11:08, Oleg Nesterov wrote:
> >
> > out:
> > - return newsize;
> > + return ra->prev_page + 1;
>
> This change introduces one key behavioural change in
> page_cache_readahead(). Instead of returning the number-of-pages
> successfully read, it now returns the next-page-index which is yet to be
> read. Was this essential?
The problem is that with this change:
+ if (offset == ra->prev_page && --req_size)
+ ++offset;
we can't just return newsize.
Because the caller of page_cache_readahead(offset, req_size) expects
that returned value is the number-of-pages successfully read from
this original offset.
Consider do_generic_mapping_read() reading two pages at offset 10,
with ra->prev_page == 10.
1st page, index == 10:
ret_size = page_cache_readahead(10, 2); // returns 1
next_index += ret_size; // next_index == 11
2nd page, index == 11:
if (index == next_index) // Yes
page_cache_readahead(11, 1); // BOGUS!
It can be solved without behavioural change of course, but it will be
more complex.
> At least, a comment towards this effect at the top of the function is
> worth adding.
Yes, it's my fault. I should have added comment in changelog at least.
Oleg.