Hi Andrew
I noticed fadvise(POSIX_FADV_DONTNEED) was invalidating more parts of the file
than I expected. Here is the patch.
Notes for change in fadvise.c/sys_fadvise64_64():
- It should be noticed that the 'end' param of invalidate_mapping_pages()
is inclusive;
- When 'offset' and/or 'offset+len' do no align to page boundary, we must
decide whether to abandon the partial page at the beginning/end of the range.
My patch assumes that the application is scanning forward,
which is the most common case.
So 'end_index' is set to the page just before the ending partial page.
- Manual pages for posix_fadvise() mentioned the case of 'len=0', which
is not handled by the code. Perhaps the handling of 'len=0' is useless,
so I leaved that alone.
Notes for change in truncate.c/invalidate_mapping_pages():
- Is there any reason that I din't know to free any page outside of [begin,end]?
The origin code will abandon useful trailing pages when there's hole
in the range, which may cause series of unecessary disk I/O in
streaming applications.
best regards, Wu Fengguang
diff -Naur linux-2.6.4-rc2/mm/fadvise.c linux-2.6.4-rc2_fadvise_fix/mm/fadvise.c
--- linux-2.6.4-rc2/mm/fadvise.c 2004-02-18 03:57:30.000000000 +0000
+++ linux-2.6.4-rc2_fadvise_fix/mm/fadvise.c 2004-03-08 12:20:06.000000000 +0000
@@ -66,8 +66,7 @@
if (!bdi_write_congested(mapping->backing_dev_info))
filemap_flush(mapping);
start_index = offset >> PAGE_CACHE_SHIFT;
- end_index = (offset + len + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
+ end_index = ((offset + len) >> PAGE_CACHE_SHIFT) - 1;
invalidate_mapping_pages(mapping, start_index, end_index);
break;
default:
diff -Naur linux-2.6.4-rc2/mm/truncate.c linux-2.6.4-rc2_fadvise_fix/mm/truncate.c
--- linux-2.6.4-rc2/mm/truncate.c 2004-02-18 03:59:34.000000000 +0000
+++ linux-2.6.4-rc2_fadvise_fix/mm/truncate.c 2004-03-08 12:20:06.000000000 +0000
@@ -219,6 +219,8 @@
ret += invalidate_complete_page(mapping, page);
unlock:
unlock_page(page);
+ if (next > end)
+ break;
}
pagevec_release(&pvec);
cond_resched();
On Mon, Mar 08, 2004 at 12:03:22PM -0800, Andrew Morton wrote:
> WU Fengguang <[email protected]> wrote:
> >
> >
> > - When 'offset' and/or 'offset+len' do no align to page boundary, we must
> > decide whether to abandon the partial page at the beginning/end of the range.
> > My patch assumes that the application is scanning forward,
> > which is the most common case.
> > So 'end_index' is set to the page just before the ending partial page.
>
> If you're going to preserve the partial page at `end' (which seems
> reasonable) then you should also preserve the partial page at `start', don't
> you agree?
>
> - start_index = offset >> PAGE_CACHE_SHIFT;
> + start_index = (offset + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>
In fact, it depends on the access pattern.
I would expect the normal usage of fadvise to be:
for() {
read() and consume() data in [offset, offset+LEN);
fadvise(fd, offset, LEN, POSIX_FADV_DONTNEED);
offset += LEN; /*scanning forward*/
}
In this case, the partial page at `start' should be freed.
Certainly there may be other patterns.
I wonder the best solution in kernel is to code in favor of the normal case,
and the best(safe and portable) practice in usermode code is to always align
'offset' and 'offset+len' to page boundaries.
And some comment in the manual page of posix_fadvise is recommended.
On Mon, Mar 08, 2004 at 12:03:22PM -0800, Andrew Morton wrote:
> WU Fengguang <[email protected]> wrote:
> >
> >
> > - When 'offset' and/or 'offset+len' do no align to page boundary,
> > we must
> > decide whether to abandon the partial page at the beginning/end
> > of the range.
> > My patch assumes that the application is scanning forward,
> > which is the most common case.
> > So 'end_index' is set to the page just before the ending partial
> > page.
>
> If you're going to preserve the partial page at `end' (which seems
> reasonable) then you should also preserve the partial page at `start',
> don't
> you agree?
>
> - start_index = offset >> PAGE_CACHE_SHIFT;
> + start_index = (offset + PAGE_CACHE_SIZE - 1) >>
> PAGE_CACHE_SHIFT;
>
I gave it a rethink today, and yes, we should preserve the starting
partial page as well. Doing so helps reducing unnecessary disk I/O for
any access pattern, and the kernel will behave more consistent and robust.
The only negative effect is, some useless pages are left to be freed at some
later time by the page replacement routines, and the cost of which is acceptable.
regards, Wu Fengguang