2010-01-20 22:04:29

by Chris Frost

[permalink] [raw]
Subject: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

Add the fincore() system call. fincore() is mincore() for file descriptors.

The functionality of fincore() can be emulated with an mmap(), mincore(),
and munmap(), but this emulation requires more system calls and requires
page table modifications. fincore() can provide a significant performance
improvement for non-sequential in-core queries.

Signed-off-by: Chris Frost <[email protected]>
Signed-off-by: Steve VanDeBogart <[email protected]>
---

Notes on micro and macro performance and on a potential optimization:

For a microbenchmark that sequentially queries whether the pages of a large
file are in memory fincore is 7-11x faster than mmap+mincore+munmap
when querying one page a time (Pentium 4 running a 32 bit SMP kernel).
When querying 1024 pages at a time the two approaches perform comparably.
However, an application cannot always amortize calls; e.g., non-sequential
reads. These improvements are increased significantly by amortizing the
RCU work done by each find_get_page() call, but this optimization does
not affect our macrobenchmarks (more in third paragraph).

We introduced this system call while modifying SQLite and the GIMP to
request large prefetches for what would otherwise be non-sequential reads.
As a macrobenchmark, we see a 125s SQLite query (72s system time) reduced
to 75s (18s system time) by using fincore() instead of mincore(). This
speedup of course varies by benchmark and benchmarks size; we've seen
both minimal speedups and 1000x speedups. More on these benchmarks in the
publication _Reducing Seek Overhead with Application-Directed Prefetching_
in USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.

In this patch find_get_page() is called for each page, which in turn
calls rcu_read_lock(), for each page. We have found that amortizing
these RCU calls, e.g., by introducing a variant of find_get_pages_contig()
that does not skip missing pages, can speedup the above microbenchmark
by 260x when querying many pages per system call. But we have not observed
noticeable improvements to our macrobenchmarks. I'd be happy to also post
this change or look further into it, but this seems like a reasonable
first patch, at least.

include/linux/syscalls.h | 2
fs/fincore.c | 135 +++++++++++++++++++++++++++
fs/Makefile | 2
arch/x86/ia32/ia32entry.S | 1
arch/x86/include/asm/unistd_32.h | 3
arch/x86/include/asm/unistd_64.h | 2
arch/x86/kernel/syscall_table_32.S | 1
include/asm-generic/unistd.h | 5 -
8 files changed, 148 insertions(+), 3 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a990ace..1e8f00f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -534,6 +534,8 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
+asmlinkage long sys_fincore(unsigned int fd, loff_t start, loff_t len,
+ unsigned char __user *vec);

asmlinkage long sys_pivot_root(const char __user *new_root,
const char __user *put_old);
diff --git a/fs/fincore.c b/fs/fincore.c
new file mode 100644
index 0000000..6b74cc4
--- /dev/null
+++ b/fs/fincore.c
@@ -0,0 +1,135 @@
+/*
+ * fs/fincore.c
+ *
+ * Copyright (C) 2009, 2010 Chris Frost, UC Regents
+ * Copyright (C) 2008 Steve VanDeBogart, UC Regents
+ */
+
+/*
+ * The fincore() system call.
+ */
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+
+static unsigned char fincore_page(struct address_space *mapping, pgoff_t pgoff)
+{
+ unsigned char present = 0;
+ struct page *page = find_get_page(mapping, pgoff);
+ if (page) {
+ present = PageUptodate(page);
+ page_cache_release(page);
+ }
+
+ return present;
+}
+
+/*
+ * The fincore(2) system call.
+ *
+ * fincore() returns the memory residency status of the pages backing
+ * a file range specified by fd and [start, start + len).
+ * The status is returned in a vector of bytes. The least significant
+ * bit of each byte is 1 if the referenced page is in memory, otherwise
+ * it is zero.
+ *
+ * Because the status of a page can change after fincore() checks it
+ * but before it returns to the application, the returned vector may
+ * contain stale information. Only locked pages are guaranteed to
+ * remain in memory.
+ *
+ * return values:
+ * zero - success
+ * -EBADF - fd is an illegal file descriptor
+ * -EFAULT - vec points to an illegal address
+ * -EINVAL - start is not a multiple of PAGE_CACHE_SIZE or start + len
+ * is larger than the size of the file
+ * -EAGAIN - A kernel resource was temporarily unavailable.
+ */
+SYSCALL_DEFINE4(fincore, unsigned int, fd, loff_t, start, loff_t, len,
+ unsigned char __user *, vec)
+{
+ long retval;
+ loff_t pgoff = start >> PAGE_SHIFT;
+ loff_t pgend;
+ loff_t npages;
+ struct file *filp;
+ int fput_needed;
+ loff_t file_nbytes;
+ loff_t file_npages;
+ unsigned char *tmp = NULL;
+ unsigned char tmp_small[64];
+ unsigned tmp_count;
+ int i;
+
+ /* Check the start address: needs to be page-aligned.. */
+ if (start & ~PAGE_CACHE_MASK)
+ return -EINVAL;
+
+ npages = len >> PAGE_SHIFT;
+ npages += (len & ~PAGE_MASK) != 0;
+
+ pgend = pgoff + npages;
+
+ filp = fget_light(fd, &fput_needed);
+ if (!filp)
+ return -EBADF;
+
+ if (filp->f_dentry->d_inode->i_mode & S_IFBLK)
+ file_nbytes = filp->f_dentry->d_inode->i_bdev->bd_inode->i_size << 9;
+ else
+ file_nbytes = filp->f_dentry->d_inode->i_size;
+
+ file_npages = file_nbytes >> PAGE_SHIFT;
+ file_npages += (file_nbytes & ~PAGE_MASK) != 0;
+
+ if (pgoff >= file_npages || pgend > file_npages) {
+ retval = -EINVAL;
+ goto done;
+ }
+
+ if (!access_ok(VERIFY_WRITE, vec, npages)) {
+ retval = -EFAULT;
+ goto done;
+ }
+
+ /*
+ * Allocate buffer vector page.
+ * Optimize allocation for small values of npages because the
+ * __get_free_page() call doubles fincore(2) runtime when npages == 1.
+ */
+ if (npages <= sizeof(tmp_small)) {
+ tmp = tmp_small;
+ tmp_count = sizeof(tmp_small);
+ } else {
+ tmp = (void *) __get_free_page(GFP_USER);
+ if (!tmp) {
+ retval = -EAGAIN;
+ goto done;
+ }
+ tmp_count = PAGE_SIZE;
+ }
+
+ while (pgoff < pgend) {
+ /*
+ * Do at most tmp_count entries per iteration, due to
+ * the temporary buffer size.
+ */
+ for (i = 0; pgoff < pgend && i < tmp_count; pgoff++, i++)
+ tmp[i] = fincore_page(filp->f_mapping, pgoff);
+
+ if (copy_to_user(vec, tmp, i)) {
+ retval = -EFAULT;
+ break;
+ }
+ vec += i;
+ }
+ retval = 0;
+done:
+ if (tmp && tmp != tmp_small)
+ free_page((unsigned long) tmp);
+ fput_light(filp, fput_needed);
+ return retval;
+}
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..a3ccd6b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
- stack.o fs_struct.o
+ stack.o fs_struct.o fincore.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
index d76b66a..ce76dc8 100644
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -623,8 +623,11 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
#define __NR_perf_event_open 241
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)

+#define __NR_fincore 242
+__SYSCALL(__NR_fincore, sys_fincore)
+
#undef __NR_syscalls
-#define __NR_syscalls 242
+#define __NR_syscalls 243

/*
* All syscalls below here should go away really,
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 581b056..cbf96e6 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -841,4 +841,5 @@ ia32_sys_call_table:
.quad compat_sys_pwritev
.quad compat_sys_rt_tgsigqueueinfo /* 335 */
.quad sys_perf_event_open
+ .quad sys_fincore
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..088b235 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,10 +342,11 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
+#define __NR_fincore 337

#ifdef __KERNEL__

-#define NR_syscalls 337
+#define NR_syscalls 338

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 8d3ad0a..ebc04b5 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
__SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
#define __NR_perf_event_open 298
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_fincore 299
+__SYSCALL(__NR_fincore, sys_fincore)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..1fdc8bc 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
.long sys_pwritev
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
+ .long sys_fincore
--
1.5.4.3

--
Chris Frost
http://www.frostnet.net/chris/


2010-01-21 01:12:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

Chris Frost <[email protected]> writes:


> For a microbenchmark that sequentially queries whether the pages of a large
> file are in memory fincore is 7-11x faster than mmap+mincore+munmap
> when querying one page a time (Pentium 4 running a 32 bit SMP kernel).

I haven't read your paper, but naively it was not fully clear
to me why the application can't simply prefetch everything and let
the kernel worry if it's already in memory or not?

Also I'm always wondering why people do these optimizations
only now when spinning storage is about to become obsolete @)
It seems a bit like the last steam engine train.

> In this patch find_get_page() is called for each page, which in turn
> calls rcu_read_lock(), for each page. We have found that amortizing

rcu_read_lock is normally a no-op (or rather just a compiler barrier)
Even on preemptive kernels it's quite cheap and always local. It doesn't
make too much sense to optimize around it.

Also it's custom to supply man page with new system calls.
Such independent documentation often flushes out a lot of semantic issues.

+SYSCALL_DEFINE4(fincore, unsigned int, fd, loff_t, start, loff_t, len,
+ unsigned char __user *, vec)

I doubt the loff_t actually work for 32bit processes on 64bit kernels
That typically needs a special compat stub that reassembles the 64bit values
from the two registers.

Also on 32bit you'll end with a 6 argument call, which can be problematic.

> + /*
> + * Allocate buffer vector page.
> + * Optimize allocation for small values of npages because the
> + * __get_free_page() call doubles fincore(2) runtime when npages == 1.
> + */

I suspect you could afford slightly more than 64 bytes on the stack.

> + if (npages <= sizeof(tmp_small)) {
> + tmp = tmp_small;
> + tmp_count = sizeof(tmp_small);
> + } else {
> + tmp = (void *) __get_free_page(GFP_USER);
> + if (!tmp) {
> + retval = -EAGAIN;
> + goto done;
> + }
> + tmp_count = PAGE_SIZE;

tmp_* are impressively bad variable names.

> + }
> +
> + while (pgoff < pgend) {
> + /*
> + * Do at most tmp_count entries per iteration, due to
> + * the temporary buffer size.
> + */
> + for (i = 0; pgoff < pgend && i < tmp_count; pgoff++, i++)
> + tmp[i] = fincore_page(filp->f_mapping, pgoff);

If you really care about speed you could probably do it much faster
with a radix gang lookup for a larger range. And of course
the get/put is not really needed, although avoiding that might
add too many special cases.

This loop needs a need_resched() somewhere, otherwise
someone could cause very large latencies in the kernel.

But even if you added that:

e.g. if I create a 1TB file and run it over the full range,
will I get a process that cannot be Ctrl-C'ed for a long time?

Perhaps some signal handling is needed too?

Still also would be undebuggable in that time. It might
be best to simply limit it to some reasonable upper limit.
Most system calls do that in some form.

> +
> + if (copy_to_user(vec, tmp, i)) {

When you used access_ok() earlier you could use __copy_to_user,
but since that's only a few instructions I would rather drop
the unnecessary access_ok() earlier.


-Andi
--
[email protected] -- Speaking for myself only.

2010-01-22 01:17:32

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

On Wed, Jan 20, 2010 at 01:57:12PM -0800, Chris Frost wrote:
> Add the fincore() system call. fincore() is mincore() for file descriptors.
>
> The functionality of fincore() can be emulated with an mmap(), mincore(),
> and munmap(), but this emulation requires more system calls and requires
> page table modifications. fincore() can provide a significant performance
> improvement for non-sequential in-core queries.

FYI I have a seqfile based procfile that export cached file pages with
various states:

root /home/wfg# echo /sbin/init > /proc/filecache
root /home/wfg# cat /proc/filecache
# file /sbin/init
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
# idx len state refcnt
0 6 RAMU________ 2
6 1 _AMU________ 2
7 1 RAMU________ 2
8 2 ___U________ 1

It was first developed to provide information for prefetching.
Since then I've been using it as a generic page cache inspection tool.
It helped me debug vm/fs issues, eg. readahead, writeback and vmscan.
Though I'm not sure if the interface is acceptable to Linux.

Here is the code snippet if you are interested :)

/*
* Listing of cached page ranges of a file.
*
* Usage:
* echo 'file name' > /proc/filecache
* cat /proc/filecache
*/

unsigned long page_mask;
#define PG_MMAP PG_lru /* reuse any non-relevant flag */
#define PG_BUFFER PG_swapcache /* ditto */
#define PG_DIRTY PG_error /* ditto */
#define PG_WRITEBACK PG_buddy /* ditto */

/*
* Page state names, prefixed by their abbreviations.
*/
struct {
unsigned long mask;
const char *name;
int faked;
} page_flag [] = {
{1 << PG_referenced, "R:referenced", 0},
{1 << PG_active, "A:active", 0},
{1 << PG_MMAP, "M:mmap", 1},

{1 << PG_uptodate, "U:uptodate", 0},
{1 << PG_dirty, "D:dirty", 0},
{1 << PG_writeback, "W:writeback", 0},
{1 << PG_reclaim, "X:readahead", 0},

{1 << PG_private, "P:private", 0},
{1 << PG_owner_priv_1, "O:owner", 0},

{1 << PG_BUFFER, "b:buffer", 1},
{1 << PG_DIRTY, "d:dirty", 1},
{1 << PG_WRITEBACK, "w:writeback", 1},
};

static unsigned long page_flags(struct page* page)
{
unsigned long flags;
struct address_space *mapping = page_mapping(page);

flags = page->flags & page_mask;

if (page_mapped(page))
flags |= (1 << PG_MMAP);

if (page_has_buffers(page))
flags |= (1 << PG_BUFFER);

if (mapping) {
if (radix_tree_tag_get(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK))
flags |= (1 << PG_WRITEBACK);

if (radix_tree_tag_get(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY))
flags |= (1 << PG_DIRTY);
}

return flags;
}

static int pages_similiar(struct page* page0, struct page* page)
{
if (page_count(page0) != page_count(page))
return 0;

if (page_flags(page0) != page_flags(page))
return 0;

return 1;
}

static void show_range(struct seq_file *m, struct page* page, unsigned long len)
{
int i;
unsigned long flags;

if (!m || !page)
return;

seq_printf(m, "%lu\t%lu\t", page->index, len);

flags = page_flags(page);
for (i = 0; i < ARRAY_SIZE(page_flag); i++)
seq_putc(m, (flags & page_flag[i].mask) ?
page_flag[i].name[0] : '_');

seq_printf(m, "\t%d\n", page_count(page));
}

#define BATCH_LINES 100
static pgoff_t show_file_cache(struct seq_file *m,
struct address_space *mapping, pgoff_t start)
{
int i;
int lines = 0;
pgoff_t len = 0;
struct pagevec pvec;
struct page *page;
struct page *page0 = NULL;

for (;;) {
pagevec_init(&pvec, 0);
pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
(void **)pvec.pages, start + len, PAGEVEC_SIZE);

if (pvec.nr == 0) {
show_range(m, page0, len);
start = ULONG_MAX;
goto out;
}

if (!page0)
page0 = pvec.pages[0];

for (i = 0; i < pvec.nr; i++) {
page = pvec.pages[i];

if (page->index == start + len &&
pages_similiar(page0, page))
len++;
else {
show_range(m, page0, len);
page0 = page;
start = page->index;
len = 1;
if (++lines > BATCH_LINES)
goto out;
}
}
}

out:
return start;
}

Thanks,
Fengguang

2010-01-22 01:29:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

On Wed, Jan 20, 2010 at 01:57:12PM -0800, Chris Frost wrote:
> Add the fincore() system call. fincore() is mincore() for file descriptors.
>
> The functionality of fincore() can be emulated with an mmap(), mincore(),
> and munmap(), but this emulation requires more system calls and requires
> page table modifications. fincore() can provide a significant performance
> improvement for non-sequential in-core queries.
>
> Signed-off-by: Chris Frost <[email protected]>
> Signed-off-by: Steve VanDeBogart <[email protected]>
> ---
>
> Notes on micro and macro performance and on a potential optimization:
>
> For a microbenchmark that sequentially queries whether the pages of a large
> file are in memory fincore is 7-11x faster than mmap+mincore+munmap
> when querying one page a time (Pentium 4 running a 32 bit SMP kernel).
> When querying 1024 pages at a time the two approaches perform comparably.
> However, an application cannot always amortize calls; e.g., non-sequential
> reads. These improvements are increased significantly by amortizing the
> RCU work done by each find_get_page() call, but this optimization does
> not affect our macrobenchmarks (more in third paragraph).
>
> We introduced this system call while modifying SQLite and the GIMP to
> request large prefetches for what would otherwise be non-sequential reads.
> As a macrobenchmark, we see a 125s SQLite query (72s system time) reduced
> to 75s (18s system time) by using fincore() instead of mincore(). This
> speedup of course varies by benchmark and benchmarks size; we've seen
> both minimal speedups and 1000x speedups. More on these benchmarks in the
> publication _Reducing Seek Overhead with Application-Directed Prefetching_
> in USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.
>
> In this patch find_get_page() is called for each page, which in turn
> calls rcu_read_lock(), for each page. We have found that amortizing
> these RCU calls, e.g., by introducing a variant of find_get_pages_contig()
> that does not skip missing pages, can speedup the above microbenchmark
> by 260x when querying many pages per system call. But we have not observed
> noticeable improvements to our macrobenchmarks. I'd be happy to also post
> this change or look further into it, but this seems like a reasonable
> first patch, at least.

Interesting. What RCU-related kernel config parameters were in effect
when you ran your microbenchmark tests? Having rcu_read_lock() be a
bottleneck is a bit unexpected. ;-)

Thanx, Paul

> include/linux/syscalls.h | 2
> fs/fincore.c | 135 +++++++++++++++++++++++++++
> fs/Makefile | 2
> arch/x86/ia32/ia32entry.S | 1
> arch/x86/include/asm/unistd_32.h | 3
> arch/x86/include/asm/unistd_64.h | 2
> arch/x86/kernel/syscall_table_32.S | 1
> include/asm-generic/unistd.h | 5 -
> 8 files changed, 148 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a990ace..1e8f00f 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -534,6 +534,8 @@ asmlinkage long sys_munlockall(void);
> asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> asmlinkage long sys_mincore(unsigned long start, size_t len,
> unsigned char __user * vec);
> +asmlinkage long sys_fincore(unsigned int fd, loff_t start, loff_t len,
> + unsigned char __user *vec);
>
> asmlinkage long sys_pivot_root(const char __user *new_root,
> const char __user *put_old);
> diff --git a/fs/fincore.c b/fs/fincore.c
> new file mode 100644
> index 0000000..6b74cc4
> --- /dev/null
> +++ b/fs/fincore.c
> @@ -0,0 +1,135 @@
> +/*
> + * fs/fincore.c
> + *
> + * Copyright (C) 2009, 2010 Chris Frost, UC Regents
> + * Copyright (C) 2008 Steve VanDeBogart, UC Regents
> + */
> +
> +/*
> + * The fincore() system call.
> + */
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/pagemap.h>
> +#include <linux/syscalls.h>
> +#include <linux/uaccess.h>
> +
> +static unsigned char fincore_page(struct address_space *mapping, pgoff_t pgoff)
> +{
> + unsigned char present = 0;
> + struct page *page = find_get_page(mapping, pgoff);
> + if (page) {
> + present = PageUptodate(page);
> + page_cache_release(page);
> + }
> +
> + return present;
> +}
> +
> +/*
> + * The fincore(2) system call.
> + *
> + * fincore() returns the memory residency status of the pages backing
> + * a file range specified by fd and [start, start + len).
> + * The status is returned in a vector of bytes. The least significant
> + * bit of each byte is 1 if the referenced page is in memory, otherwise
> + * it is zero.
> + *
> + * Because the status of a page can change after fincore() checks it
> + * but before it returns to the application, the returned vector may
> + * contain stale information. Only locked pages are guaranteed to
> + * remain in memory.
> + *
> + * return values:
> + * zero - success
> + * -EBADF - fd is an illegal file descriptor
> + * -EFAULT - vec points to an illegal address
> + * -EINVAL - start is not a multiple of PAGE_CACHE_SIZE or start + len
> + * is larger than the size of the file
> + * -EAGAIN - A kernel resource was temporarily unavailable.
> + */
> +SYSCALL_DEFINE4(fincore, unsigned int, fd, loff_t, start, loff_t, len,
> + unsigned char __user *, vec)
> +{
> + long retval;
> + loff_t pgoff = start >> PAGE_SHIFT;
> + loff_t pgend;
> + loff_t npages;
> + struct file *filp;
> + int fput_needed;
> + loff_t file_nbytes;
> + loff_t file_npages;
> + unsigned char *tmp = NULL;
> + unsigned char tmp_small[64];
> + unsigned tmp_count;
> + int i;
> +
> + /* Check the start address: needs to be page-aligned.. */
> + if (start & ~PAGE_CACHE_MASK)
> + return -EINVAL;
> +
> + npages = len >> PAGE_SHIFT;
> + npages += (len & ~PAGE_MASK) != 0;
> +
> + pgend = pgoff + npages;
> +
> + filp = fget_light(fd, &fput_needed);
> + if (!filp)
> + return -EBADF;
> +
> + if (filp->f_dentry->d_inode->i_mode & S_IFBLK)
> + file_nbytes = filp->f_dentry->d_inode->i_bdev->bd_inode->i_size << 9;
> + else
> + file_nbytes = filp->f_dentry->d_inode->i_size;
> +
> + file_npages = file_nbytes >> PAGE_SHIFT;
> + file_npages += (file_nbytes & ~PAGE_MASK) != 0;
> +
> + if (pgoff >= file_npages || pgend > file_npages) {
> + retval = -EINVAL;
> + goto done;
> + }
> +
> + if (!access_ok(VERIFY_WRITE, vec, npages)) {
> + retval = -EFAULT;
> + goto done;
> + }
> +
> + /*
> + * Allocate buffer vector page.
> + * Optimize allocation for small values of npages because the
> + * __get_free_page() call doubles fincore(2) runtime when npages == 1.
> + */
> + if (npages <= sizeof(tmp_small)) {
> + tmp = tmp_small;
> + tmp_count = sizeof(tmp_small);
> + } else {
> + tmp = (void *) __get_free_page(GFP_USER);
> + if (!tmp) {
> + retval = -EAGAIN;
> + goto done;
> + }
> + tmp_count = PAGE_SIZE;
> + }
> +
> + while (pgoff < pgend) {
> + /*
> + * Do at most tmp_count entries per iteration, due to
> + * the temporary buffer size.
> + */
> + for (i = 0; pgoff < pgend && i < tmp_count; pgoff++, i++)
> + tmp[i] = fincore_page(filp->f_mapping, pgoff);
> +
> + if (copy_to_user(vec, tmp, i)) {
> + retval = -EFAULT;
> + break;
> + }
> + vec += i;
> + }
> + retval = 0;
> +done:
> + if (tmp && tmp != tmp_small)
> + free_page((unsigned long) tmp);
> + fput_light(filp, fput_needed);
> + return retval;
> +}
> diff --git a/fs/Makefile b/fs/Makefile
> index af6d047..a3ccd6b 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \
> attr.o bad_inode.o file.o filesystems.o namespace.o \
> seq_file.o xattr.o libfs.o fs-writeback.o \
> pnode.o drop_caches.o splice.o sync.o utimes.o \
> - stack.o fs_struct.o
> + stack.o fs_struct.o fincore.o
>
> ifeq ($(CONFIG_BLOCK),y)
> obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
> diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
> index d76b66a..ce76dc8 100644
> --- a/include/asm-generic/unistd.h
> +++ b/include/asm-generic/unistd.h
> @@ -623,8 +623,11 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
> #define __NR_perf_event_open 241
> __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
>
> +#define __NR_fincore 242
> +__SYSCALL(__NR_fincore, sys_fincore)
> +
> #undef __NR_syscalls
> -#define __NR_syscalls 242
> +#define __NR_syscalls 243
>
> /*
> * All syscalls below here should go away really,
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 581b056..cbf96e6 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -841,4 +841,5 @@ ia32_sys_call_table:
> .quad compat_sys_pwritev
> .quad compat_sys_rt_tgsigqueueinfo /* 335 */
> .quad sys_perf_event_open
> + .quad sys_fincore
> ia32_syscall_end:
> diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
> index 6fb3c20..088b235 100644
> --- a/arch/x86/include/asm/unistd_32.h
> +++ b/arch/x86/include/asm/unistd_32.h
> @@ -342,10 +342,11 @@
> #define __NR_pwritev 334
> #define __NR_rt_tgsigqueueinfo 335
> #define __NR_perf_event_open 336
> +#define __NR_fincore 337
>
> #ifdef __KERNEL__
>
> -#define NR_syscalls 337
> +#define NR_syscalls 338
>
> #define __ARCH_WANT_IPC_PARSE_VERSION
> #define __ARCH_WANT_OLD_READDIR
> diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
> index 8d3ad0a..ebc04b5 100644
> --- a/arch/x86/include/asm/unistd_64.h
> +++ b/arch/x86/include/asm/unistd_64.h
> @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
> __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
> #define __NR_perf_event_open 298
> __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> +#define __NR_fincore 299
> +__SYSCALL(__NR_fincore, sys_fincore)
>
> #ifndef __NO_STUBS
> #define __ARCH_WANT_OLD_READDIR
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index 0157cd2..1fdc8bc 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -336,3 +336,4 @@ ENTRY(sys_call_table)
> .long sys_pwritev
> .long sys_rt_tgsigqueueinfo /* 335 */
> .long sys_perf_event_open
> + .long sys_fincore
> --
> 1.5.4.3
>
> --
> Chris Frost
> http://www.frostnet.net/chris/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-01-26 22:14:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

On Wed, 20 Jan 2010 13:57:12 -0800
Chris Frost <[email protected]> wrote:

> Add the fincore() system call. fincore() is mincore() for file descriptors.
>
> The functionality of fincore() can be emulated with an mmap(), mincore(),
> and munmap(), but this emulation requires more system calls and requires
> page table modifications. fincore() can provide a significant performance
> improvement for non-sequential in-core queries.
>
> Signed-off-by: Chris Frost <[email protected]>
> Signed-off-by: Steve VanDeBogart <[email protected]>
> ---
>
> Notes on micro and macro performance and on a potential optimization:
>
> For a microbenchmark that sequentially queries whether the pages of a large
> file are in memory fincore is 7-11x faster than mmap+mincore+munmap
> when querying one page a time (Pentium 4 running a 32 bit SMP kernel).
> When querying 1024 pages at a time the two approaches perform comparably.
> However, an application cannot always amortize calls; e.g., non-sequential
> reads. These improvements are increased significantly by amortizing the
> RCU work done by each find_get_page() call, but this optimization does
> not affect our macrobenchmarks (more in third paragraph).
>
> We introduced this system call while modifying SQLite and the GIMP to
> request large prefetches for what would otherwise be non-sequential reads.
> As a macrobenchmark, we see a 125s SQLite query (72s system time) reduced
> to 75s (18s system time) by using fincore() instead of mincore(). This
> speedup of course varies by benchmark and benchmarks size; we've seen
> both minimal speedups and 1000x speedups. More on these benchmarks in the
> publication _Reducing Seek Overhead with Application-Directed Prefetching_
> in USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.
>
> In this patch find_get_page() is called for each page, which in turn
> calls rcu_read_lock(), for each page. We have found that amortizing
> these RCU calls, e.g., by introducing a variant of find_get_pages_contig()
> that does not skip missing pages, can speedup the above microbenchmark
> by 260x when querying many pages per system call. But we have not observed
> noticeable improvements to our macrobenchmarks. I'd be happy to also post
> this change or look further into it, but this seems like a reasonable
> first patch, at least.

I must say, the syscall appeals to my inner geek. Lot of applications
are leaving a lot of time on the floor due to bad disk access patterns.
A really smart library which uses this facility could help all over
the place.

Is it likely that these changes to SQLite and Gimp would be merged into
the upstream applications?

> ...
>
> new file mode 100644
> index 0000000..6b74cc4
> --- /dev/null
> +++ b/fs/fincore.c
> @@ -0,0 +1,135 @@
> +/*
> + * fs/fincore.c
> + *
> + * Copyright (C) 2009, 2010 Chris Frost, UC Regents
> + * Copyright (C) 2008 Steve VanDeBogart, UC Regents
> + */
> +
> +/*
> + * The fincore() system call.
> + */
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/pagemap.h>
> +#include <linux/syscalls.h>
> +#include <linux/uaccess.h>
> +
> +static unsigned char fincore_page(struct address_space *mapping, pgoff_t pgoff)
> +{
> + unsigned char present = 0;
> + struct page *page = find_get_page(mapping, pgoff);
> + if (page) {
> + present = PageUptodate(page);
> + page_cache_release(page);
> + }
> +
> + return present;
> +}

What Andi said. This is crying out for a big radix-tree walk in the
inner loop.

> +/*
> + * The fincore(2) system call.
> + *
> + * fincore() returns the memory residency status of the pages backing
> + * a file range specified by fd and [start, start + len).
> + * The status is returned in a vector of bytes. The least significant
> + * bit of each byte is 1 if the referenced page is in memory, otherwise
> + * it is zero.
> + *
> + * Because the status of a page can change after fincore() checks it
> + * but before it returns to the application, the returned vector may
> + * contain stale information. Only locked pages are guaranteed to
> + * remain in memory.
> + *
> + * return values:
> + * zero - success
> + * -EBADF - fd is an illegal file descriptor
> + * -EFAULT - vec points to an illegal address
> + * -EINVAL - start is not a multiple of PAGE_CACHE_SIZE or start + len
> + * is larger than the size of the file
> + * -EAGAIN - A kernel resource was temporarily unavailable.
> + */
> +SYSCALL_DEFINE4(fincore, unsigned int, fd, loff_t, start, loff_t, len,
> + unsigned char __user *, vec)
> +{
> + long retval;
> + loff_t pgoff = start >> PAGE_SHIFT;
> + loff_t pgend;
> + loff_t npages;
> + struct file *filp;
> + int fput_needed;
> + loff_t file_nbytes;
> + loff_t file_npages;
> + unsigned char *tmp = NULL;
> + unsigned char tmp_small[64];
> + unsigned tmp_count;
> + int i;

pgoff, pgend and npages should be pgoff_t, I think. file_nbytes I
guess is OK, but maybe size_t. file_npages should be pgoff_t. And
death to tmp!

> + /* Check the start address: needs to be page-aligned.. */
> + if (start & ~PAGE_CACHE_MASK)
> + return -EINVAL;
> +
> + npages = len >> PAGE_SHIFT;
> + npages += (len & ~PAGE_MASK) != 0;
> +
> + pgend = pgoff + npages;
> +
> + filp = fget_light(fd, &fput_needed);
> + if (!filp)
> + return -EBADF;
> +
> + if (filp->f_dentry->d_inode->i_mode & S_IFBLK)
> + file_nbytes = filp->f_dentry->d_inode->i_bdev->bd_inode->i_size << 9;
> + else
> + file_nbytes = filp->f_dentry->d_inode->i_size;

I think

file_nbytes = i_size_read(filp->f_mapping->host->i_size);

is what you want here.


> + file_npages = file_nbytes >> PAGE_SHIFT;
> + file_npages += (file_nbytes & ~PAGE_MASK) != 0;

file_npages = (file_nbytes + PAGE_SIZE - 1) >> PAGE_SHIFT;

> + if (pgoff >= file_npages || pgend > file_npages) {
> + retval = -EINVAL;
> + goto done;
> + }

Should this return -EINVAL, or should it just return "0": nothing there?

Bear in mind that this code is racy against truncate (I think?), and
this is "by design". If that race does occur, we want to return
something graceful to userspace and I suggest that "nope, nothing
there" is a more graceful result that "erk, you screwed up". Because
the application _didn't_ screw up: the pages were there when the
syscall was first performed.

> + if (!access_ok(VERIFY_WRITE, vec, npages)) {
> + retval = -EFAULT;
> + goto done;
> + }

Yeah, just remove this. copy_to_user() will handle it.

> + /*
> + * Allocate buffer vector page.
> + * Optimize allocation for small values of npages because the
> + * __get_free_page() call doubles fincore(2) runtime when npages == 1.
> + */
> + if (npages <= sizeof(tmp_small)) {
> + tmp = tmp_small;
> + tmp_count = sizeof(tmp_small);
> + } else {
> + tmp = (void *) __get_free_page(GFP_USER);
> + if (!tmp) {
> + retval = -EAGAIN;
> + goto done;
> + }
> + tmp_count = PAGE_SIZE;
> + }
> +
> + while (pgoff < pgend) {
> + /*
> + * Do at most tmp_count entries per iteration, due to
> + * the temporary buffer size.
> + */
> + for (i = 0; pgoff < pgend && i < tmp_count; pgoff++, i++)
> + tmp[i] = fincore_page(filp->f_mapping, pgoff);
> +
> + if (copy_to_user(vec, tmp, i)) {
> + retval = -EFAULT;
> + break;
> + }
> + vec += i;
> + }
> + retval = 0;
> +done:
> + if (tmp && tmp != tmp_small)
> + free_page((unsigned long) tmp);
> + fput_light(filp, fput_needed);
> + return retval;
> +}

2010-01-27 18:14:46

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

Chris Frost wrote:
> We introduced this system call while modifying SQLite and the GIMP to
> request large prefetches for what would otherwise be non-sequential reads.
> As a macrobenchmark, we see a 125s SQLite query (72s system time) reduced
> to 75s (18s system time) by using fincore() instead of mincore(). This
> speedup of course varies by benchmark and benchmarks size; we've seen
> both minimal speedups and 1000x speedups. More on these benchmarks in the
> publication _Reducing Seek Overhead with Application-Directed Prefetching_
> in USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.

My first thought was:

Why is calling fincore() and then issuing reads better than simply
calling readahead() on the same range? I.e. why is readahead() (or
POSIX_FADV_WILLNEED) unsuitable to give the same result? Or even
issuing lots of AIO requests.

I knew that I was missing something, so I read the paper ;-) I don't
fully understand it, but *think* that it says fincore() is used to
detect when the kernel is evicting pages faster than libprefetch had
planned for, implying memory pressure, so it adjusts its planning in
response.

Interesting idea, though I wonder if it wouldn't be even better to
have a direct way to ask the kernel "tell me when there is memory
pressure causing my file to be evicted".

-- Jamie

2010-01-28 07:42:49

by Steve VanDeBogart

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

On Tue, 26 Jan 2010, Andrew Morton wrote:

> On Wed, 20 Jan 2010 13:57:12 -0800
> Chris Frost <[email protected]> wrote:
>
>> In this patch find_get_page() is called for each page, which in turn
>> calls rcu_read_lock(), for each page. We have found that amortizing
>> these RCU calls, e.g., by introducing a variant of find_get_pages_contig()
>> that does not skip missing pages, can speedup the above microbenchmark
>> by 260x when querying many pages per system call. But we have not observed
>> noticeable improvements to our macrobenchmarks. I'd be happy to also post
>> this change or look further into it, but this seems like a reasonable
>> first patch, at least.
>
> I must say, the syscall appeals to my inner geek. Lot of applications
> are leaving a lot of time on the floor due to bad disk access patterns.
> A really smart library which uses this facility could help all over
> the place.
>
> Is it likely that these changes to SQLite and Gimp would be merged into
> the upstream applications?

Changes to the GIMP fit nicely into the code structure, so it's feasible
to push this kind of optimization upstream. The changes in SQLite are
a bit more focused on the benchmark, but a more general approach is not
conceptually difficult. SQLite may not want the added complexity, but
other database may be interested in the performance improvement.

Of course, these kernel changes are needed before any application can
optimize its IO as we did with libprefetch.

>> + if (pgoff >= file_npages || pgend > file_npages) {
>> + retval = -EINVAL;
>> + goto done;
>> + }
>
> Should this return -EINVAL, or should it just return "0": nothing there?
>
> Bear in mind that this code is racy against truncate (I think?), and
> this is "by design". If that race does occur, we want to return
> something graceful to userspace and I suggest that "nope, nothing
> there" is a more graceful result that "erk, you screwed up". Because
> the application _didn't_ screw up: the pages were there when the
> syscall was first performed.

That's a good point. Not in core seems like the right answer for
pgoff >= file_npages.

--
Steve

2010-01-28 08:23:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

On Wed, 27 Jan 2010 23:42:35 -0800 (PST) Steve VanDeBogart <[email protected]> wrote:

> > Is it likely that these changes to SQLite and Gimp would be merged into
> > the upstream applications?
>
> Changes to the GIMP fit nicely into the code structure, so it's feasible
> to push this kind of optimization upstream. The changes in SQLite are
> a bit more focused on the benchmark, but a more general approach is not
> conceptually difficult. SQLite may not want the added complexity, but
> other database may be interested in the performance improvement.
>
> Of course, these kernel changes are needed before any application can
> optimize its IO as we did with libprefetch.

That didn't really answer my question.

If there's someone signed up and motivated to do the hard work of
getting these changes integrated into the upstream applications then
that makes us more interested. If, however it was some weekend
proof-of-concept hack which shortly dies an instadeath then... meh,
not so much.

2010-01-28 08:24:06

by Steve VanDeBogart

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

On Wed, 27 Jan 2010, Jamie Lokier wrote:

> Chris Frost wrote:
>> We introduced this system call while modifying SQLite and the GIMP to
>> request large prefetches for what would otherwise be non-sequential reads.
>> As a macrobenchmark, we see a 125s SQLite query (72s system time) reduced
>> to 75s (18s system time) by using fincore() instead of mincore(). This
>> speedup of course varies by benchmark and benchmarks size; we've seen
>> both minimal speedups and 1000x speedups. More on these benchmarks in the
>> publication _Reducing Seek Overhead with Application-Directed Prefetching_
>> in USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.
>
> My first thought was:
>
> Why is calling fincore() and then issuing reads better than simply
> calling readahead() on the same range? I.e. why is readahead() (or
> POSIX_FADV_WILLNEED) unsuitable to give the same result? Or even
> issuing lots of AIO requests.

A stupid example can illustrate the difference. If you have X bytes
of RAM, and have a file 10 X in size, reading the entire thing in
before accessing it will only hurt performance. (The same situation
where an MRU replacement policy can perform better then a strict LRU
policy.) In other words, using fincore helps the prefetching library
figure out how much to prefetch to optimize performance.

> I knew that I was missing something, so I read the paper ;-) I don't
> fully understand it, but *think* that it says fincore() is used to
> detect when the kernel is evicting pages faster than libprefetch had
> planned for, implying memory pressure, so it adjusts its planning in
> response.

In some sense, yes. At core, Libprefetch uses fincore to detect how much
memory can be used for prefetching. We can ask proc how much is
free and how much is in buffers, but memory use is dynamic, so
libprefetch must monitor it and react appropriately.

> Interesting idea, though I wonder if it wouldn't be even better to
> have a direct way to ask the kernel "tell me when there is memory
> pressure causing my file to be evicted".

Such an interface sounds fairly specialized. It's possible that it would
be more efficient for this particular purpose, but useless for other,
related, purposes. For example, if you want to backup a file without
polluting the buffer cache: before accessing each page of the file,
fincore that page, send it to the backup device, then restore the page
to its previous state with fadvise. fincore is obviously modelled on
mincore, which has stood the test of time as an interface. Even if you
don't like the mincore/fincore interface, you have to admit it can't be
that bad because no replacement interface has been accepted.

--
Steve

2010-01-28 08:32:36

by Steve VanDeBogart

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

On Thu, 28 Jan 2010, Andrew Morton wrote:

> On Wed, 27 Jan 2010 23:42:35 -0800 (PST) Steve VanDeBogart <[email protected]> wrote:
>
>>> Is it likely that these changes to SQLite and Gimp would be merged into
>>> the upstream applications?
>>
>> Changes to the GIMP fit nicely into the code structure, so it's feasible
>> to push this kind of optimization upstream. The changes in SQLite are
>> a bit more focused on the benchmark, but a more general approach is not
>> conceptually difficult. SQLite may not want the added complexity, but
>> other database may be interested in the performance improvement.
>>
>> Of course, these kernel changes are needed before any application can
>> optimize its IO as we did with libprefetch.
>
> That didn't really answer my question.
>
> If there's someone signed up and motivated to do the hard work of
> getting these changes integrated into the upstream applications then
> that makes us more interested. If, however it was some weekend
> proof-of-concept hack which shortly dies an instadeath then... meh,
> not so much.

Sorry I misunderstood. The maintainer of GraphicsMagick has already
contacted us about making changes similar to our GIMP changes. So yes,
there is interest in really using these changes.

--
Steve

2010-01-28 23:54:47

by Andres Freund

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

On Thursday 28 January 2010 09:23:13 Andrew Morton wrote:
> On Wed, 27 Jan 2010 23:42:35 -0800 (PST) Steve VanDeBogart <vandebo-
[email protected]> wrote:
> > > Is it likely that these changes to SQLite and Gimp would be merged into
> > > the upstream applications?
> >
> > Changes to the GIMP fit nicely into the code structure, so it's feasible
> > to push this kind of optimization upstream. The changes in SQLite are
> > a bit more focused on the benchmark, but a more general approach is not
> > conceptually difficult. SQLite may not want the added complexity, but
> > other database may be interested in the performance improvement.
> >
> > Of course, these kernel changes are needed before any application can
> > optimize its IO as we did with libprefetch.
> That didn't really answer my question.
> If there's someone signed up and motivated to do the hard work of
> getting these changes integrated into the upstream applications then
> that makes us more interested. If, however it was some weekend
> proof-of-concept hack which shortly dies an instadeath then... meh,
> not so much.
There is somebody working on a POC contrib module for postgres to deliver
better cost estimates using mincore - as postgres doesnt use mmap itself
something like fincore would be rather nice there.

(While currently it would be contrib module the plan is to be able too hook it
into it without modifications to core pg code)

Andres

2010-02-16 18:21:45

by Chris Frost

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

Add the fincore() system call. fincore() is mincore() for file descriptors.

The functionality of fincore() can be emulated with an mmap(), mincore(),
and munmap(), but this emulation requires more system calls and requires
page table modifications. fincore() can provide a significant performance
improvement for non-sequential in-core queries.

CC: Andi Kleen <[email protected]>
CC: Wu Fengguang <[email protected]>
CC: Andrew Morton <[email protected]>
Signed-off-by: Chris Frost <[email protected]>
Signed-off-by: Steve VanDeBogart <[email protected]>
---

Andi, Wu, and Andrew,

Thanks for the feedback. I have incorporated most of your suggestions
into this patch. A man page for fincore(2) is attached.

A few questions about the suggestions:

* Early return when a signal is queued
Andi pointed out that it would be good to return early if a signal is
queued for the calling process. I see two options to do this:
1. Return -EINTR (the patch in this email takes this approach).
2. Return the results discovered so far.

And the tradeoffs:
1. Returning -EINTR abandons good work.
2. fincore needs to inform the caller of the number of valid vec entries.
Three approaches:
- Return the number of file bytes with valid page entries.
Issue: The return type is a long. For a 32 bit process, sizeof(long)
is significantly less than sizeof(loff_t).
- Return the number of pages with valid page entries (use type ssize_t).
Issue: I don't see any significant issues.
- Add an 'loff_t *result' parameter to the system call (a la _llseek).
Issue: does this push the number of arguments too high?

Given the above, I feel that returning the number of valid page entries is
the best approach. Feedback?

* Radix gang lookup
Andi and Andrew suggested using radix gang lookups for larger ranges.
My benchmarks show that my change to do this isn't a strict performance
win. Attached is my patch that changes the below code to do gang lookups.
Here are the microbenchmark results for querying a 1 GiB file (results
are times in seconds):
density | none in-core | many in-core |
pages/syscall | 1 | 8192 | 1 | 8192 |
--------------+------+-------+------+-------|
fincore | .055 | .0045 | .078 | .0250 |
fincore-batch | .059 | .0010 | .123 | .0170 |
"none in-core" means none of the file pages were in ram and "many in-core"
that the majority were in ram. the next row indicates how many pages
are queried in each fincore(2) call by the process (1 or 8192 pages).
These results show that fincore-batch is ~2-5x faster for large count
calls, but that it is slower for small count calls. Both classes of calls
are made by our macrobenchmarks. The macrobenchmarks (a SQLite query and
a GIMP image blur) show the batch version slows performance by 2-12%.

Perhaps my gang patch can be further optimized, or is incorrect? (Of note,
I did not avoid the gets and puts.) If not, fincore_pages() could be
changed to decide choose which of a single vs. gang lookup to do based on
a heuristic tradeoff value. Or it could stay simple and only make single
lookups.

* Kernel output buffer size
Andi pointed out that the optimization to avoid allocating a buffer page
for small queries could be changed to claim more than 64 bytes of on-stack
buffer space. I haven't seen increasing this space to significantly affect
my particular macrobenchmarks. I'm ok with leaving it at 64 bytes or
changing it; this number was picked fairly arbitrarily.

* Upper query size limit
Andi suggested imposing a reasonable upper limit on the number of pages
that can be queried in a single fincore call, because the process would
not be debuggable while the system call is executing. Andi, could you
cite an example system call for me to look at that takes this approach?

* System call argument feasibility across architectures
The fincore(2) parameters are now reordered so that they should fit in six
registers on all platforms. I intend for the libc wrapper to expose the
call as shown in the man page, rather than the system call's ordering.
I think this issue is now resolved, but wanted to mention it just in case
I missed anything; does this look good for all architectures?

include/linux/syscalls.h | 2
fs/fincore.c | 126 +++++++++++++++++++++++++++++++++++++
fs/Makefile | 2
include/asm-generic/unistd.h | 5 +
arch/x86/ia32/ia32entry.S | 1
arch/x86/include/asm/unistd_32.h | 3
arch/x86/include/asm/unistd_64.h | 2
arch/x86/kernel/syscall_table_32.S | 1
8 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a990ace..814e4f5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -534,6 +534,8 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
+asmlinkage long sys_fincore(unsigned int fd, unsigned char __user *vec,
+ loff_t start, loff_t len);

asmlinkage long sys_pivot_root(const char __user *new_root,
const char __user *put_old);
diff --git a/fs/fincore.c b/fs/fincore.c
new file mode 100644
index 0000000..f329fe4
--- /dev/null
+++ b/fs/fincore.c
@@ -0,0 +1,126 @@
+/*
+ * fs/fincore.c
+ *
+ * Copyright (C) 2009, 2010 Chris Frost, UC Regents
+ * Copyright (C) 2008 Steve VanDeBogart, UC Regents
+ */
+
+/*
+ * The fincore() system call.
+ */
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+
+static unsigned char fincore_page(struct address_space *mapping, pgoff_t pgoff)
+{
+ unsigned char present = 0;
+ struct page *page = find_get_page(mapping, pgoff);
+ if (page) {
+ present = PageUptodate(page);
+ page_cache_release(page);
+ }
+
+ return present;
+}
+
+/*
+ * The fincore(2) system call.
+ *
+ * fincore() returns the memory residency status of the pages backing
+ * a file range specified by fd and [start, start + len).
+ * The status is returned in a vector of bytes. The least significant
+ * bit of each byte is 1 if the referenced page is in memory, otherwise
+ * it is zero.
+ *
+ * Because the status of a page can change after fincore() checks it
+ * but before it returns to the application, the returned vector may
+ * contain stale information. Only locked pages are guaranteed to
+ * remain in memory.
+ *
+ * Note that the parameter order for this system calls differ from the order
+ * for the libc wrapper. This syscall order allows the parameters to fit
+ * in six registers on all architectures.
+ *
+ * return values:
+ * zero - success
+ * -EBADF - fd is an illegal file descriptor
+ * -EFAULT - vec points to an illegal address
+ * -EINVAL - start is not a multiple of PAGE_CACHE_SIZE
+ * -EAGAIN - A kernel resource was temporarily unavailable
+ * -EINTR - The call was interrupted by a signal
+ */
+SYSCALL_DEFINE4(fincore, unsigned int, fd, unsigned char __user *, vec,
+ loff_t, start, loff_t, len)
+{
+ long retval;
+ pgoff_t pgoff = start >> PAGE_SHIFT;
+ pgoff_t npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ pgoff_t pgend = pgoff + npages;
+ struct file *filp;
+ int fput_needed;
+ loff_t file_nbytes;
+ pgoff_t file_npages;
+ unsigned char *kernel_vec = NULL;
+ unsigned char kernel_vec_small[64];
+ unsigned kernel_vec_count;
+ int i;
+
+ /* Check the start address: needs to be page-aligned.. */
+ if (start & ~PAGE_CACHE_MASK)
+ return -EINVAL;
+
+ filp = fget_light(fd, &fput_needed);
+ if (!filp)
+ return -EBADF;
+
+ file_nbytes = i_size_read(filp->f_mapping->host);
+
+ file_npages = (file_nbytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+ /*
+ * Allocate buffer vector page.
+ * Optimize allocation for small values of npages because the
+ * __get_free_page() call doubles fincore(2) runtime when npages == 1.
+ */
+ if (npages <= sizeof(kernel_vec_small)) {
+ kernel_vec = kernel_vec_small;
+ kernel_vec_count = sizeof(kernel_vec_small);
+ } else {
+ kernel_vec = (void *) __get_free_page(GFP_USER);
+ if (!kernel_vec) {
+ retval = -EAGAIN;
+ goto done;
+ }
+ kernel_vec_count = PAGE_SIZE;
+ }
+
+ while (pgoff < pgend) {
+ /*
+ * Do at most kernel_vec_count entries per iteration, due to
+ * the limited buffer size.
+ */
+ for (i = 0; pgoff < pgend && i < kernel_vec_count; pgoff++, i++)
+ kernel_vec[i] = fincore_page(filp->f_mapping, pgoff);
+
+ if (copy_to_user(vec, kernel_vec, i)) {
+ retval = -EFAULT;
+ break;
+ }
+ vec += i;
+
+ if (signal_pending(current)) {
+ retval = -EINTR;
+ break;
+ }
+ cond_resched();
+ }
+ retval = 0;
+done:
+ if (kernel_vec && kernel_vec != kernel_vec_small)
+ free_page((unsigned long) kernel_vec);
+ fput_light(filp, fput_needed);
+ return retval;
+}
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..a3ccd6b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
- stack.o fs_struct.o
+ stack.o fs_struct.o fincore.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
index d76b66a..ce76dc8 100644
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -623,8 +623,11 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
#define __NR_perf_event_open 241
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)

+#define __NR_fincore 242
+__SYSCALL(__NR_fincore, sys_fincore)
+
#undef __NR_syscalls
-#define __NR_syscalls 242
+#define __NR_syscalls 243

/*
* All syscalls below here should go away really,
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 581b056..cbf96e6 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -841,4 +841,5 @@ ia32_sys_call_table:
.quad compat_sys_pwritev
.quad compat_sys_rt_tgsigqueueinfo /* 335 */
.quad sys_perf_event_open
+ .quad sys_fincore
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..088b235 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,10 +342,11 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
+#define __NR_fincore 337

#ifdef __KERNEL__

-#define NR_syscalls 337
+#define NR_syscalls 338

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 8d3ad0a..ebc04b5 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
__SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
#define __NR_perf_event_open 298
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_fincore 299
+__SYSCALL(__NR_fincore, sys_fincore)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..1fdc8bc 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
.long sys_pwritev
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
+ .long sys_fincore
--
Chris Frost
http://www.frostnet.net/chris/


Attachments:
(No filename) (11.39 kB)
fincore.2 (4.04 kB)
batch.patch (2.43 kB)
Download all attachments

2010-02-21 03:02:41

by Andy Isaacson

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

On Tue, Feb 16, 2010 at 10:13:12AM -0800, Chris Frost wrote:
> Add the fincore() system call. fincore() is mincore() for file descriptors.
>
> The functionality of fincore() can be emulated with an mmap(), mincore(),
> and munmap(), but this emulation requires more system calls and requires
> page table modifications. fincore() can provide a significant performance
> improvement for non-sequential in-core queries.

In addition to being expensive, mmap/mincore/munmap perturb the VM's
eviction algorithm -- a page is less likely to be evicted if it's
mmapped when being considered for eviction.

I frequently see this happen when using mincore(1) from
http://bitbucket.org/radii/mincore/ -- "watch mincore -v *.big" while
*.big are being sequentially read results in a significant number of
pages remaining in-core, whereas if I only run mincore after the
sequential read is complete, the large files will be nearly-completely
out of core (except for the tail of the last file, of course).

It's very interesting to watch
% watch --interval=.5 mincore -v *

while an IO-intensive process is happening, such as mke2fs on a
filesystem image.

So, I support the addition of fincore(2) and would use it if it were
merged.

-andy

2010-02-21 03:25:39

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

Andy and Chris,

On Sun, Feb 21, 2010 at 11:02:38AM +0800, Andy Isaacson wrote:
> On Tue, Feb 16, 2010 at 10:13:12AM -0800, Chris Frost wrote:
> > Add the fincore() system call. fincore() is mincore() for file descriptors.
> >
> > The functionality of fincore() can be emulated with an mmap(), mincore(),
> > and munmap(), but this emulation requires more system calls and requires
> > page table modifications. fincore() can provide a significant performance
> > improvement for non-sequential in-core queries.
>
> In addition to being expensive, mmap/mincore/munmap perturb the VM's
> eviction algorithm -- a page is less likely to be evicted if it's
> mmapped when being considered for eviction.
>
> I frequently see this happen when using mincore(1) from
> http://bitbucket.org/radii/mincore/ -- "watch mincore -v *.big" while
> *.big are being sequentially read results in a significant number of
> pages remaining in-core, whereas if I only run mincore after the
> sequential read is complete, the large files will be nearly-completely
> out of core (except for the tail of the last file, of course).
>
> It's very interesting to watch
> % watch --interval=.5 mincore -v *
>
> while an IO-intensive process is happening, such as mke2fs on a
> filesystem image.
>
> So, I support the addition of fincore(2) and would use it if it were
> merged.

I'd like to advocate the "pagecache object collections", a ftrace
based alternative:

http://lkml.org/lkml/2010/2/9/156

Which will provide much more information than fincore(). I'd really
appreciate it if you can join and use the general "pagecache object
collections" facility.

Thanks,
Fengguang

2010-02-23 16:39:30

by Andy Isaacson

[permalink] [raw]
Subject: Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

On Sun, Feb 21, 2010 at 11:25:33AM +0800, Wu Fengguang wrote:
> Andy and Chris,
> On Sun, Feb 21, 2010 at 11:02:38AM +0800, Andy Isaacson wrote:
> > On Tue, Feb 16, 2010 at 10:13:12AM -0800, Chris Frost wrote:
> > > Add the fincore() system call. fincore() is mincore() for file descriptors.
> > >
> > > The functionality of fincore() can be emulated with an mmap(), mincore(),
> > > and munmap(), but this emulation requires more system calls and requires
> > > page table modifications. fincore() can provide a significant performance
> > > improvement for non-sequential in-core queries.
> >
> > In addition to being expensive, mmap/mincore/munmap perturb the VM's
> > eviction algorithm -- a page is less likely to be evicted if it's
> > mmapped when being considered for eviction.
> >
> > I frequently see this happen when using mincore(1) from
> > http://bitbucket.org/radii/mincore/ -- "watch mincore -v *.big" while
> > *.big are being sequentially read results in a significant number of
> > pages remaining in-core, whereas if I only run mincore after the
> > sequential read is complete, the large files will be nearly-completely
> > out of core (except for the tail of the last file, of course).
> >
> > It's very interesting to watch
> > % watch --interval=.5 mincore -v *
> >
> > while an IO-intensive process is happening, such as mke2fs on a
> > filesystem image.
> >
> > So, I support the addition of fincore(2) and would use it if it were
> > merged.
>
> I'd like to advocate the "pagecache object collections", a ftrace
> based alternative:
>
> http://lkml.org/lkml/2010/2/9/156
>
> Which will provide much more information than fincore(). I'd really
> appreciate it if you can join and use the general "pagecache object
> collections" facility.

1. The ftrace alternative appears to require root. That's a complete
non-starter for my use case.

2. I can imagine advocating that other UNIXes adopt fincore. It's
unrealistic to pretend that other UNIXes will adopt our trace/
infrastructure. (If anything we should have adopted DTrace.)

3. It appears to expose a significantly more complicated userland API.
(But this doesn't matter until (1) is addressed.) Also, it looks
like it'll be a lot more expensive for high-frequency queries. Note
that in the library-helper use case, the library implementation may
be limited by its exposed API from leaving filedescriptors open
across calls. Does ftrace really require the kernel to format data
to ASCII so that it can be fscanf()ed by userland? I hope that's
just a convenience and there's a binary output path.

4. How committed is the ftrace API and ABI? Is it guaranteed to
continue to be supported for the next 2 decades?

I'd much rather have the simple, supportible, explainable, performant
API that fits in well to the standard UNIX paradigm than to add
dependencies on Linux-specific APIs that appear to be in extreme flux.

My apologies if I've missed anything in the above, please let me know if
I'm wrong.

Thanks,
-andy