2014-06-02 05:26:00

by Naoya Horiguchi

[permalink] [raw]
Subject: [RFC][PATCH 0/3] mm: introduce fincore()

Due to the previous discussion[1], I learned that you people have discussed
this system call a few times (but not conclusion) and it can solve my problem
about pagecache scanning (see[2] for my motivation.) So I try it now.

The main patch of this patchset is patch 2, and this is based on Johannes's
previous version[3], so I CCed people who joined that discussion. While there
might be controversies about the format of data copied from kernel to userspace,
I take the Kirill's suggestion[4] which uses a flag to choose the data format,
which is extensible and flexible (you can cut off some info if you don't need it.)

And I added simple tests at patch 3, and patch 2 passes all the tests.

Any comments are welcomed.

[1] http://marc.info/?l=linux-kernel&m=140072606903894&w=2
[2] http://marc.info/?l=linux-mm&m=140072522603776&w=2
[3] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
[4] http://marc.info/?l=linux-kernel&m=140075509611532&w=2

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (3):
replace PAGECACHE_TAG_* definition with enumeration
mm: introduce fincore()
selftest: add test code for fincore()

arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/fs.h | 9 +-
include/linux/syscalls.h | 2 +
mm/Makefile | 2 +-
mm/fincore.c | 362 +++++++++++++++++++++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/fincore/Makefile | 31 ++
.../selftests/fincore/create_hugetlbfs_file.c | 49 +++
tools/testing/selftests/fincore/fincore.c | 153 +++++++++
tools/testing/selftests/fincore/run_fincoretests | 355 ++++++++++++++++++++
10 files changed, 961 insertions(+), 4 deletions(-)


2014-06-02 05:26:06

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 2/3] mm: introduce fincore()

This patch provides a new system call fincore(2), which provides mincore()-
like information, i.e. page residency of a given file. But unlike mincore(),
fincore() can have a mode flag and it enables us to extract more detailed
information about page cache like pfn and page flag. This kind of information
is very helpful for example when applications want to know the file cache
status to control IO on their own way.

Detail about the data format being passed to userspace are explained in
inline comment, but generally in long entry format, we can choose which
information is extraced flexibly, so you don't have to waste memory by
extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
we can skip hole pages (not on memory,) which makes us avoid a flood of
meaningless zero entries when calling on extremely large (but only few
pages of it are loaded on memory) file.

Basic testset is added in a next patch on tools/testing/selftests/fincore/.

[1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919

Signed-off-by: Naoya Horiguchi <[email protected]>
---
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 +
mm/Makefile | 2 +-
mm/fincore.c | 362 +++++++++++++++++++++++++++++++++++++++
4 files changed, 366 insertions(+), 1 deletion(-)
create mode 100644 mm/fincore.c

diff --git v3.15-rc7.orig/arch/x86/syscalls/syscall_64.tbl v3.15-rc7/arch/x86/syscalls/syscall_64.tbl
index 04376ac3d9ef..0a6b6dd77708 100644
--- v3.15-rc7.orig/arch/x86/syscalls/syscall_64.tbl
+++ v3.15-rc7/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+317 common fincore sys_fincore

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git v3.15-rc7.orig/include/linux/syscalls.h v3.15-rc7/include/linux/syscalls.h
index a4a0588c5397..d625ec9cb73d 100644
--- v3.15-rc7.orig/include/linux/syscalls.h
+++ v3.15-rc7/include/linux/syscalls.h
@@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_fincore(int fd, loff_t start, long nr_pages,
+ int mode, unsigned char __user *vec);
#endif
diff --git v3.15-rc7.orig/mm/Makefile v3.15-rc7/mm/Makefile
index b484452dac57..55e1d13ddb76 100644
--- v3.15-rc7.orig/mm/Makefile
+++ v3.15-rc7/mm/Makefile
@@ -18,7 +18,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
mm_init.o mmu_context.o percpu.o slab_common.o \
compaction.o balloon_compaction.o vmacache.o \
interval_tree.o list_lru.o workingset.o \
- iov_iter.o $(mmu-y)
+ iov_iter.o fincore.o $(mmu-y)

obj-y += init-mm.o

diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
new file mode 100644
index 000000000000..3fc3ef465471
--- /dev/null
+++ v3.15-rc7/mm/fincore.c
@@ -0,0 +1,362 @@
+/*
+ * fincore(2) system call
+ *
+ * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
+ */
+
+#include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+
+/*
+ * You can control how the buffer in userspace is filled with this mode
+ * parameters:
+ *
+ * - FINCORE_BMAP:
+ * The page status is returned in a vector of bytes.
+ * The least significant bit of each byte is 1 if the referenced page
+ * is in memory, otherwise it is zero.
+ *
+ * - FINCORE_PFN:
+ * stores pfn, using 8 bytes.
+ *
+ * - FINCORE_PAGEFLAGS:
+ * stores page flags, using 8 bytes. See definition of KPF_* for details.
+ *
+ * - FINCORE_PAGECACHE_TAGS:
+ * stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
+ * for details.
+ *
+ * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
+ * information about hole. Instead each records per page has the entry
+ * of page offset (using 8 bytes.) This mode is useful if we handle
+ * large file and only few pages are on memory for the file.
+ *
+ * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
+ * data in this mode is like this:
+ *
+ * page offset 0 1 2 3 4
+ * +---+---+---+---+---+
+ * | 1 | 0 | 0 | 1 | 1 | ...
+ * +---+---+---+---+---+
+ * <->
+ * 1 byte
+ *
+ * For FINCORE_PFN, page data is formatted like this:
+ *
+ * page offset 0 1 2 3 4
+ * +-------+-------+-------+-------+-------+
+ * | pfn | pfn | pfn | pfn | pfn | ...
+ * +-------+-------+-------+-------+-------+
+ * <----->
+ * 8 byte
+ *
+ * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
+ * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
+ * information is stored like this:
+ *
+ * page offset 0 page offset 1 page offset 2
+ * +-------+-------+-------+-------+-------+-------+
+ * | pfn | flags | pfn | flags | pfn | flags | ...
+ * +-------+-------+-------+-------+-------+-------+
+ * <-------------> <-------------> <------------->
+ * 16 bytes 16 bytes 16 bytes
+ *
+ * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
+ * (8 bytes) instead. For example, the data format of mode
+ * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
+ *
+ * +-------+-------+-------+-------+-------+-------+
+ * | pgoff | pfn | pgoff | pfn | pgoff | pfn | ...
+ * +-------+-------+-------+-------+-------+-------+
+ * <-------------> <-------------> <------------->
+ * 16 bytes 16 bytes 16 bytes
+ */
+#define FINCORE_BMAP 0x01 /* bytemap mode */
+#define FINCORE_PFN 0x02
+#define FINCORE_PAGE_FLAGS 0x04
+#define FINCORE_PAGECACHE_TAGS 0x08
+#define FINCORE_SKIP_HOLE 0x10
+
+#define FINCORE_MODE_MASK 0x1f
+#define FINCORE_LONGENTRY_MASK (FINCORE_PFN | FINCORE_PAGE_FLAGS | \
+ FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
+
+struct fincore_control {
+ int mode;
+ int width; /* width of each entry (in bytes) */
+ unsigned char *buffer;
+ long buffer_size;
+ void *cursor; /* current position on the buffer */
+ pgoff_t pgstart; /* start point of page cache scan in each run
+ * of the while loop */
+ long nr_pages; /* number of pages to be copied to userspace
+ * (decreasing while scan proceeds) */
+ long scanned_offset; /* page offset of the lastest scanned page */
+ struct address_space *mapping;
+};
+
+/*
+ * TODO: doing radix_tree_tag_get() for each tag is not optimal, but no easy
+ * way without degrading finely tuned radix tree routines.
+ */
+static unsigned long get_pagecache_tags(struct radix_tree_root *root,
+ unsigned long index)
+{
+ int i;
+ unsigned long tags = 0;
+ for (i = 0; i < __NR_PAGECACHE_TAGS; i++)
+ if (radix_tree_tag_get(root, index, i))
+ tags |= 1 << i;
+ return tags;
+}
+
+#define store_entry(fc, type, data) ({ \
+ *(type *)fc->cursor = (type)data; \
+ fc->cursor += sizeof(type); \
+})
+
+/*
+ * Store page cache data to temporal buffer in the specified format depending
+ * on fincore mode.
+ */
+static void __do_fincore(struct fincore_control *fc, struct page *page,
+ unsigned long index)
+{
+ VM_BUG_ON(!page);
+ VM_BUG_ON((unsigned long)fc->cursor - (unsigned long)fc->buffer
+ >= fc->buffer_size);
+ if (fc->mode & FINCORE_BMAP)
+ store_entry(fc, unsigned char, PageUptodate(page));
+ else if (fc->mode & (FINCORE_LONGENTRY_MASK)) {
+ if (fc->mode & FINCORE_SKIP_HOLE)
+ store_entry(fc, unsigned long, index);
+ if (fc->mode & FINCORE_PFN)
+ store_entry(fc, unsigned long, page_to_pfn(page));
+ if (fc->mode & FINCORE_PAGE_FLAGS)
+ store_entry(fc, unsigned long, stable_page_flags(page));
+ if (fc->mode & FINCORE_PAGECACHE_TAGS)
+ store_entry(fc, unsigned long,
+ get_pagecache_tags(&fc->mapping->page_tree,
+ index));
+ }
+}
+
+/*
+ * Traverse page cache tree. It's assumed that temporal buffer are zeroed
+ * in advance. Due to this, we don't have to store zero entry explicitly
+ * one-by-one and we just set fc->cursor to the position of the next
+ * on-memory page.
+ *
+ * Return value is the number of pages whose data is stored in fc->buffer.
+ */
+static long do_fincore(struct fincore_control *fc, int nr_pages)
+{
+ pgoff_t pgend = fc->pgstart + nr_pages;
+ struct radix_tree_iter iter;
+ void **slot;
+ long nr = 0;
+
+ fc->cursor = fc->buffer;
+
+ rcu_read_lock();
+restart:
+ radix_tree_for_each_slot(slot, &fc->mapping->page_tree, &iter,
+ fc->pgstart) {
+ long jump;
+ struct page *page;
+
+ fc->scanned_offset = iter.index;
+ /* Handle holes */
+ jump = iter.index - fc->pgstart - nr;
+ if (jump) {
+ if (!(fc->mode & FINCORE_SKIP_HOLE)) {
+ if (iter.index < pgend) {
+ fc->cursor += jump * fc->width;
+ nr = iter.index - fc->pgstart;
+ } else {
+ /*
+ * Fill remaining buffer as hole. Next
+ * call should start at offset pgend.
+ */
+ nr = nr_pages;
+ fc->scanned_offset = pgend - 1;
+ break;
+ }
+ }
+ }
+repeat:
+ page = radix_tree_deref_slot(slot);
+ if (unlikely(!page))
+ /*
+ * No need to increment nr and fc->cursor, because next
+ * iteration should detect hole and update them there.
+ */
+ continue;
+ else if (radix_tree_exception(page)) {
+ if (radix_tree_deref_retry(page)) {
+ /*
+ * Transient condition which can only trigger
+ * when entry at index 0 moves out of or back
+ * to root: none yet gotten, safe to restart.
+ */
+ WARN_ON(iter.index);
+ goto restart;
+ }
+ __do_fincore(fc, page, iter.index);
+ } else {
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *slot)) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ __do_fincore(fc, page, iter.index);
+ page_cache_release(page);
+ }
+
+ if (++nr == nr_pages)
+ break;
+ }
+ rcu_read_unlock();
+
+ return nr;
+}
+
+static inline bool fincore_validate_mode(int mode)
+{
+ if (mode & ~FINCORE_MODE_MASK)
+ return false;
+ if (!(!!(mode & FINCORE_BMAP) ^ !!(mode & FINCORE_LONGENTRY_MASK)))
+ return false;
+ if ((mode & FINCORE_LONGENTRY_MASK) == FINCORE_SKIP_HOLE)
+ return false;
+ return true;
+}
+
+#define FINCORE_LOOP_STEP 256L
+
+/*
+ * The fincore(2) system call
+ *
+ * @fd: file descriptor of the target file
+ * @start: starting address offset of the target file (in byte).
+ * This should be aligned to page cache size.
+ * @nr_pages: the number of pages from which the page data is passed to
+ * userspace. If you don't set FINCORE_SKIP_HOLE, it's identical
+ * to the pages to be scanned.
+ * @mode fincore mode flags to determine the entry's format
+ * @vec pointer of the userspace buffer. The size must be equal to or
+ * larger than (@nr_pages * width), where width is the size of
+ * each entry.
+ *
+ * fincore() returns the memory residency status and additional info (like
+ * pfn and page flags) of the given file's pages.
+ *
+ * Depending on the fincore mode, caller can receive the different formatted
+ * information. See the comment on definition of FINCORE_* above.
+ *
+ * Because the status of a page can change after fincore() checks it
+ * but before it returns to the application, the returned vector may
+ * contain stale information.
+ *
+ * return values:
+ * -EBADF: @fd isn't a valid open file descriptor
+ * -EFAULT: @vec points to an illegal address
+ * -EINVAL: @start is not aligned to page cache size, or @mode is invalid
+ * non-negative value: fincore() is successfully done and the value means
+ * the number of valid entries stored on userspace buffer
+ */
+SYSCALL_DEFINE5(fincore, int, fd, loff_t, start, long, nr_pages,
+ int, mode, unsigned char __user *, vec)
+{
+ long ret;
+ long step;
+ long nr = 0;
+ int pc_shift = PAGE_CACHE_SHIFT;
+ struct fd f;
+
+ struct fincore_control fc = {
+ .mode = mode,
+ .width = sizeof(unsigned char),
+ };
+
+ if (!fincore_validate_mode(mode))
+ return -EINVAL;
+
+ f = fdget(fd);
+
+ if (is_file_hugepages(f.file))
+ pc_shift = huge_page_shift(hstate_file(f.file));
+
+ if (!IS_ALIGNED(start, 1 << pc_shift)) {
+ ret = -EINVAL;
+ goto fput;
+ }
+
+ /*
+ * TODO: support /dev/mem, /proc/pid/mem for system/process wide
+ * page survey, which would obsolete /proc/kpageflags, and
+ * /proc/pid/pagemap.
+ */
+ if (!S_ISREG(file_inode(f.file)->i_mode)) {
+ ret = -EBADF;
+ goto fput;
+ }
+
+ fc.pgstart = start >> pc_shift;
+ fc.nr_pages = nr_pages;
+ fc.mapping = f.file->f_mapping;
+ if (mode & FINCORE_LONGENTRY_MASK)
+ fc.width = ((mode & FINCORE_PFN ? 1 : 0) +
+ (mode & FINCORE_PAGE_FLAGS ? 1 : 0) +
+ (mode & FINCORE_PAGECACHE_TAGS ? 1 : 0) +
+ (mode & FINCORE_SKIP_HOLE ? 1 : 0)
+ ) * sizeof(unsigned long);
+ step = min(fc.nr_pages, FINCORE_LOOP_STEP);
+
+ fc.buffer_size = step * fc.width;
+ fc.buffer = kzalloc(fc.buffer_size, GFP_TEMPORARY);
+ if (!fc.buffer) {
+ ret = -ENOMEM;
+ goto fput;
+ }
+
+ while (true) {
+ ret = do_fincore(&fc, min(step, fc.nr_pages));
+ /* Reached the end of the file */
+ if (ret == 0) {
+ ret = nr;
+ break;
+ }
+ if (ret < 0)
+ break;
+ if (copy_to_user(vec + nr * fc.width,
+ fc.buffer, ret * fc.width)) {
+ ret = -EFAULT;
+ break;
+ }
+ fc.nr_pages -= ret;
+ fc.pgstart = fc.scanned_offset + 1;
+ nr += ret;
+ /* Completed scanning the requested numbers of pages */
+ if (fc.nr_pages == 0) {
+ ret = nr;
+ break;
+ }
+ /* Clear buffer for next do_fincore() */
+ memset(fc.buffer, 0, step * fc.width);
+ }
+
+ kfree(fc.buffer);
+fput:
+ fdput(f);
+ return ret;
+}
--
1.9.3

2014-06-02 05:26:18

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 3/3] selftest: add test code for fincore()

This patch adds simple test programs for fincore(), which contains the
following testcase:
- test_unaligned_start_address
- test_unaligned_start_address_hugetlb
- test_invalid_mode
- test_smallfile_bytemap
- test_smallfile_pfn
- test_smallfile_multientry
- test_largefile_pfn
- test_largefile_pfn_offset
- test_largefile_pfn_overrun
- test_tmpfs_pfn
- test_hugetlb_pfn
- test_largefile_pfn_skiphole
- test_smallfile_pfn_skiphole
-
Signed-off-by: Naoya Horiguchi <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/fincore/Makefile | 31 ++
.../selftests/fincore/create_hugetlbfs_file.c | 49 +++
tools/testing/selftests/fincore/fincore.c | 153 +++++++++
tools/testing/selftests/fincore/run_fincoretests | 355 +++++++++++++++++++++
5 files changed, 589 insertions(+)
create mode 100644 tools/testing/selftests/fincore/Makefile
create mode 100644 tools/testing/selftests/fincore/create_hugetlbfs_file.c
create mode 100644 tools/testing/selftests/fincore/fincore.c
create mode 100644 tools/testing/selftests/fincore/run_fincoretests

diff --git v3.15-rc7.orig/tools/testing/selftests/Makefile v3.15-rc7/tools/testing/selftests/Makefile
index 32487ed18354..820813f571fb 100644
--- v3.15-rc7.orig/tools/testing/selftests/Makefile
+++ v3.15-rc7/tools/testing/selftests/Makefile
@@ -10,6 +10,7 @@ TARGETS += timers
TARGETS += vm
TARGETS += powerpc
TARGETS += user
+TARGETS += fincore

all:
for TARGET in $(TARGETS); do \
diff --git v3.15-rc7.orig/tools/testing/selftests/fincore/Makefile v3.15-rc7/tools/testing/selftests/fincore/Makefile
new file mode 100644
index 000000000000..ab4361c70da5
--- /dev/null
+++ v3.15-rc7/tools/testing/selftests/fincore/Makefile
@@ -0,0 +1,31 @@
+# Makefile for vm selftests
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+ ARCH := X86
+ CFLAGS := -DCONFIG_X86_32 -D__i386__
+endif
+ifeq ($(ARCH),x86_64)
+ ARCH := X86
+ CFLAGS := -DCONFIG_X86_64 -D__x86_64__
+endif
+
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+CFLAGS += -I../../../../arch/x86/include/
+
+BINARIES = fincore create_hugetlbfs_file
+
+all: $(BINARIES)
+%: %.c
+ $(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+ @/bin/sh ./run_fincoretests || (echo "fincoretests: [FAIL]"; exit 1)
+
+clean:
+ $(RM) $(BINARIES)
diff --git v3.15-rc7.orig/tools/testing/selftests/fincore/create_hugetlbfs_file.c v3.15-rc7/tools/testing/selftests/fincore/create_hugetlbfs_file.c
new file mode 100644
index 000000000000..a46ccf0af5f2
--- /dev/null
+++ v3.15-rc7/tools/testing/selftests/fincore/create_hugetlbfs_file.c
@@ -0,0 +1,49 @@
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdlib.h>
+
+#define err(x) (perror(x), exit(1))
+
+unsigned long default_hugepage_size(void)
+{
+ unsigned long hps = 0;
+ char *line = NULL;
+ size_t linelen = 0;
+ FILE *f = fopen("/proc/meminfo", "r");
+ if (!f)
+ err("open /proc/meminfo");
+ while (getline(&line, &linelen, f) > 0) {
+ if (sscanf(line, "Hugepagesize: %lu kB", &hps) == 1) {
+ hps <<= 10;
+ break;
+ }
+ }
+ free(line);
+ return hps;
+}
+
+int main(int argc, char **argv)
+{
+ int ret;
+ int fd;
+ char *p;
+ unsigned long hpsize = default_hugepage_size();
+ fd = open(argv[1], O_RDWR|O_CREAT);
+ if (fd == -1)
+ err("open");
+ p = mmap(NULL, 10 * hpsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ if (p == (void *)-1)
+ err("mmap");
+ memset(p, 'a', 3 * hpsize);
+ memset(p + 7 * hpsize, 'a', 3 * hpsize - 1);
+ ret = close(fd);
+ if (ret == -1)
+ err("close");
+ return 0;
+}
diff --git v3.15-rc7.orig/tools/testing/selftests/fincore/fincore.c v3.15-rc7/tools/testing/selftests/fincore/fincore.c
new file mode 100644
index 000000000000..b089fe5f6bdf
--- /dev/null
+++ v3.15-rc7/tools/testing/selftests/fincore/fincore.c
@@ -0,0 +1,153 @@
+/*
+ * fincore(2) test program
+ */
+
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <getopt.h>
+#include <assert.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+
+#define err(x) (perror(x), exit(1))
+
+#define FINCORE_BMAP 0x01
+#define FINCORE_PFN 0x02
+#define FINCORE_PAGE_FLAGS 0x04
+#define FINCORE_PAGECACHE_TAGS 0x08
+#define FINCORE_SKIP_HOLE 0x10
+
+#define FINCORE_MODE_MASK 0x1f
+#define FINCORE_LONGENTRY_MASK (FINCORE_PFN | FINCORE_PAGE_FLAGS | \
+ FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
+
+static int sys_fincore(int fd, loff_t start, size_t len, int mode,
+ unsigned char *vec)
+{
+ return syscall(__NR_fincore, fd, start, len, mode, vec);
+}
+
+static int call_fincore(int fd, loff_t start, size_t len, int mode,
+ unsigned char *vec)
+{
+ return sys_fincore(fd, start, len, mode, vec);
+}
+
+void usage(char *str)
+{
+ printf(
+ "Usage: %s [-s start] [-l len] [-m mode] [-p pagesize] file\n"
+ " -s: start offset (in bytes)\n"
+ " -l: length to scan (in bytes)\n"
+ " -m: fincore mode\n"
+ " -p: set page size (for hugepage)\n"
+ " -h: show this message\n"
+ , str);
+ exit(EXIT_SUCCESS);
+}
+
+static void show_fincore_buffer(long start, long nr_pages, int records_per_page,
+ int mode, unsigned char *buf)
+{
+ int i, j;
+ unsigned char *curuc = (unsigned char *)buf;
+ unsigned long *curul = (unsigned long *)buf;
+
+ for (i = 0; i < nr_pages; i++) {
+ j = 0;
+ if (mode & FINCORE_BMAP)
+ printf("0x%lx\t%d", start + i, curuc[i + j]);
+ else if (mode & (FINCORE_LONGENTRY_MASK)) {
+ if (mode & FINCORE_SKIP_HOLE)
+ printf("0x%lx", curul[i * records_per_page + (j++)]);
+ else
+ printf("0x%lx", start + i);
+ if (mode & FINCORE_PFN)
+ printf("\t0x%lx", curul[i * records_per_page + (j++)]);
+ if (mode & FINCORE_PAGE_FLAGS)
+ printf("\t0x%lx", curul[i * records_per_page + (j++)]);
+ if (mode & FINCORE_PAGECACHE_TAGS)
+ printf("\t0x%lx", curul[i * records_per_page + (j++)]);
+ }
+ printf("\n");
+ }
+}
+
+int main(int argc, char *argv[])
+{
+ char c;
+ int fd;
+ int ret;
+ int mode = FINCORE_PFN;
+ int width = sizeof(unsigned char);
+ int records_per_page = 1;
+ long pagesize = sysconf(_SC_PAGESIZE);
+ long nr_pages;
+ unsigned long start = 0;
+ unsigned long len = 0;
+ unsigned char *buf;
+ struct stat stat;
+
+ while ((c = getopt(argc, argv, "s:l:m:p:h")) != -1) {
+ switch (c) {
+ case 's':
+ start = strtoul(optarg, NULL, 0);
+ break;
+ case 'l':
+ len = strtoul(optarg, NULL, 0);
+ break;
+ case 'm':
+ mode = strtoul(optarg, NULL, 0);
+ break;
+ case 'p':
+ pagesize = strtoul(optarg, NULL, 0);
+ break;
+ case 'h':
+ default:
+ usage(argv[0]);
+ }
+ }
+
+ fd = open(argv[optind], O_RDWR);
+ if (fd == -1)
+ err("open failed.");
+
+ /* scan to the end of file by default */
+ if (!len) {
+ ret = fstat(fd, &stat);
+ if (ret == -1)
+ err("fstat failed.");
+ len = stat.st_size - start;
+ }
+
+ if (mode & FINCORE_LONGENTRY_MASK) {
+ records_per_page = ((mode & FINCORE_PFN ? 1 : 0) +
+ (mode & FINCORE_PAGE_FLAGS ? 1 : 0) +
+ (mode & FINCORE_PAGECACHE_TAGS ? 1 : 0) +
+ (mode & FINCORE_SKIP_HOLE ? 1 : 0)
+ );
+ width = records_per_page * sizeof(unsigned long);
+ }
+
+ nr_pages = ((len + pagesize - 1) & (~(pagesize - 1))) / pagesize;
+ buf = malloc(nr_pages * width);
+ if (!buf)
+ err("malloc");
+ ret = call_fincore(fd, start, nr_pages, mode, buf);
+ if (ret < 0)
+ err("fincore");
+ /*
+ * print buffer to stdout, and parse it later for validation check.
+ * fincore() returns the number of entries written to the buffer.
+ */
+ show_fincore_buffer(start / pagesize, ret, records_per_page, mode, buf);
+ ret = close(fd);
+ if (ret < 0)
+ err("close");
+ return 0;
+}
diff --git v3.15-rc7.orig/tools/testing/selftests/fincore/run_fincoretests v3.15-rc7/tools/testing/selftests/fincore/run_fincoretests
new file mode 100644
index 000000000000..3775ac339860
--- /dev/null
+++ v3.15-rc7/tools/testing/selftests/fincore/run_fincoretests
@@ -0,0 +1,355 @@
+#!/bin/bash
+
+WDIR=./fincore_work
+mkdir $WDIR 2> /dev/null
+TMPF=`mktemp --tmpdir=$WDIR -d`
+
+#
+# common routines
+#
+abort() {
+ echo "Test abort"
+ exit 1
+}
+
+create_small_file() {
+ dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 > /dev/null 2>&1
+ dd if=/dev/urandom of=$WDIR/smallfile bs=4096 count=4 seek=8> /dev/null 2>&1
+ date >> $WDIR/smallfile
+ sync
+}
+
+create_large_file() {
+ dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 > /dev/null 2>&1
+ dd if=/dev/urandom of=$WDIR/largefile bs=4096 count=384 seek=640> /dev/null 2>&1
+ sync
+}
+
+create_tmpfs_file() {
+ dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 > /dev/null 2>&1
+ dd if=/dev/urandom of=/tmp/tmpfile bs=4096 count=4 seek=8> /dev/null 2>&1
+ date >> /tmp/tmpfile
+ sync
+}
+
+create_hugetlb_file() {
+ if mount | grep $WDIR/hugepages > /dev/null ; then
+ echo "$WDIR/hugepages already mounted"
+ else
+ mkdir -p $WDIR/hugepages 2> /dev/null
+ mount -t hugetlbfs none $WDIR/hugepages
+ if [ $? -ne 0 ] ; then
+ echo "Failed to mount hugetlbfs" >&2
+ return 1
+ fi
+ fi
+ local hptotal=$(grep HugePages_Total: /proc/meminfo | tr -s ' ' | cut -f2 -d' ')
+ if [ "$hptotal" -lt 10 ] ; then
+ echo "Hugepage pool size need to be >= 10" >&2
+ return 1
+ fi
+ ./create_hugetlbfs_file $WDIR/hugepages/file
+ if [ $? -ne 0 ] ; then
+ echo "Failed to create hugetlb file" >&2
+ return 1
+ fi
+ return 0;
+}
+
+#
+# Testcases
+#
+test_smallfile_bytemap() {
+ local exist
+ local nonexist
+ create_small_file
+
+ ./fincore -m 0x1 $WDIR/smallfile > $TMPF/$FUNCNAME
+ exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 1 | wc -l)
+ nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0 | wc -l)
+ if [ "$exist" -ne 9 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 9, but got $exist"
+ return 1
+ fi
+ if [ "$nonexist" -ne 4 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of hole pages should be 4, but got $nonexist"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+ return 0
+}
+
+test_smallfile_pfn() {
+ local exist
+ local nonexist
+ create_small_file
+
+ ./fincore -m 0x2 $WDIR/smallfile > $TMPF/$FUNCNAME
+ exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+ nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+ if [ "$exist" -ne 9 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 9, but got $exist"
+ return 1
+ fi
+ if [ "$nonexist" -ne 4 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of hole pages should be 4, but got $nonexist"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+ return 0
+}
+
+test_smallfile_multientry() {
+ local exist
+ local nonexist
+ create_small_file
+
+ ./fincore -m 0xe $WDIR/smallfile > $TMPF/$FUNCNAME
+ exist=$(cat $TMPF/$FUNCNAME | cut -f 2,3,4 | grep -vP "0x0\t0x0\t0x0" | wc -l)
+ nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2,3,4 | grep -P "0x0\t0x0\t0x0" | wc -l)
+ if [ "$exist" -ne 9 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 9, but got $exist"
+ return 1
+ fi
+ if [ "$nonexist" -ne 4 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of hole pages should be 4, but got $nonexist"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+ return 0
+}
+
+test_smallfile_pfn_skiphole() {
+ local exist
+ local nonexist
+ create_small_file
+
+ ./fincore -m 0x12 $WDIR/smallfile > $TMPF/$FUNCNAME
+ exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+ nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+ if [ "$exist" -ne 9 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 9, but got $exist"
+ return 1
+ fi
+ if [ "$nonexist" -ne 0 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of hole pages should be 0, but got $nonexist"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+ return 0
+}
+
+# in-kernel function sys_fincore() repeat copy_to_user() per 256 entries,
+# so testing for large file is meaningful testcase.
+test_largefile_pfn() {
+ local exist
+ local nonexist
+ create_large_file
+
+ ./fincore -m 0x2 $WDIR/largefile > $TMPF/$FUNCNAME
+ exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+ nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+ if [ "$exist" -ne 768 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 768, but got $exist"
+ return 1
+ fi
+ if [ "$nonexist" -ne 256 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of hole pages should be 256, but got $nonexist"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+ return 0
+}
+
+test_largefile_pfn_offset() {
+ local exist
+ local nonexist
+ create_large_file
+
+ ./fincore -m 0x2 -s 0x80000 $WDIR/largefile > $TMPF/$FUNCNAME
+ exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+ nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+ if [ "$exist" -ne 640 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 640, but got $exist"
+ return 1
+ fi
+ if [ "$nonexist" -ne 256 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of hole pages should be 256, but got $nonexist"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+ return 0
+}
+
+test_largefile_pfn_overrun() {
+ local exist
+ local nonexist
+ create_large_file
+
+ ./fincore -m 0x2 -s 0x80000 -l 0x400000 $WDIR/largefile > $TMPF/$FUNCNAME
+ exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+ nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+ if [ "$exist" -ne 640 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 640, but got $exist"
+ return 1
+ fi
+ if [ "$nonexist" -ne 256 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of hole pages should be 256, but got $nonexist"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+ return 0
+}
+
+test_largefile_pfn_skiphole() {
+ local exist
+ local nonexist
+ create_large_file
+
+ ./fincore -m 0x12 -s 0x100000 -l 0x102000 $WDIR/largefile > $TMPF/$FUNCNAME
+ exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+ nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+ if [ "$exist" -ne 258 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 258, but got $exist"
+ return 1
+ fi
+ if [ "$nonexist" -ne 0 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of hole pages should be 0, but got $nonexist"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+ return 0
+}
+
+test_tmpfs_pfn() {
+ local exist
+ local nonexist
+ create_tmpfs_file
+
+ ./fincore -m 0x2 /tmp/tmpfile > $TMPF/$FUNCNAME
+ exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+ nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+ if [ "$exist" -ne 9 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 9, but got $exist"
+ return 1
+ fi
+ if [ "$nonexist" -ne 4 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of hole pages should be 4, but got $nonexist"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+ return 0
+}
+
+test_hugetlb_pfn() {
+ local exist
+ local nonexist
+ local exitcode=0
+ create_hugetlb_file
+ if [ $? -ne 0 ] ; then
+ echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+ return 1
+ fi
+ local hugepagesize=$[$(cat /proc/meminfo | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+ ./fincore -p $hugepagesize -m 0x2 $WDIR/hugepages/file > $TMPF/$FUNCNAME
+ exist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep -v 0x0 | wc -l)
+ nonexist=$(cat $TMPF/$FUNCNAME | cut -f 2 | grep 0x0 | wc -l)
+ rm -rf $WDIR/hugepages/file
+ umount $WDIR/hugepages
+ if [ "$exist" -ne 6 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of on-memory pages should be 6, but got $exist"
+ return 1
+ fi
+ if [ "$nonexist" -ne 4 ] ; then
+ echo "[FAIL] $FUNCNAME: Number of hole pages should be 4, but got $nonexist"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+}
+
+test_unaligned_start_address() {
+ create_small_file
+ ./fincore -m 0x2 -s 0x30 $WDIR/smallfile > $TMPF/$FUNCNAME
+ if [ $? -eq 0 ] ; then
+ echo "[FAIL] $FUNCNAME: fincore should fail for unaligned start address"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+}
+
+test_unaligned_start_address_hugetlb() {
+ local exist
+ local nonexist
+ local exitcode=0
+ create_hugetlb_file
+ if [ $? -ne 0 ] ; then
+ echo "[FAIL] $FUNCNAME: fail to create a file on hugetlbfs"
+ return 1
+ fi
+ local hugepagesize=$[$(cat /proc/meminfo | grep Hugepagesize: | tr -s ' ' | cut -f2 -d' ') * 1024]
+ ./fincore -p $hugepagesize -m 0x2 -s 0x1000 $WDIR/hugepages/file > $TMPF/$FUNCNAME
+ if [ $? -eq 0 ] ; then
+ echo "[FAIL] $FUNCNAME: fincore should fail for page-unaligned start address"
+ return 1
+ fi
+ ./fincore -p $hugepagesize -m 0x2 -s $hugepagesize $WDIR/hugepages/file > $TMPF/$FUNCNAME
+ if [ $? -ne 0 ] ; then
+ echo "[FAIL] $FUNCNAME: fincore should pass for hugepage-aligned start address"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+}
+
+test_invalid_mode() {
+ create_small_file
+ ./fincore -m 0x0 $WDIR/smallfile > $TMPF/$FUNCNAME
+ if [ $? -eq 0 ] ; then
+ echo "[FAIL] $FUNCNAME: mode == NULL is invalid mode"
+ return 1
+ fi
+ ./fincore -m 0x3 $WDIR/smallfile > $TMPF/$FUNCNAME
+ if [ $? -eq 0 ] ; then
+ echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_PFN) is invalid mode"
+ return 1
+ fi
+ ./fincore -m 0x11 $WDIR/smallfile > $TMPF/$FUNCNAME
+ if [ $? -eq 0 ] ; then
+ echo "[FAIL] $FUNCNAME: mode == (FINCORE_BMAP|FINCORE_SKIP_HOLE) is invalid mode"
+ return 1
+ fi
+ ./fincore -m 0x12 $WDIR/smallfile > $TMPF/$FUNCNAME
+ if [ $? -ne 0 ] ; then
+ echo "[FAIL] $FUNCNAME: mode == (FINCORE_PFN|FINCORE_SKIP_HOLE) is valid mode"
+ return 1
+ fi
+ ./fincore -m 0x10 $WDIR/smallfile > $TMPF/$FUNCNAME
+ if [ $? -eq 0 ] ; then
+ echo "[FAIL] $FUNCNAME: mode == (FINCORE_SKIP_HOLE) is invalid mode"
+ return 1
+ fi
+ ./fincore -m 0x1002 $WDIR/smallfile > $TMPF/$FUNCNAME
+ if [ $? -eq 0 ] ; then
+ echo "[FAIL] $FUNCNAME: mode == (Unknown|FINCORE_PFN) is invalid mode"
+ return 1
+ fi
+ echo "[PASS] $FUNCNAME"
+}
+
+test_unaligned_start_address || abort
+test_unaligned_start_address_hugetlb || abort
+test_invalid_mode || abort
+test_smallfile_bytemap || abort
+test_smallfile_pfn || abort
+test_smallfile_multientry || abort
+test_largefile_pfn || abort
+test_largefile_pfn_offset || abort
+test_largefile_pfn_overrun || abort
+test_tmpfs_pfn || abort
+test_hugetlb_pfn || abort
+test_largefile_pfn_skiphole || abort
+test_smallfile_pfn_skiphole || abort
+
+# cleanup
+rm -rf $WDIR
+
+exit 0
--
1.9.3

2014-06-02 05:25:59

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration

We need number of pagecache tags in later patches, this patch prepares it.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/fs.h | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git v3.15-rc7.orig/include/linux/fs.h v3.15-rc7/include/linux/fs.h
index 878031227c57..5b489df9d964 100644
--- v3.15-rc7.orig/include/linux/fs.h
+++ v3.15-rc7/include/linux/fs.h
@@ -447,9 +447,12 @@ struct block_device {
* Radix-tree tags, for tagging dirty and writeback pages within the pagecache
* radix trees
*/
-#define PAGECACHE_TAG_DIRTY 0
-#define PAGECACHE_TAG_WRITEBACK 1
-#define PAGECACHE_TAG_TOWRITE 2
+enum {
+ PAGECACHE_TAG_DIRTY,
+ PAGECACHE_TAG_WRITEBACK,
+ PAGECACHE_TAG_TOWRITE,
+ __NR_PAGECACHE_TAGS,
+};

int mapping_tagged(struct address_space *mapping, int tag);

--
1.9.3

2014-06-02 06:43:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: introduce fincore()

Please also provide a man page for the system call.

I'm also very unhappy about the crazy different interpretation of the
return value depending on flags, which probably becomes more obvious if
you try to document it.

That being said I think fincore is useful, but why not stick to the
same simple interface as mincore?

Subject: Re: [PATCH 2/3] mm: introduce fincore()

Hello Naoya,

As Christoph noted, it would be best to provide some good user
documentation for this proposed system call, to aid design review.

Also, as per Documentation/SubmitChecklist, patches that change the
kernel-userspace API/ABI should CC
[email protected] (see
https://www.kernel.org/doc/man-pages/linux-api-ml.html).

Thanks,

Michael


On Mon, Jun 2, 2014 at 7:24 AM, Naoya Horiguchi
<[email protected]> wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() can have a mode flag and it enables us to extract more detailed
> information about page cache like pfn and page flag. This kind of information
> is very helpful for example when applications want to know the file cache
> status to control IO on their own way.
>
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.
>
> Basic testset is added in a next patch on tools/testing/selftests/fincore/.
>
> [1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> arch/x86/syscalls/syscall_64.tbl | 1 +
> include/linux/syscalls.h | 2 +
> mm/Makefile | 2 +-
> mm/fincore.c | 362 +++++++++++++++++++++++++++++++++++++++
> 4 files changed, 366 insertions(+), 1 deletion(-)
> create mode 100644 mm/fincore.c
>
> diff --git v3.15-rc7.orig/arch/x86/syscalls/syscall_64.tbl v3.15-rc7/arch/x86/syscalls/syscall_64.tbl
> index 04376ac3d9ef..0a6b6dd77708 100644
> --- v3.15-rc7.orig/arch/x86/syscalls/syscall_64.tbl
> +++ v3.15-rc7/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
> 314 common sched_setattr sys_sched_setattr
> 315 common sched_getattr sys_sched_getattr
> 316 common renameat2 sys_renameat2
> +317 common fincore sys_fincore
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git v3.15-rc7.orig/include/linux/syscalls.h v3.15-rc7/include/linux/syscalls.h
> index a4a0588c5397..d625ec9cb73d 100644
> --- v3.15-rc7.orig/include/linux/syscalls.h
> +++ v3.15-rc7/include/linux/syscalls.h
> @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
> asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
> unsigned long idx1, unsigned long idx2);
> asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
> +asmlinkage long sys_fincore(int fd, loff_t start, long nr_pages,
> + int mode, unsigned char __user *vec);
> #endif
> diff --git v3.15-rc7.orig/mm/Makefile v3.15-rc7/mm/Makefile
> index b484452dac57..55e1d13ddb76 100644
> --- v3.15-rc7.orig/mm/Makefile
> +++ v3.15-rc7/mm/Makefile
> @@ -18,7 +18,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
> mm_init.o mmu_context.o percpu.o slab_common.o \
> compaction.o balloon_compaction.o vmacache.o \
> interval_tree.o list_lru.o workingset.o \
> - iov_iter.o $(mmu-y)
> + iov_iter.o fincore.o $(mmu-y)
>
> obj-y += init-mm.o
>
> diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
> new file mode 100644
> index 000000000000..3fc3ef465471
> --- /dev/null
> +++ v3.15-rc7/mm/fincore.c
> @@ -0,0 +1,362 @@
> +/*
> + * fincore(2) system call
> + *
> + * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:
> + *
> + * - FINCORE_BMAP:
> + * The page status is returned in a vector of bytes.
> + * The least significant bit of each byte is 1 if the referenced page
> + * is in memory, otherwise it is zero.
> + *
> + * - FINCORE_PFN:
> + * stores pfn, using 8 bytes.
> + *
> + * - FINCORE_PAGEFLAGS:
> + * stores page flags, using 8 bytes. See definition of KPF_* for details.
> + *
> + * - FINCORE_PAGECACHE_TAGS:
> + * stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
> + * for details.
> + *
> + * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
> + * information about hole. Instead each records per page has the entry
> + * of page offset (using 8 bytes.) This mode is useful if we handle
> + * large file and only few pages are on memory for the file.
> + *
> + * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
> + * data in this mode is like this:
> + *
> + * page offset 0 1 2 3 4
> + * +---+---+---+---+---+
> + * | 1 | 0 | 0 | 1 | 1 | ...
> + * +---+---+---+---+---+
> + * <->
> + * 1 byte
> + *
> + * For FINCORE_PFN, page data is formatted like this:
> + *
> + * page offset 0 1 2 3 4
> + * +-------+-------+-------+-------+-------+
> + * | pfn | pfn | pfn | pfn | pfn | ...
> + * +-------+-------+-------+-------+-------+
> + * <----->
> + * 8 byte
> + *
> + * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:
> + *
> + * page offset 0 page offset 1 page offset 2
> + * +-------+-------+-------+-------+-------+-------+
> + * | pfn | flags | pfn | flags | pfn | flags | ...
> + * +-------+-------+-------+-------+-------+-------+
> + * <-------------> <-------------> <------------->
> + * 16 bytes 16 bytes 16 bytes
> + *
> + * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
> + * (8 bytes) instead. For example, the data format of mode
> + * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
> + *
> + * +-------+-------+-------+-------+-------+-------+
> + * | pgoff | pfn | pgoff | pfn | pgoff | pfn | ...
> + * +-------+-------+-------+-------+-------+-------+
> + * <-------------> <-------------> <------------->
> + * 16 bytes 16 bytes 16 bytes
> + */
> +#define FINCORE_BMAP 0x01 /* bytemap mode */
> +#define FINCORE_PFN 0x02
> +#define FINCORE_PAGE_FLAGS 0x04
> +#define FINCORE_PAGECACHE_TAGS 0x08
> +#define FINCORE_SKIP_HOLE 0x10
> +
> +#define FINCORE_MODE_MASK 0x1f
> +#define FINCORE_LONGENTRY_MASK (FINCORE_PFN | FINCORE_PAGE_FLAGS | \
> + FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
> +
> +struct fincore_control {
> + int mode;
> + int width; /* width of each entry (in bytes) */
> + unsigned char *buffer;
> + long buffer_size;
> + void *cursor; /* current position on the buffer */
> + pgoff_t pgstart; /* start point of page cache scan in each run
> + * of the while loop */
> + long nr_pages; /* number of pages to be copied to userspace
> + * (decreasing while scan proceeds) */
> + long scanned_offset; /* page offset of the lastest scanned page */
> + struct address_space *mapping;
> +};
> +
> +/*
> + * TODO: doing radix_tree_tag_get() for each tag is not optimal, but no easy
> + * way without degrading finely tuned radix tree routines.
> + */
> +static unsigned long get_pagecache_tags(struct radix_tree_root *root,
> + unsigned long index)
> +{
> + int i;
> + unsigned long tags = 0;
> + for (i = 0; i < __NR_PAGECACHE_TAGS; i++)
> + if (radix_tree_tag_get(root, index, i))
> + tags |= 1 << i;
> + return tags;
> +}
> +
> +#define store_entry(fc, type, data) ({ \
> + *(type *)fc->cursor = (type)data; \
> + fc->cursor += sizeof(type); \
> +})
> +
> +/*
> + * Store page cache data to temporal buffer in the specified format depending
> + * on fincore mode.
> + */
> +static void __do_fincore(struct fincore_control *fc, struct page *page,
> + unsigned long index)
> +{
> + VM_BUG_ON(!page);
> + VM_BUG_ON((unsigned long)fc->cursor - (unsigned long)fc->buffer
> + >= fc->buffer_size);
> + if (fc->mode & FINCORE_BMAP)
> + store_entry(fc, unsigned char, PageUptodate(page));
> + else if (fc->mode & (FINCORE_LONGENTRY_MASK)) {
> + if (fc->mode & FINCORE_SKIP_HOLE)
> + store_entry(fc, unsigned long, index);
> + if (fc->mode & FINCORE_PFN)
> + store_entry(fc, unsigned long, page_to_pfn(page));
> + if (fc->mode & FINCORE_PAGE_FLAGS)
> + store_entry(fc, unsigned long, stable_page_flags(page));
> + if (fc->mode & FINCORE_PAGECACHE_TAGS)
> + store_entry(fc, unsigned long,
> + get_pagecache_tags(&fc->mapping->page_tree,
> + index));
> + }
> +}
> +
> +/*
> + * Traverse page cache tree. It's assumed that temporal buffer are zeroed
> + * in advance. Due to this, we don't have to store zero entry explicitly
> + * one-by-one and we just set fc->cursor to the position of the next
> + * on-memory page.
> + *
> + * Return value is the number of pages whose data is stored in fc->buffer.
> + */
> +static long do_fincore(struct fincore_control *fc, int nr_pages)
> +{
> + pgoff_t pgend = fc->pgstart + nr_pages;
> + struct radix_tree_iter iter;
> + void **slot;
> + long nr = 0;
> +
> + fc->cursor = fc->buffer;
> +
> + rcu_read_lock();
> +restart:
> + radix_tree_for_each_slot(slot, &fc->mapping->page_tree, &iter,
> + fc->pgstart) {
> + long jump;
> + struct page *page;
> +
> + fc->scanned_offset = iter.index;
> + /* Handle holes */
> + jump = iter.index - fc->pgstart - nr;
> + if (jump) {
> + if (!(fc->mode & FINCORE_SKIP_HOLE)) {
> + if (iter.index < pgend) {
> + fc->cursor += jump * fc->width;
> + nr = iter.index - fc->pgstart;
> + } else {
> + /*
> + * Fill remaining buffer as hole. Next
> + * call should start at offset pgend.
> + */
> + nr = nr_pages;
> + fc->scanned_offset = pgend - 1;
> + break;
> + }
> + }
> + }
> +repeat:
> + page = radix_tree_deref_slot(slot);
> + if (unlikely(!page))
> + /*
> + * No need to increment nr and fc->cursor, because next
> + * iteration should detect hole and update them there.
> + */
> + continue;
> + else if (radix_tree_exception(page)) {
> + if (radix_tree_deref_retry(page)) {
> + /*
> + * Transient condition which can only trigger
> + * when entry at index 0 moves out of or back
> + * to root: none yet gotten, safe to restart.
> + */
> + WARN_ON(iter.index);
> + goto restart;
> + }
> + __do_fincore(fc, page, iter.index);
> + } else {
> + if (!page_cache_get_speculative(page))
> + goto repeat;
> +
> + /* Has the page moved? */
> + if (unlikely(page != *slot)) {
> + page_cache_release(page);
> + goto repeat;
> + }
> +
> + __do_fincore(fc, page, iter.index);
> + page_cache_release(page);
> + }
> +
> + if (++nr == nr_pages)
> + break;
> + }
> + rcu_read_unlock();
> +
> + return nr;
> +}
> +
> +static inline bool fincore_validate_mode(int mode)
> +{
> + if (mode & ~FINCORE_MODE_MASK)
> + return false;
> + if (!(!!(mode & FINCORE_BMAP) ^ !!(mode & FINCORE_LONGENTRY_MASK)))
> + return false;
> + if ((mode & FINCORE_LONGENTRY_MASK) == FINCORE_SKIP_HOLE)
> + return false;
> + return true;
> +}
> +
> +#define FINCORE_LOOP_STEP 256L
> +
> +/*
> + * The fincore(2) system call
> + *
> + * @fd: file descriptor of the target file
> + * @start: starting address offset of the target file (in byte).
> + * This should be aligned to page cache size.
> + * @nr_pages: the number of pages from which the page data is passed to
> + * userspace. If you don't set FINCORE_SKIP_HOLE, it's identical
> + * to the pages to be scanned.
> + * @mode fincore mode flags to determine the entry's format
> + * @vec pointer of the userspace buffer. The size must be equal to or
> + * larger than (@nr_pages * width), where width is the size of
> + * each entry.
> + *
> + * fincore() returns the memory residency status and additional info (like
> + * pfn and page flags) of the given file's pages.
> + *
> + * Depending on the fincore mode, caller can receive the different formatted
> + * information. See the comment on definition of FINCORE_* above.
> + *
> + * Because the status of a page can change after fincore() checks it
> + * but before it returns to the application, the returned vector may
> + * contain stale information.
> + *
> + * return values:
> + * -EBADF: @fd isn't a valid open file descriptor
> + * -EFAULT: @vec points to an illegal address
> + * -EINVAL: @start is not aligned to page cache size, or @mode is invalid
> + * non-negative value: fincore() is successfully done and the value means
> + * the number of valid entries stored on userspace buffer
> + */
> +SYSCALL_DEFINE5(fincore, int, fd, loff_t, start, long, nr_pages,
> + int, mode, unsigned char __user *, vec)
> +{
> + long ret;
> + long step;
> + long nr = 0;
> + int pc_shift = PAGE_CACHE_SHIFT;
> + struct fd f;
> +
> + struct fincore_control fc = {
> + .mode = mode,
> + .width = sizeof(unsigned char),
> + };
> +
> + if (!fincore_validate_mode(mode))
> + return -EINVAL;
> +
> + f = fdget(fd);
> +
> + if (is_file_hugepages(f.file))
> + pc_shift = huge_page_shift(hstate_file(f.file));
> +
> + if (!IS_ALIGNED(start, 1 << pc_shift)) {
> + ret = -EINVAL;
> + goto fput;
> + }
> +
> + /*
> + * TODO: support /dev/mem, /proc/pid/mem for system/process wide
> + * page survey, which would obsolete /proc/kpageflags, and
> + * /proc/pid/pagemap.
> + */
> + if (!S_ISREG(file_inode(f.file)->i_mode)) {
> + ret = -EBADF;
> + goto fput;
> + }
> +
> + fc.pgstart = start >> pc_shift;
> + fc.nr_pages = nr_pages;
> + fc.mapping = f.file->f_mapping;
> + if (mode & FINCORE_LONGENTRY_MASK)
> + fc.width = ((mode & FINCORE_PFN ? 1 : 0) +
> + (mode & FINCORE_PAGE_FLAGS ? 1 : 0) +
> + (mode & FINCORE_PAGECACHE_TAGS ? 1 : 0) +
> + (mode & FINCORE_SKIP_HOLE ? 1 : 0)
> + ) * sizeof(unsigned long);
> + step = min(fc.nr_pages, FINCORE_LOOP_STEP);
> +
> + fc.buffer_size = step * fc.width;
> + fc.buffer = kzalloc(fc.buffer_size, GFP_TEMPORARY);
> + if (!fc.buffer) {
> + ret = -ENOMEM;
> + goto fput;
> + }
> +
> + while (true) {
> + ret = do_fincore(&fc, min(step, fc.nr_pages));
> + /* Reached the end of the file */
> + if (ret == 0) {
> + ret = nr;
> + break;
> + }
> + if (ret < 0)
> + break;
> + if (copy_to_user(vec + nr * fc.width,
> + fc.buffer, ret * fc.width)) {
> + ret = -EFAULT;
> + break;
> + }
> + fc.nr_pages -= ret;
> + fc.pgstart = fc.scanned_offset + 1;
> + nr += ret;
> + /* Completed scanning the requested numbers of pages */
> + if (fc.nr_pages == 0) {
> + ret = nr;
> + break;
> + }
> + /* Clear buffer for next do_fincore() */
> + memset(fc.buffer, 0, step * fc.width);
> + }
> +
> + kfree(fc.buffer);
> +fput:
> + fdput(f);
> + return ret;
> +}
> --
> 1.9.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>



--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

2014-06-02 12:24:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: introduce fincore()

On Mon, Jun 02, 2014 at 01:24:58AM -0400, Naoya Horiguchi wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() can have a mode flag and it enables us to extract more detailed
> information about page cache like pfn and page flag. This kind of information
> is very helpful for example when applications want to know the file cache
> status to control IO on their own way.
>
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.
>
> Basic testset is added in a next patch on tools/testing/selftests/fincore/.
>
> [1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
>
> Signed-off-by: Naoya Horiguchi <[email protected]>

...

> diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
> new file mode 100644
> index 000000000000..3fc3ef465471
> --- /dev/null
> +++ v3.15-rc7/mm/fincore.c
> @@ -0,0 +1,362 @@
> +/*
> + * fincore(2) system call
> + *
> + * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:
> + *
> + * - FINCORE_BMAP:
> + * The page status is returned in a vector of bytes.
> + * The least significant bit of each byte is 1 if the referenced page
> + * is in memory, otherwise it is zero.

I'm okay with bytemap. Just wounder why not bitmap?

> + *
> + * - FINCORE_PFN:
> + * stores pfn, using 8 bytes.
> + *
> + * - FINCORE_PAGEFLAGS:
> + * stores page flags, using 8 bytes. See definition of KPF_* for details.
> + *
> + * - FINCORE_PAGECACHE_TAGS:
> + * stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
> + * for details.

Is it safe to expose this info to unprivilaged process (consider all three
flags above)?

> + * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
> + * information about hole. Instead each records per page has the entry
> + * of page offset (using 8 bytes.) This mode is useful if we handle
> + * large file and only few pages are on memory for the file.

Hm.. It's probably overkill, but instead of filling userspace buffer we
could return file descriptor and define lseek(SEEK_HOLE). Just thinking.

> + *
> + * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
> + * data in this mode is like this:
> + *
> + * page offset 0 1 2 3 4
> + * +---+---+---+---+---+
> + * | 1 | 0 | 0 | 1 | 1 | ...
> + * +---+---+---+---+---+
> + * <->
> + * 1 byte
> + *
> + * For FINCORE_PFN, page data is formatted like this:
> + *
> + * page offset 0 1 2 3 4
> + * +-------+-------+-------+-------+-------+
> + * | pfn | pfn | pfn | pfn | pfn | ...
> + * +-------+-------+-------+-------+-------+
> + * <----->
> + * 8 byte
> + *
> + * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:
> + *
> + * page offset 0 page offset 1 page offset 2
> + * +-------+-------+-------+-------+-------+-------+
> + * | pfn | flags | pfn | flags | pfn | flags | ...
> + * +-------+-------+-------+-------+-------+-------+
> + * <-------------> <-------------> <------------->
> + * 16 bytes 16 bytes 16 bytes
> + *
> + * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
> + * (8 bytes) instead. For example, the data format of mode
> + * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
> + *
> + * +-------+-------+-------+-------+-------+-------+
> + * | pgoff | pfn | pgoff | pfn | pgoff | pfn | ...
> + * +-------+-------+-------+-------+-------+-------+
> + * <-------------> <-------------> <------------->
> + * 16 bytes 16 bytes 16 bytes
> + */
> +#define FINCORE_BMAP 0x01 /* bytemap mode */
> +#define FINCORE_PFN 0x02
> +#define FINCORE_PAGE_FLAGS 0x04
> +#define FINCORE_PAGECACHE_TAGS 0x08
> +#define FINCORE_SKIP_HOLE 0x10

FINCORE_SKIP_HOLE is greater then FINCORE_PFN but pgoff precedes pfn in
records. It's confusing. We need clear definition of record format.

What about rename FINCORE_SKIP_HOLE -> FINCORE_PGOFF, move it before
FINCORE_PFN. So FINCORE_PGOFF is less than FINCORE_PFN, which is less than
FINCORE_PAGE_FLAGS, which is less than FINCORE_PAGECACHE_TAGS. It matches
order in records:

FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGEFLAGS|FINCORE_PAGECACHE_TAGS

+-------+-------+-------+-------+-------+-------+-------+-------+
| pgoff | pfn | flags | tags | pgoff | pfn | flags | tags | ...
+-------+-------+-------+-------+-------+-------+-------+-------+
<-----------------------------> <------------------------------>
32 bytes 32 bytes

> +
> +#define FINCORE_MODE_MASK 0x1f
> +#define FINCORE_LONGENTRY_MASK (FINCORE_PFN | FINCORE_PAGE_FLAGS | \
> + FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
> +
--
Kirill A. Shutemov

2014-06-02 16:11:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: introduce fincore()

On 06/01/2014 10:24 PM, Naoya Horiguchi wrote:
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.

Something similar could be useful for hugetlbfs too. For a 1GB page,
it's pretty silly to do 2^18 entries which essentially repeat the same
data in an interface like this.

2014-06-02 16:12:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration

On 06/01/2014 10:24 PM, Naoya Horiguchi wrote:
> -#define PAGECACHE_TAG_DIRTY 0
> -#define PAGECACHE_TAG_WRITEBACK 1
> -#define PAGECACHE_TAG_TOWRITE 2
> +enum {
> + PAGECACHE_TAG_DIRTY,
> + PAGECACHE_TAG_WRITEBACK,
> + PAGECACHE_TAG_TOWRITE,
> + __NR_PAGECACHE_TAGS,
> +};

Doesn't this end up exposing kernel-internal values out to a userspace
interface? Wouldn't that lock these values in to the ABI?

2014-06-02 16:48:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration

On 06/02/2014 09:37 AM, Naoya Horiguchi wrote:
> On Mon, Jun 02, 2014 at 09:12:25AM -0700, Dave Hansen wrote:
>> > On 06/01/2014 10:24 PM, Naoya Horiguchi wrote:
>>> > > -#define PAGECACHE_TAG_DIRTY 0
>>> > > -#define PAGECACHE_TAG_WRITEBACK 1
>>> > > -#define PAGECACHE_TAG_TOWRITE 2
>>> > > +enum {
>>> > > + PAGECACHE_TAG_DIRTY,
>>> > > + PAGECACHE_TAG_WRITEBACK,
>>> > > + PAGECACHE_TAG_TOWRITE,
>>> > > + __NR_PAGECACHE_TAGS,
>>> > > +};
>> >
>> > Doesn't this end up exposing kernel-internal values out to a userspace
>> > interface? Wouldn't that lock these values in to the ABI?
> Yes, that would. I hope these PAGECACHE_TAG_* stuff is very basic
> things and will never change drastically in the future (only added),
> so it's unlikely to bother people about ABI breakage things.

OK, so if I'm writing a userspace program, which header do I include
pull these values in to my program?

2014-06-02 18:20:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration

On 06/02/2014 10:14 AM, Naoya Horiguchi wrote:
> Yes, that's necessary to consider (but I haven't done, sorry),
> so I'm thinking of moving this definition to the new file
> include/uapi/linux/pagecache.h and let it be imported from the
> userspace programs. Is it fine?

Yep, although I'd probably also explicitly separate the definitions of
the user-exposed ones from the kernel-internal ones. We want to make
this hard to screw up.

I can see why we might want to expose dirty and writeback out to
userspace, especially since we already expose the aggregate, system-wide
view in /proc/meminfo. But, what about PAGECACHE_TAG_TOWRITE? I really
can't think of a good reason why userspace would ever care about it or
consider it different from PAGECACHE_TAG_DIRTY.

2014-06-02 21:17:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration

On Mon, 02 Jun 2014 09:12:25 -0700 Dave Hansen <[email protected]> wrote:

> On 06/01/2014 10:24 PM, Naoya Horiguchi wrote:
> > -#define PAGECACHE_TAG_DIRTY 0
> > -#define PAGECACHE_TAG_WRITEBACK 1
> > -#define PAGECACHE_TAG_TOWRITE 2
> > +enum {
> > + PAGECACHE_TAG_DIRTY,
> > + PAGECACHE_TAG_WRITEBACK,
> > + PAGECACHE_TAG_TOWRITE,
> > + __NR_PAGECACHE_TAGS,
> > +};
>
> Doesn't this end up exposing kernel-internal values out to a userspace
> interface? Wouldn't that lock these values in to the ABI?

Yes, we should be careful here. We should not do anything which
constrains future kernel code or which causes any form of
compatibility/migration issues.

I wonder if we can do something smart with the interface. For example
when userspace calls sys_fincore() it must explicitly ask for
PAGECACHE_TAG_DIRTY and if some future kernel doesn't implement
PAGECACHE_TAG_DIRTY, it can return -EINVAL.

Or maybe it can succeed, but tells userspace "you didn't get
PAGECACHE_TAG_DIRTY".

<thinking out loud>

So userspace sends a mask of bits which select what fields it wants.
The kernel returns a mask of bits which tell userspace what it actually
received.

Or something like that - you get the idea ;)