2012-02-12 00:23:32

by Andrea Righi

[permalink] [raw]
Subject: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

There were some reported problems in the past about trashing page cache when a
backup software (i.e., rsync) touches a huge amount of pages (see for example
[1]).

This problem was mitigated by the Minchan's patch [2] and a proper use of
fadvise() in the backup software. For example this patch set [3] has been
proposed for inclusion in rsync.

However, there are still other trashing problems: when the backup software
reads all the source files, some of them may be part of the actual working set
of the system. If a POSIX_FADV_DONTNEED is performed _all_ pages are evicted
from pagecache, both the working set and the use-once pages touched only by the
backup.

A previous proposal [4] was to use the POSIX_FADV_NOREUSE hint, currently
implemented as a no-op, to provide a page invalidation policy less agressive
than POSIX_FADV_DONTNEED (by moving active pages to the tail of the inactive
list, and dropping the pages from the inactive list).

However, as correctly pointed out by KOSAKI, this behavior is very different
respect to the POSIX definition:

POSIX_FADV_NOREUSE
Specifies that the application expects to access the specified data once and then
not reuse it thereafter.

The new proposal is to implement POSIX_FADV_NOREUSE as a way to perform a real
drop-behind policy where applications can mark certain intervals of a file as
FADV_NOREUSE before accessing the data.

In this way the marked pages will never blow away the current working set. This
requirement can be satisfied by preventing lru activation of the FADV_NOREUSE
pages. Moreover, all the file cache pages in a FADV_NOREUSE range will be
immediately dropped after a read if the page was not present in the cache
before. The purpose is to preserve as much as possible the previous state of
the file cache memory before reading data in ranges marked via FADV_NOREUSE.

The list of FADV_NOREUSE ranges are maintained into an interval tree [5] inside
the address_space structure (for this a generic interval tree implementation is
also provided by PATCH 1/2). The intervals are dropped when pages are also
dropped from the page cache or when a different caching hint is specified on
the same intervals.

Example:

How this can solve the "backup is trashing my page cache" issue?

The "backup" software can be changed as following:

- before:
- fd_src = open /path/very_large_file
- fd_dst = open /backup/very_large_file
- read from fd_src
- write to fd_dst
- close fd_src and fd_dst

- after:
- fd_src = open /path/large_file
- fd_dst = open /backup/very_large_file
- posix_fadvise(fd_src, start, end, POSIX_FADV_NOREUSE)
- read from fd_src (even multiple times)
- write to fd_dst
- posix_fadvise(fd_src, start, end, POSIX_FADV_NORMAL)
- posix_fadvise(fd_dst, start, end, POSIX_FADV_DONTNEED)
- close fd_src and fd_dst

Simple test case:

- read a chunk of 256MB from the middle of a large file (this represents the
working set)

- simulate a backup-like software reading the whole file using
POSIX_FADV_NORMAL, or POSIX_FADV_DONTNEED or POSIX_FADV_NOREUSE

- re-read the same 256MB chunk from the middle of the file and measure
the time and the bio submitted (perf stat -e block:block_bio_queue)

The test is done starting both with the working set in the inactive lru list
(the 256MB chunk read once at the beginning of the test) and the working set in
the active lru list (the 256MB chunk read twice at the beginning of the test).

Results:

inact_working_set = the working set is all in the inactive lru list
act_working_set = the working set is all in the active lru list

- block:block_bio_queue (events)
inact_working_set act_working_set
POSIX_FADV_NORMAL 2,052 0
POSIX_FADV_DONTNEED 0(*) 2048
POSIX_FADV_NOREUSE 0 0

- elapsed time (sec)
inact_working_set act_working_set
POSIX_FADV_NORMAL 1.013 0.070
POSIX_FADV_DONTNEED 0.070(*) 1.006
POSIX_FADV_NOREUSE 0.070 0.070

(*) With POSIX_FADV_DONTNEED I would expect to see the working set pages
dropped from the page cache when it starts in the inactive lru list.

Instead this happens only when we start with the working set all in the active
list. IIUC this happens because when we re-read the pages the second time
they're moved to the per-cpu vector activate_page_pvecs; if the page vector is
not yet drained when we issue the POSIX_FADV_DONTNEED the advice is ignored.

Anyway, for this particular test case the best solution to preserve the state
of the page cache is obviouly the usage of POSIX_FADV_NOREUSE.

Regression test:

for i in `seq 1 10`; do
fio --name=fio --directory=. --rw=read --bs=4K --size=1G --numjobs=4 | grep READ:
done

- before:
READ: io=4096.0MB, aggrb=244737KB/s, minb=62652KB/s, maxb=63474KB/s, mint=16916msec, maxt=17138msec
READ: io=4096.0MB, aggrb=242263KB/s, minb=62019KB/s, maxb=64380KB/s, mint=16678msec, maxt=17313msec
READ: io=4096.0MB, aggrb=241913KB/s, minb=61929KB/s, maxb=62766KB/s, mint=17107msec, maxt=17338msec
READ: io=4096.0MB, aggrb=244337KB/s, minb=62550KB/s, maxb=63776KB/s, mint=16836msec, maxt=17166msec
READ: io=4096.0MB, aggrb=242768KB/s, minb=62148KB/s, maxb=62517KB/s, mint=17175msec, maxt=17277msec
READ: io=4096.0MB, aggrb=242796KB/s, minb=62155KB/s, maxb=63191KB/s, mint=16992msec, maxt=17275msec
READ: io=4096.0MB, aggrb=244352KB/s, minb=62554KB/s, maxb=63392KB/s, mint=16938msec, maxt=17165msec
READ: io=4096.0MB, aggrb=242011KB/s, minb=61954KB/s, maxb=62368KB/s, mint=17216msec, maxt=17331msec
READ: io=4096.0MB, aggrb=241676KB/s, minb=61869KB/s, maxb=63738KB/s, mint=16846msec, maxt=17355msec
READ: io=4096.0MB, aggrb=242319KB/s, minb=62033KB/s, maxb=63362KB/s, mint=16946msec, maxt=17309msec

avg aggrb = 242917KB/s, avg mint = 16965msec, avg maxt = 17267msec

- after:
READ: io=4096.0MB, aggrb=243968KB/s, minb=62455KB/s, maxb=63306KB/s, mint=16961msec, maxt=17192msec
READ: io=4096.0MB, aggrb=242979KB/s, minb=62202KB/s, maxb=63127KB/s, mint=17009msec, maxt=17262msec
READ: io=4096.0MB, aggrb=242473KB/s, minb=62073KB/s, maxb=62285KB/s, mint=17239msec, maxt=17298msec
READ: io=4096.0MB, aggrb=244494KB/s, minb=62590KB/s, maxb=63272KB/s, mint=16970msec, maxt=17155msec
READ: io=4096.0MB, aggrb=244352KB/s, minb=62554KB/s, maxb=63269KB/s, mint=16971msec, maxt=17165msec
READ: io=4096.0MB, aggrb=241969KB/s, minb=61944KB/s, maxb=63444KB/s, mint=16924msec, maxt=17334msec
READ: io=4096.0MB, aggrb=243303KB/s, minb=62285KB/s, maxb=62543KB/s, mint=17168msec, maxt=17239msec
READ: io=4096.0MB, aggrb=243232KB/s, minb=62267KB/s, maxb=63109KB/s, mint=17014msec, maxt=17244msec
READ: io=4096.0MB, aggrb=241969KB/s, minb=61944KB/s, maxb=62652KB/s, mint=17138msec, maxt=17334msec
READ: io=4096.0MB, aggrb=241649KB/s, minb=61862KB/s, maxb=62616KB/s, mint=17148msec, maxt=17357msec

avg aggrb = 243038KB/s, avg mint = 17054msec, avg maxt = 17258msec

No obvious performance regression was found according to this simple test.

Credits:
- Some of the routines to implement the generic interval tree has been taken
from the x86 PAT code, that uses interval trees to keep track of PAT ranges
(in the future it would be interesting to convert also the x86 PAT code to
use the generic interval tree implementation).

- The idea to store the FADV_NOREUSE intervals into the address_space
structure as been inspired by the John's POSIX_FADV_VOLATILE patch [6].

References:
[1] http://marc.info/?l=rsync&m=128885034930933&w=2
[2] https://lkml.org/lkml/2011/2/20/57
[3] http://lists.samba.org/archive/rsync/2010-November/025827.html
[4] http://thread.gmane.org/gmane.linux.kernel.mm/65493
[5] http://en.wikipedia.org/wiki/Interval_tree
[6] http://thread.gmane.org/gmane.linux.kernel/1218654

ChangeLog v4 -> v5:
- completely new redesign: implement the expected drop-behind policy
maintaining the list of FADV_NOREUSE ranges inside the file

[PATCH v5 1/3] kinterval: routines to manipulate generic intervals
[PATCH v5 2/3] mm: filemap: introduce mark_page_usedonce
[PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

fs/inode.c | 3 +
include/linux/fs.h | 12 ++
include/linux/kinterval.h | 126 ++++++++++++
include/linux/swap.h | 1 +
lib/Makefile | 2 +-
lib/kinterval.c | 483 +++++++++++++++++++++++++++++++++++++++++++++
mm/fadvise.c | 18 ++-
mm/filemap.c | 95 +++++++++-
mm/swap.c | 24 +++
9 files changed, 760 insertions(+), 4 deletions(-)


2012-02-12 00:23:37

by Andrea Righi

[permalink] [raw]
Subject: [PATCH v5 1/3] kinterval: routines to manipulate generic intervals

Add a generic infrastructure to efficiently keep track of intervals.

An interval is represented by a triplet (start, end, type). The values
(start, end) define the bounds of the range. The type is a generic
property associated to the interval. The interpretation of the type is
left to the user.

Multiple intervals associated to the same object are stored in an
interval tree (augmented rbtree) [1], with tree ordered on starting
address. The tree cannot contain multiple entries of different
interval types which overlap; in case of overlapping intervals new
inserted intervals overwrite the old ones (completely or in part, in the
second case the old interval is shrunk or split accordingly).

Reference:
[1] "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein

Signed-off-by: Andrea Righi <[email protected]>
---
include/linux/kinterval.h | 126 ++++++++++++
lib/Makefile | 2 +-
lib/kinterval.c | 483 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 610 insertions(+), 1 deletions(-)
create mode 100644 include/linux/kinterval.h
create mode 100644 lib/kinterval.c

diff --git a/include/linux/kinterval.h b/include/linux/kinterval.h
new file mode 100644
index 0000000..8152265
--- /dev/null
+++ b/include/linux/kinterval.h
@@ -0,0 +1,126 @@
+/*
+ * kinterval.h - Routines for manipulating generic intervals
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2012 Andrea Righi <[email protected]>
+ */
+
+#ifndef _LINUX_KINTERVAL_H
+#define _LINUX_KINTERVAL_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+
+/**
+ * struct kinterval - define a range in an interval tree
+ * @start: address representing the start of the range.
+ * @end: address representing the end of the range.
+ * @subtree_max_end: augmented rbtree data to perform quick lookup of the
+ * overlapping ranges.
+ * @type: type of the interval (defined by the user).
+ * @rb: the rbtree node.
+ */
+struct kinterval {
+ u64 start;
+ u64 end;
+ u64 subtree_max_end;
+ unsigned long type;
+ struct rb_node rb;
+};
+
+/**
+ * DECLARE_KINTERVAL_TREE - macro to declare an interval tree
+ * @__name: name of the declared interval tree.
+ *
+ * The tree is an interval tree (augmented rbtree) with tree ordered
+ * on starting address. Tree cannot contain multiple entries for differnt
+ * ranges which overlap; in case of overlapping ranges new inserted intervals
+ * overwrite the old ones (completely or in part, in the second case the old
+ * interval is shrinked accordingly).
+ *
+ * NOTE: all locking issues are left to the caller.
+ *
+ * Reference:
+ * "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein.
+ */
+#define DECLARE_KINTERVAL_TREE(__name) struct rb_root __name
+
+/**
+ * DEFINE_KINTERVAL_TREE - macro to define and initialize an interval tree
+ * @__name: name of the declared interval tree.
+ */
+#define DEFINE_KINTERVAL_TREE(__name) \
+ struct rb_root __name = RB_ROOT
+
+/**
+ * INIT_KINTERVAL_TREE_ROOT - macro to initialize an interval tree
+ * @__root: root of the declared interval tree.
+ */
+#define INIT_KINTERVAL_TREE_ROOT(__root) \
+ do { \
+ (__root)->rb_node = NULL; \
+ } while (0)
+
+/**
+ * kinterval_add - define a new range into the interval tree
+ * @root: the root of the tree.
+ * @start: start of the range to define.
+ * @end: end of the range to define.
+ * @type: attribute assinged to the range.
+ * @flags: type of memory to allocate (see kcalloc).
+ */
+int kinterval_add(struct rb_root *root, u64 start, u64 end,
+ long type, gfp_t flags);
+
+/**
+ * kinterval_del - erase a range from the interval tree
+ * @root: the root of the tree.
+ * @start: start of the range to erase.
+ * @end: end of the range to erase.
+ * @flags: type of memory to allocate (see kcalloc).
+ */
+int kinterval_del(struct rb_root *root, u64 start, u64 end, gfp_t flags);
+
+/**
+ * kinterval_lookup_range - return the attribute of a range
+ * @root: the root of the tree.
+ * @start: start of the range to lookup.
+ * @end: end of the range to lookup.
+ *
+ * NOTE: return the type of the lowest match, if the range specified by the
+ * arguments overlaps multiple intervals only the type of the first one
+ * (lowest) is returned.
+ */
+long kinterval_lookup_range(struct rb_root *root, u64 start, u64 end);
+
+/**
+ * kinterval_lookup - return the attribute of an address
+ * @root: the root of the tree.
+ * @addr: address to lookup.
+ */
+static inline long kinterval_lookup(struct rb_root *root, u64 addr)
+{
+ return kinterval_lookup_range(root, addr, addr);
+}
+
+/**
+ * kinterval_clear - erase all intervals defined in an interval tree
+ * @root: the root of the tree.
+ */
+void kinterval_clear(struct rb_root *root);
+
+#endif /* _LINUX_KINTERVAL_H */
diff --git a/lib/Makefile b/lib/Makefile
index 18515f0..9a648ba 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
idr.o int_sqrt.o extable.o prio_tree.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o prio_heap.o ratelimit.o show_mem.o \
- is_single_threaded.o plist.o decompress.o
+ is_single_threaded.o plist.o decompress.o kinterval.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/kinterval.c b/lib/kinterval.c
new file mode 100644
index 0000000..2a9d463
--- /dev/null
+++ b/lib/kinterval.c
@@ -0,0 +1,483 @@
+/*
+ * kinterval.c - Routines for manipulating generic intervals
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2012 Andrea Righi <[email protected]>
+ */
+
+#include <linux/init.h>
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/rbtree.h>
+#include <linux/kinterval.h>
+
+static struct kmem_cache *kinterval_cachep __read_mostly;
+
+static bool is_interval_overlapping(struct kinterval *node, u64 start, u64 end)
+{
+ return !(node->start > end || node->end < start);
+}
+
+static u64 get_subtree_max_end(struct rb_node *node)
+{
+ struct kinterval *range;
+
+ if (!node)
+ return 0;
+ range = rb_entry(node, struct kinterval, rb);
+
+ return range->subtree_max_end;
+}
+
+/* Update 'subtree_max_end' for a node, based on node and its children */
+static void kinterval_rb_augment_cb(struct rb_node *node, void *__unused)
+{
+ struct kinterval *range;
+ u64 max_end, child_max_end;
+
+ if (!node)
+ return;
+ range = rb_entry(node, struct kinterval, rb);
+ max_end = range->end;
+
+ child_max_end = get_subtree_max_end(node->rb_right);
+ if (child_max_end > max_end)
+ max_end = child_max_end;
+
+ child_max_end = get_subtree_max_end(node->rb_left);
+ if (child_max_end > max_end)
+ max_end = child_max_end;
+
+ range->subtree_max_end = max_end;
+}
+
+/*
+ * Find the lowest overlapping range from the tree.
+ *
+ * Return NULL if there is no overlap.
+ */
+static struct kinterval *
+kinterval_rb_lowest_match(struct rb_root *root, u64 start, u64 end)
+{
+ struct rb_node *node = root->rb_node;
+ struct kinterval *lower_range = NULL;
+
+ while (node) {
+ struct kinterval *range = rb_entry(node, struct kinterval, rb);
+
+ if (get_subtree_max_end(node->rb_left) > start) {
+ node = node->rb_left;
+ } else if (is_interval_overlapping(range, start, end)) {
+ lower_range = range;
+ break;
+ } else if (start >= range->start) {
+ node = node->rb_right;
+ } else {
+ break;
+ }
+ }
+ return lower_range;
+}
+
+static void
+kinterval_rb_insert(struct rb_root *root, struct kinterval *new)
+{
+ struct rb_node **node = &(root->rb_node);
+ struct rb_node *parent = NULL;
+
+ while (*node) {
+ struct kinterval *range = rb_entry(*node, struct kinterval, rb);
+
+ parent = *node;
+ if (new->start <= range->start)
+ node = &((*node)->rb_left);
+ else if (new->start > range->start)
+ node = &((*node)->rb_right);
+ }
+
+ rb_link_node(&new->rb, parent, node);
+ rb_insert_color(&new->rb, root);
+ rb_augment_insert(&new->rb, kinterval_rb_augment_cb, NULL);
+}
+
+/* Merge adjacent intervals */
+static void kinterval_rb_merge(struct rb_root *root)
+{
+ struct kinterval *next, *prev = NULL;
+ struct rb_node *node, *deepest;
+
+ node = rb_first(root);
+ while (node) {
+ next = rb_entry(node, struct kinterval, rb);
+ node = rb_next(&next->rb);
+
+ if (prev && prev->type == next->type &&
+ prev->end == (next->start - 1) &&
+ prev->end < next->start) {
+ prev->end = next->end;
+ deepest = rb_augment_erase_begin(&next->rb);
+ rb_erase(&next->rb, root);
+ rb_augment_erase_end(deepest,
+ kinterval_rb_augment_cb, NULL);
+ kmem_cache_free(kinterval_cachep, next);
+ } else {
+ prev = next;
+ }
+ }
+}
+
+static int kinterval_rb_check_add(struct rb_root *root,
+ struct kinterval *new, gfp_t flags)
+{
+ struct kinterval *old;
+ struct rb_node *node, *deepest;
+
+ node = rb_first(root);
+ while (node) {
+ old = rb_entry(node, struct kinterval, rb);
+ /* Check all the possible matches within the range */
+ if (old->start > new->end)
+ break;
+ node = rb_next(&old->rb);
+
+ if (!is_interval_overlapping(old, new->start, new->end))
+ continue;
+ /*
+ * Interval is overlapping another one, shrink the old interval
+ * accordingly.
+ */
+ if (new->start == old->start && new->end == old->end) {
+ /*
+ * Exact match, just update the type:
+ *
+ * old
+ * |___________________|
+ * new
+ * |___________________|
+ */
+ old->type = new->type;
+ kmem_cache_free(kinterval_cachep, new);
+ return 0;
+ } else if (new->start <= old->start && new->end >= old->end) {
+ /*
+ * New range completely overwrites the old one:
+ *
+ * old
+ * |________|
+ * new
+ * |___________________|
+ *
+ * Replace old with new.
+ */
+ deepest = rb_augment_erase_begin(&old->rb);
+ rb_erase(&old->rb, root);
+ rb_augment_erase_end(deepest, kinterval_rb_augment_cb,
+ NULL);
+ kmem_cache_free(kinterval_cachep, old);
+ } else if (new->start <= old->start && new->end <= old->end) {
+ /*
+ * Update the start of the interval:
+ *
+ * - before:
+ *
+ * old
+ * |_____________|
+ * new
+ * |___________|
+ *
+ * - after:
+ *
+ * new old
+ * |___________|_______|
+ */
+ rb_erase(&old->rb, root);
+ old->start = new->end + 1;
+ kinterval_rb_insert(root, old);
+ break;
+ } else if (new->start >= old->start && new->end >= old->end) {
+ /*
+ * Update the end of the interval:
+ *
+ * - before:
+ *
+ * old
+ * |_____________|
+ * new
+ * |___________|
+ *
+ * - after:
+ *
+ * old new
+ * |________|__________|
+ */
+ deepest = rb_augment_erase_begin(&old->rb);
+ rb_erase(&old->rb, root);
+ rb_augment_erase_end(deepest, kinterval_rb_augment_cb,
+ NULL);
+ old->end = new->start - 1;
+ old->subtree_max_end = old->end;
+ kinterval_rb_insert(root, old);
+ } else if (new->start >= old->start && new->end <= old->end) {
+ struct kinterval *prev;
+
+ if (new->type == old->type) {
+ /* Same type, just drop the new element */
+ kmem_cache_free(kinterval_cachep, new);
+ return 0;
+ }
+ /*
+ * Insert the new interval in the middle of another
+ * one.
+ *
+ * - before:
+ *
+ * old
+ * |___________________|
+ * new
+ * |_______|
+ *
+ * - after:
+ *
+ * prev new old
+ * |_____|_______|_____|
+ */
+ prev = kmem_cache_zalloc(kinterval_cachep, flags);
+ if (unlikely(!prev))
+ return -ENOMEM;
+
+ rb_erase(&old->rb, root);
+
+ prev->start = old->start;
+ old->start = new->end + 1;
+ prev->end = new->start - 1;
+ prev->type = old->type;
+
+ kinterval_rb_insert(root, old);
+
+ new->subtree_max_end = new->end;
+ kinterval_rb_insert(root, new);
+
+ prev->subtree_max_end = prev->end;
+ kinterval_rb_insert(root, prev);
+ return 0;
+ }
+ }
+ new->subtree_max_end = new->end;
+ kinterval_rb_insert(root, new);
+ kinterval_rb_merge(root);
+
+ return 0;
+}
+
+int kinterval_add(struct rb_root *root, u64 start, u64 end,
+ long type, gfp_t flags)
+{
+ struct kinterval *range;
+ int ret;
+
+ if (end < start)
+ return -EINVAL;
+ range = kmem_cache_zalloc(kinterval_cachep, flags);
+ if (unlikely(!range))
+ return -ENOMEM;
+ range->start = start;
+ range->end = end;
+ range->type = type;
+
+ ret = kinterval_rb_check_add(root, range, flags);
+ if (unlikely(ret < 0))
+ kmem_cache_free(kinterval_cachep, range);
+
+ return ret;
+}
+EXPORT_SYMBOL(kinterval_add);
+
+static int kinterval_rb_check_del(struct rb_root *root,
+ u64 start, u64 end, gfp_t flags)
+{
+ struct kinterval *old;
+ struct rb_node *node, *deepest;
+
+ node = rb_first(root);
+ while (node) {
+ old = rb_entry(node, struct kinterval, rb);
+ /* Check all the possible matches within the range */
+ if (old->start > end)
+ break;
+ node = rb_next(&old->rb);
+
+ if (!is_interval_overlapping(old, start, end))
+ continue;
+ if (start <= old->start && end >= old->end) {
+ /*
+ * Completely erase the old range:
+ *
+ * old
+ * |________|
+ * erase
+ * |___________________|
+ */
+ deepest = rb_augment_erase_begin(&old->rb);
+ rb_erase(&old->rb, root);
+ rb_augment_erase_end(deepest, kinterval_rb_augment_cb,
+ NULL);
+ kmem_cache_free(kinterval_cachep, old);
+ } else if (start <= old->start && end <= old->end) {
+ /*
+ * Trim the beginning of an interval:
+ *
+ * - before:
+ *
+ * old
+ * |_____________|
+ * erase
+ * |___________|
+ *
+ * - after:
+ *
+ * old
+ * |_______|
+ */
+ rb_erase(&old->rb, root);
+ old->start = end + 1;
+ kinterval_rb_insert(root, old);
+ break;
+ } else if (start >= old->start && end >= old->end) {
+ /*
+ * Trim the end of an interval:
+ *
+ * - before:
+ *
+ * old
+ * |_____________|
+ * erase
+ * |___________|
+ *
+ * - after:
+ *
+ * old
+ * |________|
+ */
+ deepest = rb_augment_erase_begin(&old->rb);
+ rb_erase(&old->rb, root);
+ rb_augment_erase_end(deepest, kinterval_rb_augment_cb,
+ NULL);
+ old->end = start - 1;
+ old->subtree_max_end = old->end;
+ kinterval_rb_insert(root, old);
+ } else if (start >= old->start && end <= old->end) {
+ struct kinterval *prev;
+
+ /*
+ * Trim the middle of an interval:
+ *
+ * - before:
+ *
+ * old
+ * |___________________|
+ * erase
+ * |_______|
+ *
+ * - after:
+ *
+ * prev old
+ * |_____| |_____|
+ */
+ prev = kmem_cache_zalloc(kinterval_cachep, flags);
+ if (unlikely(!prev))
+ return -ENOMEM;
+
+ rb_erase(&old->rb, root);
+
+ prev->start = old->start;
+ old->start = end + 1;
+ prev->end = start - 1;
+ prev->type = old->type;
+
+ kinterval_rb_insert(root, old);
+
+ prev->subtree_max_end = prev->end;
+ kinterval_rb_insert(root, prev);
+ break;
+ }
+ }
+ return 0;
+}
+
+int kinterval_del(struct rb_root *root, u64 start, u64 end, gfp_t flags)
+{
+ if (end < start)
+ return -EINVAL;
+ return kinterval_rb_check_del(root, start, end, flags);
+}
+EXPORT_SYMBOL(kinterval_del);
+
+void kinterval_clear(struct rb_root *root)
+{
+ struct kinterval *range;
+ struct rb_node *node;
+
+ node = rb_first(root);
+ while (node) {
+ range = rb_entry(node, struct kinterval, rb);
+#ifdef DEBUG
+ printk(KERN_INFO "start=%llu end=%llu type=%lu\n",
+ range->start, range->end, range->type);
+#endif
+ node = rb_next(&range->rb);
+ rb_erase(&range->rb, root);
+ kmem_cache_free(kinterval_cachep, range);
+ }
+}
+EXPORT_SYMBOL(kinterval_clear);
+
+long kinterval_lookup_range(struct rb_root *root, u64 start, u64 end)
+{
+ struct kinterval *range;
+
+ if (end < start)
+ return -EINVAL;
+ range = kinterval_rb_lowest_match(root, start, end);
+ return range ? range->type : -ENOENT;
+}
+EXPORT_SYMBOL(kinterval_lookup_range);
+
+static int __init kinterval_init(void)
+{
+ kinterval_cachep = kmem_cache_create("kinterval_cache",
+ sizeof(struct kinterval),
+ 0, 0, NULL);
+ if (unlikely(!kinterval_cachep)) {
+ printk(KERN_ERR "kinterval: failed to create slab cache\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static void __exit kinterval_exit(void)
+{
+ kmem_cache_destroy(kinterval_cachep);
+}
+
+module_init(kinterval_init);
+module_exit(kinterval_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Generic interval ranges");
+MODULE_AUTHOR("Andrea Righi <[email protected]>");
--
1.7.5.4

2012-02-12 00:23:41

by Andrea Righi

[permalink] [raw]
Subject: [PATCH v5 2/3] mm: filemap: introduce mark_page_usedonce

Introduce a helper function to drop a page from the page cache if it is
evictable, inactive and unreferenced.

This can be used to drop used-once pages from the file cache with
POSIX_FADV_NOREUSE.

Signed-off-by: Andrea Righi <[email protected]>
---
include/linux/swap.h | 1 +
mm/swap.c | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3e60228..2e5d1b8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -222,6 +222,7 @@ extern void lru_add_page_tail(struct zone* zone,
struct page *page, struct page *page_tail);
extern void activate_page(struct page *);
extern void mark_page_accessed(struct page *);
+extern void mark_page_usedonce(struct page *page);
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index fff1ff7..2c19c92 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -352,6 +352,30 @@ void activate_page(struct page *page)
}
#endif

+/**
+ * mark_page_usedonce - handle used-once pages
+ * @page: the page set as used-once
+ *
+ * Drop a page from the page cache if it is evictable, inactive and
+ * unreferenced.
+ */
+void mark_page_usedonce(struct page *page)
+{
+ int ret;
+
+ if (!PageLRU(page))
+ return;
+ if (PageActive(page) || PageUnevictable(page) || PageReferenced(page))
+ return;
+ if (lock_page_killable(page))
+ return;
+ ret = invalidate_inode_page(page);
+ unlock_page(page);
+ if (!ret)
+ deactivate_page(page);
+}
+EXPORT_SYMBOL(mark_page_usedonce);
+
/*
* Mark a page as having seen activity.
*
--
1.7.5.4

2012-02-12 00:23:54

by Andrea Righi

[permalink] [raw]
Subject: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

According to the POSIX standard the POSIX_FADV_NOREUSE hint means that
the application expects to access the specified data once and then not
reuse it thereafter.

It seems that the expected behavior is to implement a drop-behind
policy where the application can set certain intervals of a file as
FADV_NOREUSE _before_ accessing the data.

An interesting usage of this hint is to guarantee that pages marked as
FADV_NOREUSE will never blow away the pages of the current working set.

A possible solution to satisfy this requirement is to prevent lru
activation of the pages marked as FADV_NOREUSE, in other words, never
add pages marked as FADV_NOREUSE to the active lru list. Moreover, all
the file cache pages in a FADV_NOREUSE range can be immediately dropped
after a read if the page was not present in the file cache before.

In general, the purpose of this approach is to preserve as much as
possible the previous state of the file cache memory before reading data
in ranges marked by FADV_NOREUSE.

All the pages read before (pre-)setting them as FADV_NOREUSE should be
treated as normal, so they can be added to the active lru list as usual
if they're accessed multiple times.

Only after setting them as FADV_NOREUSE we can prevent them for being
promoted to the active lru list. If they are already in the active lru
list before calling FADV_NOREUSE we should keep them there, but if they
quit from the active list they can't get back anymore (except by
explicitly setting a different caching hint).

To achieve this goal we need to maintain the list of file ranges marked
as FADV_NOREUSE until the pages are dropped from the page cache, or the
inode is deleted, or they're explicitly marked to use a different cache
behavior (FADV_NORMAL | FADV_WILLNEED).

The list of FADV_NOREUSE ranges is maintained in the address_space
structure using an interval tree (kinterval).

Signed-off-by: Andrea Righi <[email protected]>
---
fs/inode.c | 3 ++
include/linux/fs.h | 12 ++++++
mm/fadvise.c | 18 +++++++++-
mm/filemap.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 125 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fb10d86..6375163 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -254,6 +254,7 @@ void __destroy_inode(struct inode *inode)
if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
posix_acl_release(inode->i_default_acl);
#endif
+ filemap_clear_cache(&inode->i_data);
this_cpu_dec(nr_inodes);
}
EXPORT_SYMBOL(__destroy_inode);
@@ -360,6 +361,8 @@ void address_space_init_once(struct address_space *mapping)
spin_lock_init(&mapping->private_lock);
INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
+ INIT_KINTERVAL_TREE_ROOT(&mapping->nocache_tree);
+ rwlock_init(&mapping->nocache_lock);
}
EXPORT_SYMBOL(address_space_init_once);

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 386da09..624a73e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -9,6 +9,7 @@
#include <linux/limits.h>
#include <linux/ioctl.h>
#include <linux/blk_types.h>
+#include <linux/kinterval.h>
#include <linux/types.h>

/*
@@ -521,6 +522,11 @@ enum positive_aop_returns {
* helper code (eg buffer layer)
* to clear GFP_FS from alloc */

+enum filemap_cache_modes {
+ FILEMAP_CACHE_NORMAL, /* No special cache behavior */
+ FILEMAP_CACHE_ONCE, /* Pages will be used once */
+};
+
/*
* oh the beauties of C type declarations.
*/
@@ -655,6 +661,8 @@ struct address_space {
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+ struct rb_root nocache_tree; /* noreuse cache range tree */
+ rwlock_t nocache_lock; /* protect the nocache_tree */
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
@@ -2189,6 +2197,10 @@ extern int invalidate_inode_pages2(struct address_space *mapping);
extern int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end);
extern int write_inode_now(struct inode *, int);
+extern void filemap_clear_cache(struct address_space *mapping);
+extern int filemap_set_cache(struct address_space *mapping,
+ pgoff_t start, pgoff_t end, int mode);
+extern int filemap_get_cache(struct address_space *mapping, pgoff_t index);
extern int filemap_fdatawrite(struct address_space *);
extern int filemap_flush(struct address_space *);
extern int filemap_fdatawait(struct address_space *);
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e..22b1aa8 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -80,6 +80,12 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
spin_lock(&file->f_lock);
file->f_mode &= ~FMODE_RANDOM;
spin_unlock(&file->f_lock);
+
+ start_index = offset >> PAGE_CACHE_SHIFT;
+ end_index = endbyte >> PAGE_CACHE_SHIFT;
+
+ ret = filemap_set_cache(mapping, start_index, end_index,
+ FILEMAP_CACHE_NORMAL);
break;
case POSIX_FADV_RANDOM:
spin_lock(&file->f_lock);
@@ -102,11 +108,16 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
start_index = offset >> PAGE_CACHE_SHIFT;
end_index = endbyte >> PAGE_CACHE_SHIFT;

+ ret = filemap_set_cache(mapping, start_index, end_index,
+ FILEMAP_CACHE_NORMAL);
+ if (ret < 0)
+ break;
+
/* Careful about overflow on the "+1" */
nrpages = end_index - start_index + 1;
if (!nrpages)
nrpages = ~0UL;
-
+
ret = force_page_cache_readahead(mapping, file,
start_index,
nrpages);
@@ -114,6 +125,11 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
ret = 0;
break;
case POSIX_FADV_NOREUSE:
+ start_index = offset >> PAGE_CACHE_SHIFT;
+ end_index = endbyte >> PAGE_CACHE_SHIFT;
+
+ ret = filemap_set_cache(mapping, start_index, end_index,
+ FILEMAP_CACHE_ONCE);
break;
case POSIX_FADV_DONTNEED:
if (!bdi_write_congested(mapping->backing_dev_info))
diff --git a/mm/filemap.c b/mm/filemap.c
index b662757..610502a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -27,6 +27,7 @@
#include <linux/writeback.h>
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+#include <linux/kinterval.h>
#include <linux/blkdev.h>
#include <linux/security.h>
#include <linux/syscalls.h>
@@ -187,6 +188,82 @@ static int sleep_on_page_killable(void *word)
}

/**
+ * filemap_clear_cache - clear all special cache settings
+ * @mapping: target address space structure
+ *
+ * Reset all the special file cache settings previously defined by
+ * filemap_set_cache().
+ */
+void filemap_clear_cache(struct address_space *mapping)
+{
+ write_lock(&mapping->nocache_lock);
+ kinterval_clear(&mapping->nocache_tree);
+ write_unlock(&mapping->nocache_lock);
+}
+
+/**
+ * filemap_set_cache - set special cache behavior for a range of file pages
+ * @mapping: target address space structure
+ * @start: offset in pages where the range starts
+ * @end: offset in pages where the range ends (inclusive)
+ * @mode: cache behavior configuration
+ *
+ * This can be used to define special cache behavior in advance, before
+ * accessing the data.
+ *
+ * At the moment the supported cache behaviors are the following (see also
+ * filemap_cache_modes):
+ *
+ * FILEMAP_CACHE_NORMAL: normal page cache behavior;
+ *
+ * FILEMAP_CACHE_ONCE: specifies that the pages will be accessed once and the
+ * caller don't expect to reuse it thereafter. This prevents them for being
+ * promoted to the active lru list.
+ */
+int filemap_set_cache(struct address_space *mapping,
+ pgoff_t start, pgoff_t end, int mode)
+{
+ int ret;
+
+ write_lock(&mapping->nocache_lock);
+ switch (mode) {
+ case FILEMAP_CACHE_NORMAL:
+ ret = kinterval_del(&mapping->nocache_tree,
+ start, end, GFP_KERNEL);
+ break;
+ case FILEMAP_CACHE_ONCE:
+ ret = kinterval_add(&mapping->nocache_tree,
+ start, end, mode, GFP_KERNEL);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ write_unlock(&mapping->nocache_lock);
+
+ return ret;
+}
+
+/**
+ * filemap_get_cache - get special cache behavior of a file page
+ * @mapping: target address space structure
+ * @index: index of the page inside the address space
+ *
+ * If no special cache behavior are defined FILEMAP_CACHE_NORMAL is returned
+ * (that means no special page cache behavior is applied).
+ */
+int filemap_get_cache(struct address_space *mapping, pgoff_t index)
+{
+ int mode;
+
+ read_lock(&mapping->nocache_lock);
+ mode = kinterval_lookup(&mapping->nocache_tree, index);
+ read_unlock(&mapping->nocache_lock);
+
+ return mode < 0 ? FILEMAP_CACHE_NORMAL : mode;
+}
+
+/**
* __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
* @mapping: address space structure to write
* @start: offset in bytes where the range starts
@@ -1181,8 +1258,22 @@ page_ok:
* When a sequential read accesses a page several times,
* only mark it as accessed the first time.
*/
- if (prev_index != index || offset != prev_offset)
- mark_page_accessed(page);
+ if (prev_index != index || offset != prev_offset) {
+ int mode;
+
+ mode = filemap_get_cache(mapping, index);
+ switch (mode) {
+ case FILEMAP_CACHE_NORMAL:
+ mark_page_accessed(page);
+ break;
+ case FILEMAP_CACHE_ONCE:
+ mark_page_usedonce(page);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ break;
+ }
+ }
prev_index = index;

/*
--
1.7.5.4

2012-02-12 07:16:10

by Hillf Danton

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

Hello Andrea

On Sun, Feb 12, 2012 at 8:21 AM, Andrea Righi <[email protected]> wrote:
[...]
>  - Some of the routines to implement the generic interval tree has been taken
>   from the x86 PAT code, that uses interval trees to keep track of PAT ranges
>   (in the future it would be interesting to convert also the x86 PAT code to
>   use the generic interval tree implementation).
>
Perhaps the tree implemented in this work could also be used in tracking
regions in mm/hugetlb.c.

Thanks
Hillf

2012-02-12 11:58:41

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

On Sun, Feb 12, 2012 at 03:16:07PM +0800, Hillf Danton wrote:
> Hello Andrea
>
> On Sun, Feb 12, 2012 at 8:21 AM, Andrea Righi <[email protected]> wrote:
> [...]
> > ?- Some of the routines to implement the generic interval tree has been taken
> > ? from the x86 PAT code, that uses interval trees to keep track of PAT ranges
> > ? (in the future it would be interesting to convert also the x86 PAT code to
> > ? use the generic interval tree implementation).
> >
> Perhaps the tree implemented in this work could also be used in tracking
> regions in mm/hugetlb.c.
>
> Thanks
> Hillf

Thanks, Hillf.

Yes, I quickly looked at the hugtlb code, it seems another potential
user of the interval tree. Now all the hugetlb regions are stored in a
list, the interval tree is a more efficient structure for lookups -
O(log(n)), so there are probably advantages in presence of many
different disjoint intervals.

mmh... at the moment there's not a way to map region_count() with the
current kinterval API, but we can easily extend it to provide also this
feature (count the overlap size of two intervals).

-Andrea

2012-02-12 17:54:58

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

On Sun, 12 Feb 2012 12:58:34 +0100, Andrea Righi <[email protected]> wrote:
> On Sun, Feb 12, 2012 at 03:16:07PM +0800, Hillf Danton wrote:
> > Hello Andrea
> >
> > On Sun, Feb 12, 2012 at 8:21 AM, Andrea Righi <[email protected]> wrote:
> > [...]
> > >  - Some of the routines to implement the generic interval tree has been taken
> > >   from the x86 PAT code, that uses interval trees to keep track of PAT ranges
> > >   (in the future it would be interesting to convert also the x86 PAT code to
> > >   use the generic interval tree implementation).
> > >
> > Perhaps the tree implemented in this work could also be used in tracking
> > regions in mm/hugetlb.c.
> >
> > Thanks
> > Hillf
>
> Thanks, Hillf.
>
> Yes, I quickly looked at the hugtlb code, it seems another potential
> user of the interval tree. Now all the hugetlb regions are stored in a
> list, the interval tree is a more efficient structure for lookups -
> O(log(n)), so there are probably advantages in presence of many
> different disjoint intervals.
>
> mmh... at the moment there's not a way to map region_count() with the
> current kinterval API, but we can easily extend it to provide also this
> feature (count the overlap size of two intervals).
>

I am also extending the hugetlb region list in the hugetlb cgroup
patchset i recently posted to make sure we don't merge region if the
associated private data doesn't match (in my case it is the pointer to
hugetlb cgroup). Ref:
http://article.gmane.org/gmane.linux.kernel.mm/73829

The goal is to make sure a region add with different private value
results in below.


old
| hcg1 |
-------------------

new
| hcg2 |
----------------------------------

results in

| hcg2 | | hcg1 | | hcg2 |
--------- __________________ ----------


-aneesh

2012-02-13 00:49:10

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] kinterval: routines to manipulate generic intervals

On Sun, Feb 12, 2012 at 01:21:36AM +0100, Andrea Righi wrote:
> Add a generic infrastructure to efficiently keep track of intervals.
>
> An interval is represented by a triplet (start, end, type). The values
> (start, end) define the bounds of the range. The type is a generic
> property associated to the interval. The interpretation of the type is
> left to the user.
>
> Multiple intervals associated to the same object are stored in an
> interval tree (augmented rbtree) [1], with tree ordered on starting
> address. The tree cannot contain multiple entries of different
> interval types which overlap; in case of overlapping intervals new
> inserted intervals overwrite the old ones (completely or in part, in the
> second case the old interval is shrunk or split accordingly).
>
> Reference:
> [1] "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein
>
> Signed-off-by: Andrea Righi <[email protected]>
> ---

For those who are interested, I've an extra patch for this part (must be
applied on top of the previous one): there's a bug fix and some
improvements.

I'll include the following fix in the next version of the
POSIX_FADV_NOREUSE patch set (sorry for this, but the whole patch set is
still totally experimental, especially the kinterval part, I also plan
to add a proper documentation and a sample test case in the samples/ dir
if we think it'll be useful).

Thanks,
-Andrea

---
From: Andrea Righi <[email protected]>
Subject: [PATCH] kinterval: fix interval boundaries and optimize insertion/removal

Use the right notation [start, end) instead of [start, end] for interval
boundaries, so now an interval does not include its endpoint. This is
the natural way to define a memory range (i.e, 0-4096 = all bytes
between 0 and 4095 included) and it makes easier to reuse this code also
for other similar cases.

Moreover, optimize insertion and removal code to be actually O(log(n)).

Signed-off-by: Andrea Righi <[email protected]>
---
include/linux/kinterval.h | 2 +-
lib/kinterval.c | 135 ++++++++++++++++++++++++++++-----------------
2 files changed, 86 insertions(+), 51 deletions(-)

diff --git a/include/linux/kinterval.h b/include/linux/kinterval.h
index 8152265..7a505f4 100644
--- a/include/linux/kinterval.h
+++ b/include/linux/kinterval.h
@@ -114,7 +114,7 @@ long kinterval_lookup_range(struct rb_root *root, u64 start, u64 end);
*/
static inline long kinterval_lookup(struct rb_root *root, u64 addr)
{
- return kinterval_lookup_range(root, addr, addr);
+ return kinterval_lookup_range(root, addr, addr + 1);
}

/**
diff --git a/lib/kinterval.c b/lib/kinterval.c
index 2a9d463..28ee627 100644
--- a/lib/kinterval.c
+++ b/lib/kinterval.c
@@ -31,7 +31,7 @@ static struct kmem_cache *kinterval_cachep __read_mostly;

static bool is_interval_overlapping(struct kinterval *node, u64 start, u64 end)
{
- return !(node->start > end || node->end < start);
+ return !(node->start >= end || node->end <= start);
}

static u64 get_subtree_max_end(struct rb_node *node)
@@ -76,23 +76,65 @@ static struct kinterval *
kinterval_rb_lowest_match(struct rb_root *root, u64 start, u64 end)
{
struct rb_node *node = root->rb_node;
- struct kinterval *lower_range = NULL;
+ struct kinterval *lowest_match = NULL;

while (node) {
struct kinterval *range = rb_entry(node, struct kinterval, rb);

if (get_subtree_max_end(node->rb_left) > start) {
+ /* Lowest overlap if any must be on the left side */
node = node->rb_left;
} else if (is_interval_overlapping(range, start, end)) {
- lower_range = range;
+ lowest_match = range;
break;
} else if (start >= range->start) {
+ /* Lowest overlap if any must be on the right side */
node = node->rb_right;
} else {
break;
}
}
- return lower_range;
+ return lowest_match;
+}
+
+/*
+ * Merge two adjacent intervals, if they can be merged next is removed from the
+ * tree.
+ */
+static void kinterval_rb_merge_node(struct rb_root *root,
+ struct kinterval *prev, struct kinterval *next)
+{
+ struct rb_node *deepest;
+
+ if (prev && prev->type == next->type && prev->end == next->start) {
+ prev->end = next->end;
+ deepest = rb_augment_erase_begin(&next->rb);
+ rb_erase(&next->rb, root);
+ rb_augment_erase_end(deepest,
+ kinterval_rb_augment_cb, NULL);
+ kmem_cache_free(kinterval_cachep, next);
+ }
+}
+
+/*
+ * Try to merge a new inserted interval with the previous and the next
+ * interval.
+ */
+static void kinterval_rb_merge(struct rb_root *root, struct kinterval *new)
+{
+ struct kinterval *next, *prev;
+ struct rb_node *node;
+
+ node = rb_prev(&new->rb);
+ prev = node ? rb_entry(node, struct kinterval, rb) : NULL;
+
+ node = rb_next(&new->rb);
+ next = node ? rb_entry(node, struct kinterval, rb) : NULL;
+
+ if (next)
+ kinterval_rb_merge_node(root, new, next);
+ if (prev)
+ kinterval_rb_merge_node(root, prev, new);
}

static void
@@ -114,32 +156,8 @@ kinterval_rb_insert(struct rb_root *root, struct kinterval *new)
rb_link_node(&new->rb, parent, node);
rb_insert_color(&new->rb, root);
rb_augment_insert(&new->rb, kinterval_rb_augment_cb, NULL);
-}

-/* Merge adjacent intervals */
-static void kinterval_rb_merge(struct rb_root *root)
-{
- struct kinterval *next, *prev = NULL;
- struct rb_node *node, *deepest;
-
- node = rb_first(root);
- while (node) {
- next = rb_entry(node, struct kinterval, rb);
- node = rb_next(&next->rb);
-
- if (prev && prev->type == next->type &&
- prev->end == (next->start - 1) &&
- prev->end < next->start) {
- prev->end = next->end;
- deepest = rb_augment_erase_begin(&next->rb);
- rb_erase(&next->rb, root);
- rb_augment_erase_end(deepest,
- kinterval_rb_augment_cb, NULL);
- kmem_cache_free(kinterval_cachep, next);
- } else {
- prev = next;
- }
- }
+ kinterval_rb_merge(root, new);
}

static int kinterval_rb_check_add(struct rb_root *root,
@@ -148,16 +166,17 @@ static int kinterval_rb_check_add(struct rb_root *root,
struct kinterval *old;
struct rb_node *node, *deepest;

- node = rb_first(root);
+ old = kinterval_rb_lowest_match(root, new->start, new->end);
+ node = old ? &old->rb : NULL;
+
while (node) {
old = rb_entry(node, struct kinterval, rb);
+ node = rb_next(&old->rb);
+
/* Check all the possible matches within the range */
- if (old->start > new->end)
+ if (old->start >= new->end)
break;
- node = rb_next(&old->rb);

- if (!is_interval_overlapping(old, new->start, new->end))
- continue;
/*
* Interval is overlapping another one, shrink the old interval
* accordingly.
@@ -206,8 +225,12 @@ static int kinterval_rb_check_add(struct rb_root *root,
* new old
* |___________|_______|
*/
+ deepest = rb_augment_erase_begin(&old->rb);
rb_erase(&old->rb, root);
+ rb_augment_erase_end(deepest, kinterval_rb_augment_cb,
+ NULL);
old->start = new->end + 1;
+ old->subtree_max_end = old->end;
kinterval_rb_insert(root, old);
break;
} else if (new->start >= old->start && new->end >= old->end) {
@@ -230,7 +253,7 @@ static int kinterval_rb_check_add(struct rb_root *root,
rb_erase(&old->rb, root);
rb_augment_erase_end(deepest, kinterval_rb_augment_cb,
NULL);
- old->end = new->start - 1;
+ old->end = new->start;
old->subtree_max_end = old->end;
kinterval_rb_insert(root, old);
} else if (new->start >= old->start && new->end <= old->end) {
@@ -261,13 +284,17 @@ static int kinterval_rb_check_add(struct rb_root *root,
if (unlikely(!prev))
return -ENOMEM;

+ deepest = rb_augment_erase_begin(&old->rb);
rb_erase(&old->rb, root);
+ rb_augment_erase_end(deepest, kinterval_rb_augment_cb,
+ NULL);

prev->start = old->start;
- old->start = new->end + 1;
- prev->end = new->start - 1;
+ old->start = new->end;
+ prev->end = new->start;
prev->type = old->type;

+ old->subtree_max_end = old->end;
kinterval_rb_insert(root, old);

new->subtree_max_end = new->end;
@@ -280,7 +307,6 @@ static int kinterval_rb_check_add(struct rb_root *root,
}
new->subtree_max_end = new->end;
kinterval_rb_insert(root, new);
- kinterval_rb_merge(root);

return 0;
}
@@ -291,7 +317,7 @@ int kinterval_add(struct rb_root *root, u64 start, u64 end,
struct kinterval *range;
int ret;

- if (end < start)
+ if (end <= start)
return -EINVAL;
range = kmem_cache_zalloc(kinterval_cachep, flags);
if (unlikely(!range))
@@ -314,16 +340,17 @@ static int kinterval_rb_check_del(struct rb_root *root,
struct kinterval *old;
struct rb_node *node, *deepest;

- node = rb_first(root);
+ old = kinterval_rb_lowest_match(root, start, end);
+ node = old ? &old->rb : NULL;
+
while (node) {
old = rb_entry(node, struct kinterval, rb);
+ node = rb_next(&old->rb);
+
/* Check all the possible matches within the range */
- if (old->start > end)
+ if (old->start >= end)
break;
- node = rb_next(&old->rb);

- if (!is_interval_overlapping(old, start, end))
- continue;
if (start <= old->start && end >= old->end) {
/*
* Completely erase the old range:
@@ -354,8 +381,12 @@ static int kinterval_rb_check_del(struct rb_root *root,
* old
* |_______|
*/
+ deepest = rb_augment_erase_begin(&old->rb);
rb_erase(&old->rb, root);
- old->start = end + 1;
+ rb_augment_erase_end(deepest, kinterval_rb_augment_cb,
+ NULL);
+ old->start = end;
+ old->subtree_max_end = old->end;
kinterval_rb_insert(root, old);
break;
} else if (start >= old->start && end >= old->end) {
@@ -378,7 +409,7 @@ static int kinterval_rb_check_del(struct rb_root *root,
rb_erase(&old->rb, root);
rb_augment_erase_end(deepest, kinterval_rb_augment_cb,
NULL);
- old->end = start - 1;
+ old->end = start;
old->subtree_max_end = old->end;
kinterval_rb_insert(root, old);
} else if (start >= old->start && end <= old->end) {
@@ -403,13 +434,17 @@ static int kinterval_rb_check_del(struct rb_root *root,
if (unlikely(!prev))
return -ENOMEM;

+ deepest = rb_augment_erase_begin(&old->rb);
rb_erase(&old->rb, root);
+ rb_augment_erase_end(deepest, kinterval_rb_augment_cb,
+ NULL);

prev->start = old->start;
- old->start = end + 1;
- prev->end = start - 1;
+ old->start = end;
+ prev->end = start;
prev->type = old->type;

+ old->subtree_max_end = old->end;
kinterval_rb_insert(root, old);

prev->subtree_max_end = prev->end;
@@ -422,7 +457,7 @@ static int kinterval_rb_check_del(struct rb_root *root,

int kinterval_del(struct rb_root *root, u64 start, u64 end, gfp_t flags)
{
- if (end < start)
+ if (end <= start)
return -EINVAL;
return kinterval_rb_check_del(root, start, end, flags);
}
@@ -451,7 +486,7 @@ long kinterval_lookup_range(struct rb_root *root, u64 start, u64 end)
{
struct kinterval *range;

- if (end < start)
+ if (end <= start)
return -EINVAL;
range = kinterval_rb_lowest_match(root, start, end);
return range ? range->type : -ENOENT;
--
1.7.5.4

2012-02-13 01:14:05

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

On Sun, Feb 12, 2012 at 11:24:27PM +0530, Aneesh Kumar K.V wrote:
> On Sun, 12 Feb 2012 12:58:34 +0100, Andrea Righi <[email protected]> wrote:
> > On Sun, Feb 12, 2012 at 03:16:07PM +0800, Hillf Danton wrote:
> > > Hello Andrea
> > >
> > > On Sun, Feb 12, 2012 at 8:21 AM, Andrea Righi <[email protected]> wrote:
> > > [...]
> > > > ?- Some of the routines to implement the generic interval tree has been taken
> > > > ? from the x86 PAT code, that uses interval trees to keep track of PAT ranges
> > > > ? (in the future it would be interesting to convert also the x86 PAT code to
> > > > ? use the generic interval tree implementation).
> > > >
> > > Perhaps the tree implemented in this work could also be used in tracking
> > > regions in mm/hugetlb.c.
> > >
> > > Thanks
> > > Hillf
> >
> > Thanks, Hillf.
> >
> > Yes, I quickly looked at the hugtlb code, it seems another potential
> > user of the interval tree. Now all the hugetlb regions are stored in a
> > list, the interval tree is a more efficient structure for lookups -
> > O(log(n)), so there are probably advantages in presence of many
> > different disjoint intervals.
> >
> > mmh... at the moment there's not a way to map region_count() with the
> > current kinterval API, but we can easily extend it to provide also this
> > feature (count the overlap size of two intervals).
> >
>
> I am also extending the hugetlb region list in the hugetlb cgroup
> patchset i recently posted to make sure we don't merge region if the
> associated private data doesn't match (in my case it is the pointer to
> hugetlb cgroup). Ref:
> http://article.gmane.org/gmane.linux.kernel.mm/73829
>
> The goal is to make sure a region add with different private value
> results in below.
>
>
> old
> | hcg1 |
> -------------------
>
> new
> | hcg2 |
> ----------------------------------
>
> results in
>
> | hcg2 | | hcg1 | | hcg2 |
> --------- __________________ ----------
>
>
> -aneesh

Yep! This is different from my case, at least according to the current
implementation. Using kintervals the new range hcg2 will completely
overwrite hcg1, because the default behavior is that new ranges
overwrite old ranges.

To get the same result the new range (hcg2) should be defined inside
the old range (hcg1):

new
| hcg2 |
-------------------

old
| hcg1 |
----------------------------------

In this case result will be:

| hcg1 | | hcg2 | | hcg1 |
--------- __________________ ----------

mmmh.. I'm wondering if there's a better way to make the intervals even
more generic, so that it would be possible to specify also a different
behavior in case of conflicts... (i.e., new wins vs old wins).

Thanks,
-Andrea

2012-02-13 16:22:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

> @@ -1181,8 +1258,22 @@ page_ok:
> ? ? ? ? ? ? ? ? * When a sequential read accesses a page several times,
> ? ? ? ? ? ? ? ? * only mark it as accessed the first time.
> ? ? ? ? ? ? ? ? */
> - ? ? ? ? ? ? ? if (prev_index != index || offset != prev_offset)
> - ? ? ? ? ? ? ? ? ? ? ? mark_page_accessed(page);
> + ? ? ? ? ? ? ? if (prev_index != index || offset != prev_offset) {
> + ? ? ? ? ? ? ? ? ? ? ? int mode;
> +
> + ? ? ? ? ? ? ? ? ? ? ? mode = filemap_get_cache(mapping, index);
> + ? ? ? ? ? ? ? ? ? ? ? switch (mode) {
> + ? ? ? ? ? ? ? ? ? ? ? case FILEMAP_CACHE_NORMAL:
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mark_page_accessed(page);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? ? ? ? ? case FILEMAP_CACHE_ONCE:
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mark_page_usedonce(page);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? ? ? ? ? default:
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? WARN_ON_ONCE(1);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;

Here is generic_file_read, right? Why don't you care write and page fault?
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-02-13 18:00:35

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On Mon, Feb 13, 2012 at 11:22:26AM -0500, KOSAKI Motohiro wrote:
> > @@ -1181,8 +1258,22 @@ page_ok:
> > ? ? ? ? ? ? ? ? * When a sequential read accesses a page several times,
> > ? ? ? ? ? ? ? ? * only mark it as accessed the first time.
> > ? ? ? ? ? ? ? ? */
> > - ? ? ? ? ? ? ? if (prev_index != index || offset != prev_offset)
> > - ? ? ? ? ? ? ? ? ? ? ? mark_page_accessed(page);
> > + ? ? ? ? ? ? ? if (prev_index != index || offset != prev_offset) {
> > + ? ? ? ? ? ? ? ? ? ? ? int mode;
> > +
> > + ? ? ? ? ? ? ? ? ? ? ? mode = filemap_get_cache(mapping, index);
> > + ? ? ? ? ? ? ? ? ? ? ? switch (mode) {
> > + ? ? ? ? ? ? ? ? ? ? ? case FILEMAP_CACHE_NORMAL:
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mark_page_accessed(page);
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> > + ? ? ? ? ? ? ? ? ? ? ? case FILEMAP_CACHE_ONCE:
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mark_page_usedonce(page);
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> > + ? ? ? ? ? ? ? ? ? ? ? default:
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? WARN_ON_ONCE(1);
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
>
> Here is generic_file_read, right? Why don't you care write and page fault?

That's correct. I focused in my read test case and have not consider
write and page fault at all yet. There's also another missing piece
probably: readahead.

About generic_file_read the behavior that we may want to provide looks
quite clear to me. Instead, I don't know which is the best behavior for
the NOREUSE writes... should we just avoid active lru list eligibility,
or also drop the pages after the write if they weren't present in page
cache before the write? In the former NOREUSE pages can still trash
pages in the inactive lru list, in the latter writes will be slow
because we need to wait for the writeback. Ideas/suggestions?

About readahead pages I think we shouldn't touch anything. IIUC, when
readahead pages are loaded in page cache for the first time they are
added to the inactive lru list and not marked as referenced.

If the readahead pages are also referenced by the application and
they're inside a NOREUSE range they won't be marked as referenced and
will continue to live in the inactive list (and dropped when the
inactive list is shrunk). If readahead pages are not inside a NOREUSE
range they will be treated as usual (marked as referenced, and moved to
the active list if they're accessed more than once).

Thanks,
-Andrea

2012-02-14 21:33:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

On Sun, 12 Feb 2012 01:21:35 +0100
Andrea Righi <[email protected]> wrote:

> The new proposal is to implement POSIX_FADV_NOREUSE as a way to perform a real
> drop-behind policy where applications can mark certain intervals of a file as
> FADV_NOREUSE before accessing the data.

I think you and John need to talk to each other, please. The amount of
duplication here is extraordinary.

Both patchsets add fields to the address_space (and hence inode), which
is significant - we should convince ourselves that we're getting really
good returns from a feature which does this.



Regarding the use of fadvise(): I suppose it's a reasonable thing to do
in the long term - if the feature works well, popular data streaming
applications will eventually switch over. But I do think we should
explore interfaces which don't require modification of userspace source
code. Because there will always be unconverted applications, and the
feature becomes available immediately.

One such interface would be to toss the offending application into a
container which has a modified drop-behind policy. And here we need to
drag out the crystal ball: what *is* the best way of tuning application
pagecache behaviour? Will we gravitate towards containerization, or
will we gravitate towards finer-tuned fadvise/sync_page_range/etc
behaviour? Thus far it has been the latter, and I don't think that has
been a great success.

Finally, are the problems which prompted these patchsets already
solved? What happens if you take the offending streaming application
and toss it into a 16MB memcg? That *should* avoid perturbing other
things running on that machine.

And yes, a container-based approach is pretty crude, and one can
envision applications which only want modified reclaim policy for one
particualr file. But I suspect an application-wide reclaim policy
solves 90% of the problems.

2012-02-14 22:07:07

by John Stultz

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

On Tue, 2012-02-14 at 13:33 -0800, Andrew Morton wrote:
> On Sun, 12 Feb 2012 01:21:35 +0100
> Andrea Righi <[email protected]> wrote:
>
> > The new proposal is to implement POSIX_FADV_NOREUSE as a way to perform a real
> > drop-behind policy where applications can mark certain intervals of a file as
> > FADV_NOREUSE before accessing the data.
>
> I think you and John need to talk to each other, please. The amount of
> duplication here is extraordinary.

Yea. Clearly there is much we can share. I'm still catching up from a
conference last week, so I've not had a chance to really look at this
yet, but its on my queue for this week.

thanks
-john

2012-02-14 22:59:32

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

On Tue, Feb 14, 2012 at 01:33:37PM -0800, Andrew Morton wrote:
> On Sun, 12 Feb 2012 01:21:35 +0100
> Andrea Righi <[email protected]> wrote:
>
> > The new proposal is to implement POSIX_FADV_NOREUSE as a way to perform a real
> > drop-behind policy where applications can mark certain intervals of a file as
> > FADV_NOREUSE before accessing the data.
>
> I think you and John need to talk to each other, please. The amount of
> duplication here is extraordinary.

Yes, definitely. I'm currently reviewing and testing the John's patch
set. I was even considering to apply my patch set on top of the John's
patch, or at least propose my tree-based approach to manage the list of
the POSIX_FADV_VOLATILE ranges.

>
> Both patchsets add fields to the address_space (and hence inode), which
> is significant - we should convince ourselves that we're getting really
> good returns from a feature which does this.
>
>
>
> Regarding the use of fadvise(): I suppose it's a reasonable thing to do
> in the long term - if the feature works well, popular data streaming
> applications will eventually switch over. But I do think we should
> explore interfaces which don't require modification of userspace source
> code. Because there will always be unconverted applications, and the
> feature becomes available immediately.
>
> One such interface would be to toss the offending application into a
> container which has a modified drop-behind policy. And here we need to
> drag out the crystal ball: what *is* the best way of tuning application
> pagecache behaviour? Will we gravitate towards containerization, or
> will we gravitate towards finer-tuned fadvise/sync_page_range/etc
> behaviour? Thus far it has been the latter, and I don't think that has
> been a great success.
>
> Finally, are the problems which prompted these patchsets already
> solved? What happens if you take the offending streaming application
> and toss it into a 16MB memcg? That *should* avoid perturbing other
> things running on that machine.

Moving the streaming application into a 16MB memcg can be dangerous in
some cases... the application might start to do "bad" things, like
swapping (if the memcg can swap) or just fail due to OOMs.

>
> And yes, a container-based approach is pretty crude, and one can
> envision applications which only want modified reclaim policy for one
> particualr file. But I suspect an application-wide reclaim policy
> solves 90% of the problems.

I really like the container-based approach. But for this we need a
better file cache control in the memory cgroup; now we have the
accounting of file pages, but there's no way to limit them.

Thanks for your comments, Andrew.

-Andrea

2012-02-14 23:22:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

On Tue, 14 Feb 2012 23:59:22 +0100
Andrea Righi <[email protected]> wrote:

> On Tue, Feb 14, 2012 at 01:33:37PM -0800, Andrew Morton wrote:
> > On Sun, 12 Feb 2012 01:21:35 +0100
> > Andrea Righi <[email protected]> wrote:
> >
> > > The new proposal is to implement POSIX_FADV_NOREUSE as a way to perform a real
> > > drop-behind policy where applications can mark certain intervals of a file as
> > > FADV_NOREUSE before accessing the data.
> >
> > I think you and John need to talk to each other, please. The amount of
> > duplication here is extraordinary.
>
> Yes, definitely. I'm currently reviewing and testing the John's patch
> set. I was even considering to apply my patch set on top of the John's
> patch, or at least propose my tree-based approach to manage the list of
> the POSIX_FADV_VOLATILE ranges.

Cool.

> >
> > Both patchsets add fields to the address_space (and hence inode), which
> > is significant - we should convince ourselves that we're getting really
> > good returns from a feature which does this.
> >
> >
> >
> > Regarding the use of fadvise(): I suppose it's a reasonable thing to do
> > in the long term - if the feature works well, popular data streaming
> > applications will eventually switch over. But I do think we should
> > explore interfaces which don't require modification of userspace source
> > code. Because there will always be unconverted applications, and the
> > feature becomes available immediately.
> >
> > One such interface would be to toss the offending application into a
> > container which has a modified drop-behind policy. And here we need to
> > drag out the crystal ball: what *is* the best way of tuning application
> > pagecache behaviour? Will we gravitate towards containerization, or
> > will we gravitate towards finer-tuned fadvise/sync_page_range/etc
> > behaviour? Thus far it has been the latter, and I don't think that has
> > been a great success.
> >
> > Finally, are the problems which prompted these patchsets already
> > solved? What happens if you take the offending streaming application
> > and toss it into a 16MB memcg? That *should* avoid perturbing other
> > things running on that machine.
>
> Moving the streaming application into a 16MB memcg can be dangerous in
> some cases... the application might start to do "bad" things, like
> swapping (if the memcg can swap) or just fail due to OOMs.

Well OK, maybe there are problems with the current implementation. But
are they unfixable problems? Is the right approach to give up on ever
making containers useful for this application and to instead go off and
implement a new and separate feature?

> > And yes, a container-based approach is pretty crude, and one can
> > envision applications which only want modified reclaim policy for one
> > particualr file. But I suspect an application-wide reclaim policy
> > solves 90% of the problems.
>
> I really like the container-based approach. But for this we need a
> better file cache control in the memory cgroup; now we have the
> accounting of file pages, but there's no way to limit them.

Again, if/whem memcg becomes sufficiently useful for this application
we're left maintaining the obsolete POSIX_FADVISE_NOREUSE for ever.

2012-02-15 01:35:33

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

On Tue, Feb 14, 2012 at 03:22:20PM -0800, Andrew Morton wrote:
> On Tue, 14 Feb 2012 23:59:22 +0100
> Andrea Righi <[email protected]> wrote:
>
> > On Tue, Feb 14, 2012 at 01:33:37PM -0800, Andrew Morton wrote:
> > > On Sun, 12 Feb 2012 01:21:35 +0100
> > > Andrea Righi <[email protected]> wrote:
> > >
> > > > The new proposal is to implement POSIX_FADV_NOREUSE as a way to perform a real
> > > > drop-behind policy where applications can mark certain intervals of a file as
> > > > FADV_NOREUSE before accessing the data.
> > >
> > > I think you and John need to talk to each other, please. The amount of
> > > duplication here is extraordinary.
> >
> > Yes, definitely. I'm currently reviewing and testing the John's patch
> > set. I was even considering to apply my patch set on top of the John's
> > patch, or at least propose my tree-based approach to manage the list of
> > the POSIX_FADV_VOLATILE ranges.
>
> Cool.
>
> > >
> > > Both patchsets add fields to the address_space (and hence inode), which
> > > is significant - we should convince ourselves that we're getting really
> > > good returns from a feature which does this.
> > >
> > >
> > >
> > > Regarding the use of fadvise(): I suppose it's a reasonable thing to do
> > > in the long term - if the feature works well, popular data streaming
> > > applications will eventually switch over. But I do think we should
> > > explore interfaces which don't require modification of userspace source
> > > code. Because there will always be unconverted applications, and the
> > > feature becomes available immediately.
> > >
> > > One such interface would be to toss the offending application into a
> > > container which has a modified drop-behind policy. And here we need to
> > > drag out the crystal ball: what *is* the best way of tuning application
> > > pagecache behaviour? Will we gravitate towards containerization, or
> > > will we gravitate towards finer-tuned fadvise/sync_page_range/etc
> > > behaviour? Thus far it has been the latter, and I don't think that has
> > > been a great success.
> > >
> > > Finally, are the problems which prompted these patchsets already
> > > solved? What happens if you take the offending streaming application
> > > and toss it into a 16MB memcg? That *should* avoid perturbing other
> > > things running on that machine.
> >
> > Moving the streaming application into a 16MB memcg can be dangerous in
> > some cases... the application might start to do "bad" things, like
> > swapping (if the memcg can swap) or just fail due to OOMs.
>
> Well OK, maybe there are problems with the current implementation. But
> are they unfixable problems? Is the right approach to give up on ever
> making containers useful for this application and to instead go off and
> implement a new and separate feature?
>
> > > And yes, a container-based approach is pretty crude, and one can
> > > envision applications which only want modified reclaim policy for one
> > > particualr file. But I suspect an application-wide reclaim policy
> > > solves 90% of the problems.
> >
> > I really like the container-based approach. But for this we need a
> > better file cache control in the memory cgroup; now we have the
> > accounting of file pages, but there's no way to limit them.
>
> Again, if/whem memcg becomes sufficiently useful for this application
> we're left maintaining the obsolete POSIX_FADVISE_NOREUSE for ever.

Yes, totally agree. For the future a memcg-based solution is probably
the best way to go.

This reminds me to the old per-memcg dirty memory discussion
(http://thread.gmane.org/gmane.linux.kernel.mm/67114), cc'ing Greg.

Maybe the generic feature to provide that could solve both problems is
a better file cache isolation in memcg.

Thanks,
-Andrea

2012-02-15 23:35:45

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On Sun, Feb 12, 2012 at 01:21:38AM +0100, Andrea Righi wrote:
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 386da09..624a73e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -9,6 +9,7 @@
> #include <linux/limits.h>
> #include <linux/ioctl.h>
> #include <linux/blk_types.h>
> +#include <linux/kinterval.h>
> #include <linux/types.h>

fs.h is an exported header file, whereas kinterval.h is not. So this
fails scripts/header_check.pl.

I used the workaround below.

-Arun

diff --git a/fs/inode.c b/fs/inode.c
index d27dbee..1335a5f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -26,6 +26,7 @@
#include <linux/ima.h>
#include <linux/cred.h>
#include <linux/buffer_head.h> /* for inode_has_buffers */
+#include <linux/kinterval.h>
#include "internal.h"

/*
@@ -279,7 +280,7 @@ void address_space_init_once(struct address_space *mapping)
spin_lock_init(&mapping->private_lock);
INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
- INIT_KINTERVAL_TREE_ROOT(&mapping->nocache_tree);
+ INIT_KINTERVAL_TREE_ROOT((struct rb_root *) &mapping->nocache_tree);
rwlock_init(&mapping->nocache_lock);
}
EXPORT_SYMBOL(address_space_init_once);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 74b6a97..b4e45e6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -9,7 +9,6 @@
#include <linux/limits.h>
#include <linux/ioctl.h>
#include <linux/blk_types.h>
-#include <linux/kinterval.h>
#include <linux/types.h>

/*
@@ -656,7 +655,7 @@ struct address_space {
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
- struct rb_root nocache_tree; /* noreuse cache range tree */
+ void *nocache_tree; /* noreuse cache range tree */
rwlock_t nocache_lock; /* protect the nocache_tree */
} __attribute__((aligned(sizeof(long))));
/*

2012-02-15 23:47:39

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On Wed, Feb 15, 2012 at 03:35:37PM -0800, Arun Sharma wrote:
> On Sun, Feb 12, 2012 at 01:21:38AM +0100, Andrea Righi wrote:
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 386da09..624a73e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -9,6 +9,7 @@
> > #include <linux/limits.h>
> > #include <linux/ioctl.h>
> > #include <linux/blk_types.h>
> > +#include <linux/kinterval.h>
> > #include <linux/types.h>
>
> fs.h is an exported header file, whereas kinterval.h is not. So this
> fails scripts/header_check.pl.
>
> I used the workaround below.
>
> -Arun
>
> diff --git a/fs/inode.c b/fs/inode.c
> index d27dbee..1335a5f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -26,6 +26,7 @@
> #include <linux/ima.h>
> #include <linux/cred.h>
> #include <linux/buffer_head.h> /* for inode_has_buffers */
> +#include <linux/kinterval.h>
> #include "internal.h"
>
> /*
> @@ -279,7 +280,7 @@ void address_space_init_once(struct address_space *mapping)
> spin_lock_init(&mapping->private_lock);
> INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
> INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
> - INIT_KINTERVAL_TREE_ROOT(&mapping->nocache_tree);
> + INIT_KINTERVAL_TREE_ROOT((struct rb_root *) &mapping->nocache_tree);
> rwlock_init(&mapping->nocache_lock);
> }
> EXPORT_SYMBOL(address_space_init_once);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 74b6a97..b4e45e6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -9,7 +9,6 @@
> #include <linux/limits.h>
> #include <linux/ioctl.h>
> #include <linux/blk_types.h>
> -#include <linux/kinterval.h>
> #include <linux/types.h>
>
> /*
> @@ -656,7 +655,7 @@ struct address_space {
> spinlock_t private_lock; /* for use by the address_space */
> struct list_head private_list; /* ditto */
> struct address_space *assoc_mapping; /* ditto */
> - struct rb_root nocache_tree; /* noreuse cache range tree */
> + void *nocache_tree; /* noreuse cache range tree */
> rwlock_t nocache_lock; /* protect the nocache_tree */
> } __attribute__((aligned(sizeof(long))));
> /*

mmh.. a forward declaration of rb_root in fs.h shouldn't be better than
this?

Thanks,
-Andrea

2012-02-15 23:50:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

On Wed, 15 Feb 2012 02:35:24 +0100
Andrea Righi <[email protected]> wrote:

> On Tue, Feb 14, 2012 at 03:22:20PM -0800, Andrew Morton wrote:
> > On Tue, 14 Feb 2012 23:59:22 +0100
> > Andrea Righi <[email protected]> wrote:
> >
> > > On Tue, Feb 14, 2012 at 01:33:37PM -0800, Andrew Morton wrote:
> > > > On Sun, 12 Feb 2012 01:21:35 +0100
> > > > Andrea Righi <[email protected]> wrote:
> > > > And yes, a container-based approach is pretty crude, and one can
> > > > envision applications which only want modified reclaim policy for one
> > > > particualr file. But I suspect an application-wide reclaim policy
> > > > solves 90% of the problems.
> > >
> > > I really like the container-based approach. But for this we need a
> > > better file cache control in the memory cgroup; now we have the
> > > accounting of file pages, but there's no way to limit them.
> >
> > Again, if/whem memcg becomes sufficiently useful for this application
> > we're left maintaining the obsolete POSIX_FADVISE_NOREUSE for ever.
>
> Yes, totally agree. For the future a memcg-based solution is probably
> the best way to go.
>
> This reminds me to the old per-memcg dirty memory discussion
> (http://thread.gmane.org/gmane.linux.kernel.mm/67114), cc'ing Greg.
>
> Maybe the generic feature to provide that could solve both problems is
> a better file cache isolation in memcg.
>

Can you think of example interface for us ?
I'd like to discuss this in mm-summit if we have a chance.

Thanks,
-Kame

2012-02-15 23:59:55

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE



On 2/15/12 3:47 PM, Andrea Righi wrote:
>> index 74b6a97..b4e45e6 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -9,7 +9,6 @@
>> #include<linux/limits.h>
>> #include<linux/ioctl.h>
>> #include<linux/blk_types.h>
>> -#include<linux/kinterval.h>
>> #include<linux/types.h>
>>
>> /*
>> @@ -656,7 +655,7 @@ struct address_space {
>> spinlock_t private_lock; /* for use by the address_space */
>> struct list_head private_list; /* ditto */
>> struct address_space *assoc_mapping; /* ditto */
>> - struct rb_root nocache_tree; /* noreuse cache range tree */
>> + void *nocache_tree; /* noreuse cache range tree */
>> rwlock_t nocache_lock; /* protect the nocache_tree */
>> } __attribute__((aligned(sizeof(long))));
>> /*
>
> mmh.. a forward declaration of rb_root in fs.h shouldn't be better than
> this?
>

Forward declaration works if the type was struct rb_root *. But the type
in your patch was a struct and the compiler can't figure out its size.

include/linux/fs.h:659:17: error: field ?nocache_tree? has incomplete type

Did you mean forward declaring struct rb_node instead of rb_root?

If we go down this path, a few more places need fixups (I ignored the
compiler warnings about casting void * to struct rb_root *).

-Arun

2012-02-16 00:43:50

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

On Thu, Feb 16, 2012 at 08:48:31AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 15 Feb 2012 02:35:24 +0100
> Andrea Righi <[email protected]> wrote:
>
> > On Tue, Feb 14, 2012 at 03:22:20PM -0800, Andrew Morton wrote:
> > > On Tue, 14 Feb 2012 23:59:22 +0100
> > > Andrea Righi <[email protected]> wrote:
> > >
> > > > On Tue, Feb 14, 2012 at 01:33:37PM -0800, Andrew Morton wrote:
> > > > > On Sun, 12 Feb 2012 01:21:35 +0100
> > > > > Andrea Righi <[email protected]> wrote:
> > > > > And yes, a container-based approach is pretty crude, and one can
> > > > > envision applications which only want modified reclaim policy for one
> > > > > particualr file. But I suspect an application-wide reclaim policy
> > > > > solves 90% of the problems.
> > > >
> > > > I really like the container-based approach. But for this we need a
> > > > better file cache control in the memory cgroup; now we have the
> > > > accounting of file pages, but there's no way to limit them.
> > >
> > > Again, if/whem memcg becomes sufficiently useful for this application
> > > we're left maintaining the obsolete POSIX_FADVISE_NOREUSE for ever.
> >
> > Yes, totally agree. For the future a memcg-based solution is probably
> > the best way to go.
> >
> > This reminds me to the old per-memcg dirty memory discussion
> > (http://thread.gmane.org/gmane.linux.kernel.mm/67114), cc'ing Greg.
> >
> > Maybe the generic feature to provide that could solve both problems is
> > a better file cache isolation in memcg.
> >
>
> Can you think of example interface for us ?
> I'd like to discuss this in mm-summit if we have a chance.
>
> Thanks,
> -Kame

Sure! I'll try to write down more detailed ideas.

For now the best interface that I can see is to add something like
memory.file.* in cgroupfs.

The NOREUSE-like policy that I was trying to implement via fadvise() can
be probably implemented by setting memory.file.limit_in_bytes=0 (or
using a very small value).

A cgroup like this could use any amount of memory (according to the
other memory.* settings), but it should drop any file cache page as soon
as possible, if the page was not present in memory before. IOW, this
cgroup shouldn't disturb the state of the page cache for the other
cgroups.

Another interesting usage is to provide different levels of service. For
example, using different values for memory.file.limit_in_byte would make
possible to specify that file cache pages of certain cgroups are
reclaimed before others. This would be a very nice feature IMHO, also
for those who want to provide different levels of service per-user.

Thoughts?

Thanks,
-Andrea

2012-02-16 00:56:16

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On Wed, Feb 15, 2012 at 03:57:47PM -0800, Arun Sharma wrote:
>
>
> On 2/15/12 3:47 PM, Andrea Righi wrote:
> >>index 74b6a97..b4e45e6 100644
> >>--- a/include/linux/fs.h
> >>+++ b/include/linux/fs.h
> >>@@ -9,7 +9,6 @@
> >> #include<linux/limits.h>
> >> #include<linux/ioctl.h>
> >> #include<linux/blk_types.h>
> >>-#include<linux/kinterval.h>
> >> #include<linux/types.h>
> >>
> >> /*
> >>@@ -656,7 +655,7 @@ struct address_space {
> >> spinlock_t private_lock; /* for use by the address_space */
> >> struct list_head private_list; /* ditto */
> >> struct address_space *assoc_mapping; /* ditto */
> >>- struct rb_root nocache_tree; /* noreuse cache range tree */
> >>+ void *nocache_tree; /* noreuse cache range tree */
> >> rwlock_t nocache_lock; /* protect the nocache_tree */
> >> } __attribute__((aligned(sizeof(long))));
> >> /*
> >
> >mmh.. a forward declaration of rb_root in fs.h shouldn't be better than
> >this?
> >
>
> Forward declaration works if the type was struct rb_root *. But the
> type in your patch was a struct and the compiler can't figure out
> its size.
>
> include/linux/fs.h:659:17: error: field ‘nocache_tree’ has incomplete type
>
> Did you mean forward declaring struct rb_node instead of rb_root?
>
> If we go down this path, a few more places need fixups (I ignored
> the compiler warnings about casting void * to struct rb_root *).
>
> -Arun

Oh sorry, you're right! nocache_tree is not a pointer inside
address_space, so the compiler must know the size.

mmh... move the definition of the rb_root struct in linux/types.h? or
simply use a rb_root pointer. The (void *) looks a bit scary and too bug
prone.

-Andrea

2012-02-16 02:11:15

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On 2/15/12 4:56 PM, Andrea Righi wrote:

> Oh sorry, you're right! nocache_tree is not a pointer inside
> address_space, so the compiler must know the size.
>
> mmh... move the definition of the rb_root struct in linux/types.h? or
> simply use a rb_root pointer. The (void *) looks a bit scary and too bug
> prone.

Either way is fine. I did some black box testing of the patch (comparing
noreuse vs dontneed) and it behaves as expected.

On a file copy, neither one pollutes the page cache. But if I run a
random read benchmark on the source file right before and afterwards,
page cache is warm with noreuse, but cold with dontneed. Copy
performance was unaffected.

I can't really comment on the implementation details since I haven't
reviewed it, but the functionality sounds useful.

-Arun

2012-02-16 10:39:52

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On Wed, Feb 15, 2012 at 06:10:28PM -0800, Arun Sharma wrote:
> On 2/15/12 4:56 PM, Andrea Righi wrote:
>
> >Oh sorry, you're right! nocache_tree is not a pointer inside
> >address_space, so the compiler must know the size.
> >
> >mmh... move the definition of the rb_root struct in linux/types.h? or
> >simply use a rb_root pointer. The (void *) looks a bit scary and too bug
> >prone.
>
> Either way is fine. I did some black box testing of the patch
> (comparing noreuse vs dontneed) and it behaves as expected.
>
> On a file copy, neither one pollutes the page cache. But if I run a
> random read benchmark on the source file right before and
> afterwards, page cache is warm with noreuse, but cold with dontneed.
> Copy performance was unaffected.
>
> I can't really comment on the implementation details since I haven't
> reviewed it, but the functionality sounds useful.
>
> -Arun

Arun, thank you very much for your review and testing. Probably we'll
move to a different, memcg-based solution, so I don't think I'll post
another version of this patch set as is. In case, I'll apply one of
the workarounds for the rb_root attribute.

Thanks,
-Andrea

2012-02-16 18:43:39

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On 2/16/12 2:39 AM, Andrea Righi wrote:

> Arun, thank you very much for your review and testing. Probably we'll
> move to a different, memcg-based solution, so I don't think I'll post
> another version of this patch set as is. In case, I'll apply one of
> the workarounds for the rb_root attribute.

I'm not sure if the proposed memory.file.limit_in_bytes is the right
interface. Two problems:

* The user is now required to figure out what is the right amount of
page cache for the app (may change over time)

* If the app touches two sets of files, one with streaming access and
the other which benefits from page cache (eg: a mapper task in a map
reduce), memcg doesn't allow the user to specify the access pattern per-fd.

-Arun

2012-02-16 18:58:02

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On Thu, Feb 16, 2012 at 10:43:00AM -0800, Arun Sharma wrote:
> On 2/16/12 2:39 AM, Andrea Righi wrote:
>
> >Arun, thank you very much for your review and testing. Probably we'll
> >move to a different, memcg-based solution, so I don't think I'll post
> >another version of this patch set as is. In case, I'll apply one of
> >the workarounds for the rb_root attribute.
>
> I'm not sure if the proposed memory.file.limit_in_bytes is the right
> interface. Two problems:
>
> * The user is now required to figure out what is the right amount of
> page cache for the app (may change over time)

Right.

>
> * If the app touches two sets of files, one with streaming access
> and the other which benefits from page cache (eg: a mapper task in a
> map reduce), memcg doesn't allow the user to specify the access
> pattern per-fd.

Yes, of course the memcg approach is probably too coarse-grained for
certain apps.

If we want to provide the per-fd granularity the fadvise() solution is
a way better. However, the memcg solution could be enough to resolve
most of the common data streaming issues (like "the backup is trashing
the page cache" problem) and it doesn't require modification of the
application source code. This is an important advantage that we
shouldn't ignore IMHO, because it means that the new feature will be
available _immediately_ for any application.

Maybe we should try to push ...something... in the memcg code for the
short-term future, make it as much generic as possible, and for the
long-term try to reuse the same feature (totally or in part) in the
per-fd approach via fadvise().

Thanks,
-Andrea

2012-02-16 19:08:16

by Arun Sharma

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On 2/16/12 10:57 AM, Andrea Righi wrote:

> Maybe we should try to push ...something... in the memcg code for the
> short-term future, make it as much generic as possible, and for the
> long-term try to reuse the same feature (totally or in part) in the
> per-fd approach via fadvise().

Yes - the two approaches are complementary and we should probably pursue
both.

There are a number of apps which are already using fadvise though:

https://issues.apache.org/jira/browse/MAPREDUCE-3289
http://highscalability.com/blog/2012/1/12/peregrine-a-map-reduce-framework-for-iterative-and-pipelined.html

and probably many other similar cases that are not open source.

Some of these apps may be better off using NOREUSE instead of DONTNEED,
since they may not have a clue on what else is going on in the system.

The way I think about it: NOREUSE is a statement about what my process
is doing and DONTNEED is a statement about the entire system.

-Arun

2012-02-27 02:35:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On Sun, 12 Feb 2012 01:21:38 +0100
Andrea Righi <[email protected]> wrote:

> According to the POSIX standard the POSIX_FADV_NOREUSE hint means that
> the application expects to access the specified data once and then not
> reuse it thereafter.
>
> It seems that the expected behavior is to implement a drop-behind
> policy where the application can set certain intervals of a file as
> FADV_NOREUSE _before_ accessing the data.
>
> An interesting usage of this hint is to guarantee that pages marked as
> FADV_NOREUSE will never blow away the pages of the current working set.
>
> A possible solution to satisfy this requirement is to prevent lru
> activation of the pages marked as FADV_NOREUSE, in other words, never
> add pages marked as FADV_NOREUSE to the active lru list. Moreover, all
> the file cache pages in a FADV_NOREUSE range can be immediately dropped
> after a read if the page was not present in the file cache before.
>
> In general, the purpose of this approach is to preserve as much as
> possible the previous state of the file cache memory before reading data
> in ranges marked by FADV_NOREUSE.
>
> All the pages read before (pre-)setting them as FADV_NOREUSE should be
> treated as normal, so they can be added to the active lru list as usual
> if they're accessed multiple times.
>
> Only after setting them as FADV_NOREUSE we can prevent them for being
> promoted to the active lru list. If they are already in the active lru
> list before calling FADV_NOREUSE we should keep them there, but if they
> quit from the active list they can't get back anymore (except by
> explicitly setting a different caching hint).
>

>From this part, it seems the behavior of systemcall is highly depends on
interanal kernel implemenatation...


> To achieve this goal we need to maintain the list of file ranges marked
> as FADV_NOREUSE until the pages are dropped from the page cache, or the
> inode is deleted, or they're explicitly marked to use a different cache
> behavior (FADV_NORMAL | FADV_WILLNEED).
>
> The list of FADV_NOREUSE ranges is maintained in the address_space
> structure using an interval tree (kinterval).
>
> Signed-off-by: Andrea Righi <[email protected]>


Once an appliation sets a range of file as FILEMAP_CACHE_ONCE,
the effects will last until the inode is dropped....right ?
Won't this cause troubles which cannot be detected
(because kinterval information is hidden.) ?

I'm not sure but FADV_NOREUSE seems like one-shot call and should not have
very long time of effect (after the application exits.)
Can't we ties the liftime of kinteval to the application/file descriptor ?

Thanks,
-Kame

> ---
> fs/inode.c | 3 ++
> include/linux/fs.h | 12 ++++++
> mm/fadvise.c | 18 +++++++++-
> mm/filemap.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> 4 files changed, 125 insertions(+), 3 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index fb10d86..6375163 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -254,6 +254,7 @@ void __destroy_inode(struct inode *inode)
> if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
> posix_acl_release(inode->i_default_acl);
> #endif
> + filemap_clear_cache(&inode->i_data);
> this_cpu_dec(nr_inodes);
> }
> EXPORT_SYMBOL(__destroy_inode);
> @@ -360,6 +361,8 @@ void address_space_init_once(struct address_space *mapping)
> spin_lock_init(&mapping->private_lock);
> INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
> INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
> + INIT_KINTERVAL_TREE_ROOT(&mapping->nocache_tree);
> + rwlock_init(&mapping->nocache_lock);
> }
> EXPORT_SYMBOL(address_space_init_once);
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 386da09..624a73e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -9,6 +9,7 @@
> #include <linux/limits.h>
> #include <linux/ioctl.h>
> #include <linux/blk_types.h>
> +#include <linux/kinterval.h>
> #include <linux/types.h>
>
> /*
> @@ -521,6 +522,11 @@ enum positive_aop_returns {
> * helper code (eg buffer layer)
> * to clear GFP_FS from alloc */
>
> +enum filemap_cache_modes {
> + FILEMAP_CACHE_NORMAL, /* No special cache behavior */
> + FILEMAP_CACHE_ONCE, /* Pages will be used once */
> +};
> +
> /*
> * oh the beauties of C type declarations.
> */
> @@ -655,6 +661,8 @@ struct address_space {
> spinlock_t private_lock; /* for use by the address_space */
> struct list_head private_list; /* ditto */
> struct address_space *assoc_mapping; /* ditto */
> + struct rb_root nocache_tree; /* noreuse cache range tree */
> + rwlock_t nocache_lock; /* protect the nocache_tree */
> } __attribute__((aligned(sizeof(long))));
> /*
> * On most architectures that alignment is already the case; but
> @@ -2189,6 +2197,10 @@ extern int invalidate_inode_pages2(struct address_space *mapping);
> extern int invalidate_inode_pages2_range(struct address_space *mapping,
> pgoff_t start, pgoff_t end);
> extern int write_inode_now(struct inode *, int);
> +extern void filemap_clear_cache(struct address_space *mapping);
> +extern int filemap_set_cache(struct address_space *mapping,
> + pgoff_t start, pgoff_t end, int mode);
> +extern int filemap_get_cache(struct address_space *mapping, pgoff_t index);
> extern int filemap_fdatawrite(struct address_space *);
> extern int filemap_flush(struct address_space *);
> extern int filemap_fdatawait(struct address_space *);
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 469491e..22b1aa8 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -80,6 +80,12 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> spin_lock(&file->f_lock);
> file->f_mode &= ~FMODE_RANDOM;
> spin_unlock(&file->f_lock);
> +
> + start_index = offset >> PAGE_CACHE_SHIFT;
> + end_index = endbyte >> PAGE_CACHE_SHIFT;
> +
> + ret = filemap_set_cache(mapping, start_index, end_index,
> + FILEMAP_CACHE_NORMAL);
> break;
> case POSIX_FADV_RANDOM:
> spin_lock(&file->f_lock);
> @@ -102,11 +108,16 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> start_index = offset >> PAGE_CACHE_SHIFT;
> end_index = endbyte >> PAGE_CACHE_SHIFT;
>
> + ret = filemap_set_cache(mapping, start_index, end_index,
> + FILEMAP_CACHE_NORMAL);
> + if (ret < 0)
> + break;
> +
> /* Careful about overflow on the "+1" */
> nrpages = end_index - start_index + 1;
> if (!nrpages)
> nrpages = ~0UL;
> -
> +
> ret = force_page_cache_readahead(mapping, file,
> start_index,
> nrpages);
> @@ -114,6 +125,11 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> ret = 0;
> break;
> case POSIX_FADV_NOREUSE:
> + start_index = offset >> PAGE_CACHE_SHIFT;
> + end_index = endbyte >> PAGE_CACHE_SHIFT;
> +
> + ret = filemap_set_cache(mapping, start_index, end_index,
> + FILEMAP_CACHE_ONCE);
> break;
> case POSIX_FADV_DONTNEED:
> if (!bdi_write_congested(mapping->backing_dev_info))
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b662757..610502a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -27,6 +27,7 @@
> #include <linux/writeback.h>
> #include <linux/backing-dev.h>
> #include <linux/pagevec.h>
> +#include <linux/kinterval.h>
> #include <linux/blkdev.h>
> #include <linux/security.h>
> #include <linux/syscalls.h>
> @@ -187,6 +188,82 @@ static int sleep_on_page_killable(void *word)
> }
>
> /**
> + * filemap_clear_cache - clear all special cache settings
> + * @mapping: target address space structure
> + *
> + * Reset all the special file cache settings previously defined by
> + * filemap_set_cache().
> + */
> +void filemap_clear_cache(struct address_space *mapping)
> +{
> + write_lock(&mapping->nocache_lock);
> + kinterval_clear(&mapping->nocache_tree);
> + write_unlock(&mapping->nocache_lock);
> +}
> +
> +/**
> + * filemap_set_cache - set special cache behavior for a range of file pages
> + * @mapping: target address space structure
> + * @start: offset in pages where the range starts
> + * @end: offset in pages where the range ends (inclusive)
> + * @mode: cache behavior configuration
> + *
> + * This can be used to define special cache behavior in advance, before
> + * accessing the data.
> + *
> + * At the moment the supported cache behaviors are the following (see also
> + * filemap_cache_modes):
> + *
> + * FILEMAP_CACHE_NORMAL: normal page cache behavior;
> + *
> + * FILEMAP_CACHE_ONCE: specifies that the pages will be accessed once and the
> + * caller don't expect to reuse it thereafter. This prevents them for being
> + * promoted to the active lru list.
> + */
> +int filemap_set_cache(struct address_space *mapping,
> + pgoff_t start, pgoff_t end, int mode)
> +{
> + int ret;
> +
> + write_lock(&mapping->nocache_lock);
> + switch (mode) {
> + case FILEMAP_CACHE_NORMAL:
> + ret = kinterval_del(&mapping->nocache_tree,
> + start, end, GFP_KERNEL);
> + break;
> + case FILEMAP_CACHE_ONCE:
> + ret = kinterval_add(&mapping->nocache_tree,
> + start, end, mode, GFP_KERNEL);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + write_unlock(&mapping->nocache_lock);
> +
> + return ret;
> +}
> +
> +/**
> + * filemap_get_cache - get special cache behavior of a file page
> + * @mapping: target address space structure
> + * @index: index of the page inside the address space
> + *
> + * If no special cache behavior are defined FILEMAP_CACHE_NORMAL is returned
> + * (that means no special page cache behavior is applied).
> + */
> +int filemap_get_cache(struct address_space *mapping, pgoff_t index)
> +{
> + int mode;
> +
> + read_lock(&mapping->nocache_lock);
> + mode = kinterval_lookup(&mapping->nocache_tree, index);
> + read_unlock(&mapping->nocache_lock);
> +
> + return mode < 0 ? FILEMAP_CACHE_NORMAL : mode;
> +}
> +
> +/**
> * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
> * @mapping: address space structure to write
> * @start: offset in bytes where the range starts
> @@ -1181,8 +1258,22 @@ page_ok:
> * When a sequential read accesses a page several times,
> * only mark it as accessed the first time.
> */
> - if (prev_index != index || offset != prev_offset)
> - mark_page_accessed(page);
> + if (prev_index != index || offset != prev_offset) {
> + int mode;
> +
> + mode = filemap_get_cache(mapping, index);
> + switch (mode) {
> + case FILEMAP_CACHE_NORMAL:
> + mark_page_accessed(page);
> + break;
> + case FILEMAP_CACHE_ONCE:
> + mark_page_usedonce(page);
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + break;
> + }
> + }
> prev_index = index;
>
> /*
> --
> 1.7.5.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-02-27 10:47:01

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

On Mon, Feb 27, 2012 at 11:33:38AM +0900, KAMEZAWA Hiroyuki wrote:
> On Sun, 12 Feb 2012 01:21:38 +0100
> Andrea Righi <[email protected]> wrote:
>
> > According to the POSIX standard the POSIX_FADV_NOREUSE hint means that
> > the application expects to access the specified data once and then not
> > reuse it thereafter.
> >
> > It seems that the expected behavior is to implement a drop-behind
> > policy where the application can set certain intervals of a file as
> > FADV_NOREUSE _before_ accessing the data.
> >
> > An interesting usage of this hint is to guarantee that pages marked as
> > FADV_NOREUSE will never blow away the pages of the current working set.
> >
> > A possible solution to satisfy this requirement is to prevent lru
> > activation of the pages marked as FADV_NOREUSE, in other words, never
> > add pages marked as FADV_NOREUSE to the active lru list. Moreover, all
> > the file cache pages in a FADV_NOREUSE range can be immediately dropped
> > after a read if the page was not present in the file cache before.
> >
> > In general, the purpose of this approach is to preserve as much as
> > possible the previous state of the file cache memory before reading data
> > in ranges marked by FADV_NOREUSE.
> >
> > All the pages read before (pre-)setting them as FADV_NOREUSE should be
> > treated as normal, so they can be added to the active lru list as usual
> > if they're accessed multiple times.
> >
> > Only after setting them as FADV_NOREUSE we can prevent them for being
> > promoted to the active lru list. If they are already in the active lru
> > list before calling FADV_NOREUSE we should keep them there, but if they
> > quit from the active list they can't get back anymore (except by
> > explicitly setting a different caching hint).
> >
>
> >From this part, it seems the behavior of systemcall is highly depends on
> interanal kernel implemenatation...

Yes. If in a future kernel we'll decide to remove the active/inactive
lru lists we also need to change the implementation of FADV_NOREUSE.

However, I think that any solution for a feature that allows to not
disturb the state of the page cache has inevitably something dependent
on internal kernel implementation...

Probably a more generic concept to document is that FADV_NOREUSE is an
advice from the application to never consider the marked pages as part
of the working set (for any possible meaning/implementation of "working
set").

>
>
> > To achieve this goal we need to maintain the list of file ranges marked
> > as FADV_NOREUSE until the pages are dropped from the page cache, or the
> > inode is deleted, or they're explicitly marked to use a different cache
> > behavior (FADV_NORMAL | FADV_WILLNEED).
> >
> > The list of FADV_NOREUSE ranges is maintained in the address_space
> > structure using an interval tree (kinterval).
> >
> > Signed-off-by: Andrea Righi <[email protected]>
>
>
> Once an appliation sets a range of file as FILEMAP_CACHE_ONCE,
> the effects will last until the inode is dropped....right ?
> Won't this cause troubles which cannot be detected
> (because kinterval information is hidden.) ?
>
> I'm not sure but FADV_NOREUSE seems like one-shot call and should not have
> very long time of effect (after the application exits.)
> Can't we ties the liftime of kinteval to the application/file descriptor ?

Yes, I'm also concerned about this. Using FADV_NOREUSE may also affect
the page cache behavior of other applications.

I like the idea to tie the FADV_NOREUSE ranges to the file descriptor.
In addition to the shorter lifetime it has the advantage that the policy
is applied only to the application that is actually using this feature
and not also to the other apps running in the system.

I'll consider this possibility for sure if I'll post a new version of
this patch set.

Thanks!
-Andrea

2014-01-02 21:25:48

by Phillip Susi

[permalink] [raw]
Subject: Re: [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

What ever happened to this patch set? It looks like a great idea to
me and I'd *really* like to see this flag implemented.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJSxdlTAAoJEI5FoCIzSKrw0O0H/3cIe/XvrXvF6qzgHpQPIfnN
a+14iqUa5+fNKNRM0Rwk9Tb3EFUjIXKHcRiRzGD9CINVxonBQQME68KA94UxVZIL
oul4YMP4dNcBhp8Ux1M80JY3Y/CMSo9SAN1pc7bmIezua/v821vb6wgCamj3EnS5
/Zs51cIWMkRSAr7EVvycI6mI04MzqsEdtGHdI0U6jrLjLLHsEgbuqkBMrc5BNkQ/
3tD6atY5zNyBIl+RBOvukNoijtEW4Z5OU+zfZHSk/L72yZnl+17nz4mRApmikUDV
zhJoCheWqLCLtpg2SxVC/EMUS3TDy3k9+8zQHbRZ3igXP4e58NZFR+XC1JhVonQ=
=TqVD
-----END PGP SIGNATURE-----