LinuxLists.cc - [PATCH 1/2] [RFC] Range tree implementation

2012-02-10 00:17:07

Subject: [PATCH 1/2] [RFC] Range tree implementation

After Andrew suggested something like his mumbletree idea
to better store a list of ranges, I worked on a few different
approaches, and this is what I've finally managed to get working.

I suspect range-tree isn't a totally accurate name, but I
couldn't quite make out the difference between range trees
and interval trees, so I just picked one to call it. Do
let me know if you have a better name.

The idea of storing ranges in a tree is nice, but has a number
of complications. When adding a range, its possible that a
large range will consume and merge a number of smaller ranges.
When removing a range, its possible you may end up splitting an
existing range, causing one range to become two. This makes it
very difficult to provide generic list_head like behavior, as
the parent structures would need to be duplicated and removed,
and that has lots of memory ownership issues.

So, this is a much simplified and more list_head like
implementation. You can add a node to a tree, or remove a node
to a tree, but the generic implementation doesn't do the
merging or splitting for you. But it does provide helpers to
find overlapping and adjacent ranges.

I made the tree self-balancing via splaying as it seemed easier
to handle with the merging/splitting cases I originally tried to
make the generic code handle, but since I've dropped that, I
suspect it can be reworked to use a rbtree. I just wanted to get
this out for initial review.

Andrew also really wanted this range-tree implementation to be
resuable so we don't duplicate the file locking logic. I'm not
totally convinced that the requirements between the volatile
ranges and file locking are really equivelent, but this reduced
impelementation may make it possible.

Do let me know what you think or if you have other ideas for
better ways to do the same.

thanks
-john

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/rangetree.h | 35 +++++
lib/Makefile | 2 +-
lib/rangetree.c | 325 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 361 insertions(+), 1 deletions(-)
create mode 100644 include/linux/rangetree.h
create mode 100644 lib/rangetree.c

diff --git a/include/linux/rangetree.h b/include/linux/rangetree.h
new file mode 100644
index 0000000..998ebcc
--- /dev/null
+++ b/include/linux/rangetree.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_RANGETREE_H
+#define _LINUX_RANGETREE_H
+
+#include <linux/types.h>
+#include <linux/fs.h>
+
+struct range_tree_node {
+ struct range_tree_node *parent;
+ struct range_tree_node *left;
+ struct range_tree_node *right;
+ u64 start;
+ u64 end;
+};
+
+static inline void range_tree_node_init(struct range_tree_node *node)
+{
+ node->parent = NULL;
+ node->left = NULL;
+ node->right = NULL;
+ node->start = 0;
+ node->end = 0;
+}
+
+extern struct range_tree_node *range_tree_in_range(struct range_tree_node *root,
+ u64 start, u64 end);
+extern struct range_tree_node *range_tree_in_range_adjacent(
+ struct range_tree_node *root,
+ u64 start, u64 end);
+extern struct range_tree_node *range_tree_add(struct range_tree_node *root,
+ struct range_tree_node *node);
+extern struct range_tree_node *range_tree_remove(struct range_tree_node *root,
+ struct range_tree_node *node);
+#endif
+
+
diff --git a/lib/Makefile b/lib/Makefile
index 18515f0..f43ef0d 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
idr.o int_sqrt.o extable.o prio_tree.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o prio_heap.o ratelimit.o show_mem.o \
- is_single_threaded.o plist.o decompress.o
+ is_single_threaded.o plist.o decompress.o rangetree.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/rangetree.c b/lib/rangetree.c
new file mode 100644
index 0000000..db20665
--- /dev/null
+++ b/lib/rangetree.c
@@ -0,0 +1,325 @@
+#include <linux/rangetree.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+/**
+ * rotate_right - Splay tree helper
+ * @node: node to be rotated right
+ * @root: tree root
+ *
+ * Returns the tree root after rotating the node right
+ */
+static struct range_tree_node *rotate_right(struct range_tree_node *node,
+ struct range_tree_node *root)
+{
+ struct range_tree_node *new_root, *new_right;
+
+ if (!node->left)
+ return root;
+
+ new_root = node->left;
+ new_right = node;
+
+ if (root == node)
+ root = new_root;
+
+ new_right->left = new_root->right;
+ new_root->parent = new_right->parent;
+ if (new_root->parent) {
+ if (new_root->parent->right == new_right)
+ new_root->parent->right = new_root;
+ else
+ new_root->parent->left = new_root;
+ }
+ new_right->parent = new_root;
+
+ new_root->right = new_right;
+
+ if (new_right->left)
+ new_right->left->parent = new_right;
+
+
+ /* Paranoid sanity checking */
+ if (new_root->left)
+ BUG_ON(new_root->left->parent != new_root);
+ if (new_root->right)
+ BUG_ON(new_root->right->parent != new_root);
+ if (new_right->left)
+ BUG_ON(new_right->left->parent != new_right);
+ if (new_right->right)
+ BUG_ON(new_right->right->parent != new_right);
+
+
+ return root;
+
+}
+
+/**
+ * rotate_left - Splay tree helper
+ * @node: node to be rotated left
+ * @root: tree root
+ *
+ * Returns the tree root after rotating the node left
+ */
+static struct range_tree_node *rotate_left(struct range_tree_node *node,
+ struct range_tree_node *root)
+{
+ struct range_tree_node *new_root, *new_left;
+
+ if (!node->right)
+ return root;
+
+ new_root = node->right;
+ new_left = node;
+
+ if (root == node)
+ root = new_root;
+
+ new_left->right = new_root->left;
+ if (new_left->right)
+ new_left->right->parent = new_left;
+ new_root->parent = new_left->parent;
+ if (new_root->parent) {
+ if (new_root->parent->left == new_left)
+ new_root->parent->left = new_root;
+ else
+ new_root->parent->right = new_root;
+ }
+ new_left->parent = new_root;
+ new_root->left = new_left;
+
+
+ /* Paranoid sanity checking */
+ if (new_root->left)
+ BUG_ON(new_root->left->parent != new_root);
+ if (new_root->right)
+ BUG_ON(new_root->right->parent != new_root);
+ if (new_left->left)
+ BUG_ON(new_left->left->parent != new_left);
+ if (new_left->right)
+ BUG_ON(new_left->right->parent != new_left);
+
+ return root;
+}
+
+/**
+ * splay_tree Splays a node to the top of a tree
+ * @root: root of the splay tree
+ * @node: node to be splayed to the root
+ *
+ * Returns the root of a tree after splaying
+ */
+static struct range_tree_node *splay_tree(struct range_tree_node *root,
+ struct range_tree_node *node)
+{
+restart:
+ if (root == node)
+ return root;
+
+ if (node->parent == root) {
+ if (root->left == node)
+ root = rotate_right(root, root);
+ else
+ root = rotate_left(root, root);
+ return root;
+ } else {
+ struct range_tree_node *parent, *grandparent;
+
+ parent = node->parent;
+ grandparent = parent->parent;
+
+ if ((node == parent->left) && (parent == grandparent->left)) {
+ root = rotate_right(grandparent, root);
+ root = rotate_right(parent, root);
+ } else if ((node == parent->right) &&
+ (parent == grandparent->right)) {
+ root = rotate_left(grandparent, root);
+ root = rotate_left(parent, root);
+ } else if ((node == parent->right) &&
+ (parent == grandparent->left)) {
+ root = rotate_left(parent, root);
+ root = rotate_right(grandparent, root);
+ } else if ((node == parent->left) &&
+ (parent == grandparent->right)) {
+ root = rotate_right(parent, root);
+ root = rotate_left(grandparent, root);
+ } else {
+ BUG_ON(1); /* Something is odd */
+ }
+ goto restart;
+ }
+}
+
+
+/**
+ * range_tree_in_range - Returns the first node that overlaps with the
+ * given range
+ * @root: range_tree root
+ * @start: range start
+ * @end: range end
+ *
+ */
+struct range_tree_node *range_tree_in_range(struct range_tree_node *root,
+ u64 start, u64 end)
+{
+ struct range_tree_node *candidate = root;
+
+ if (!candidate)
+ return 0;
+restart:
+ if (end < candidate->start) {
+ if (candidate->left) {
+ candidate = candidate->left;
+ goto restart;
+ }
+ } else if (start > candidate->end) {
+ if (candidate->right) {
+ candidate = candidate->right;
+ goto restart;
+ }
+ } else
+ return candidate;
+ return 0;
+}
+
+
+/**
+ * range_tree_in_range - Returns the first node that overlaps or is adjacent
+ * with the given range
+ * @root: range_tree root
+ * @start: range start
+ * @end: range end
+ *
+ */
+struct range_tree_node *range_tree_in_range_adjacent(
+ struct range_tree_node *root,
+ u64 start, u64 end)
+{
+ struct range_tree_node *candidate = root;
+
+ if (!candidate)
+ return 0;
+restart:
+ if (end + 1 < candidate->start) {
+ if (candidate->left) {
+ candidate = candidate->left;
+ goto restart;
+ }
+ } else if (start > candidate->end + 1) {
+ if (candidate->right) {
+ candidate = candidate->right;
+ goto restart;
+ }
+ } else
+ return candidate;
+ return 0;
+}
+
+/**
+ * range_tree_add - Add a node to a range tree
+ * @root: range tree to be added to
+ * @node: range_tree_node to be added
+ *
+ * Adds a node to the range tree.
+ */
+struct range_tree_node *range_tree_add(struct range_tree_node *root,
+ struct range_tree_node *node)
+{
+ struct range_tree_node *candidate;
+ /* make sure its not connected */
+ BUG_ON(node->parent || node->left || node->right);
+
+ if (!root)
+ return node;
+
+ candidate = root;
+restart:
+ if (node->start < candidate->start) {
+ if (candidate->left) {
+ candidate = candidate->left;
+ goto restart;
+ }
+ candidate->left = node;
+ node->parent = candidate;
+ } else if (node->start > candidate->start) {
+ if (candidate->right) {
+ candidate = candidate->right;
+ goto restart;
+ }
+ candidate->right = node;
+ node->parent = candidate;
+ }
+
+ root = splay_tree(root, node);
+ return root;
+}
+
+/**
+ * range_tree_merge - Helper to merge two range sub-trees
+ * @left: left subtree to be merged
+ * @right: right subtree to be merged
+ *
+ * Returns a merged range tree of two subtrees. left subtree
+ * must be all less then the right subtree.
+ */
+static struct range_tree_node *range_tree_merge(struct range_tree_node *left,
+ struct range_tree_node *right)
+{
+ struct range_tree_node *merge;
+
+ if (!left)
+ return right;
+ if (!right)
+ return left;
+
+ merge = left;
+ /* grab the right-most node on the left side */
+ while (merge->right)
+ merge = merge->right;
+ merge->right = right;
+ if (right)
+ right->parent = merge;
+
+ return left;
+}
+
+/**
+ * null_node: Helper that clears node data
+ * @node: Node to be cleared
+ */
+static void null_node(struct range_tree_node *node)
+{
+ node->left = node->right = node->parent = NULL;
+ node->start = node->end = 0;
+}
+
+/**
+ * range_tree_remove: Removes a given node from the tree
+ * @root: root of tree
+ * @node: Node to be removed
+ *
+ * Removes a node and splays the tree
+ */
+struct range_tree_node *range_tree_remove(struct range_tree_node *root,
+ struct range_tree_node *node)
+{
+ struct range_tree_node *subtree;
+
+ subtree = range_tree_merge(node->left, node->right);
+
+ if (subtree)
+ subtree->parent = node->parent;
+
+ if (node->parent && node->parent->left == node)
+ node->parent->left = subtree;
+ if (node->parent && node->parent->right == node)
+ node->parent->right = subtree;
+
+ null_node(node);
+ if (node == root)
+ return subtree;
+
+ if (subtree)
+ root = splay_tree(root, subtree);
+ return root;
+}
--
1.7.3.2.146.gca209

2012-02-10 00:18:06

by John Stultz

[permalink] [raw]

Subject: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

This patch provides new fadvise flags that can be used to mark
file pages as volatile, which will allow it to be discarded if the
kernel wants to reclaim memory.

This is useful for userspace to allocate things like caches, and lets
the kernel destructively (but safely) reclaim them when there's memory
pressure.

It's different from FADV_DONTNEED since the pages are not immediately
discarded; they are only discarded under pressure.

This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code, instead of isolating
it into a specific driver.

I'm very much a newbie at the VM code, so At this point, I just want
to try to get some input on the patch, so if you have another idea
for using something other then fadvise, or other thoughts on how the
volatile ranges are stored, I'd be really interested in hearing them.
So let me know if you have any comments for feedback!

Also many thanks to Dave Hansen who helped design and develop the
initial version of this patch, and has provided continued review and
mentoring for me in the VM code.

v2:
After the valid critique that just dropping pages would poke holes
in volatile ranges, and instead we should zap an entire range if we
drop any of it, I changed the code to more closely mimic the ashmem
implementation, which zaps entire ranges via a shrinker using an lru
list that tracks which range has been marked volatile the longest.

v3:
Reworked to use range tree implementation.

Known issues:
* Not sure how to nicely error out if we get ENOMEM while splitting
a range.
* Lockdep doesn't like calling vmtruncate_range() from a shrinker.
Any help here on how to address this would be appreciated.

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
fs/inode.c | 5 +
include/linux/fadvise.h | 6 +
include/linux/fs.h | 3 +
include/linux/volatile.h | 14 ++
mm/Makefile | 2 +-
mm/fadvise.c | 22 +++-
mm/volatile.c | 314 ++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 364 insertions(+), 2 deletions(-)
create mode 100644 include/linux/volatile.h
create mode 100644 mm/volatile.c

diff --git a/fs/inode.c b/fs/inode.c
index fb10d86..0675962 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -27,6 +27,7 @@
#include <linux/cred.h>
#include <linux/buffer_head.h> /* for inode_has_buffers */
#include <linux/ratelimit.h>
+#include <linux/volatile.h>
#include "internal.h"

/*
@@ -254,6 +255,7 @@ void __destroy_inode(struct inode *inode)
if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
posix_acl_release(inode->i_default_acl);
#endif
+ mapping_clear_volatile_ranges(&inode->i_data);
this_cpu_dec(nr_inodes);
}
EXPORT_SYMBOL(__destroy_inode);
@@ -360,6 +362,9 @@ void address_space_init_once(struct address_space *mapping)
spin_lock_init(&mapping->private_lock);
INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
+ mapping->volatile_root = NULL;
+ mutex_init(&mapping->vlist_mutex);
+
}
EXPORT_SYMBOL(address_space_init_once);

diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h
index e8e7471..988fb00 100644
--- a/include/linux/fadvise.h
+++ b/include/linux/fadvise.h
@@ -18,4 +18,10 @@
#define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */
#endif

+#define POSIX_FADV_VOLATILE 8 /* _can_ toss, but don't toss now */
+#define POSIX_FADV_NONVOLATILE 9 /* Remove VOLATILE flag */
+#define POSIX_FADV_ISVOLATILE 10 /* Returns volatile flag for region */
+
+
+
#endif /* FADVISE_H_INCLUDED */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 386da09..a784a9b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -10,6 +10,7 @@
#include <linux/ioctl.h>
#include <linux/blk_types.h>
#include <linux/types.h>
+#include <linux/rangetree.h>

/*
* It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -655,6 +656,8 @@ struct address_space {
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+ struct range_tree_node *volatile_root; /* volatile range list */
+ struct mutex vlist_mutex; /* protect volatile_list */
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..5460d7b
--- /dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,14 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+#include <linux/fs.h>
+
+extern long mapping_range_volatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+extern long mapping_range_nonvolatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+extern long mapping_range_isvolatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+extern void mapping_clear_volatile_ranges(struct address_space *mapping);
+
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..7b6c7a8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
page_isolation.o mm_init.o mmu_context.o percpu.o \
- $(mmu-y)
+ volatile.o $(mmu-y)
obj-y += init-mm.o

ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e0..732258b 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -17,6 +17,7 @@
#include <linux/fadvise.h>
#include <linux/writeback.h>
#include <linux/syscalls.h>
+#include <linux/volatile.h>

#include <asm/unistd.h>

@@ -106,7 +107,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
nrpages = end_index - start_index + 1;
if (!nrpages)
nrpages = ~0UL;
-
+
ret = force_page_cache_readahead(mapping, file,
start_index,
nrpages);
@@ -128,6 +129,25 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
invalidate_mapping_pages(mapping, start_index,
end_index);
break;
+ case POSIX_FADV_VOLATILE:
+ /* First and last PARTIAL page! */
+ start_index = offset >> PAGE_CACHE_SHIFT;
+ end_index = endbyte >> PAGE_CACHE_SHIFT;
+ ret = mapping_range_volatile(mapping, start_index, end_index);
+ break;
+ case POSIX_FADV_NONVOLATILE:
+ /* First and last PARTIAL page! */
+ start_index = offset >> PAGE_CACHE_SHIFT;
+ end_index = endbyte >> PAGE_CACHE_SHIFT;
+ ret = mapping_range_nonvolatile(mapping, start_index,
+ end_index);
+ break;
+ case POSIX_FADV_ISVOLATILE:
+ /* First and last PARTIAL page! */
+ start_index = offset >> PAGE_CACHE_SHIFT;
+ end_index = endbyte >> PAGE_CACHE_SHIFT;
+ ret = mapping_range_isvolatile(mapping, start_index, end_index);
+ break;
default:
ret = -EINVAL;
}
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..7ac1afd
--- /dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,314 @@
+/* mm/volatile.c
+ *
+ * Volatile page range managment.
+ * Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c
+ * by Robert Love <[email protected]>
+ * Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/volatile.h>
+
+
+struct volatile_range {
+ struct list_head lru;
+ struct range_tree_node range_node;
+
+ unsigned int purged;
+ struct address_space *mapping;
+};
+
+
+/* LRU list of volatile page ranges */
+static LIST_HEAD(volatile_lru_list);
+static DEFINE_MUTEX(volatile_lru_mutex);
+
+/* Count of pages on our LRU list */
+static u64 lru_count;
+
+
+/* range helpers */
+
+static inline u64 range_size(struct volatile_range *range)
+{
+ return range->range_node.end - range->range_node.start + 1;
+}
+
+
+static inline void lru_add(struct volatile_range *range)
+{
+ mutex_lock(&volatile_lru_mutex);
+ list_add_tail(&range->lru, &volatile_lru_list);
+ lru_count += range_size(range);
+ mutex_unlock(&volatile_lru_mutex);
+}
+
+static inline void __lru_del(struct volatile_range *range)
+{
+ list_del(&range->lru);
+ lru_count -= range_size(range);
+}
+
+static inline void lru_del(struct volatile_range *range)
+{
+ mutex_lock(&volatile_lru_mutex);
+ __lru_del(range);
+ mutex_unlock(&volatile_lru_mutex);
+}
+
+#define range_on_lru(range) (!(range)->purged)
+
+
+static inline void volatile_range_shrink(struct volatile_range *range,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ size_t pre = range_size(range);
+
+ range->range_node.start = start_index;
+ range->range_node.end = end_index;
+
+ if (range_on_lru(range)) {
+ mutex_lock(&volatile_lru_mutex);
+ lru_count -= pre - range_size(range);
+ mutex_unlock(&volatile_lru_mutex);
+ }
+}
+
+static struct volatile_range *vrange_alloc(void)
+{
+ struct volatile_range *new;
+
+ new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+ if (!new)
+ return 0;
+
+ range_tree_node_init(&new->range_node);
+ return new;
+}
+
+static void vrange_del(struct volatile_range *vrange)
+{
+ struct address_space *mapping;
+ mapping = vrange->mapping;
+
+ mapping->volatile_root =
+ range_tree_remove(mapping->volatile_root, &vrange->range_node);
+ if (range_on_lru(vrange))
+ lru_del(vrange);
+ kfree(vrange);
+}
+
+
+
+/*
+ * Mark a region as volatile, allowing dirty pages to be purged
+ * under memory pressure
+ */
+long mapping_range_volatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ struct volatile_range *new;
+ struct range_tree_node *node;
+
+ u64 start, end;
+ int purged = 0;
+ start = (u64)start_index;
+ end = (u64)end_index;
+
+ new = vrange_alloc();
+ if (!new)
+ return -ENOMEM;
+
+ mutex_lock(&mapping->vlist_mutex);
+
+ node = range_tree_in_range_adjacent(mapping->volatile_root, start, end);
+ while (node) {
+ struct volatile_range *vrange;
+
+ /* Already entirely marked volatile, so we're done */
+ if (node->start < start && node->end > end) {
+ /* don't need the allocated value */
+ kfree(new);
+ return 0;
+ }
+
+ /* Grab containing volatile range */
+ vrange = container_of(node, struct volatile_range, range_node);
+
+ /* resize range */
+ start = min_t(u64, start, node->start);
+ end = max_t(u64, end, node->end);
+ purged |= vrange->purged;
+
+ vrange_del(vrange);
+
+ /* get the next possible overlap */
+ node = range_tree_in_range(mapping->volatile_root, start, end);
+ }
+
+ new->mapping = mapping;
+ new->range_node.start = start;
+ new->range_node.end = end;
+ new->purged = purged;
+ mapping->volatile_root = range_tree_add(mapping->volatile_root,
+ &new->range_node);
+ if (range_on_lru(new))
+ lru_add(new);
+ mutex_unlock(&mapping->vlist_mutex);
+
+ return 0;
+}
+
+/*
+ * Mark a region as nonvolatile, returns 1 if any pages in the region
+ * were purged.
+ */
+long mapping_range_nonvolatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ struct volatile_range *new;
+ struct range_tree_node *node;
+ int ret = 0;
+ u64 start, end;
+ start = (u64)start_index;
+ end = (u64)end_index;
+
+ mutex_lock(&mapping->vlist_mutex);
+ node = range_tree_in_range(mapping->volatile_root, start, end);
+ while (node) {
+ struct volatile_range *vrange;
+ vrange = container_of(node, struct volatile_range, range_node);
+
+ ret |= vrange->purged;
+
+ if (start <= node->start && end >= node->end) {
+ vrange_del(vrange);
+ } else if (node->start >= start) {
+ volatile_range_shrink(vrange, end+1, node->end);
+ } else if (node->end <= end) {
+ volatile_range_shrink(vrange, node->start, start-1);
+ } else {
+ /* create new node */
+ new = vrange_alloc(); /* XXX ENOMEM HERE? */
+
+ new->mapping = mapping;
+ new->range_node.start = end + 1;
+ new->range_node.end = node->end;
+ volatile_range_shrink(vrange, node->start, start-1);
+ mapping->volatile_root =
+ range_tree_add(mapping->volatile_root,
+ &new->range_node);
+ if (range_on_lru(new))
+ lru_add(new);
+ break;
+ }
+ node = range_tree_in_range(mapping->volatile_root, start, end);
+ }
+ mutex_unlock(&mapping->vlist_mutex);
+
+ return ret;
+}
+
+/*
+ * Returns if a region has been marked volatile or not.
+ * Does not return if the region has been purged.
+ */
+long mapping_range_isvolatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ long ret = 0;
+ u64 start, end;
+ start = (u64)start_index;
+ end = (u64)end_index;
+
+ mutex_lock(&mapping->vlist_mutex);
+ if (range_tree_in_range(mapping->volatile_root, start, end))
+ ret = 1;
+ mutex_unlock(&mapping->vlist_mutex);
+ return ret;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void mapping_clear_volatile_ranges(struct address_space *mapping)
+{
+ struct volatile_range *tozap;
+
+ mutex_lock(&mapping->vlist_mutex);
+ while (mapping->volatile_root) {
+ tozap = container_of(mapping->volatile_root,
+ struct volatile_range, range_node);
+ vrange_del(tozap);
+ }
+ mutex_unlock(&mapping->vlist_mutex);
+}
+
+
+static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
+{
+ struct volatile_range *range, *next;
+ unsigned long nr_to_scan = sc->nr_to_scan;
+ const gfp_t gfp_mask = sc->gfp_mask;
+
+ /* We might recurse into filesystem code, so bail out if necessary */
+ if (nr_to_scan && !(gfp_mask & __GFP_FS))
+ return -1;
+ if (!nr_to_scan)
+ return lru_count;
+
+ mutex_lock(&volatile_lru_mutex);
+ list_for_each_entry_safe(range, next, &volatile_lru_list, lru) {
+ struct inode *inode = range->mapping->host;
+ loff_t start, end;
+
+
+ start = range->range_node.start * PAGE_SIZE;
+ end = (range->range_node.end + 1) * PAGE_SIZE - 1;
+
+ /*
+ * XXX - calling vmtruncate_range from a shrinker causes
+ * lockdep warnings. Revisit this!
+ */
+ vmtruncate_range(inode, start, end);
+ range->purged = 1;
+ __lru_del(range);
+
+ nr_to_scan -= range_size(range);
+ if (nr_to_scan <= 0)
+ break;
+ }
+ mutex_unlock(&volatile_lru_mutex);
+
+ return lru_count;
+}
+
+static struct shrinker volatile_shrinker = {
+ .shrink = volatile_shrink,
+ .seeks = DEFAULT_SEEKS * 4,
+};
+
+
+static int __init volatile_init(void)
+{
+ register_shrinker(&volatile_shrinker);
+ return 0;
+}
+
+arch_initcall(volatile_init);
--
1.7.3.2.146.gca209

2012-02-12 14:08:30

by Dmitry Adamushko

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On 10 February 2012 01:16, John Stultz <[email protected]> wrote:

[ ... ]

> +/*
> + * Mark a region as nonvolatile, returns 1 if any pages in the region
> + * were purged.
> + */
> +long mapping_range_nonvolatile(struct address_space *mapping,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? pgoff_t start_index, pgoff_t end_index)
> +{
> + ? ? ? struct volatile_range *new;
> + ? ? ? struct range_tree_node *node;
> + ? ? ? int ret ?= 0;
> + ? ? ? u64 start, end;
> + ? ? ? start = (u64)start_index;
> + ? ? ? end = (u64)end_index;
> +
> + ? ? ? mutex_lock(&mapping->vlist_mutex);
> + ? ? ? node = range_tree_in_range(mapping->volatile_root, start, end);
> + ? ? ? while (node) {
> + ? ? ? ? ? ? ? struct volatile_range *vrange;
> + ? ? ? ? ? ? ? vrange = container_of(node, struct volatile_range, range_node);
> +
> + ? ? ? ? ? ? ? ret |= vrange->purged;

again, racing with volatile_shrink() here, so we can return a stale state.

> +
> + ? ? ? ? ? ? ? if (start <= node->start && end >= node->end) {
> + ? ? ? ? ? ? ? ? ? ? ? vrange_del(vrange);
> + ? ? ? ? ? ? ? } else if (node->start >= start) {
> + ? ? ? ? ? ? ? ? ? ? ? volatile_range_shrink(vrange, end+1, node->end);
> + ? ? ? ? ? ? ? } else if (node->end <= end) {
> + ? ? ? ? ? ? ? ? ? ? ? volatile_range_shrink(vrange, node->start, start-1);
> + ? ? ? ? ? ? ? } else {
> + ? ? ? ? ? ? ? ? ? ? ? /* create new node */
> + ? ? ? ? ? ? ? ? ? ? ? new = vrange_alloc(); /* XXX ENOMEM HERE? */
> +
> + ? ? ? ? ? ? ? ? ? ? ? new->mapping = mapping;
> + ? ? ? ? ? ? ? ? ? ? ? new->range_node.start = end + 1;
> + ? ? ? ? ? ? ? ? ? ? ? new->range_node.end = node->end;

new->purged = vrange->purged ?

> + ? ? ? ? ? ? ? ? ? ? ? volatile_range_shrink(vrange, node->start, start-1);
> + ? ? ? ? ? ? ? ? ? ? ? mapping->volatile_root =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? range_tree_add(mapping->volatile_root,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &new->range_node);
> + ? ? ? ? ? ? ? ? ? ? ? if (range_on_lru(new))
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lru_add(new);
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? node = range_tree_in_range(mapping->volatile_root, start, end);
> + ? ? ? }
> + ? ? ? mutex_unlock(&mapping->vlist_mutex);
> +
> + ? ? ? return ret;
> +}
> +

Also, I have a question about mapping_range_volatile().

+long mapping_range_volatile(struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ struct volatile_range *new;
+ struct range_tree_node *node;
+
+ u64 start, end;
+ int purged = 0;
+ start = (u64)start_index;
+ end = (u64)end_index;
+
+ new = vrange_alloc();
+ if (!new)
+ return -ENOMEM;
+
+ mutex_lock(&mapping->vlist_mutex);
+
+ node = range_tree_in_range_adjacent(mapping->volatile_root, start, end);
+ while (node) {
+ struct volatile_range *vrange;
+
+ /* Already entirely marked volatile, so we're done */
+ if (node->start < start && node->end > end) {
+ /* don't need the allocated value */
+ kfree(new);
+ return 0;
+ }
+
+ /* Grab containing volatile range */
+ vrange = container_of(node, struct volatile_range, range_node);
+
+ /* resize range */
+ start = min_t(u64, start, node->start);
+ end = max_t(u64, end, node->end);
+ purged |= vrange->purged;
+
+ vrange_del(vrange);
+
+ /* get the next possible overlap */
+ node = range_tree_in_range(mapping->volatile_root, start, end);
+ }
+
+ new->mapping = mapping;
+ new->range_node.start = start;
+ new->range_node.end = end;
+ new->purged = purged;

I'm wondering whether this 'inheritance' is always desirable.

Say,

mapping_range_volatile(mapping, X, X + 1);
...
time goes by and volatile_shrink() has been called for this region.

now, a user does the following (is it considered bad user-behavior?)

mapping_range_volatile(mapping, Y = X - big_value, Z = X + big_value);

This new range will 'inherit' purged=1 from the old one and won't be
on the lru_list. Yet, it's much bigger than the old one and so many
pages are not really 'volatile'.

-- Dmitry

2012-02-14 05:17:07

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Thu, Feb 09, 2012 at 04:16:33PM -0800, John Stultz wrote:
> This patch provides new fadvise flags that can be used to mark
> file pages as volatile, which will allow it to be discarded if the
> kernel wants to reclaim memory.
>
> This is useful for userspace to allocate things like caches, and lets
> the kernel destructively (but safely) reclaim them when there's memory
> pressure.
.....
> @@ -655,6 +656,8 @@ struct address_space {
> spinlock_t private_lock; /* for use by the address_space */
> struct list_head private_list; /* ditto */
> struct address_space *assoc_mapping; /* ditto */
> + struct range_tree_node *volatile_root; /* volatile range list */
> + struct mutex vlist_mutex; /* protect volatile_list */
> } __attribute__((aligned(sizeof(long))));

So you're adding roughly 32 bytes to every cached inode in the
system? This will increasing the memory footprint of the inode cache
by 2-5% (depending on the filesystem). Almost no-one will be using
this functionality on most inodes that are cached in the system, so
that seems like a pretty bad trade-off to me...

> +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
> +{
> + struct volatile_range *range, *next;
> + unsigned long nr_to_scan = sc->nr_to_scan;
> + const gfp_t gfp_mask = sc->gfp_mask;
> +
> + /* We might recurse into filesystem code, so bail out if necessary */
> + if (nr_to_scan && !(gfp_mask & __GFP_FS))
> + return -1;
> + if (!nr_to_scan)
> + return lru_count;
> +
> + mutex_lock(&volatile_lru_mutex);
> + list_for_each_entry_safe(range, next, &volatile_lru_list, lru) {
> + struct inode *inode = range->mapping->host;
> + loff_t start, end;
> +
> +
> + start = range->range_node.start * PAGE_SIZE;
> + end = (range->range_node.end + 1) * PAGE_SIZE - 1;
> +
> + /*
> + * XXX - calling vmtruncate_range from a shrinker causes
> + * lockdep warnings. Revisit this!
> + */
> + vmtruncate_range(inode, start, end);

That function vmtruncate_range, I don't think it does what you think
it does.

Firstly, it's only implemented for shmfs/tmpfs, so this can't have
been tested for caching files on any real filesystem. If it's only
for shm/tmpfs, then the applications cwcan just as easily use their
own memory for caching their volatile data...

Secondly, vmtruncate_range() is actually a hole punching function,
not a page cache invalidation function. You should be using
invalidate_inode_pages2_range() to invalidate and tear down the page
cache. If you really want to punch holes in files, then you should
be using the fallocate syscall with direct application control, not
trying to hide it until memory pressure occurs via fadvise because
hole punching requires memory for the transactions necessary to run
extent freeing operations.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-02-14 05:56:20

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Tue, 2012-02-14 at 16:16 +1100, Dave Chinner wrote:
> On Thu, Feb 09, 2012 at 04:16:33PM -0800, John Stultz wrote:
> > This patch provides new fadvise flags that can be used to mark
> > file pages as volatile, which will allow it to be discarded if the
> > kernel wants to reclaim memory.
> >
> > This is useful for userspace to allocate things like caches, and lets
> > the kernel destructively (but safely) reclaim them when there's memory
> > pressure.
> .....
> > @@ -655,6 +656,8 @@ struct address_space {
> > spinlock_t private_lock; /* for use by the address_space */
> > struct list_head private_list; /* ditto */
> > struct address_space *assoc_mapping; /* ditto */
> > + struct range_tree_node *volatile_root; /* volatile range list */
> > + struct mutex vlist_mutex; /* protect volatile_list */
> > } __attribute__((aligned(sizeof(long))));
>
> So you're adding roughly 32 bytes to every cached inode in the
> system? This will increasing the memory footprint of the inode cache
> by 2-5% (depending on the filesystem). Almost no-one will be using
> this functionality on most inodes that are cached in the system, so
> that seems like a pretty bad trade-off to me...

Yea. Bloating the address_space is a concern I'm aware of, but for the
initial passes I left it to see where folks would rather I keep it.
Pushing the mutex into a range_tree_root structure or something could
cut this down, but I still suspect it won't be loved. Another idea would
be to manage the mapping -> range tree separately via something like a
hash. Do you have any preferences or suggestions here?

> > +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
> > +{
> > + struct volatile_range *range, *next;
> > + unsigned long nr_to_scan = sc->nr_to_scan;
> > + const gfp_t gfp_mask = sc->gfp_mask;
> > +
> > + /* We might recurse into filesystem code, so bail out if necessary */
> > + if (nr_to_scan && !(gfp_mask & __GFP_FS))
> > + return -1;
> > + if (!nr_to_scan)
> > + return lru_count;
> > +
> > + mutex_lock(&volatile_lru_mutex);
> > + list_for_each_entry_safe(range, next, &volatile_lru_list, lru) {
> > + struct inode *inode = range->mapping->host;
> > + loff_t start, end;
> > +
> > +
> > + start = range->range_node.start * PAGE_SIZE;
> > + end = (range->range_node.end + 1) * PAGE_SIZE - 1;
> > +
> > + /*
> > + * XXX - calling vmtruncate_range from a shrinker causes
> > + * lockdep warnings. Revisit this!
> > + */
> > + vmtruncate_range(inode, start, end);
>
> That function vmtruncate_range, I don't think it does what you think
> it does.
>
> Firstly, it's only implemented for shmfs/tmpfs, so this can't have
> been tested for caching files on any real filesystem. If it's only
> for shm/tmpfs, then the applications cwcan just as easily use their
> own memory for caching their volatile data...

Yep you're right, this started as being shm only, and has only been
tested on tmpfs mounts. In this verison, I had left the shm checks off
so that it could be possibly more generic, but I admittedly haven't
thought that through enough.

> Secondly, vmtruncate_range() is actually a hole punching function,
> not a page cache invalidation function. You should be using
> invalidate_inode_pages2_range() to invalidate and tear down the page
> cache. If you really want to punch holes in files, then you should
> be using the fallocate syscall with direct application control, not
> trying to hide it until memory pressure occurs via fadvise because
> hole punching requires memory for the transactions necessary to run
> extent freeing operations.

Thanks for the tip on invalidate_inode_pages2_range()! I'll look it over
and rework the patch using that.

Thanks so much for the review!
-john

2012-02-14 23:51:12

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Mon, Feb 13, 2012 at 09:55:32PM -0800, John Stultz wrote:
> On Tue, 2012-02-14 at 16:16 +1100, Dave Chinner wrote:
> > On Thu, Feb 09, 2012 at 04:16:33PM -0800, John Stultz wrote:
> > > This patch provides new fadvise flags that can be used to mark
> > > file pages as volatile, which will allow it to be discarded if the
> > > kernel wants to reclaim memory.
> > >
> > > This is useful for userspace to allocate things like caches, and lets
> > > the kernel destructively (but safely) reclaim them when there's memory
> > > pressure.
> > .....
> > > @@ -655,6 +656,8 @@ struct address_space {
> > > spinlock_t private_lock; /* for use by the address_space */
> > > struct list_head private_list; /* ditto */
> > > struct address_space *assoc_mapping; /* ditto */
> > > + struct range_tree_node *volatile_root; /* volatile range list */
> > > + struct mutex vlist_mutex; /* protect volatile_list */
> > > } __attribute__((aligned(sizeof(long))));
> >
> > So you're adding roughly 32 bytes to every cached inode in the
> > system? This will increasing the memory footprint of the inode cache
> > by 2-5% (depending on the filesystem). Almost no-one will be using
> > this functionality on most inodes that are cached in the system, so
> > that seems like a pretty bad trade-off to me...
>
> Yea. Bloating the address_space is a concern I'm aware of, but for the
> initial passes I left it to see where folks would rather I keep it.
> Pushing the mutex into a range_tree_root structure or something could
> cut this down, but I still suspect it won't be loved. Another idea would
> be to manage the mapping -> range tree separately via something like a
> hash. Do you have any preferences or suggestions here?

Given that it is a single state bit per page (volatile/non volatile)
you could just use a radix tree tag for keeping the state. Changing
the state isn't a performance critical operation, and tagging large
ranges isn't that expensive (e.g. we do that in the writeback code),
so I'm not sure the overhead of a separate tree is necessary here....

That doesn't help with the reclaim side of things, but I would have
thought that such functioanlity would be better integrated into the
VM page cache/lru scanning code than adding a shrinker to shrink the
page cache additionally on top of what the VM has already done
before calling the shrinkers. I'm not sure what is best here,
though...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-02-15 00:29:20

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Wed, 2012-02-15 at 10:51 +1100, Dave Chinner wrote:
> Given that it is a single state bit per page (volatile/non volatile)
> you could just use a radix tree tag for keeping the state. Changing
> the state isn't a performance critical operation, and tagging large
> ranges isn't that expensive (e.g. we do that in the writeback code),
> so I'm not sure the overhead of a separate tree is necessary here....

Hrm. I'll look into this.

> That doesn't help with the reclaim side of things, but I would have
> thought that such functioanlity would be better integrated into the
> VM page cache/lru scanning code than adding a shrinker to shrink the
> page cache additionally on top of what the VM has already done
> before calling the shrinkers. I'm not sure what is best here,
> though...

Yea. My previous version did eviction from shmem_writepage(), I believe
much as you suggest here, but the concern with that is that you could
have a larger volatile range that was for the most part recently in use,
but one idle page causes the entire thing to be evicted first. Using
least-recently-marked-volatile order seems more natural for the use case
(although Dimitry and others have already pointed out that the
inheritance from the coalescing of neighboring ranges results in a
similar issue).

But I'm open to other ideas and arguments.

Thanks again for the feedback!
-john

2012-02-15 01:38:06

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Tue, 14 Feb 2012 16:29:10 -0800 John Stultz <[email protected]> wrote:

> But I'm open to other ideas and arguments.

I didn't notice the original patch, but found it at
https://lwn.net/Articles/468837/
and had a look.

My first comment is -ENODOC. A bit background always helps, so let me try to
construct that:

The goal is to allow applications to interact with the kernel's cache
management infrastructure. In particular an application can say "this
memory contains data that might be useful in the future, but can be
reconstructed if necessary, and it is cheaper to reconstruct it than to read
it back from disk, so don't bother writing it out".

The proposed mechanism - at a high level - is for user-space to be able to
say "This memory is volatile" and then later "this memory is no longer
volatile". If the content of the memory is still available the second
request succeeds. If not, it fails.. Well, actually it succeeds but reports
that some content has been lost. (not sure what happens then - can the app do
a binary search to find which pages it still has or something).

(technically we should probably include the cost to reconstruct the page,
which the kernel measures as 'seeks' but maybe that isn't necessary).

This is implemented by using files in a 'tmpfs' filesystem. These file
support three new flags to fadvise:

POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be
removed from the page cache as needed, even if they are not 'clean'.
POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile.
If any pages in the range were previously volatile but have since been
removed, then a status is returned reporting this.
POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel
but rather asks a question: Are any of these pages volatile?

Is this an accurate description?

My first thoughts are:
1/ is page granularity really needed? Would file granularity be sufficient?
2/ POSIX_FADV_ISVOLATILE is a warning sign to me - it doesn't actually
provide advice. Is this really needed? What for? Because it feels like
a wrong interface.
3/ Given that this is specific to one filesystem, is fadvise really an
appropriate interface?

(fleshing out the above documentation might be an excellent way to answer
these questions).

As a counter-point, this is my first thought of an implementation approach
(-ENOPATCH, sorry)

- new mount option for tmpfs e.g. 'volatile'. Any file in a filesystem
mounted with that option and which is not currently open by any process can
have blocks removed at any time. The file name must remain, and the file
size must not change.
- lseek can be used to determine if anything has been purged with 'SEEK_DATA'
and 'SEEK_HOLE'.

So you can only mark volatility on a whole-file granularity (hence the
question above).
'open' says "NONVOLATILE".
'close' says "VOLATILE".
'lseek' is used to check if anything was discarded.

This approach would allow multiple processes to share a cache (might this be
valuable?) as it doesn't become truly volatile until all processes close
their handles.

Just a few thoughts ... take or discard as you choose.

NeilBrown

Attachments:

signature.asc (828.00 B)

2012-02-17 03:43:59

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Sun, 2012-02-12 at 13:48 +0100, Dmitry Adamushko wrote:
>
> On 10 February 2012 01:16, John Stultz <[email protected]> wrote:

> +static inline void volatile_range_shrink(struct
> volatile_range *range,
> + pgoff_t start_index, pgoff_t
> end_index)
> +{
> + size_t pre = range_size(range);
> +
> + range->range_node.start = start_index;
> + range->range_node.end = end_index;
> +
>
> I guess, here we get a whole range of races with volatile_shrink(),
> which may see inconsistent (in-the-middle-of-update) ranges
> (e.g. .start and .end).

We should be holding the vlist_mutex to avoid any such races. But you
also make clear that volatile_range_shrink() should really be called
volatile_range_resize(), since having two _shrink calls is terrible. My
apologies.

> + unsigned long nr_to_scan = sc->nr_to_scan;
> + const gfp_t gfp_mask = sc->gfp_mask;
> +
> + /* We might recurse into filesystem code, so bail out
> if necessary */
> + if (nr_to_scan && !(gfp_mask & __GFP_FS))
> + return -1;
> + if (!nr_to_scan)
> + return lru_count;
>
> So it's u64 -> int here, which is possibly 32 bits and signed. Can't
> it lead to inconsistent results on 32bit platforms?

Good point. Thanks for pointing that out.

> + start = range->range_node.start * PAGE_SIZE;
> + end = (range->range_node.end + 1) * PAGE_SIZE
> - 1;
>
> PAGE_CACHE_SHIFT was used in fadvise() to calculate .start and .end
> indexes, and here we use PAGE_SIZE to get back to 'normal' addresses.
> Isn't it inconsistent at the very least?

Fair enough.

>
> + nr_to_scan -= range_size(range);
>
> hmm, unsigned long -= u64
>
> + if (nr_to_scan <= 0)
>
> nr_to_scan is "unsigned long" :-))

Good catch.

Thanks for the feedback!
-john

2012-02-17 03:49:43

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Sun, 2012-02-12 at 15:08 +0100, Dmitry Adamushko wrote:
> On 10 February 2012 01:16, John Stultz <[email protected]> wrote:
> Also, I have a question about mapping_range_volatile().
[snip]
> + new->mapping = mapping;
> + new->range_node.start = start;
> + new->range_node.end = end;
> + new->purged = purged;
>
> I'm wondering whether this 'inheritance' is always desirable.
>
> Say,
>
> mapping_range_volatile(mapping, X, X + 1);
> ...
> time goes by and volatile_shrink() has been called for this region.
>
> now, a user does the following (is it considered bad user-behavior?)
>
> mapping_range_volatile(mapping, Y = X - big_value, Z = X + big_value);
>
> This new range will 'inherit' purged=1 from the old one and won't be
> on the lru_list. Yet, it's much bigger than the old one and so many
> pages are not really 'volatile'.

Yea, I think this is a interesting point, and we can probably be a
little smarter then what is done here. We could only coalesce ranges
that haven't been purged, for instance. Although, the coalescing of
neighboring ranges in of itself is sort of questionable, as if they were
marked volatile independently, they may be marked nonvolatile
independently as well, so merging them together mucks up the lru
ordering.

Robert/Brian: Is there strong rational for the coalescing of neighboring
ranges in ashmem?

thanks
-john

2012-02-17 04:46:03

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Wed, Feb 15, 2012 at 12:37:50PM +1100, NeilBrown wrote:
> On Tue, 14 Feb 2012 16:29:10 -0800 John Stultz <[email protected]> wrote:
>
> > But I'm open to other ideas and arguments.
>
> I didn't notice the original patch, but found it at
> https://lwn.net/Articles/468837/
> and had a look.
>
> My first comment is -ENODOC. A bit background always helps, so let me try to
> construct that:
>
> The goal is to allow applications to interact with the kernel's cache
> management infrastructure. In particular an application can say "this
> memory contains data that might be useful in the future, but can be
> reconstructed if necessary, and it is cheaper to reconstruct it than to read
> it back from disk, so don't bother writing it out".
>
> The proposed mechanism - at a high level - is for user-space to be able to
> say "This memory is volatile" and then later "this memory is no longer
> volatile". If the content of the memory is still available the second
> request succeeds. If not, it fails.. Well, actually it succeeds but reports
> that some content has been lost. (not sure what happens then - can the app do
> a binary search to find which pages it still has or something).
>
> (technically we should probably include the cost to reconstruct the page,
> which the kernel measures as 'seeks' but maybe that isn't necessary).
>
> This is implemented by using files in a 'tmpfs' filesystem. These file
> support three new flags to fadvise:
>
> POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be
> removed from the page cache as needed, even if they are not 'clean'.
> POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile.
> If any pages in the range were previously volatile but have since been
> removed, then a status is returned reporting this.
> POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel
> but rather asks a question: Are any of these pages volatile?

What about for files that aren't on tmpfs? the fadvise() interface
is not tmpfs specific, and given that everyone is talking about
volatility of page cache pages, I fail to see what is tmpfs specific
about this proposal.

So what are the semantics that are supposed to apply to a file that
is on a filesystem with stable storage that is cached in the page
cache?

If this is tmpfs specific behaviour that is required, then IMO
fadvise is not the correct interface to use here because fadvise is
supposed to be a generic interface to controlling the page cache
behaviour on any given file....

> As a counter-point, this is my first thought of an implementation approach
> (-ENOPATCH, sorry)
>
> - new mount option for tmpfs e.g. 'volatile'. Any file in a filesystem
> mounted with that option and which is not currently open by any process can
> have blocks removed at any time. The file name must remain, and the file
> size must not change.
> - lseek can be used to determine if anything has been purged with 'SEEK_DATA'
> and 'SEEK_HOLE'.
>
> So you can only mark volatility on a whole-file granularity (hence the
> question above).
> 'open' says "NONVOLATILE".
> 'close' says "VOLATILE".
> 'lseek' is used to check if anything was discarded.
>
> This approach would allow multiple processes to share a cache (might this be
> valuable?) as it doesn't become truly volatile until all processes close
> their handles.

If this functionality is only useful for tmpfs, then I'd much prefer
a tmpfs specific approach like this....

Cheers,

Dave.

--
Dave Chinner
[email protected]

2012-02-17 05:22:40

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Wed, 2012-02-15 at 12:37 +1100, NeilBrown wrote:
> On Tue, 14 Feb 2012 16:29:10 -0800 John Stultz <[email protected]> wrote:
>
> > But I'm open to other ideas and arguments.
>
> I didn't notice the original patch, but found it at
> https://lwn.net/Articles/468837/
> and had a look.
>
> My first comment is -ENODOC. A bit background always helps, so let me try to
> construct that:

Apologies for not providing better documentation, and thanks for your
first pass below.

> The goal is to allow applications to interact with the kernel's cache
> management infrastructure. In particular an application can say "this
> memory contains data that might be useful in the future, but can be
> reconstructed if necessary, and it is cheaper to reconstruct it than to read
> it back from disk, so don't bother writing it out".

Or alternatively for tmpfs/ramfs, "this data can be reconstructed, so
purge it and free up memory". But as it currently stands, being fs
agnostic, for disk backed filesystems "don't bother writing it out" is
correct as well.

> The proposed mechanism - at a high level - is for user-space to be able to
> say "This memory is volatile" and then later "this memory is no longer
> volatile". If the content of the memory is still available the second
> request succeeds. If not, it fails.. Well, actually it succeeds but reports
> that some content has been lost. (not sure what happens then - can the app do
> a binary search to find which pages it still has or something).

The app should expect all was lost in that range.

> (technically we should probably include the cost to reconstruct the page,
> which the kernel measures as 'seeks' but maybe that isn't necessary).

Not sure I'm following this.

> This is implemented by using files in a 'tmpfs' filesystem. These file
> support three new flags to fadvise:
>
> POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be
> removed from the page cache as needed, even if they are not 'clean'.
> POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile.
> If any pages in the range were previously volatile but have since been
> removed, then a status is returned reporting this.
> POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel
> but rather asks a question: Are any of these pages volatile?
>
>
> Is this an accurate description?

Right now its not tmpfs specific, but otherwise this is pretty spot on.

> My first thoughts are:
> 1/ is page granularity really needed? Would file granularity be sufficient?

The current users of similar functionality via ashmem do seem to find
page granularity useful. You can share basically an unlinked tmpfs fd
between two applications and mark and unmark ranges of pages
"volatile" (unpinned in ashmem terms) as needed.

> 2/ POSIX_FADV_ISVOLATILE is a warning sign to me - it doesn't actually
> provide advice. Is this really needed? What for? Because it feels like
> a wrong interface.

It is more awkward, I agree. And the more I think about it, it seems
like its something we can drop, as it is likely only useful as a probe
before using a page, and using the POSIX_FADV_NONVOLAILE on the range to
be used would also provide the same behavior. So I'll drop it in the
next revision.

> 3/ Given that this is specific to one filesystem, is fadvise really an
> appropriate interface?
>
> (fleshing out the above documentation might be an excellent way to answer
> these questions).

So, the ashmem implementation is really tmpfs specific, but there's also
the expectation on android devices that there isn't swap, so its more
like ramfs. I'd like to think that this behavior makes some sense on
other filesystems, providing a way to cheaply throw out dirty data
without the cost of hitting the disk. However, the next time the file is
opened, that could cause some really strange inconsistent results, with
some recent pages written out and some stale pages. The vmtruncate would
punch a hole instead of leaving stale data, but that still would have to
hit the disk so its not free. So I'm not really sure if it makes sense
in a totally generic way. That said, it would be easy for now to return
errors if the fs isn't shmem based.

Really, I'm not married to any specific interface here. fadvise just
seemed the most logical to me. Given page granularity is needed, what
would be a filesystem specific interface that makes sense here?

> As a counter-point, this is my first thought of an implementation approach
> (-ENOPATCH, sorry)
>
> - new mount option for tmpfs e.g. 'volatile'. Any file in a filesystem
> mounted with that option and which is not currently open by any process can
> have blocks removed at any time. The file name must remain, and the file
> size must not change.
> - lseek can be used to determine if anything has been purged with 'SEEK_DATA'
> and 'SEEK_HOLE'.
>
> So you can only mark volatility on a whole-file granularity (hence the
> question above).
> 'open' says "NONVOLATILE".
> 'close' says "VOLATILE".
> 'lseek' is used to check if anything was discarded.
>
> This approach would allow multiple processes to share a cache (might this be
> valuable?) as it doesn't become truly volatile until all processes close
> their handles.

I do really appreciate the feedback, but I don't think the full file
semantics described here would work for what are essentially existing
users of ashmem.

thanks
-john

2012-02-17 05:26:17

by john stultz

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Thu, 2012-02-16 at 19:43 -0800, John Stultz wrote:
> On Sun, 2012-02-12 at 13:48 +0100, Dmitry Adamushko wrote:
> >
> > On 10 February 2012 01:16, John Stultz <[email protected]> wrote:
>
> > +static inline void volatile_range_shrink(struct
> > volatile_range *range,
> > + pgoff_t start_index, pgoff_t
> > end_index)
> > +{
> > + size_t pre = range_size(range);
> > +
> > + range->range_node.start = start_index;
> > + range->range_node.end = end_index;
> > +
> >
> > I guess, here we get a whole range of races with volatile_shrink(),
> > which may see inconsistent (in-the-middle-of-update) ranges
> > (e.g. .start and .end).
>
> We should be holding the vlist_mutex to avoid any such races. But you
> also make clear that volatile_range_shrink() should really be called
> volatile_range_resize(), since having two _shrink calls is terrible. My
> apologies.

And sure enough in the shrinker we're not holding the vlist_mutex.
Thanks for pointing that out.
-john

2012-02-17 05:27:26

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Fri, 17 Feb 2012 15:45:57 +1100 Dave Chinner <[email protected]> wrote:

> On Wed, Feb 15, 2012 at 12:37:50PM +1100, NeilBrown wrote:
> > On Tue, 14 Feb 2012 16:29:10 -0800 John Stultz <[email protected]> wrote:
> >
> > > But I'm open to other ideas and arguments.
> >
> > I didn't notice the original patch, but found it at
> > https://lwn.net/Articles/468837/
> > and had a look.
> >
> > My first comment is -ENODOC. A bit background always helps, so let me try to
> > construct that:
> >
> > The goal is to allow applications to interact with the kernel's cache
> > management infrastructure. In particular an application can say "this
> > memory contains data that might be useful in the future, but can be
> > reconstructed if necessary, and it is cheaper to reconstruct it than to read
> > it back from disk, so don't bother writing it out".
> >
> > The proposed mechanism - at a high level - is for user-space to be able to
> > say "This memory is volatile" and then later "this memory is no longer
> > volatile". If the content of the memory is still available the second
> > request succeeds. If not, it fails.. Well, actually it succeeds but reports
> > that some content has been lost. (not sure what happens then - can the app do
> > a binary search to find which pages it still has or something).
> >
> > (technically we should probably include the cost to reconstruct the page,
> > which the kernel measures as 'seeks' but maybe that isn't necessary).
> >
> > This is implemented by using files in a 'tmpfs' filesystem. These file
> > support three new flags to fadvise:
> >
> > POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be
> > removed from the page cache as needed, even if they are not 'clean'.
> > POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile.
> > If any pages in the range were previously volatile but have since been
> > removed, then a status is returned reporting this.
> > POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel
> > but rather asks a question: Are any of these pages volatile?
>
> What about for files that aren't on tmpfs? the fadvise() interface
> is not tmpfs specific, and given that everyone is talking about
> volatility of page cache pages, I fail to see what is tmpfs specific
> about this proposal.

It seems I was looking at an earlier version of the patch which only seemed
to affect tmpfs file. I see now that the latest version can affect all
filesystems.

>
> So what are the semantics that are supposed to apply to a file that
> is on a filesystem with stable storage that is cached in the page
> cache?

This is my question too. Does this make any sense at all for a
storage-backed filesystem?

If I understand the current code (which is by no means certain), then
there is nothing concrete that stops volatile pages from being written back
to storage. Whether they are or not would be the result of a race between
the 'volatile_shrinker' purging them, and the VM cleaning them.

Given that the volatile_shrinker sets 'seeks = DEFAULT_SEEKS * 4', I would
guess that the VM would get to the pages before the shrinker, but that is
mostly just a guess.

If this really what we want?

Certainly having this clarified in Documents/volatile.txt would help a lot :-)

Thanks,
NeilBrown

Attachments:

signature.asc (828.00 B)

2012-02-17 05:38:58

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On Fri, 2012-02-17 at 15:45 +1100, Dave Chinner wrote:
> On Wed, Feb 15, 2012 at 12:37:50PM +1100, NeilBrown wrote:
> > On Tue, 14 Feb 2012 16:29:10 -0800 John Stultz <[email protected]> wrote:
> >
> > > But I'm open to other ideas and arguments.
> >
> > I didn't notice the original patch, but found it at
> > https://lwn.net/Articles/468837/
> > and had a look.
> >
> > My first comment is -ENODOC. A bit background always helps, so let me try to
> > construct that:
> >
> > The goal is to allow applications to interact with the kernel's cache
> > management infrastructure. In particular an application can say "this
> > memory contains data that might be useful in the future, but can be
> > reconstructed if necessary, and it is cheaper to reconstruct it than to read
> > it back from disk, so don't bother writing it out".
> >
> > The proposed mechanism - at a high level - is for user-space to be able to
> > say "This memory is volatile" and then later "this memory is no longer
> > volatile". If the content of the memory is still available the second
> > request succeeds. If not, it fails.. Well, actually it succeeds but reports
> > that some content has been lost. (not sure what happens then - can the app do
> > a binary search to find which pages it still has or something).
> >
> > (technically we should probably include the cost to reconstruct the page,
> > which the kernel measures as 'seeks' but maybe that isn't necessary).
> >
> > This is implemented by using files in a 'tmpfs' filesystem. These file
> > support three new flags to fadvise:
> >
> > POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be
> > removed from the page cache as needed, even if they are not 'clean'.
> > POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile.
> > If any pages in the range were previously volatile but have since been
> > removed, then a status is returned reporting this.
> > POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel
> > but rather asks a question: Are any of these pages volatile?
>
> What about for files that aren't on tmpfs? the fadvise() interface
> is not tmpfs specific, and given that everyone is talking about
> volatility of page cache pages, I fail to see what is tmpfs specific
> about this proposal.
>
> So what are the semantics that are supposed to apply to a file that
> is on a filesystem with stable storage that is cached in the page
> cache?

Indeed, this is probably the most awkward case. So currently, we use
vmtruncate_range, which should punch a hole in the file. If I switch to
invalidate_inode_pages2_range(), then I think dirty data is dropped and
the backed page remains (I'm currently reading over that now).

> If this is tmpfs specific behaviour that is required, then IMO
> fadvise is not the correct interface to use here because fadvise is
> supposed to be a generic interface to controlling the page cache
> behaviour on any given file....
>
> > As a counter-point, this is my first thought of an implementation approach
> > (-ENOPATCH, sorry)
> >
> > - new mount option for tmpfs e.g. 'volatile'. Any file in a filesystem
> > mounted with that option and which is not currently open by any process can
> > have blocks removed at any time. The file name must remain, and the file
> > size must not change.
> > - lseek can be used to determine if anything has been purged with 'SEEK_DATA'
> > and 'SEEK_HOLE'.
> >
> > So you can only mark volatility on a whole-file granularity (hence the
> > question above).
> > 'open' says "NONVOLATILE".
> > 'close' says "VOLATILE".
> > 'lseek' is used to check if anything was discarded.
> >
> > This approach would allow multiple processes to share a cache (might this be
> > valuable?) as it doesn't become truly volatile until all processes close
> > their handles.
>
> If this functionality is only useful for tmpfs, then I'd much prefer
> a tmpfs specific approach like this....

Since, as I think more on this, this seems to map closer to file hole
punching, would fallocate be the right interface? FALLOC_FL_PUNCH_HOLE
isn't supported by all filesystems, after all.

Maybe FALLOC_FL_VOLATILE and FALLOC_FL_NONVOLATILE?

thanks
-john

2012-02-20 07:35:18

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

Hi John,
thanks for your answers....

>
> > The proposed mechanism - at a high level - is for user-space to be able to
> > say "This memory is volatile" and then later "this memory is no longer
> > volatile". If the content of the memory is still available the second
> > request succeeds. If not, it fails.. Well, actually it succeeds but reports
> > that some content has been lost. (not sure what happens then - can the app do
> > a binary search to find which pages it still has or something).
>
> The app should expect all was lost in that range.

So... the app has some idea of the real granularity of the cache, which
is several objects in one file, and marks them volatile as a whole - then
marks them non-volatile as a whole and if that fails it assumes that the
whole object is gone.
However the kernel doesn't really have any idea of real granularity and so
just removes individual pages until it has freed up enough.
It could have just corrupted a much bigger object and so the rest of the
object is of no value and may as well be freed, but it has no way to know
this, so frees something else instead.

Is this a problem? If the typical granularity is a page or two then it is
unlikely to hurt. If it is hundreds of pages I think it would mean that we
don't make as good use of memory as we could (but it is all heuristics anyway
and we probably waste lots of opportunities already so maybe it doesn't
matter).

My gut feeling is that seeing the app has concrete knowledge about
granularity it should give it to the kernel somehow.

>
> > (technically we should probably include the cost to reconstruct the page,
> > which the kernel measures as 'seeks' but maybe that isn't necessary).
>
> Not sure I'm following this.

The shrinker in your code (and the original ashmem) contains:

.seeks = DEFAULT_SEEKS * 4,

This means that objects in this cache are 4 times as expensive to replace as
most other caches.
(the cost of replacing an entry in the cache is measured in 'seeks' and the
default is to assume that it takes 2 seeks to reload and object).

I don't really know what the practical importance of 'seeks' is. Maybe it is
close to meaningless, in which case you should probably use DEFAULT_SEEKS
like (almost) everyone else.
Maybe it is quite relevant, in which case maybe you should expose that
setting to user-space somehow.
Or maybe 'DEFAULT_SEEKS * 4' is perfect for all possible users of this
caching mechanism.

I guess my point is that any non-default value should be justified.

>
> > This is implemented by using files in a 'tmpfs' filesystem. These file
> > support three new flags to fadvise:
> >
> > POSIX_FADV_VOLATILE - this marks a range of pages as 'volatile'. They may be
> > removed from the page cache as needed, even if they are not 'clean'.
> > POSIX_FADV_NONVOLATILE - this marks a range of pages as non-volatile.
> > If any pages in the range were previously volatile but have since been
> > removed, then a status is returned reporting this.
> > POSIX_FADV_ISVOLATILE - this does not actually give any advice to the kernel
> > but rather asks a question: Are any of these pages volatile?
> >
> >
> > Is this an accurate description?
>
> Right now its not tmpfs specific, but otherwise this is pretty spot on.
>
> > My first thoughts are:
> > 1/ is page granularity really needed? Would file granularity be sufficient?
>
> The current users of similar functionality via ashmem do seem to find
> page granularity useful. You can share basically an unlinked tmpfs fd
> between two applications and mark and unmark ranges of pages
> "volatile" (unpinned in ashmem terms) as needed.

Sharing an unlinked cache between processes certainly seems like a valid case
that my model doesn't cover.
I feel uncomfortable about different processes being able to unpin each
other's pages. It means they need to negotiate with each other to ensure one
doesn't unpin a page that the other is using.

If this was a common use case, it would make a lot of sense for the kernel to
refcount the pinning so that a range only becomes really unpinned when no-one
has it pinned any more.

Do you know any more about these apps that share a cache file? Do they need
extra inter-locking (or are they completely hypothetical?).

>
> > 2/ POSIX_FADV_ISVOLATILE is a warning sign to me - it doesn't actually
> > provide advice. Is this really needed? What for? Because it feels like
> > a wrong interface.
>
> It is more awkward, I agree. And the more I think about it, it seems
> like its something we can drop, as it is likely only useful as a probe
> before using a page, and using the POSIX_FADV_NONVOLAILE on the range to
> be used would also provide the same behavior. So I'll drop it in the
> next revision.

Good. That makes me feel happier.

>
> > 3/ Given that this is specific to one filesystem, is fadvise really an
> > appropriate interface?
> >
> > (fleshing out the above documentation might be an excellent way to answer
> > these questions).
>
> So, the ashmem implementation is really tmpfs specific, but there's also
> the expectation on android devices that there isn't swap, so its more
> like ramfs. I'd like to think that this behavior makes some sense on
> other filesystems, providing a way to cheaply throw out dirty data
> without the cost of hitting the disk. However, the next time the file is
> opened, that could cause some really strange inconsistent results, with
> some recent pages written out and some stale pages. The vmtruncate would
> punch a hole instead of leaving stale data, but that still would have to
> hit the disk so its not free. So I'm not really sure if it makes sense
> in a totally generic way. That said, it would be easy for now to return
> errors if the fs isn't shmem based.

As I think I said somewhere, I cannot see how the functionality makes any
sense at all on a storage-backed filesystem - and what you have said about
inconsistent on-disk images only re-enforces that.
I think it should definitely be ramfs only (maybe tmpfs as well??).

>
> Really, I'm not married to any specific interface here. fadvise just
> seemed the most logical to me. Given page granularity is needed, what
> would be a filesystem specific interface that makes sense here?

OK, let me try again.
This looks to me a bit like byte-range locking.
locking can already have a filesystem-specific implementation so this could
be implemented as a ramfs-specific locking protocol. This would be activated
by some mount option (or it could even be a different filesystem type -
ramcachefs).

1- a shared lock (F_RDLCK) pins the range in memory and prevents an exclusive
lock, or any purge of pages.
2- an exclusive lock (F_WRLCK) is used to create or re-create an object in
the cache.
3- when pages are purged a lock-range is created which marks the range as
purged and prevents any read lock from succeeding.
This lock-range is removed when a write-lock is taken out.

So initially all pages are marked by an internal 'purged' lock indicating that
they contain nothing.

Objects can be created by creating a write lock and writing data. Then
unlocking (or down grading to a read lock) allows them to be accessed by
other processes.
Any process that wants to read an object first asks for a shared lock. If
this succeeds they are sure that the pages are still available (and that
no-one has an exclusive lock).
If the shared lock fails then at least one page doesn't exist - probably all
are gone. They can then optionally try to get a write lock. Once they get
that they can revalidate somehow, or refill the object.
When the last lock is removed, the locking code could keep the range
information but mark it as unlocked and put it on an lru list.

So 4 sorts of ranges are defined and they cover the entire file:
shared locks - these might overlap
exclusive locks - these don't overlap
purge locks - mark ranges that have been purged or never written
pending locks - mark all remaining ranges.

When a shared or exclusive lock is released it becomes a pending lock.
When the shrinker fires it converts some number of pending locks to
purge locks and discards the pages wholly in them.
A shared lock can only be taken when there is a shared or pending lock there.
An exclusive lock can be taken when a purge or pending lock is present.

For the most part this doesn't conflict with the more normal usage of byte
range locks. However it does mean that a process cannot place a range in a
state where some other process is allowed to write, but the kernel is not
allowed to purge the pages. I cannot tell if this might be a problem.
(it could probably be managed by some convention where locking the first byte
in an object gives read/write permission and locking the rest keeps it in
cache. One byte by itself will never be purged).

I'm not sure what should happen if you write without first getting a write
lock. I guess it should turn a purge lock to a pending lock, but leave an
other range unchanged.

NeilBrown

Attachments:

signature.asc (828.00 B)

2012-02-20 23:26:03

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH 2/2] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags

On 02/19/2012 11:34 PM, NeilBrown wrote:
> Is this a problem? If the typical granularity is a page or two then it is
> unlikely to hurt. If it is hundreds of pages I think it would mean that we
> don't make as good use of memory as we could (but it is all heuristics anyway
> and we probably waste lots of opportunities already so maybe it doesn't
> matter).
>
> My gut feeling is that seeing the app has concrete knowledge about
> granularity it should give it to the kernel somehow.

I dunno... The kernel works well with an awful lot of applications
without ever having any concept of the underlying objects.

I guess the worst-case scenario would be if we have a very large object,
most of its pages being accessed very often, but one never accessed.
The one page makes its way to the end of the LRU and gets discarded, now
the whole object is worthless, and most of it is at the beginning of the
LRU. If it is truly accessed often, then userspace should notice
quickly and discard the entire object quickly. If it isn't accessed
often, then the majority of the object is moving down the LRU and won't
be wasted for long.

What else can we do? I guess whenever a range is set NONVOLATILE, we
could SetPageReferenced() on every page in the range to keep them
bunched up on the LRU.