LinuxLists.cc - [PATCH 0/3] [RFC] Fallocate Volatile Ranges v2

2012-06-01 18:30:09

Subject: [PATCH 0/3] [RFC] Fallocate Volatile Ranges v2

Here's another update to the Fallocate Volatile Range code.

The bigish change is renaming the range-tree code to
interval-tree, as Jan Kara pointed out that term is more
accurate (although this is a naive implementation).

I also fixed a bad bug in the volatile range management,
and added an optimization so we don't run over the lru
to determine how many pages are unpurged.

Thanks to everyone for the review so far, please let me
know if you have any further thoughts or suggestions.

thanks
-john

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Dmitry Adamushko <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Neil Brown <[email protected]>
CC: Andrea Righi <[email protected]>
CC: Aneesh Kumar K.V <[email protected]>
CC: Taras Glek <[email protected]>
CC: Mike Hommey <[email protected]>
CC: Jan Kara <[email protected]>

John Stultz (3):
[RFC] Interval tree implementation
[RFC] Add volatile range management code
[RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

fs/open.c | 3 +-
include/linux/falloc.h | 7 +-
include/linux/intervaltree.h | 55 +++++
include/linux/volatile.h | 45 ++++
lib/Makefile | 2 +-
lib/intervaltree.c | 119 ++++++++++
mm/Makefile | 2 +-
mm/shmem.c | 107 +++++++++
mm/volatile.c | 509 ++++++++++++++++++++++++++++++++++++++++++
9 files changed, 843 insertions(+), 6 deletions(-)
create mode 100644 include/linux/intervaltree.h
create mode 100644 include/linux/volatile.h
create mode 100644 lib/intervaltree.c
create mode 100644 mm/volatile.c

--
1.7.3.2.146.gca209

2012-06-01 18:30:45

by John Stultz

[permalink] [raw]

Subject: [PATCH 2/3] [RFC] Add volatile range management code

This patch provides the volatile range management code
that filesystems can utilize when implementing
FALLOC_FL_MARK_VOLATILE.

It tracks a collection of page ranges against a mapping
stored in an interval-tree. This code handles coalescing
overlapping and adjacent ranges, as well as splitting
ranges when sub-chunks are removed.

The ranges can be marked purged or unpurged. And there is
a per-fs lru list that tracks all the unpurged ranges for
that fs.

v2:
* Fix bug in volatile_ranges_get_last_used returning bad
start,end values
* Rework for intervaltree renaming
* Optimize volatile_range_lru_size to avoid running through
lru list each time.

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Dmitry Adamushko <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Neil Brown <[email protected]>
CC: Andrea Righi <[email protected]>
CC: Aneesh Kumar K.V <[email protected]>
CC: Taras Glek <[email protected]>
CC: Mike Hommey <[email protected]>
CC: Jan Kara <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/volatile.h | 45 ++++
mm/Makefile | 2 +-
mm/volatile.c | 509 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 555 insertions(+), 1 deletions(-)
create mode 100644 include/linux/volatile.h
create mode 100644 mm/volatile.c

diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..66737a8
--- /dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,45 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+#include <linux/fs.h>
+
+struct volatile_fs_head {
+ struct mutex lock;
+ struct list_head lru_head;
+ s64 unpurged_page_count;
+};
+
+
+#define DEFINE_VOLATILE_FS_HEAD(name) struct volatile_fs_head name = { \
+ .lock = __MUTEX_INITIALIZER(name.lock), \
+ .lru_head = LIST_HEAD_INIT(name.lru_head), \
+ .unpurged_page_count = 0, \
+}
+
+
+static inline void volatile_range_lock(struct volatile_fs_head *head)
+{
+ mutex_lock(&head->lock);
+}
+
+static inline void volatile_range_unlock(struct volatile_fs_head *head)
+{
+ mutex_unlock(&head->lock);
+}
+
+extern long volatile_range_add(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+extern long volatile_range_remove(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+
+extern s64 volatile_range_lru_size(struct volatile_fs_head *head);
+
+extern void volatile_range_clear(struct volatile_fs_head *head,
+ struct address_space *mapping);
+
+extern s64 volatile_ranges_get_last_used(struct volatile_fs_head *head,
+ struct address_space **mapping,
+ loff_t *start, loff_t *end);
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index a156285..dc79eb8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
page_isolation.o mm_init.o mmu_context.o percpu.o \
- compaction.o $(mmu-y)
+ compaction.o volatile.o $(mmu-y)
obj-y += init-mm.o

ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..f8da602
--- /dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,509 @@
+/* mm/volatile.c
+ *
+ * Volatile page range managment.
+ * Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c
+ * by Robert Love <[email protected]>
+ * Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * The volatile range management is a helper layer on top of the range tree
+ * code, which is used to help filesystems manage page ranges that are volatile.
+ *
+ * These ranges are stored in a per-mapping range tree. Storing both purged and
+ * unpurged ranges connected to that address_space. Unpurged ranges are also
+ * linked together in an lru list that is per-volatile-fs-head (basically
+ * per-filesystem).
+ *
+ * The goal behind volatile ranges is to allow applications to interact
+ * with the kernel's cache management infrastructure. In particular an
+ * application can say "this memory contains data that might be useful in
+ * the future, but can be reconstructed if necessary, so if the kernel
+ * needs, it can zap and reclaim this memory without having to swap it out.
+ *
+ * The proposed mechanism - at a high level - is for user-space to be able
+ * to say "This memory is volatile" and then later "this memory is no longer
+ * volatile". If the content of the memory is still available the second
+ * request succeeds. If not, the memory is marked non-volatile and an
+ * error is returned to denote that the contents have been lost.
+ *
+ * Credits to Neil Brown for the above description.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/volatile.h>
+#include <linux/intervaltree.h>
+#include <linux/hash.h>
+#include <linux/shmem_fs.h>
+
+
+struct volatile_range {
+ struct list_head lru;
+ struct interval_tree_node interval_node;
+ unsigned int purged;
+ struct address_space *mapping;
+};
+
+
+/*
+ * To avoid bloating the address_space structure, we use
+ * a hash structure to map from address_space mappings to
+ * the interval_tree root that stores volatile ranges
+ */
+static DEFINE_MUTEX(hash_mutex);
+static struct hlist_head *mapping_hash;
+static long mapping_hash_shift = 8;
+struct mapping_hash_entry {
+ struct interval_tree_root root;
+ struct address_space *mapping;
+ struct hlist_node hnode;
+};
+
+
+static inline
+struct interval_tree_root *__mapping_to_root(struct address_space *mapping)
+{
+ struct hlist_node *elem;
+ struct mapping_hash_entry *entry;
+ struct interval_tree_root *ret = NULL;
+
+ hlist_for_each_entry_rcu(entry, elem,
+ &mapping_hash[hash_ptr(mapping, mapping_hash_shift)],
+ hnode)
+ if (entry->mapping == mapping)
+ ret = &entry->root;
+
+ return ret;
+}
+
+
+static inline
+struct interval_tree_root *mapping_to_root(struct address_space *mapping)
+{
+ struct interval_tree_root *ret;
+
+ mutex_lock(&hash_mutex);
+ ret = __mapping_to_root(mapping);
+ mutex_unlock(&hash_mutex);
+ return ret;
+}
+
+
+static inline
+struct interval_tree_root *mapping_allocate_root(struct address_space *mapping)
+{
+ struct mapping_hash_entry *entry;
+ struct interval_tree_root *dblchk;
+ struct interval_tree_root *ret = NULL;
+
+ entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+ if (!entry)
+ return NULL;
+
+ mutex_lock(&hash_mutex);
+ /* Since we dropped the lock, double check that no one has
+ * created the same hash entry.
+ */
+ dblchk = __mapping_to_root(mapping);
+ if (dblchk) {
+ kfree(entry);
+ ret = dblchk;
+ goto out;
+ }
+
+ INIT_HLIST_NODE(&entry->hnode);
+ entry->mapping = mapping;
+ interval_tree_init(&entry->root);
+
+ hlist_add_head_rcu(&entry->hnode,
+ &mapping_hash[hash_ptr(mapping, mapping_hash_shift)]);
+
+ ret = &entry->root;
+out:
+ mutex_unlock(&hash_mutex);
+ return ret;
+}
+
+
+static inline void mapping_free_root(struct interval_tree_root *root)
+{
+ struct mapping_hash_entry *entry;
+
+ mutex_lock(&hash_mutex);
+ entry = container_of(root, struct mapping_hash_entry, root);
+
+ hlist_del_rcu(&entry->hnode);
+ kfree(entry);
+ mutex_unlock(&hash_mutex);
+}
+
+
+/* volatile range helpers */
+static inline void vrange_resize(struct volatile_fs_head *head,
+ struct volatile_range *range,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ s64 old_size, new_size;
+
+ old_size = range->interval_node.end - range->interval_node.start;
+ new_size = end_index-start_index;
+
+ if (!range->purged)
+ head->unpurged_page_count += new_size - old_size;
+
+ range->interval_node.start = start_index;
+ range->interval_node.end = end_index;
+}
+
+static struct volatile_range *vrange_alloc(void)
+{
+ struct volatile_range *new;
+
+ new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+ if (!new)
+ return 0;
+ interval_tree_node_init(&new->interval_node);
+ return new;
+}
+
+static void vrange_del(struct volatile_fs_head *head,
+ struct interval_tree_root *root,
+ struct volatile_range *vrange)
+{
+ if (!vrange->purged) {
+ head->unpurged_page_count -=
+ vrange->interval_node.end - vrange->interval_node.start;
+ list_del(&vrange->lru);
+ }
+ interval_tree_remove(root, &vrange->interval_node);
+ kfree(vrange);
+}
+
+
+/**
+ * volatile_range_add: Marks a page interval as volatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked volatile
+ * @start_index: Starting page in range to be marked volatile
+ * @end_index: Ending page in range to be marked volatile
+ *
+ * Mark a region as volatile. Coalesces overlapping and neighboring regions.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if the range was coalesced with any purged ranges.
+ * Returns 0 on success.
+ */
+long volatile_range_add(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ struct volatile_range *new;
+ struct interval_tree_node *node;
+ struct volatile_range *vrange;
+ struct interval_tree_root *root;
+ int purged = 0;
+ u64 start = (u64)start_index;
+ u64 end = (u64)end_index;
+
+ /* Make sure we're properly locked */
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ /*
+ * Because the lock might be held in a shrinker, release
+ * it during allocation.
+ */
+ mutex_unlock(&head->lock);
+ new = vrange_alloc();
+ mutex_lock(&head->lock);
+ if (!new)
+ return -ENOMEM;
+
+ root = mapping_to_root(mapping);
+ if (!root) {
+ mutex_unlock(&head->lock);
+ root = mapping_allocate_root(mapping);
+ mutex_lock(&head->lock);
+ if (!root) {
+ kfree(new);
+ return -ENOMEM;
+ }
+ }
+
+ /* First, find any existing intervals that overlap */
+ node = interval_tree_in_interval(root, start, end);
+ while (node) {
+ /* Already entirely marked volatile, so we're done */
+ if (node->start < start && node->end > end) {
+ /* don't need the allocated value */
+ kfree(new);
+ return purged;
+ }
+
+ /* Grab containing volatile range */
+ vrange = container_of(node, struct volatile_range,
+ interval_node);
+
+ /* Resize the new range to cover all overlapping ranges */
+ start = min_t(u64, start, node->start);
+ end = max_t(u64, end, node->end);
+
+ /* Inherit purged state from overlapping ranges */
+ purged |= vrange->purged;
+
+
+ node = interval_tree_next_in_interval(&vrange->interval_node,
+ start, end);
+ /* Delete the old range, as we consume it */
+ vrange_del(head, root, vrange);
+ }
+
+ /* Coalesce left-adjacent ranges */
+ node = interval_tree_in_interval(root, start-1, start);
+ if (node) {
+ vrange = container_of(node, struct volatile_range,
+ interval_node);
+ /* Only coalesce if both are either purged or unpurged */
+ if (vrange->purged == purged) {
+ /* resize new range */
+ start = min_t(u64, start, node->start);
+ end = max_t(u64, end, node->end);
+ /* delete old range */
+ vrange_del(head, root, vrange);
+ }
+ }
+
+ /* Coalesce right-adjacent ranges */
+ node = interval_tree_in_interval(root, end, end+1);
+ if (node) {
+ vrange = container_of(node, struct volatile_range,
+ interval_node);
+ /* Only coalesce if both are either purged or unpurged */
+ if (vrange->purged == purged) {
+ /* resize new range */
+ start = min_t(u64, start, node->start);
+ end = max_t(u64, end, node->end);
+ /* delete old range */
+ vrange_del(head, root, vrange);
+ }
+ }
+ /* Assign and store the new range in the range tree */
+ new->mapping = mapping;
+ new->interval_node.start = start;
+ new->interval_node.end = end;
+ new->purged = purged;
+ interval_tree_add(root, &new->interval_node);
+
+ /* Only add unpurged ranges to LRU */
+ if (!purged) {
+ head->unpurged_page_count += end - start;
+ list_add_tail(&new->lru, &head->lru_head);
+ }
+ return purged;
+}
+
+
+/**
+ * volatile_range_remove: Marks a page interval as nonvolatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked nonvolatile
+ * @start_index: Starting page in range to be marked nonvolatile
+ * @end_index: Ending page in range to be marked nonvolatile
+ *
+ * Mark a region as nonvolatile. And remove any contained pages
+ * from the volatile range tree.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if any portion of the range was purged.
+ * Returns 0 on success.
+ */
+long volatile_range_remove(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ struct volatile_range *new;
+ struct interval_tree_node *node;
+ struct interval_tree_root *root;
+ int ret = 0;
+ int used_new = 0;
+ u64 start = (u64)start_index;
+ u64 end = (u64)end_index;
+
+ /* Make sure we're properly locked */
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ /*
+ * Because the lock might be held in a shrinker, release
+ * it during allocation.
+ */
+ mutex_unlock(&head->lock);
+ new = vrange_alloc();
+ mutex_lock(&head->lock);
+ if (!new)
+ return -ENOMEM;
+
+ root = mapping_to_root(mapping);
+ if (!root)
+ goto out;
+
+
+ /* Find any overlapping ranges */
+ node = interval_tree_in_interval(root, start, end);
+ while (node) {
+ struct volatile_range *vrange;
+ vrange = container_of(node, struct volatile_range,
+ interval_node);
+
+ ret |= vrange->purged;
+
+ if (start <= node->start && end >= node->end) {
+ /* delete: volatile range is totally within range */
+ node = interval_tree_next_in_interval(
+ &vrange->interval_node,
+ start, end);
+ vrange_del(head, root, vrange);
+ } else if (node->start >= start) {
+ /* resize: volatile range right-overlaps range */
+ vrange_resize(head, vrange, end+1, node->end);
+ node = interval_tree_next_in_interval(
+ &vrange->interval_node,
+ start, end);
+
+ } else if (node->end <= end) {
+ /* resize: volatile range left-overlaps range */
+ vrange_resize(head, vrange, node->start, start-1);
+ node = interval_tree_next_in_interval(
+ &vrange->interval_node,
+ start, end);
+ } else {
+ /* split: range is totally within a volatile range */
+ used_new = 1; /* we only do this once */
+ new->mapping = mapping;
+ new->interval_node.start = end + 1;
+ new->interval_node.end = node->end;
+ new->purged = vrange->purged;
+ interval_tree_add(root, &new->interval_node);
+ if (!new->purged)
+ list_add_tail(&new->lru, &head->lru_head);
+ vrange_resize(head, vrange, node->start, start-1);
+
+ break;
+ }
+ }
+
+out:
+ if (!used_new)
+ kfree(new);
+
+ return ret;
+}
+
+/**
+ * volatile_range_lru_size: Returns the number of unpurged pages on the lru
+ * @head: per-fs volatile head
+ *
+ * Returns the number of unpurged pages on the LRU
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ */
+s64 volatile_range_lru_size(struct volatile_fs_head *head)
+{
+ WARN_ON(!mutex_is_locked(&head->lock));
+ return head->unpurged_page_count;
+}
+
+
+/**
+ * volatile_ranges_get_last_used: Returns mapping and size of lru unpurged range
+ * @head: per-fs volatile head
+ * @mapping: dbl pointer to mapping who's range is being purged
+ * @start: Pointer to starting address of range being purged
+ * @end: Pointer to ending address of range being purged
+ *
+ * Returns the mapping, start and end values of the least recently used
+ * range. Marks the range as purged and removes it from the LRU.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 on success if a range was returned
+ * Return 0 if no ranges were found.
+ */
+s64 volatile_ranges_get_last_used(struct volatile_fs_head *head,
+ struct address_space **mapping,
+ loff_t *start, loff_t *end)
+{
+ struct volatile_range *range;
+
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ if (list_empty(&head->lru_head))
+ return 0;
+
+ range = list_first_entry(&head->lru_head, struct volatile_range, lru);
+
+ *start = range->interval_node.start;
+ *end = range->interval_node.end;
+ *mapping = range->mapping;
+
+ head->unpurged_page_count -= *end - *start;
+ list_del(&range->lru);
+ range->purged = 1;
+
+ return 1;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void volatile_range_clear(struct volatile_fs_head *head,
+ struct address_space *mapping)
+{
+ struct volatile_range *tozap;
+ struct interval_tree_root *root;
+
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ root = mapping_to_root(mapping);
+ if (!root)
+ return;
+
+ while (!interval_tree_empty(root)) {
+ struct interval_tree_node *tmp;
+ tmp = interval_tree_root_node(root);
+ tozap = container_of(tmp, struct volatile_range, interval_node);
+ vrange_del(head, root, tozap);
+ }
+ mapping_free_root(root);
+}
+
+
+static int __init volatile_init(void)
+{
+ int i, size;
+
+ size = 1U << mapping_hash_shift;
+ mapping_hash = kzalloc(sizeof(mapping_hash)*size, GFP_KERNEL);
+ for (i = 0; i < size; i++)
+ INIT_HLIST_HEAD(&mapping_hash[i]);
+
+ return 0;
+}
+arch_initcall(volatile_init);
--
1.7.3.2.146.gca209

2012-06-01 18:30:57

by John Stultz

[permalink] [raw]

Subject: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
functionality for tmpfs making use of the volatile range
management code.

Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
FALLOC_FL_PUNCH_HOLE. This allows applications that have
data caches that can be re-created to tell the kernel that
some memory contains data that is useful in the future, but
can be recreated if needed, so if the kernel needs, it can
zap the memory without having to swap it out.

In use, applications use FALLOC_FL_MARK_VOLATILE to mark
page ranges as volatile when they are not in use. Then later
if they wants to reuse the data, they use
FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
data has been purged.

This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code.

This is a reworked version of the fadvise volatile idea submitted
earlier to the list. Thanks to Dave Chinner for suggesting to
rework the idea in this fashion. Also thanks to Dmitry Adamushko
for continued review and bug reporting, and Dave Hansen for
help with the original design and mentoring me in the VM code.

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Dmitry Adamushko <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Neil Brown <[email protected]>
CC: Andrea Righi <[email protected]>
CC: Aneesh Kumar K.V <[email protected]>
CC: Taras Glek <[email protected]>
CC: Mike Hommey <[email protected]>
CC: Jan Kara <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
fs/open.c | 3 +-
include/linux/falloc.h | 7 ++-
mm/shmem.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 113 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index d543012..448ed5a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -223,7 +223,8 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -EINVAL;

/* Return error if mode is not supported */
- if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+ FALLOC_FL_MARK_VOLATILE | FALLOC_FL_UNMARK_VOLATILE))
return -EOPNOTSUPP;

/* Punch hole must have keep size set */
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 73e0b62..3e47ad5 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -1,9 +1,10 @@
#ifndef _FALLOC_H_
#define _FALLOC_H_

-#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
-#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
-
+#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
+#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
+#define FALLOC_FL_MARK_VOLATILE 0x04 /* mark range volatile */
+#define FALLOC_FL_UNMARK_VOLATILE 0x08 /* mark range non-volatile */
#ifdef __KERNEL__

/*
diff --git a/mm/shmem.c b/mm/shmem.c
index d576b84..d28daa4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -64,6 +64,7 @@ static struct vfsmount *shm_mnt;
#include <linux/highmem.h>
#include <linux/seq_file.h>
#include <linux/magic.h>
+#include <linux/volatile.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -624,11 +625,109 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
return error;
}

+static DEFINE_VOLATILE_FS_HEAD(shmem_volatile_head);
+
+static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+ loff_t lstart, lend;
+ int ret;
+
+ lstart = offset >> PAGE_CACHE_SHIFT;
+ lend = (offset+len) >> PAGE_CACHE_SHIFT;
+
+ volatile_range_lock(&shmem_volatile_head);
+ ret = volatile_range_add(&shmem_volatile_head, &inode->i_data,
+ lstart, lend);
+ if (ret > 0) { /* immdiately purge */
+ shmem_truncate_range(inode, lstart<<PAGE_CACHE_SHIFT,
+ (lend<<PAGE_CACHE_SHIFT)-1);
+ ret = 0;
+ }
+ volatile_range_unlock(&shmem_volatile_head);
+
+ return ret;
+}
+
+static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+ loff_t lstart, lend;
+ int ret;
+
+ lstart = offset >> PAGE_CACHE_SHIFT;
+ lend = (offset+len) >> PAGE_CACHE_SHIFT;
+
+ volatile_range_lock(&shmem_volatile_head);
+ ret = volatile_range_remove(&shmem_volatile_head,
+ &inode->i_data,
+ lstart, lend);
+ volatile_range_unlock(&shmem_volatile_head);
+
+ return ret;
+}
+
+static void shmem_clear_volatile(struct inode *inode)
+{
+ volatile_range_lock(&shmem_volatile_head);
+ volatile_range_clear(&shmem_volatile_head, &inode->i_data);
+ volatile_range_unlock(&shmem_volatile_head);
+}
+
+static
+int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
+{
+ s64 nr_to_scan = sc->nr_to_scan;
+ const gfp_t gfp_mask = sc->gfp_mask;
+ struct address_space *mapping;
+ loff_t start, end;
+ int ret;
+ s64 page_count;
+
+ if (nr_to_scan && !(gfp_mask & __GFP_FS))
+ return -1;
+
+ volatile_range_lock(&shmem_volatile_head);
+ page_count = volatile_range_lru_size(&shmem_volatile_head);
+ if (!nr_to_scan)
+ goto out;
+
+ do {
+ ret = volatile_ranges_get_last_used(&shmem_volatile_head,
+ &mapping, &start, &end);
+ if (ret) {
+ shmem_truncate_range(mapping->host,
+ start<<PAGE_CACHE_SHIFT,
+ (end<<PAGE_CACHE_SHIFT)-1);
+ nr_to_scan -= end-start;
+ page_count -= end-start;
+ };
+ } while (ret && (nr_to_scan > 0));
+
+out:
+ volatile_range_unlock(&shmem_volatile_head);
+
+ return page_count;
+}
+
+static struct shrinker shmem_volatile_shrinker = {
+ .shrink = shmem_volatile_shrink,
+ .seeks = DEFAULT_SEEKS,
+};
+
+static int __init shmem_shrinker_init(void)
+{
+ register_shrinker(&shmem_volatile_shrinker);
+ return 0;
+}
+arch_initcall(shmem_shrinker_init);
+
+
static void shmem_evict_inode(struct inode *inode)
{
struct shmem_inode_info *info = SHMEM_I(inode);
struct shmem_xattr *xattr, *nxattr;

+ shmem_clear_volatile(inode);
+
if (inode->i_mapping->a_ops == &shmem_aops) {
shmem_unacct_size(info->flags, inode->i_size);
inode->i_size = 0;
@@ -1789,6 +1888,14 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
/* No need to unmap again: hole-punching leaves COWed pages */
error = 0;
goto out;
+ } else if (mode & FALLOC_FL_MARK_VOLATILE) {
+ /* Mark pages volatile, sort of delayed hole punching */
+ error = shmem_mark_volatile(inode, offset, len);
+ goto out;
+ } else if (mode & FALLOC_FL_UNMARK_VOLATILE) {
+ /* Mark pages non-volatile, return error if pages were purged */
+ error = shmem_unmark_volatile(inode, offset, len);
+ goto out;
}

/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
--
1.7.3.2.146.gca209

2012-06-01 18:33:47

by John Stultz

[permalink] [raw]

Subject: [PATCH 1/3] [RFC] Interval tree implementation

After Andrew suggested something like his mumbletree idea
to better store a list of intervals, I worked on a few different
approaches, and this is what I've finally managed to get working.

The idea of storing intervals in a tree is nice, but has a number
of complications. When adding an interval, its possible that a
large interval will consume and merge a number of smaller intervals.
When removing a interval, its possible you may end up splitting an
existing interval, causing one interval to become two. This makes it
very difficult to provide generic list_head like behavior, as
the parent structures would need to be duplicated and removed,
and that has lots of memory ownership issues.

So, this is a much simplified and more list_head like
implementation. You can add a node to a tree, or remove a node
to a tree, but the generic implementation doesn't do the
merging or splitting for you. But it does provide helpers to
find overlapping and adjacent intervals.

Andrew also really wanted this interval-tree implementation to be
resuable so we don't duplicate the file locking logic. I'm not
totally convinced that the requirements between the volatile
intervals and file locking are really equivelent, but this reduced
impelementation may make it possible.

Changelog:
v2:
* Reworked code to use an rbtree instead of splaying

v3:
* Added range_tree_next_in_range() to avoid having to start
lookups from the root every time.
* Fixed some comments and return NULL instead of 0, as suggested
by Aneesh Kumar K.V

v6:
* Fixed range_tree_in_range() so that it finds the earliest range,
rather then the first. This allows the next_in_range() function
to properly cover all the ranges in the tree.
* Minor clenaups to simplify some of the functions

v7:
* Changed terminology from rangetree to intervaltree as suggested
by Jan Kara

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Dmitry Adamushko <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Neil Brown <[email protected]>
CC: Andrea Righi <[email protected]>
CC: Aneesh Kumar K.V <[email protected]>
CC: Taras Glek <[email protected]>
CC: Mike Hommey <[email protected]>
CC: Jan Kara <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/intervaltree.h | 55 +++++++++++++++++++
lib/Makefile | 2 +-
lib/intervaltree.c | 119 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 175 insertions(+), 1 deletions(-)
create mode 100644 include/linux/intervaltree.h
create mode 100644 lib/intervaltree.c

diff --git a/include/linux/intervaltree.h b/include/linux/intervaltree.h
new file mode 100644
index 0000000..cfaa174
--- /dev/null
+++ b/include/linux/intervaltree.h
@@ -0,0 +1,55 @@
+#ifndef _LINUX_INTERVALTREE_H
+#define _LINUX_INTERVALTREE_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+
+struct interval_tree_node {
+ struct rb_node rb;
+ u64 start;
+ u64 end;
+};
+
+struct interval_tree_root {
+ struct rb_root head;
+};
+
+static inline void interval_tree_init(struct interval_tree_root *root)
+{
+ root->head = RB_ROOT;
+}
+
+static inline void interval_tree_node_init(struct interval_tree_node *node)
+{
+ rb_init_node(&node->rb);
+ node->start = 0;
+ node->end = 0;
+}
+
+static inline int interval_tree_empty(struct interval_tree_root *root)
+{
+ return RB_EMPTY_ROOT(&root->head);
+}
+
+static inline
+struct interval_tree_node *interval_tree_root_node(
+ struct interval_tree_root *root)
+{
+ struct interval_tree_node *ret;
+ ret = container_of(root->head.rb_node, struct interval_tree_node, rb);
+ return ret;
+}
+
+extern struct interval_tree_node *interval_tree_in_interval(
+ struct interval_tree_root *root,
+ u64 start, u64 end);
+extern struct interval_tree_node *interval_tree_next_in_interval(
+ struct interval_tree_node *node,
+ u64 start, u64 end);
+extern void interval_tree_add(struct interval_tree_root *root,
+ struct interval_tree_node *node);
+extern void interval_tree_remove(struct interval_tree_root *root,
+ struct interval_tree_node *node);
+#endif
+
+
diff --git a/lib/Makefile b/lib/Makefile
index 8c31a0c..2bbad25 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
idr.o int_sqrt.o extable.o prio_tree.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o prio_heap.o ratelimit.o show_mem.o \
- is_single_threaded.o plist.o decompress.o
+ is_single_threaded.o plist.o decompress.o intervaltree.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/intervaltree.c b/lib/intervaltree.c
new file mode 100644
index 0000000..47c52e0
--- /dev/null
+++ b/lib/intervaltree.c
@@ -0,0 +1,119 @@
+#include <linux/intervaltree.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+/* This code implements a naive interval tree, which stores a series of
+ * non-intersecting intervals.
+ * More complex interval trees can be read about here:
+ * http://en.wikipedia.org/wiki/Interval_tree
+ */
+
+
+/**
+ * interval_tree_in_interval - Returns the first node that intersects with the
+ * given interval
+ * @root: interval_tree root
+ * @start: interval start
+ * @end: interval end
+ *
+ */
+struct interval_tree_node *interval_tree_in_interval(
+ struct interval_tree_root *root,
+ u64 start, u64 end)
+{
+ struct rb_node *p = root->head.rb_node;
+ struct interval_tree_node *candidate, *match = NULL;
+
+ while (p) {
+ candidate = rb_entry(p, struct interval_tree_node, rb);
+ if (end < candidate->start)
+ p = p->rb_left;
+ else if (start > candidate->end)
+ p = p->rb_right;
+ else {
+ /* We found one, but try to find an earlier match */
+ match = candidate;
+ p = p->rb_left;
+ }
+ }
+
+ return match;
+}
+
+
+/**
+ * interval_tree_next_in_interval - Return the next interval in a intervaltree
+ * thatintersects with a specified interval.
+ * @root: interval_tree root
+ * @start: interval start
+ * @end: interval end
+ *
+ */
+struct interval_tree_node *interval_tree_next_in_interval(
+ struct interval_tree_node *node,
+ u64 start, u64 end)
+{
+ struct rb_node *next;
+ struct interval_tree_node *candidate;
+ if (!node)
+ return NULL;
+ next = rb_next(&node->rb);
+ if (!next)
+ return NULL;
+
+ candidate = container_of(next, struct interval_tree_node, rb);
+
+ if ((candidate->start > end) || (candidate->end < start))
+ return NULL;
+
+ return candidate;
+}
+
+/**
+ * interval_tree_add - Add a node to a interval tree
+ * @root: interval tree to be added to
+ * @node: interval_tree_node to be added
+ *
+ * Adds a node to the interval tree. Added interval should not intersect with
+ * existing intervals in the tree.
+ */
+void interval_tree_add(struct interval_tree_root *root,
+ struct interval_tree_node *node)
+{
+ struct rb_node **p = &root->head.rb_node;
+ struct rb_node *parent = NULL;
+ struct interval_tree_node *ptr;
+
+ WARN_ON_ONCE(!RB_EMPTY_NODE(&node->rb));
+
+ /* XXX might want to conditionalize this on debugging checks */
+ WARN_ON_ONCE(!!interval_tree_in_interval(root, node->start, node->end));
+
+ while (*p) {
+ parent = *p;
+ ptr = rb_entry(parent, struct interval_tree_node, rb);
+ if (node->start < ptr->start)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+ rb_link_node(&node->rb, parent, p);
+ rb_insert_color(&node->rb, &root->head);
+}
+
+
+/**
+ * interval_tree_remove: Removes a given node from the tree
+ * @root: root of tree
+ * @node: Node to be removed
+ *
+ * Removes a node and splays the tree
+ */
+void interval_tree_remove(struct interval_tree_root *root,
+ struct interval_tree_node *node)
+{
+ WARN_ON_ONCE(RB_EMPTY_NODE(&node->rb));
+
+ rb_erase(&node->rb, &root->head);
+ RB_CLEAR_NODE(&node->rb);
+}
--
1.7.3.2.146.gca209

2012-06-01 20:17:43

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

Hi John,

(6/1/12 2:29 PM), John Stultz wrote:
> This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
> functionality for tmpfs making use of the volatile range
> management code.
>
> Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
> FALLOC_FL_PUNCH_HOLE. This allows applications that have
> data caches that can be re-created to tell the kernel that
> some memory contains data that is useful in the future, but
> can be recreated if needed, so if the kernel needs, it can
> zap the memory without having to swap it out.
>
> In use, applications use FALLOC_FL_MARK_VOLATILE to mark
> page ranges as volatile when they are not in use. Then later
> if they wants to reuse the data, they use
> FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
> data has been purged.
>
> This is very much influenced by the Android Ashmem interface by
> Robert Love so credits to him and the Android developers.
> In many cases the code& logic come directly from the ashmem patch.
> The intent of this patch is to allow for ashmem-like behavior, but
> embeds the idea a little deeper into the VM code.
>
> This is a reworked version of the fadvise volatile idea submitted
> earlier to the list. Thanks to Dave Chinner for suggesting to
> rework the idea in this fashion. Also thanks to Dmitry Adamushko
> for continued review and bug reporting, and Dave Hansen for
> help with the original design and mentoring me in the VM code.

I like this patch concept. This is cleaner than userland
notification quirk. But I don't like you use shrinker. Because of,
after applying this patch, normal page reclaim path can still make
swap out. this is undesirable.

(snip)

> +static
> +int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
> +{
> + s64 nr_to_scan = sc->nr_to_scan;
> + const gfp_t gfp_mask = sc->gfp_mask;
> + struct address_space *mapping;
> + loff_t start, end;
> + int ret;
> + s64 page_count;
> +
> + if (nr_to_scan&& !(gfp_mask& __GFP_FS))
> + return -1;
> +
> + volatile_range_lock(&shmem_volatile_head);
> + page_count = volatile_range_lru_size(&shmem_volatile_head);
> + if (!nr_to_scan)
> + goto out;
> +
> + do {
> + ret = volatile_ranges_get_last_used(&shmem_volatile_head,
> + &mapping,&start,&end);

Why drop last used region? Not recently used region is better?

> + if (ret) {
> + shmem_truncate_range(mapping->host,
> + start<<PAGE_CACHE_SHIFT,
> + (end<<PAGE_CACHE_SHIFT)-1);
> + nr_to_scan -= end-start;
> + page_count -= end-start;
> + };
> + } while (ret&& (nr_to_scan> 0));
> +
> +out:
> + volatile_range_unlock(&shmem_volatile_head);
> +
> + return page_count;
> +}
> +

2012-06-01 21:04:05

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/01/2012 01:17 PM, KOSAKI Motohiro wrote:
> Hi John,
>
> (6/1/12 2:29 PM), John Stultz wrote:
>> This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
>> functionality for tmpfs making use of the volatile range
>> management code.
>>
>> Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
>> FALLOC_FL_PUNCH_HOLE. This allows applications that have
>> data caches that can be re-created to tell the kernel that
>> some memory contains data that is useful in the future, but
>> can be recreated if needed, so if the kernel needs, it can
>> zap the memory without having to swap it out.
>>
>> In use, applications use FALLOC_FL_MARK_VOLATILE to mark
>> page ranges as volatile when they are not in use. Then later
>> if they wants to reuse the data, they use
>> FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
>> data has been purged.
>>
>> This is very much influenced by the Android Ashmem interface by
>> Robert Love so credits to him and the Android developers.
>> In many cases the code& logic come directly from the ashmem patch.
>> The intent of this patch is to allow for ashmem-like behavior, but
>> embeds the idea a little deeper into the VM code.
>>
>> This is a reworked version of the fadvise volatile idea submitted
>> earlier to the list. Thanks to Dave Chinner for suggesting to
>> rework the idea in this fashion. Also thanks to Dmitry Adamushko
>> for continued review and bug reporting, and Dave Hansen for
>> help with the original design and mentoring me in the VM code.
> I like this patch concept. This is cleaner than userland
> notification quirk. But I don't like you use shrinker. Because of,
> after applying this patch, normal page reclaim path can still make
> swap out. this is undesirable.
Any recommendations for alternative approaches? What should I be hooking
into in order to get notified that tmpfs should drop volatile pages?

>> +static
>> +int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
>> +{
>> + s64 nr_to_scan = sc->nr_to_scan;
>> + const gfp_t gfp_mask = sc->gfp_mask;
>> + struct address_space *mapping;
>> + loff_t start, end;
>> + int ret;
>> + s64 page_count;
>> +
>> + if (nr_to_scan&& !(gfp_mask& __GFP_FS))
>> + return -1;
>> +
>> + volatile_range_lock(&shmem_volatile_head);
>> + page_count = volatile_range_lru_size(&shmem_volatile_head);
>> + if (!nr_to_scan)
>> + goto out;
>> +
>> + do {
>> + ret = volatile_ranges_get_last_used(&shmem_volatile_head,
>> + &mapping,&start,&end);
> Why drop last used region? Not recently used region is better?
>
Sorry, that function name isn't very good. It does return the
least-recently-used range, or more specifically: the
least-recently-marked-volatile-range.

I'll improve that function name, but if I misunderstood you and you have
a different suggestion for the purging order, let me know.

thanks
-john

2012-06-01 21:37:22

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

(6/1/12 5:03 PM), John Stultz wrote:
> On 06/01/2012 01:17 PM, KOSAKI Motohiro wrote:
>> Hi John,
>>
>> (6/1/12 2:29 PM), John Stultz wrote:
>>> This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
>>> functionality for tmpfs making use of the volatile range
>>> management code.
>>>
>>> Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
>>> FALLOC_FL_PUNCH_HOLE. This allows applications that have
>>> data caches that can be re-created to tell the kernel that
>>> some memory contains data that is useful in the future, but
>>> can be recreated if needed, so if the kernel needs, it can
>>> zap the memory without having to swap it out.
>>>
>>> In use, applications use FALLOC_FL_MARK_VOLATILE to mark
>>> page ranges as volatile when they are not in use. Then later
>>> if they wants to reuse the data, they use
>>> FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
>>> data has been purged.
>>>
>>> This is very much influenced by the Android Ashmem interface by
>>> Robert Love so credits to him and the Android developers.
>>> In many cases the code& logic come directly from the ashmem patch.
>>> The intent of this patch is to allow for ashmem-like behavior, but
>>> embeds the idea a little deeper into the VM code.
>>>
>>> This is a reworked version of the fadvise volatile idea submitted
>>> earlier to the list. Thanks to Dave Chinner for suggesting to
>>> rework the idea in this fashion. Also thanks to Dmitry Adamushko
>>> for continued review and bug reporting, and Dave Hansen for
>>> help with the original design and mentoring me in the VM code.
>> I like this patch concept. This is cleaner than userland
>> notification quirk. But I don't like you use shrinker. Because of,
>> after applying this patch, normal page reclaim path can still make
>> swap out. this is undesirable.
> Any recommendations for alternative approaches? What should I be hooking
> into in order to get notified that tmpfs should drop volatile pages?

I thought to modify shmem_write_page(). But other way is also ok to me.

>>> +static
>>> +int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
>>> +{
>>> + s64 nr_to_scan = sc->nr_to_scan;
>>> + const gfp_t gfp_mask = sc->gfp_mask;
>>> + struct address_space *mapping;
>>> + loff_t start, end;
>>> + int ret;
>>> + s64 page_count;
>>> +
>>> + if (nr_to_scan&& !(gfp_mask& __GFP_FS))
>>> + return -1;
>>> +
>>> + volatile_range_lock(&shmem_volatile_head);
>>> + page_count = volatile_range_lru_size(&shmem_volatile_head);
>>> + if (!nr_to_scan)
>>> + goto out;
>>> +
>>> + do {
>>> + ret = volatile_ranges_get_last_used(&shmem_volatile_head,
>>> + &mapping,&start,&end);
>> Why drop last used region? Not recently used region is better?
>>
> Sorry, that function name isn't very good. It does return the
> least-recently-used range, or more specifically: the
> least-recently-marked-volatile-range.

Ah, I misunderstood. thanks for correction.

> I'll improve that function name, but if I misunderstood you and you have
> a different suggestion for the purging order, let me know.

No, please just rename.

2012-06-01 21:45:37

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/01/2012 02:37 PM, KOSAKI Motohiro wrote:
> (6/1/12 5:03 PM), John Stultz wrote:
>> On 06/01/2012 01:17 PM, KOSAKI Motohiro wrote:
>>> Hi John,
>>>
>>> (6/1/12 2:29 PM), John Stultz wrote:
>>>> This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
>>>> functionality for tmpfs making use of the volatile range
>>>> management code.
>>>>
>>>> Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
>>>> FALLOC_FL_PUNCH_HOLE. This allows applications that have
>>>> data caches that can be re-created to tell the kernel that
>>>> some memory contains data that is useful in the future, but
>>>> can be recreated if needed, so if the kernel needs, it can
>>>> zap the memory without having to swap it out.
>>>>
>>>> In use, applications use FALLOC_FL_MARK_VOLATILE to mark
>>>> page ranges as volatile when they are not in use. Then later
>>>> if they wants to reuse the data, they use
>>>> FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
>>>> data has been purged.
>>>>
>>>> This is very much influenced by the Android Ashmem interface by
>>>> Robert Love so credits to him and the Android developers.
>>>> In many cases the code& logic come directly from the ashmem patch.
>>>> The intent of this patch is to allow for ashmem-like behavior, but
>>>> embeds the idea a little deeper into the VM code.
>>>>
>>>> This is a reworked version of the fadvise volatile idea submitted
>>>> earlier to the list. Thanks to Dave Chinner for suggesting to
>>>> rework the idea in this fashion. Also thanks to Dmitry Adamushko
>>>> for continued review and bug reporting, and Dave Hansen for
>>>> help with the original design and mentoring me in the VM code.
>>> I like this patch concept. This is cleaner than userland
>>> notification quirk. But I don't like you use shrinker. Because of,
>>> after applying this patch, normal page reclaim path can still make
>>> swap out. this is undesirable.
>> Any recommendations for alternative approaches? What should I be hooking
>> into in order to get notified that tmpfs should drop volatile pages?
> I thought to modify shmem_write_page(). But other way is also ok to me.
So initially the patch used shmem_write_page(), purging ranges if a page
was to be swapped (and just dropping it instead). The problem there is
that if there's a large range that is very active, we might purge the
entire range just because it contains one rarely used page. This is why
the LRU list for unpurged volatile ranges is useful.

However, Dave Hansen just suggested to me on irc the idea of if we're
swapping any pages, we might want to just purge a volatile range
instead. This allows us to keep the unpurged LRU range list, but just
uses write_page as the flag for needing to free memory.

I'm taking a shot at implementing this now, but let me know if it sounds
good to you.

>>>> +static
>>>> +int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
>>>> +{
>>>> + s64 nr_to_scan = sc->nr_to_scan;
>>>> + const gfp_t gfp_mask = sc->gfp_mask;
>>>> + struct address_space *mapping;
>>>> + loff_t start, end;
>>>> + int ret;
>>>> + s64 page_count;
>>>> +
>>>> + if (nr_to_scan&& !(gfp_mask& __GFP_FS))
>>>> + return -1;
>>>> +
>>>> + volatile_range_lock(&shmem_volatile_head);
>>>> + page_count = volatile_range_lru_size(&shmem_volatile_head);
>>>> + if (!nr_to_scan)
>>>> + goto out;
>>>> +
>>>> + do {
>>>> + ret = volatile_ranges_get_last_used(&shmem_volatile_head,
>>>> + &mapping,&start,&end);
>>> Why drop last used region? Not recently used region is better?
>>>
>> Sorry, that function name isn't very good. It does return the
>> least-recently-used range, or more specifically: the
>> least-recently-marked-volatile-range.
> Ah, I misunderstood. thanks for correction.
>
>
>> I'll improve that function name, but if I misunderstood you and you have
>> a different suggestion for the purging order, let me know.
> No, please just rename.
Will do.

Thanks for the feedback!
-john

2012-06-01 22:34:58

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

(6/1/12 5:44 PM), John Stultz wrote:
> On 06/01/2012 02:37 PM, KOSAKI Motohiro wrote:
>> (6/1/12 5:03 PM), John Stultz wrote:
>>> On 06/01/2012 01:17 PM, KOSAKI Motohiro wrote:
>>>> Hi John,
>>>>
>>>> (6/1/12 2:29 PM), John Stultz wrote:
>>>>> This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
>>>>> functionality for tmpfs making use of the volatile range
>>>>> management code.
>>>>>
>>>>> Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
>>>>> FALLOC_FL_PUNCH_HOLE. This allows applications that have
>>>>> data caches that can be re-created to tell the kernel that
>>>>> some memory contains data that is useful in the future, but
>>>>> can be recreated if needed, so if the kernel needs, it can
>>>>> zap the memory without having to swap it out.
>>>>>
>>>>> In use, applications use FALLOC_FL_MARK_VOLATILE to mark
>>>>> page ranges as volatile when they are not in use. Then later
>>>>> if they wants to reuse the data, they use
>>>>> FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
>>>>> data has been purged.
>>>>>
>>>>> This is very much influenced by the Android Ashmem interface by
>>>>> Robert Love so credits to him and the Android developers.
>>>>> In many cases the code& logic come directly from the ashmem patch.
>>>>> The intent of this patch is to allow for ashmem-like behavior, but
>>>>> embeds the idea a little deeper into the VM code.
>>>>>
>>>>> This is a reworked version of the fadvise volatile idea submitted
>>>>> earlier to the list. Thanks to Dave Chinner for suggesting to
>>>>> rework the idea in this fashion. Also thanks to Dmitry Adamushko
>>>>> for continued review and bug reporting, and Dave Hansen for
>>>>> help with the original design and mentoring me in the VM code.
>>>> I like this patch concept. This is cleaner than userland
>>>> notification quirk. But I don't like you use shrinker. Because of,
>>>> after applying this patch, normal page reclaim path can still make
>>>> swap out. this is undesirable.
>>> Any recommendations for alternative approaches? What should I be hooking
>>> into in order to get notified that tmpfs should drop volatile pages?
>> I thought to modify shmem_write_page(). But other way is also ok to me.
> So initially the patch used shmem_write_page(), purging ranges if a page
> was to be swapped (and just dropping it instead). The problem there is
> that if there's a large range that is very active, we might purge the
> entire range just because it contains one rarely used page. This is why
> the LRU list for unpurged volatile ranges is useful.

???
But, volatile marking order is not related to access frequency. Why do you
bother more inaccurate one? At least, pageout() should affect lru order
of volatile ranges?

> However, Dave Hansen just suggested to me on irc the idea of if we're
> swapping any pages, we might want to just purge a volatile range
> instead. This allows us to keep the unpurged LRU range list, but just
> uses write_page as the flag for needing to free memory.

Can you please elaborate more? I don't understand what's different
"just dropping it instead" and "just purge a volatile range instead".

> I'm taking a shot at implementing this now, but let me know if it sounds
> good to you.

2012-06-01 23:25:49

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/01/2012 03:34 PM, KOSAKI Motohiro wrote:
> (6/1/12 5:44 PM), John Stultz wrote:
>> On 06/01/2012 02:37 PM, KOSAKI Motohiro wrote:
>>> (6/1/12 5:03 PM), John Stultz wrote:
>>>> On 06/01/2012 01:17 PM, KOSAKI Motohiro wrote:
>>>>> I like this patch concept. This is cleaner than userland
>>>>> notification quirk. But I don't like you use shrinker. Because of,
>>>>> after applying this patch, normal page reclaim path can still make
>>>>> swap out. this is undesirable.
>>>> Any recommendations for alternative approaches? What should I be hooking
>>>> into in order to get notified that tmpfs should drop volatile pages?
>>> I thought to modify shmem_write_page(). But other way is also ok to me.
>> So initially the patch used shmem_write_page(), purging ranges if a page
>> was to be swapped (and just dropping it instead). The problem there is
>> that if there's a large range that is very active, we might purge the
>> entire range just because it contains one rarely used page. This is why
>> the LRU list for unpurged volatile ranges is useful.
> ???
> But, volatile marking order is not related to access frequency.

Correct.

> Why do you
> bother more inaccurate one? At least, pageout() should affect lru order
> of volatile ranges?

Not sure I'm following you here.

The key point is we want volatile ranges to be purged in the order they
were marked volatile.
If we use the page lru via shmem_writeout to trigger range purging, we
wouldn't necessarily get this desired behavior.

That said, Dave's idea is to still use a volatile range LRU, but to free
it via shmem_writeout. This allows us to purge volatile pages before
swapping out pages. I'll be sending a modified patchset out shortly that
does this, hopefully it helps makes this idea clear.

>> However, Dave Hansen just suggested to me on irc the idea of if we're
>> swapping any pages, we might want to just purge a volatile range
>> instead. This allows us to keep the unpurged LRU range list, but just
>> uses write_page as the flag for needing to free memory.
> Can you please elaborate more? I don't understand what's different
> "just dropping it instead" and "just purge a volatile range instead".
So in the first implementation, on writeout we checked if the page was
in a volatile range, and if so we dropped the page (just unlocking the
page) and marked the range as purged instead of swapping the page out.
This was non-optimal since the entire range was marked purged, but other
volatile pages in that range would not be dropped until writeout was
called on them.

My next implementation purged the entire range (via
shmem_truncate_range) if we did a writeout on a page in that range. This
was better, but still left us open to purging recently marked volatile
ranges if only a single page in that range had not been accessed in awhile.

That's when I added the LRU tracking at the volatile range level (which
reverted back to the behavior ashmem has always used), and have been
using that model sense.

Hopefully this clarifies things. My apologies if I don't always use the
correct terminology, as I'm still a newbie when it comes to VM code.

thanks
-john

2012-06-06 19:52:32

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

>>>>>> I like this patch concept. This is cleaner than userland
>>>>>> notification quirk. But I don't like you use shrinker. Because of,
>>>>>> after applying this patch, normal page reclaim path can still make
>>>>>> swap out. this is undesirable.
>>>>> Any recommendations for alternative approaches? What should I be hooking
>>>>> into in order to get notified that tmpfs should drop volatile pages?
>>>> I thought to modify shmem_write_page(). But other way is also ok to me.
>>> So initially the patch used shmem_write_page(), purging ranges if a page
>>> was to be swapped (and just dropping it instead). The problem there is
>>> that if there's a large range that is very active, we might purge the
>>> entire range just because it contains one rarely used page. This is why
>>> the LRU list for unpurged volatile ranges is useful.
>> ???
>> But, volatile marking order is not related to access frequency.
>
> Correct.
>
>> Why do you
>> bother more inaccurate one? At least, pageout() should affect lru order
>> of volatile ranges?
>
> Not sure I'm following you here.
>
> The key point is we want volatile ranges to be purged in the order they
> were marked volatile.
> If we use the page lru via shmem_writeout to trigger range purging, we
> wouldn't necessarily get this desired behavior.

Ok, so can you please explain your ideal order to reclaim. your last mail
described old and new volatiled region. but I'm not sure regular tmpfs pages
vs volatile pages vs regular file cache order. That said, when using shrink_slab(),
we choose random order to drop against page cache. I'm not sure why you sure
it is ideal.

And, now I guess you think nobody touch volatiled page, yes? because otherwise
volatile marking order is silly choice. If yes, what's happen if anyone touch
a patch which volatiled. no-op? SIGBUS?

>
> That said, Dave's idea is to still use a volatile range LRU, but to free
> it via shmem_writeout. This allows us to purge volatile pages before
> swapping out pages. I'll be sending a modified patchset out shortly that
> does this, hopefully it helps makes this idea clear.
>
>>> However, Dave Hansen just suggested to me on irc the idea of if we're
>>> swapping any pages, we might want to just purge a volatile range
>>> instead. This allows us to keep the unpurged LRU range list, but just
>>> uses write_page as the flag for needing to free memory.
>> Can you please elaborate more? I don't understand what's different
>> "just dropping it instead" and "just purge a volatile range instead".
> So in the first implementation, on writeout we checked if the page was
> in a volatile range, and if so we dropped the page (just unlocking the
> page) and marked the range as purged instead of swapping the page out.
> This was non-optimal since the entire range was marked purged, but other
> volatile pages in that range would not be dropped until writeout was
> called on them.
>
> My next implementation purged the entire range (via
> shmem_truncate_range) if we did a writeout on a page in that range. This
> was better, but still left us open to purging recently marked volatile
> ranges if only a single page in that range had not been accessed in awhile.

Which worklord didn't work. Usually, anon pages reclaim are only happen when
1) tmpfs streaming io workload or 2) heavy vm pressure. So, this scenario
are not so inaccurate to me.

> That's when I added the LRU tracking at the volatile range level (which
> reverted back to the behavior ashmem has always used), and have been
> using that model sense.
>
> Hopefully this clarifies things. My apologies if I don't always use the
> correct terminology, as I'm still a newbie when it comes to VM code.

I think your code is enough clean. But I'm still not sure your background
design. Please help me to understand clearly.

btw, Why do you choice fallocate instead of fadvise? As far as I skimmed,
fallocate() is an operation of a disk layout, not of a cache. And, why
did you choice fadvise() instead of madvise() at initial version. vma
hint might be useful than fadvise() because it can be used for anonymous
pages too.

2012-06-06 23:56:56

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/06/2012 12:52 PM, KOSAKI Motohiro wrote:
>> The key point is we want volatile ranges to be purged in the order they
>> were marked volatile.
>> If we use the page lru via shmem_writeout to trigger range purging, we
>> wouldn't necessarily get this desired behavior.
> Ok, so can you please explain your ideal order to reclaim. your last mail
> described old and new volatiled region. but I'm not sure regular tmpfs pages
> vs volatile pages vs regular file cache order. That said, when using shrink_slab(),
> we choose random order to drop against page cache. I'm not sure why you sure
> it is ideal.

So I'm not totally sure its ideal, but I can tell you what make sense to
me. If there is a more ideal order, I'm open to suggestions.

So volatile ranges should be purged first-in-first-out. So the first
range marked volatile should be purged first. Since volatile ranges
might have different costs depending on what filesystem the file is
backed by, this LRU order is per-filesystem.

It seems that if we have tmpfs volatile ranges, we should purge them
before we swap out any regular tmpfs pages. Thus why I'm purging any
available ranges on shmem_writepage before swapping, rather then using a
shrinker now (I'm hoping you saw the updated patchset I sent out friday).

Does that make sense?

> And, now I guess you think nobody touch volatiled page, yes? because otherwise
> volatile marking order is silly choice. If yes, what's happen if anyone touch
> a patch which volatiled. no-op? SIGBUS?

So more of a noop. If you read a page that has been marked volatile, it
may return the data that was there, or it may return an empty nulled page.

I guess we could throw a signal to help avoid developers making
programming mistakes, but I'm not sure what the extra cost would be to
set up and tare that down each time. One important aspect of this is
that in order to make it attractive for an application to mark ranges as
volatile, it has to be very cheap to mark and unmark ranges.

> Which worklord didn't work. Usually, anon pages reclaim are only
> happen when 1) tmpfs streaming io workload or 2) heavy vm pressure.
> So, this scenario are not so inaccurate to me.

So it was more of a theoretical issue in my discussions, but once it was
brought up, ashmems' global range lru made more sense.

I think the workload we're mostly concerned with here is heavy vm pressure.

>> That's when I added the LRU tracking at the volatile range level (which
>> reverted back to the behavior ashmem has always used), and have been
>> using that model sense.
>>
>> Hopefully this clarifies things. My apologies if I don't always use the
>> correct terminology, as I'm still a newbie when it comes to VM code.
> I think your code is enough clean. But I'm still not sure your background
> design. Please help me to understand clearly.
Hopefully the above helps. But let me know where you'd like more
clarification.

> btw, Why do you choice fallocate instead of fadvise? As far as I skimmed,
> fallocate() is an operation of a disk layout, not of a cache. And, why
> did you choice fadvise() instead of madvise() at initial version. vma
> hint might be useful than fadvise() because it can be used for anonymous
> pages too.
I actually started with madvise, but quickly moved to fadvise when
feeling that the fd based ranges made more sense. With ashmem, fds are
often shared, and coordinating volatile ranges on a shared fd made more
sense on a (fd, offset, len) tuple, rather then on an offset and length
on an mmapped region.

I moved to fallocate at Dave Chinner's request. In short, it allows
non-tmpfs filesystems to implement volatile range semantics allowing
them to zap rather then writeout dirty volatile pages. And since the
volatile ranges are very similar to a delayed/cancel-able hole-punch, it
made sense to use a similar interface to FALLOC_FL_HOLE_PUNCH.

You can read the details of DaveC's suggestion here:
https://lkml.org/lkml/2012/4/30/441

thanks
-john

2012-06-07 10:55:05

by Dmitry Adamushko

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

[ ... ]
>> Ok, so can you please explain your ideal order to reclaim. your last mail
>> described old and new volatiled region. but I'm not sure regular tmpfs pages
>> vs volatile pages vs regular file cache order. That said, when using shrink_slab(),
>> we choose random order to drop against page cache. I'm not sure why you sure
>> it is ideal.
>
> So I'm not totally sure its ideal, but I can tell you what make sense to
> me. If there is a more ideal order, I'm open to suggestions.
>
> So volatile ranges should be purged first-in-first-out. So the first
> range marked volatile should be purged first. Since volatile ranges
> might have different costs depending on what filesystem the file is
> backed by, this LRU order is per-filesystem.
>
> It seems that if we have tmpfs volatile ranges, we should purge them
> before we swap out any regular tmpfs pages. Thus why I'm purging any
> available ranges on shmem_writepage before swapping, rather then using a
> shrinker now (I'm hoping you saw the updated patchset I sent out friday).
>

so there are multiple sources of reclaimable memory, each coming with
its own cost of (a) giving a memory page back to the kernel and (b)
populating a new page with the content (if accessed later). Given that
the costs are different, I assume that the kernel tries to balance
between different sources with the goal of minimizing the overall
cost.

In this light, the placement of a new source (like 'volatile ranges')
of reclaimable memory does affect this balance (and hence, the overall
cost) one way or another. For instance,

"we should purge them before we swap out any regular tmpfs pages"

but maybe we should also purge them before we swap out some non-tmpfs
pages or drop some file-backed pages?

-- Dmitry

2012-06-07 23:42:22

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/07/2012 03:55 AM, Dmitry Adamushko wrote:
> but maybe we should also purge them before we swap out some non-tmpfs
> pages or drop some file-backed pages?

Sure... I guess we could kick that from either direct reclaim or from
kswapd. But, then we're basically back to the places where
shrink_slab() is called.

I think that means that we think it's preferable to integrate this more
directly in the VM instead of sticking it off in the corner of tmpfs
only, or pretending it's a slab.

Dunno... The slab shrinker one isn't looking _so_ bad at the moment.

2012-06-08 03:03:32

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/07/2012 04:41 PM, Dave Hansen wrote:
> On 06/07/2012 03:55 AM, Dmitry Adamushko wrote:
>> but maybe we should also purge them before we swap out some non-tmpfs
>> pages or drop some file-backed pages?
>
> Sure... I guess we could kick that from either direct reclaim or from
> kswapd. But, then we're basically back to the places where
> shrink_slab() is called.
>
> I think that means that we think it's preferable to integrate this more
> directly in the VM instead of sticking it off in the corner of tmpfs
> only, or pretending it's a slab.
>
> Dunno... The slab shrinker one isn't looking _so_ bad at the moment.

Dave also pointed out to me on irc that on a system without swap,
shmem_writepage doesn't even get called, which kills the utility of
triggering volatile purging from writepage.

So I'm falling back to using a shrinker for now, but I think Dmitry's
point is an interesting one, and am interested in finding a better place
to trigger purging volatile ranges from the mm code. If anyone has any
suggestions, let me know, otherwise I'll go back to trying to better
grok the mm code.

thanks
-john

2012-06-08 04:50:24

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

(6/7/12 11:03 PM), John Stultz wrote:
> On 06/07/2012 04:41 PM, Dave Hansen wrote:
>> On 06/07/2012 03:55 AM, Dmitry Adamushko wrote:
>>> but maybe we should also purge them before we swap out some non-tmpfs
>>> pages or drop some file-backed pages?
>>
>> Sure... I guess we could kick that from either direct reclaim or from
>> kswapd. But, then we're basically back to the places where
>> shrink_slab() is called.
>>
>> I think that means that we think it's preferable to integrate this more
>> directly in the VM instead of sticking it off in the corner of tmpfs
>> only, or pretending it's a slab.
>>
>> Dunno... The slab shrinker one isn't looking _so_ bad at the moment.
>
> Dave also pointed out to me on irc that on a system without swap,
>shmem_writepage doesn't even get called, which kills the utility of
>triggering volatile purging from writepage.

Ah, right you are. swap-less system never try to reclaim anon pages. So,
volatile pages is no longer swap backed. swap backed lru is no longer suitable
place.

> So I'm falling back to using a shrinker for now, but I think Dmitry's
>point is an interesting one, and am interested in finding a better
>place to trigger purging volatile ranges from the mm code. If anyone has any
>suggestions, let me know, otherwise I'll go back to trying to better grok the mm code.

I hate vm feature to abuse shrink_slab(). because of, it was not designed generic callback.
it was designed for shrinking filesystem metadata. Therefore, vm keeping a balance between
page scanning and slab scanning. then, a lot of shrink_slab misuse may lead to break balancing
logic. i.e. drop icache/dcache too many and makes perfomance impact.

As far as a code impact is small, I'm prefer to connect w/ vm reclaim code directly.

2012-06-08 06:39:59

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

(6/6/12 7:56 PM), John Stultz wrote:
> On 06/06/2012 12:52 PM, KOSAKI Motohiro wrote:
>>> The key point is we want volatile ranges to be purged in the order they
>>> were marked volatile.
>>> If we use the page lru via shmem_writeout to trigger range purging, we
>>> wouldn't necessarily get this desired behavior.
>> Ok, so can you please explain your ideal order to reclaim. your last mail
>> described old and new volatiled region. but I'm not sure regular tmpfs pages
>> vs volatile pages vs regular file cache order. That said, when using shrink_slab(),
>> we choose random order to drop against page cache. I'm not sure why you sure
>> it is ideal.
>
> So I'm not totally sure its ideal, but I can tell you what make sense to
> me. If there is a more ideal order, I'm open to suggestions.
>
> So volatile ranges should be purged first-in-first-out. So the first
> range marked volatile should be purged first. Since volatile ranges
> might have different costs depending on what filesystem the file is
> backed by, this LRU order is per-filesystem.
>
> It seems that if we have tmpfs volatile ranges, we should purge them
> before we swap out any regular tmpfs pages. Thus why I'm purging any
> available ranges on shmem_writepage before swapping, rather then using a
> shrinker now (I'm hoping you saw the updated patchset I sent out friday).
>
> Does that make sense?
>
>> And, now I guess you think nobody touch volatiled page, yes? because otherwise
>> volatile marking order is silly choice. If yes, what's happen if anyone touch
>> a patch which volatiled. no-op? SIGBUS?
>
> So more of a noop. If you read a page that has been marked volatile, it
> may return the data that was there, or it may return an empty nulled page.
>
> I guess we could throw a signal to help avoid developers making
> programming mistakes, but I'm not sure what the extra cost would be to
> set up and tare that down each time. One important aspect of this is
> that in order to make it attractive for an application to mark ranges as
> volatile, it has to be very cheap to mark and unmark ranges.

ok, i agree we don't need to pay any extra cost.

>> Which worklord didn't work. Usually, anon pages reclaim are only
>> happen when 1) tmpfs streaming io workload or 2) heavy vm pressure.
>> So, this scenario are not so inaccurate to me.
>
> So it was more of a theoretical issue in my discussions, but once it was
> brought up, ashmems' global range lru made more sense.

No. Every global lru is evil. Please don't introduce numa unaware code for
a new feature. That's a legacy and poor performance.

> I think the workload we're mostly concerned with here is heavy vm pressure.

I don't admit it. but note, when under heavy workload, shrink_slab() behave
stupid seriously.

>>> That's when I added the LRU tracking at the volatile range level (which
>>> reverted back to the behavior ashmem has always used), and have been
>>> using that model sense.
>>>
>>> Hopefully this clarifies things. My apologies if I don't always use the
>>> correct terminology, as I'm still a newbie when it comes to VM code.
>> I think your code is enough clean. But I'm still not sure your background
>> design. Please help me to understand clearly.
> Hopefully the above helps. But let me know where you'd like more
> clarification.
>
>
>> btw, Why do you choice fallocate instead of fadvise? As far as I skimmed,
>> fallocate() is an operation of a disk layout, not of a cache. And, why
>> did you choice fadvise() instead of madvise() at initial version. vma
>> hint might be useful than fadvise() because it can be used for anonymous
>> pages too.
> I actually started with madvise, but quickly moved to fadvise when
> feeling that the fd based ranges made more sense. With ashmem, fds are
> often shared, and coordinating volatile ranges on a shared fd made more
> sense on a (fd, offset, len) tuple, rather then on an offset and length
> on an mmapped region.
>
> I moved to fallocate at Dave Chinner's request. In short, it allows
> non-tmpfs filesystems to implement volatile range semantics allowing
> them to zap rather then writeout dirty volatile pages. And since the
> volatile ranges are very similar to a delayed/cancel-able hole-punch, it
> made sense to use a similar interface to FALLOC_FL_HOLE_PUNCH.
>
> You can read the details of DaveC's suggestion here:
> https://lkml.org/lkml/2012/4/30/441

Hmmm...

I'm sorry. I can't imagine how to integrate FALLOCATE_VOLATILE into regular
file systems. do you have any idea?

2012-06-09 03:45:23

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/07/2012 09:50 PM, KOSAKI Motohiro wrote:
> (6/7/12 11:03 PM), John Stultz wrote:
>
>> So I'm falling back to using a shrinker for now, but I think Dmitry's
>> point is an interesting one, and am interested in finding a better
>> place to trigger purging volatile ranges from the mm code. If anyone
>> has any
>> suggestions, let me know, otherwise I'll go back to trying to better
>> grok the mm code.
>
> I hate vm feature to abuse shrink_slab(). because of, it was not
> designed generic callback.
> it was designed for shrinking filesystem metadata. Therefore, vm
> keeping a balance between
> page scanning and slab scanning. then, a lot of shrink_slab misuse may
> lead to break balancing
> logic. i.e. drop icache/dcache too many and makes perfomance impact.
>
> As far as a code impact is small, I'm prefer to connect w/ vm reclaim
> code directly.

I can see your concern about mis-using the shrinker code. Also your
other email's point about the problem of having LRU range purging
behavior on a NUMA system makes some sense too. Unfortunately I'm not
yet familiar enough with the reclaim core to sort out how to best track
and connect the volatile range purging in the vm's reclaim core yet.

So for now, I've moved the code back to using the shrinker (along with
fixing a few bugs along the way).
Thus, currently we manage the ranges as so:
[per fs volatile range lru head] -> [volatile range] -> [volatile
range] -> [volatile range]
With the per-fs shrinker zaping the volatile ranges from the lru.

I *think* ideally, the pages in a volatile range should be similar to
non-dirty file-backed pages. There is a cost to restore them, but
freeing them is very cheap. The trick is that volatile ranges
introduces a new relationship between pages. Since the neighboring
virtual pages in a volatile range are in effect tied together, purging
one effectively ruins the value of keeping the others, regardless of
which zone they are physically.

So maybe the right appraoch give up the per-fs volatile range lru, and
try a varient of what DaveC and DaveH have suggested: Letting the page
based lru reclamation handle the selection on a physical page basis, but
then zapping the entirety of the neighboring range if any one page is
reclaimed. In order to try to preserve the range based LRU behavior,
activate all the pages in the range together when the range is marked
volatile. Since we assume ranges are un-touched when volatile, that
should preserve LRU purging behavior on single node systems and on
multi-node systems it will approximate fairly closely.

My main concern with this approach is marking and unmarking volatile
ranges needs to be fast, so I'm worried about the additional overhead of
activating each of the containing pages on mark_volatile.

The other question I have with this approach is if we're on a system
that doesn't have swap, it *seems* (not totally sure I understand it
yet) the tmpfs file pages will be skipped over when we call
shrink_lruvec. So it seems we may need to add a new lru_list enum and
nr[] entry (maybe LRU_VOLATILE?). So then it may be that when we mark
a range as volatile, instead of just activating it, we move it to the
volatile lru, and then when we shrink from that list, we call back to
the filesystem to trigger the entire range purging.

Does that sound reasonable? Any other suggested approaches? I'll think
some more about it this weekend and try to get a patch scratched out
early next week.

thanks
-john

2012-06-10 06:35:22

by Dmitry Adamushko

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

>
> So maybe the right appraoch give up the per-fs volatile range lru, and try a
> varient of what DaveC and DaveH have suggested: Letting the page based lru
> reclamation handle the selection on a physical page basis, but then zapping
> the entirety of the neighboring range if any one page is reclaimed. ?In
> order to try to preserve the range based LRU behavior, activate all the
> pages in the range together when the range is marked volatile. ?Since we
> assume ranges are un-touched when volatile, that should preserve LRU purging
> behavior on single node systems and on multi-node systems it will
> approximate fairly closely.
>
> My main concern with this approach is marking and unmarking volatile ranges
> needs to be fast, so I'm worried about the additional overhead of activating
> each of the containing pages on mark_volatile.

(for my education) just to be sure that I got it right. So what you suggest is

(1) to 'deactivate-page' for all the pages in the range upon
mark_volatile. Hence, the pages from the same volatile range are
placed in clusters within their original LRU lists [a] and so

(1.1) the standard per-page reclaim mechanism is more likely to
discard them together;
(1.2) they are also (LRU-style) ordered wrt other volatile ranges (clusters)

[a] it's LRU_INACTIVE_FILE for tmpfs, right? also, the pages can be
from different zones (otoh, at least on x86 HIGH_MEM is likely).

or

(2) somehow remove all the pages from the standard LRU lists (or do
something else) to make sure that that the normal per-page reclaim
procedure can't see them. Then we introduce LRU_VOLATILE (where we
keep whole volatile ranges, not pages) and find the appropriate place
to process it in the reclaim code.

Also, I had another idea (it looks quite hacky though). For (1) above,
we don't necessarily need to touch all the pages... what we can do is
as follows:
- take the first page of the range (or even create a (hacky-hacky) virtual one);
- we need to mark it somehow as belonging to the volatile-reclaim
(modifying page->mapping ?);
- we place it at the beginning of the corresponding LRU_INACTIVE_*
list (hm, more complex if different zones);
the idea here, is that the standard per-page reclaim code should see
this page before seeing any other page from its range
- once the per-page reclaim code encounters such a page (heh, should
be a low cost check though) - we call into volatile-reclaim...

now, this volatile-reclaim can even purge another volatile range,
because by placing "the page at the beginning of the corresponding
LRU_INACTIVE_* list)" we broke LRU-like behavior for volatile ranges.

>
> The other question I have with this approach is if we're on a system that
> doesn't have swap, it *seems* (not totally sure I understand it yet) the
> tmpfs file pages will be skipped over when we call shrink_lruvec. ?So it
> seems we may need to add a new lru_list enum and nr[] entry (maybe
> LRU_VOLATILE?). ? So then it may be that when we mark a range as volatile,
> instead of just activating it, we move it to the volatile lru, and then when
> we shrink from that list, we call back to the filesystem to trigger the
> entire range purging.
>

Kind of what I meant with (2) above?

[ I was in a bit of hurry while writing this, so I apologize for
possible confusion... I can elaborate on it more in details later on ]

Thanks,

-- Dmitry

2012-06-10 21:48:55

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/08/2012 11:45 PM, John Stultz wrote:

> I *think* ideally, the pages in a volatile range should be similar to
> non-dirty file-backed pages. There is a cost to restore them, but
> freeing them is very cheap. The trick is that volatile ranges introduces

Easier to mark them dirty.

> a new relationship between pages. Since the neighboring virtual pages in
> a volatile range are in effect tied together, purging one effectively
> ruins the value of keeping the others, regardless of which zone they are
> physically.

Then the volatile ->writepage function can zap the whole
object.

--
All rights reversed

2012-06-11 18:36:12

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/10/2012 02:47 PM, Rik van Riel wrote:
> On 06/08/2012 11:45 PM, John Stultz wrote:
>
>> I *think* ideally, the pages in a volatile range should be similar to
>> non-dirty file-backed pages. There is a cost to restore them, but
>> freeing them is very cheap. The trick is that volatile ranges introduces
>
> Easier to mark them dirty.
>
>> a new relationship between pages. Since the neighboring virtual pages in
>> a volatile range are in effect tied together, purging one effectively
>> ruins the value of keeping the others, regardless of which zone they are
>> physically.
>
> Then the volatile ->writepage function can zap the whole
> object.
>

What about the concern that if we don't have swap, we'll not call
writepage on tmpfs files?

thanks
-john

2012-06-12 01:24:08

by john stultz

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/11/2012 11:35 AM, John Stultz wrote:
> On 06/10/2012 02:47 PM, Rik van Riel wrote:
>> On 06/08/2012 11:45 PM, John Stultz wrote:
>>
>>> I *think* ideally, the pages in a volatile range should be similar to
>>> non-dirty file-backed pages. There is a cost to restore them, but
>>> freeing them is very cheap. The trick is that volatile ranges
>>> introduces
>>
>> Easier to mark them dirty.
>>
>>> a new relationship between pages. Since the neighboring virtual
>>> pages in
>>> a volatile range are in effect tied together, purging one effectively
>>> ruins the value of keeping the others, regardless of which zone they
>>> are
>>> physically.
>>
>> Then the volatile ->writepage function can zap the whole
>> object.
>>
>
> What about the concern that if we don't have swap, we'll not call
> writepage on tmpfs files?

So actually, a more concrete question might be: What is the value of the
active/inactive split of anonymous memory on a system without swap?

Basically I'm looking at trying to allow the writepage function to zap
the range as you suggest, but also changing the behavior when there is
no swap so that all anonymous pages stay active, unless they are
volatile. Then, in both cases with swap and without, we would still
shrink the inactive list, call writepage and zap the volatile ranges.
Its just without swap, the only anonymous pages on the inactive list
would be volatile.

Does that make any sense?

Hopefully will have a hackish patch to demonstrate what I'm describing
tomorrow.

thanks
-john

2012-06-12 07:16:53

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

Please, Cced linux-mm.

On 06/09/2012 12:45 PM, John Stultz wrote:

> On 06/07/2012 09:50 PM, KOSAKI Motohiro wrote:
>> (6/7/12 11:03 PM), John Stultz wrote:
>>
>>> So I'm falling back to using a shrinker for now, but I think Dmitry's
>>> point is an interesting one, and am interested in finding a better
>>> place to trigger purging volatile ranges from the mm code. If anyone
>>> has any
>>> suggestions, let me know, otherwise I'll go back to trying to better
>>> grok the mm code.
>>
>> I hate vm feature to abuse shrink_slab(). because of, it was not
>> designed generic callback.
>> it was designed for shrinking filesystem metadata. Therefore, vm
>> keeping a balance between
>> page scanning and slab scanning. then, a lot of shrink_slab misuse may
>> lead to break balancing
>> logic. i.e. drop icache/dcache too many and makes perfomance impact.
>>
>> As far as a code impact is small, I'm prefer to connect w/ vm reclaim
>> code directly.
>
> I can see your concern about mis-using the shrinker code. Also your
> other email's point about the problem of having LRU range purging
> behavior on a NUMA system makes some sense too. Unfortunately I'm not
> yet familiar enough with the reclaim core to sort out how to best track
> and connect the volatile range purging in the vm's reclaim core yet.
>
> So for now, I've moved the code back to using the shrinker (along with
> fixing a few bugs along the way).
> Thus, currently we manage the ranges as so:
> [per fs volatile range lru head] -> [volatile range] -> [volatile
> range] -> [volatile range]
> With the per-fs shrinker zaping the volatile ranges from the lru.
>
> I *think* ideally, the pages in a volatile range should be similar to
> non-dirty file-backed pages. There is a cost to restore them, but
> freeing them is very cheap. The trick is that volatile ranges
> introduces a new relationship between pages. Since the neighboring
> virtual pages in a volatile range are in effect tied together, purging
> one effectively ruins the value of keeping the others, regardless of
> which zone they are physically.
>
> So maybe the right appraoch give up the per-fs volatile range lru, and
> try a varient of what DaveC and DaveH have suggested: Letting the page
> based lru reclamation handle the selection on a physical page basis, but
> then zapping the entirety of the neighboring range if any one page is
> reclaimed. In order to try to preserve the range based LRU behavior,
> activate all the pages in the range together when the range is marked

You mean deactivation for fast reclaiming, not activation when memory pressure happen?

> volatile. Since we assume ranges are un-touched when volatile, that
> should preserve LRU purging behavior on single node systems and on
> multi-node systems it will approximate fairly closely.
>
> My main concern with this approach is marking and unmarking volatile
> ranges needs to be fast, so I'm worried about the additional overhead of
> activating each of the containing pages on mark_volatile.

Yes. it could be a problem if range is very large and populated already.
Why can't we make new hooks?

Just concept for showing my intention..

+int shrink_volatile_pages(struct zone *zone)
+{
+ int ret = 0;
+ if (zone_page_state(zone, NR_ZONE_VOLATILE))
+ ret = shmem_purge_one_volatile_range();
+ return ret;
+}
+
static void shrink_zone(struct zone *zone, struct scan_control *sc)
{
struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -1827,6 +1835,18 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
.priority = sc->priority,
};
struct mem_cgroup *memcg;
+ int ret;
+
+ /*
+ * Before we dive into trouble maker, let's look at easy-
+ * reclaimable pages and avoid costly-reclaim if possible.
+ */
+ do {
+ ret = shrink_volatile_pages();
+ if (ret)
+ zone_watermark_ok(zone, sc->order, xxx);
+ return;
+ } while(ret)

Off-topic:

I want to drive low memory notification level-triggering instead of raw vmstat trigger.
(It's rather long thread https://lkml.org/lkml/2012/5/1/97)

level 1: out-of-easy reclaimable pages (NR_VOLATILE + NR_UNMAPPED_CLEAN_PAGE)
level 2 (more sever VM pressure than level 1): level2 + reclaimable dirty pages

When it is out of easy-reclaimable pages, it might be good indication for
low memory notification.

>
> The other question I have with this approach is if we're on a system
> that doesn't have swap, it *seems* (not totally sure I understand it
> yet) the tmpfs file pages will be skipped over when we call
> shrink_lruvec. So it seems we may need to add a new lru_list enum and
> nr[] entry (maybe LRU_VOLATILE?). So then it may be that when we mark
> a range as volatile, instead of just activating it, we move it to the
> volatile lru, and then when we shrink from that list, we call back to
> the filesystem to trigger the entire range purging.

Adding new LRU idea might make very slow fallocate(VOLATILE) so I hope we can avoid that if possible.

Off-topic:
But I'm not sure because I might try to make new easy-reclaimable LRU list for low memory notification.
That LRU list would contain non-mapped clean cache page and volatile pages if I decide adding it.
Both pages has a common characteristic that recreating page is less costly.
It's true for eMMC/SSD like device, at least.

>
> Does that sound reasonable? Any other suggested approaches? I'll think
> some more about it this weekend and try to get a patch scratched out
> early next week.
>
> thanks
> -john
>
>
>
>
>
>
>
>
>
>
>
>
>

--
Kind regards,
Minchan Kim

2012-06-12 16:03:29

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

> Off-topic:
> But I'm not sure because I might try to make new easy-reclaimable LRU list for low memory notification.
> That LRU list would contain non-mapped clean cache page and volatile pages if I decide adding it.
> Both pages has a common characteristic that recreating page is less costly.
> It's true for eMMC/SSD like device, at least.

+1.

I like L2 inactive list.

2012-06-12 19:36:14

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/12/2012 12:16 AM, Minchan Kim wrote:
> Please, Cced linux-mm.
>
> On 06/09/2012 12:45 PM, John Stultz wrote:
>
>> On 06/07/2012 09:50 PM, KOSAKI Motohiro wrote:
>>> (6/7/12 11:03 PM), John Stultz wrote:
>>>
>>>> So I'm falling back to using a shrinker for now, but I think Dmitry's
>>>> point is an interesting one, and am interested in finding a better
>>>> place to trigger purging volatile ranges from the mm code. If anyone
>>>> has any
>>>> suggestions, let me know, otherwise I'll go back to trying to better
>>>> grok the mm code.
>>> I hate vm feature to abuse shrink_slab(). because of, it was not
>>> designed generic callback.
>>> it was designed for shrinking filesystem metadata. Therefore, vm
>>> keeping a balance between
>>> page scanning and slab scanning. then, a lot of shrink_slab misuse may
>>> lead to break balancing
>>> logic. i.e. drop icache/dcache too many and makes perfomance impact.
>>>
>>> As far as a code impact is small, I'm prefer to connect w/ vm reclaim
>>> code directly.
>> I can see your concern about mis-using the shrinker code. Also your
>> other email's point about the problem of having LRU range purging
>> behavior on a NUMA system makes some sense too. Unfortunately I'm not
>> yet familiar enough with the reclaim core to sort out how to best track
>> and connect the volatile range purging in the vm's reclaim core yet.
>>
>> So for now, I've moved the code back to using the shrinker (along with
>> fixing a few bugs along the way).
>> Thus, currently we manage the ranges as so:
>> [per fs volatile range lru head] -> [volatile range] -> [volatile
>> range] -> [volatile range]
>> With the per-fs shrinker zaping the volatile ranges from the lru.
>>
>> I *think* ideally, the pages in a volatile range should be similar to
>> non-dirty file-backed pages. There is a cost to restore them, but
>> freeing them is very cheap. The trick is that volatile ranges
>> introduces a new relationship between pages. Since the neighboring
>> virtual pages in a volatile range are in effect tied together, purging
>> one effectively ruins the value of keeping the others, regardless of
>> which zone they are physically.
>>
>> So maybe the right appraoch give up the per-fs volatile range lru, and
>> try a varient of what DaveC and DaveH have suggested: Letting the page
>> based lru reclamation handle the selection on a physical page basis, but
>> then zapping the entirety of the neighboring range if any one page is
>> reclaimed. In order to try to preserve the range based LRU behavior,
>> activate all the pages in the range together when the range is marked
>
> You mean deactivation for fast reclaiming, not activation when memory pressure happen?
Yes. Sorry for mixing up terms here. The point is moving all the pages together to the inactive list to preserve relative LRU behavior for purging ranges.

>> volatile. Since we assume ranges are un-touched when volatile, that
>> should preserve LRU purging behavior on single node systems and on
>> multi-node systems it will approximate fairly closely.
>>
>> My main concern with this approach is marking and unmarking volatile
>> ranges needs to be fast, so I'm worried about the additional overhead of
>> activating each of the containing pages on mark_volatile.
>
> Yes. it could be a problem if range is very large and populated already.
> Why can't we make new hooks?
>
> Just concept for showing my intention..
>
> +int shrink_volatile_pages(struct zone *zone)
> +{
> + int ret = 0;
> + if (zone_page_state(zone, NR_ZONE_VOLATILE))
> + ret = shmem_purge_one_volatile_range();
> + return ret;
> +}
> +
> static void shrink_zone(struct zone *zone, struct scan_control *sc)
> {
> struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -1827,6 +1835,18 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
> .priority = sc->priority,
> };
> struct mem_cgroup *memcg;
> + int ret;
> +
> + /*
> + * Before we dive into trouble maker, let's look at easy-
> + * reclaimable pages and avoid costly-reclaim if possible.
> + */
> + do {
> + ret = shrink_volatile_pages();
> + if (ret)
> + zone_watermark_ok(zone, sc->order, xxx);
> + return;
> + } while(ret)

Hmm. I'm confused.
This doesn't seem that different from the shrinker approach.
How does this resolve the numa-unawareness issue that Kosaki-san brought up?

>> The other question I have with this approach is if we're on a system
>> that doesn't have swap, it *seems* (not totally sure I understand it
>> yet) the tmpfs file pages will be skipped over when we call
>> shrink_lruvec. So it seems we may need to add a new lru_list enum and
>> nr[] entry (maybe LRU_VOLATILE?). So then it may be that when we mark
>> a range as volatile, instead of just activating it, we move it to the
>> volatile lru, and then when we shrink from that list, we call back to
>> the filesystem to trigger the entire range purging.
> Adding new LRU idea might make very slow fallocate(VOLATILE) so I hope we can avoid that if possible.

Indeed. This is a major concern. I'm currently prototyping it out so I
have a concrete sense of the performance cost.

thanks
-john

2012-06-13 00:10:27

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/13/2012 04:35 AM, John Stultz wrote:

> On 06/12/2012 12:16 AM, Minchan Kim wrote:
>> Please, Cced linux-mm.
>>
>> On 06/09/2012 12:45 PM, John Stultz wrote:
>>
>>> On 06/07/2012 09:50 PM, KOSAKI Motohiro wrote:
>>>> (6/7/12 11:03 PM), John Stultz wrote:
>>>>
>>>>> So I'm falling back to using a shrinker for now, but I think Dmitry's
>>>>> point is an interesting one, and am interested in finding a better
>>>>> place to trigger purging volatile ranges from the mm code. If anyone
>>>>> has any
>>>>> suggestions, let me know, otherwise I'll go back to trying to better
>>>>> grok the mm code.
>>>> I hate vm feature to abuse shrink_slab(). because of, it was not
>>>> designed generic callback.
>>>> it was designed for shrinking filesystem metadata. Therefore, vm
>>>> keeping a balance between
>>>> page scanning and slab scanning. then, a lot of shrink_slab misuse may
>>>> lead to break balancing
>>>> logic. i.e. drop icache/dcache too many and makes perfomance impact.
>>>>
>>>> As far as a code impact is small, I'm prefer to connect w/ vm reclaim
>>>> code directly.
>>> I can see your concern about mis-using the shrinker code. Also your
>>> other email's point about the problem of having LRU range purging
>>> behavior on a NUMA system makes some sense too. Unfortunately I'm not
>>> yet familiar enough with the reclaim core to sort out how to best track
>>> and connect the volatile range purging in the vm's reclaim core yet.
>>>
>>> So for now, I've moved the code back to using the shrinker (along with
>>> fixing a few bugs along the way).
>>> Thus, currently we manage the ranges as so:
>>> [per fs volatile range lru head] -> [volatile range] -> [volatile
>>> range] -> [volatile range]
>>> With the per-fs shrinker zaping the volatile ranges from the lru.
>>>
>>> I *think* ideally, the pages in a volatile range should be similar to
>>> non-dirty file-backed pages. There is a cost to restore them, but
>>> freeing them is very cheap. The trick is that volatile ranges
>>> introduces a new relationship between pages. Since the neighboring
>>> virtual pages in a volatile range are in effect tied together, purging
>>> one effectively ruins the value of keeping the others, regardless of
>>> which zone they are physically.
>>>
>>> So maybe the right appraoch give up the per-fs volatile range lru, and
>>> try a varient of what DaveC and DaveH have suggested: Letting the page
>>> based lru reclamation handle the selection on a physical page basis, but
>>> then zapping the entirety of the neighboring range if any one page is
>>> reclaimed. In order to try to preserve the range based LRU behavior,
>>> activate all the pages in the range together when the range is marked
>>
>> You mean deactivation for fast reclaiming, not activation when memory
>> pressure happen?
> Yes. Sorry for mixing up terms here. The point is moving all the pages
> together to the inactive list to preserve relative LRU behavior for
> purging ranges.

No problem :)

>
>
>
>>> volatile. Since we assume ranges are un-touched when volatile, that
>>> should preserve LRU purging behavior on single node systems and on
>>> multi-node systems it will approximate fairly closely.
>>>
>>> My main concern with this approach is marking and unmarking volatile
>>> ranges needs to be fast, so I'm worried about the additional overhead of
>>> activating each of the containing pages on mark_volatile.
>>
>> Yes. it could be a problem if range is very large and populated already.
>> Why can't we make new hooks?
>>
>> Just concept for showing my intention..
>>
>> +int shrink_volatile_pages(struct zone *zone)
>> +{
>> + int ret = 0;
>> + if (zone_page_state(zone, NR_ZONE_VOLATILE))
>> + ret = shmem_purge_one_volatile_range();
>> + return ret;
>> +}
>> +
>> static void shrink_zone(struct zone *zone, struct scan_control *sc)
>> {
>> struct mem_cgroup *root = sc->target_mem_cgroup;
>> @@ -1827,6 +1835,18 @@ static void shrink_zone(struct zone *zone,
>> struct scan_control *sc)
>> .priority = sc->priority,
>> };
>> struct mem_cgroup *memcg;
>> + int ret;
>> +
>> + /*
>> + * Before we dive into trouble maker, let's look at easy-
>> + * reclaimable pages and avoid costly-reclaim if possible.
>> + */
>> + do {
>> + ret = shrink_volatile_pages();
>> + if (ret)
>> + zone_watermark_ok(zone, sc->order, xxx);
>> + return;
>> + } while(ret)
>
> Hmm. I'm confused.
> This doesn't seem that different from the shrinker approach.

Shrinker is called after shrink_list so it means normal pages can be reclaimed
before we reclaim volatile pages. We shouldn't do that.

> How does this resolve the numa-unawareness issue that Kosaki-san brought
> up?

Basically, I think your shrink function should be more smart.

when fallocate is called, we can get mem_policy from shmem_inode_info and pass it to
volatile_range so that volatile_range can keep the information of NUMA.

When shmem_purge_one_volatile_range is called, it receives zone information.
So shmem_purge_one_volatile_range should find a range matched with NUMA policy and
passed zone.

Assumption:
A range may include same node/zone pages if possible.

I am not familiar with NUMA handling code so KOSAKI/Rik can point out if I am wrong.

>
>
>>> The other question I have with this approach is if we're on a system
>>> that doesn't have swap, it *seems* (not totally sure I understand it
>>> yet) the tmpfs file pages will be skipped over when we call
>>> shrink_lruvec. So it seems we may need to add a new lru_list enum and
>>> nr[] entry (maybe LRU_VOLATILE?). So then it may be that when we mark
>>> a range as volatile, instead of just activating it, we move it to the
>>> volatile lru, and then when we shrink from that list, we call back to
>>> the filesystem to trigger the entire range purging.
>> Adding new LRU idea might make very slow fallocate(VOLATILE) so I hope
>> we can avoid that if possible.
>
> Indeed. This is a major concern. I'm currently prototyping it out so I
> have a concrete sense of the performance cost.

If performance loss isn't big, that would be a approach!

>
> thanks
> -john
>

--
Kind regards,
Minchan Kim

2012-06-13 01:21:43

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/12/2012 05:10 PM, Minchan Kim wrote:
> On 06/13/2012 04:35 AM, John Stultz wrote:
>
>> On 06/12/2012 12:16 AM, Minchan Kim wrote:
>>> Please, Cced linux-mm.
>>>
>>> On 06/09/2012 12:45 PM, John Stultz wrote:
>>>
>>>
>>>> volatile. Since we assume ranges are un-touched when volatile, that
>>>> should preserve LRU purging behavior on single node systems and on
>>>> multi-node systems it will approximate fairly closely.
>>>>
>>>> My main concern with this approach is marking and unmarking volatile
>>>> ranges needs to be fast, so I'm worried about the additional overhead of
>>>> activating each of the containing pages on mark_volatile.
>>> Yes. it could be a problem if range is very large and populated already.
>>> Why can't we make new hooks?
>>>
>>> Just concept for showing my intention..
>>>
>>> +int shrink_volatile_pages(struct zone *zone)
>>> +{
>>> + int ret = 0;
>>> + if (zone_page_state(zone, NR_ZONE_VOLATILE))
>>> + ret = shmem_purge_one_volatile_range();
>>> + return ret;
>>> +}
>>> +
>>> static void shrink_zone(struct zone *zone, struct scan_control *sc)
>>> {
>>> struct mem_cgroup *root = sc->target_mem_cgroup;
>>> @@ -1827,6 +1835,18 @@ static void shrink_zone(struct zone *zone,
>>> struct scan_control *sc)
>>> .priority = sc->priority,
>>> };
>>> struct mem_cgroup *memcg;
>>> + int ret;
>>> +
>>> + /*
>>> + * Before we dive into trouble maker, let's look at easy-
>>> + * reclaimable pages and avoid costly-reclaim if possible.
>>> + */
>>> + do {
>>> + ret = shrink_volatile_pages();
>>> + if (ret)
>>> + zone_watermark_ok(zone, sc->order, xxx);
>>> + return;
>>> + } while(ret)
>> Hmm. I'm confused.
>> This doesn't seem that different from the shrinker approach.
>
> Shrinker is called after shrink_list so it means normal pages can be reclaimed
> before we reclaim volatile pages. We shouldn't do that.

Ah. Ok. Maybe that's a reasonable compromise between the shrinker
approach and the more complex approach I just posted to lkml?
(Forgive me for forgetting to CC you and linux-mm with my latest post!)

>> How does this resolve the numa-unawareness issue that Kosaki-san brought
>> up?
> Basically, I think your shrink function should be more smart.
>
> when fallocate is called, we can get mem_policy from shmem_inode_info and pass it to
> volatile_range so that volatile_range can keep the information of NUMA.
Hrm.. That sounds reasonable. I'll look into the mem_policy bits and try
to learn more.

> When shmem_purge_one_volatile_range is called, it receives zone information.
> So shmem_purge_one_volatile_range should find a range matched with NUMA policy and
> passed zone.
>
> Assumption:
> A range may include same node/zone pages if possible.
>
> I am not familiar with NUMA handling code so KOSAKI/Rik can point out if I am wrong.
Right, the range may cross nodes/zones but maybe that's not a huge deal?
The only bit I'd worry about is the lru scanning being non-constant as
we searched for a range that matched the node we want to free from. I
guess we could have per-node/zone lrus.

>>>> The other question I have with this approach is if we're on a system
>>>> that doesn't have swap, it *seems* (not totally sure I understand it
>>>> yet) the tmpfs file pages will be skipped over when we call
>>>> shrink_lruvec. So it seems we may need to add a new lru_list enum and
>>>> nr[] entry (maybe LRU_VOLATILE?). So then it may be that when we mark
>>>> a range as volatile, instead of just activating it, we move it to the
>>>> volatile lru, and then when we shrink from that list, we call back to
>>>> the filesystem to trigger the entire range purging.
>>> Adding new LRU idea might make very slow fallocate(VOLATILE) so I hope
>>> we can avoid that if possible.
>> Indeed. This is a major concern. I'm currently prototyping it out so I
>> have a concrete sense of the performance cost.
> If performance loss isn't big, that would be a approach!
I've not had a chance yet to measure it, as I wanted to get my very
rough patches out for discussion first. But if folks don't nack it
outright I'll be providing some data there. The hard part is that range
creation would have a linear cost with the number of pages in the range,
which at some point will be a pain.

Thanks again for your input!
-john

2012-06-13 04:42:22

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

On 06/13/2012 10:21 AM, John Stultz wrote:

> On 06/12/2012 05:10 PM, Minchan Kim wrote:
>> On 06/13/2012 04:35 AM, John Stultz wrote:
>>
>>> On 06/12/2012 12:16 AM, Minchan Kim wrote:
>>>> Please, Cced linux-mm.
>>>>
>>>> On 06/09/2012 12:45 PM, John Stultz wrote:
>>>>
>>>>
>>>>> volatile. Since we assume ranges are un-touched when volatile, that
>>>>> should preserve LRU purging behavior on single node systems and on
>>>>> multi-node systems it will approximate fairly closely.
>>>>>
>>>>> My main concern with this approach is marking and unmarking volatile
>>>>> ranges needs to be fast, so I'm worried about the additional
>>>>> overhead of
>>>>> activating each of the containing pages on mark_volatile.
>>>> Yes. it could be a problem if range is very large and populated
>>>> already.
>>>> Why can't we make new hooks?
>>>>
>>>> Just concept for showing my intention..
>>>>
>>>> +int shrink_volatile_pages(struct zone *zone)
>>>> +{
>>>> + int ret = 0;
>>>> + if (zone_page_state(zone, NR_ZONE_VOLATILE))
>>>> + ret = shmem_purge_one_volatile_range();
>>>> + return ret;
>>>> +}
>>>> +
>>>> static void shrink_zone(struct zone *zone, struct scan_control *sc)
>>>> {
>>>> struct mem_cgroup *root = sc->target_mem_cgroup;
>>>> @@ -1827,6 +1835,18 @@ static void shrink_zone(struct zone *zone,
>>>> struct scan_control *sc)
>>>> .priority = sc->priority,
>>>> };
>>>> struct mem_cgroup *memcg;
>>>> + int ret;
>>>> +
>>>> + /*
>>>> + * Before we dive into trouble maker, let's look at easy-
>>>> + * reclaimable pages and avoid costly-reclaim if possible.
>>>> + */
>>>> + do {
>>>> + ret = shrink_volatile_pages();
>>>> + if (ret)
>>>> + zone_watermark_ok(zone, sc->order, xxx);
>>>> + return;
>>>> + } while(ret)
>>> Hmm. I'm confused.
>>> This doesn't seem that different from the shrinker approach.
>>
>> Shrinker is called after shrink_list so it means normal pages can be
>> reclaimed
>> before we reclaim volatile pages. We shouldn't do that.
>
>
> Ah. Ok. Maybe that's a reasonable compromise between the shrinker
> approach and the more complex approach I just posted to lkml?
> (Forgive me for forgetting to CC you and linux-mm with my latest post!)

NP.

>
>>> How does this resolve the numa-unawareness issue that Kosaki-san brought
>>> up?
>> Basically, I think your shrink function should be more smart.
>>
>> when fallocate is called, we can get mem_policy from shmem_inode_info
>> and pass it to
>> volatile_range so that volatile_range can keep the information of NUMA.
> Hrm.. That sounds reasonable. I'll look into the mem_policy bits and try
> to learn more.
>
>> When shmem_purge_one_volatile_range is called, it receives zone
>> information.
>> So shmem_purge_one_volatile_range should find a range matched with
>> NUMA policy and
>> passed zone.
>>
>> Assumption:
>> A range may include same node/zone pages if possible.
>>
>> I am not familiar with NUMA handling code so KOSAKI/Rik can point out
>> if I am wrong.
> Right, the range may cross nodes/zones but maybe that's not a huge deal?
> The only bit I'd worry about is the lru scanning being non-constant as
> we searched for a range that matched the node we want to free from. I
> guess we could have per-node/zone lrus.

Good.

>
>
>>>>> The other question I have with this approach is if we're on a system
>>>>> that doesn't have swap, it *seems* (not totally sure I understand it
>>>>> yet) the tmpfs file pages will be skipped over when we call
>>>>> shrink_lruvec. So it seems we may need to add a new lru_list enum and
>>>>> nr[] entry (maybe LRU_VOLATILE?). So then it may be that when we
>>>>> mark
>>>>> a range as volatile, instead of just activating it, we move it to the
>>>>> volatile lru, and then when we shrink from that list, we call back to
>>>>> the filesystem to trigger the entire range purging.
>>>> Adding new LRU idea might make very slow fallocate(VOLATILE) so I hope
>>>> we can avoid that if possible.
>>> Indeed. This is a major concern. I'm currently prototyping it out so I
>>> have a concrete sense of the performance cost.
>> If performance loss isn't big, that would be a approach!
> I've not had a chance yet to measure it, as I wanted to get my very
> rough patches out for discussion first. But if folks don't nack it
> outright I'll be providing some data there. The hard part is that range
> creation would have a linear cost with the number of pages in the range,
> which at some point will be a pain.

That's right. So IMHO, my suggestion could be a solution.
I looked through your new patchset[5/6]. I know your intention but code still have problems.
But I didn't commented out. Before the detail review, I would like to hear opinions from others
and am curious about that whether you decide turning the approach or not.
It can save our precious time. :)

>
> Thanks again for your input!
> -john

Thanks for your effort!

--
Kind regards,
Minchan Kim