2012-06-27 04:17:49

by John Stultz

[permalink] [raw]
Subject: [PATCH 0/5][RFC] Fallocate Volatile Ranges v5

After sending out my last iteration, I got very little feedback, so
I wanted to try sending this out again for comment.

This patchset has two parts:

The first 3 patches add generic volatile range management code, as
well as tmpfs support for FALLOC_FL_MARK_VOLATILE, which uses a
shrinker to purge ranges, and converts ashmem to use
FALLOC_FL_MARK_VOLATILE, almost reducing the driver in half.

Since Kosaki-san objected to using the shrinker, as its not numa aware,
and is only called after we shrink normal lru lists. The second half of
this patch set provides a different method that is not shrinker based.
In this method we deactivate the pages in the volatile range, and then
when writepage is called on a volatile page, we purge the entire range.
Due to my unfamiliar with the details of the VM, this second part is
less likely to be 100% correct or ideal, but seems to work properly in
my testing.

Since last sending this out, I was able to do some further performance
analysis on the extra costs of deactivating all of the pages in the
range when marking a range volatile, and the overhead is significant.
Hopefully folks have some suggestions on how to maybe reduce this.

Given that the shrinker approach is much faster, I've been also
looking at other alternatives: One idea I believe suggested by
Minchan (although I may have misunderstood) was to provide a
separate list to purge volatile ranges before we do the lru
shrinking. So I've considered adding a "early_shrinker" list,
which callbacks can be registered against and these shrinkers are
called prior to lru shrinking. They still would be numa-unaware,
but would still allow for volatile ranges to be zapped before
we swap anything out. I realize this isn't what Minchan recently
proposed (basically adding a new per-zone ERECLAIM LRU list), but
my worry with the ERECLAIM list idea is we still have to touch the
pages individually when marking them volatile. If folks are curious
about this approach, I can post it as well, but I wanted to try to
get further review on my current approach before jumping off onto
another tangent.


What's new in this iteration:
* At Michel Lespinasse's and Dmitry Adamushko's earlier suggestion
I dropped the generic interval tree implementation to use the
prio_tree code which seems to function fine for my needs.
* I added some pagevec batching in the activating/deactivating
paths which helped improve performance of non-shrinker method,
but there's still a ways to go.
* Minor cleanups.


Thanks again for the feedback so far!

thanks
-john

(Also, given the number of revisions this patchset, and previous
attempts, have been through the CC list is getting crazy long,
so feel free to ping me privately and I'll drop you if you're
really not wanting to be flooded every week or so with these
patches as I iterate)

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Dmitry Adamushko <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Neil Brown <[email protected]>
CC: Andrea Righi <[email protected]>
CC: Aneesh Kumar K.V <[email protected]>
CC: Taras Glek <[email protected]>
CC: Mike Hommey <[email protected]>
CC: Jan Kara <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Michel Lespinasse <[email protected]>
CC: Minchan Kim <[email protected]>
CC: [email protected] <[email protected]>



John Stultz (5):
[RFC] Add volatile range management code
[RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
[RFC] ashmem: Convert ashmem to use volatile ranges
[RFC][HACK] tmpfs: Purge volatile ranges on writepage instead of
using shrinker
[RFC][HACK] mm: Change memory management of anonymous pages on
swapless systems

drivers/staging/android/ashmem.c | 331 +--------------------------
fs/open.c | 3 +-
include/linux/falloc.h | 7 +-
include/linux/pagevec.h | 5 +-
include/linux/swap.h | 23 +-
include/linux/volatile.h | 40 ++++
mm/Makefile | 2 +-
mm/shmem.c | 123 +++++++++-
mm/swap.c | 13 +-
mm/vmscan.c | 9 -
mm/volatile.c | 467 ++++++++++++++++++++++++++++++++++++++
11 files changed, 673 insertions(+), 350 deletions(-)
create mode 100644 include/linux/volatile.h
create mode 100644 mm/volatile.c

--
1.7.9.5


2012-06-27 04:17:51

by John Stultz

[permalink] [raw]
Subject: [PATCH 3/5] [RFC] ashmem: Convert ashmem to use volatile ranges

Rework of my first pass attempt at getting ashmem to utilize
the volatile range code, now using the fallocate interface.

In this implementaiton GET_PIN_STATUS is unimplemented, due to
the fact that adding a ISVOLATILE check wasn't considered
terribly useful in earlier reviews. It would be trivial to
re-add that functionality, but I wanted to check w/ the
Android developers to see how often GET_PIN_STATUS is actually
used?

Similarly the ashmem PURGE_ALL_CACHES ioctl does not function,
as the volatile range purging is no longer directly under its
control.

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Dmitry Adamushko <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Neil Brown <[email protected]>
CC: Andrea Righi <[email protected]>
CC: Aneesh Kumar K.V <[email protected]>
CC: Taras Glek <[email protected]>
CC: Mike Hommey <[email protected]>
CC: Jan Kara <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Michel Lespinasse <[email protected]>
CC: Minchan Kim <[email protected]>
CC: [email protected] <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
drivers/staging/android/ashmem.c | 331 ++------------------------------------
1 file changed, 9 insertions(+), 322 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index e84dbec..9e8fe37 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -50,26 +50,6 @@ struct ashmem_area {
};

/*
- * ashmem_range - represents an interval of unpinned (evictable) pages
- * Lifecycle: From unpin to pin
- * Locking: Protected by `ashmem_mutex'
- */
-struct ashmem_range {
- struct list_head lru; /* entry in LRU list */
- struct list_head unpinned; /* entry in its area's unpinned list */
- struct ashmem_area *asma; /* associated area */
- size_t pgstart; /* starting page, inclusive */
- size_t pgend; /* ending page, inclusive */
- unsigned int purged; /* ASHMEM_NOT or ASHMEM_WAS_PURGED */
-};
-
-/* LRU list of unpinned pages, protected by ashmem_mutex */
-static LIST_HEAD(ashmem_lru_list);
-
-/* Count of pages on our LRU list, protected by ashmem_mutex */
-static unsigned long lru_count;
-
-/*
* ashmem_mutex - protects the list of and each individual ashmem_area
*
* Lock Ordering: ashmex_mutex -> i_mutex -> i_alloc_sem
@@ -77,102 +57,9 @@ static unsigned long lru_count;
static DEFINE_MUTEX(ashmem_mutex);

static struct kmem_cache *ashmem_area_cachep __read_mostly;
-static struct kmem_cache *ashmem_range_cachep __read_mostly;
-
-#define range_size(range) \
- ((range)->pgend - (range)->pgstart + 1)
-
-#define range_on_lru(range) \
- ((range)->purged == ASHMEM_NOT_PURGED)
-
-#define page_range_subsumes_range(range, start, end) \
- (((range)->pgstart >= (start)) && ((range)->pgend <= (end)))
-
-#define page_range_subsumed_by_range(range, start, end) \
- (((range)->pgstart <= (start)) && ((range)->pgend >= (end)))
-
-#define page_in_range(range, page) \
- (((range)->pgstart <= (page)) && ((range)->pgend >= (page)))
-
-#define page_range_in_range(range, start, end) \
- (page_in_range(range, start) || page_in_range(range, end) || \
- page_range_subsumes_range(range, start, end))
-
-#define range_before_page(range, page) \
- ((range)->pgend < (page))

#define PROT_MASK (PROT_EXEC | PROT_READ | PROT_WRITE)

-static inline void lru_add(struct ashmem_range *range)
-{
- list_add_tail(&range->lru, &ashmem_lru_list);
- lru_count += range_size(range);
-}
-
-static inline void lru_del(struct ashmem_range *range)
-{
- list_del(&range->lru);
- lru_count -= range_size(range);
-}
-
-/*
- * range_alloc - allocate and initialize a new ashmem_range structure
- *
- * 'asma' - associated ashmem_area
- * 'prev_range' - the previous ashmem_range in the sorted asma->unpinned list
- * 'purged' - initial purge value (ASMEM_NOT_PURGED or ASHMEM_WAS_PURGED)
- * 'start' - starting page, inclusive
- * 'end' - ending page, inclusive
- *
- * Caller must hold ashmem_mutex.
- */
-static int range_alloc(struct ashmem_area *asma,
- struct ashmem_range *prev_range, unsigned int purged,
- size_t start, size_t end)
-{
- struct ashmem_range *range;
-
- range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
- if (unlikely(!range))
- return -ENOMEM;
-
- range->asma = asma;
- range->pgstart = start;
- range->pgend = end;
- range->purged = purged;
-
- list_add_tail(&range->unpinned, &prev_range->unpinned);
-
- if (range_on_lru(range))
- lru_add(range);
-
- return 0;
-}
-
-static void range_del(struct ashmem_range *range)
-{
- list_del(&range->unpinned);
- if (range_on_lru(range))
- lru_del(range);
- kmem_cache_free(ashmem_range_cachep, range);
-}
-
-/*
- * range_shrink - shrinks a range
- *
- * Caller must hold ashmem_mutex.
- */
-static inline void range_shrink(struct ashmem_range *range,
- size_t start, size_t end)
-{
- size_t pre = range_size(range);
-
- range->pgstart = start;
- range->pgend = end;
-
- if (range_on_lru(range))
- lru_count -= pre - range_size(range);
-}

static int ashmem_open(struct inode *inode, struct file *file)
{
@@ -198,12 +85,6 @@ static int ashmem_open(struct inode *inode, struct file *file)
static int ashmem_release(struct inode *ignored, struct file *file)
{
struct ashmem_area *asma = file->private_data;
- struct ashmem_range *range, *next;
-
- mutex_lock(&ashmem_mutex);
- list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned)
- range_del(range);
- mutex_unlock(&ashmem_mutex);

if (asma->file)
fput(asma->file);
@@ -337,56 +218,6 @@ out:
return ret;
}

-/*
- * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
- *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
- *
- * 'gfp_mask' is the mask of the allocation that got us into this mess.
- *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
- * proceed without risk of deadlock (due to gfp_mask).
- *
- * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
- * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
- * pages freed.
- */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
-{
- struct ashmem_range *range, *next;
-
- /* We might recurse into filesystem code, so bail out if necessary */
- if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
- return -1;
- if (!sc->nr_to_scan)
- return lru_count;
-
- mutex_lock(&ashmem_mutex);
- list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
- loff_t start = range->pgstart * PAGE_SIZE;
- loff_t end = (range->pgend + 1) * PAGE_SIZE;
-
- do_fallocate(range->asma->file,
- FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
- start, end - start);
- range->purged = ASHMEM_WAS_PURGED;
- lru_del(range);
-
- sc->nr_to_scan -= range_size(range);
- if (sc->nr_to_scan <= 0)
- break;
- }
- mutex_unlock(&ashmem_mutex);
-
- return lru_count;
-}
-
-static struct shrinker ashmem_shrinker = {
- .shrink = ashmem_shrink,
- .seeks = DEFAULT_SEEKS * 4,
-};
-
static int set_prot_mask(struct ashmem_area *asma, unsigned long prot)
{
int ret = 0;
@@ -459,136 +290,10 @@ static int get_name(struct ashmem_area *asma, void __user *name)
return ret;
}

-/*
- * ashmem_pin - pin the given ashmem region, returning whether it was
- * previously purged (ASHMEM_WAS_PURGED) or not (ASHMEM_NOT_PURGED).
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
- struct ashmem_range *range, *next;
- int ret = ASHMEM_NOT_PURGED;
-
- list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
- /* moved past last applicable page; we can short circuit */
- if (range_before_page(range, pgstart))
- break;
-
- /*
- * The user can ask us to pin pages that span multiple ranges,
- * or to pin pages that aren't even unpinned, so this is messy.
- *
- * Four cases:
- * 1. The requested range subsumes an existing range, so we
- * just remove the entire matching range.
- * 2. The requested range overlaps the start of an existing
- * range, so we just update that range.
- * 3. The requested range overlaps the end of an existing
- * range, so we just update that range.
- * 4. The requested range punches a hole in an existing range,
- * so we have to update one side of the range and then
- * create a new range for the other side.
- */
- if (page_range_in_range(range, pgstart, pgend)) {
- ret |= range->purged;
-
- /* Case #1: Easy. Just nuke the whole thing. */
- if (page_range_subsumes_range(range, pgstart, pgend)) {
- range_del(range);
- continue;
- }
-
- /* Case #2: We overlap from the start, so adjust it */
- if (range->pgstart >= pgstart) {
- range_shrink(range, pgend + 1, range->pgend);
- continue;
- }
-
- /* Case #3: We overlap from the rear, so adjust it */
- if (range->pgend <= pgend) {
- range_shrink(range, range->pgstart, pgstart-1);
- continue;
- }
-
- /*
- * Case #4: We eat a chunk out of the middle. A bit
- * more complicated, we allocate a new range for the
- * second half and adjust the first chunk's endpoint.
- */
- range_alloc(asma, range, range->purged,
- pgend + 1, range->pgend);
- range_shrink(range, range->pgstart, pgstart - 1);
- break;
- }
- }
-
- return ret;
-}
-
-/*
- * ashmem_unpin - unpin the given range of pages. Returns zero on success.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
- struct ashmem_range *range, *next;
- unsigned int purged = ASHMEM_NOT_PURGED;
-
-restart:
- list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
- /* short circuit: this is our insertion point */
- if (range_before_page(range, pgstart))
- break;
-
- /*
- * The user can ask us to unpin pages that are already entirely
- * or partially pinned. We handle those two cases here.
- */
- if (page_range_subsumed_by_range(range, pgstart, pgend))
- return 0;
- if (page_range_in_range(range, pgstart, pgend)) {
- pgstart = min_t(size_t, range->pgstart, pgstart),
- pgend = max_t(size_t, range->pgend, pgend);
- purged |= range->purged;
- range_del(range);
- goto restart;
- }
- }
-
- return range_alloc(asma, range, purged, pgstart, pgend);
-}
-
-/*
- * ashmem_get_pin_status - Returns ASHMEM_IS_UNPINNED if _any_ pages in the
- * given interval are unpinned and ASHMEM_IS_PINNED otherwise.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_get_pin_status(struct ashmem_area *asma, size_t pgstart,
- size_t pgend)
-{
- struct ashmem_range *range;
- int ret = ASHMEM_IS_PINNED;
-
- list_for_each_entry(range, &asma->unpinned_list, unpinned) {
- if (range_before_page(range, pgstart))
- break;
- if (page_range_in_range(range, pgstart, pgend)) {
- ret = ASHMEM_IS_UNPINNED;
- break;
- }
- }
-
- return ret;
-}
-
static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
void __user *p)
{
struct ashmem_pin pin;
- size_t pgstart, pgend;
int ret = -EINVAL;

if (unlikely(!asma->file))
@@ -610,20 +315,24 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
if (unlikely(PAGE_ALIGN(asma->size) < pin.offset + pin.len))
return -EINVAL;

- pgstart = pin.offset / PAGE_SIZE;
- pgend = pgstart + (pin.len / PAGE_SIZE) - 1;

mutex_lock(&ashmem_mutex);

switch (cmd) {
case ASHMEM_PIN:
- ret = ashmem_pin(asma, pgstart, pgend);
+ ret = do_fallocate(asma->file, FALLOC_FL_MARK_VOLATILE,
+ pin.offset, pin.len);
break;
case ASHMEM_UNPIN:
- ret = ashmem_unpin(asma, pgstart, pgend);
+ ret = do_fallocate(asma->file, FALLOC_FL_UNMARK_VOLATILE,
+ pin.offset, pin.len);
break;
case ASHMEM_GET_PIN_STATUS:
- ret = ashmem_get_pin_status(asma, pgstart, pgend);
+ /*
+ * XXX - volatile ranges currently don't provide status,
+ * due to questionable utility
+ */
+ ret = -EINVAL;
break;
}

@@ -667,15 +376,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
break;
case ASHMEM_PURGE_ALL_CACHES:
ret = -EPERM;
- if (capable(CAP_SYS_ADMIN)) {
- struct shrink_control sc = {
- .gfp_mask = GFP_KERNEL,
- .nr_to_scan = 0,
- };
- ret = ashmem_shrink(&ashmem_shrinker, &sc);
- sc.nr_to_scan = ret;
- ashmem_shrink(&ashmem_shrinker, &sc);
- }
break;
}

@@ -711,22 +411,12 @@ static int __init ashmem_init(void)
return -ENOMEM;
}

- ashmem_range_cachep = kmem_cache_create("ashmem_range_cache",
- sizeof(struct ashmem_range),
- 0, 0, NULL);
- if (unlikely(!ashmem_range_cachep)) {
- printk(KERN_ERR "ashmem: failed to create slab cache\n");
- return -ENOMEM;
- }
-
ret = misc_register(&ashmem_misc);
if (unlikely(ret)) {
printk(KERN_ERR "ashmem: failed to register misc device!\n");
return ret;
}

- register_shrinker(&ashmem_shrinker);
-
printk(KERN_INFO "ashmem: initialized\n");

return 0;
@@ -736,13 +426,10 @@ static void __exit ashmem_exit(void)
{
int ret;

- unregister_shrinker(&ashmem_shrinker);
-
ret = misc_deregister(&ashmem_misc);
if (unlikely(ret))
printk(KERN_ERR "ashmem: failed to unregister misc device!\n");

- kmem_cache_destroy(ashmem_range_cachep);
kmem_cache_destroy(ashmem_area_cachep);

printk(KERN_INFO "ashmem: unloaded\n");
--
1.7.9.5

2012-06-27 04:18:17

by John Stultz

[permalink] [raw]
Subject: [PATCH 5/5] [RFC][HACK] mm: Change memory management of anonymous pages on swapless systems

Due to my newbie-ness, the following may not be precise, but
I think it conveys the intent of what I'm trying to do here.

Anonymous memory is tracked on two LRU lists: LRU_INACTIVE_ANON
and LRU_ACTIVE_ANON. This split is useful when we need to free
up pages and are trying to decide what to swap out.

However, on systems that do no have swap, this partition is less
clear. In many cases the code avoids aging active anonymous pages
onto the inactive list. However in some cases pages do get moved
to the inactive list, but we never call writepage, as there isn't
anything to swap out.

This patch changes some of the active/inactive list management of
anonymous memory when there is no swap. In that case pages are
always added to the active lru. The intent is that since anonymous
pages cannot be swapped out, they all shoudld be active.

The one exception is volatile pages, which can be moved to
the inactive lru by calling deactivate_page().

In addition, I've changed the logic so we also do try to shrink
the inactive anonymous lru, and call writepage. This should only
be done if there are volatile pages on the inactive lru.

This allows us to purge volatile pages in writepage when the system
does not have swap.

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Dmitry Adamushko <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Neil Brown <[email protected]>
CC: Andrea Righi <[email protected]>
CC: Aneesh Kumar K.V <[email protected]>
CC: Taras Glek <[email protected]>
CC: Mike Hommey <[email protected]>
CC: Jan Kara <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Michel Lespinasse <[email protected]>
CC: Minchan Kim <[email protected]>
CC: [email protected] <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/pagevec.h | 5 +----
include/linux/swap.h | 23 +++++++++++++++--------
mm/swap.c | 13 ++++++++++++-
mm/vmscan.c | 9 ---------
4 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 2aa12b8..e1312a5 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,7 @@ struct pagevec {

void __pagevec_release(struct pagevec *pvec);
void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
+void __pagevec_lru_add_anon(struct pagevec *pvec);
unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
pgoff_t start, unsigned nr_pages);
unsigned pagevec_lookup_tag(struct pagevec *pvec,
@@ -64,10 +65,6 @@ static inline void pagevec_release(struct pagevec *pvec)
__pagevec_release(pvec);
}

-static inline void __pagevec_lru_add_anon(struct pagevec *pvec)
-{
- __pagevec_lru_add(pvec, LRU_INACTIVE_ANON);
-}

static inline void __pagevec_lru_add_active_anon(struct pagevec *pvec)
{
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c84ec68..639936f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,14 +238,6 @@ extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);

-/**
- * lru_cache_add: add a page to the page lists
- * @page: the page to add
- */
-static inline void lru_cache_add_anon(struct page *page)
-{
- __lru_cache_add(page, LRU_INACTIVE_ANON);
-}

static inline void lru_cache_add_file(struct page *page)
{
@@ -474,5 +466,20 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
}

#endif /* CONFIG_SWAP */
+
+/**
+ * lru_cache_add: add a page to the page lists
+ * @page: the page to add
+ */
+static inline void lru_cache_add_anon(struct page *page)
+{
+ int lru = LRU_INACTIVE_ANON;
+ if (!total_swap_pages)
+ lru = LRU_ACTIVE_ANON;
+
+ __lru_cache_add(page, lru);
+}
+
+
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
diff --git a/mm/swap.c b/mm/swap.c
index 4e7e2ec..f35df46 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -691,7 +691,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
SetPageLRU(page_tail);

if (page_evictable(page_tail, NULL)) {
- if (PageActive(page)) {
+ if (PageActive(page) || !total_swap_pages) {
SetPageActive(page_tail);
active = 1;
lru = LRU_ACTIVE_ANON;
@@ -755,6 +755,17 @@ void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
}
EXPORT_SYMBOL(__pagevec_lru_add);

+
+void __pagevec_lru_add_anon(struct pagevec *pvec)
+{
+ if (!total_swap_pages)
+ __pagevec_lru_add(pvec, LRU_ACTIVE_ANON);
+ else
+ __pagevec_lru_add(pvec, LRU_INACTIVE_ANON);
+}
+EXPORT_SYMBOL(__pagevec_lru_add_anon);
+
+
/**
* pagevec_lookup - gang pagecache lookup
* @pvec: Where the resulting pages are placed
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeb3bc9..52d8ad9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1597,15 +1597,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
if (!global_reclaim(sc))
force_scan = true;

- /* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || (nr_swap_pages <= 0)) {
- noswap = 1;
- fraction[0] = 0;
- fraction[1] = 1;
- denominator = 1;
- goto out;
- }
-
anon = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
get_lru_size(lruvec, LRU_INACTIVE_ANON);
file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
--
1.7.9.5

2012-06-27 04:18:19

by John Stultz

[permalink] [raw]
Subject: [PATCH 2/5] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers

This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
functionality for tmpfs making use of the volatile range
management code.

Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
FALLOC_FL_PUNCH_HOLE. This allows applications that have
data caches that can be re-created to tell the kernel that
some memory contains data that is useful in the future, but
can be recreated if needed, so if the kernel needs, it can
zap the memory without having to swap it out.

In use, applications use FALLOC_FL_MARK_VOLATILE to mark
page ranges as volatile when they are not in use. Then later
if they wants to reuse the data, they use
FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
data has been purged.

This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code.

This is a reworked version of the fadvise volatile idea submitted
earlier to the list. Thanks to Dave Chinner for suggesting to
rework the idea in this fashion. Also thanks to Dmitry Adamushko
for continued review and bug reporting, and Dave Hansen for
help with the original design and mentoring me in the VM code.

v3:
* Fix off by one issue when truncating page ranges
* Use Dave Hansesn's suggestion to use shmem_writepage to trigger
range purging instead of using a shrinker.

v4:
* Revert the shrinker removal, since writepage won't get called
if we don't have swap.

v5:
* Cleanups

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Dmitry Adamushko <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Neil Brown <[email protected]>
CC: Andrea Righi <[email protected]>
CC: Aneesh Kumar K.V <[email protected]>
CC: Taras Glek <[email protected]>
CC: Mike Hommey <[email protected]>
CC: Jan Kara <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Michel Lespinasse <[email protected]>
CC: Minchan Kim <[email protected]>
CC: [email protected] <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
fs/open.c | 3 +-
include/linux/falloc.h | 7 +--
mm/shmem.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 119 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index d6c79a0..c0531c0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -223,7 +223,8 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -EINVAL;

/* Return error if mode is not supported */
- if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+ FALLOC_FL_MARK_VOLATILE | FALLOC_FL_UNMARK_VOLATILE))
return -EOPNOTSUPP;

/* Punch hole must have keep size set */
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 73e0b62..3e47ad5 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -1,9 +1,10 @@
#ifndef _FALLOC_H_
#define _FALLOC_H_

-#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
-#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
-
+#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
+#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
+#define FALLOC_FL_MARK_VOLATILE 0x04 /* mark range volatile */
+#define FALLOC_FL_UNMARK_VOLATILE 0x08 /* mark range non-volatile */
#ifdef __KERNEL__

/*
diff --git a/mm/shmem.c b/mm/shmem.c
index a15a466..d85d237 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -64,6 +64,7 @@ static struct vfsmount *shm_mnt;
#include <linux/highmem.h>
#include <linux/seq_file.h>
#include <linux/magic.h>
+#include <linux/volatile.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -624,6 +625,103 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
return error;
}

+static DEFINE_VOLATILE_FS_HEAD(shmem_volatile_head);
+
+static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+ pgoff_t start, end;
+ int ret;
+
+ start = offset >> PAGE_CACHE_SHIFT;
+ end = (offset+len) >> PAGE_CACHE_SHIFT;
+
+ volatile_range_lock(&shmem_volatile_head);
+ ret = volatile_range_add(&shmem_volatile_head, &inode->i_data,
+ start, end);
+ if (ret > 0) { /* immdiately purge */
+ shmem_truncate_range(inode,
+ ((loff_t) start << PAGE_CACHE_SHIFT),
+ ((loff_t) end << PAGE_CACHE_SHIFT)-1);
+ ret = 0;
+ }
+ volatile_range_unlock(&shmem_volatile_head);
+
+ return ret;
+}
+
+static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+ pgoff_t start, end;
+ int ret;
+
+ start = offset >> PAGE_CACHE_SHIFT;
+ end = (offset+len) >> PAGE_CACHE_SHIFT;
+
+ volatile_range_lock(&shmem_volatile_head);
+ ret = volatile_range_remove(&shmem_volatile_head, &inode->i_data,
+ start, end);
+ volatile_range_unlock(&shmem_volatile_head);
+
+ return ret;
+}
+
+static void shmem_clear_volatile(struct inode *inode)
+{
+ volatile_range_lock(&shmem_volatile_head);
+ volatile_range_clear(&shmem_volatile_head, &inode->i_data);
+ volatile_range_unlock(&shmem_volatile_head);
+}
+
+static
+int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
+{
+ s64 nr_to_scan = sc->nr_to_scan;
+ const gfp_t gfp_mask = sc->gfp_mask;
+ struct address_space *mapping;
+ pgoff_t start, end;
+ int ret;
+ s64 page_count;
+
+ if (nr_to_scan && !(gfp_mask & __GFP_FS))
+ return -1;
+
+ volatile_range_lock(&shmem_volatile_head);
+ page_count = volatile_range_lru_size(&shmem_volatile_head);
+ if (!nr_to_scan)
+ goto out;
+
+ do {
+ ret = volatile_ranges_pluck_lru(&shmem_volatile_head,
+ &mapping, &start, &end);
+ if (ret) {
+ shmem_truncate_range(mapping->host,
+ ((loff_t) start << PAGE_CACHE_SHIFT),
+ ((loff_t) end << PAGE_CACHE_SHIFT)-1);
+
+ nr_to_scan -= end-start;
+ page_count -= end-start;
+ };
+ } while (ret && (nr_to_scan > 0));
+
+out:
+ volatile_range_unlock(&shmem_volatile_head);
+
+ return page_count;
+}
+
+static struct shrinker shmem_volatile_shrinker = {
+ .shrink = shmem_volatile_shrink,
+ .seeks = DEFAULT_SEEKS,
+};
+
+static int __init shmem_shrinker_init(void)
+{
+ register_shrinker(&shmem_volatile_shrinker);
+ return 0;
+}
+arch_initcall(shmem_shrinker_init);
+
+
static void shmem_evict_inode(struct inode *inode)
{
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -1806,6 +1904,14 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
/* No need to unmap again: hole-punching leaves COWed pages */
error = 0;
goto out;
+ } else if (mode & FALLOC_FL_MARK_VOLATILE) {
+ /* Mark pages volatile, sort of delayed hole punching */
+ error = shmem_mark_volatile(inode, offset, len);
+ goto out;
+ } else if (mode & FALLOC_FL_UNMARK_VOLATILE) {
+ /* Mark pages non-volatile, return error if pages were purged */
+ error = shmem_unmark_volatile(inode, offset, len);
+ goto out;
}

/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -1884,6 +1990,12 @@ out:
return error;
}

+static int shmem_release(struct inode *inode, struct file *file)
+{
+ shmem_clear_volatile(inode);
+ return 0;
+}
+
static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
@@ -2795,6 +2907,7 @@ static const struct file_operations shmem_file_operations = {
.splice_read = shmem_file_splice_read,
.splice_write = generic_file_splice_write,
.fallocate = shmem_fallocate,
+ .release = shmem_release,
#endif
};

--
1.7.9.5

2012-06-27 04:18:16

by John Stultz

[permalink] [raw]
Subject: [PATCH 4/5] [RFC][HACK] tmpfs: Purge volatile ranges on writepage instead of using shrinker

In order to avoid using a shrinker, purge volatile ranges on writepage.
This requires deactivating/activating the page ranges together to
ensure relative LRU behavior in purging the volatile ranges.

This no longer requires the volatile range lru so remove that code.
Also add volatile range infrastructure to find a range that contains
a given page.

One concern with this approach is that it adds the overhead of having
to activate or deactivate each page in the range when we mark or unmark
a range as volatile. Since users of this interface are volunteering
memory for the kernel to possibly take, users may not feel so generous
if the cost of marking and umarking ranges is high.

Cost of calling MARK_VOLATILE then UNMARK_VOLATILE on a 1meg chunk:
Before this patch: ~25us
With this patch: ~495us

So the 20x cost increase makes this approach less favorable. But
hopefully someone can suggest an idea to improve things?

NOTE: On systems without swap, the VM won't shrink anonymous memory,
so writepage is never called and volatile ranges won't be purged.
This issue will be addressed in a following patch.

I've only been able to do minimal testing, so there's probably much
still wrong with this patch. But hopefully it will provide a concrete
approach for discussion.

v2:
* Rework page activation/deactivation using pagevec to batch
operations. This improves performance 3x from prior, but
performance still is ~20x slower then the lru shrinker method.

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Dmitry Adamushko <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Neil Brown <[email protected]>
CC: Andrea Righi <[email protected]>
CC: Aneesh Kumar K.V <[email protected]>
CC: Taras Glek <[email protected]>
CC: Mike Hommey <[email protected]>
CC: Jan Kara <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Michel Lespinasse <[email protected]>
CC: Minchan Kim <[email protected]>
CC: [email protected] <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/volatile.h | 11 ++---
mm/shmem.c | 108 +++++++++++++++++++++++++---------------------
mm/volatile.c | 88 ++++++++++---------------------------
3 files changed, 84 insertions(+), 123 deletions(-)

diff --git a/include/linux/volatile.h b/include/linux/volatile.h
index 6f41b98..fcae047 100644
--- a/include/linux/volatile.h
+++ b/include/linux/volatile.h
@@ -5,15 +5,11 @@

struct volatile_fs_head {
struct mutex lock;
- struct list_head lru_head;
- s64 unpurged_page_count;
};


#define DEFINE_VOLATILE_FS_HEAD(name) struct volatile_fs_head name = { \
.lock = __MUTEX_INITIALIZER(name.lock), \
- .lru_head = LIST_HEAD_INIT(name.lru_head), \
- .unpurged_page_count = 0, \
}


@@ -34,12 +30,11 @@ extern long volatile_range_remove(struct volatile_fs_head *head,
struct address_space *mapping,
pgoff_t start_index, pgoff_t end_index);

-extern s64 volatile_range_lru_size(struct volatile_fs_head *head);
-
extern void volatile_range_clear(struct volatile_fs_head *head,
struct address_space *mapping);

-extern s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
- struct address_space **mapping,
+int volatile_page_in_range(struct volatile_fs_head *head,
+ struct page *page,
pgoff_t *start, pgoff_t *end);
+
#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index d85d237..9c6b2cd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -627,6 +627,41 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)

static DEFINE_VOLATILE_FS_HEAD(shmem_volatile_head);

+
+void modify_range(struct address_space *mapping,
+ pgoff_t start, pgoff_t end,
+ void(*activate_func)(struct page*))
+{
+ struct pagevec pvec;
+ pgoff_t index = start;
+ int i;
+
+
+ pagevec_init(&pvec, 0);
+ while (index <= end && pagevec_lookup(&pvec, mapping, index,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+ mem_cgroup_uncharge_start();
+ for (i = 0; i < pagevec_count(&pvec); i++) {
+ struct page *page = pvec.pages[i];
+
+ /* We rely upon deletion not changing page->index */
+ index = page->index;
+ if (index > end)
+ break;
+
+ activate_func(page);
+ }
+ pagevec_release(&pvec);
+ mem_cgroup_uncharge_end();
+ cond_resched();
+ index++;
+ }
+
+}
+
+
+
+
static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
{
pgoff_t start, end;
@@ -643,7 +678,11 @@ static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
((loff_t) start << PAGE_CACHE_SHIFT),
((loff_t) end << PAGE_CACHE_SHIFT)-1);
ret = 0;
+
}
+
+ modify_range(&inode->i_data, start, end, &deactivate_page);
+
volatile_range_unlock(&shmem_volatile_head);

return ret;
@@ -660,6 +699,9 @@ static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
volatile_range_lock(&shmem_volatile_head);
ret = volatile_range_remove(&shmem_volatile_head, &inode->i_data,
start, end);
+
+ modify_range(&inode->i_data, start, end, &activate_page);
+
volatile_range_unlock(&shmem_volatile_head);

return ret;
@@ -672,55 +714,6 @@ static void shmem_clear_volatile(struct inode *inode)
volatile_range_unlock(&shmem_volatile_head);
}

-static
-int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
-{
- s64 nr_to_scan = sc->nr_to_scan;
- const gfp_t gfp_mask = sc->gfp_mask;
- struct address_space *mapping;
- pgoff_t start, end;
- int ret;
- s64 page_count;
-
- if (nr_to_scan && !(gfp_mask & __GFP_FS))
- return -1;
-
- volatile_range_lock(&shmem_volatile_head);
- page_count = volatile_range_lru_size(&shmem_volatile_head);
- if (!nr_to_scan)
- goto out;
-
- do {
- ret = volatile_ranges_pluck_lru(&shmem_volatile_head,
- &mapping, &start, &end);
- if (ret) {
- shmem_truncate_range(mapping->host,
- ((loff_t) start << PAGE_CACHE_SHIFT),
- ((loff_t) end << PAGE_CACHE_SHIFT)-1);
-
- nr_to_scan -= end-start;
- page_count -= end-start;
- };
- } while (ret && (nr_to_scan > 0));
-
-out:
- volatile_range_unlock(&shmem_volatile_head);
-
- return page_count;
-}
-
-static struct shrinker shmem_volatile_shrinker = {
- .shrink = shmem_volatile_shrink,
- .seeks = DEFAULT_SEEKS,
-};
-
-static int __init shmem_shrinker_init(void)
-{
- register_shrinker(&shmem_volatile_shrinker);
- return 0;
-}
-arch_initcall(shmem_shrinker_init);
-

static void shmem_evict_inode(struct inode *inode)
{
@@ -884,15 +877,30 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
struct address_space *mapping;
struct inode *inode;
swp_entry_t swap;
- pgoff_t index;
+ pgoff_t index, start, end;

BUG_ON(!PageLocked(page));
mapping = page->mapping;
index = page->index;
inode = mapping->host;
info = SHMEM_I(inode);
+
if (info->flags & VM_LOCKED)
goto redirty;
+
+ volatile_range_lock(&shmem_volatile_head);
+ if (volatile_page_in_range(&shmem_volatile_head, page, &start, &end)) {
+ unlock_page(page);
+ volatile_range_unlock(&shmem_volatile_head);
+ shmem_truncate_range(inode,
+ ((loff_t) start << PAGE_CACHE_SHIFT),
+ ((loff_t) end << PAGE_CACHE_SHIFT)-1);
+
+ return 0;
+ }
+ volatile_range_unlock(&shmem_volatile_head);
+
+
if (!total_swap_pages)
goto redirty;

diff --git a/mm/volatile.c b/mm/volatile.c
index d05a767..862f2ae 100644
--- a/mm/volatile.c
+++ b/mm/volatile.c
@@ -53,7 +53,6 @@


struct volatile_range {
- struct list_head lru;
struct prio_tree_node node;
unsigned int purged;
struct address_space *mapping;
@@ -159,15 +158,8 @@ static inline void vrange_resize(struct volatile_fs_head *head,
struct volatile_range *vrange,
pgoff_t start_index, pgoff_t end_index)
{
- pgoff_t old_size, new_size;
-
- old_size = vrange->node.last - vrange->node.start;
- new_size = end_index-start_index;
-
- if (!vrange->purged)
- head->unpurged_page_count += new_size - old_size;
-
prio_tree_remove(root, &vrange->node);
+ INIT_PRIO_TREE_NODE(&vrange->node);
vrange->node.start = start_index;
vrange->node.last = end_index;
prio_tree_insert(root, &vrange->node);
@@ -189,15 +181,7 @@ static void vrange_add(struct volatile_fs_head *head,
struct prio_tree_root *root,
struct volatile_range *vrange)
{
-
prio_tree_insert(root, &vrange->node);
-
- /* Only add unpurged ranges to LRU */
- if (!vrange->purged) {
- head->unpurged_page_count += vrange->node.last - vrange->node.start;
- list_add_tail(&vrange->lru, &head->lru_head);
- }
-
}


@@ -206,10 +190,6 @@ static void vrange_del(struct volatile_fs_head *head,
struct prio_tree_root *root,
struct volatile_range *vrange)
{
- if (!vrange->purged) {
- head->unpurged_page_count -= vrange->node.last - vrange->node.start;
- list_del(&vrange->lru);
- }
prio_tree_remove(root, &vrange->node);
kfree(vrange);
}
@@ -416,62 +396,40 @@ out:
return ret;
}

-/**
- * volatile_range_lru_size: Returns the number of unpurged pages on the lru
- * @head: per-fs volatile head
- *
- * Returns the number of unpurged pages on the LRU
- *
- * Must lock the volatile_fs_head before calling!
- *
- */
-s64 volatile_range_lru_size(struct volatile_fs_head *head)
-{
- WARN_ON(!mutex_is_locked(&head->lock));
- return head->unpurged_page_count;
-}
-

-/**
- * volatile_ranges_pluck_lru: Returns mapping and size of lru unpurged range
- * @head: per-fs volatile head
- * @mapping: dbl pointer to mapping who's range is being purged
- * @start: Pointer to starting address of range being purged
- * @end: Pointer to ending address of range being purged
- *
- * Returns the mapping, start and end values of the least recently used
- * range. Marks the range as purged and removes it from the LRU.
- *
- * Must lock the volatile_fs_head before calling!
- *
- * Returns 1 on success if a range was returned
- * Return 0 if no ranges were found.
- */
-s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
- struct address_space **mapping,
+int volatile_page_in_range(struct volatile_fs_head *head,
+ struct page *page,
pgoff_t *start, pgoff_t *end)
{
- struct volatile_range *range;
+ struct prio_tree_root *root;
+ struct prio_tree_node *node;
+ struct prio_tree_iter iter;
+ struct volatile_range *vrange;
+ int ret = 0;

WARN_ON(!mutex_is_locked(&head->lock));

- if (list_empty(&head->lru_head))
+ root = mapping_to_root(page->mapping);
+ if (!root)
return 0;

- range = list_first_entry(&head->lru_head, struct volatile_range, lru);
-
- *start = range->node.start;
- *end = range->node.last;
- *mapping = range->mapping;
-
- head->unpurged_page_count -= *end - *start;
- list_del(&range->lru);
- range->purged = 1;
+ prio_tree_iter_init(&iter, root, page->index, page->index);
+ node = prio_tree_next(&iter);
+ if (node) {
+ vrange = container_of(node, struct volatile_range, node);

- return 1;
+ if (!vrange->purged) {
+ *start = vrange->node.start;
+ *end = vrange->node.last;
+ vrange->purged = 1;
+ ret = 1;
+ }
+ }
+ return ret;
}


+
/*
* Cleans up any volatile ranges.
*/
--
1.7.9.5

2012-06-27 04:18:57

by John Stultz

[permalink] [raw]
Subject: [PATCH 1/5] [RFC] Add volatile range management code

This patch provides the volatile range management code
that filesystems can utilize when implementing
FALLOC_FL_MARK_VOLATILE.

It tracks a collection of page ranges against a mapping
stored in an interval-tree. This code handles coalescing
overlapping and adjacent ranges, as well as splitting
ranges when sub-chunks are removed.

The ranges can be marked purged or unpurged. And there is
a per-fs lru list that tracks all the unpurged ranges for
that fs.

v2:
* Fix bug in volatile_ranges_get_last_used returning bad
start,end values
* Rework for intervaltree renaming
* Optimize volatile_range_lru_size to avoid running through
lru list each time.

v3:
* Improve function name to make it clear what the
volatile_ranges_pluck_lru() code does.
* Drop volatile_range_lru_size and unpurged_page_count
mangement as its now unused

v4:
* Re-add volatile_range_lru_size and unpruged_page_count
* Fix bug in range_remove when we split ranges, we add
an overlapping range before resizing the existing range.

v5:
* Drop intervaltree for prio_tree usage per Michel &
Dmitry's suggestions.
* Cleanups

CC: Andrew Morton <[email protected]>
CC: Android Kernel Team <[email protected]>
CC: Robert Love <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Dave Hansen <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Dmitry Adamushko <[email protected]>
CC: Dave Chinner <[email protected]>
CC: Neil Brown <[email protected]>
CC: Andrea Righi <[email protected]>
CC: Aneesh Kumar K.V <[email protected]>
CC: Taras Glek <[email protected]>
CC: Mike Hommey <[email protected]>
CC: Jan Kara <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Michel Lespinasse <[email protected]>
CC: Minchan Kim <[email protected]>
CC: [email protected] <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/volatile.h | 45 ++++
mm/Makefile | 2 +-
mm/volatile.c | 509 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 555 insertions(+), 1 deletion(-)
create mode 100644 include/linux/volatile.h
create mode 100644 mm/volatile.c

diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..6f41b98
--- /dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,45 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+#include <linux/fs.h>
+
+struct volatile_fs_head {
+ struct mutex lock;
+ struct list_head lru_head;
+ s64 unpurged_page_count;
+};
+
+
+#define DEFINE_VOLATILE_FS_HEAD(name) struct volatile_fs_head name = { \
+ .lock = __MUTEX_INITIALIZER(name.lock), \
+ .lru_head = LIST_HEAD_INIT(name.lru_head), \
+ .unpurged_page_count = 0, \
+}
+
+
+static inline void volatile_range_lock(struct volatile_fs_head *head)
+{
+ mutex_lock(&head->lock);
+}
+
+static inline void volatile_range_unlock(struct volatile_fs_head *head)
+{
+ mutex_unlock(&head->lock);
+}
+
+extern long volatile_range_add(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+extern long volatile_range_remove(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start_index, pgoff_t end_index);
+
+extern s64 volatile_range_lru_size(struct volatile_fs_head *head);
+
+extern void volatile_range_clear(struct volatile_fs_head *head,
+ struct address_space *mapping);
+
+extern s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
+ struct address_space **mapping,
+ pgoff_t *start, pgoff_t *end);
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 2e2fbbe..3e3cd6f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
page_isolation.o mm_init.o mmu_context.o percpu.o \
- compaction.o $(mmu-y)
+ compaction.o volatile.o $(mmu-y)
obj-y += init-mm.o

ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..d05a767
--- /dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,509 @@
+/* mm/volatile.c
+ *
+ * Volatile page range managment.
+ * Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c
+ * by Robert Love <[email protected]>
+ * Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * The volatile range management is a helper layer on top of the range tree
+ * code, which is used to help filesystems manage page ranges that are volatile.
+ *
+ * These ranges are stored in a per-mapping range tree. Storing both purged and
+ * unpurged ranges connected to that address_space. Unpurged ranges are also
+ * linked together in an lru list that is per-volatile-fs-head (basically
+ * per-filesystem).
+ *
+ * The goal behind volatile ranges is to allow applications to interact
+ * with the kernel's cache management infrastructure. In particular an
+ * application can say "this memory contains data that might be useful in
+ * the future, but can be reconstructed if necessary, so if the kernel
+ * needs, it can zap and reclaim this memory without having to swap it out.
+ *
+ * The proposed mechanism - at a high level - is for user-space to be able
+ * to say "This memory is volatile" and then later "this memory is no longer
+ * volatile". If the content of the memory is still available the second
+ * request succeeds. If not, the memory is marked non-volatile and an
+ * error is returned to denote that the contents have been lost.
+ *
+ * Credits to Neil Brown for the above description.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/volatile.h>
+#include <linux/rbtree.h>
+#include <linux/hash.h>
+#include <linux/shmem_fs.h>
+
+
+struct volatile_range {
+ struct list_head lru;
+ struct prio_tree_node node;
+ unsigned int purged;
+ struct address_space *mapping;
+};
+
+
+/*
+ * To avoid bloating the address_space structure, we use
+ * a hash structure to map from address_space mappings to
+ * the interval_tree root that stores volatile ranges
+ */
+static DEFINE_MUTEX(hash_mutex);
+static struct hlist_head *mapping_hash;
+static long mapping_hash_shift = 8;
+struct mapping_hash_entry {
+ struct prio_tree_root root;
+ struct address_space *mapping;
+ struct hlist_node hnode;
+};
+
+
+static inline
+struct prio_tree_root *__mapping_to_root(struct address_space *mapping)
+{
+ struct hlist_node *elem;
+ struct mapping_hash_entry *entry;
+ struct prio_tree_root *ret = NULL;
+
+ hlist_for_each_entry_rcu(entry, elem,
+ &mapping_hash[hash_ptr(mapping, mapping_hash_shift)],
+ hnode)
+ if (entry->mapping == mapping)
+ ret = &entry->root;
+
+ return ret;
+}
+
+
+static inline
+struct prio_tree_root *mapping_to_root(struct address_space *mapping)
+{
+ struct prio_tree_root *ret;
+
+ mutex_lock(&hash_mutex);
+ ret = __mapping_to_root(mapping);
+ mutex_unlock(&hash_mutex);
+ return ret;
+}
+
+
+static inline
+struct prio_tree_root *mapping_allocate_root(struct address_space *mapping)
+{
+ struct mapping_hash_entry *entry;
+ struct prio_tree_root *dblchk;
+ struct prio_tree_root *ret = NULL;
+
+ entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+ if (!entry)
+ return NULL;
+
+ mutex_lock(&hash_mutex);
+ /* Since we dropped the lock, double check that no one has
+ * created the same hash entry.
+ */
+ dblchk = __mapping_to_root(mapping);
+ if (dblchk) {
+ kfree(entry);
+ ret = dblchk;
+ goto out;
+ }
+
+ INIT_HLIST_NODE(&entry->hnode);
+ entry->mapping = mapping;
+ INIT_PRIO_TREE_ROOT(&entry->root);
+
+ hlist_add_head_rcu(&entry->hnode,
+ &mapping_hash[hash_ptr(mapping, mapping_hash_shift)]);
+
+ ret = &entry->root;
+out:
+ mutex_unlock(&hash_mutex);
+ return ret;
+}
+
+
+static inline void mapping_free_root(struct prio_tree_root *root)
+{
+ struct mapping_hash_entry *entry;
+
+ mutex_lock(&hash_mutex);
+ entry = container_of(root, struct mapping_hash_entry, root);
+
+ hlist_del_rcu(&entry->hnode);
+ kfree(entry);
+ mutex_unlock(&hash_mutex);
+}
+
+
+/* volatile range helpers */
+static inline void vrange_resize(struct volatile_fs_head *head,
+ struct prio_tree_root *root,
+ struct volatile_range *vrange,
+ pgoff_t start_index, pgoff_t end_index)
+{
+ pgoff_t old_size, new_size;
+
+ old_size = vrange->node.last - vrange->node.start;
+ new_size = end_index-start_index;
+
+ if (!vrange->purged)
+ head->unpurged_page_count += new_size - old_size;
+
+ prio_tree_remove(root, &vrange->node);
+ vrange->node.start = start_index;
+ vrange->node.last = end_index;
+ prio_tree_insert(root, &vrange->node);
+}
+
+static struct volatile_range *vrange_alloc(void)
+{
+ struct volatile_range *new;
+
+ new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+ if (!new)
+ return 0;
+ INIT_PRIO_TREE_NODE(&new->node);
+ return new;
+}
+
+
+static void vrange_add(struct volatile_fs_head *head,
+ struct prio_tree_root *root,
+ struct volatile_range *vrange)
+{
+
+ prio_tree_insert(root, &vrange->node);
+
+ /* Only add unpurged ranges to LRU */
+ if (!vrange->purged) {
+ head->unpurged_page_count += vrange->node.last - vrange->node.start;
+ list_add_tail(&vrange->lru, &head->lru_head);
+ }
+
+}
+
+
+
+static void vrange_del(struct volatile_fs_head *head,
+ struct prio_tree_root *root,
+ struct volatile_range *vrange)
+{
+ if (!vrange->purged) {
+ head->unpurged_page_count -= vrange->node.last - vrange->node.start;
+ list_del(&vrange->lru);
+ }
+ prio_tree_remove(root, &vrange->node);
+ kfree(vrange);
+}
+
+
+/**
+ * volatile_range_add: Marks a page interval as volatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked volatile
+ * @start_index: Starting page in range to be marked volatile
+ * @end_index: Ending page in range to be marked volatile
+ *
+ * Mark a region as volatile. Coalesces overlapping and neighboring regions.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if the range was coalesced with any purged ranges.
+ * Returns 0 on success.
+ */
+long volatile_range_add(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start, pgoff_t end)
+{
+ struct prio_tree_node *node;
+ struct prio_tree_iter iter;
+ struct volatile_range *new, *vrange;
+ struct prio_tree_root *root;
+ int purged = 0;
+
+ /* Make sure we're properly locked */
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ /*
+ * Because the lock might be held in a shrinker, release
+ * it during allocation.
+ */
+ mutex_unlock(&head->lock);
+ new = vrange_alloc();
+ mutex_lock(&head->lock);
+ if (!new)
+ return -ENOMEM;
+
+ root = mapping_to_root(mapping);
+ if (!root) {
+ mutex_unlock(&head->lock);
+ root = mapping_allocate_root(mapping);
+ mutex_lock(&head->lock);
+ if (!root) {
+ kfree(new);
+ return -ENOMEM;
+ }
+ }
+
+
+ /* First, find any existing intervals that overlap */
+ prio_tree_iter_init(&iter, root, start, end);
+ node = prio_tree_next(&iter);
+ while (node) {
+ vrange = container_of(node, struct volatile_range, node);
+
+ /* Already entirely marked volatile, so we're done */
+ if (vrange->node.start < start && vrange->node.last > end) {
+ /* don't need the allocated value */
+ kfree(new);
+ return purged;
+ }
+
+ /* Resize the new range to cover all overlapping ranges */
+ start = min_t(u64, start, vrange->node.start);
+ end = max_t(u64, end, vrange->node.last);
+
+ /* Inherit purged state from overlapping ranges */
+ purged |= vrange->purged;
+
+ /* See if there's a next range that overlaps */
+ node = prio_tree_next(&iter);
+
+ /* Delete the old range, as we consume it */
+ vrange_del(head, root, vrange);
+
+ }
+
+ /* Coalesce left-adjacent ranges */
+ prio_tree_iter_init(&iter, root, start-1, start);
+ node = prio_tree_next(&iter);
+ while (node) {
+ vrange = container_of(node, struct volatile_range, node);
+ node = prio_tree_next(&iter);
+ /* Only coalesce if both are either purged or unpurged */
+ if (vrange->purged == purged) {
+ /* resize new range */
+ start = min_t(u64, start, vrange->node.start);
+ end = max_t(u64, end, vrange->node.last);
+ /* delete old range */
+ vrange_del(head, root, vrange);
+ }
+ }
+
+ /* Coalesce right-adjacent ranges */
+ prio_tree_iter_init(&iter, root, end, end+1);
+ node = prio_tree_next(&iter);
+ while (node) {
+ vrange = container_of(node, struct volatile_range, node);
+ node = prio_tree_next(&iter);
+ /* Only coalesce if both are either purged or unpurged */
+ if (vrange->purged == purged) {
+ /* resize new range */
+ start = min_t(u64, start, vrange->node.start);
+ end = max_t(u64, end, vrange->node.last);
+ /* delete old range */
+ vrange_del(head, root, vrange);
+ }
+ }
+ /* Assign and store the new range in the range tree */
+ new->mapping = mapping;
+ new->node.start = start;
+ new->node.last = end;
+ new->purged = purged;
+ vrange_add(head, root, new);
+
+ return purged;
+}
+
+
+/**
+ * volatile_range_remove: Marks a page interval as nonvolatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked nonvolatile
+ * @start_index: Starting page in range to be marked nonvolatile
+ * @end_index: Ending page in range to be marked nonvolatile
+ *
+ * Mark a region as nonvolatile. And remove any contained pages
+ * from the volatile range tree.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if any portion of the range was purged.
+ * Returns 0 on success.
+ */
+long volatile_range_remove(struct volatile_fs_head *head,
+ struct address_space *mapping,
+ pgoff_t start, pgoff_t end)
+{
+ struct prio_tree_node *node;
+ struct prio_tree_iter iter;
+ struct volatile_range *new, *vrange;
+ struct prio_tree_root *root;
+ int ret = 0;
+ int used_new = 0;
+
+ /* Make sure we're properly locked */
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ /*
+ * Because the lock might be held in a shrinker, release
+ * it during allocation.
+ */
+ mutex_unlock(&head->lock);
+ new = vrange_alloc();
+ mutex_lock(&head->lock);
+ if (!new)
+ return -ENOMEM;
+
+ root = mapping_to_root(mapping);
+ if (!root)
+ goto out;
+
+
+ /* Find any overlapping ranges */
+ prio_tree_iter_init(&iter, root, start, end);
+ node = prio_tree_next(&iter);
+ while (node) {
+ vrange = container_of(node, struct volatile_range, node);
+ node = prio_tree_next(&iter);
+
+ ret |= vrange->purged;
+
+ if (start <= vrange->node.start && end >= vrange->node.last) {
+ /* delete: volatile range is totally within range */
+ vrange_del(head, root, vrange);
+ } else if (vrange->node.start >= start) {
+ /* resize: volatile range right-overlaps range */
+ vrange_resize(head, root, vrange, end+1, vrange->node.last);
+ } else if (vrange->node.last <= end) {
+ /* resize: volatile range left-overlaps range */
+ vrange_resize(head, root, vrange, vrange->node.start, start-1);
+ } else {
+ /* split: range is totally within a volatile range */
+ used_new = 1; /* we only do this once */
+ new->mapping = mapping;
+ new->node.start = end + 1;
+ new->node.last = vrange->node.last;
+ new->purged = vrange->purged;
+ vrange_resize(head, root, vrange, vrange->node.start, start-1);
+ vrange_add(head, root, new);
+ break;
+ }
+ }
+
+out:
+ if (!used_new)
+ kfree(new);
+
+ return ret;
+}
+
+/**
+ * volatile_range_lru_size: Returns the number of unpurged pages on the lru
+ * @head: per-fs volatile head
+ *
+ * Returns the number of unpurged pages on the LRU
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ */
+s64 volatile_range_lru_size(struct volatile_fs_head *head)
+{
+ WARN_ON(!mutex_is_locked(&head->lock));
+ return head->unpurged_page_count;
+}
+
+
+/**
+ * volatile_ranges_pluck_lru: Returns mapping and size of lru unpurged range
+ * @head: per-fs volatile head
+ * @mapping: dbl pointer to mapping who's range is being purged
+ * @start: Pointer to starting address of range being purged
+ * @end: Pointer to ending address of range being purged
+ *
+ * Returns the mapping, start and end values of the least recently used
+ * range. Marks the range as purged and removes it from the LRU.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 on success if a range was returned
+ * Return 0 if no ranges were found.
+ */
+s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
+ struct address_space **mapping,
+ pgoff_t *start, pgoff_t *end)
+{
+ struct volatile_range *range;
+
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ if (list_empty(&head->lru_head))
+ return 0;
+
+ range = list_first_entry(&head->lru_head, struct volatile_range, lru);
+
+ *start = range->node.start;
+ *end = range->node.last;
+ *mapping = range->mapping;
+
+ head->unpurged_page_count -= *end - *start;
+ list_del(&range->lru);
+ range->purged = 1;
+
+ return 1;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void volatile_range_clear(struct volatile_fs_head *head,
+ struct address_space *mapping)
+{
+ struct volatile_range *tozap;
+ struct prio_tree_root *root;
+
+ WARN_ON(!mutex_is_locked(&head->lock));
+
+ root = mapping_to_root(mapping);
+ if (!root)
+ return;
+
+ while (!prio_tree_empty(root)) {
+ tozap = container_of(root->prio_tree_node, struct volatile_range, node);
+ vrange_del(head, root, tozap);
+ }
+ mapping_free_root(root);
+}
+
+
+static int __init volatile_init(void)
+{
+ int i, size;
+
+ size = 1U << mapping_hash_shift;
+ mapping_hash = kzalloc(sizeof(mapping_hash)*size, GFP_KERNEL);
+ for (i = 0; i < size; i++)
+ INIT_HLIST_HEAD(&mapping_hash[i]);
+
+ return 0;
+}
+arch_initcall(volatile_init);
--
1.7.9.5

2012-08-09 09:46:39

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH 1/5] [RFC] Add volatile range management code

On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <[email protected]> wrote:
> v5:
> * Drop intervaltree for prio_tree usage per Michel &
> Dmitry's suggestions.

Actually, I believe the ranges you need to track are non-overlapping, correct ?

If that is the case, a simple rbtree, sorted by start-of-range
address, would work best.
(I am trying to remove prio_tree users... :)

> + /* First, find any existing intervals that overlap */
> + prio_tree_iter_init(&iter, root, start, end);

Note that prio tree iterations take intervals as [start; last] not [start; end[
So if you want to stick with prio trees, you would have to use end-1 here.

> + /* Coalesce left-adjacent ranges */
> + prio_tree_iter_init(&iter, root, start-1, start);

Same here; you probably want to use start-1 on both ends

> + node = prio_tree_next(&iter);
> + while (node) {

I'm confused, I don't think you ever expect more than one range to
match, do you ???

> + /* Coalesce right-adjacent ranges */
> + prio_tree_iter_init(&iter, root, end, end+1);

Same again, here you probably want end on both ends

This is far from a complete code review, but I just wanted to point
out a couple details that jumped to me first. I am afraid I am missing
some of the background about how the feature is to be used to really
dig into the rest of the changes at this point :/

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2012-08-09 13:35:56

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/5] [RFC] Add volatile range management code

On Thu, Aug 09, 2012 at 02:46:37AM -0700, Michel Lespinasse wrote:
> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <[email protected]> wrote:
> > v5:
> > * Drop intervaltree for prio_tree usage per Michel &
> > Dmitry's suggestions.
>
> Actually, I believe the ranges you need to track are non-overlapping, correct ?
>
> If that is the case, a simple rbtree, sorted by start-of-range
> address, would work best.
> (I am trying to remove prio_tree users... :)
>

John,

JFYI, if you want to try a possible rbtree-based implementation, as
suggested by Michel you could try this one:
https://github.com/arighi/kinterval

This implementation supports insertion, deletion and transparent merging
of adjacent ranges, as well as splitting ranges when chunks removed or
different chunk types are added in the middle of an existing range; so
if I'm not wrong probably you should be able to use this code as is,
without any modification.

If you decide to go this way and/or need help to use it in your patch
set just let me know.

-Andrea

2012-08-09 19:14:23

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 1/5] [RFC] Add volatile range management code

On 08/09/2012 02:46 AM, Michel Lespinasse wrote:
> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <[email protected]> wrote:
>> v5:
>> * Drop intervaltree for prio_tree usage per Michel &
>> Dmitry's suggestions.
> Actually, I believe the ranges you need to track are non-overlapping, correct ?
Correct. Any overlapping range is coalesced.

> If that is the case, a simple rbtree, sorted by start-of-range
> address, would work best.
> (I am trying to remove prio_tree users... :)

Sigh. Sure. Although I've blown with the wind on a number of different
approaches for storing the ranges. I'm not particularly passionate about
it, but the continual conflicting suggestions are a slight frustration. :)


>> + /* First, find any existing intervals that overlap */
>> + prio_tree_iter_init(&iter, root, start, end);
> Note that prio tree iterations take intervals as [start; last] not [start; end[
> So if you want to stick with prio trees, you would have to use end-1 here.
Thanks! I think I hit this off-by-one issue in my testing, but fixed it
on the backend w/ :

modify_range(&inode->i_data, start, end-1, &mark_nonvolatile_page);

Clearly fixing it at the start instead of papering over it is better.


>> + node = prio_tree_next(&iter);
>> + while (node) {
> I'm confused, I don't think you ever expect more than one range to
> match, do you ???

So yea. If you already have two ranges (0-5),(10-15) and then add range
(0-20) we need to coalesce the two existing ranges into the new one.


> This is far from a complete code review, but I just wanted to point
> out a couple details that jumped to me first. I am afraid I am missing
> some of the background about how the feature is to be used to really
> dig into the rest of the changes at this point :/

Well, I really appreciate any feedback here.

thanks
-john

2012-08-09 19:36:04

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 1/5] [RFC] Add volatile range management code

On 08/09/2012 06:35 AM, Andrea Righi wrote:
> On Thu, Aug 09, 2012 at 02:46:37AM -0700, Michel Lespinasse wrote:
>> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <[email protected]> wrote:
>>> v5:
>>> * Drop intervaltree for prio_tree usage per Michel &
>>> Dmitry's suggestions.
>> Actually, I believe the ranges you need to track are non-overlapping, correct ?
>>
>> If that is the case, a simple rbtree, sorted by start-of-range
>> address, would work best.
>> (I am trying to remove prio_tree users... :)
>>
> John,
>
> JFYI, if you want to try a possible rbtree-based implementation, as
> suggested by Michel you could try this one:
> https://github.com/arighi/kinterval
>
> This implementation supports insertion, deletion and transparent merging
> of adjacent ranges, as well as splitting ranges when chunks removed or
> different chunk types are added in the middle of an existing range; so
> if I'm not wrong probably you should be able to use this code as is,
> without any modification.
I do appreciate the suggestion, and considered this earlier when you
posted this before.

Unfotunately the transparent merging/splitting/etc is actually not
useful for me, since I manage other data per-range. The earlier generic
rangetree/intervaltree implementations I tried limiting the interface to
basically add(), remove(), search(), and search_next(), since when we
coalesce intervals, we need to free the data in the structure
referencing the interval being deleted (and similarly create new
structures to reference new intervals created when we remove an
interval). So the coalescing/splitting logic can't be pushed into the
interval management code cleanly.

So while I might be able to make use of your kinterval in a fairly
simple manner (only using add/del/lookup), I'm not sure it wins anything
over just using an rbtree. Especially since I'd have to do my own
coalesce/splitting logic anyway, it would actually be more expensive as
on add() it would still scan to check for overlapping ranges to merge.

I ended up dropping my generic intervaltree implementation because folks
objected that it was so trivial (basically just wrapping an rbtree) and
didn't handle some of the more complex intervaltree use cases (ie:
allowing for overlapping intervals). The priotree seemed to match fairly
closely the interface I was using, but apparently its on its way out as
well, so unless anyone further objects, I think I'll just fall back to a
simple rbtree implementation.

thanks
-john

2012-08-09 19:39:29

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/5] [RFC] Add volatile range management code

On Thu, Aug 09, 2012 at 12:33:17PM -0700, John Stultz wrote:
> On 08/09/2012 06:35 AM, Andrea Righi wrote:
> >On Thu, Aug 09, 2012 at 02:46:37AM -0700, Michel Lespinasse wrote:
> >>On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <[email protected]> wrote:
> >>>v5:
> >>>* Drop intervaltree for prio_tree usage per Michel &
> >>> Dmitry's suggestions.
> >>Actually, I believe the ranges you need to track are non-overlapping, correct ?
> >>
> >>If that is the case, a simple rbtree, sorted by start-of-range
> >>address, would work best.
> >>(I am trying to remove prio_tree users... :)
> >>
> >John,
> >
> >JFYI, if you want to try a possible rbtree-based implementation, as
> >suggested by Michel you could try this one:
> >https://github.com/arighi/kinterval
> >
> >This implementation supports insertion, deletion and transparent merging
> >of adjacent ranges, as well as splitting ranges when chunks removed or
> >different chunk types are added in the middle of an existing range; so
> >if I'm not wrong probably you should be able to use this code as is,
> >without any modification.
> I do appreciate the suggestion, and considered this earlier when you
> posted this before.
>
> Unfotunately the transparent merging/splitting/etc is actually not
> useful for me, since I manage other data per-range. The earlier
> generic rangetree/intervaltree implementations I tried limiting the
> interface to basically add(), remove(), search(), and search_next(),
> since when we coalesce intervals, we need to free the data in the
> structure referencing the interval being deleted (and similarly
> create new structures to reference new intervals created when we
> remove an interval). So the coalescing/splitting logic can't be
> pushed into the interval management code cleanly.
>
> So while I might be able to make use of your kinterval in a fairly
> simple manner (only using add/del/lookup), I'm not sure it wins
> anything over just using an rbtree. Especially since I'd have to do
> my own coalesce/splitting logic anyway, it would actually be more
> expensive as on add() it would still scan to check for overlapping
> ranges to merge.
>
> I ended up dropping my generic intervaltree implementation because
> folks objected that it was so trivial (basically just wrapping an
> rbtree) and didn't handle some of the more complex intervaltree use
> cases (ie: allowing for overlapping intervals). The priotree seemed
> to match fairly closely the interface I was using, but apparently
> its on its way out as well, so unless anyone further objects, I
> think I'll just fall back to a simple rbtree implementation.

OK, everything makes sense now, thanks for the clarifications, and sorry
for suggesting yet another range/interval tree implementation. :)

I'll look at your patch set more in details and try to test/review it
closely.

-Andrea