2012-11-11 19:01:59

by Rafael Aquini

[permalink] [raw]
Subject: [PATCH v12 0/7] make balloon pages movable by compaction

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch-set follows the main idea discussed at 2012 LSFMMS session:
"Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
to introduce the required changes to the virtio_balloon driver, as well as
the changes to the core compaction & migration bits, in order to make those
subsystems aware of ballooned pages and allow memory balloon pages become
movable within a guest, thus avoiding the aforementioned fragmentation issue

Following are numbers that prove this patch benefits on allowing compaction
to be more effective at memory ballooned guests.

Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB chunks,
at every minute (inflating/deflating), while test was running:

===BEGIN stress-highalloc

STRESS-HIGHALLOC
highalloc-3.7 highalloc-3.7
rc4-clean rc4-patch
Pass 1 55.00 ( 0.00%) 62.00 ( 7.00%)
Pass 2 54.00 ( 0.00%) 62.00 ( 8.00%)
while Rested 75.00 ( 0.00%) 80.00 ( 5.00%)

MMTests Statistics: duration
3.7 3.7
rc4-clean rc4-patch
User 1207.59 1207.46
System 1300.55 1299.61
Elapsed 2273.72 2157.06

MMTests Statistics: vmstat
3.7 3.7
rc4-clean rc4-patch
Page Ins 3581516 2374368
Page Outs 11148692 10410332
Swap Ins 80 47
Swap Outs 3641 476
Direct pages scanned 37978 33826
Kswapd pages scanned 1828245 1342869
Kswapd pages reclaimed 1710236 1304099
Direct pages reclaimed 32207 31005
Kswapd efficiency 93% 97%
Kswapd velocity 804.077 622.546
Direct efficiency 84% 91%
Direct velocity 16.703 15.682
Percentage direct scans 2% 2%
Page writes by reclaim 79252 9704
Page writes file 75611 9228
Page writes anon 3641 476
Page reclaim immediate 16764 11014
Page rescued immediate 0 0
Slabs scanned 2171904 2152448
Direct inode steals 385 2261
Kswapd inode steals 659137 609670
Kswapd skipped wait 1 69
THP fault alloc 546 631
THP collapse alloc 361 339
THP splits 259 263
THP fault fallback 98 50
THP collapse fail 20 17
Compaction stalls 747 499
Compaction success 244 145
Compaction failures 503 354
Compaction pages moved 370888 474837
Compaction move failure 77378 65259

===END stress-highalloc

Rafael Aquini (7):
mm: adjust address_space_operations.migratepage() return code
mm: redefine address_space.assoc_mapping
mm: introduce a common interface for balloon pages mobility
mm: introduce compaction and migration for ballooned pages
virtio_balloon: introduce migration primitives to balloon pages
mm: introduce putback_movable_pages()
mm: add vm event counters for balloon pages compaction

drivers/virtio/virtio_balloon.c | 139 +++++++++++++++--
fs/buffer.c | 12 +-
fs/gfs2/glock.c | 2 +-
fs/hugetlbfs/inode.c | 4 +-
fs/inode.c | 2 +-
fs/nilfs2/page.c | 2 +-
include/linux/balloon_compaction.h | 263 ++++++++++++++++++++++++++++++++
include/linux/fs.h | 2 +-
include/linux/migrate.h | 19 +++
include/linux/pagemap.h | 16 ++
include/linux/vm_event_item.h | 7 +-
mm/Kconfig | 15 ++
mm/Makefile | 3 +-
mm/balloon_compaction.c | 304 +++++++++++++++++++++++++++++++++++++
mm/compaction.c | 27 +++-
mm/migrate.c | 86 ++++++++---
mm/page_alloc.c | 2 +-
mm/vmstat.c | 9 +-
18 files changed, 862 insertions(+), 52 deletions(-)
create mode 100644 include/linux/balloon_compaction.h
create mode 100644 mm/balloon_compaction.c

Change log:
v12:
* Address last suggestions on sorting the barriers usage out (Mel Gorman);
* Fix reported build breakages for CONFIG_BALLOON_COMPACTION=n (Andrew Morton);
* Enhance commentary on the locking scheme used for balloon page compaction;
* Move all the 'balloon vm_event counter' changes to PATCH 07;
v11:
* Address AKPM's last review suggestions;
* Extend the balloon compaction common API and simplify its usage at driver;
* Minor nitpicks on code commentary;
v10:
* Adjust leak_balloon() wait_event logic to make a clear locking scheme (MST);
* Drop the RCU protection approach for dereferencing balloon's page->mapping;
* Minor nitpitcks on code commentaries (MST);
v9:
* Adjust rcu_dereference usage to leverage page lock protection (Paul, Peter);
* Enhance doc on compaction interface introduced to balloon driver (Michael);
* Fix issue with isolated pages breaking leak_balloon() logics (Michael);
v8:
* introduce a common MM interface for balloon driver page compaction (Michael);
* remove the global state preventing multiple balloon device support (Michael);
* introduce RCU protection/syncrhonization to balloon page->mapping (Michael);
v7:
* fix a potential page leak case at 'putback_balloon_page' (Mel);
* adjust vm-events-counter patch and remove its drop-on-merge message (Rik);
* add 'putback_movable_pages' to avoid hacks on 'putback_lru_pages' (Minchan);
v6:
* rename 'is_balloon_page()' to 'movable_balloon_page()' (Rik);
v5:
* address Andrew Morton's review comments on the patch series;
* address a couple extra nitpick suggestions on PATCH 01 (Minchan);
v4:
* address Rusty Russel's review comments on PATCH 02;
* re-base virtio_balloon patch on 9c378abc5c0c6fc8e3acf5968924d274503819b3;
V3:
* address reviewers nitpick suggestions on PATCH 01 (Mel, Minchan);
V2:
* address Mel Gorman's review comments on PATCH 01;

--
1.7.11.7


2012-11-11 19:02:01

by Rafael Aquini

[permalink] [raw]
Subject: [PATCH v12 1/7] mm: adjust address_space_operations.migratepage() return code

This patch introduces MIGRATEPAGE_SUCCESS as the default return code
for address_space_operations.migratepage() method and documents the
expected return code for the same method in failure cases.

Signed-off-by: Rafael Aquini <[email protected]>
---
fs/hugetlbfs/inode.c | 4 ++--
include/linux/migrate.h | 7 +++++++
mm/migrate.c | 33 +++++++++++++++------------------
3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 14bc0c1..fed1cd5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -582,11 +582,11 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
int rc;

rc = migrate_huge_page_move_mapping(mapping, newpage, page);
- if (rc)
+ if (rc != MIGRATEPAGE_SUCCESS)
return rc;
migrate_page_copy(newpage, page);

- return 0;
+ return MIGRATEPAGE_SUCCESS;
}

static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9a5afea..fab15ae 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -7,6 +7,13 @@

typedef struct page *new_page_t(struct page *, unsigned long private, int **);

+/*
+ * Return values from addresss_space_operations.migratepage():
+ * - negative errno on page migration failure;
+ * - zero on page migration success;
+ */
+#define MIGRATEPAGE_SUCCESS 0
+
#ifdef CONFIG_MIGRATION

extern void putback_lru_pages(struct list_head *l);
diff --git a/mm/migrate.c b/mm/migrate.c
index 0c5ec37..6f408c7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -286,7 +286,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
expected_count += 1;
if (page_count(page) != expected_count)
return -EAGAIN;
- return 0;
+ return MIGRATEPAGE_SUCCESS;
}

spin_lock_irq(&mapping->tree_lock);
@@ -356,7 +356,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
}
spin_unlock_irq(&mapping->tree_lock);

- return 0;
+ return MIGRATEPAGE_SUCCESS;
}

/*
@@ -372,7 +372,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
if (!mapping) {
if (page_count(page) != 1)
return -EAGAIN;
- return 0;
+ return MIGRATEPAGE_SUCCESS;
}

spin_lock_irq(&mapping->tree_lock);
@@ -399,7 +399,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
page_unfreeze_refs(page, expected_count - 1);

spin_unlock_irq(&mapping->tree_lock);
- return 0;
+ return MIGRATEPAGE_SUCCESS;
}

/*
@@ -486,11 +486,11 @@ int migrate_page(struct address_space *mapping,

rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);

- if (rc)
+ if (rc != MIGRATEPAGE_SUCCESS)
return rc;

migrate_page_copy(newpage, page);
- return 0;
+ return MIGRATEPAGE_SUCCESS;
}
EXPORT_SYMBOL(migrate_page);

@@ -513,7 +513,7 @@ int buffer_migrate_page(struct address_space *mapping,

rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);

- if (rc)
+ if (rc != MIGRATEPAGE_SUCCESS)
return rc;

/*
@@ -549,7 +549,7 @@ int buffer_migrate_page(struct address_space *mapping,

} while (bh != head);

- return 0;
+ return MIGRATEPAGE_SUCCESS;
}
EXPORT_SYMBOL(buffer_migrate_page);
#endif
@@ -628,7 +628,7 @@ static int fallback_migrate_page(struct address_space *mapping,
*
* Return value:
* < 0 - error code
- * == 0 - success
+ * MIGRATEPAGE_SUCCESS - success
*/
static int move_to_new_page(struct page *newpage, struct page *page,
int remap_swapcache, enum migrate_mode mode)
@@ -665,7 +665,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
else
rc = fallback_migrate_page(mapping, newpage, page, mode);

- if (rc) {
+ if (rc != MIGRATEPAGE_SUCCESS) {
newpage->mapping = NULL;
} else {
if (remap_swapcache)
@@ -814,7 +814,7 @@ skip_unmap:
put_anon_vma(anon_vma);

uncharge:
- mem_cgroup_end_migration(mem, page, newpage, rc == 0);
+ mem_cgroup_end_migration(mem, page, newpage, rc == MIGRATEPAGE_SUCCESS);
unlock:
unlock_page(page);
out:
@@ -987,7 +987,7 @@ int migrate_pages(struct list_head *from,
case -EAGAIN:
retry++;
break;
- case 0:
+ case MIGRATEPAGE_SUCCESS:
break;
default:
/* Permanent failure */
@@ -996,15 +996,12 @@ int migrate_pages(struct list_head *from,
}
}
}
- rc = 0;
+ rc = nr_failed + retry;
out:
if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;

- if (rc)
- return rc;
-
- return nr_failed + retry;
+ return rc;
}

int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
@@ -1024,7 +1021,7 @@ int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
/* try again */
cond_resched();
break;
- case 0:
+ case MIGRATEPAGE_SUCCESS:
goto out;
default:
rc = -EIO;
--
1.7.11.7

2012-11-11 19:02:17

by Rafael Aquini

[permalink] [raw]
Subject: [PATCH v12 3/7] mm: introduce a common interface for balloon pages mobility

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces a common interface to help a balloon driver on
making its page set movable to compaction, and thus allowing the system
to better leverage the compation efforts on memory defragmentation.

Signed-off-by: Rafael Aquini <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
include/linux/balloon_compaction.h | 256 +++++++++++++++++++++++++++++++
include/linux/migrate.h | 10 ++
include/linux/pagemap.h | 16 ++
mm/Kconfig | 15 ++
mm/Makefile | 3 +-
mm/balloon_compaction.c | 302 +++++++++++++++++++++++++++++++++++++
6 files changed, 601 insertions(+), 1 deletion(-)
create mode 100644 include/linux/balloon_compaction.h
create mode 100644 mm/balloon_compaction.c

diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
new file mode 100644
index 0000000..2e63d94
--- /dev/null
+++ b/include/linux/balloon_compaction.h
@@ -0,0 +1,256 @@
+/*
+ * include/linux/balloon_compaction.h
+ *
+ * Common interface definitions for making balloon pages movable by compaction.
+ *
+ * Despite being perfectly possible to perform ballooned pages migration, they
+ * make a special corner case to compaction scans because balloon pages are not
+ * enlisted at any LRU list like the other pages we do compact / migrate.
+ *
+ * As the page isolation scanning step a compaction thread does is a lockless
+ * procedure (from a page standpoint), it might bring some racy situations while
+ * performing balloon page compaction. In order to sort out these racy scenarios
+ * and safely perform balloon's page compaction and migration we must, always,
+ * ensure following these three simple rules:
+ *
+ * i. when updating a balloon's page ->mapping element, strictly do it under
+ * the following lock order, independently of the far superior
+ * locking scheme (lru_lock, balloon_lock):
+ * +-page_lock(page);
+ * +--spin_lock_irq(&b_dev_info->pages_lock);
+ * ... page->mapping updates here ...
+ *
+ * ii. before isolating or dequeueing a balloon page from the balloon device
+ * pages list, the page reference counter must be raised by one and the
+ * extra refcount must be dropped when the page is enqueued back into
+ * the balloon device page list, thus a balloon page keeps its reference
+ * counter raised only while it is under our special handling;
+ *
+ * iii. after the lockless scan step have selected a potential balloon page for
+ * isolation, re-test the page->mapping flags and the page ref counter
+ * under the proper page lock, to ensure isolating a valid balloon page
+ * (not yet isolated, nor under release procedure)
+ *
+ * The functions provided by this interface are placed to help on coping with
+ * the aforementioned balloon page corner case, as well as to ensure the simple
+ * set of exposed rules are satisfied while we are dealing with balloon pages
+ * compaction / migration.
+ *
+ * Copyright (C) 2012, Red Hat, Inc. Rafael Aquini <[email protected]>
+ */
+#ifndef _LINUX_BALLOON_COMPACTION_H
+#define _LINUX_BALLOON_COMPACTION_H
+#include <linux/pagemap.h>
+#include <linux/migrate.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+
+/*
+ * Balloon device information descriptor.
+ * This struct is used to allow the common balloon compaction interface
+ * procedures to find the proper balloon device holding memory pages they'll
+ * have to cope for page compaction / migration, as well as it serves the
+ * balloon driver as a page book-keeper for its registered balloon devices.
+ */
+struct balloon_dev_info {
+ void *balloon_device; /* balloon device descriptor */
+ struct address_space *mapping; /* balloon special page->mapping */
+ unsigned long isolated_pages; /* # of isolated pages for migration */
+ spinlock_t pages_lock; /* Protection to pages list */
+ struct list_head pages; /* Pages enqueued & handled to Host */
+};
+
+extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
+extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info);
+extern struct balloon_dev_info *balloon_devinfo_alloc(
+ void *balloon_dev_descriptor);
+
+static inline void balloon_devinfo_free(struct balloon_dev_info *b_dev_info)
+{
+ kfree(b_dev_info);
+}
+
+/*
+ * balloon_page_free - release a balloon page back to the page free lists
+ * @page: ballooned page to be set free
+ *
+ * This function must be used to properly set free an isolated/dequeued balloon
+ * page at the end of a sucessful page migration, or at the balloon driver's
+ * page release procedure.
+ */
+static inline void balloon_page_free(struct page *page)
+{
+ /*
+ * Balloon pages always get an extra refcount before being isolated
+ * and before being dequeued to help on sorting out fortuite colisions
+ * between a thread attempting to isolate and another thread attempting
+ * to release the very same balloon page.
+ *
+ * Before we handle the page back to Buddy, lets drop its extra refcnt.
+ */
+ put_page(page);
+ __free_page(page);
+}
+
+#ifdef CONFIG_BALLOON_COMPACTION
+extern bool balloon_page_isolate(struct page *page);
+extern void balloon_page_putback(struct page *page);
+extern int balloon_page_migrate(struct page *newpage,
+ struct page *page, enum migrate_mode mode);
+extern struct address_space
+*balloon_mapping_alloc(struct balloon_dev_info *b_dev_info,
+ const struct address_space_operations *a_ops);
+
+static inline void balloon_mapping_free(struct address_space *balloon_mapping)
+{
+ kfree(balloon_mapping);
+}
+
+/*
+ * __is_movable_balloon_page - helper to perform @page mapping->flags tests
+ */
+static inline bool __is_movable_balloon_page(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ return mapping_balloon(mapping);
+}
+
+/*
+ * balloon_page_movable - test page->mapping->flags to identify balloon pages
+ * that can be moved by compaction/migration.
+ *
+ * This function is used at core compaction's page isolation scheme, therefore
+ * most pages exposed to it are not enlisted as balloon pages and so, to avoid
+ * undesired side effects like racing against __free_pages(), we cannot afford
+ * holding the page locked while testing page->mapping->flags here.
+ *
+ * As we might return false positives in the case of a balloon page being just
+ * released under us, the page->mapping->flags need to be re-tested later,
+ * under the proper page lock, at the functions that will be coping with the
+ * balloon page case.
+ */
+static inline bool balloon_page_movable(struct page *page)
+{
+ /*
+ * Before dereferencing and testing mapping->flags, lets make sure
+ * this is not a page that uses ->mapping in a different way
+ */
+ if (!PageSlab(page) && !PageSwapCache(page) && !PageAnon(page) &&
+ !page_mapped(page))
+ return __is_movable_balloon_page(page);
+
+ return false;
+}
+
+/*
+ * balloon_page_insert - insert a page into the balloon's page list and make
+ * the page->mapping assignment accordingly.
+ * @page : page to be assigned as a 'balloon page'
+ * @mapping : allocated special 'balloon_mapping'
+ * @head : balloon's device page list head
+ *
+ * Caller must ensure the page is locked and the spin_lock protecting balloon
+ * pages list is held before inserting a page into the balloon device.
+ */
+static inline void balloon_page_insert(struct page *page,
+ struct address_space *mapping,
+ struct list_head *head)
+{
+ page->mapping = mapping;
+ list_add(&page->lru, head);
+}
+
+/*
+ * balloon_page_delete - delete a page from balloon's page list and clear
+ * the page->mapping assignement accordingly.
+ * @page : page to be released from balloon's page list
+ *
+ * Caller must ensure the page is locked and the spin_lock protecting balloon
+ * pages list is held before deleting a page from the balloon device.
+ */
+static inline void balloon_page_delete(struct page *page)
+{
+ page->mapping = NULL;
+ list_del(&page->lru);
+}
+
+/*
+ * balloon_page_device - get the b_dev_info descriptor for the balloon device
+ * that enqueues the given page.
+ */
+static inline struct balloon_dev_info *balloon_page_device(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ if (likely(mapping))
+ return mapping->private_data;
+
+ return NULL;
+}
+
+static inline gfp_t balloon_mapping_gfp_mask(void)
+{
+ return GFP_HIGHUSER_MOVABLE;
+}
+
+static inline bool balloon_compaction_check(void)
+{
+ return true;
+}
+
+#else /* !CONFIG_BALLOON_COMPACTION */
+
+static inline void *balloon_mapping_alloc(void *balloon_device,
+ const struct address_space_operations *a_ops)
+{
+ return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline void balloon_mapping_free(struct address_space *balloon_mapping)
+{
+ return;
+}
+
+static inline void balloon_page_insert(struct page *page,
+ struct address_space *mapping,
+ struct list_head *head)
+{
+ list_add(&page->lru, head);
+}
+
+static inline void balloon_page_delete(struct page *page)
+{
+ list_del(&page->lru);
+}
+
+static inline bool balloon_page_movable(struct page *page)
+{
+ return false;
+}
+
+static inline bool balloon_page_isolate(struct page *page)
+{
+ return false;
+}
+
+static inline void balloon_page_putback(struct page *page)
+{
+ return;
+}
+
+static inline int balloon_page_migrate(struct page *newpage,
+ struct page *page, enum migrate_mode mode)
+{
+ return 0;
+}
+
+static inline gfp_t balloon_mapping_gfp_mask(void)
+{
+ return GFP_HIGHUSER;
+}
+
+static inline bool balloon_compaction_check(void)
+{
+ return false;
+}
+#endif /* CONFIG_BALLOON_COMPACTION */
+#endif /* _LINUX_BALLOON_COMPACTION_H */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index fab15ae..4ce2ee9 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -11,8 +11,18 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
* Return values from addresss_space_operations.migratepage():
* - negative errno on page migration failure;
* - zero on page migration success;
+ *
+ * The balloon page migration introduces this special case where a 'distinct'
+ * return code is used to flag a successful page migration to unmap_and_move().
+ * This approach is necessary because page migration can race against balloon
+ * deflation procedure, and for such case we could introduce a nasty page leak
+ * if a successfully migrated balloon page gets released concurrently with
+ * migration's unmap_and_move() wrap-up steps.
*/
#define MIGRATEPAGE_SUCCESS 0
+#define MIGRATEPAGE_BALLOON_SUCCESS 1 /* special ret code for balloon page
+ * sucessful migration case.
+ */

#ifdef CONFIG_MIGRATION

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e42c762..6da609d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -24,6 +24,7 @@ enum mapping_flags {
AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */
AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
+ AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */
};

static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -53,6 +54,21 @@ static inline int mapping_unevictable(struct address_space *mapping)
return !!mapping;
}

+static inline void mapping_set_balloon(struct address_space *mapping)
+{
+ set_bit(AS_BALLOON_MAP, &mapping->flags);
+}
+
+static inline void mapping_clear_balloon(struct address_space *mapping)
+{
+ clear_bit(AS_BALLOON_MAP, &mapping->flags);
+}
+
+static inline int mapping_balloon(struct address_space *mapping)
+{
+ return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
+}
+
static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
{
return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
diff --git a/mm/Kconfig b/mm/Kconfig
index a3f8ddd..ae92dd5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -188,6 +188,21 @@ config SPLIT_PTLOCK_CPUS
default "4"

#
+# support for memory balloon compaction
+config BALLOON_COMPACTION
+ bool "Allow for balloon memory compaction/migration"
+ def_bool y
+ depends on TRANSPARENT_HUGEPAGE && VIRTIO_BALLOON
+ help
+ Memory fragmentation introduced by ballooning might reduce
+ significantly the number of 2MB contiguous memory blocks that can be
+ used within a guest, thus imposing performance penalties associated
+ with the reduced number of transparent huge pages that could be used
+ by the guest workload. Allowing the compaction & migration for memory
+ pages enlisted as being part of memory balloon devices avoids the
+ scenario aforementioned and helps improving memory defragmentation.
+
+#
# support for memory compaction
config COMPACTION
bool "Allow for memory compaction"
diff --git a/mm/Makefile b/mm/Makefile
index 6b025f8..3a46287 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,8 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o mmu_context.o percpu.o slab_common.o \
- compaction.o interval_tree.o $(mmu-y)
+ compaction.o balloon_compaction.o \
+ interval_tree.o $(mmu-y)

obj-y += init-mm.o

diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
new file mode 100644
index 0000000..07dbc8e
--- /dev/null
+++ b/mm/balloon_compaction.c
@@ -0,0 +1,302 @@
+/*
+ * mm/balloon_compaction.c
+ *
+ * Common interface for making balloon pages movable by compaction.
+ *
+ * Copyright (C) 2012, Red Hat, Inc. Rafael Aquini <[email protected]>
+ */
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/export.h>
+#include <linux/balloon_compaction.h>
+
+/*
+ * balloon_devinfo_alloc - allocates a balloon device information descriptor.
+ * @balloon_dev_descriptor: pointer to reference the balloon device which
+ * this struct balloon_dev_info will be servicing.
+ *
+ * Driver must call it to properly allocate and initialize an instance of
+ * struct balloon_dev_info which will be used to reference a balloon device
+ * as well as to keep track of the balloon device page list.
+ */
+struct balloon_dev_info *balloon_devinfo_alloc(void *balloon_dev_descriptor)
+{
+ struct balloon_dev_info *b_dev_info;
+ b_dev_info = kmalloc(sizeof(*b_dev_info), GFP_KERNEL);
+ if (!b_dev_info)
+ return ERR_PTR(-ENOMEM);
+
+ b_dev_info->balloon_device = balloon_dev_descriptor;
+ b_dev_info->mapping = NULL;
+ b_dev_info->isolated_pages = 0;
+ spin_lock_init(&b_dev_info->pages_lock);
+ INIT_LIST_HEAD(&b_dev_info->pages);
+
+ return b_dev_info;
+}
+EXPORT_SYMBOL_GPL(balloon_devinfo_alloc);
+
+/*
+ * balloon_page_enqueue - allocates a new page and inserts it into the balloon
+ * page list.
+ * @b_dev_info: balloon device decriptor where we will insert a new page to
+ *
+ * Driver must call it to properly allocate a new enlisted balloon page
+ * before definetively removing it from the guest system.
+ * This function returns the page address for the recently enqueued page or
+ * NULL in the case we fail to allocate a new page this turn.
+ */
+struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info)
+{
+ unsigned long flags;
+ struct page *page = alloc_page(balloon_mapping_gfp_mask() |
+ __GFP_NOMEMALLOC | __GFP_NORETRY);
+ if (!page)
+ return NULL;
+
+ /*
+ * Block others from accessing the 'page' when we get around to
+ * establishing additional references. We should be the only one
+ * holding a reference to the 'page' at this point.
+ */
+ BUG_ON(!trylock_page(page));
+ spin_lock_irqsave(&b_dev_info->pages_lock, flags);
+ balloon_page_insert(page, b_dev_info->mapping, &b_dev_info->pages);
+ spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+ unlock_page(page);
+ return page;
+}
+EXPORT_SYMBOL_GPL(balloon_page_enqueue);
+
+/*
+ * balloon_page_dequeue - removes a page from balloon's page list and returns
+ * the its address to allow the driver release the page.
+ * @b_dev_info: balloon device decriptor where we will grab a page from.
+ *
+ * Driver must call it to properly de-allocate a previous enlisted balloon page
+ * before definetively releasing it back to the guest system.
+ * This function returns the page address for the recently dequeued page or
+ * NULL in the case we find balloon's page list temporarily empty due to
+ * compaction isolated pages.
+ */
+struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info)
+{
+ struct page *page, *tmp;
+ unsigned long flags;
+ bool dequeued_page;
+
+ dequeued_page = false;
+ list_for_each_entry_safe(page, tmp, &b_dev_info->pages, lru) {
+ /*
+ * Block others from accessing the 'page' while we get around
+ * establishing additional references and preparing the 'page'
+ * to be released by the balloon driver.
+ */
+ if (trylock_page(page)) {
+ spin_lock_irqsave(&b_dev_info->pages_lock, flags);
+ /*
+ * Raise the page refcount here to prevent any wrong
+ * attempt to isolate this page, in case of coliding
+ * with balloon_page_isolate() just after we release
+ * the page lock.
+ *
+ * balloon_page_free() will take care of dropping
+ * this extra refcount later.
+ */
+ get_page(page);
+ balloon_page_delete(page);
+ spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+ unlock_page(page);
+ dequeued_page = true;
+ break;
+ }
+ }
+
+ if (!dequeued_page) {
+ /*
+ * If we are unable to dequeue a balloon page because the page
+ * list is empty and there is no isolated pages, then something
+ * went out of track and some balloon pages are lost.
+ * BUG() here, otherwise the balloon driver may get stuck into
+ * an infinite loop while attempting to release all its pages.
+ */
+ spin_lock_irqsave(&b_dev_info->pages_lock, flags);
+ if (unlikely(list_empty(&b_dev_info->pages) &&
+ !b_dev_info->isolated_pages))
+ BUG();
+ spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+ page = NULL;
+ }
+ return page;
+}
+EXPORT_SYMBOL_GPL(balloon_page_dequeue);
+
+#ifdef CONFIG_BALLOON_COMPACTION
+/*
+ * balloon_mapping_alloc - allocates a special ->mapping for ballooned pages.
+ * @b_dev_info: holds the balloon device information descriptor.
+ * @a_ops: balloon_mapping address_space_operations descriptor.
+ *
+ * Driver must call it to properly allocate and initialize an instance of
+ * struct address_space which will be used as the special page->mapping for
+ * balloon device enlisted page instances.
+ */
+struct address_space *balloon_mapping_alloc(struct balloon_dev_info *b_dev_info,
+ const struct address_space_operations *a_ops)
+{
+ struct address_space *mapping;
+
+ mapping = kmalloc(sizeof(*mapping), GFP_KERNEL);
+ if (!mapping)
+ return ERR_PTR(-ENOMEM);
+
+ /*
+ * Give a clean 'zeroed' status to all elements of this special
+ * balloon page->mapping struct address_space instance.
+ */
+ address_space_init_once(mapping);
+
+ /*
+ * Set mapping->flags appropriately, to allow balloon pages
+ * ->mapping identification.
+ */
+ mapping_set_balloon(mapping);
+ mapping_set_gfp_mask(mapping, balloon_mapping_gfp_mask());
+
+ /* balloon's page->mapping->a_ops callback descriptor */
+ mapping->a_ops = a_ops;
+
+ /*
+ * Establish a pointer reference back to the balloon device descriptor
+ * this particular page->mapping will be servicing.
+ * This is used by compaction / migration procedures to identify and
+ * access the balloon device pageset while isolating / migrating pages.
+ *
+ * As some balloon drivers can register multiple balloon devices
+ * for a single guest, this also helps compaction / migration to
+ * properly deal with multiple balloon pagesets, when required.
+ */
+ mapping->private_data = b_dev_info;
+ b_dev_info->mapping = mapping;
+
+ return mapping;
+}
+EXPORT_SYMBOL_GPL(balloon_mapping_alloc);
+
+static inline void __isolate_balloon_page(struct page *page)
+{
+ struct balloon_dev_info *b_dev_info = page->mapping->private_data;
+ unsigned long flags;
+ spin_lock_irqsave(&b_dev_info->pages_lock, flags);
+ list_del(&page->lru);
+ b_dev_info->isolated_pages++;
+ spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+}
+
+static inline void __putback_balloon_page(struct page *page)
+{
+ struct balloon_dev_info *b_dev_info = page->mapping->private_data;
+ unsigned long flags;
+ spin_lock_irqsave(&b_dev_info->pages_lock, flags);
+ list_add(&page->lru, &b_dev_info->pages);
+ b_dev_info->isolated_pages--;
+ spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+}
+
+static inline int __migrate_balloon_page(struct address_space *mapping,
+ struct page *newpage, struct page *page, enum migrate_mode mode)
+{
+ return page->mapping->a_ops->migratepage(mapping, newpage, page, mode);
+}
+
+/* __isolate_lru_page() counterpart for a ballooned page */
+bool balloon_page_isolate(struct page *page)
+{
+ /*
+ * Avoid burning cycles with pages that are yet under __free_pages(),
+ * or just got freed under us.
+ *
+ * In case we 'win' a race for a balloon page being freed under us and
+ * raise its refcount preventing __free_pages() from doing its job
+ * the put_page() at the end of this block will take care of
+ * release this page, thus avoiding a nasty leakage.
+ */
+ if (likely(get_page_unless_zero(page))) {
+ /*
+ * As balloon pages are not isolated from LRU lists, concurrent
+ * compaction threads can race against page migration functions
+ * as well as race against the balloon driver releasing a page.
+ *
+ * In order to avoid having an already isolated balloon page
+ * being (wrongly) re-isolated while it is under migration,
+ * or to avoid attempting to isolate pages being released by
+ * the balloon driver, lets be sure we have the page lock
+ * before proceeding with the balloon page isolation steps.
+ */
+ if (likely(trylock_page(page))) {
+ /*
+ * A ballooned page, by default, has just one refcount.
+ * Prevent concurrent compaction threads from isolating
+ * an already isolated balloon page by refcount check.
+ */
+ if (__is_movable_balloon_page(page) &&
+ page_count(page) == 2) {
+ __isolate_balloon_page(page);
+ unlock_page(page);
+ return true;
+ }
+ unlock_page(page);
+ }
+ put_page(page);
+ }
+ return false;
+}
+
+/* putback_lru_page() counterpart for a ballooned page */
+void balloon_page_putback(struct page *page)
+{
+ /*
+ * 'lock_page()' stabilizes the page and prevents races against
+ * concurrent isolation threads attempting to re-isolate it.
+ */
+ lock_page(page);
+
+ if (__is_movable_balloon_page(page)) {
+ __putback_balloon_page(page);
+ /* drop the extra ref count taken for page isolation */
+ put_page(page);
+ } else {
+ WARN_ON(1);
+ dump_page(page);
+ }
+ unlock_page(page);
+}
+
+/* move_to_new_page() counterpart for a ballooned page */
+int balloon_page_migrate(struct page *newpage,
+ struct page *page, enum migrate_mode mode)
+{
+ struct address_space *mapping;
+ int rc = -EAGAIN;
+
+ /*
+ * Block others from accessing the 'newpage' when we get around to
+ * establishing additional references. We should be the only one
+ * holding a reference to the 'newpage' at this point.
+ */
+ BUG_ON(!trylock_page(newpage));
+
+ if (WARN_ON(!__is_movable_balloon_page(page))) {
+ dump_page(page);
+ unlock_page(newpage);
+ return rc;
+ }
+
+ mapping = page->mapping;
+ if (mapping)
+ rc = __migrate_balloon_page(mapping, newpage, page, mode);
+
+ unlock_page(newpage);
+ return rc;
+}
+#endif /* CONFIG_BALLOON_COMPACTION */
--
1.7.11.7

2012-11-11 19:02:23

by Rafael Aquini

[permalink] [raw]
Subject: [PATCH v12 5/7] virtio_balloon: introduce migration primitives to balloon pages

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

Besides making balloon pages movable at allocation time and introducing
the necessary primitives to perform balloon page migration/compaction,
this patch also introduces the following locking scheme, in order to
enhance the syncronization methods for accessing elements of struct
virtio_balloon, thus providing protection against concurrent access
introduced by parallel memory migration threads.

- balloon_lock (mutex) : synchronizes the access demand to elements of
struct virtio_balloon and its queue operations;

Signed-off-by: Rafael Aquini <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
drivers/virtio/virtio_balloon.c | 139 ++++++++++++++++++++++++++++++++++++----
1 file changed, 127 insertions(+), 12 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 586395c..8f92ab7 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,6 +27,7 @@
#include <linux/delay.h>
#include <linux/slab.h>
#include <linux/module.h>
+#include <linux/balloon_compaction.h>

/*
* Balloon device works in 4K page units. So each page is pointed to by
@@ -34,6 +35,7 @@
* page units.
*/
#define VIRTIO_BALLOON_PAGES_PER_PAGE (PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
+#define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256

struct virtio_balloon
{
@@ -52,15 +54,19 @@ struct virtio_balloon
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
/*
- * The pages we've told the Host we're not using.
+ * The pages we've told the Host we're not using are enqueued
+ * at vb_dev_info->pages list.
* Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
* to num_pages above.
*/
- struct list_head pages;
+ struct balloon_dev_info *vb_dev_info;
+
+ /* Synchronize access/update to this struct virtio_balloon elements */
+ struct mutex balloon_lock;

/* The array of pfns we tell the Host about. */
unsigned int num_pfns;
- u32 pfns[256];
+ u32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];

/* Memory statistics */
int need_stats_update;
@@ -122,17 +128,20 @@ static void set_page_pfns(u32 pfns[], struct page *page)

static void fill_balloon(struct virtio_balloon *vb, size_t num)
{
+ struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;
+
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));

+ mutex_lock(&vb->balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
- struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
- __GFP_NOMEMALLOC | __GFP_NOWARN);
+ struct page *page = balloon_page_enqueue(vb_dev_info);
+
if (!page) {
dev_info_ratelimited(&vb->vdev->dev,
"Out of puff! Can't get %zu pages\n",
- num);
+ VIRTIO_BALLOON_PAGES_PER_PAGE);
/* Sleep for at least 1/5 of a second before retry. */
msleep(200);
break;
@@ -140,7 +149,6 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
set_page_pfns(vb->pfns + vb->num_pfns, page);
vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
totalram_pages--;
- list_add(&page->lru, &vb->pages);
}

/* Didn't get any? Oh well. */
@@ -148,6 +156,7 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
return;

tell_host(vb, vb->inflate_vq);
+ mutex_unlock(&vb->balloon_lock);
}

static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
@@ -156,7 +165,7 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num)

/* Find pfns pointing at start of each page, get pages and free them. */
for (i = 0; i < num; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
- __free_page(balloon_pfn_to_page(pfns[i]));
+ balloon_page_free(balloon_pfn_to_page(pfns[i]));
totalram_pages++;
}
}
@@ -164,14 +173,17 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
static void leak_balloon(struct virtio_balloon *vb, size_t num)
{
struct page *page;
+ struct balloon_dev_info *vb_dev_info = vb->vb_dev_info;

/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb->pfns));

+ mutex_lock(&vb->balloon_lock);
for (vb->num_pfns = 0; vb->num_pfns < num;
vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
- page = list_first_entry(&vb->pages, struct page, lru);
- list_del(&page->lru);
+ page = balloon_page_dequeue(vb_dev_info);
+ if (!page)
+ break;
set_page_pfns(vb->pfns + vb->num_pfns, page);
vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
}
@@ -182,6 +194,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num)
* is true, we *have* to do it in this order
*/
tell_host(vb, vb->deflate_vq);
+ mutex_unlock(&vb->balloon_lock);
release_pages_by_pfn(vb->pfns, vb->num_pfns);
}

@@ -338,9 +351,84 @@ static int init_vqs(struct virtio_balloon *vb)
return 0;
}

+static const struct address_space_operations virtio_balloon_aops;
+#ifdef CONFIG_BALLOON_COMPACTION
+/*
+ * virtballoon_migratepage - perform the balloon page migration on behalf of
+ * a compation thread. (called under page lock)
+ * @mapping: the page->mapping which will be assigned to the new migrated page.
+ * @newpage: page that will replace the isolated page after migration finishes.
+ * @page : the isolated (old) page that is about to be migrated to newpage.
+ * @mode : compaction mode -- not used for balloon page migration.
+ *
+ * After a ballooned page gets isolated by compaction procedures, this is the
+ * function that performs the page migration on behalf of a compaction thread
+ * The page migration for virtio balloon is done in a simple swap fashion which
+ * follows these two macro steps:
+ * 1) insert newpage into vb->pages list and update the host about it;
+ * 2) update the host about the old page removed from vb->pages list;
+ *
+ * This function preforms the balloon page migration task.
+ * Called through balloon_mapping->a_ops->migratepage
+ */
+int virtballoon_migratepage(struct address_space *mapping,
+ struct page *newpage, struct page *page, enum migrate_mode mode)
+{
+ struct balloon_dev_info *vb_dev_info = balloon_page_device(page);
+ struct virtio_balloon *vb;
+ unsigned long flags;
+
+ BUG_ON(!vb_dev_info);
+
+ vb = vb_dev_info->balloon_device;
+
+ /*
+ * In order to avoid lock contention while migrating pages concurrently
+ * to leak_balloon() or fill_balloon() we just give up the balloon_lock
+ * this turn, as it is easier to retry the page migration later.
+ * This also prevents fill_balloon() getting stuck into a mutex
+ * recursion in the case it ends up triggering memory compaction
+ * while it is attempting to inflate the ballon.
+ */
+ if (!mutex_trylock(&vb->balloon_lock))
+ return -EAGAIN;
+
+ /* balloon's page migration 1st step -- inflate "newpage" */
+ spin_lock_irqsave(&vb_dev_info->pages_lock, flags);
+ balloon_page_insert(newpage, mapping, &vb_dev_info->pages);
+ vb_dev_info->isolated_pages--;
+ spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
+ vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+ set_page_pfns(vb->pfns, newpage);
+ tell_host(vb, vb->inflate_vq);
+
+ /*
+ * balloon's page migration 2nd step -- deflate "page"
+ *
+ * It's safe to delete page->lru here because this page is at
+ * an isolated migration list, and this step is expected to happen here
+ */
+ balloon_page_delete(page);
+ vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+ set_page_pfns(vb->pfns, page);
+ tell_host(vb, vb->deflate_vq);
+
+ mutex_unlock(&vb->balloon_lock);
+
+ return MIGRATEPAGE_BALLOON_SUCCESS;
+}
+
+/* define the balloon_mapping->a_ops callback to allow balloon page migration */
+static const struct address_space_operations virtio_balloon_aops = {
+ .migratepage = virtballoon_migratepage,
+};
+#endif /* CONFIG_BALLOON_COMPACTION */
+
static int virtballoon_probe(struct virtio_device *vdev)
{
struct virtio_balloon *vb;
+ struct address_space *vb_mapping;
+ struct balloon_dev_info *vb_devinfo;
int err;

vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
@@ -349,16 +437,37 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out;
}

- INIT_LIST_HEAD(&vb->pages);
vb->num_pages = 0;
+ mutex_init(&vb->balloon_lock);
init_waitqueue_head(&vb->config_change);
init_waitqueue_head(&vb->acked);
vb->vdev = vdev;
vb->need_stats_update = 0;

+ vb_devinfo = balloon_devinfo_alloc(vb);
+ if (IS_ERR(vb_devinfo)) {
+ err = PTR_ERR(vb_devinfo);
+ goto out_free_vb;
+ }
+
+ vb_mapping = balloon_mapping_alloc(vb_devinfo,
+ (balloon_compaction_check()) ?
+ &virtio_balloon_aops : NULL);
+ if (IS_ERR(vb_mapping)) {
+ /*
+ * IS_ERR(vb_mapping) && PTR_ERR(vb_mapping) == -EOPNOTSUPP
+ * This means !CONFIG_BALLOON_COMPACTION, otherwise we get off.
+ */
+ err = PTR_ERR(vb_mapping);
+ if (err != -EOPNOTSUPP)
+ goto out_free_vb_devinfo;
+ }
+
+ vb->vb_dev_info = vb_devinfo;
+
err = init_vqs(vb);
if (err)
- goto out_free_vb;
+ goto out_free_vb_mapping;

vb->thread = kthread_run(balloon, vb, "vballoon");
if (IS_ERR(vb->thread)) {
@@ -370,6 +479,10 @@ static int virtballoon_probe(struct virtio_device *vdev)

out_del_vqs:
vdev->config->del_vqs(vdev);
+out_free_vb_mapping:
+ balloon_mapping_free(vb_mapping);
+out_free_vb_devinfo:
+ balloon_devinfo_free(vb_devinfo);
out_free_vb:
kfree(vb);
out:
@@ -395,6 +508,8 @@ static void __devexit virtballoon_remove(struct virtio_device *vdev)

kthread_stop(vb->thread);
remove_common(vb);
+ balloon_mapping_free(vb->vb_dev_info->mapping);
+ balloon_devinfo_free(vb->vb_dev_info);
kfree(vb);
}

--
1.7.11.7

2012-11-11 19:02:20

by Rafael Aquini

[permalink] [raw]
Subject: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces the helper functions as well as the necessary changes
to teach compaction and migration bits how to cope with pages which are
part of a guest memory balloon, in order to make them movable by memory
compaction procedures.

Signed-off-by: Rafael Aquini <[email protected]>
Acked-by: Mel Gorman <[email protected]>
---
mm/compaction.c | 21 +++++++++++++++++++--
mm/migrate.c | 34 ++++++++++++++++++++++++++++++++--
2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9eef558..76abd84 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -14,6 +14,7 @@
#include <linux/backing-dev.h>
#include <linux/sysctl.h>
#include <linux/sysfs.h>
+#include <linux/balloon_compaction.h>
#include "internal.h"

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
@@ -565,9 +566,24 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
goto next_pageblock;
}

- /* Check may be lockless but that's ok as we recheck later */
- if (!PageLRU(page))
+ /*
+ * Check may be lockless but that's ok as we recheck later.
+ * It's possible to migrate LRU pages and balloon pages
+ * Skip any other type of page
+ */
+ if (!PageLRU(page)) {
+ if (unlikely(balloon_page_movable(page))) {
+ if (locked && balloon_page_isolate(page)) {
+ /* Successfully isolated */
+ cc->finished_update_migrate = true;
+ list_add(&page->lru, migratelist);
+ cc->nr_migratepages++;
+ nr_isolated++;
+ goto check_compact_cluster;
+ }
+ }
continue;
+ }

/*
* PageLRU is set. lru_lock normally excludes isolation
@@ -621,6 +637,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
cc->nr_migratepages++;
nr_isolated++;

+check_compact_cluster:
/* Avoid isolating too much */
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
++low_pfn;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f408c7..a771751 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -35,6 +35,7 @@
#include <linux/hugetlb.h>
#include <linux/hugetlb_cgroup.h>
#include <linux/gfp.h>
+#include <linux/balloon_compaction.h>

#include <asm/tlbflush.h>

@@ -79,7 +80,10 @@ void putback_lru_pages(struct list_head *l)
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
- putback_lru_page(page);
+ if (unlikely(balloon_page_movable(page)))
+ balloon_page_putback(page);
+ else
+ putback_lru_page(page);
}
}

@@ -778,6 +782,18 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
}
}

+ if (unlikely(balloon_page_movable(page))) {
+ /*
+ * A ballooned page does not need any special attention from
+ * physical to virtual reverse mapping procedures.
+ * Skip any attempt to unmap PTEs or to remap swap cache,
+ * in order to avoid burning cycles at rmap level, and perform
+ * the page migration right away (proteced by page lock).
+ */
+ rc = balloon_page_migrate(newpage, page, mode);
+ goto uncharge;
+ }
+
/*
* Corner case handling:
* 1. When a new swap-cache page is read into, it is added to the LRU
@@ -814,7 +830,9 @@ skip_unmap:
put_anon_vma(anon_vma);

uncharge:
- mem_cgroup_end_migration(mem, page, newpage, rc == MIGRATEPAGE_SUCCESS);
+ mem_cgroup_end_migration(mem, page, newpage,
+ (rc == MIGRATEPAGE_SUCCESS ||
+ rc == MIGRATEPAGE_BALLOON_SUCCESS));
unlock:
unlock_page(page);
out:
@@ -846,6 +864,18 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
goto out;

rc = __unmap_and_move(page, newpage, force, offlining, mode);
+
+ if (unlikely(rc == MIGRATEPAGE_BALLOON_SUCCESS)) {
+ /*
+ * A ballooned page has been migrated already.
+ * Now, it's the time to wrap-up counters,
+ * handle the page back to Buddy and return.
+ */
+ dec_zone_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
+ balloon_page_free(page);
+ return MIGRATEPAGE_SUCCESS;
+ }
out:
if (rc != -EAGAIN) {
/*
--
1.7.11.7

2012-11-11 19:02:53

by Rafael Aquini

[permalink] [raw]
Subject: [PATCH v12 7/7] mm: add vm event counters for balloon pages compaction

This patch introduces a new set of vm event counters to keep track of
ballooned pages compaction activity.

Signed-off-by: Rafael Aquini <[email protected]>
---
include/linux/balloon_compaction.h | 7 +++++++
include/linux/vm_event_item.h | 7 ++++++-
mm/balloon_compaction.c | 2 ++
mm/migrate.c | 1 +
mm/vmstat.c | 9 ++++++++-
5 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 2e63d94..68893bc 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -197,8 +197,15 @@ static inline bool balloon_compaction_check(void)
return true;
}

+static inline void balloon_event_count(enum vm_event_item item)
+{
+ count_vm_event(item);
+}
#else /* !CONFIG_BALLOON_COMPACTION */

+/* A macro, to avoid generating references to the undefined COMPACTBALLOON* */
+#define balloon_event_count(item) do { } while (0)
+
static inline void *balloon_mapping_alloc(void *balloon_device,
const struct address_space_operations *a_ops)
{
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..bd67c3f 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -41,7 +41,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_COMPACTION
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
-#endif
+#ifdef CONFIG_BALLOON_COMPACTION
+ COMPACTBALLOONISOLATED, /* isolated from balloon pagelist */
+ COMPACTBALLOONMIGRATED, /* balloon page sucessfully migrated */
+ COMPACTBALLOONRETURNED, /* putback to pagelist, not-migrated */
+#endif /* CONFIG_BALLOON_COMPACTION */
+#endif /* CONFIG_COMPACTION */
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
#endif
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 07dbc8e..2c8ce49 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -242,6 +242,7 @@ bool balloon_page_isolate(struct page *page)
if (__is_movable_balloon_page(page) &&
page_count(page) == 2) {
__isolate_balloon_page(page);
+ balloon_event_count(COMPACTBALLOONISOLATED);
unlock_page(page);
return true;
}
@@ -265,6 +266,7 @@ void balloon_page_putback(struct page *page)
__putback_balloon_page(page);
/* drop the extra ref count taken for page isolation */
put_page(page);
+ balloon_event_count(COMPACTBALLOONRETURNED);
} else {
WARN_ON(1);
dump_page(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 107a281..ecae213 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -894,6 +894,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
balloon_page_free(page);
+ balloon_event_count(COMPACTBALLOONMIGRATED);
return MIGRATEPAGE_SUCCESS;
}
out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..18a76ea 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -781,7 +781,14 @@ const char * const vmstat_text[] = {
"compact_stall",
"compact_fail",
"compact_success",
-#endif
+
+#ifdef CONFIG_BALLOON_COMPACTION
+ "compact_balloon_isolated",
+ "compact_balloon_migrated",
+ "compact_balloon_returned",
+#endif /* CONFIG_BALLOON_COMPACTION */
+
+#endif /* CONFIG_COMPACTION */

#ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
--
1.7.11.7

2012-11-11 19:03:15

by Rafael Aquini

[permalink] [raw]
Subject: [PATCH v12 6/7] mm: introduce putback_movable_pages()

The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
hacks around putback_lru_pages() in order to allow ballooned pages to be
re-inserted on balloon page list as if a ballooned page was like a LRU page.

As ballooned pages are not legitimate LRU pages, this patch introduces
putback_movable_pages() to properly cope with cases where the isolated
pageset contains ballooned pages and LRU pages, thus fixing the mentioned
inelegant hack around putback_lru_pages().

Signed-off-by: Rafael Aquini <[email protected]>
---
include/linux/migrate.h | 2 ++
mm/compaction.c | 6 +++---
mm/migrate.c | 20 ++++++++++++++++++++
mm/page_alloc.c | 2 +-
4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 4ce2ee9..42fafb4 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,6 +27,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
#ifdef CONFIG_MIGRATION

extern void putback_lru_pages(struct list_head *l);
+extern void putback_movable_pages(struct list_head *l);
extern int migrate_page(struct address_space *,
struct page *, struct page *, enum migrate_mode);
extern int migrate_pages(struct list_head *l, new_page_t x,
@@ -51,6 +52,7 @@ extern int migrate_misplaced_page(struct page *page, int node);
#else

static inline void putback_lru_pages(struct list_head *l) {}
+static inline void putback_movable_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 76abd84..f268bd8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -995,7 +995,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
switch (isolate_migratepages(zone, cc)) {
case ISOLATE_ABORT:
ret = COMPACT_PARTIAL;
- putback_lru_pages(&cc->migratepages);
+ putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
goto out;
case ISOLATE_NONE:
@@ -1018,9 +1018,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
trace_mm_compaction_migratepages(nr_migrate - nr_remaining,
nr_remaining);

- /* Release LRU pages not migrated */
+ /* Release isolated pages not migrated */
if (err) {
- putback_lru_pages(&cc->migratepages);
+ putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
if (err == -ENOMEM) {
ret = COMPACT_PARTIAL;
diff --git a/mm/migrate.c b/mm/migrate.c
index a771751..107a281 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -80,6 +80,26 @@ void putback_lru_pages(struct list_head *l)
list_del(&page->lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
+ putback_lru_page(page);
+ }
+}
+
+/*
+ * Put previously isolated pages back onto the appropriate lists
+ * from where they were once taken off for compaction/migration.
+ *
+ * This function shall be used instead of putback_lru_pages(),
+ * whenever the isolated pageset has been built by isolate_migratepages_range()
+ */
+void putback_movable_pages(struct list_head *l)
+{
+ struct page *page;
+ struct page *page2;
+
+ list_for_each_entry_safe(page, page2, l, lru) {
+ list_del(&page->lru);
+ dec_zone_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
if (unlikely(balloon_page_movable(page)))
balloon_page_putback(page);
else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fd1919..6b990cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5795,7 +5795,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
0, false, MIGRATE_SYNC);
}

- putback_lru_pages(&cc->migratepages);
+ putback_movable_pages(&cc->migratepages);
return ret > 0 ? 0 : ret;
}

--
1.7.11.7

2012-11-11 19:03:33

by Rafael Aquini

[permalink] [raw]
Subject: [PATCH v12 2/7] mm: redefine address_space.assoc_mapping

This patch overhauls struct address_space.assoc_mapping renaming it to
address_space.private_data and its type is redefined to void*.
By this approach we consistently name the .private_* elements from
struct address_space as well as allow extended usage for address_space
association with other data structures through ->private_data.

Also, all users of old ->assoc_mapping element are converted to reflect
its new name and type change (->private_data).

Signed-off-by: Rafael Aquini <[email protected]>
---
fs/buffer.c | 12 ++++++------
fs/gfs2/glock.c | 2 +-
fs/inode.c | 2 +-
fs/nilfs2/page.c | 2 +-
include/linux/fs.h | 2 +-
5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b5f0442..e0bad95 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -555,7 +555,7 @@ void emergency_thaw_all(void)
*/
int sync_mapping_buffers(struct address_space *mapping)
{
- struct address_space *buffer_mapping = mapping->assoc_mapping;
+ struct address_space *buffer_mapping = mapping->private_data;

if (buffer_mapping == NULL || list_empty(&mapping->private_list))
return 0;
@@ -588,10 +588,10 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
struct address_space *buffer_mapping = bh->b_page->mapping;

mark_buffer_dirty(bh);
- if (!mapping->assoc_mapping) {
- mapping->assoc_mapping = buffer_mapping;
+ if (!mapping->private_data) {
+ mapping->private_data = buffer_mapping;
} else {
- BUG_ON(mapping->assoc_mapping != buffer_mapping);
+ BUG_ON(mapping->private_data != buffer_mapping);
}
if (!bh->b_assoc_map) {
spin_lock(&buffer_mapping->private_lock);
@@ -788,7 +788,7 @@ void invalidate_inode_buffers(struct inode *inode)
if (inode_has_buffers(inode)) {
struct address_space *mapping = &inode->i_data;
struct list_head *list = &mapping->private_list;
- struct address_space *buffer_mapping = mapping->assoc_mapping;
+ struct address_space *buffer_mapping = mapping->private_data;

spin_lock(&buffer_mapping->private_lock);
while (!list_empty(list))
@@ -811,7 +811,7 @@ int remove_inode_buffers(struct inode *inode)
if (inode_has_buffers(inode)) {
struct address_space *mapping = &inode->i_data;
struct list_head *list = &mapping->private_list;
- struct address_space *buffer_mapping = mapping->assoc_mapping;
+ struct address_space *buffer_mapping = mapping->private_data;

spin_lock(&buffer_mapping->private_lock);
while (!list_empty(list)) {
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 6114571..904a808 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -766,7 +766,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
mapping->host = s->s_bdev->bd_inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_NOFS);
- mapping->assoc_mapping = NULL;
+ mapping->private_data = NULL;
mapping->backing_dev_info = s->s_bdi;
mapping->writeback_index = 0;
}
diff --git a/fs/inode.c b/fs/inode.c
index b03c719..4cac8e1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -165,7 +165,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
mapping->host = inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
- mapping->assoc_mapping = NULL;
+ mapping->private_data = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
mapping->writeback_index = 0;

diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 3e7b2a0..07f76db 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -431,7 +431,7 @@ void nilfs_mapping_init(struct address_space *mapping, struct inode *inode,
mapping->host = inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_NOFS);
- mapping->assoc_mapping = NULL;
+ mapping->private_data = NULL;
mapping->backing_dev_info = bdi;
mapping->a_ops = &empty_aops;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..0982565 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -418,7 +418,7 @@ struct address_space {
struct backing_dev_info *backing_dev_info; /* device readahead, etc */
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
- struct address_space *assoc_mapping; /* ditto */
+ void *private_data; /* ditto */
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
--
1.7.11.7

2012-11-11 19:18:52

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH v12 0/7] make balloon pages movable by compaction

On Sun, Nov 11, 2012 at 05:01:13PM -0200, Rafael Aquini wrote:
> Change log:
> v12:
> * Address last suggestions on sorting the barriers usage out (Mel Gorman);
> * Fix reported build breakages for CONFIG_BALLOON_COMPACTION=n (Andrew Morton);
> * Enhance commentary on the locking scheme used for balloon page compaction;
> * Move all the 'balloon vm_event counter' changes to PATCH 07;

Andrew,

could you drop the earlier review (v11) and pick this submission (v12) instead,
please?


Aside: before submitting I rebased the series on -next-20121109, after reverting
the v11 commits, to make those clean-up hunks from PATCH 01 apply smoothly.

-- Rafael

2012-11-17 18:01:59

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

Hi guys,

On 11/11/2012 02:01 PM, Rafael Aquini wrote:
> Memory fragmentation introduced by ballooning might reduce significantly
> the number of 2MB contiguous memory blocks that can be used within a guest,
> thus imposing performance penalties associated with the reduced number of
> transparent huge pages that could be used by the guest workload.
>
> This patch introduces the helper functions as well as the necessary changes
> to teach compaction and migration bits how to cope with pages which are
> part of a guest memory balloon, in order to make them movable by memory
> compaction procedures.
>
> Signed-off-by: Rafael Aquini <[email protected]>
> Acked-by: Mel Gorman <[email protected]>
> ---

I'm getting the following while fuzzing using trinity inside a KVM tools guest,
on latest -next:

[ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0000000000000194
[ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
[ 1642.786061] PGD 39e80067 PUD 39f6d067 PMD 0
[ 1642.786766] Oops: 0000 [#3] PREEMPT SMP DEBUG_PAGEALLOC
[ 1642.787587] CPU 0
[ 1642.787884] Pid: 8522, comm: trinity-child5 Tainted: G D W 3.7.0-rc5-next-20121115-sasha-00013-g37271d3 #154
[ 1642.789483] RIP: 0010:[<ffffffff8122b354>] [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
[ 1642.790016] RSP: 0018:ffff880039d998d8 EFLAGS: 00010202
[ 1642.790016] RAX: 0000000000000054 RBX: ffffea0000fd5480 RCX: 00000000000001fa
[ 1642.790016] RDX: 0000000080490049 RSI: 0000000000000004 RDI: ffff880039d99a20
[ 1642.790016] RBP: ffff880039d99978 R08: 0000000000000001 R09: 0000000000000000
[ 1642.790016] R10: ffff88003bfcefc0 R11: 0000000000000001 R12: 000000000003f552
[ 1642.790016] R13: 0000000000000153 R14: 0000000000000000 R15: ffff88004ffd2000
[ 1642.790016] FS: 00007ff799de5700(0000) GS:ffff880013600000(0000) knlGS:0000000000000000
[ 1642.790016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1642.790016] CR2: 0000000000000194 CR3: 00000000369ef000 CR4: 00000000000406f0
[ 1642.790016] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1642.790016] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1642.790016] Process trinity-child5 (pid: 8522, threadinfo ffff880039d98000, task ffff8800298e8000)
[ 1642.790016] Stack:
[ 1642.790016] 00000000000001fa ffff880039d99a78 ffff8800298e8000 ffff880039d99a30
[ 1642.790016] 0000000000000000 0000000000000000 0000000000000000 ffff880039d99fd8
[ 1642.790016] ffff880039d99a20 ffffea0000fd0000 ffff880039d99a60 000000000003f600
[ 1642.790016] Call Trace:
[ 1642.790016] [<ffffffff8122ae10>] ? isolate_freepages+0x1f0/0x1f0
[ 1642.790016] [<ffffffff8122bc3e>] compact_zone+0x3ce/0x490
[ 1642.790016] [<ffffffff81185afb>] ? __lock_acquired+0x3b/0x360
[ 1642.790016] [<ffffffff8122bfee>] compact_zone_order+0xde/0x120
[ 1642.790016] [<ffffffff819f4bb8>] ? do_raw_spin_unlock+0xc8/0xe0
[ 1642.790016] [<ffffffff8122c0ee>] try_to_compact_pages+0xbe/0x110
[ 1642.790016] [<ffffffff83b53b4d>] __alloc_pages_direct_compact+0xba/0x206
[ 1642.790016] [<ffffffff8118f059>] ? on_each_cpu_mask+0xd9/0x110
[ 1642.790016] [<ffffffff8120e3ef>] __alloc_pages_nodemask+0x92f/0xbc0
[ 1642.790016] [<ffffffff8125284c>] alloc_pages_vma+0xfc/0x150
[ 1642.790016] [<ffffffff812699c1>] do_huge_pmd_anonymous_page+0x191/0x2b0
[ 1642.790016] [<ffffffff81137394>] ? __rcu_read_unlock+0x44/0xb0
[ 1642.790016] [<ffffffff812339f9>] handle_mm_fault+0x1c9/0x350
[ 1642.790016] [<ffffffff81234068>] __get_user_pages+0x418/0x5f0
[ 1642.790016] [<ffffffff81235bdc>] ? do_mlock_pages+0x8c/0x160
[ 1642.790016] [<ffffffff81235b43>] __mlock_vma_pages_range+0xb3/0xc0
[ 1642.790016] [<ffffffff81235c39>] do_mlock_pages+0xe9/0x160
[ 1642.790016] [<ffffffff812366e0>] sys_mlockall+0x160/0x1a0
[ 1642.790016] [<ffffffff83c1abd8>] tracesys+0xe1/0xe6
[ 1642.790016] Code: a9 00 00 01 00 0f 85 6c 02 00 00 48 8b 43 08 a8 01 0f 85 60 02 00 00 8b 53 18 85 d2 0f 89 55 02 00 00 48 85 c0
0f 84 4c 02 00 00 <48> 8b 80 40 01 00 00 a9 00 00 00 20 0f 84 3a 02 00 00 45 84 f6
[ 1642.790016] RIP [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
[ 1642.790016] RSP <ffff880039d998d8>
[ 1642.790016] CR2: 0000000000000194
[ 1643.398013] ---[ end trace 0ad6459aa62f5d72 ]---

My guess is that we see those because of a race during the check in
isolate_migratepages_range().


Thanks,
Sasha

2012-11-17 21:55:02

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote:
>
> I'm getting the following while fuzzing using trinity inside a KVM tools guest,
> on latest -next:
>
> [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0000000000000194
> [ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
>
> My guess is that we see those because of a race during the check in
> isolate_migratepages_range().
>
>
> Thanks,
> Sasha

Sasha, could you share your .config and steps you did used with trinity? So I
can attempt to reproduce this issue you reported.

Thanks,
Rafael

2012-11-18 15:00:09

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

On Sat, Nov 17, 2012 at 4:54 PM, Rafael Aquini <[email protected]> wrote:
> On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote:
>>
>> I'm getting the following while fuzzing using trinity inside a KVM tools guest,
>> on latest -next:
>>
>> [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0000000000000194
>> [ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
>>
>> My guess is that we see those because of a race during the check in
>> isolate_migratepages_range().
>>
>>
>> Thanks,
>> Sasha
>
> Sasha, could you share your .config and steps you did used with trinity? So I
> can attempt to reproduce this issue you reported.

Basically try running trinity (with ./trinity -m --quiet --dangerous
-l off) inside a disposable guest as root.

I manage to hit that every couple of hours.

Config attached.


Thanks,
Sasha

Confi


Attachments:
config-sasha (133.95 kB)

2012-11-20 14:15:01

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

On Sun, Nov 18, 2012 at 09:59:47AM -0500, Sasha Levin wrote:
> On Sat, Nov 17, 2012 at 4:54 PM, Rafael Aquini <[email protected]> wrote:
> > On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote:
> >>
> >> I'm getting the following while fuzzing using trinity inside a KVM tools guest,
> >> on latest -next:
> >>
> >> [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0000000000000194
> >> [ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
> >>
> >> My guess is that we see those because of a race during the check in
> >> isolate_migratepages_range().
> >>
> >>
> >> Thanks,
> >> Sasha
> >
> > Sasha, could you share your .config and steps you did used with trinity? So I
> > can attempt to reproduce this issue you reported.
>
> Basically try running trinity (with ./trinity -m --quiet --dangerous
> -l off) inside a disposable guest as root.
>
> I manage to hit that every couple of hours.
>
> Config attached.
>

Howdy Sasha,

After several hours since last Sunday running trinity tests on a traditional
KVM-QEMU guest as well as running it on a lkvm guest (both running
next-20121115) I couldn't hit a single time the crash you've reported,
(un)fortunately.

Also, the .config you gave me, applied on top of next-20121115, haven't produced
the same bin you've running and hitting the mentioned bug, apparently.

Here's the RIP for your crash:
[ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at
0000000000000194
[ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0


And here's the symbol address for the next-20121115 with your .config I've been
running tests on:
[raquini@x61 linux]$ nm -n vmlinux | grep isolate_migratepages_range
ffffffff8122d890 T isolate_migratepages_range

Also, it seems quite clear I'm missing something from your tree, as applying the
RIP displacement (0x344) to my local isolate_migratepages_range sym addr leads
me to the _middle_ of a instruction opcode that does not dereference any
pointers at all.

So, if you're consistently reproducing the same crash, consider to share with us
a disassembled dump from the isolate_migratepages_range() you're running along
with the crash stack-dump, please.

Cheers!
-- Rafael

2012-11-21 01:18:36

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

On 11/20/2012 09:14 AM, Rafael Aquini wrote:
> On Sun, Nov 18, 2012 at 09:59:47AM -0500, Sasha Levin wrote:
>> On Sat, Nov 17, 2012 at 4:54 PM, Rafael Aquini <[email protected]> wrote:
>>> On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote:
>>>>
>>>> I'm getting the following while fuzzing using trinity inside a KVM tools guest,
>>>> on latest -next:
>>>>
>>>> [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0000000000000194
>>>> [ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
>>>>
>>>> My guess is that we see those because of a race during the check in
>>>> isolate_migratepages_range().
>>>>
>>>>
>>>> Thanks,
>>>> Sasha
>>>
>>> Sasha, could you share your .config and steps you did used with trinity? So I
>>> can attempt to reproduce this issue you reported.
>>
>> Basically try running trinity (with ./trinity -m --quiet --dangerous
>> -l off) inside a disposable guest as root.
>>
>> I manage to hit that every couple of hours.
>>
>> Config attached.
>>
>
> Howdy Sasha,
>
> After several hours since last Sunday running trinity tests on a traditional
> KVM-QEMU guest as well as running it on a lkvm guest (both running
> next-20121115) I couldn't hit a single time the crash you've reported,
> (un)fortunately.

Odd... I can see it happening here every couple of hours.

> Also, the .config you gave me, applied on top of next-20121115, haven't produced
> the same bin you've running and hitting the mentioned bug, apparently.
>
> Here's the RIP for your crash:
> [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000194
> [ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
>
>
> And here's the symbol address for the next-20121115 with your .config I've been
> running tests on:
> [raquini@x61 linux]$ nm -n vmlinux | grep isolate_migratepages_range
> ffffffff8122d890 T isolate_migratepages_range
>
> Also, it seems quite clear I'm missing something from your tree, as applying the
> RIP displacement (0x344) to my local isolate_migratepages_range sym addr leads
> me to the _middle_ of a instruction opcode that does not dereference any
> pointers at all.

Yup, I carry another small fix to mpol (which is unrelated to this one).

> So, if you're consistently reproducing the same crash, consider to share with us
> a disassembled dump from the isolate_migratepages_range() you're running along
> with the crash stack-dump, please.

Sure!

The call chain is:

isolate_migratepages_range
balloon_page_movable
__is_movable_balloon_page
mapping_balloon

mapping_balloon() fails because it checks for mapping to be non-null (and it is -
it's usually a small value like 0x50), and then it dereferences that.

The relevant assembly is:

static inline int mapping_balloon(struct address_space *mapping)
{
return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
17ab: 48 85 c0 test %rax,%rax
17ae: 0f 84 4c 02 00 00 je 1a00 <isolate_migratepages_range+0x590>
17b4: 48 8b 80 40 01 00 00 mov 0x140(%rax),%rax
17bb: a9 00 00 00 20 test $0x20000000,%eax
17c0: 0f 84 3a 02 00 00 je 1a00 <isolate_migratepages_range+0x590>

It dies on 17b4.

Let me know if you need anything else from me, I can also add debug code into the
kernel if it would help you...


Thanks,
Sasha

2012-11-22 18:38:08

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

On Thu, Nov 22, 2012 at 09:19:15AM -0500, Sasha Levin wrote:
> And managed to reproduce it only once through last night, here is the dump I got
> before the oops:
>
> [ 2760.356820] page:ffffea0000d00e00 count:1 mapcount:-2147287036 mapping:00000000000004f4 index:0xd00e00000003
> [ 2760.362354] page flags: 0x350000000001800(private|private_2)
>

Thanks alot for following up this one Sasha.


We're stumbling across a private page -- seems something in your setup is doing
this particular usage, and that's probably why I'm not seeing the same here.

Regardless being a particular case or not, we shouldn't be poking at that
private page, so I figured the tests I'm doing at balloon_page_movable() are
incomplete and dumb.

Perhaps, a better way to proceed here would be assuring the NR_PAGEFLAGS
rightmost bits from page->flags are all cleared, as this is the state a page
coming from buddy to the balloon list will be, and this is the state the balloon
page flags will be kept as long as it lives as such (we don't play with any flag
at balloon level).


Here goes what I'll propose after you confirm it doesn't trigger your crash
anymore, as it simplifies the code and reduces the testing battery @
balloon_page_movable() -- ballooned pages have no flags set, 1 refcount and 0
mapcount, always.


Could you give this a try?

Thank you!

---
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compacti
index e339dd9..44ad50f 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -101,6 +101,12 @@ static inline bool __is_movable_balloon_page(struct page *p
return mapping_balloon(mapping);
}

+#define PAGE_FLAGS_MASK ((1UL << NR_PAGEFLAGS) - 1)
+static inline bool __balloon_page_flags(struct page *page)
+{
+ return page->flags & PAGE_FLAGS_MASK ? false : true;
+}
+
/*
* balloon_page_movable - test page->mapping->flags to identify balloon pages
* that can be moved by compaction/migration.
@@ -121,8 +127,8 @@ static inline bool balloon_page_movable(struct page *page)
* Before dereferencing and testing mapping->flags, lets make sure
* this is not a page that uses ->mapping in a different way
*/
- if (!PageSlab(page) && !PageSwapCache(page) && !PageAnon(page) &&
- !page_mapped(page))
+ if (__balloon_page_flags(page) && !page_mapped(page) &&
+ page_count(page) == 1)
return __is_movable_balloon_page(page);

return false;

2012-11-22 19:03:56

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

On Tue, Nov 20, 2012 at 08:18:04PM -0500, Sasha Levin wrote:
> On 11/20/2012 09:14 AM, Rafael Aquini wrote:
> > On Sun, Nov 18, 2012 at 09:59:47AM -0500, Sasha Levin wrote:
> >> On Sat, Nov 17, 2012 at 4:54 PM, Rafael Aquini <[email protected]> wrote:
> >>> On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote:
> >>>>
> >>>> I'm getting the following while fuzzing using trinity inside a KVM tools guest,
> >>>> on latest -next:
> >>>>
> >>>> [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0000000000000194
> >>>> [ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
> >>>>
> >>>> My guess is that we see those because of a race during the check in
> >>>> isolate_migratepages_range().
> >>>>
> >>>>
> >>>> Thanks,
> >>>> Sasha
> >>>
> >>> Sasha, could you share your .config and steps you did used with trinity? So I
> >>> can attempt to reproduce this issue you reported.
> >>
> >> Basically try running trinity (with ./trinity -m --quiet --dangerous
> >> -l off) inside a disposable guest as root.
> >>
> >> I manage to hit that every couple of hours.
> >>
> >> Config attached.
> >>
> >
> > Howdy Sasha,
> >
> > After several hours since last Sunday running trinity tests on a traditional
> > KVM-QEMU guest as well as running it on a lkvm guest (both running
> > next-20121115) I couldn't hit a single time the crash you've reported,
> > (un)fortunately.
>
> Odd... I can see it happening here every couple of hours.
>
> > Also, the .config you gave me, applied on top of next-20121115, haven't produced
> > the same bin you've running and hitting the mentioned bug, apparently.
> >
> > Here's the RIP for your crash:
> > [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at
> > 0000000000000194
> > [ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
> >
> >
> > And here's the symbol address for the next-20121115 with your .config I've been
> > running tests on:
> > [raquini@x61 linux]$ nm -n vmlinux | grep isolate_migratepages_range
> > ffffffff8122d890 T isolate_migratepages_range
> >
> > Also, it seems quite clear I'm missing something from your tree, as applying the
> > RIP displacement (0x344) to my local isolate_migratepages_range sym addr leads
> > me to the _middle_ of a instruction opcode that does not dereference any
> > pointers at all.
>
> Yup, I carry another small fix to mpol (which is unrelated to this one).
>
> > So, if you're consistently reproducing the same crash, consider to share with us
> > a disassembled dump from the isolate_migratepages_range() you're running along
> > with the crash stack-dump, please.
>
> Sure!
>
> The call chain is:
>
> isolate_migratepages_range
> balloon_page_movable
> __is_movable_balloon_page
> mapping_balloon
>
> mapping_balloon() fails because it checks for mapping to be non-null (and it is -
> it's usually a small value like 0x50), and then it dereferences that.
>
> The relevant assembly is:
>
> static inline int mapping_balloon(struct address_space *mapping)
> {
> return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
> 17ab: 48 85 c0 test %rax,%rax
> 17ae: 0f 84 4c 02 00 00 je 1a00 <isolate_migratepages_range+0x590>
> 17b4: 48 8b 80 40 01 00 00 mov 0x140(%rax),%rax
> 17bb: a9 00 00 00 20 test $0x20000000,%eax
> 17c0: 0f 84 3a 02 00 00 je 1a00 <isolate_migratepages_range+0x590>
>
> It dies on 17b4.
>
> Let me know if you need anything else from me, I can also add debug code into the
> kernel if it would help you...
>

I still failing miserably on getting a reproducer, but I'm quite curious about
what kind of page is being left behind with this barbed wire at ->mapping.

This might be an overkill on messages, but could you run a test with the
following patch? This migth bring some light to this corner as well as it might
help me on figuring out a clever/cleaner way to perform that test.

Thanks a lot for your attention here.

---
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compacti
index e339dd9..d47ac78 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -122,9 +122,10 @@ static inline bool balloon_page_movable(struct page *page)
* this is not a page that uses ->mapping in a different way
*/
if (!PageSlab(page) && !PageSwapCache(page) && !PageAnon(page) &&
- !page_mapped(page))
+ !page_mapped(page)) {
+ dump_page(page);
return __is_movable_balloon_page(page);
-
+ }
return false;
}

2012-11-22 19:32:28

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

On 11/21/2012 07:01 PM, Rafael Aquini wrote:
> On Tue, Nov 20, 2012 at 08:18:04PM -0500, Sasha Levin wrote:
>> On 11/20/2012 09:14 AM, Rafael Aquini wrote:
>>> On Sun, Nov 18, 2012 at 09:59:47AM -0500, Sasha Levin wrote:
>>>> On Sat, Nov 17, 2012 at 4:54 PM, Rafael Aquini <[email protected]> wrote:
>>>>> On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote:
>>>>>>
>>>>>> I'm getting the following while fuzzing using trinity inside a KVM tools guest,
>>>>>> on latest -next:
>>>>>>
>>>>>> [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0000000000000194
>>>>>> [ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
>>>>>>
>>>>>> My guess is that we see those because of a race during the check in
>>>>>> isolate_migratepages_range().
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Sasha
>>>>>
>>>>> Sasha, could you share your .config and steps you did used with trinity? So I
>>>>> can attempt to reproduce this issue you reported.
>>>>
>>>> Basically try running trinity (with ./trinity -m --quiet --dangerous
>>>> -l off) inside a disposable guest as root.
>>>>
>>>> I manage to hit that every couple of hours.
>>>>
>>>> Config attached.
>>>>
>>>
>>> Howdy Sasha,
>>>
>>> After several hours since last Sunday running trinity tests on a traditional
>>> KVM-QEMU guest as well as running it on a lkvm guest (both running
>>> next-20121115) I couldn't hit a single time the crash you've reported,
>>> (un)fortunately.
>>
>> Odd... I can see it happening here every couple of hours.
>>
>>> Also, the .config you gave me, applied on top of next-20121115, haven't produced
>>> the same bin you've running and hitting the mentioned bug, apparently.
>>>
>>> Here's the RIP for your crash:
>>> [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at
>>> 0000000000000194
>>> [ 1642.785083] IP: [<ffffffff8122b354>] isolate_migratepages_range+0x344/0x7b0
>>>
>>>
>>> And here's the symbol address for the next-20121115 with your .config I've been
>>> running tests on:
>>> [raquini@x61 linux]$ nm -n vmlinux | grep isolate_migratepages_range
>>> ffffffff8122d890 T isolate_migratepages_range
>>>
>>> Also, it seems quite clear I'm missing something from your tree, as applying the
>>> RIP displacement (0x344) to my local isolate_migratepages_range sym addr leads
>>> me to the _middle_ of a instruction opcode that does not dereference any
>>> pointers at all.
>>
>> Yup, I carry another small fix to mpol (which is unrelated to this one).
>>
>>> So, if you're consistently reproducing the same crash, consider to share with us
>>> a disassembled dump from the isolate_migratepages_range() you're running along
>>> with the crash stack-dump, please.
>>
>> Sure!
>>
>> The call chain is:
>>
>> isolate_migratepages_range
>> balloon_page_movable
>> __is_movable_balloon_page
>> mapping_balloon
>>
>> mapping_balloon() fails because it checks for mapping to be non-null (and it is -
>> it's usually a small value like 0x50), and then it dereferences that.
>>
>> The relevant assembly is:
>>
>> static inline int mapping_balloon(struct address_space *mapping)
>> {
>> return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
>> 17ab: 48 85 c0 test %rax,%rax
>> 17ae: 0f 84 4c 02 00 00 je 1a00 <isolate_migratepages_range+0x590>
>> 17b4: 48 8b 80 40 01 00 00 mov 0x140(%rax),%rax
>> 17bb: a9 00 00 00 20 test $0x20000000,%eax
>> 17c0: 0f 84 3a 02 00 00 je 1a00 <isolate_migratepages_range+0x590>
>>
>> It dies on 17b4.
>>
>> Let me know if you need anything else from me, I can also add debug code into the
>> kernel if it would help you...
>>
>
> I still failing miserably on getting a reproducer, but I'm quite curious about
> what kind of page is being left behind with this barbed wire at ->mapping.
>
> This might be an overkill on messages, but could you run a test with the
> following patch? This migth bring some light to this corner as well as it might
> help me on figuring out a clever/cleaner way to perform that test.
>
> Thanks a lot for your attention here.

I've modified it a bit to decrease the noise:

diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index e339dd9..7dfbce1 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -98,6 +98,8 @@ static inline bool __is_movable_balloon_page(struct page *page)
*/
struct address_space *mapping = ACCESS_ONCE(page->mapping);
smp_read_barrier_depends();
+ if (mapping && mapping < 0x1000)
+ dump_page(page);
return mapping_balloon(mapping);
}

And managed to reproduce it only once through last night, here is the dump I got
before the oops:

[ 2760.356820] page:ffffea0000d00e00 count:1 mapcount:-2147287036 mapping:00000000000004f4 index:0xd00e00000003
[ 2760.362354] page flags: 0x350000000001800(private|private_2)


Thanks,
Sasha

2012-11-26 15:58:16

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

On Thu, Nov 22, 2012 at 10:10 AM, Rafael Aquini <[email protected]> wrote:
> On Thu, Nov 22, 2012 at 09:19:15AM -0500, Sasha Levin wrote:
>> And managed to reproduce it only once through last night, here is the dump I got
>> before the oops:
>>
>> [ 2760.356820] page:ffffea0000d00e00 count:1 mapcount:-2147287036 mapping:00000000000004f4 index:0xd00e00000003
>> [ 2760.362354] page flags: 0x350000000001800(private|private_2)
>>
>
> Thanks alot for following up this one Sasha.
>
>
> We're stumbling across a private page -- seems something in your setup is doing
> this particular usage, and that's probably why I'm not seeing the same here.
>
> Regardless being a particular case or not, we shouldn't be poking at that
> private page, so I figured the tests I'm doing at balloon_page_movable() are
> incomplete and dumb.
>
> Perhaps, a better way to proceed here would be assuring the NR_PAGEFLAGS
> rightmost bits from page->flags are all cleared, as this is the state a page
> coming from buddy to the balloon list will be, and this is the state the balloon
> page flags will be kept as long as it lives as such (we don't play with any flag
> at balloon level).
>
>
> Here goes what I'll propose after you confirm it doesn't trigger your crash
> anymore, as it simplifies the code and reduces the testing battery @
> balloon_page_movable() -- ballooned pages have no flags set, 1 refcount and 0
> mapcount, always.
>
>
> Could you give this a try?
>
> Thank you!

Ran it for a while, no more BUGs, yay :)


Thanks,
Sasha