From: Martin Schwidefsky <[email protected]>
From: Hubertus Franke <[email protected]>
From: Himanshu Raj <[email protected]>
[patch 3/9] Guest page hinting: volatile page cache.
A new page state "volatile" is introduced that is used for clean,
uptodate page cache pages. The host can choose to discard volatile
pages as part of its vmscan operation instead of writing them to the
hosts paging device. This greatly reduces the i/o needed by the host
if it gets under memory pressure. The guest system doesn't notice
that a volatile page is gone until it tries to access the page or
tries to make it stable again. Then the guest needs to remove the
page from the cache. A guest access to a discarded page causes the
host to deliver a new kind of fault to the guest - the discard fault.
After the guest has removed the page it is reloaded from the backing
device.
The code added by this patch uses the volatile state for all page cache
pages, even for pages which are referenced by writable ptes. The host
needs to be able to check the dirty state of the pages. Since the host
doesn't know where the page table entries of the guest are located,
the volatile state as introduced by this patch is only usable on
architectures with per-page dirty bits (s390 only). For per-pte dirty
bit architectures some additional code is needed.
The interesting question is where to put the state transitions between
the volatile and the stable state. The simple solution is the make a
page stable whenever a lookup is done or a page reference is derived
from a page table entry. Attempts to make pages volatile are added at
strategic points. Now what are the conditions that prevent a page from
being made volatile? There are 10 conditions:
1) The page is reserved. Some sort of special page.
2) The page is marked dirty in the struct page. The page content is
more recent than the data on the backing device. The host cannot
access the linux internal dirty bit so the page needs to be stable.
3) The page is in writeback. The page content is needed for i/o.
4) The page is locked. Someone has exclusive access to the page.
5) The page is anonymous. Swap cache support needs additional code.
6) The page has no mapping. No backing, the page cannot be recreated.
7) The page is not uptodate.
8) The page has private information. try_to_release_page can fail,
e.g. in case the private information is journaling data. The discard
fault need to be able to remove the page.
9) The page is already discarded.
10) The page map count is not equal to the page reference count - 1.
The discard fault handler can remove the page cache reference and
all mappers of a page. It cannot remove the page reference for any
other user of the page.
The transitions to stable are done by find_get_pages() and its variants,
in get_user_pages if called with a pages parameter, by copy-on-write in
do_wp_page, and by the early copy-on-write in do_no_page. To make enough
pages discardable by the host an attempt to do the transition to volatile
state is done when a page gets unlocked (unlock_page), when writeback has
finished (test_clear_page_dirty), when the page reference counter is
decreased (put_page_testzero), and when the page map counter is increased
(page_add_file_rmap).
All page references acquired with find_get_page and friends can be used
to access the page frame content. A page reference grabbed from a page
table cannot be used to access the page content. A page_make_stable
needs to be done first and if that fails the page needs to get discarded.
That removes the page table entry as well.
Two new page flags are added. To guard against concurrent page state
updates the PG_state_change flag is used. It prevents that a transition
to the stable state can "overtake" a transition to volatile state.
If page_make_volatile has already done the 10 checks it will issue
the state change primitive. If in the meantime on of the conditions has
changed the user that requires the page in stable state will have to wait
until the make volatile operation has finished. The make volatile
operation does not wait for the PG_state_change if make stable has set
the bit. Instead the attempt to make the page volatile fails. The reason
is that test_clear_page_dirty is called from interrupt context and waiting
for the bit might dead-lock.
The second new page flag is the PG_discarded bit. This bit is set for
discarded pages. It prevents multiple removes of a page from the page
cache due to concurrent discard faults and/or normal page removals.
It also prevents the re-add of isolated pages to the lru list in vmscan
if the page has been discarded while it was not on the lru list.
Another noticable change is that the first few lines of code in
try_to_unmap_one that calculates the address from the page and the vma
is moved out of try_to_unmap_one to the callers. This is done to make
try_to_unmap_one usable for the removal of discarded pages in
page_unmap_all.
Signed-off-by: Martin Schwidefsky <[email protected]>
---
fs/buffer.c | 1
include/linux/mm.h | 8 +
include/linux/page-flags.h | 13 +++
include/linux/page-states.h | 55 +++++++++++++
include/linux/pagemap.h | 1
mm/Makefile | 1
mm/filemap.c | 73 +++++++++++++++++-
mm/memory.c | 44 ++++++++++
mm/page-discard.c | 178 ++++++++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 3
mm/page_alloc.c | 3
mm/rmap.c | 68 ++++++++++++++--
mm/vmscan.c | 56 +++++++++----
13 files changed, 471 insertions(+), 33 deletions(-)
diff -urpN linux-2.6/fs/buffer.c linux-2.6-patched/fs/buffer.c
--- linux-2.6/fs/buffer.c 2006-09-01 12:49:30.000000000 +0200
+++ linux-2.6-patched/fs/buffer.c 2006-09-01 12:49:36.000000000 +0200
@@ -105,6 +105,7 @@ __clear_page_buffers(struct page *page)
{
ClearPagePrivate(page);
set_page_private(page, 0);
+ page_make_volatile(page, 2);
page_cache_release(page);
}
diff -urpN linux-2.6/include/linux/mm.h linux-2.6-patched/include/linux/mm.h
--- linux-2.6/include/linux/mm.h 2006-09-01 12:49:36.000000000 +0200
+++ linux-2.6-patched/include/linux/mm.h 2006-09-01 12:49:36.000000000 +0200
@@ -308,11 +308,17 @@ struct page {
/*
* Drop a ref, return true if the refcount fell to zero (the page has no users)
+ * put_page_testzero checks if the page can be made volatile if the page
+ * still has users and guest page hinting is enabled.
*/
static inline int put_page_testzero(struct page *page)
{
+ int ret;
VM_BUG_ON(atomic_read(&page->_count) == 0);
- return atomic_dec_and_test(&page->_count);
+ ret = atomic_dec_and_test(&page->_count);
+ if (!ret)
+ page_make_volatile(page, 1);
+ return ret;
}
/*
diff -urpN linux-2.6/include/linux/page-flags.h linux-2.6-patched/include/linux/page-flags.h
--- linux-2.6/include/linux/page-flags.h 2006-09-01 12:49:32.000000000 +0200
+++ linux-2.6-patched/include/linux/page-flags.h 2006-09-01 12:49:36.000000000 +0200
@@ -105,6 +105,9 @@
#define PG_uncached 31 /* Page has been mapped as uncached */
#endif
+#define PG_state_change 21 /* HV page state is changing. */
+#define PG_discarded 22 /* HV page has been discarded. */
+
/*
* Manipulation of page state flags
*/
@@ -254,6 +257,16 @@
#define SetPageReadahead(page) set_bit(PG_readahead, &(page)->flags)
#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+#define PageStateChange(page) test_bit(PG_state_change, &(page)->flags)
+#define ClearPageStateChange(page) clear_bit(PG_state_change, &(page)->flags)
+#define TestSetPageStateChange(page) \
+ test_and_set_bit(PG_state_change, &(page)->flags)
+
+#define PageDiscarded(page) test_bit(PG_discarded, &(page)->flags)
+#define ClearPageDiscarded(page) clear_bit(PG_discarded, &(page)->flags)
+#define TestSetPageDiscarded(page) \
+ test_and_set_bit(PG_discarded, &(page)->flags)
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
diff -urpN linux-2.6/include/linux/pagemap.h linux-2.6-patched/include/linux/pagemap.h
--- linux-2.6/include/linux/pagemap.h 2006-09-01 12:49:32.000000000 +0200
+++ linux-2.6-patched/include/linux/pagemap.h 2006-09-01 12:49:36.000000000 +0200
@@ -114,6 +114,7 @@ int add_to_page_cache_lru(struct page *p
unsigned long index, gfp_t gfp_mask);
extern void remove_from_page_cache(struct page *page);
extern void __remove_from_page_cache(struct page *page);
+extern void __remove_from_page_cache_nocheck(struct page *page);
/*
* Return byte-offset into filesystem object for page.
diff -urpN linux-2.6/include/linux/page-states.h linux-2.6-patched/include/linux/page-states.h
--- linux-2.6/include/linux/page-states.h 2006-09-01 12:49:36.000000000 +0200
+++ linux-2.6-patched/include/linux/page-states.h 2006-09-01 12:49:36.000000000 +0200
@@ -16,17 +16,72 @@
#else
/* Guest page hinting architecture primitives:
+ * - page_host_discards:
+ * Indicates whether the host system discards guest pages or not.
* - page_set_unused:
* Indicates to the host that the page content is of no interest
* to the guest. The host can "forget" the page content and replace
* it with a page containing zeroes.
* - page_set_stable:
* Indicate to the host that the page content is needed by the guest.
+ * - page_set_volatile:
+ * Make the page discardable by the host. Instead of writing the
+ * page to the hosts swap device, the host can remove the page.
+ * A guest that accesses such a discarded page gets a special
+ * discard fault.
+ * - page_set_stable_if_present:
+ * The page state is set to stable if the page has not been discarded
+ * by the host. The check and the state change have to be done
+ * atomically.
+ * - page_discarded:
+ * Returns true if the page has been discarded by the host.
*/
+#define page_host_discards() (0)
#define page_set_unused(_page,_order) do { } while (0)
#define page_set_stable(_page,_order) do { } while (0)
+#define page_set_volatile(_page) do { } while (0)
+#define page_set_stable_if_present(_page) (1)
+#define page_discarded(_page) (0)
#endif
+/*
+ * Common guest page hinting functions. The compiler will turn them
+ * into nop's if page_host_discards as primitive is defined as (0):
+ * - page_make_stable:
+ * Tries to make a page stable. This operation can fail if the
+ * host has discarded a page. The function returns != 0 if the
+ * page could not be made stable.
+ * - page_make_volatile:
+ * Tries to make a page volatile. There are a number of conditions
+ * that prevent a page from becoming volatile. If at least one
+ * is true the function does nothing. See mm/page-states.c for
+ * details.
+ * - page_discard:
+ * Removes a discarded page from the system. The page is removed
+ * from the LRU list and the radix tree of its mapping.
+ * page_discard uses page_unmap_all to remove all page table
+ * entries for a page.
+ */
+extern void page_unmap_all(struct page *page);
+extern void page_discard(struct page *page);
+
+static inline int page_make_stable(struct page *page)
+{
+ extern int __page_make_stable(struct page *);
+ if (!page_host_discards())
+ return 1;
+ return __page_make_stable(page);
+}
+
+static inline void page_make_volatile(struct page *page, unsigned int offset)
+{
+ extern void __page_make_volatile(struct page *, unsigned int offset);
+ if (!page_host_discards())
+ return;
+ if (likely(!test_bit(PG_discarded, &page->flags)))
+ __page_make_volatile(page, offset);
+}
+
#endif /* _LINUX_PAGE_STATES_H */
diff -urpN linux-2.6/mm/filemap.c linux-2.6-patched/mm/filemap.c
--- linux-2.6/mm/filemap.c 2006-09-01 12:49:33.000000000 +0200
+++ linux-2.6-patched/mm/filemap.c 2006-09-01 12:49:36.000000000 +0200
@@ -118,7 +118,7 @@ extern u32 readahead_debug_level;
* sure the page is locked and that nobody else uses it - or that usage
* is safe. The caller must hold a write_lock on the mapping's tree_lock.
*/
-void __remove_from_page_cache(struct page *page)
+void inline __remove_from_page_cache_nocheck(struct page *page)
{
struct address_space *mapping = page->mapping;
@@ -127,6 +127,28 @@ void __remove_from_page_cache(struct pag
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
}
+
+void __remove_from_page_cache(struct page *page)
+{
+ /*
+ * Check if the discard fault handler already removed
+ * the page from the page cache. If not set the discard
+ * bit in the page flags to prevent double page free if
+ * a discard fault is racing with normal page free.
+ */
+ if (page_host_discards() && TestSetPageDiscarded(page))
+ return;
+
+ __remove_from_page_cache_nocheck(page);
+
+ /*
+ * Check the hardware page state and clear the discard
+ * bit in the page flags only if the page is not
+ * discarded.
+ */
+ if (page_host_discards() && !page_discarded(page))
+ ClearPageDiscarded(page);
+}
EXPORT_SYMBOL(__remove_from_page_cache);
void remove_from_page_cache(struct page *page)
@@ -567,6 +589,7 @@ void fastcall unlock_page(struct page *p
if (!TestClearPageLocked(page))
BUG();
smp_mb__after_clear_bit();
+ page_make_volatile(page, 1);
wake_up_page(page, PG_locked);
}
EXPORT_SYMBOL(unlock_page);
@@ -671,6 +694,14 @@ struct page * find_get_page(struct addre
if (page)
page_cache_get(page);
read_unlock_irq(&mapping->tree_lock);
+ if (page && unlikely(!page_make_stable(page))) {
+ /*
+ * The page has been discarded by the host. Run the
+ * discard handler and return NULL.
+ */
+ page_discard(page);
+ page = NULL;
+ }
return page;
}
EXPORT_SYMBOL(find_get_page);
@@ -715,7 +746,15 @@ repeat:
page = radix_tree_lookup(&mapping->page_tree, offset);
if (page) {
page_cache_get(page);
- if (TestSetPageLocked(page)) {
+ if (unlikely(!page_make_stable(page))) {
+ /*
+ * The page has been discarded by the host. Run the
+ * discard handler and return NULL.
+ */
+ read_unlock_irq(&mapping->tree_lock);
+ page_discard(page);
+ return NULL;
+ } else if (TestSetPageLocked(page)) {
read_unlock_irq(&mapping->tree_lock);
__lock_page(page);
read_lock_irq(&mapping->tree_lock);
@@ -800,11 +839,24 @@ unsigned find_get_pages(struct address_s
unsigned int i;
unsigned int ret;
+repeat:
read_lock_irq(&mapping->tree_lock);
ret = radix_tree_gang_lookup(&mapping->page_tree,
(void **)pages, start, nr_pages);
- for (i = 0; i < ret; i++)
+ for (i = 0; i < ret; i++) {
page_cache_get(pages[i]);
+ if (likely(page_make_stable(pages[i])))
+ continue;
+ /*
+ * Make stable failed, we discard the page and retry the
+ * whole operation.
+ */
+ read_unlock_irq(&mapping->tree_lock);
+ page_discard(pages[i]);
+ while (i--)
+ page_cache_release(pages[i]);
+ goto repeat;
+ }
read_unlock_irq(&mapping->tree_lock);
return ret;
}
@@ -860,11 +912,24 @@ unsigned find_get_pages_tag(struct addre
unsigned int i;
unsigned int ret;
+repeat:
read_lock_irq(&mapping->tree_lock);
ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
(void **)pages, *index, nr_pages, tag);
- for (i = 0; i < ret; i++)
+ for (i = 0; i < ret; i++) {
page_cache_get(pages[i]);
+ if (likely(page_make_stable(pages[i])))
+ continue;
+ /*
+ * Make stable failed, we discard the page and retry the
+ * whole operation.
+ */
+ read_unlock_irq(&mapping->tree_lock);
+ page_discard(pages[i]);
+ while (i--)
+ page_cache_release(pages[i]);
+ goto repeat;
+ }
if (ret)
*index = pages[ret - 1]->index + 1;
read_unlock_irq(&mapping->tree_lock);
diff -urpN linux-2.6/mm/Makefile linux-2.6-patched/mm/Makefile
--- linux-2.6/mm/Makefile 2006-09-01 12:49:33.000000000 +0200
+++ linux-2.6-patched/mm/Makefile 2006-09-01 12:50:14.000000000 +0200
@@ -28,3 +28,4 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
+obj-$(CONFIG_PAGE_STATES) += page-discard.o
diff -urpN linux-2.6/mm/memory.c linux-2.6-patched/mm/memory.c
--- linux-2.6/mm/memory.c 2006-09-01 12:49:33.000000000 +0200
+++ linux-2.6-patched/mm/memory.c 2006-09-01 12:49:36.000000000 +0200
@@ -1058,6 +1058,7 @@ int get_user_pages(struct task_struct *t
if (write)
foll_flags |= FOLL_WRITE;
+retry:
cond_resched();
while (!(page = follow_page(vma, start, foll_flags))) {
int ret;
@@ -1087,6 +1088,22 @@ int get_user_pages(struct task_struct *t
BUG();
}
}
+ if (foll_flags & FOLL_GET) {
+ /*
+ * The pages are only made stable in case
+ * an additional reference is acquired. This
+ * includes the case of a non-null pages array.
+ * If no additional reference is taken it
+ * implies that the caller can deal with page
+ * faults in case the page is swapped out.
+ * In this case the caller can deal with
+ * discard faults as well.
+ */
+ if (unlikely(!page_make_stable(page))) {
+ page_discard(page);
+ goto retry;
+ }
+ }
if (pages) {
pages[i] = page;
@@ -1552,6 +1569,12 @@ static int do_wp_page(struct mm_struct *
* Ok, we need to copy. Oh, well..
*/
page_cache_get(old_page);
+ /*
+ * To copy the content of old_page it needs to be stable.
+ * page_cache_release on old_page will make it volatile again.
+ */
+ if (unlikely(!page_make_stable(old_page)))
+ goto discard;
gotten:
pte_unmap_unlock(page_table, ptl);
@@ -1613,6 +1636,10 @@ oom:
unwritable_page:
page_cache_release(old_page);
return VM_FAULT_SIGBUS;
+discard:
+ pte_unmap_unlock(page_table, ptl);
+ page_discard(old_page);
+ return VM_FAULT_MAJOR;
}
/*
@@ -2177,6 +2204,10 @@ retry:
if (unlikely(anon_vma_prepare(vma)))
goto oom;
+ if (unlikely(!page_make_stable(new_page))) {
+ page_discard(new_page);
+ goto retry;
+ }
page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!page)
goto oom;
@@ -2321,7 +2352,9 @@ static inline int handle_pte_fault(struc
pte_t entry;
pte_t old_entry;
spinlock_t *ptl;
+ int rc;
+again:
old_entry = entry = *pte;
if (!pte_present(entry)) {
if (pte_none(entry)) {
@@ -2343,9 +2376,16 @@ static inline int handle_pte_fault(struc
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
if (write_access) {
- if (!pte_write(entry))
- return do_wp_page(mm, vma, address,
+ if (!pte_write(entry)) {
+ rc = do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
+ if (page_host_discards() &&
+ unlikely(rc == VM_FAULT_MAJOR)) {
+ pte = pte_alloc_map(mm, pmd, address);
+ goto again;
+ }
+ return rc;
+ }
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
diff -urpN linux-2.6/mm/page_alloc.c linux-2.6-patched/mm/page_alloc.c
--- linux-2.6/mm/page_alloc.c 2006-09-01 12:49:36.000000000 +0200
+++ linux-2.6-patched/mm/page_alloc.c 2006-09-01 12:49:36.000000000 +0200
@@ -215,7 +215,8 @@ static void bad_page(struct page *page)
1 << PG_slab |
1 << PG_swapcache |
1 << PG_writeback |
- 1 << PG_buddy );
+ 1 << PG_buddy |
+ 1 << PG_discarded );
set_page_count(page, 0);
reset_page_mapcount(page);
page->mapping = NULL;
diff -urpN linux-2.6/mm/page-discard.c linux-2.6-patched/mm/page-discard.c
--- linux-2.6/mm/page-discard.c 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6-patched/mm/page-discard.c 2006-09-01 12:49:36.000000000 +0200
@@ -0,0 +1,178 @@
+/*
+ * mm/page-discard.c
+ *
+ * (C) Copyright IBM Corp. 2005, 2006
+ *
+ * Guest page hinting functions.
+ *
+ * Authors: Martin Schwidefsky <[email protected]>
+ * Hubertus Franke <[email protected]>
+ * Himanshu Raj <[email protected]>
+ */
+
+#include <linux/mm.h>
+#include <linux/mm_inline.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/buffer_head.h>
+
+#include "internal.h"
+
+/*
+ * Check if there is something that prevents the page from changing
+ * its state to volatile.
+ */
+static inline int __page_discardable(struct page *page,unsigned int offset)
+{
+ /*
+ * There are several conditions that prevent a page from becoming
+ * volatile. The first check is for the page bits.
+ */
+ if (PageDirty(page) || PageReserved(page) || PageWriteback(page) ||
+ PageLocked(page) || PagePrivate(page) || PageDiscarded(page) ||
+ !PageUptodate(page) || !PageLRU(page) || PageAnon(page))
+ return 0;
+
+ /*
+ * If the page has been truncated there is no point in making
+ * it volatile. It will be freed soon. If the mapping ever had
+ * locked pages all pages of the mapping will stay stable.
+ */
+ if (!page_mapping(page))
+ return 0;
+
+ /*
+ * The last check is the critical one. We check the reference
+ * counter of the page against the number of mappings. The caller
+ * of make_volatile passes an offset, that is the number of extra
+ * references. For most calls that is 1 extra reference for the
+ * page-cache. In some cases the caller itself holds an additional
+ * reference, then the offset is 2. If the page map counter is equal
+ * to the page count minus the offset then there is no other
+ * (unknown) user of the page in the system and we can make the
+ * page volatile.
+ */
+ BUG_ON(page_mapcount(page) > page_count(page) - offset);
+ if (page_mapcount(page) != page_count(page) - offset)
+ return 0;
+
+ return 1;
+}
+
+/*
+ * Attempts to change the state of a page to volatile. If there is
+ * something preventing the state change the page stays in its current
+ * state.
+ */
+void __page_make_volatile(struct page *page, unsigned int offset)
+{
+ /*
+ * If we can't get the PG_state_change bit just give up. The
+ * worst that can happen is that the page will stay in stable
+ * state although it might be volatile.
+ */
+ preempt_disable();
+ if (!TestSetPageStateChange(page)) {
+ if (__page_discardable(page, offset))
+ page_set_volatile(page);
+ ClearPageStateChange(page);
+ }
+ preempt_enable();
+}
+EXPORT_SYMBOL(__page_make_volatile);
+
+/*
+ * Attempts to change the state of a page to stable. The host could
+ * have removed a volatile page, the page_set_stable_if_present call
+ * can fail.
+ *
+ * returns "0" on success and "1" on failure
+ */
+int __page_make_stable(struct page *page)
+{
+ /*
+ * Postpone state change to stable until PG_state_change bit is
+ * cleared. As long as PG_state_change is set another cpu is in
+ * page_make_volatile for this page. That makes sure
+ * that no caller of make_stable "overtakes" a make_volatile
+ * leaving the page in volatile where stable is required.
+ * The caller of make_stable need to make sure that no caller
+ * of make_volatile can make the page volatile right after
+ * make_stable has finished. That is done by requiring that
+ * page has been locked or that the page_count has been
+ * increased before make_stable is called. In both cases a
+ * subsequent call page_make_volatile will fail.
+ */
+ while (PageStateChange(page))
+ cpu_relax();
+ return page_set_stable_if_present(page);
+}
+EXPORT_SYMBOL(__page_make_stable);
+
+/**
+ * __page_discard() - remove a discarded page from the cache
+ *
+ * @page: the page
+ *
+ * The page passed to this function needs to be locked.
+ */
+static void __page_discard(struct page *page)
+{
+ struct address_space *mapping;
+ struct zone *zone;
+
+ /* Paranoia checks. */
+ BUG_ON(PageWriteback(page));
+ BUG_ON(PageDirty(page));
+ BUG_ON(PagePrivate(page));
+
+ /* Set the discarded bit early. */
+ if (TestSetPageDiscarded(page))
+ return;
+
+ /* Unmap the page from all page tables. */
+ page_unmap_all(page);
+
+ /*
+ * Remove the page from LRU if it is currently added.
+ * The users of isolate_lru_pages need to check the
+ * discarded bit before readding the page to the LRU.
+ */
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (PageLRU(page)) {
+ /* Unlink page from lru. */
+ __ClearPageLRU(page);
+ del_page_from_lru(zone, page);
+ }
+ spin_unlock_irq(&zone->lru_lock);
+
+ /* We can't handle swap cache pages (yet). */
+ BUG_ON(PageSwapCache(page));
+
+ /* Remove page from page cache. */
+ mapping = page->mapping;
+ write_lock_irq(&mapping->tree_lock);
+ __remove_from_page_cache_nocheck(page);
+ write_unlock_irq(&mapping->tree_lock);
+ __put_page(page);
+}
+
+/**
+ * page_discard() - remove a discarded page from the cache
+ *
+ * @page: the page
+ *
+ * Before calling this function an additional page reference needs to
+ * be acquired. This reference is released by the function.
+ */
+void page_discard(struct page *page)
+{
+ lock_page(page);
+ __page_discard(page);
+ unlock_page(page);
+ page_cache_release(page);
+}
+EXPORT_SYMBOL(page_discard);
diff -urpN linux-2.6/mm/page-writeback.c linux-2.6-patched/mm/page-writeback.c
--- linux-2.6/mm/page-writeback.c 2006-09-01 12:49:33.000000000 +0200
+++ linux-2.6-patched/mm/page-writeback.c 2006-09-01 12:49:36.000000000 +0200
@@ -858,6 +858,7 @@ int test_clear_page_dirty(struct page *p
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
+ page_make_volatile(page, 2);
write_unlock_irqrestore(&mapping->tree_lock, flags);
/*
* We can continue to use `mapping' here because the
@@ -924,6 +925,8 @@ int test_clear_page_writeback(struct pag
page_index(page),
PAGECACHE_TAG_WRITEBACK);
write_unlock_irqrestore(&mapping->tree_lock, flags);
+ if (ret)
+ page_make_volatile(page, 2);
} else {
ret = TestClearPageWriteback(page);
}
diff -urpN linux-2.6/mm/rmap.c linux-2.6-patched/mm/rmap.c
--- linux-2.6/mm/rmap.c 2006-09-01 12:49:33.000000000 +0200
+++ linux-2.6-patched/mm/rmap.c 2006-09-01 12:49:36.000000000 +0200
@@ -565,6 +565,7 @@ void page_add_file_rmap(struct page *pag
{
if (atomic_inc_and_test(&page->_mapcount))
__inc_zone_page_state(page, NR_FILE_MAPPED);
+ page_make_volatile(page, 1);
}
/**
@@ -606,19 +607,14 @@ void page_remove_rmap(struct page *page)
* repeatedly from either try_to_unmap_anon or try_to_unmap_file.
*/
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
- int migration)
+ unsigned long address, int migration)
{
struct mm_struct *mm = vma->vm_mm;
- unsigned long address;
pte_t *pte;
pte_t pteval;
spinlock_t *ptl;
int ret = SWAP_AGAIN;
- address = vma_address(page, vma);
- if (address == -EFAULT)
- goto out;
-
pte = page_check_address(page, mm, address, &ptl);
if (!pte)
goto out;
@@ -788,6 +784,7 @@ static int try_to_unmap_anon(struct page
{
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
+ unsigned long address;
int ret = SWAP_AGAIN;
anon_vma = page_lock_anon_vma(page);
@@ -795,7 +792,10 @@ static int try_to_unmap_anon(struct page
return ret;
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
- ret = try_to_unmap_one(page, vma, migration);
+ address = vma_address(page, vma);
+ if (address == -EFAULT)
+ continue;
+ ret = try_to_unmap_one(page, vma, address, migration);
if (ret == SWAP_FAIL || !page_mapped(page))
break;
}
@@ -819,6 +819,7 @@ static int try_to_unmap_file(struct page
struct vm_area_struct *vma;
struct prio_tree_iter iter;
int ret = SWAP_AGAIN;
+ unsigned long address;
unsigned long cursor;
unsigned long max_nl_cursor = 0;
unsigned long max_nl_size = 0;
@@ -826,7 +827,10 @@ static int try_to_unmap_file(struct page
spin_lock(&mapping->i_mmap_lock);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
- ret = try_to_unmap_one(page, vma, migration);
+ address = vma_address(page, vma);
+ if (address == -EFAULT)
+ continue;
+ ret = try_to_unmap_one(page, vma, address, migration);
if (ret == SWAP_FAIL || !page_mapped(page))
goto out;
}
@@ -927,3 +931,51 @@ int try_to_unmap(struct page *page, int
return ret;
}
+#if defined(CONFIG_PAGE_STATES)
+
+/**
+ * page_unmap_all - removes all mappings of a page
+ *
+ * @page: the page which mapping in the vma should be struck down
+ *
+ * the caller needs to hold page lock
+ */
+void page_unmap_all(struct page* page)
+{
+ struct address_space *mapping = page_mapping(page);
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+ unsigned long address;
+
+ BUG_ON(!PageLocked(page) || PageReserved(page) || PageAnon(page));
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+ address = vma_address(page, vma);
+ if (address == -EFAULT)
+ continue;
+ BUG_ON(try_to_unmap_one(page, vma, address, 0) == SWAP_FAIL);
+ }
+
+ if (list_empty(&mapping->i_mmap_nonlinear))
+ goto out;
+
+ /*
+ * Remove the non-linear mappings of the page. This is
+ * awfully slow, but we have to find that discarded page..
+ */
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
+ shared.vm_set.list) {
+ address = vma->vm_start;
+ while (address < vma->vm_end) {
+ BUG_ON(try_to_unmap_one(page, vma, address, 0) == SWAP_FAIL);
+ address += PAGE_SIZE;
+ }
+ }
+
+out:
+ spin_unlock(&mapping->i_mmap_lock);
+}
+
+#endif
diff -urpN linux-2.6/mm/vmscan.c linux-2.6-patched/mm/vmscan.c
--- linux-2.6/mm/vmscan.c 2006-09-01 12:49:33.000000000 +0200
+++ linux-2.6-patched/mm/vmscan.c 2006-09-01 12:49:36.000000000 +0200
@@ -681,13 +681,20 @@ static unsigned long shrink_inactive_lis
*/
while (!list_empty(&page_list)) {
page = lru_to_page(&page_list);
- VM_BUG_ON(PageLRU(page));
- SetPageLRU(page);
list_del(&page->lru);
- if (PageActive(page))
- add_page_to_active_list(zone, page);
- else
- add_page_to_inactive_list(zone, page);
+ /*
+ * Only readd the page to lru list if it has not
+ * been discarded.
+ */
+ if (!page_host_discards() ||
+ likely(!PageDiscarded(page))) {
+ VM_BUG_ON(PageLRU(page));
+ SetPageLRU(page);
+ if (PageActive(page))
+ add_page_to_active_list(zone, page);
+ else
+ add_page_to_inactive_list(zone, page);
+ }
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
@@ -813,13 +820,20 @@ force_reclaim_mapped:
while (!list_empty(&l_inactive)) {
page = lru_to_page(&l_inactive);
prefetchw_prev_lru_page(page, &l_inactive, flags);
- VM_BUG_ON(PageLRU(page));
- SetPageLRU(page);
- VM_BUG_ON(!PageActive(page));
- ClearPageActive(page);
+ /*
+ * Only readd the page to lru list if it has not
+ * been discarded.
+ */
+ if (!page_host_discards() || likely(!PageDiscarded(page))) {
+ VM_BUG_ON(PageLRU(page));
+ SetPageLRU(page);
+ VM_BUG_ON(!PageActive(page));
+ ClearPageActive(page);
+ list_move(&page->lru, &zone->inactive_list);
+ pgmoved++;
+ } else
+ list_del(&page->lru);
- list_move(&page->lru, &zone->inactive_list);
- pgmoved++;
if (!pagevec_add(&pvec, page)) {
zone->nr_inactive += pgmoved;
spin_unlock_irq(&zone->lru_lock);
@@ -843,11 +857,19 @@ force_reclaim_mapped:
while (!list_empty(&l_active)) {
page = lru_to_page(&l_active);
prefetchw_prev_lru_page(page, &l_active, flags);
- VM_BUG_ON(PageLRU(page));
- SetPageLRU(page);
- VM_BUG_ON(!PageActive(page));
- list_move(&page->lru, &zone->active_list);
- pgmoved++;
+ /*
+ * Only readd the page to lru list if it has not
+ * been discarded.
+ */
+ if (!page_host_discards() || likely(!PageDiscarded(page))) {
+ VM_BUG_ON(PageLRU(page));
+ SetPageLRU(page);
+ VM_BUG_ON(!PageActive(page));
+ list_move(&page->lru, &zone->active_list);
+ pgmoved++;
+ } else
+ list_del(&page->lru);
+
if (!pagevec_add(&pvec, page)) {
zone->nr_active += pgmoved;
pgmoved = 0;
Are all of the un/likely()s in here really needed?
-- Dave
On Fri, 2006-09-01 at 13:09 +0200, Martin Schwidefsky wrote:
> +#define PG_state_change 21 /* HV page state is changing. */
> +#define PG_discarded 22 /* HV page has been discarded. */
We're already desperately short on page flags on 32-bit architectures.
It seems a wee bit silly to add two arch-generic flags for what is a
very specialized arch-specific feature at this point.
I know that there are 32-bit s390 kernels, but would this be a
reasonable feature to restrict to only 64-bit kernels? That might be a
decent compromise.
-- Dave
On Fri, 2006-09-01 at 07:54 -0700, Dave Hansen wrote:
> Are all of the un/likely()s in here really needed?
Well, no un/likely is really needed. But they do help the compiler to
generate better code. It IS unlikely that a page is discarded.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 07:57 -0700, Dave Hansen wrote:
> > +#define PG_state_change 21 /* HV page state is changing. */
> > +#define PG_discarded 22 /* HV page has been discarded. */
>
> We're already desperately short on page flags on 32-bit architectures.
> It seems a wee bit silly to add two arch-generic flags for what is a
> very specialized arch-specific feature at this point.
There are even three additional page flags if you apply the full set of
patches.
> I know that there are 32-bit s390 kernels, but would this be a
> reasonable feature to restrict to only 64-bit kernels? That might be a
> decent compromise.
Yes, it is definitly an option to make this a 64-bit only features. In
particular because the ESSA instruction that is used on s390 is only
available in zarch mode (=64 bit).
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 17:29 +0200, Martin Schwidefsky wrote:
> On Fri, 2006-09-01 at 07:54 -0700, Dave Hansen wrote:
> > Are all of the un/likely()s in here really needed?
>
> Well, no un/likely is really needed. But they do help the compiler to
> generate better code. It IS unlikely that a page is discarded.
I know what they're _for_. ;)
I seem to recall people being poked in the past for sprinkling too many
of these things around their code. I just wanted to make sure that
there was some tangible reason for doing it, other than speculative
performance optimization.
-- Dave
Martin Schwidefsky wrote:
> On Fri, 2006-09-01 at 07:57 -0700, Dave Hansen wrote:
>>> +#define PG_state_change 21 /* HV page state is changing. */
>>> +#define PG_discarded 22 /* HV page has been discarded. */
>> We're already desperately short on page flags on 32-bit architectures.
>> It seems a wee bit silly to add two arch-generic flags for what is a
>> very specialized arch-specific feature at this point.
>
> There are even three additional page flags if you apply the full set of
> patches.
>
>> I know that there are 32-bit s390 kernels, but would this be a
>> reasonable feature to restrict to only 64-bit kernels? That might be a
>> decent compromise.
>
> Yes, it is definitly an option to make this a 64-bit only features. In
> particular because the ESSA instruction that is used on s390 is only
> available in zarch mode (=64 bit).
Wow. Well there are only 7 extra bits available in 64 bit mode (the
FIELDS area is larger on 64bit machines). Do we really, really need
three new bits. What are they being used for here.
-apw
On Fri, 2006-09-01 at 16:48 +0100, Andy Whitcroft wrote:
> >> I know that there are 32-bit s390 kernels, but would this be a
> >> reasonable feature to restrict to only 64-bit kernels? That might be a
> >> decent compromise.
> >
> > Yes, it is definitly an option to make this a 64-bit only features. In
> > particular because the ESSA instruction that is used on s390 is only
> > available in zarch mode (=64 bit).
>
> Wow. Well there are only 7 extra bits available in 64 bit mode (the
> FIELDS area is larger on 64bit machines). Do we really, really need
> three new bits. What are they being used for here.
I have though about alternatives for the page bits hard and long and I
could not find a way how to do it without the bits. Their use is:
1) The page-is-discarded (PG_discarded) bit is set for pages that have
been recognized as removed by the host. The page needs to be removed
from the page cache while there are still page references floating
around. To prevent multiple removals from the page cache the discarded
bit is needed.
2) The page-state-change (PG_state_change) bit is required to prevent
that an make_stable "overtakes" a make_volatile. In order to make a page
volatile a number of conditions are check. After this is done the state
change will be done. The critical section is the code that performs the
checks up to the instruction that does the state change. No make_stable
may be done in between. The granularity is per page, to use a global
lock like a spinlock would severly limit the scalability for large smp
systems.
3) The page-has-a-writable-mapping (PG_writable) bit is set when the
first writable pte for a page is established. The page needs to have a
different state if a writable pte exists compared to a read-only page.
The alternative without the page bit would be to do the state change
every time a writable pte is established or to search all ptes of a
given page. Both have performance implications.
Overall I fear that the code really needs three additional bits.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 18:04 +0200, Martin Schwidefsky wrote:
> 1) The page-is-discarded (PG_discarded) bit is set for pages that have
> been recognized as removed by the host. The page needs to be removed
> from the page cache while there are still page references floating
> around. To prevent multiple removals from the page cache the discarded
> bit is needed.
OK, so the page has data in it, and is in the page cache. The
hypervisor kills the page, gives the notification to the kernel that the
page has gone away, and the kernel marks PG_discarded. There still
might be active references to the page.
So, is the problem trying to communicate with the reference holders that
the page is no longer valid? How is this fundamentally different from
page truncating?
> 2) The page-state-change (PG_state_change) bit is required to prevent
> that an make_stable "overtakes" a make_volatile. In order to make a page
> volatile a number of conditions are check. After this is done the state
> change will be done. The critical section is the code that performs the
> checks up to the instruction that does the state change. No make_stable
> may be done in between. The granularity is per page, to use a global
> lock like a spinlock would severly limit the scalability for large smp
> systems.
How about doing it in the NUMA node? Or the mem_section? Or, even a
bit in the mem_map[] for the area guarding the 'struct page' itself?
Even a hashed table of locks based on the page address. You just need
something that allows _some_ level of concurrency. You certainly never
have a number of CPUs which is anywhere close to the number of 'struct
page's in the system.
Let me think more about (3).
-- Dave
On Fri, 2006-09-01 at 09:18 -0700, Dave Hansen wrote:
> > 1) The page-is-discarded (PG_discarded) bit is set for pages that have
> > been recognized as removed by the host. The page needs to be removed
> > from the page cache while there are still page references floating
> > around. To prevent multiple removals from the page cache the discarded
> > bit is needed.
>
> OK, so the page has data in it, and is in the page cache. The
> hypervisor kills the page, gives the notification to the kernel that the
> page has gone away, and the kernel marks PG_discarded. There still
> might be active references to the page.
No, the hypervisor does not give the notification immediatly. A discard
fault is delivered to the guest if it tries to access a page that has
been removed by the host. That is the fundamental difference between a
memory ballooner and the guest page hinting.
> So, is the problem trying to communicate with the reference holders that
> the page is no longer valid? How is this fundamentally different from
> page truncating?
Truncating is similar but the reaction is different. A truncated page is
gone and will not be recreated. A discarded page can be reloaded.
> > 2) The page-state-change (PG_state_change) bit is required to prevent
> > that an make_stable "overtakes" a make_volatile. In order to make a page
> > volatile a number of conditions are check. After this is done the state
> > change will be done. The critical section is the code that performs the
> > checks up to the instruction that does the state change. No make_stable
> > may be done in between. The granularity is per page, to use a global
> > lock like a spinlock would severly limit the scalability for large smp
> > systems.
>
> How about doing it in the NUMA node? Or the mem_section? Or, even a
> bit in the mem_map[] for the area guarding the 'struct page' itself?
> Even a hashed table of locks based on the page address. You just need
> something that allows _some_ level of concurrency. You certainly never
> have a number of CPUs which is anywhere close to the number of 'struct
> page's in the system.
NUMA node is not granular enough, mem_section is probably doable. I do
not understand the part about the bit in the mem_map[] area, a bit in
the page->flags is exactly that, isn't it?
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 18:04 +0200, Martin Schwidefsky wrote:
> 3) The page-has-a-writable-mapping (PG_writable) bit is set when the
> first writable pte for a page is established. The page needs to have a
> different state if a writable pte exists compared to a read-only page.
> The alternative without the page bit would be to do the state change
> every time a writable pte is established or to search all ptes of a
> given page. Both have performance implications.
What are the performance implications? Do they completely erase any
performance gains that these patches might have given in the first
place? Has there been any evaluation of these other two alternatives?
As I understand it, carrying out this performance analysis would be very
difficult for most of the kernel community to perform.
Keeping a nice count of the number of writable PTEs sounds like
something that might be generally useful. Could we split
page->_mapcount to keep track of r/o and r/w ptes separately? Or,
perhaps a single bit in it can be utilized to replace PG_writable,
instead.
-- Dave
On Fri, 2006-09-01 at 18:25 +0200, Martin Schwidefsky wrote:
> On Fri, 2006-09-01 at 09:18 -0700, Dave Hansen wrote:
> > > 1) The page-is-discarded (PG_discarded) bit is set for pages that have
> > > been recognized as removed by the host. The page needs to be removed
> > > from the page cache while there are still page references floating
> > > around. To prevent multiple removals from the page cache the discarded
> > > bit is needed.
> >
> > OK, so the page has data in it, and is in the page cache. The
> > hypervisor kills the page, gives the notification to the kernel that the
> > page has gone away, and the kernel marks PG_discarded. There still
> > might be active references to the page.
>
> No, the hypervisor does not give the notification immediatly. A discard
> fault is delivered to the guest if it tries to access a page that has
> been removed by the host. That is the fundamental difference between a
> memory ballooner and the guest page hinting.
>
> > So, is the problem trying to communicate with the reference holders that
> > the page is no longer valid? How is this fundamentally different from
> > page truncating?
>
> Truncating is similar but the reaction is different. A truncated page is
> gone and will not be recreated. A discarded page can be reloaded.
Can you give me the sequence of events that occur so that we need to
set, then check PG_discarded? I'm not getting it.
1. there is good data in a page
...
50. ... and PG_discarded gets set
...
99. We check PG_discarded and ...
> > > 2) The page-state-change (PG_state_change) bit is required to prevent
> > > that an make_stable "overtakes" a make_volatile. In order to make a page
> > > volatile a number of conditions are check. After this is done the state
> > > change will be done. The critical section is the code that performs the
> > > checks up to the instruction that does the state change. No make_stable
> > > may be done in between. The granularity is per page, to use a global
> > > lock like a spinlock would severly limit the scalability for large smp
> > > systems.
> >
> > How about doing it in the NUMA node? Or the mem_section? Or, even a
> > bit in the mem_map[] for the area guarding the 'struct page' itself?
> > Even a hashed table of locks based on the page address. You just need
> > something that allows _some_ level of concurrency. You certainly never
> > have a number of CPUs which is anywhere close to the number of 'struct
> > page's in the system.
>
> NUMA node is not granular enough, mem_section is probably doable. I do
> not understand the part about the bit in the mem_map[] area, a bit in
> the page->flags is exactly that, isn't it?
No, I'm being tricky. There are struct pages for all memory, including
kernel memory. mem_map[] is in kernel memory. So, the memory for the
mem_map[] has struct pages, which themselves are in the mem_map[].
void lock_page_for_state_change(struct page *page)
{
struct pages_backing_page = virt_to_page(page);
lock_page(pages_backing_page)
}
We did this for a bit with sparsemem, I think. That is, until Andy came
up with something even more clever.
-- Dave
On Fri, 2006-09-01 at 09:37 -0700, Dave Hansen wrote:
> Can you give me the sequence of events that occur so that we need to
> set, then check PG_discarded? I'm not getting it.
>
> 1. there is good data in a page
> ...
> 50. ... and PG_discarded gets set
> ...
> 99. We check PG_discarded and ...
Ok, here we go:
0) there is good data in a page
1) the host scans for pages to reclaim and selects a page of a
particular guest
2) the host checks the page state and decides to either swap the page or
discard it
3) nothing happens for a long time
4) the guest comes around and tries to access the long gone page
5) the host gets a fault because the page is gone from the hosts page
table for the guest system
6) the host delivers a discard fault to the guest
7) the architecture dependent fault handler gets a page reference for
the discarded page (tricky for s390)
8) page_discard is called which locks the page and does a
TestSetPageDiscarded. If the bit has not been set yet the page is
removed from the page cache. There can still be page references around.
Concurrent to 5-8 another cpu could be just be removing the page from
page cache as well. Without the check for the discarded bit the page
would get removed twice. This does nasty things to reference counting,
mapping->nrpages, ...
> > NUMA node is not granular enough, mem_section is probably doable. I do
> > not understand the part about the bit in the mem_map[] area, a bit in
> > the page->flags is exactly that, isn't it?
>
> No, I'm being tricky. There are struct pages for all memory, including
> kernel memory. mem_map[] is in kernel memory. So, the memory for the
> mem_map[] has struct pages, which themselves are in the mem_map[].
>
> void lock_page_for_state_change(struct page *page)
> {
> struct pages_backing_page = virt_to_page(page);
> lock_page(pages_backing_page)
> }
>
> We did this for a bit with sparsemem, I think. That is, until Andy came
> up with something even more clever.
Ouch, I understand what you are trying to tell me. The struct page
entries that cover the mem_map array itself has free bits we could try
to cannibalize.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 09:29 -0700, Dave Hansen wrote:
> > 3) The page-has-a-writable-mapping (PG_writable) bit is set when the
> > first writable pte for a page is established. The page needs to have a
> > different state if a writable pte exists compared to a read-only page.
> > The alternative without the page bit would be to do the state change
> > every time a writable pte is established or to search all ptes of a
> > given page. Both have performance implications.
>
> What are the performance implications? Do they completely erase any
> performance gains that these patches might have given in the first
> place? Has there been any evaluation of these other two alternatives?
> As I understand it, carrying out this performance analysis would be very
> difficult for most of the kernel community to perform.
It seemed obvious to me that anything else than checking a bit is way to
expensive. I never implemented nor measured any of the alternatives. The
alternative to do the state change every time a writable pte is
established can be implemented without too much trouble. Perhaps I will
give it a try next week.
> Keeping a nice count of the number of writable PTEs sounds like
> something that might be generally useful. Could we split
> page->_mapcount to keep track of r/o and r/w ptes separately? Or,
> perhaps a single bit in it can be utilized to replace PG_writable,
> instead.
Yes, that would be really useful for the writable ptes. But I have the
feeling that the actual implementation of it will be tricky.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 19:02 +0200, Martin Schwidefsky wrote:
> Yes, that would be really useful for the writable ptes. But I have the
> feeling that the actual implementation of it will be tricky.
I'm confident that anyone able to produce nine patches like what I just
saw is capable of producing at least one more tricky one. ;)
-- Dave
On Fri, 2006-09-01 at 18:56 +0200, Martin Schwidefsky wrote:
> On Fri, 2006-09-01 at 09:37 -0700, Dave Hansen wrote:
> > Can you give me the sequence of events that occur so that we need to
> > set, then check PG_discarded? I'm not getting it.
> >
> > 1. there is good data in a page
> > ...
> > 50. ... and PG_discarded gets set
> > ...
> > 99. We check PG_discarded and ...
>
> Ok, here we go:
> 0) there is good data in a page
> 1) the host scans for pages to reclaim and selects a page of a
> particular guest
> 2) the host checks the page state and decides to either swap the page or
> discard it
> 3) nothing happens for a long time
> 4) the guest comes around and tries to access the long gone page
> 5) the host gets a fault because the page is gone from the hosts page
> table for the guest system
> 6) the host delivers a discard fault to the guest
> 7) the architecture dependent fault handler gets a page reference for
> the discarded page (tricky for s390)
> 8) page_discard is called which locks the page and does a
> TestSetPageDiscarded. If the bit has not been set yet the page is
> removed from the page cache. There can still be page references around.
>
> Concurrent to 5-8 another cpu could be just be removing the page from
> page cache as well. Without the check for the discarded bit the page
> would get removed twice. This does nasty things to reference counting,
> mapping->nrpages, ...
This feels like something that can be done with RCU. The
__page_discard() is the write operation, right? So, take an rcu write
lock inside of the page discard function, and read locks over the
current places where PG_discarded is set.
That should make sure that the discard operation itself can't be done
concurrently with one of the __remove_from*() operations. Once the
write lock has been acquired, you just check page->mapping to see if the
a __remove_from*() operation has occurred while you waited.
> Ouch, I understand what you are trying to tell me. The struct page
> entries that cover the mem_map array itself has free bits we could try
> to cannibalize.
Right. It is certainly ugly. I'd much rather have some kind of
spin_lock(&lock_array[hash_function(page)]);
thing.
-- Dave
On Fri, 2006-09-01 at 10:16 -0700, Dave Hansen wrote:
> This feels like something that can be done with RCU. The
> __page_discard() is the write operation, right? So, take an rcu write
> lock inside of the page discard function, and read locks over the
> current places where PG_discarded is set.
>
> That should make sure that the discard operation itself can't be done
> concurrently with one of the __remove_from*() operations. Once the
> write lock has been acquired, you just check page->mapping to see if the
> a __remove_from*() operation has occurred while you waited.
The problem of page discard vs. normal page remove is that the page can
be remove and discarded at the same time. Both sides are writers in the
sense that they want to remove the page from page cache. RCU doesn't not
help with that kind of race.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 19:42 +0200, Martin Schwidefsky wrote:
> The problem of page discard vs. normal page remove is that the page can
> be remove and discarded at the same time. Both sides are writers in the
> sense that they want to remove the page from page cache. RCU doesn't not
> help with that kind of race.
OK. It comes down to a race between
__remove_from_page_cache()/__delete_from_swap_cache()
and
__page_discard()
running on the same page at the same time. Right?
-- Dave
On Fri, 2006-09-01 at 11:03 -0700, Dave Hansen wrote:
> OK. It comes down to a race between
>
> __remove_from_page_cache()/__delete_from_swap_cache()
>
> and
>
> __page_discard()
>
> running on the same page at the same time. Right?
Yes.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 20:04 +0200, Martin Schwidefsky wrote:
> On Fri, 2006-09-01 at 11:03 -0700, Dave Hansen wrote:
> > OK. It comes down to a race between
> >
> > __remove_from_page_cache()/__delete_from_swap_cache()
> >
> > and
> >
> > __page_discard()
> >
> > running on the same page at the same time. Right?
>
> Yes.
OK, and there's no other workable solution to exclude each other from
running at the same time than a bit in page->flags?
It seems like that hashed lock (or lock in mem_map[]) we were talking
about earlier might be applicable here, too.
Do we ever discard pages other than ones that have just recently failed
to be made stable? There seems to be a lot of
if (!page_make_stable(page) {
...
page_discard(page);
}
Some of these call sites even have mapping->page_lock held when the
page_make_stable() occurs, so they would have _already_ excluded
__remove_from_page_cache(). At which call sites is it not feasible to
acquire mapping->page_lock?
-- Dave
On Fri, 2006-09-01 at 11:23 -0700, Dave Hansen wrote:
> OK, and there's no other workable solution to exclude each other from
> running at the same time than a bit in page->flags?
>
> It seems like that hashed lock (or lock in mem_map[]) we were talking
> about earlier might be applicable here, too.
The indication which page has already been removed from the page cache
by a discard fault is by definition per-page. The situation is different
compared to the one with PG_state_change which is used to protect
critical sections. After the cpu left the critical section the bit can
be clear again. The discard bit cannot be cleared until the page really
has been freed.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, 2006-09-01 at 20:31 +0200, Martin Schwidefsky wrote:
> On Fri, 2006-09-01 at 11:23 -0700, Dave Hansen wrote:
> > OK, and there's no other workable solution to exclude each other from
> > running at the same time than a bit in page->flags?
> >
> > It seems like that hashed lock (or lock in mem_map[]) we were talking
> > about earlier might be applicable here, too.
>
> The indication which page has already been removed from the page cache
> by a discard fault is by definition per-page.
Right. So having a single bit that was set and cleared wouldn't work
because it could get interpreted incorrectly for multiple pages. But,
what about a lock?
> The situation is different
> compared to the one with PG_state_change which is used to protect
> critical sections. After the cpu left the critical section the bit can
> be clear again. The discard bit cannot be cleared until the page really
> has been freed.
While something like the following wouldn't be scalable, it would
functionally work, right?
+static void __page_discard(struct page *page)
+{
+ spin_lock(discard_lock);
...
+ spin_unlock(discard_lock);
+}
+void __delete_from_swap_cache(struct page *page)
+{
+ spin_lock(discard_lock);
...
+ spin_unlock(discard_lock);
+}
+void __remove_from_page_cache(struct page *page)
+{
+ spin_lock(discard_lock);
...
+ spin_unlock(discard_lock);
+}
-- Dave
On Fri, 2006-09-01 at 11:41 -0700, Dave Hansen wrote:
> While something like the following wouldn't be scalable, it would
> functionally work, right?
>
> +static void __page_discard(struct page *page)
> +{
> + spin_lock(discard_lock);
> ...
> + spin_unlock(discard_lock);
> +}
>
> +void __delete_from_swap_cache(struct page *page)
> +{
> + spin_lock(discard_lock);
> ...
> + spin_unlock(discard_lock);
> +}
>
> +void __remove_from_page_cache(struct page *page)
> +{
> + spin_lock(discard_lock);
> ...
> + spin_unlock(discard_lock);
> +}
Any kind of locking won't work. You need the information that a page has
been discarded until the page has been freed. Only then the fact that
the page has been discarded may enter nirvana. Any kind of lock needs to
be freed again to allow the next discard fault to happen. Since you
don't when the last page reference is returned you cannot hold the lock
until the page is free.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
--
VGER BF report: H 0
On Mon, 2006-09-04 at 13:21 +0200, Martin Schwidefsky wrote:
> Any kind of locking won't work. You need the information that a page has
> been discarded until the page has been freed. Only then the fact that
> the page has been discarded may enter nirvana. Any kind of lock needs to
> be freed again to allow the next discard fault to happen. Since you
> don't when the last page reference is returned you cannot hold the lock
> until the page is free.
First of all, you *CAN* sleep with the BKL held. ;)
Why doesn't the normal lock_page() help? It can sleep, too?
As far as simplifying the patches, I feel like some of the
page_make_stable() stuff should be done inside of page_cache_get().
Perhaps the API needs to be changed so that page_cache_get()s can fail.
There are also a ton of "mapping == page->mapping" tests all over.
Perhaps you need a page_still_in_mapping(page, mapping) call that also
checks the page's discard state.
I also have the feeling that every single page_host_discards() check
which is actually placed in the VM code shouldn't be there. The ones in
page_make_stable() and friends are OK, but the ones in
shrink_inactive_list() seem bogus to me. Looks like they should be
covered up in some _other_ function that checks PageDiscarded().
You could even put these things in (what are now) simple functions like
lru_to_page(). The logic would be along the lines of, whenever I am
looking into the LRU, I need to make sure this page is still actually
there.
As for the locking, imagine a seqlock (per-zone, node, section, hash,
anon_vma, mapping, whatever...). A write is taken any time that
PG_discard would have been set, and the page is placed in to a list so
that it can be found (the data structure isn't important now). All of
the places that currently check PG_discard would go and take a read on
the seqlock. If they fail to acquire it (what is normally now a loop),
they would go look in the list to see if the page they are interested in
is there. If it is, then they treat it as dicarded, otherwise they
proceed normally. So, the operation is normally very cheap (a
non-atomic read). It is very expensive _during_ a discard because of
the traversal of the list, but these should be rare.
The structure storing the page could be like this:
struct page_list {
struct list_head list;
struct page *page;
};
So that it doesn't require any extra space in the struct page, and
limits the overhead to only the people actually using the page discard
mechanism.
-- Dave
On Tue, 2006-09-05 at 11:27 -0700, Dave Hansen wrote:
> On Mon, 2006-09-04 at 13:21 +0200, Martin Schwidefsky wrote:
> > Any kind of locking won't work. You need the information that a page has
> > been discarded until the page has been freed. Only then the fact that
> > the page has been discarded may enter nirvana. Any kind of lock needs to
> > be freed again to allow the next discard fault to happen. Since you
> > don't when the last page reference is returned you cannot hold the lock
> > until the page is free.
>
> First of all, you *CAN* sleep with the BKL held. ;)
You can call schedule with the BKL held. But schedule will release the
BKL before the process is put to sleep. To use a lock as mechanism to
prevent double frees of discarded pages you'd have to hold it while
sleeping.
> Why doesn't the normal lock_page() help? It can sleep, too?
The PG_locked bit which is behind lock_page() is used for a different
purpose. Look at the way how e.g. find_lock_page works: the radix tree
lookup is made, a reference is acquired with page_cache_get() and then
__lock_page() is called. For the discard double page free prevention we
need to lock the page, then wait for all referenced to be returned. Only
then can we unlock the page again. These two uses are mutual exclusive.
> As far as simplifying the patches, I feel like some of the
> page_make_stable() stuff should be done inside of page_cache_get().
> Perhaps the API needs to be changed so that page_cache_get()s can fail.
Uhh, that is even more invasive than the things we do now.
page_cache_get() alias get_page() is just concerned about reference
counting. You definitly do not want to do state changes in
get_page/put_page. That would increase the number state changes by an
order of magnitude.
> There are also a ton of "mapping == page->mapping" tests all over.
> Perhaps you need a page_still_in_mapping(page, mapping) call that also
> checks the page's discard state.
Where does this help? All places that have "mapping == page->mapping"
either don't care about the page state or work on stable pages.
> I also have the feeling that every single page_host_discards() check
> which is actually placed in the VM code shouldn't be there. The ones in
> page_make_stable() and friends are OK, but the ones in
> shrink_inactive_list() seem bogus to me. Looks like they should be
> covered up in some _other_ function that checks PageDiscarded().
Hmm, we can reduce the number of page_host_discards() checks in the VM
code if PageDiscarded() and friends are defined conditionally on
CONFIG_PAGE_STATES. I'm not sure if all calls to page_host_discards()
can be removed but I will take a look at it. It would be a real
improvement to get rid of most of these calls.
> You could even put these things in (what are now) simple functions like
> lru_to_page(). The logic would be along the lines of, whenever I am
> looking into the LRU, I need to make sure this page is still actually
> there.
It would make the code simpler but the cost would be additional state
changes. For example whenever you free a page you would do a state
transition from volatile to stable shortly followed by a state
transition from stable to unused. These states changes have a cost that
we want to avoid.
> As for the locking, imagine a seqlock (per-zone, node, section, hash,
> anon_vma, mapping, whatever...). A write is taken any time that
> PG_discard would have been set, and the page is placed in to a list so
> that it can be found (the data structure isn't important now). All of
> the places that currently check PG_discard would go and take a read on
> the seqlock. If they fail to acquire it (what is normally now a loop),
> they would go look in the list to see if the page they are interested in
> is there. If it is, then they treat it as dicarded, otherwise they
> proceed normally. So, the operation is normally very cheap (a
> non-atomic read). It is very expensive _during_ a discard because of
> the traversal of the list, but these should be rare.
>
> The structure storing the page could be like this:
>
> struct page_list {
> struct list_head list;
> struct page *page;
> };
>
> So that it doesn't require any extra space in the struct page, and
> limits the overhead to only the people actually using the page discard
> mechanism.
I'm a little confused about this suggestion. If I understood it
correctly you want to block a number of pages and use a seqlock per
block to indicate if there is a discarded page in that block of pages.
On discard you do write_seqlock and force the read_seqlock side if the
lock could to be acquired to traverse a list before they can conclude
that a page is not discarded.
First question: where are the list_heads stored that are used to put
pages on the special list? They have to be allocated somewhere, per
discarded page. I do not think that you can do it lazy, that is after a
discard fault happened. You have to preallocate them and then you are
spending 16 byte per discardable page instead of 1 bit.
Second question: how do you deal with multiple discards for a block of
pages? The first sets the write_seqlock, how is the write_sequnlock
done? After a discarded page has been freed you need to check if there
is another discarded page in the block, or if a discard is in progress.
This sounds racy.
I do not think that the locking with a seqlock will work. But please
keep throwing ideas at me. The point about page_host_discards is valid.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> The code added by this patch uses the volatile state for all page cache
> pages, even for pages which are referenced by writable ptes. The host
> needs to be able to check the dirty state of the pages. Since the host
> doesn't know where the page table entries of the guest are located,
> the volatile state as introduced by this patch is only usable on
> architectures with per-page dirty bits (s390 only). For per-pte dirty
> bit architectures some additional code is needed.
>
What do you mean by per-page dirty bits. Do you mean per-page dirty
bits in hardware (s390), in software (Linux), or maintained by the
hypervisor?
> The interesting question is where to put the state transitions between
> the volatile and the stable state. The simple solution is the make a
> page stable whenever a lookup is done or a page reference is derived
> from a page table entry. Attempts to make pages volatile are added at
> strategic points. Now what are the conditions that prevent a page from
> being made volatile? There are 10 conditions:
> 1) The page is reserved. Some sort of special page.
> 2) The page is marked dirty in the struct page. The page content is
> more recent than the data on the backing device. The host cannot
> access the linux internal dirty bit so the page needs to be stable.
> 3) The page is in writeback. The page content is needed for i/o.
> 4) The page is locked. Someone has exclusive access to the page.
>
I'll tend to agree from a heuristical performance point of view, but I
don't see any reason that a locked page can't be made volatile from a
correctness perspective.
> 5) The page is anonymous. Swap cache support needs additional code.
> 6) The page has no mapping. No backing, the page cannot be recreated.
> 7) The page is not uptodate.
> 8) The page has private information. try_to_release_page can fail,
> e.g. in case the private information is journaling data. The discard
> fault need to be able to remove the page.
> 9) The page is already discarded.
> 10) The page map count is not equal to the page reference count - 1.
> The discard fault handler can remove the page cache reference and
> all mappers of a page. It cannot remove the page reference for any
> other user of the page.
>
Does s390 use per physical page permission bits separate from the PTEs?
Because I don't see how you can generate a discard fault otherwise
unless you know where the page table entries of the guest are located,
which you already said you don't. Or perhaps I'm misunderstanding the
meaning of discard fault - I'm taking it to mean a fault which happens
on access to a volatile page that was discarded by the hypervisor - thus
requiring a refresh of all mapping PTEs?
Thanks,
Zach
On Wed, 2006-09-13 at 11:21 -0700, Zachary Amsden wrote:
> Martin Schwidefsky wrote:
> > The code added by this patch uses the volatile state for all page cache
> > pages, even for pages which are referenced by writable ptes. The host
> > needs to be able to check the dirty state of the pages. Since the host
> > doesn't know where the page table entries of the guest are located,
> > the volatile state as introduced by this patch is only usable on
> > architectures with per-page dirty bits (s390 only). For per-pte dirty
> > bit architectures some additional code is needed.
> >
>
> What do you mean by per-page dirty bits. Do you mean per-page dirty
> bits in hardware (s390), in software (Linux), or maintained by the
> hypervisor?
s390 has something called a storage key. There is one for each page. In
a virtualized system there is one for each virtual page. The dirty and
referenced bit for each page is kept in the storage key and not in the
pte. On s390 the pte has nothing to do with dirty/referenced.
If your hardware allows you to have dirty bits per page instead of per
pte then you can query the dirty bit from the host without needing
access to all the ptes.
> > The interesting question is where to put the state transitions between
> > the volatile and the stable state. The simple solution is the make a
> > page stable whenever a lookup is done or a page reference is derived
> > from a page table entry. Attempts to make pages volatile are added at
> > strategic points. Now what are the conditions that prevent a page from
> > being made volatile? There are 10 conditions:
> > 1) The page is reserved. Some sort of special page.
> > 2) The page is marked dirty in the struct page. The page content is
> > more recent than the data on the backing device. The host cannot
> > access the linux internal dirty bit so the page needs to be stable.
> > 3) The page is in writeback. The page content is needed for i/o.
> > 4) The page is locked. Someone has exclusive access to the page.
> >
>
> I'll tend to agree from a heuristical performance point of view, but I
> don't see any reason that a locked page can't be made volatile from a
> correctness perspective.
Interesting. Usually there is an additional page reference for a locked
page so it probably won't make a difference. I'm not sure if all "users"
of locked pages can deal with volatile pages, I'll have to check.
> > 5) The page is anonymous. Swap cache support needs additional code.
> > 6) The page has no mapping. No backing, the page cannot be recreated.
> > 7) The page is not uptodate.
> > 8) The page has private information. try_to_release_page can fail,
> > e.g. in case the private information is journaling data. The discard
> > fault need to be able to remove the page.
> > 9) The page is already discarded.
> > 10) The page map count is not equal to the page reference count - 1.
> > The discard fault handler can remove the page cache reference and
> > all mappers of a page. It cannot remove the page reference for any
> > other user of the page.
> >
>
> Does s390 use per physical page permission bits separate from the PTEs?
Yes.
> Because I don't see how you can generate a discard fault otherwise
> unless you know where the page table entries of the guest are located,
> which you already said you don't. Or perhaps I'm misunderstanding the
> meaning of discard fault - I'm taking it to mean a fault which happens
> on access to a volatile page that was discarded by the hypervisor - thus
> requiring a refresh of all mapping PTEs?
The discard fault happens on access to a volatile that has been
discarded. An important property of the s390 architecture comes into
play here: there are two page tables, a guest page table and a host page
table. What the guest perceives as its "physical" memory is in virtual
storage for the host. An address resolution has to walk two pages
tables, if a pte is invalid in either table you get a fault. A guest
fault if the invalid pte is in the guest table and a host fault if it is
in the host table. That gives s390 a simple method to implement
discarded pages: the hypervisor just unmaps the page from the host table
and changes the state of the guest page. I can see that you will have a
much harder time to implement this on i386.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote
> The discard fault happens on access to a volatile that has been
> discarded. An important property of the s390 architecture comes into
> play here: there are two page tables, a guest page table and a host page
> table. What the guest perceives as its "physical" memory is in virtual
> storage for the host. An address resolution has to walk two pages
> tables, if a pte is invalid in either table you get a fault. A guest
> fault if the invalid pte is in the guest table and a host fault if it is
>
Yes, I'm familiar with that trick. Wasn't sure if you had it in
hardware or not.
> in the host table. That gives s390 a simple method to implement
> discarded pages: the hypervisor just unmaps the page from the host table
> and changes the state of the guest page. I can see that you will have a
> much harder time to implement this on i386.
>
Nah, I think we'll do just fine.
Thanks for the info - based on this, I think we can probably use the
volatile page / swap cache changes as well for VMware, also pretty much
unchanged. Sorry to take so long to look at these patches, BTW - I was
on holiday for two weeks.
Zach
On Thu, 2006-09-14 at 02:23 -0700, Zachary Amsden wrote:
> Martin Schwidefsky wrote
> > The discard fault happens on access to a volatile that has been
> > discarded. An important property of the s390 architecture comes into
> > play here: there are two page tables, a guest page table and a host page
> > table. What the guest perceives as its "physical" memory is in virtual
> > storage for the host. An address resolution has to walk two pages
> > tables, if a pte is invalid in either table you get a fault. A guest
> > fault if the invalid pte is in the guest table and a host fault if it is
> >
>
> Yes, I'm familiar with that trick. Wasn't sure if you had it in
> hardware or not.
The mainframes have that trick in hardware for about 30 years ..
> > in the host table. That gives s390 a simple method to implement
> > discarded pages: the hypervisor just unmaps the page from the host table
> > and changes the state of the guest page. I can see that you will have a
> > much harder time to implement this on i386.
> >
>
> Nah, I think we'll do just fine.
I wonder which trick you use, since there is only one page table one
i386 I can only imagine that you are tracking all page tables of the
guest.
> Thanks for the info - based on this, I think we can probably use the
> volatile page / swap cache changes as well for VMware, also pretty much
> unchanged. Sorry to take so long to look at these patches, BTW - I was
> on holiday for two weeks.
I've been sitting on these patches for months now and you are worrying
about two weeks..
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
In-Reply-To: <1158309400.23993.9.camel@localhost>
On Fri, 15 Sep 2006 10:36:39 +0200, Martin Schwidefsky wrote:
> I wonder which trick you use, since there is only one page table one
> i386 I can only imagine that you are tracking all page tables of the
> guest.
AMD K8 with the SVM feature has host and guest page tables and
address-space identifiers for the guests so their global TLB flushes
can be limited to their own address space...
--
Chuck
On Fri, 2006-09-15 at 13:50 -0400, Chuck Ebbert wrote:
> > I wonder which trick you use, since there is only one page table one
> > i386 I can only imagine that you are tracking all page tables of the
> > guest.
>
> AMD K8 with the SVM feature has host and guest page tables and
> address-space identifiers for the guests so their global TLB flushes
> can be limited to their own address space...
Yeah, I know about the nested page tables on K8. But VMware will not
restrict themselves to K8 only, so they need a clever way to implement
shadow page tables on i386 chips without this feature.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.