From: Martin Schwidefsky <[email protected]>
From: Hubertus Franke <[email protected]>
From: Himanshu Raj
s390 uses the milli-coded ESSA instruction to set the page state. The
page state is formed by four guest page states called block usage states
and three host page states called block content states.
The guest states are:
- stable (S): there is essential content in the page
- unused (U): there is no useful content and any access to the page will
cause an addressing exception
- volatile (V): there is useful content in the page. The host system is
allowed to discard the content anytime, but has to deliver a discard
fault with the absolute address of the page if the guest tries to
access it.
- potential volatile (P): the page has useful content. The host system
is allowed to discard the content after it has checked the dirty bit
of the page. It has to deliver a discard fault with the absolute
address of the page if the guest tries to access it.
The host states are:
- resident: the page is present in real memory.
- preserved: the page is not present in real memory but the content is
preserved elsewhere by the machine, e.g. on the paging device.
- zero: the page is not present in real memory. The content of the page
is logically-zero.
There are 12 combinations of guest and host state, currently only 8 are
valid page states:
Sr: a stable, resident page.
Sp: a stable, preserved page.
Sz: a stable, logically zero page. A page filled with zeroes will be
allocated on first access.
Ur: an unused but resident page. The host could make it Uz anytime but
it doesn't have to.
Uz: an unused, logically zero page.
Vr: a volatile, resident page. The guest can access it normally.
Vz: a volatile, logically zero page. This is a discarded page. The host
will deliver a discard fault for any access to the page.
Pr: a potential volatile, resident page. The guest can access it normally.
The remaining 4 combinations can't occur:
Up: an unused, preserved page. If the host tries to get rid of a Ur page
it will remove it without writing the page content to disk and set
the page to Uz.
Vp: a volatile, preserved page. If the host picks a Vr page for eviction
it will discard it and set the page state to Vz.
Pp: a potential volatile, preserved page. There are two cases for page out:
1) if the page is dirty then the host will preserved the page and set
it to Sp or 2) if the page is clean then the host will discard it and
set the page state to Vz.
Pz: a potential volatile, logically zero page. The host system will always
use Vz instead of Pz.
The state transitions (a diagram would be nicer but that is too hard
to do in ascii art...):
{Ur,Sr,Vr,Pr}: a resident page will change its block usage state if the
guest requests it with page_set_{unused,stable,volatile}.
{Uz,Sz,Vz}: a logically zero page will change its block usage state if the
guest requests it with page_set_{unused,stable,volatile}. The
guest can't create the Pz state, the state will be Vz instead.
Ur -> Uz: the host system can remove an unused, resident page from memory
Sz -> Sr: on first access a stable, logically zero page will become resident
Sr -> Sp: the host system can swap a stable page to disk
Sp -> Sr: a guest access to a Sp page forces the host to retrieve it
Vr -> Vz: the host can discard a volatile page
Sp -> Uz: a page preserved by the host will be removed if the guest sets
the block usage state to unused.
Sp -> Vz: a page preserved by the host will be discarded if the guest sets
the block usage state to volatile.
Pr -> Sp: the host can move a page from Pr to Sp if it discovers that the
page is dirty while trying to discard the page. The page content is
written to the paging device.
Pr -> Vz: the host can discard a Pr page. The Pz state is replaced by the
Vz state.
The are some hazards the code has to deal with:
1) For potential volatile pages the transfer of the hardware dirty bit to
the software dirty bit needs to make sure that the page gets into the
stable state before the hardware dirty bit is cleared. Between the
page_test_dirty and the page_clear_dirty call a page_make_stable is
required.
2) Since the access of unused pages causes addressing exceptions we need
to take care with /dev/mem. The copy_{from_to}_user functions need to
be able to cope with addressing exceptions for the kernel address space.
3) The discard fault on a s390 machine delivers the absolute address of
the page that caused the fault instead of the virtual address. With the
virtual address we could have used the page table entry of the current
process to safely get a reference to the discarded page. We can get to
the struct page from the absolute page address but it is rather hard to
get to a proper page reference. The page that caused the fault could
already have been freed and reused for a different purpose. None of the
fields in the struct page would be reliable to use. The freeing of
discarded pages therefore has to be postponed until all pending discard
faults for this page have been dealt with. The discard fault handler
is called disabled for interrupts and tries to get a page reference
with get_page_unless_zero. A discarded page is only freed after all
cpus have been enabled for interrupts at least once since the detection
of the discarded page. This is done using the timer interrupts and the
cpu-idle notifier.
Signed-off-by: Martin Schwidefsky <[email protected]>
---
arch/s390/Kconfig | 3
arch/s390/kernel/time.c | 11 ++
arch/s390/kernel/traps.c | 4
arch/s390/lib/uaccess_mvcos.c | 10 +
arch/s390/lib/uaccess_std.c | 7 -
arch/s390/mm/fault.c | 210 +++++++++++++++++++++++++++++++++++++++++
include/asm-s390/page-states.h | 117 ++++++++++++++++++++++
mm/rmap.c | 9 +
8 files changed, 364 insertions(+), 7 deletions(-)
Index: linux-2.6/arch/s390/Kconfig
===================================================================
--- linux-2.6.orig/arch/s390/Kconfig
+++ linux-2.6/arch/s390/Kconfig
@@ -411,6 +411,9 @@ config CMM_IUCV
Select this option to enable the special message interface to
the cooperative memory management.
+config PAGE_STATES
+ bool "Enable support for guest page hinting."
+
config VIRT_TIMER
bool "Virtual CPU timer support"
help
Index: linux-2.6/arch/s390/kernel/time.c
===================================================================
--- linux-2.6.orig/arch/s390/kernel/time.c
+++ linux-2.6/arch/s390/kernel/time.c
@@ -30,6 +30,7 @@
#include <linux/timex.h>
#include <linux/notifier.h>
#include <linux/clocksource.h>
+#include <linux/page-states.h>
#include <asm/uaccess.h>
#include <asm/delay.h>
@@ -222,6 +223,9 @@ static int nohz_idle_notify(struct notif
switch (action) {
case S390_CPU_IDLE:
stop_hz_timer();
+#ifdef CONFIG_PAGE_STATES
+ page_shrink_discard_list();
+#endif
break;
case S390_CPU_NOT_IDLE:
start_hz_timer();
@@ -270,6 +274,9 @@ void init_cpu_timer(void)
static void clock_comparator_interrupt(__u16 code)
{
+#ifdef CONFIG_PAGE_STATES
+ page_shrink_discard_list();
+#endif
/* set clock comparator for next tick */
set_clock_comparator(S390_lowcore.jiffy_timer + CPU_DEVIATION);
}
@@ -349,6 +356,10 @@ void __init time_init(void)
#ifdef CONFIG_VIRT_TIMER
vtime_init();
#endif
+
+#ifdef CONFIG_PAGE_STATES
+ page_discard_init();
+#endif
}
/*
Index: linux-2.6/arch/s390/kernel/traps.c
===================================================================
--- linux-2.6.orig/arch/s390/kernel/traps.c
+++ linux-2.6/arch/s390/kernel/traps.c
@@ -61,6 +61,7 @@ extern pgm_check_handler_t do_protection
extern pgm_check_handler_t do_dat_exception;
extern pgm_check_handler_t do_monitor_call;
extern pgm_check_handler_t do_asce_exception;
+extern pgm_check_handler_t do_discard_fault;
#define stack_pointer ({ void **sp; asm("la %0,0(15)" : "=&d" (sp)); sp; })
@@ -740,5 +741,8 @@ void __init trap_init(void)
pgm_check_table[0x1C] = &space_switch_exception;
pgm_check_table[0x1D] = &hfp_sqrt_exception;
pgm_check_table[0x40] = &do_monitor_call;
+#ifdef CONFIG_PAGE_STATES
+ pgm_check_table[0x1a] = &do_discard_fault;
+#endif
pfault_irq_init();
}
Index: linux-2.6/arch/s390/lib/uaccess_mvcos.c
===================================================================
--- linux-2.6.orig/arch/s390/lib/uaccess_mvcos.c
+++ linux-2.6/arch/s390/lib/uaccess_mvcos.c
@@ -36,7 +36,7 @@ static size_t copy_from_user_mvcos(size_
tmp1 = -4096UL;
asm volatile(
"0: .insn ss,0xc80000000000,0(%0,%2),0(%1),0\n"
- " jz 7f\n"
+ "10:jz 7f\n"
"1:"ALR" %0,%3\n"
" "SLR" %1,%3\n"
" "SLR" %2,%3\n"
@@ -47,7 +47,7 @@ static size_t copy_from_user_mvcos(size_
" "CLR" %0,%4\n" /* copy crosses next page boundary? */
" jnh 4f\n"
"3: .insn ss,0xc80000000000,0(%4,%2),0(%1),0\n"
- " "SLR" %0,%4\n"
+ "11:"SLR" %0,%4\n"
" "ALR" %2,%4\n"
"4:"LHI" %4,-1\n"
" "ALR" %4,%0\n" /* copy remaining size, subtract 1 */
@@ -62,6 +62,7 @@ static size_t copy_from_user_mvcos(size_
"7:"SLR" %0,%0\n"
"8: \n"
EX_TABLE(0b,2b) EX_TABLE(3b,4b)
+ EX_TABLE(10b,8b) EX_TABLE(11b,8b)
: "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2)
: "d" (reg0) : "cc", "memory");
return size;
@@ -82,7 +83,7 @@ static size_t copy_to_user_mvcos(size_t
tmp1 = -4096UL;
asm volatile(
"0: .insn ss,0xc80000000000,0(%0,%1),0(%2),0\n"
- " jz 4f\n"
+ "6: jz 4f\n"
"1:"ALR" %0,%3\n"
" "SLR" %1,%3\n"
" "SLR" %2,%3\n"
@@ -93,11 +94,12 @@ static size_t copy_to_user_mvcos(size_t
" "CLR" %0,%4\n" /* copy crosses next page boundary? */
" jnh 5f\n"
"3: .insn ss,0xc80000000000,0(%4,%1),0(%2),0\n"
- " "SLR" %0,%4\n"
+ "7:"SLR" %0,%4\n"
" j 5f\n"
"4:"SLR" %0,%0\n"
"5: \n"
EX_TABLE(0b,2b) EX_TABLE(3b,5b)
+ EX_TABLE(6b,5b) EX_TABLE(7b,5b)
: "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2)
: "d" (reg0) : "cc", "memory");
return size;
Index: linux-2.6/arch/s390/lib/uaccess_std.c
===================================================================
--- linux-2.6.orig/arch/s390/lib/uaccess_std.c
+++ linux-2.6/arch/s390/lib/uaccess_std.c
@@ -36,12 +36,12 @@ size_t copy_from_user_std(size_t size, c
tmp1 = -256UL;
asm volatile(
"0: mvcp 0(%0,%2),0(%1),%3\n"
- " jz 8f\n"
+ "10:jz 8f\n"
"1:"ALR" %0,%3\n"
" la %1,256(%1)\n"
" la %2,256(%2)\n"
"2: mvcp 0(%0,%2),0(%1),%3\n"
- " jnz 1b\n"
+ "11:jnz 1b\n"
" j 8f\n"
"3: la %4,255(%1)\n" /* %4 = ptr + 255 */
" "LHI" %3,-4096\n"
@@ -50,7 +50,7 @@ size_t copy_from_user_std(size_t size, c
" "CLR" %0,%4\n" /* copy crosses next page boundary? */
" jnh 5f\n"
"4: mvcp 0(%4,%2),0(%1),%3\n"
- " "SLR" %0,%4\n"
+ "12:"SLR" %0,%4\n"
" "ALR" %2,%4\n"
"5:"LHI" %4,-1\n"
" "ALR" %4,%0\n" /* copy remaining size, subtract 1 */
@@ -65,6 +65,7 @@ size_t copy_from_user_std(size_t size, c
"8:"SLR" %0,%0\n"
"9: \n"
EX_TABLE(0b,3b) EX_TABLE(2b,3b) EX_TABLE(4b,5b)
+ EX_TABLE(10b,9b) EX_TABLE(11b,9b) EX_TABLE(12b,9b)
: "+a" (size), "+a" (ptr), "+a" (x), "+a" (tmp1), "=a" (tmp2)
: : "cc", "memory");
return size;
Index: linux-2.6/arch/s390/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/fault.c
+++ linux-2.6/arch/s390/mm/fault.c
@@ -19,6 +19,8 @@
#include <linux/ptrace.h>
#include <linux/mman.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/cpu.h>
#include <linux/smp.h>
#include <linux/kdebug.h>
#include <linux/smp_lock.h>
@@ -28,11 +30,13 @@
#include <linux/hardirq.h>
#include <linux/kprobes.h>
#include <linux/uaccess.h>
+#include <linux/page-states.h>
#include <asm/system.h>
#include <asm/pgtable.h>
#include <asm/s390_ext.h>
#include <asm/mmu_context.h>
+#include <asm/io.h>
#ifndef CONFIG_64BIT
#define __FAIL_ADDR_MASK 0x7ffff000
@@ -615,4 +619,210 @@ void __init pfault_irq_init(void)
unregister_early_external_interrupt(0x2603, pfault_interrupt,
&ext_int_pfault);
}
+
+#endif
+
+#ifdef CONFIG_PAGE_STATES
+
+int cmma_flag;
+
+static inline int machine_has_essa(void)
+{
+ register unsigned long tmp asm("0") = 0;
+ register int rc asm("1") = 0;
+ asm volatile(
+ " .insn rrf,0xb9ab0000,%1,%1,0,0\n"
+ "0: la %0,1\n"
+ "1:\n"
+ EX_TABLE(0b,1b)
+ : "+&d" (rc), "+&d" (tmp));
+ return rc;
+}
+
+static int __init cmma(char *str)
+{
+ char *parm;
+
+ parm = strstrip(str);
+ if (strcmp(parm, "yes") == 0 || strcmp(parm, "on") == 0) {
+ cmma_flag = machine_has_essa();
+ return 1;
+ }
+ if (strcmp(parm, "no") == 0 || strcmp(parm, "off") == 0) {
+ cmma_flag = 0;
+ return 1;
+ }
+ return 0;
+}
+
+__setup("cmma=", cmma);
+
+static inline void fixup_user_copy(struct pt_regs *regs,
+ unsigned long address, unsigned short rx)
+{
+ const struct exception_table_entry *fixup;
+ unsigned long kaddr;
+
+ kaddr = (regs->gprs[rx >> 12] + (rx & 0xfff)) & __FAIL_ADDR_MASK;
+ if (virt_to_phys((void *) kaddr) != address)
+ return;
+
+ fixup = search_exception_tables(regs->psw.addr & PSW_ADDR_INSN);
+ if (fixup)
+ regs->psw.addr = fixup->fixup | PSW_ADDR_AMODE;
+ else
+ die("discard fault", regs, SIGSEGV);
+}
+
+/*
+ * Discarded pages with a page_count() of zero are placed on
+ * the page_discarded_list until all cpus have been at
+ * least once in enabled code. That closes the race of page
+ * free vs. discard faults.
+ */
+void do_discard_fault(struct pt_regs *regs, unsigned long error_code)
+{
+ unsigned long address;
+ struct page *page;
+
+ /*
+ * get the real address that caused the block validity
+ * exception.
+ */
+ address = S390_lowcore.trans_exc_code & __FAIL_ADDR_MASK;
+ page = pfn_to_page(address >> PAGE_SHIFT);
+
+ /*
+ * Check for the special case of a discard fault in
+ * copy_{from,to}_user. User copy is done using one of
+ * three special instructions: mvcp, mvcs or mvcos.
+ */
+ if (!(regs->psw.mask & PSW_MASK_PSTATE)) {
+ switch (*(unsigned char *) regs->psw.addr) {
+ case 0xda: /* mvcp */
+ fixup_user_copy(regs, address,
+ *(__u16 *)(regs->psw.addr + 2));
+ break;
+ case 0xdb: /* mvcs */
+ fixup_user_copy(regs, address,
+ *(__u16 *)(regs->psw.addr + 4));
+ break;
+ case 0xc8: /* mvcos */
+ if (regs->gprs[0] == 0x81)
+ fixup_user_copy(regs, address,
+ *(__u16*)(regs->psw.addr + 2));
+ else if (regs->gprs[0] == 0x810000)
+ fixup_user_copy(regs, address,
+ *(__u16*)(regs->psw.addr + 4));
+ break;
+ default:
+ break;
+ }
+ }
+
+ if (likely(get_page_unless_zero(page))) {
+ local_irq_enable();
+ page_discard(page);
+ }
+}
+
+static DEFINE_PER_CPU(struct list_head, page_discard_list);
+static struct list_head page_gather_list = LIST_HEAD_INIT(page_gather_list);
+static struct list_head page_signoff_list = LIST_HEAD_INIT(page_signoff_list);
+static cpumask_t page_signoff_cpumask = CPU_MASK_NONE;
+static DEFINE_SPINLOCK(page_discard_lock);
+
+/*
+ * page_free_discarded
+ *
+ * free_hot_cold_page calls this function if it is about to free a
+ * page that has PG_discarded set. Since there might be pending
+ * discard faults on other cpus on s390 we have to postpone the
+ * freeing of the page until each cpu has "signed-off" the page.
+ *
+ * returns 1 to stop free_hot_cold_page from freeing the page.
+ */
+int page_free_discarded(struct page *page)
+{
+ local_irq_disable();
+ list_add_tail(&page->lru, &__get_cpu_var(page_discard_list));
+ local_irq_enable();
+ return 1;
+}
+
+/*
+ * page_shrink_discard_list
+ *
+ * This function is called from the timer tick for an active cpu or
+ * from the idle notifier. It frees discarded pages in three stages.
+ * In the first stage it moves the pages from the per-cpu discard
+ * list to a global list. From the global list the pages are moved
+ * to the signoff list in a second step. The third step is to free
+ * the pages after all cpus acknoledged the signoff. That prevents
+ * that a page is freed when a cpus still has a pending discard
+ * fault for the page.
+ */
+void page_shrink_discard_list(void)
+{
+ struct list_head *cpu_list = &__get_cpu_var(page_discard_list);
+ struct list_head free_list = LIST_HEAD_INIT(free_list);
+ struct page *page, *next;
+ int cpu = smp_processor_id();
+
+ if (list_empty(cpu_list) && !cpu_isset(cpu, page_signoff_cpumask))
+ return;
+ spin_lock(&page_discard_lock);
+ if (!list_empty(cpu_list))
+ list_splice_init(cpu_list, &page_gather_list);
+ cpu_clear(cpu, page_signoff_cpumask);
+ if (cpus_empty(page_signoff_cpumask)) {
+ list_splice_init(&page_signoff_list, &free_list);
+ list_splice_init(&page_gather_list, &page_signoff_list);
+ if (!list_empty(&page_signoff_list)) {
+ /* Take care of the nohz race.. */
+ page_signoff_cpumask = cpu_online_map;
+ smp_wmb();
+ cpus_andnot(page_signoff_cpumask,
+ page_signoff_cpumask, nohz_cpu_mask);
+ cpu_clear(cpu, page_signoff_cpumask);
+ if (cpus_empty(page_signoff_cpumask))
+ list_splice_init(&page_signoff_list,
+ &free_list);
+ }
+ }
+ spin_unlock(&page_discard_lock);
+ list_for_each_entry_safe(page, next, &free_list, lru) {
+ ClearPageDiscarded(page);
+ free_cold_page(page);
+ }
+}
+
+static int page_discard_cpu_notify(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ int cpu = (unsigned long) hcpu;
+
+ if (action == CPU_DEAD) {
+ local_irq_disable();
+ list_splice_init(&per_cpu(page_discard_list, cpu),
+ &__get_cpu_var(page_discard_list));
+ local_irq_enable();
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block page_discard_cpu_notifier = {
+ .notifier_call = page_discard_cpu_notify,
+};
+
+void __init page_discard_init(void)
+{
+ int i;
+
+ for_each_possible_cpu(i)
+ INIT_LIST_HEAD(&per_cpu(page_discard_list, i));
+ if (register_cpu_notifier(&page_discard_cpu_notifier))
+ panic("Couldn't register page discard cpu notifier");
+}
+
#endif
Index: linux-2.6/include/asm-s390/page-states.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-s390/page-states.h
@@ -0,0 +1,117 @@
+#ifndef _ASM_S390_PAGE_STATES_H
+#define _ASM_S390_PAGE_STATES_H
+
+#define ESSA_GET_STATE 0
+#define ESSA_SET_STABLE 1
+#define ESSA_SET_UNUSED 2
+#define ESSA_SET_VOLATILE 3
+#define ESSA_SET_PVOLATILE 4
+#define ESSA_SET_STABLE_MAKE_RESIDENT 5
+#define ESSA_SET_STABLE_IF_NOT_DISCARDED 6
+
+#define ESSA_USTATE_MASK 0x0c
+#define ESSA_USTATE_STABLE 0x00
+#define ESSA_USTATE_UNUSED 0x04
+#define ESSA_USTATE_PVOLATILE 0x08
+#define ESSA_USTATE_VOLATILE 0x0c
+
+#define ESSA_CSTATE_MASK 0x03
+#define ESSA_CSTATE_RESIDENT 0x00
+#define ESSA_CSTATE_PRESERVED 0x02
+#define ESSA_CSTATE_ZERO 0x03
+
+extern int cmma_flag;
+extern struct page *mem_map;
+
+/*
+ * ESSA <rc-reg>,<page-address-reg>,<command-immediate>
+ */
+#define page_essa(_page,_command) ({ \
+ int _rc; \
+ asm volatile(".insn rrf,0xb9ab0000,%0,%1,%2,0" \
+ : "=&d" (_rc) : "a" (((_page)-mem_map)<<PAGE_SHIFT), \
+ "i" (_command)); \
+ _rc; \
+})
+
+static inline int page_host_discards(void)
+{
+ return cmma_flag;
+}
+
+static inline int page_discarded(struct page *page)
+{
+ int state;
+
+ if (!cmma_flag)
+ return 0;
+ state = page_essa(page, ESSA_GET_STATE);
+ return (state & ESSA_USTATE_MASK) == ESSA_USTATE_VOLATILE &&
+ (state & ESSA_CSTATE_MASK) == ESSA_CSTATE_ZERO;
+}
+
+static inline void page_set_unused(struct page *page, int order)
+{
+ int i;
+
+ if (!cmma_flag)
+ return;
+ for (i = 0; i < (1 << order); i++)
+ page_essa(page + i, ESSA_SET_UNUSED);
+}
+
+static inline void page_set_stable(struct page *page, int order)
+{
+ int i;
+
+ if (!cmma_flag)
+ return;
+ for (i = 0; i < (1 << order); i++)
+ page_essa(page + i, ESSA_SET_STABLE);
+}
+
+static inline void page_set_volatile(struct page *page, int writable)
+{
+ if (!cmma_flag)
+ return;
+ if (writable)
+ page_essa(page, ESSA_SET_PVOLATILE);
+ else
+ page_essa(page, ESSA_SET_VOLATILE);
+}
+
+static inline int page_set_stable_if_present(struct page *page)
+{
+ int rc;
+
+ if (!cmma_flag || PageReserved(page))
+ return 1;
+
+ rc = page_essa(page, ESSA_SET_STABLE_IF_NOT_DISCARDED);
+ return (rc & ESSA_USTATE_MASK) != ESSA_USTATE_VOLATILE ||
+ (rc & ESSA_CSTATE_MASK) != ESSA_CSTATE_ZERO;
+}
+
+/*
+ * Page locking is done with the architecture page bit PG_arch_1.
+ */
+static inline int page_test_set_state_change(struct page *page)
+{
+ return test_and_set_bit(PG_arch_1, &page->flags);
+}
+
+static inline void page_clear_state_change(struct page *page)
+{
+ clear_bit(PG_arch_1, &page->flags);
+}
+
+static inline int page_state_change(struct page *page)
+{
+ return test_bit(PG_arch_1, &page->flags);
+}
+
+int page_free_discarded(struct page *page);
+void page_shrink_discard_list(void);
+void page_discard_init(void);
+
+#endif /* _ASM_S390_PAGE_STATES_H */
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -686,6 +686,15 @@ void page_remove_rmap(struct page *page,
* faster for those pages still in swapcache.
*/
if (page_test_dirty(page)) {
+ int stable = page_make_stable(page);
+ VM_BUG_ON(!stable);
+ /*
+ * We decremented the mapcount so we now have an
+ * extra reference for the page. That prevents
+ * page_make_volatile from making the page
+ * volatile again while the dirty bit is in
+ * transit.
+ */
page_clear_dirty(page);
set_page_dirty(page);
}
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
digraph gph {
Ur -> Sr [ label="page_set_stable" ];
Ur -> Vr [ label="page_set_volatile" ];
Ur -> Ur [ label="page_set_unused" ];
Sr -> Sr [ label="page_set_stable" ];
Sr -> Vr [ label="page_set_volatile" ];
Sr -> Ur [ label="page_set_unused" ];
Vr -> Sr [ label="page_set_stable" ];
Vr -> Vr [ label="page_set_volatile" ];
Vr -> Ur [ label="page_set_unused" ];
Uz -> Sz [ label="page_set_stable" ];
Uz -> Vz [ label="page_set_volatile" ];
Uz -> Uz [ label="page_set_unused" ];
Sz -> Sz [ label="page_set_stable" ];
Sz -> Vz [ label="page_set_volatile" ];
Sz -> Uz [ label="page_set_unused" ];
Vz -> Sz [ label="page_set_stable" ];
Vz -> Vz [ label="page_set_volatile" ];
Vz -> Uz [ label="page_set_unused" ];
Ur -> Uz [ label="host evict" ];
Sz -> Sr [ label="guest write" ];
Sr -> Sp [ label="host swap" ];
Sp -> Sr [ label="guest access" ];
Sp -> Uz [ label="guest discard" ];
Sp -> Vz [ label="page_set_volatile" ];
Pr -> Sp [ label="host discard dirty" ];
Pr -> Vz [ label="host discard clean" ];
}
On Wed, 2008-03-12 at 09:19 -0700, Jeremy Fitzhardinge wrote:
> Martin Schwidefsky wrote:
> > The state transitions (a diagram would be nicer but that is too hard
> > to do in ascii art...):
> > {Ur,Sr,Vr,Pr}: a resident page will change its block usage state if the
> > guest requests it with page_set_{unused,stable,volatile}.
> > {Uz,Sz,Vz}: a logically zero page will change its block usage state if the
> > guest requests it with page_set_{unused,stable,volatile}. The
> > guest can't create the Pz state, the state will be Vz instead.
> > Ur -> Uz: the host system can remove an unused, resident page from memory
> > Sz -> Sr: on first access a stable, logically zero page will become resident
> > Sr -> Sp: the host system can swap a stable page to disk
> > Sp -> Sr: a guest access to a Sp page forces the host to retrieve it
> > Vr -> Vz: the host can discard a volatile page
> > Sp -> Uz: a page preserved by the host will be removed if the guest sets
> > the block usage state to unused.
> > Sp -> Vz: a page preserved by the host will be discarded if the guest sets
> > the block usage state to volatile.
> > Pr -> Sp: the host can move a page from Pr to Sp if it discovers that the
> > page is dirty while trying to discard the page. The page content is
> > written to the paging device.
> > Pr -> Vz: the host can discard a Pr page. The Pz state is replaced by the
> > Vz state.
>
> I created the attached .dot graph based purely on this description. It
> looks reasonable, but I didn't see how a page enters a Pr state.
That is the first block of state transitions: {Ur,Sr,Vr,Pr}
You can go from any of the four states to any of the remaining three.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
digraph gph {
Ur -> Sr [ label="set stable" ];
Ur -> Vr [ label="set volatile" ];
Ur -> Ur [ label="set unused" ];
Ur -> Pr [ label="set stable_if_present" ];
Sr -> Sr [ label="set stable" ];
Sr -> Vr [ label="set volatile" ];
Sr -> Ur [ label="set unused" ];
Sr -> Pr [ label="set stable_if_present" ];
Vr -> Sr [ label="set stable" ];
Vr -> Vr [ label="set volatile" ];
Vr -> Ur [ label="set unused" ];
Vr -> Pr [ label="set stable_if_present" ];
Pr -> Sr [ label="set stable" ];
Pr -> Vr [ label="set volatile" ];
Pr -> Ur [ label="set unused" ];
Pr -> Pr [ label="set stable_if_present" ];
Uz -> Sz [ label="set stable" ];
Uz -> Vz [ label="set volatile" ];
Uz -> Uz [ label="set unused" ];
Sz -> Sz [ label="set stable" ];
Sz -> Vz [ label="set volatile" ];
Sz -> Uz [ label="set unused" ];
Vz -> Sz [ label="set stable" ];
Vz -> Vz [ label="set volatile" ];
Vz -> Uz [ label="set unused" ];
Ur -> Uz [ label="host evict" ];
Sz -> Sr [ label="guest write" ];
Sr -> Sp [ label="host swap" ];
Sp -> Sr [ label="guest access" ];
Sp -> Uz [ label="guest discard" ];
Sp -> Vz [ label="set volatile" ];
Pr -> Sp [ label="host discard dirty" ];
Pr -> Vz [ label="host discard clean" ];
}
On Wed, 2008-03-12 at 09:44 -0700, Jeremy Fitzhardinge wrote:
> Martin Schwidefsky wrote:
> > That is the first block of state transitions: {Ur,Sr,Vr,Pr}
> > You can go from any of the four states to any of the remaining three.
> >
>
> You only mention page_set_{unused,stable,volatile}. Is
> page_set_stable_if_present() the fourth. And shouldn't that be
> "stable_if_clean":
page_set_volatile has a "writable" argument. For writable==0 you get a
Vx page, for writable==1 you get a Px page.
With stable_if_clean you are refering to stable_if_present? If yes the
answer is that this operation is used to get a page from Vx/Px back to
Sx but only if the page has not been discarded. The operation will fail
if the page state is Vz/Pz. The dirty bit only matters for the hosts
decision to discard the page, these are the state transitions from Vr/Pr
to Vz.
> - potential volatile (P): the page has useful content. The host system
> is allowed to discard the content after it has checked the dirty bit
> of the page. It has to deliver a discard fault with the absolute
> address of the page if the guest tries to access it.
>
>
> The use of "stable" in the function call and "volatile" in this
> description is a bit confusing. My understanding is that a page in this
> state is either stable or volatile depending on whether its dirty, which
> makes sense, but it would be good to consistently refer to it in the
> same way.
Your understanding is good, but how can I make this less confusing? A Px
page that is dirty may not be discarded which makes it basically stable.
The guest state still is potential volatile though as it does not have a
state of Sx.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
digraph gph {
/* Guest state changes on resident pages */
Ur -> Sr [ label="set stable" ];
Ur -> Vr [ label="set volatile\n(w=0)" ];
Ur -> Pr [ label="set volatile\n(w=1)" ];
Sr -> Ur [ label="set unused" ];
Sr -> Vr [ label="set volatile\n(w=0)" ];
Sr -> Pr [ label="set volatile\n(w=1)" ];
Vr -> Sr [ label="set stable(_if_present)" ];
Vr -> Ur [ label="set unused" ];
Vr -> Pr [ label="set volatile\n(w=1)" ];
Pr -> Sr [ label="set stable(_if_present)" ];
Pr -> Vr [ label="set volatile\n(w=0)" ];
Pr -> Ur [ label="set unused" ];
/* Guest state changes on zero pages */
Uz -> Sz [ label="set stable" ];
Uz -> Vz [ label="set volatile" ];
Sz -> Vz [ label="set volatile" ];
Sz -> Uz [ label="set unused" ];
Vz -> Sz [ label="set stable" ];
Vz -> Uz [ label="set unused" ];
/* Guest state changes on host-swapped pages */
Sp -> Uz [ label="set unused" ];
Sp -> Vz [ label="set volatile" ];
/* Guest touches pages */
Sz -> Sr [ label="guest write" ];
Sp -> Sr [ label="guest access" ];
Vz -> Vr [ label="guest write" ];
/* Host actions */
Sr -> Sp [ label="host swap" ];
Ur -> Uz [ label="host discard" ];
Vr -> Vz [ label="host discard" ];
Pr -> Sp [ label="host discard\n(dirty)" ];
Pr -> Vz [ label="host discard\n(clean)" ];
}
Jeremy Fitzhardinge wrote:
>> With stable_if_clean you are refering to stable_if_present?
>
> No. I misunderstood and thought that stable_if_present sets the Px
> state. I'd overlooked the writable flag on page_set_volatile().
>
>> If yes the
>> answer is that this operation is used to get a page from Vx/Px back to
>> Sx but only if the page has not been discarded.
>
> So you mean it will change Vr/Pr to Sr but everything else will fail?
Well presumably Vp/Pr => Sp? Is is true that from the guest's
perspective, all of the 'p' states are identical to the 'r' states?
Do the host states even really need visibility to the guest at all? It
may be useful for the guest to be able to distinguish between Ur and Uz
but it doesn't seem necessary.
BTW Jeremy, the .dot was very useful!
Regards,
Anthony Liguori
digraph gph {
/* Guest state changes on resident pages */
Ur -> Sr [ label="set stable" ];
Ur -> Vr [ label="set volatile\n(w=0)" ];
Ur -> Pr [ label="set volatile\n(w=1)" ];
Sr -> Ur [ label="set unused" ];
Sr -> Vr [ label="set volatile\n(w=0)" ];
Sr -> Pr [ label="set volatile\n(w=1)" ];
Vr -> Sr [ label="set stable(_if_present)" ];
Vr -> Ur [ label="set unused" ];
Vr -> Pr [ label="set volatile\n(w=1)" ];
Pr -> Sr [ label="set stable(_if_present)" ];
Pr -> Vr [ label="set volatile\n(w=0)" ];
Pr -> Ur [ label="set unused" ];
/* Guest state changes on zero pages */
Uz -> Sz [ label="set stable" ];
Uz -> Vz [ label="set volatile" ];
Sz -> Vz [ label="set volatile" ];
Sz -> Uz [ label="set unused" ];
Vz -> Sz [ label="set stable" ];
Vz -> Uz [ label="set unused" ];
/* Guest state changes on host-swapped pages */
Sp -> Uz [ label="set unused" ];
Sp -> Vz [ label="set volatile" ];
/* Guest touches pages */
Sz -> Sr [ label="guest write" ];
Sp -> Sr [ label="guest access" ];
Vz -> Vr [ label="guest write" ];
/* Host actions */
Sr -> Sp [ label="host swap", style=dashed ];
Ur -> Uz [ label="host discard", style=dashed ];
Vr -> Vz [ label="host discard", style=dashed ];
Pr -> Sp [ label="host discard\n(dirty)", style=dashed ];
Pr -> Vz [ label="host discard\n(clean)", style=dashed ];
}
Jeremy Fitzhardinge wrote:
>>
>> Well presumably Vp/Pr => Sp? Is is true that from the guest's
>> perspective, all of the 'p' states are identical to the 'r' states?
>>
>
> Vp should never happen, since you'd never preserve a V page. And
> surely it would be Pr -> Sr, since the hypervisor wouldn't push the
> page to backing store when you change the client state.
You're right, I meant Vp/Pp but they are invalid states. I think one of
the things that keeps tripping me up is that the host can change both
the host and guest page states. My initial impression was that the host
handled the host state and the guest handled the guest state.
>> Do the host states even really need visibility to the guest at all?
>> It may be useful for the guest to be able to distinguish between Ur
>> and Uz but it doesn't seem necessary.
>
> Well, you implicitly see the hypervisor state. If you touch a [UV]z
> page then you get a fault telling you that the page has been taken
> away from you (I think). And it would definitely help with debugging
> (seems likely there's lots of scope for race conditions if you
> prematurely tell the hypervisor you don't need the page any more...).
I was thinking that it may be useful to know a Ur verses a Uz when
allocating memory. In this case, you'd rather allocate Ur pages verses
Uz to avoid the fault. I don't read s390 arch code well, is the host
state explicit to the guest?
>> BTW Jeremy, the .dot was very useful!
> Yes, there's no way I'd be able to get my head around this otherwise.
> BTW, here's an updated one with the host-driven events as dashed
> lines, and a couple of extra transitions I think should be in there
> (but waiting for Martin's confirmation).
Excellent!
Regards,
Anthony LIguori
> J
Anthony Liguori wrote:
>> Vp should never happen, since you'd never preserve a V page. And
>> surely it would be Pr -> Sr, since the hypervisor wouldn't push the
>> page to backing store when you change the client state.
>>
>
> You're right, I meant Vp/Pp but they are invalid states. I think one of
> the things that keeps tripping me up is that the host can change both
> the host and guest page states. My initial impression was that the host
> handled the host state and the guest handled the guest state.
>
Yes. And it seems to me that you get unfortunate outcomes if you have a
Pr->Vz->Vr transition.
>>> Do the host states even really need visibility to the guest at all?
>>> It may be useful for the guest to be able to distinguish between Ur
>>> and Uz but it doesn't seem necessary.
>>>
>> Well, you implicitly see the hypervisor state. If you touch a [UV]z
>> page then you get a fault telling you that the page has been taken
>> away from you (I think). And it would definitely help with debugging
>> (seems likely there's lots of scope for race conditions if you
>> prematurely tell the hypervisor you don't need the page any more...).
>>
>
> I was thinking that it may be useful to know a Ur verses a Uz when
> allocating memory. In this case, you'd rather allocate Ur pages verses
> Uz to avoid the fault. I don't read s390 arch code well, is the host
> state explicit to the guest?
>
Yes, reusing Ur pages might well be better, but who knows - they've
probably got an instruction which makes Uz cheap...
Stuff like this suggets that both parts of the state are packed
together, and are guest-visible:
+ return (state & ESSA_USTATE_MASK) == ESSA_USTATE_VOLATILE &&
+ (state & ESSA_CSTATE_MASK) == ESSA_CSTATE_ZERO;
J
On Wed, 2008-03-12 at 15:04 -0500, Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
> >> With stable_if_clean you are refering to stable_if_present?
> >
> > No. I misunderstood and thought that stable_if_present sets the Px
> > state. I'd overlooked the writable flag on page_set_volatile().
> >
> >> If yes the
> >> answer is that this operation is used to get a page from Vx/Px back to
> >> Sx but only if the page has not been discarded.
> >
> > So you mean it will change Vr/Pr to Sr but everything else will fail?
In the extended version Vp/Pp to Sr as well but the current z/VM code
will discard a page if the host picks a Vr/Pr page to swap it.
> Well presumably Vp/Pr => Sp? Is is true that from the guest's
> perspective, all of the 'p' states are identical to the 'r' states?
Basically yes. The guest doesn't care about the host state.
> Do the host states even really need visibility to the guest at all? It
> may be useful for the guest to be able to distinguish between Ur and Uz
> but it doesn't seem necessary.
It is very useful for debugging to have the host state in the guest as
well. There is one possible optimization: if the guests finds a Uz page
in the free list, it can make it Sz and doesn't have to clear it because
the host will provide an already empty page (not yet implemented
though).
> BTW Jeremy, the .dot was very useful!
I've search on my disk and found the state diagrams we've used for the
OLS paper. You may find these useful as well.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Wed, 2008-03-12 at 13:45 -0700, Jeremy Fitzhardinge wrote:
> Vp should never happen, since you'd never preserve a V page. And surely
> it would be Pr -> Sr, since the hypervisor wouldn't push the page to
> backing store when you change the client state.
Vp does not happen in the current implementation. But it actually may be
useful. z/VM has multiple layers of paging, the first goes to expanded
storage which is very fast. If you make the page Vz and the guests needs
it you have to do a standard Linux I/O to get retrieve the page. This
can be slower than a read and a write to expanded storage.
> > Do the host states even really need visibility to the guest at all? It
> > may be useful for the guest to be able to distinguish between Ur and Uz
> > but it doesn't seem necessary.
>
> Well, you implicitly see the hypervisor state. If you touch a [UV]z
> page then you get a fault telling you that the page has been taken away
> from you (I think). And it would definitely help with debugging (seems
> likely there's lots of scope for race conditions if you prematurely tell
> the hypervisor you don't need the page any more...).
You get an addressing exception if you touch a Uz page. This indicates a
BUG in the Linux code because this is a use after free. If the guests
touches a Vz page you get a discard fault.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Wed, 2008-03-12 at 15:56 -0500, Anthony Liguori wrote:
> > Vp should never happen, since you'd never preserve a V page. And
> > surely it would be Pr -> Sr, since the hypervisor wouldn't push the
> > page to backing store when you change the client state.
>
> You're right, I meant Vp/Pp but they are invalid states. I think one of
> the things that keeps tripping me up is that the host can change both
> the host and guest page states. My initial impression was that the host
> handled the host state and the guest handled the guest state.
In principle only the guest changes the guest state and only the host
changes the host state. The simplified state diagram shows exceptions
for Pr->Sp and Pr->Vz.
> >> Do the host states even really need visibility to the guest at all?
> >> It may be useful for the guest to be able to distinguish between Ur
> >> and Uz but it doesn't seem necessary.
> >
> > Well, you implicitly see the hypervisor state. If you touch a [UV]z
> > page then you get a fault telling you that the page has been taken
> > away from you (I think). And it would definitely help with debugging
> > (seems likely there's lots of scope for race conditions if you
> > prematurely tell the hypervisor you don't need the page any more...).
>
> I was thinking that it may be useful to know a Ur verses a Uz when
> allocating memory. In this case, you'd rather allocate Ur pages verses
> Uz to avoid the fault. I don't read s390 arch code well, is the host
> state explicit to the guest?
This is the second optimization you might want to think about. The other
is to avoid the page clearing for Uz.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Wed, 2008-03-12 at 14:36 -0700, Jeremy Fitzhardinge wrote:
> Anthony Liguori wrote:
> >> Vp should never happen, since you'd never preserve a V page. And
> >> surely it would be Pr -> Sr, since the hypervisor wouldn't push the
> >> page to backing store when you change the client state.
> >>
> >
> > You're right, I meant Vp/Pp but they are invalid states. I think one of
> > the things that keeps tripping me up is that the host can change both
> > the host and guest page states. My initial impression was that the host
> > handled the host state and the guest handled the guest state.
> >
>
> Yes. And it seems to me that you get unfortunate outcomes if you have a
> Pr->Vz->Vr transition.
Vz->Vr cannot happen. This would be a bug in the host.
> > I was thinking that it may be useful to know a Ur verses a Uz when
> > allocating memory. In this case, you'd rather allocate Ur pages verses
> > Uz to avoid the fault. I don't read s390 arch code well, is the host
> > state explicit to the guest?
> >
>
> Yes, reusing Ur pages might well be better, but who knows - they've
> probably got an instruction which makes Uz cheap...
Yes, faulting in a Uz page is cheap on s390. Isn't it a lovely
architecture :-)
> Stuff like this suggets that both parts of the state are packed
> together, and are guest-visible:
>
> + return (state & ESSA_USTATE_MASK) == ESSA_USTATE_VOLATILE &&
> + (state & ESSA_CSTATE_MASK) == ESSA_CSTATE_ZERO;
>
Yes, the return value of the ESSA instruction has both the guest state
and the host state.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> Vz->Vr cannot happen. This would be a bug in the host.
>
Does that mean that Vz is effectively identical to Uz?
J
digraph gph {
/* Guest state changes on resident pages */
Ur -> Sr [ label="set stable" ];
Ur -> Vr [ label="set volatile\n(w=0)" ];
Ur -> Pr [ label="set volatile\n(w=1)" ];
Sr -> Ur [ label="set unused" ];
Sr -> Vr [ label="set volatile\n(w=0)" ];
Sr -> Pr [ label="set volatile\n(w=1)" ];
Vr -> Sr [ label="set stable(_if_present)" ];
Vr -> Ur [ label="set unused" ];
Vr -> Pr [ label="set volatile\n(w=1)" ];
Pr -> Sr [ label="set stable(_if_present)" ];
Pr -> Vr [ label="set volatile\n(w=0)" ];
Pr -> Ur [ label="set unused" ];
/* Guest state changes on zero pages */
Uz -> Sz [ label="set stable" ];
Uz -> Vz [ label="set volatile" ];
Sz -> Vz [ label="set volatile" ];
Sz -> Uz [ label="set unused" ];
Vz -> Sz [ label="set stable" ];
Vz -> Uz [ label="set unused" ];
/* Guest state changes on host-swapped pages */
Sp -> Uz [ label="set unused" ];
Sp -> Vz [ label="set volatile" ];
/* Guest touches pages */
Sz -> Sr [ label="guest write" ];
Sp -> Sr [ label="guest access" ];
/* Host actions */
Sr -> Sp [ label="host swap", style=dashed ];
Ur -> Uz [ label="host discard", style=dashed ];
Vr -> Vz [ label="host discard", style=dashed ];
Pr -> Sp [ label="host discard\n(dirty)", style=dashed ];
Pr -> Vz [ label="host discard\n(clean)", style=dashed ];
}
On Thu, 2008-03-13 at 09:17 -0700, Jeremy Fitzhardinge wrote:
> Jeremy Fitzhardinge wrote:
> > Martin Schwidefsky wrote:
> >> Vz->Vr cannot happen. This would be a bug in the host.
> >>
> >
> > Does that mean that Vz is effectively identical to Uz?
>
> Hm, on further thought:
>
> If guests writes to Vz pages are disallowed, then the only way out of Vz
> is if the guest sets it to something else (Uz,Sz). If so, what's the
> point of using that state? Why not make:
>
> Vr -> Uz host discard
> Pr -> Uz host discard clean
> Sp -> Uz set volatile
> Uz -> Uz set volatile
Vz is the page discarded state. The difference to Uz is slim, both
states will cause a program check on access. Vz generates a discard
fault, Uz generates an addressing exception which is nice for debugging.
But I don't see a reason why an implementation that uses Uz instead of
Vz shouldn't work.
> But given how you've described V-state pages, I really would expect
> writes to a Vz to work, or alternatively, all writes to V-state pages to
> be disallowed. Are there any real uses for a writable Vr page?
You mean in the section that speaks about the guests states S/U/V/P ?
Always keep in mind that you can access a V/P page only until it gets
discarded. Then the useful content of the page frame is lost and any
read of write to the not Vz page will be answered with a discard fault.
A Vr page is read-only. If a page gets mapped for writing it needs to
get into the Pr state. This is the hint for the host to look at the
dirty bit before it discards a page.
So yes, there is no use for a writable Vr page.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Martin Schwidefsky wrote:
> Vz is the page discarded state. The difference to Uz is slim, both
> states will cause a program check on access. Vz generates a discard
> fault, Uz generates an addressing exception which is nice for debugging.
>
How do you handle these different cases in Linux? Do you use Vr pages
in the pagecache, and then shoot down the pagecache entry if the host
steals the page?
The Uz access exception presumably just generates a normal oops.
(I should probably make time to read the rest of the series.)
>> But given how you've described V-state pages, I really would expect
>> writes to a Vz to work, or alternatively, all writes to V-state pages to
>> be disallowed. Are there any real uses for a writable Vr page?
>>
>
> You mean in the section that speaks about the guests states S/U/V/P ?
> Always keep in mind that you can access a V/P page only until it gets
> discarded. Then the useful content of the page frame is lost and any
> read of write to the not Vz page will be answered with a discard fault.
>
Presumably reads from a Vz page also generate a discard fault?
> A Vr page is read-only. If a page gets mapped for writing it needs to
> get into the Pr state. This is the hint for the host to look at the
> dirty bit before it discards a page.
> So yes, there is no use for a writable Vr page.
>
OK, thanks, that clears things up. I was assuming that Vr was
technically writable but that writes could be discarded at any time (ie,
allowing guests to merrily shoot themselves in the foot ;). Making it
forced RO is much more sensible.
J
On Thu, 2008-03-13 at 10:05 -0700, Jeremy Fitzhardinge wrote:
> Martin Schwidefsky wrote:
> > Vz is the page discarded state. The difference to Uz is slim, both
> > states will cause a program check on access. Vz generates a discard
> > fault, Uz generates an addressing exception which is nice for debugging.
> >
>
> How do you handle these different cases in Linux? Do you use Vr pages
> in the pagecache, and then shoot down the pagecache entry if the host
> steals the page?
The environment where we currently run all this is z/VM as the host and
Linux as the guest. We have two page tables on s390, a host page table
and a guest page table. If the host discards a page it simple removes
the entry for the page in the host page table. If the guest comes along
and accesses the page the host gets the fault and generates the
appropriate fault.
> The Uz access exception presumably just generates a normal oops.
Yes, the handler for an addressing exception will call die() for a
kernel check without a fixup.
> (I should probably make time to read the rest of the series.)
>
> >> But given how you've described V-state pages, I really would expect
> >> writes to a Vz to work, or alternatively, all writes to V-state pages to
> >> be disallowed. Are there any real uses for a writable Vr page?
> >>
> >
> > You mean in the section that speaks about the guests states S/U/V/P ?
> > Always keep in mind that you can access a V/P page only until it gets
> > discarded. Then the useful content of the page frame is lost and any
> > read of write to the not Vz page will be answered with a discard fault.
> >
>
> Presumably reads from a Vz page also generate a discard fault?
Yes.
> > A Vr page is read-only. If a page gets mapped for writing it needs to
> > get into the Pr state. This is the hint for the host to look at the
> > dirty bit before it discards a page.
> > So yes, there is no use for a writable Vr page.
> >
>
> OK, thanks, that clears things up. I was assuming that Vr was
> technically writable but that writes could be discarded at any time (ie,
> allowing guests to merrily shoot themselves in the foot ;). Making it
> forced RO is much more sensible.
Well, technically you could write to a Vr page via the kernel address
space. The thing is that the host can just discard the page although it
is dirty. The Vr state is used for page cache pages which do not have
any writable mapping.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.