2015-08-05 21:45:22

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 0/10] x86/mm: Handle large PAT bit in pud/pmd interfaces

The PAT bit gets relocated to bit 12 when PUD/PMD mappings are used. This
bit 12, however, is not covered by PTE_FLAGS_MASK, which is corrently used
for masking pfn and flags for all levels. This patch-set updates pud/pmd
interfaces and multiple functions to handle the large PAT bit properly.

Patch 1/10-2/10 make changes necessary for patch 3/10 to use P?D_PAGE_MASK.

Patch 3/10 fixes pud/pmd interfaces to handle the PAT bit properly, and
patch 4/10 adds p?d_pgprot() interfaces for PUD/PMD.

Patch 5/10 fixes /sys/kernel/debug/kernel_page_tables to show the PAT bit
properly.

Patch 6/10-9/10 fix multiple functions to handle the large PAT bit properly.

Patch 10/10 fix the same pgprot handling in try_preserve_large_page() by
leveraging the changes made in patch 8/10.

Note, the PAT bit is first enabled in 4.2-rc1 with WT mappings. The functions
fixed by patch 6/10-9/10 are not used with WT mappings yet. These fixes will
protect them from future use with the PAT bit set.

---
v3:
- Add patch 4/10 and 6/10-9/10 for multiple interfaces to handle the large
PAT bit.
- Add patch 10/10 to fix the same pgprot handling in
try_preserve_large_page().

v2:
- Change p?n_pfn() to handle the PAT bit. (Juergen Gross)
- Mask pfn and flags with P?D_PAGE_MASK. (Juergen Gross)
- Change p?d_page_vaddr() and p?d_page() to handle the PAT bit.

---
Toshi Kani (10):
1/10 x86/vdso32: Define PGTABLE_LEVELS to 32bit VDSO
2/10 x86/asm: Move PUD_PAGE macros to page_types.h
3/10 x86/asm: Fix pud/pmd interfaces to handle large PAT bit
4/10 x86/asm: Add pud_pgprot() and pmd_pgprot()
5/10 x86/mm: Fix page table dump to show PAT bit
6/10 x86/mm: Fix slow_virt_to_phys() to handle large PAT bit
7/10 x86/mm: Fix gup_huge_p?d() to handle large PAT bit
8/10 x86/mm: Fix try_preserve_large_page() to handle large PAT bit
9/10 x86/mm: Fix __split_large_page() to handle large PAT bit
10/10 x86/mm: Fix the same pgprot handling in try_preserve_large_page()

---
arch/x86/entry/vdso/vdso32/vclock_gettime.c | 2 +
arch/x86/include/asm/page_64_types.h | 3 --
arch/x86/include/asm/page_types.h | 3 ++
arch/x86/include/asm/pgtable.h | 18 ++++---
arch/x86/include/asm/pgtable_types.h | 40 +++++++++++++--
arch/x86/mm/dump_pagetables.c | 39 +++++++-------
arch/x86/mm/gup.c | 18 +++----
arch/x86/mm/pageattr.c | 79 ++++++++++++++++++-----------
8 files changed, 131 insertions(+), 71 deletions(-)


2015-08-05 21:45:23

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 1/10] x86/vdso32: Define PGTABLE_LEVELS to 32bit VDSO

In case of CONFIG_X86_64, vdso32/vclock_gettime.c fakes a 32bit
kernel configuration by re-defining it to CONFIG_X86_32. However,
it does not re-define CONFIG_PGTABLE_LEVELS leaving it as 4 levels.
Fix it by re-defining CONFIG_PGTABLE_LEVELS to 2 as X86_PAE is not
set.

Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/x86/entry/vdso/vdso32/vclock_gettime.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/entry/vdso/vdso32/vclock_gettime.c b/arch/x86/entry/vdso/vdso32/vclock_gettime.c
index 175cc72..87a86e0 100644
--- a/arch/x86/entry/vdso/vdso32/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vdso32/vclock_gettime.c
@@ -14,11 +14,13 @@
*/
#undef CONFIG_64BIT
#undef CONFIG_X86_64
+#undef CONFIG_PGTABLE_LEVELS
#undef CONFIG_ILLEGAL_POINTER_VALUE
#undef CONFIG_SPARSEMEM_VMEMMAP
#undef CONFIG_NR_CPUS

#define CONFIG_X86_32 1
+#define CONFIG_PGTABLE_LEVELS 2
#define CONFIG_PAGE_OFFSET 0
#define CONFIG_ILLEGAL_POINTER_VALUE 0
#define CONFIG_NR_CPUS 1

2015-08-05 21:47:56

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 2/10] x86/asm: Move PUD_PAGE macros to page_types.h

PUD_SHIFT is defined according to a kernel configuration, which
allows it be commonly used by any kernels. However, PUD_PAGE_SIZE
and PUD_PAGE_MASK, which are calculated from PUD_SHIFT, are defined
in page_64_types.h, which allows them be used by a 64-bit kernel
only.

Move PUD_PAGE_SIZE and PUD_PAGE_MASK to page_types.h so that they
can be used by any kernels as well.

Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/x86/include/asm/page_64_types.h | 3 ---
arch/x86/include/asm/page_types.h | 3 +++
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 4edd53b..4928cf0 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -26,9 +26,6 @@
#define MCE_STACK 4
#define N_EXCEPTION_STACKS 4 /* hw limit: 7 */

-#define PUD_PAGE_SIZE (_AC(1, UL) << PUD_SHIFT)
-#define PUD_PAGE_MASK (~(PUD_PAGE_SIZE-1))
-
/*
* Set __PAGE_OFFSET to the most negative possible address +
* PGDIR_SIZE*16 (pgd slot 272). The gap is to allow a space for a
diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index c7c712f..c5b7fb2 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -20,6 +20,9 @@
#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))

+#define PUD_PAGE_SIZE (_AC(1, UL) << PUD_SHIFT)
+#define PUD_PAGE_MASK (~(PUD_PAGE_SIZE-1))
+
#define HPAGE_SHIFT PMD_SHIFT
#define HPAGE_SIZE (_AC(1,UL) << HPAGE_SHIFT)
#define HPAGE_MASK (~(HPAGE_SIZE - 1))

2015-08-05 21:46:27

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 3/10] x86/asm: Fix pud/pmd interfaces to handle large PAT bit

The PAT bit gets relocated to bit 12 when PUD and PMD mappings are
used. This bit 12, however, is not covered by PTE_FLAGS_MASK, which
is corrently used for masking pfn and flags for all cases.

Fix pud/pmd interfaces to handle pfn and flags properly by using
P?D_PAGE_MASK when PUD/PMD mappings are used, i.e. PSE bit is set.

Suggested-by: Juergen Gross <[email protected]>
Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Konrad Wilk <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/x86/include/asm/pgtable.h | 14 +++++++-----
arch/x86/include/asm/pgtable_types.h | 40 +++++++++++++++++++++++++++++++---
2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5b..0733ec7 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -142,12 +142,12 @@ static inline unsigned long pte_pfn(pte_t pte)

static inline unsigned long pmd_pfn(pmd_t pmd)
{
- return (pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT;
+ return (pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
}

static inline unsigned long pud_pfn(pud_t pud)
{
- return (pud_val(pud) & PTE_PFN_MASK) >> PAGE_SHIFT;
+ return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;
}

#define pte_page(pte) pfn_to_page(pte_pfn(pte))
@@ -502,14 +502,15 @@ static inline int pmd_none(pmd_t pmd)

static inline unsigned long pmd_page_vaddr(pmd_t pmd)
{
- return (unsigned long)__va(pmd_val(pmd) & PTE_PFN_MASK);
+ return (unsigned long)__va(pmd_val(pmd) & pmd_pfn_mask(pmd));
}

/*
* Currently stuck as a macro due to indirect forward reference to
* linux/mmzone.h's __section_mem_map_addr() definition:
*/
-#define pmd_page(pmd) pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
+#define pmd_page(pmd) \
+ pfn_to_page((pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT)

/*
* the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
@@ -570,14 +571,15 @@ static inline int pud_present(pud_t pud)

static inline unsigned long pud_page_vaddr(pud_t pud)
{
- return (unsigned long)__va((unsigned long)pud_val(pud) & PTE_PFN_MASK);
+ return (unsigned long)__va(pud_val(pud) & pud_pfn_mask(pud));
}

/*
* Currently stuck as a macro due to indirect forward reference to
* linux/mmzone.h's __section_mem_map_addr() definition:
*/
-#define pud_page(pud) pfn_to_page(pud_val(pud) >> PAGE_SHIFT)
+#define pud_page(pud) \
+ pfn_to_page((pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT)

/* Find an entry in the second-level page table.. */
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 13f310b..dd5b0aa 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -209,10 +209,10 @@ enum page_cache_mode {

#include <linux/types.h>

-/* PTE_PFN_MASK extracts the PFN from a (pte|pmd|pud|pgd)val_t */
+/* Extracts the PFN from a (pte|pmd|pud|pgd)val_t of a 4KB page */
#define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK)

-/* PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t */
+/* Extracts the flags from a (pte|pmd|pud|pgd)val_t of a 4KB page */
#define PTE_FLAGS_MASK (~PTE_PFN_MASK)

typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
@@ -276,14 +276,46 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
}
#endif

+static inline pudval_t pud_pfn_mask(pud_t pud)
+{
+ if (native_pud_val(pud) & _PAGE_PSE)
+ return PUD_PAGE_MASK & PHYSICAL_PAGE_MASK;
+ else
+ return PTE_PFN_MASK;
+}
+
+static inline pudval_t pud_flags_mask(pud_t pud)
+{
+ if (native_pud_val(pud) & _PAGE_PSE)
+ return ~(PUD_PAGE_MASK & (pudval_t)PHYSICAL_PAGE_MASK);
+ else
+ return ~PTE_PFN_MASK;
+}
+
static inline pudval_t pud_flags(pud_t pud)
{
- return native_pud_val(pud) & PTE_FLAGS_MASK;
+ return native_pud_val(pud) & pud_flags_mask(pud);
+}
+
+static inline pmdval_t pmd_pfn_mask(pmd_t pmd)
+{
+ if (native_pmd_val(pmd) & _PAGE_PSE)
+ return PMD_PAGE_MASK & PHYSICAL_PAGE_MASK;
+ else
+ return PTE_PFN_MASK;
+}
+
+static inline pmdval_t pmd_flags_mask(pmd_t pmd)
+{
+ if (native_pmd_val(pmd) & _PAGE_PSE)
+ return ~(PMD_PAGE_MASK & (pmdval_t)PHYSICAL_PAGE_MASK);
+ else
+ return ~PTE_PFN_MASK;
}

static inline pmdval_t pmd_flags(pmd_t pmd)
{
- return native_pmd_val(pmd) & PTE_FLAGS_MASK;
+ return native_pmd_val(pmd) & pmd_flags_mask(pmd);
}

static inline pte_t native_make_pte(pteval_t val)

2015-08-05 21:47:25

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 4/10] x86/asm: Add pud_pgprot() and pmd_pgprot()

pte_pgprot() returns a pgprot_t value by calling pte_flags(). Now
that pud_flags() and pmd_flags() work differently from pte_flags(),
define pud_pgprot() and pmd_pgprot() for PUD/PMD. Also update
pte_pgprot() to remove the masking with PTE_FLAGS_MASK, which is
unnecessary since pte_flags() takes care of it.

Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/x86/include/asm/pgtable.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0733ec7..59fc341 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -379,7 +379,9 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
return __pgprot(preservebits | addbits);
}

-#define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK)
+#define pte_pgprot(x) __pgprot(pte_flags(x))
+#define pmd_pgprot(x) __pgprot(pmd_flags(x))
+#define pud_pgprot(x) __pgprot(pud_flags(x))

#define canon_pgprot(p) __pgprot(massage_pgprot(p))

2015-08-05 21:45:37

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 5/10] x86/mm: Fix page table dump to show PAT bit

/sys/kernel/debug/kernel_page_tables does not show the PAT bit for
PUD/PMD mappings. This is because walk_pud_level(), walk_pmd_level()
and note_page() mask the flags with PTE_FLAGS_MASK, which does not
cover their PAT bit, _PAGE_PAT_LARGE.

Fix it by replacing the use of PTE_FLAGS_MASK with p?d_flags(),
which masks the flags properly.

Change also to show the PAT bit as "PAT" to be consistent with
other bits.

Reported-by: Robert Elliott <[email protected]>
Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Robert Elliott <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 39 +++++++++++++++++++++------------------
1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index f0cedf3..71ab2d7 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -155,7 +155,7 @@ static void printk_prot(struct seq_file *m, pgprot_t prot, int level, bool dmsg)
pt_dump_cont_printf(m, dmsg, " ");
if ((level == 4 && pr & _PAGE_PAT) ||
((level == 3 || level == 2) && pr & _PAGE_PAT_LARGE))
- pt_dump_cont_printf(m, dmsg, "pat ");
+ pt_dump_cont_printf(m, dmsg, "PAT ");
else
pt_dump_cont_printf(m, dmsg, " ");
if (pr & _PAGE_GLOBAL)
@@ -198,8 +198,8 @@ static void note_page(struct seq_file *m, struct pg_state *st,
* we have now. "break" is either changing perms, levels or
* address space marker.
*/
- prot = pgprot_val(new_prot) & PTE_FLAGS_MASK;
- cur = pgprot_val(st->current_prot) & PTE_FLAGS_MASK;
+ prot = pgprot_val(new_prot);
+ cur = pgprot_val(st->current_prot);

if (!st->level) {
/* First entry */
@@ -269,13 +269,13 @@ static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
{
int i;
pte_t *start;
+ pgprotval_t prot;

start = (pte_t *) pmd_page_vaddr(addr);
for (i = 0; i < PTRS_PER_PTE; i++) {
- pgprot_t prot = pte_pgprot(*start);
-
+ prot = pte_flags(*start);
st->current_address = normalize_addr(P + i * PTE_LEVEL_MULT);
- note_page(m, st, prot, 4);
+ note_page(m, st, __pgprot(prot), 4);
start++;
}
}
@@ -287,18 +287,19 @@ static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,
{
int i;
pmd_t *start;
+ pgprotval_t prot;

start = (pmd_t *) pud_page_vaddr(addr);
for (i = 0; i < PTRS_PER_PMD; i++) {
st->current_address = normalize_addr(P + i * PMD_LEVEL_MULT);
if (!pmd_none(*start)) {
- pgprotval_t prot = pmd_val(*start) & PTE_FLAGS_MASK;
-
- if (pmd_large(*start) || !pmd_present(*start))
+ if (pmd_large(*start) || !pmd_present(*start)) {
+ prot = pmd_flags(*start);
note_page(m, st, __pgprot(prot), 3);
- else
+ } else {
walk_pte_level(m, st, *start,
P + i * PMD_LEVEL_MULT);
+ }
} else
note_page(m, st, __pgprot(0), 3);
start++;
@@ -318,19 +319,20 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
{
int i;
pud_t *start;
+ pgprotval_t prot;

start = (pud_t *) pgd_page_vaddr(addr);

for (i = 0; i < PTRS_PER_PUD; i++) {
st->current_address = normalize_addr(P + i * PUD_LEVEL_MULT);
if (!pud_none(*start)) {
- pgprotval_t prot = pud_val(*start) & PTE_FLAGS_MASK;
-
- if (pud_large(*start) || !pud_present(*start))
+ if (pud_large(*start) || !pud_present(*start)) {
+ prot = pud_flags(*start);
note_page(m, st, __pgprot(prot), 2);
- else
+ } else {
walk_pmd_level(m, st, *start,
P + i * PUD_LEVEL_MULT);
+ }
} else
note_page(m, st, __pgprot(0), 2);

@@ -351,6 +353,7 @@ void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
#else
pgd_t *start = swapper_pg_dir;
#endif
+ pgprotval_t prot;
int i;
struct pg_state st = {};

@@ -362,13 +365,13 @@ void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
for (i = 0; i < PTRS_PER_PGD; i++) {
st.current_address = normalize_addr(i * PGD_LEVEL_MULT);
if (!pgd_none(*start)) {
- pgprotval_t prot = pgd_val(*start) & PTE_FLAGS_MASK;
-
- if (pgd_large(*start) || !pgd_present(*start))
+ if (pgd_large(*start) || !pgd_present(*start)) {
+ prot = pgd_flags(*start);
note_page(m, &st, __pgprot(prot), 1);
- else
+ } else {
walk_pud_level(m, &st, *start,
i * PGD_LEVEL_MULT);
+ }
} else
note_page(m, &st, __pgprot(0), 1);

2015-08-05 21:45:28

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 6/10] x86/mm: Fix slow_virt_to_phys() to handle large PAT bit

slow_virt_to_phys() calls lookup_address() to obtain *pte and
its level. It then calls pte_pfn() to obtain the PFN for any
level. This does not result the correct PFN when the large
PAT bit is set because pte_pfn() does not mask the large PAT bit
properly for PUD/PMD.

Fix slow_virt_to_phys() to use pud_pfn() and pmd_pfn() according
to the level.

Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
---
arch/x86/mm/pageattr.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 727158c..ecc24e5 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -415,18 +415,28 @@ pmd_t *lookup_pmd_address(unsigned long address)
phys_addr_t slow_virt_to_phys(void *__virt_addr)
{
unsigned long virt_addr = (unsigned long)__virt_addr;
- phys_addr_t phys_addr;
- unsigned long offset;
+ unsigned long phys_addr, offset;
enum pg_level level;
- unsigned long pmask;
pte_t *pte;

pte = lookup_address(virt_addr, &level);
BUG_ON(!pte);
- pmask = page_level_mask(level);
- offset = virt_addr & ~pmask;
- phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
- return (phys_addr | offset);
+
+ switch (level) {
+ case PG_LEVEL_1G:
+ phys_addr = pud_pfn(*(pud_t *)pte) << PAGE_SHIFT;
+ offset = virt_addr & ~PUD_PAGE_MASK;
+ break;
+ case PG_LEVEL_2M:
+ phys_addr = pmd_pfn(*(pmd_t *)pte) << PAGE_SHIFT;
+ offset = virt_addr & ~PMD_PAGE_MASK;
+ break;
+ default:
+ phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+ offset = virt_addr & ~PAGE_MASK;
+ }
+
+ return (phys_addr_t)(phys_addr | offset);
}
EXPORT_SYMBOL_GPL(slow_virt_to_phys);

2015-08-05 21:46:55

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 7/10] x86/mm: Fix gup_huge_p?d() to handle large PAT bit

gup_huge_pud() and gup_huge_pmd() cast *pud and *pmd to *pte,
and use pte_xxx() interfaces to obtain the flags and PFN.
However, the pte_xxx() interface does not handle the large
PAT bit properly for PUD/PMD.

Fix gup_huge_pud() and gup_huge_pmd() to use pud_xxx() and
pmd_xxx() interfaces according to their type.

Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
---
arch/x86/mm/gup.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 81bf3d2..ae9a37b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -118,21 +118,20 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
unsigned long mask;
- pte_t pte = *(pte_t *)&pmd;
struct page *head, *page;
int refs;

mask = _PAGE_PRESENT|_PAGE_USER;
if (write)
mask |= _PAGE_RW;
- if ((pte_flags(pte) & mask) != mask)
+ if ((pmd_flags(pmd) & mask) != mask)
return 0;
/* hugepages are never "special" */
- VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
- VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+ VM_BUG_ON(pmd_flags(pmd) & _PAGE_SPECIAL);
+ VM_BUG_ON(!pfn_valid(pmd_pfn(pmd)));

refs = 0;
- head = pte_page(pte);
+ head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
@@ -195,21 +194,20 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
unsigned long mask;
- pte_t pte = *(pte_t *)&pud;
struct page *head, *page;
int refs;

mask = _PAGE_PRESENT|_PAGE_USER;
if (write)
mask |= _PAGE_RW;
- if ((pte_flags(pte) & mask) != mask)
+ if ((pud_flags(pud) & mask) != mask)
return 0;
/* hugepages are never "special" */
- VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
- VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+ VM_BUG_ON(pud_flags(pud) & _PAGE_SPECIAL);
+ VM_BUG_ON(!pfn_valid(pud_pfn(pud)));

refs = 0;
- head = pte_page(pte);
+ head = pud_page(pud);
page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);

2015-08-05 21:46:28

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 8/10] x86/mm: Fix try_preserve_large_page() to handle large PAT bit

try_preserve_large_page() is called from __change_page_attr() to
change the map attribute by preserving the large page. This
function uses pte_pfn() and pte_pgprot() for PUD/PMD, which do not
handle the large PAT bit properly.

Fix try_preserve_large_page() to use corresponding p?d_pfn() and
p?d_pgprot().

Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
---
arch/x86/mm/pageattr.c | 24 ++++++++++++++----------
1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index ecc24e5..2724755 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -469,7 +469,7 @@ static int
try_preserve_large_page(pte_t *kpte, unsigned long address,
struct cpa_data *cpa)
{
- unsigned long nextpage_addr, numpages, pmask, psize, addr, pfn;
+ unsigned long nextpage_addr, numpages, pmask, psize, addr, pfn, old_pfn;
pte_t new_pte, old_pte, *tmp;
pgprot_t old_prot, new_prot, req_prot;
int i, do_split = 1;
@@ -489,17 +489,21 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,

switch (level) {
case PG_LEVEL_2M:
-#ifdef CONFIG_X86_64
+ old_prot = pmd_pgprot(*(pmd_t *)kpte);
+ old_pfn = pmd_pfn(*(pmd_t *)kpte);
+ break;
case PG_LEVEL_1G:
-#endif
- psize = page_level_size(level);
- pmask = page_level_mask(level);
+ old_prot = pud_pgprot(*(pud_t *)kpte);
+ old_pfn = pud_pfn(*(pud_t *)kpte);
break;
default:
do_split = -EINVAL;
goto out_unlock;
}

+ psize = page_level_size(level);
+ pmask = page_level_mask(level);
+
/*
* Calculate the number of pages, which fit into this large
* page starting at address:
@@ -515,7 +519,7 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
* up accordingly.
*/
old_pte = *kpte;
- old_prot = req_prot = pgprot_large_2_4k(pte_pgprot(old_pte));
+ old_prot = req_prot = pgprot_large_2_4k(old_prot);

pgprot_val(req_prot) &= ~pgprot_val(cpa->mask_clr);
pgprot_val(req_prot) |= pgprot_val(cpa->mask_set);
@@ -541,10 +545,10 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
req_prot = canon_pgprot(req_prot);

/*
- * old_pte points to the large page base address. So we need
+ * old_pfn points to the large page base pfn. So we need
* to add the offset of the virtual address:
*/
- pfn = pte_pfn(old_pte) + ((address & (psize - 1)) >> PAGE_SHIFT);
+ pfn = old_pfn + ((address & (psize - 1)) >> PAGE_SHIFT);
cpa->pfn = pfn;

new_prot = static_protections(req_prot, address, pfn);
@@ -555,7 +559,7 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
* the pages in the range we try to preserve:
*/
addr = address & pmask;
- pfn = pte_pfn(old_pte);
+ pfn = old_pfn;
for (i = 0; i < (psize >> PAGE_SHIFT); i++, addr += PAGE_SIZE, pfn++) {
pgprot_t chk_prot = static_protections(req_prot, addr, pfn);

@@ -585,7 +589,7 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
* The address is aligned and the number of pages
* covers the full page.
*/
- new_pte = pfn_pte(pte_pfn(old_pte), new_prot);
+ new_pte = pfn_pte(old_pfn, new_prot);
__set_pmd_pte(kpte, address, new_pte);
cpa->flags |= CPA_FLUSHTLB;
do_split = 0;

2015-08-05 21:45:30

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 9/10] x86/mm: Fix __split_large_page() to handle large PAT bit

__split_large_page() is called from __change_page_attr() to change
the map attribute by splitting the large page into smaller pages.
This function uses pte_pfn() and pte_pgprot() for PUD/PMD, which
do not handle the large PAT bit properly.

Fix __split_large_page() to use corresponding p?d_pfn() and p?d_pgprot().

Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
---
arch/x86/mm/pageattr.c | 31 +++++++++++++++++++------------
1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 2724755..d0e40ed 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -606,7 +606,7 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
struct page *base)
{
pte_t *pbase = (pte_t *)page_address(base);
- unsigned long pfn, pfninc = 1;
+ unsigned long ref_pfn, pfn, pfninc = 1;
unsigned int i, level;
pte_t *tmp;
pgprot_t ref_prot;
@@ -623,26 +623,33 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
}

paravirt_alloc_pte(&init_mm, page_to_pfn(base));
- ref_prot = pte_pgprot(pte_clrhuge(*kpte));

- /* promote PAT bit to correct position */
- if (level == PG_LEVEL_2M)
+ switch (level) {
+ case PG_LEVEL_2M:
+ ref_prot = pmd_pgprot(*(pmd_t *)kpte);
+ /* clear PSE and promote PAT bit to correct position */
ref_prot = pgprot_large_2_4k(ref_prot);
+ ref_pfn = pmd_pfn(*(pmd_t *)kpte);
+ break;

-#ifdef CONFIG_X86_64
- if (level == PG_LEVEL_1G) {
+ case PG_LEVEL_1G:
+ ref_prot = pud_pgprot(*(pud_t *)kpte);
+ ref_pfn = pud_pfn(*(pud_t *)kpte);
pfninc = PMD_PAGE_SIZE >> PAGE_SHIFT;
+
/*
- * Set the PSE flags only if the PRESENT flag is set
+ * Clear the PSE flags if the PRESENT flag is not set
* otherwise pmd_present/pmd_huge will return true
* even on a non present pmd.
*/
- if (pgprot_val(ref_prot) & _PAGE_PRESENT)
- pgprot_val(ref_prot) |= _PAGE_PSE;
- else
+ if (!(pgprot_val(ref_prot) & _PAGE_PRESENT))
pgprot_val(ref_prot) &= ~_PAGE_PSE;
+ break;
+
+ default:
+ spin_unlock(&pgd_lock);
+ return 1;
}
-#endif

/*
* Set the GLOBAL flags only if the PRESENT flag is set
@@ -658,7 +665,7 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
/*
* Get the target pfn from the original entry:
*/
- pfn = pte_pfn(*kpte);
+ pfn = ref_pfn;
for (i = 0; i < PTRS_PER_PTE; i++, pfn += pfninc)
set_pte(&pbase[i], pfn_pte(pfn, canon_pgprot(ref_prot)));

2015-08-05 21:45:54

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v3 10/10] x86/mm: Fix the same pgprot handling in try_preserve_large_page()

try_preserve_large_page() checks if new_prot is the same as
old_prot. If so, it simply sets do_split to 0, and returns
with no-operation. However, old_prot is set as a 4KB pgprot
value while new_prot is a large page pgprot value.

Now that old_prot is initially set from p?d_pgprot() as a
large page pgprot value, fix it by not overwriting old_prot
with a 4KB pgprot value.

Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
---
arch/x86/mm/pageattr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index d0e40ed..6f9b885 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -519,7 +519,7 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
* up accordingly.
*/
old_pte = *kpte;
- old_prot = req_prot = pgprot_large_2_4k(old_prot);
+ req_prot = pgprot_large_2_4k(old_prot);

pgprot_val(req_prot) &= ~pgprot_val(cpa->mask_clr);
pgprot_val(req_prot) |= pgprot_val(cpa->mask_set);

2015-08-20 19:47:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 1/10] x86/vdso32: Define PGTABLE_LEVELS to 32bit VDSO

On Wed, 5 Aug 2015, Toshi Kani wrote:

> In case of CONFIG_X86_64, vdso32/vclock_gettime.c fakes a 32bit
> kernel configuration by re-defining it to CONFIG_X86_32. However,
> it does not re-define CONFIG_PGTABLE_LEVELS leaving it as 4 levels.
> Fix it by re-defining CONFIG_PGTABLE_LEVELS to 2 as X86_PAE is not
> set.

You fail to explain WHY this is required. I have not yet spotted any
code in vclock_gettime.c which is affected by this.

Thanks,

tglx

2015-08-20 23:04:29

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v3 1/10] x86/vdso32: Define PGTABLE_LEVELS to 32bit VDSO

On 8/20/2015 1:46 PM, Thomas Gleixner wrote:
> On Wed, 5 Aug 2015, Toshi Kani wrote:
>
>> In case of CONFIG_X86_64, vdso32/vclock_gettime.c fakes a 32bit
>> kernel configuration by re-defining it to CONFIG_X86_32. However,
>> it does not re-define CONFIG_PGTABLE_LEVELS leaving it as 4 levels.
>> Fix it by re-defining CONFIG_PGTABLE_LEVELS to 2 as X86_PAE is not
>> set.
> You fail to explain WHY this is required. I have not yet spotted any
> code in vclock_gettime.c which is affected by this.

Sorry about that. Without this patch 01, applying patch 02 & 03 causes
the following compile errors in vclock_gettime.c. This is because it
includes pgtable_type.h (see blow), which now requires PUD_SHIFT and
PMD_SHIFT defined properly. In case of X86_32, pgtable_type.h includes
pgtable_nopud.h and pgtable-nopmd.h, which define these SHIFTs when
CONFIG_PGTABLE_LEVEL is set to 2 (or 3 if PAE is also defined).

In file included from ./arch/x86/include/asm/paravirt_types.h:44:0,
from ./arch/x86/include/asm/ptrace.h:71,
from ./arch/x86/include/asm/alternative.h:8,
from ./arch/x86/include/asm/bitops.h:16,
from include/linux/bitops.h:36,
from include/linux/kernel.h:10,
from include/linux/list.h:8,
from include/linux/preempt.h:10,
from include/linux/spinlock.h:50,
from include/linux/seqlock.h:35,
from include/linux/time.h:5,
from include/uapi/linux/timex.h:56,
from include/linux/timex.h:56,
from include/linux/clocksource.h:12,
from ./arch/x86/include/asm/vgtod.h:5,
from arch/x86/entry/vdso/vdso32/../vclock_gettime.c:15,
from arch/x86/entry/vdso/vdso32/vclock_gettime.c:30:
./arch/x86/include/asm/pgtable_types.h: In function pud_pfn_mask・
./arch/x86/include/asm/pgtable_types.h:282:23: error: PUD_SHIFT・
undeclared (first use in this function)
return PUD_PAGE_MASK & PHYSICAL_PAGE_MASK;
^
./arch/x86/include/asm/pgtable_types.h:282:23: note: each undeclared
identifier is reported only once for each function it appears in
./arch/x86/include/asm/pgtable_types.h: In function pud_flags_mask・
./arch/x86/include/asm/pgtable_types.h:290:25: error: PUD_SHIFT・
undeclared (first use in this function)
return ~(PUD_PAGE_MASK & (pudval_t)PHYSICAL_PAGE_MASK);
^
./arch/x86/include/asm/pgtable_types.h: In function pmd_pfn_mask・
./arch/x86/include/asm/pgtable_types.h:303:23: error: PMD_SHIFT・
undeclared (first use in this function)
return PMD_PAGE_MASK & PHYSICAL_PAGE_MASK;
^
./arch/x86/include/asm/pgtable_types.h: In function pmd_flags_mask・
./arch/x86/include/asm/pgtable_types.h:311:25: error: PMD_SHIFT・
undeclared (first use in this function)
return ~(PMD_PAGE_MASK & (pmdval_t)PHYSICAL_PAGE_MASK);
^
scripts/Makefile.build:258: recipe for target
'arch/x86/entry/vdso/vdso32/vclock_gettime.o' failed
make[3]: *** [arch/x86/entry/vdso/vdso32/vclock_gettime.o] Error 1
make[3]: *** Waiting for unfinished jobs....

Thanks,
-Toshi

2015-08-24 21:50:09

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v3 1/10] x86/vdso32: Define PGTABLE_LEVELS to 32bit VDSO

On Thu, 2015-08-20 at 17:04 -0600, Toshi Kani wrote:
> On 8/20/2015 1:46 PM, Thomas Gleixner wrote:
> > On Wed, 5 Aug 2015, Toshi Kani wrote:
> >
> > > In case of CONFIG_X86_64, vdso32/vclock_gettime.c fakes a 32bit
> > > kernel configuration by re-defining it to CONFIG_X86_32. However,
> > > it does not re-define CONFIG_PGTABLE_LEVELS leaving it as 4 levels.
> > > Fix it by re-defining CONFIG_PGTABLE_LEVELS to 2 as X86_PAE is not
> > > set.
> > You fail to explain WHY this is required. I have not yet spotted any
> > code in vclock_gettime.c which is affected by this.
>
> Sorry about that. Without this patch 01, applying patch 02 & 03 causes
> the following compile errors in vclock_gettime.c. This is because it
> includes pgtable_type.h (see blow), which now requires PUD_SHIFT and
> PMD_SHIFT defined properly. In case of X86_32, pgtable_type.h includes
> pgtable_nopud.h and pgtable-nopmd.h, which define these SHIFTs when
> CONFIG_PGTABLE_LEVEL is set to 2 (or 3 if PAE is also defined).
> :

Attached is patch 01/10 with updated descriptions. The rest of the patchset
still applies cleanly.

Please let me know if you have any further comments.
Thanks,
-Toshi

----
Subject: [PATCH v3 UPDATE 1/10] x86/vdso32: Define PGTABLE_LEVELS to 32bit
VDSO

In case of CONFIG_X86_64, vdso32/vclock_gettime.c fakes a 32-bit
non-PAE kernel configuration by re-defining it to CONFIG_X86_32.
However, it does not re-define CONFIG_PGTABLE_LEVELS leaving it
as 4 levels.

This mismatch leads <asm/pgtable_type.h> to NOT include <asm-generic/
pgtable-nopud.h> and <asm-generic/pgtable-nopmd.h>, which will cause
compile errors when a later patch enhances <asm/pgtable_type.h> to
use PUD_SHIFT and PMD_SHIFT. These -nopud & -nopmd headers define
these SHIFTs for the 32-bit non-PAE kernel.

Fix it by re-defining CONFIG_PGTABLE_LEVELS to 2 levels.

Signed-off-by: Toshi Kani <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
---
arch/x86/entry/vdso/vdso32/vclock_gettime.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/entry/vdso/vdso32/vclock_gettime.c
b/arch/x86/entry/vdso/vdso32/vclock_gettime.c
index 175cc72..87a86e0 100644
--- a/arch/x86/entry/vdso/vdso32/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vdso32/vclock_gettime.c
@@ -14,11 +14,13 @@
*/
#undef CONFIG_64BIT
#undef CONFIG_X86_64
+#undef CONFIG_PGTABLE_LEVELS
#undef CONFIG_ILLEGAL_POINTER_VALUE
#undef CONFIG_SPARSEMEM_VMEMMAP
#undef CONFIG_NR_CPUS

#define CONFIG_X86_32 1
+#define CONFIG_PGTABLE_LEVELS 2
#define CONFIG_PAGE_OFFSET 0
#define CONFIG_ILLEGAL_POINTER_VALUE 0
#define CONFIG_NR_CPUS 1

2015-08-25 08:17:14

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 3/10] x86/asm: Fix pud/pmd interfaces to handle large PAT bit

On Wed, 5 Aug 2015, Toshi Kani wrote:

> The PAT bit gets relocated to bit 12 when PUD and PMD mappings are
> used. This bit 12, however, is not covered by PTE_FLAGS_MASK, which
> is corrently used for masking pfn and flags for all cases.
>
> Fix pud/pmd interfaces to handle pfn and flags properly by using
> P?D_PAGE_MASK when PUD/PMD mappings are used, i.e. PSE bit is set.

Can you please split that into a patch introducing and describing the
new mask helper functions and a second one making use of it?

Thanks,

tglx

2015-08-25 14:17:58

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v3 3/10] x86/asm: Fix pud/pmd interfaces to handle large PAT bit

On Tue, 2015-08-25 at 10:16 +0200, Thomas Gleixner wrote:
> On Wed, 5 Aug 2015, Toshi Kani wrote:
>
> > The PAT bit gets relocated to bit 12 when PUD and PMD mappings are
> > used. This bit 12, however, is not covered by PTE_FLAGS_MASK, which
> > is corrently used for masking pfn and flags for all cases.
> >
> > Fix pud/pmd interfaces to handle pfn and flags properly by using
> > P?D_PAGE_MASK when PUD/PMD mappings are used, i.e. PSE bit is set.
>
> Can you please split that into a patch introducing and describing the
> new mask helper functions and a second one making use of it?

Will do. I will send out v4 patchset today with this update (and the patch
01 update).

Thanks,
-Toshi