2014-11-20 10:25:54

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v2

V1 failed while running under kvm-tools very quickly and a second report
indicated that it happens on bare metal as well. This version survived
an overnight run of trinity running under kvm-tools here but verification
from Sasha would be appreciated.

Changelog since V1
o ppc64 paranoia checks and clarifications (aneesh)
o Fix trinity regression (hopefully)
o Reduce unnecessary TLB flushes (mel)

Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information. Very broadly speaking it
would mark PTEs as not present and use another bit to distinguish between
NUMA hinting faults and other types of faults. It was universally loved
by everybody and caused no problems whatsoever. That last sentence might
be a lie.

This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections. I did alter and add parts of it but I consider them relatively
minor contributions. At their suggestion, acked-bys are in there but I've
no problem converting them to Signed-off-by if requested.

AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh for
that. I tested trinity under kvm-tool and passed and ran a few other basic
tests. In most cases I'm leaving out detail as it's not that interesting.

specjbb single JVM: There was negligible performance difference in the
benchmark itself for short and long runs. However, system activity
is higher and interrupts are much higher over time -- possibly
TLB flushes. Migrations are also higher. Overall, this is more
overhead but considering the problems faced with the old approach
I think we just have to suck it up and find another way of reducing
the overhead.

specjbb multi JVM: Negligible performance difference to the actual benchmarm
but like the single JVM case, the system overhead is noticably
higher. Again, interrupts are a major factor.

autonumabench: This was all over the place and about all that can be
reasonably concluded is that it's different but not necessarily
better or worse.

autonumabench
3.18.0-rc4 3.18.0-rc4
vanilla protnone-v2r5
User NUMA01 32806.01 ( 0.00%) 20250.67 ( 38.27%)
User NUMA01_THEADLOCAL 23910.28 ( 0.00%) 22734.37 ( 4.92%)
User NUMA02 3176.85 ( 0.00%) 3082.68 ( 2.96%)
User NUMA02_SMT 1600.06 ( 0.00%) 1547.08 ( 3.31%)
System NUMA01 719.07 ( 0.00%) 1344.39 (-86.96%)
System NUMA01_THEADLOCAL 916.26 ( 0.00%) 180.90 ( 80.26%)
System NUMA02 20.92 ( 0.00%) 17.34 ( 17.11%)
System NUMA02_SMT 8.76 ( 0.00%) 7.24 ( 17.35%)
Elapsed NUMA01 728.27 ( 0.00%) 519.28 ( 28.70%)
Elapsed NUMA01_THEADLOCAL 589.15 ( 0.00%) 554.73 ( 5.84%)
Elapsed NUMA02 81.20 ( 0.00%) 81.72 ( -0.64%)
Elapsed NUMA02_SMT 80.49 ( 0.00%) 79.58 ( 1.13%)
CPU NUMA01 4603.00 ( 0.00%) 4158.00 ( 9.67%)
CPU NUMA01_THEADLOCAL 4213.00 ( 0.00%) 4130.00 ( 1.97%)
CPU NUMA02 3937.00 ( 0.00%) 3793.00 ( 3.66%)
CPU NUMA02_SMT 1998.00 ( 0.00%) 1952.00 ( 2.30%)


System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters. On
the other workloads that are sensible on this machine, system CPU usage
is great. Overall time to complete the benchmark is comparable

3.18.0-rc4 3.18.0-rc4
vanillaprotnone-v2r5
User 61493.38 47615.01
System 1665.17 1550.07
Elapsed 1480.79 1236.74

NUMA alloc hit 4739774 5328362
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 4664980 5328351
NUMA base PTE updates 556489407 444119981
NUMA huge PMD updates 1086000 866680
NUMA page range updates 1112521407 887860141
NUMA hint faults 1538964 1242142
NUMA hint local faults 835871 814313
NUMA hint local percent 54 65
NUMA pages migrated 7329212 59883854

The NUMA pages migrated look terrible but when I looked at a graph of the
activity over time I see that the massive spike in migration activity was
during NUMA01. This correlates with high system CPU usage and could be simply
down to bad luck but any modifications that affect that workload would be
related to scan rates and migrations, not the protection mechanism. For
all other workloads, migration activity was comparable.

Overall, headline performance figures are comparable but the overhead
is higher, mostly in interrupts. To some extent, higher overhead from
this approach was anticipated but not to this degree. It's going to be
necessary to reduce this again with a separate series in the future. It's
still worth going ahead with this series though as it's likely to avoid
constant headaches with Xen and is probably easier to maintain.

arch/powerpc/include/asm/pgtable.h | 53 ++----------
arch/powerpc/include/asm/pte-common.h | 5 --
arch/powerpc/include/asm/pte-hash64.h | 6 --
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
arch/powerpc/mm/copro_fault.c | 8 +-
arch/powerpc/mm/fault.c | 25 ++----
arch/powerpc/mm/gup.c | 4 +-
arch/powerpc/mm/pgtable.c | 8 +-
arch/powerpc/mm/pgtable_64.c | 3 +-
arch/x86/include/asm/pgtable.h | 46 +++++-----
arch/x86/include/asm/pgtable_64.h | 5 --
arch/x86/include/asm/pgtable_types.h | 41 +--------
arch/x86/mm/gup.c | 4 +-
include/asm-generic/pgtable.h | 152 ++--------------------------------
include/linux/migrate.h | 4 -
include/linux/swapops.h | 2 +-
include/uapi/linux/mempolicy.h | 2 +-
mm/gup.c | 8 +-
mm/huge_memory.c | 50 ++++++-----
mm/memory.c | 18 ++--
mm/mempolicy.c | 2 +-
mm/migrate.c | 8 +-
mm/mprotect.c | 48 +++++------
mm/pgtable-generic.c | 2 -
24 files changed, 131 insertions(+), 375 deletions(-)

--
2.1.2


2014-11-20 10:19:59

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/10] mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault

A transhuge NUMA hinting fault may find the page is migrating and should
wait until migration completes. The check is race-prone because the pmd
is deferenced outside of the page lock and while the race is tiny, it'll
be larger if the PMD is cleared while marking PMDs for hinting fault.
This patch closes the race.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/migrate.h | 4 ----
mm/huge_memory.c | 3 ++-
mm/migrate.c | 6 ------
3 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 01aad3e..a3edcdf 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -77,7 +77,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,

#ifdef CONFIG_NUMA_BALANCING
extern bool pmd_trans_migrating(pmd_t pmd);
-extern void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd);
extern int migrate_misplaced_page(struct page *page,
struct vm_area_struct *vma, int node);
extern bool migrate_ratelimited(int node);
@@ -86,9 +85,6 @@ static inline bool pmd_trans_migrating(pmd_t pmd)
{
return false;
}
-static inline void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
-{
-}
static inline int migrate_misplaced_page(struct page *page,
struct vm_area_struct *vma, int node)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de98415..38fa6cc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1284,8 +1284,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* check_same as the page may no longer be mapped.
*/
if (unlikely(pmd_trans_migrating(*pmdp))) {
+ page = pmd_page(*pmdp);
spin_unlock(ptl);
- wait_migrate_huge_page(vma->anon_vma, pmdp);
+ wait_on_page_locked(page);
goto out;
}

diff --git a/mm/migrate.c b/mm/migrate.c
index 0143995..baaf80e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1698,12 +1698,6 @@ bool pmd_trans_migrating(pmd_t pmd)
return PageLocked(page);
}

-void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
-{
- struct page *page = pmd_page(*pmd);
- wait_on_page_locked(page);
-}
-
/*
* Attempt to migrate a misplaced page to the specified destination
* node. Caller is expected to have an elevated reference count on
--
2.1.2

2014-11-20 10:20:14

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/10] mm: numa: Avoid unnecessary TLB flushes when setting NUMA hinting entries

If a PTE or PMD is already marked NUMA when scanning to mark entries
for NUMA hinting then it is not necessary to update the entry and
incur a TLB flush penalty. Avoid the avoidhead where possible.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/huge_memory.c | 14 ++++++++------
mm/mprotect.c | 4 ++++
2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6458229f..a7ea9b8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1524,12 +1524,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
return 0;
}

- ret = 1;
- entry = pmdp_get_and_clear(mm, addr, pmd);
- entry = pmd_modify(entry, newprot);
- ret = HPAGE_PMD_NR;
- set_pmd_at(mm, addr, pmd, entry);
- BUG_ON(pmd_write(entry));
+ if (!prot_numa || !pmd_protnone_numa(*pmd)) {
+ ret = 1;
+ entry = pmdp_get_and_clear(mm, addr, pmd);
+ entry = pmd_modify(entry, newprot);
+ ret = HPAGE_PMD_NR;
+ set_pmd_at(mm, addr, pmd, entry);
+ BUG_ON(pmd_write(entry));
+ }
spin_unlock(ptl);
}

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 33dfafb..eb890d0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -86,6 +86,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
page = vm_normal_page(vma, addr, oldpte);
if (!page || PageKsm(page))
continue;
+
+ /* Avoid TLB flush if possible */
+ if (pte_protnone_numa(oldpte))
+ continue;
}

ptent = ptep_modify_prot_start(mm, addr, pte);
--
2.1.2

2014-11-20 10:20:11

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other
page table modifiers are also altered. This patch layout is to make
review easier.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Aneesh Kumar <[email protected]>
---
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
arch/powerpc/mm/fault.c | 5 -----
arch/powerpc/mm/gup.c | 4 ++--
arch/powerpc/mm/pgtable.c | 8 +++++++-
arch/powerpc/mm/pgtable_64.c | 3 ++-
arch/x86/mm/gup.c | 4 ++--
include/uapi/linux/mempolicy.h | 2 +-
mm/gup.c | 8 ++++----
mm/huge_memory.c | 16 +++++++--------
mm/memory.c | 4 ++--
mm/mprotect.c | 39 ++++++++++---------------------------
mm/pgtable-generic.c | 2 +-
12 files changed, 40 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 084ad54..bbd8499 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -235,7 +235,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing,
&pte_size);
- if (pte_present(pte) && !pte_numa(pte)) {
+ if (pte_present(pte) && !pte_protnone_numa(pte)) {
if (writing && !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 08d659a..5007497 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -405,8 +405,6 @@ good_area:
* processors use the same I/D cache coherency mechanism
* as embedded.
*/
- if (error_code & DSISR_PROTFAULT)
- goto bad_area;
#endif /* CONFIG_PPC_STD_MMU */

/*
@@ -430,9 +428,6 @@ good_area:
flags |= FAULT_FLAG_WRITE;
/* a read */
} else {
- /* protection fault */
- if (error_code & 0x08000000)
- goto bad_area;
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
}
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d874668..d870d93 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -39,7 +39,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
/*
* Similar to the PMD case, NUMA hinting must take slow path
*/
- if (pte_numa(pte))
+ if (pte_protnone_numa(pte))
return 0;

if ((pte_val(pte) & mask) != result)
@@ -85,7 +85,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
* slowpath for accounting purposes and so that they
* can be serialised against THP migration.
*/
- if (pmd_numa(pmd))
+ if (pmd_protnone_numa(pmd))
return 0;

if (!gup_hugepte((pte_t *)pmdp, PMD_SIZE, addr, next,
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index c90e602..b5d58d3 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -173,7 +173,13 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte)
{
#ifdef CONFIG_DEBUG_VM
- WARN_ON(pte_val(*ptep) & _PAGE_PRESENT);
+ /*
+ * When handling numa faults, we already have the pte marked
+ * _PAGE_PRESENT, but we can be sure that it is not in hpte.
+ * Hence we can use set_pte_at for them.
+ */
+ WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) ==
+ (_PAGE_PRESENT | _PAGE_USER));
#endif
/* Note: mm->context.id might not yet have been assigned as
* this context might not have been activated yet when this
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index c8d709a..c721c5e 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -710,7 +710,8 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
{
#ifdef CONFIG_DEBUG_VM
- WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT);
+ WARN_ON((pmd_val(*pmdp) & (_PAGE_PRESENT | _PAGE_USER)) ==
+ (_PAGE_PRESENT | _PAGE_USER));
assert_spin_locked(&mm->page_table_lock);
WARN_ON(!pmd_trans_huge(pmd));
#endif
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..47ce479 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -84,7 +84,7 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
struct page *page;

/* Similar to the PMD case, NUMA hinting must take slow path */
- if (pte_numa(pte)) {
+ if (pte_protnone_numa(pte)) {
pte_unmap(ptep);
return 0;
}
@@ -178,7 +178,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
* slowpath for accounting purposes and so that they
* can be serialised against THP migration.
*/
- if (pmd_numa(pmd))
+ if (pmd_protnone_numa(pmd))
return 0;
if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
return 0;
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 0d11c3d..e52379b 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -67,7 +67,7 @@ enum mpol_rebind_step {
#define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */
#define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */
#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
-#define MPOL_F_MORON (1 << 4) /* Migrate On pte_numa Reference On Node */
+#define MPOL_F_MORON (1 << 4) /* Migrate On pte_protnone_numa Reference On Node */


#endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/gup.c b/mm/gup.c
index cd62c8c..aec34cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -64,7 +64,7 @@ retry:
migration_entry_wait(mm, pmd, address);
goto retry;
}
- if ((flags & FOLL_NUMA) && pte_numa(pte))
+ if ((flags & FOLL_NUMA) && pte_protnone_numa(pte))
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte)) {
pte_unmap_unlock(ptep, ptl);
@@ -193,7 +193,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
}
return page;
}
- if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+ if ((flags & FOLL_NUMA) && pmd_protnone_numa(*pmd))
return no_page_table(vma, flags);
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
@@ -743,7 +743,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
* path
*/
if (!pte_present(pte) || pte_special(pte) ||
- pte_numa(pte) || (write && !pte_write(pte)))
+ pte_protnone_numa(pte) || (write && !pte_write(pte)))
goto pte_unmap;

VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -895,7 +895,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
* slowpath for accounting purposes and so that they
* can be serialised against THP migration.
*/
- if (pmd_numa(pmd))
+ if (pmd_protnone_numa(pmd))
return 0;

if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 38fa6cc..ec0b424 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1223,7 +1223,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
return ERR_PTR(-EFAULT);

/* Full NUMA hinting faults to serialise migration in fault paths */
- if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
+ if ((flags & FOLL_NUMA) && pmd_protnone_numa(*pmd))
goto out;

page = pmd_page(*pmd);
@@ -1354,7 +1354,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

/*
* Migrate the THP to the requested node, returns with page unlocked
- * and pmd_numa cleared.
+ * and access rights restored.
*/
spin_unlock(ptl);
migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1369,7 +1369,7 @@ clear_pmdnuma:
BUG_ON(!PageLocked(page));
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
- VM_BUG_ON(pmd_numa(*pmdp));
+ VM_BUG_ON(pmd_protnone_numa(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
unlock_page(page);
out_unlock:
@@ -1514,7 +1514,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
ret = 1;
if (!prot_numa) {
entry = pmdp_get_and_clear(mm, addr, pmd);
- if (pmd_numa(entry))
+ if (pmd_protnone_numa(entry))
entry = pmd_mknonnuma(entry);
entry = pmd_modify(entry, newprot);
ret = HPAGE_PMD_NR;
@@ -1530,7 +1530,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
* local vs remote hits on the zero page.
*/
if (!is_huge_zero_page(page) &&
- !pmd_numa(*pmd)) {
+ !pmd_protnone_numa(*pmd)) {
pmdp_set_numa(mm, addr, pmd);
ret = HPAGE_PMD_NR;
}
@@ -1797,9 +1797,9 @@ static int __split_huge_page_map(struct page *page,
pte_t *pte, entry;
BUG_ON(PageCompound(page+i));
/*
- * Note that pmd_numa is not transferred deliberately
- * to avoid any possibility that pte_numa leaks to
- * a PROT_NONE VMA by accident.
+ * Note that NUMA hinting access restrictions are not
+ * transferred to avoid any possibility of altering
+ * permissions across VMAs.
*/
entry = mk_pte(page + i, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
diff --git a/mm/memory.c b/mm/memory.c
index 3e50383..96ceb0a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3220,7 +3220,7 @@ static int handle_pte_fault(struct mm_struct *mm,
pte, pmd, flags, entry);
}

- if (pte_numa(entry))
+ if (pte_protnone_numa(entry))
return do_numa_page(mm, vma, address, entry, pte, pmd);

ptl = pte_lockptr(mm, pmd);
@@ -3298,7 +3298,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (pmd_trans_splitting(orig_pmd))
return 0;

- if (pmd_numa(orig_pmd))
+ if (pmd_protnone_numa(orig_pmd))
return do_huge_pmd_numa_page(mm, vma, address,
orig_pmd, pmd);

diff --git a/mm/mprotect.c b/mm/mprotect.c
index ace9345..e93ddac 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,36 +75,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
oldpte = *pte;
if (pte_present(oldpte)) {
pte_t ptent;
- bool updated = false;
-
- if (!prot_numa) {
- ptent = ptep_modify_prot_start(mm, addr, pte);
- if (pte_numa(ptent))
- ptent = pte_mknonnuma(ptent);
- ptent = pte_modify(ptent, newprot);
- /*
- * Avoid taking write faults for pages we
- * know to be dirty.
- */
- if (dirty_accountable && pte_dirty(ptent) &&
- (pte_soft_dirty(ptent) ||
- !(vma->vm_flags & VM_SOFTDIRTY)))
- ptent = pte_mkwrite(ptent);
- ptep_modify_prot_commit(mm, addr, pte, ptent);
- updated = true;
- } else {
- struct page *page;
-
- page = vm_normal_page(vma, addr, oldpte);
- if (page && !PageKsm(page)) {
- if (!pte_numa(oldpte)) {
- ptep_set_numa(mm, addr, pte);
- updated = true;
- }
- }
+ ptent = ptep_modify_prot_start(mm, addr, pte);
+ ptent = pte_modify(ptent, newprot);
+
+ /* Avoid taking write faults for known dirty pages */
+ if (dirty_accountable && pte_dirty(ptent) &&
+ (pte_soft_dirty(ptent) ||
+ !(vma->vm_flags & VM_SOFTDIRTY))) {
+ ptent = pte_mkwrite(ptent);
}
- if (updated)
- pages++;
+ ptep_modify_prot_commit(mm, addr, pte, ptent);
+ pages++;
} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index dfb79e0..a2d8587 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -193,7 +193,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
pmd_t entry = *pmdp;
- if (pmd_numa(entry))
+ if (pmd_protnone_numa(entry))
entry = pmd_mknonnuma(entry);
set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
--
2.1.2

2014-11-20 10:20:09

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/10] mm: Convert p[te|md]_mknonnuma and remaining page table manipulations

With PROT_NONE, the traditional page table manipulation functions are
sufficient.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Aneesh Kumar <[email protected]>
---
include/linux/huge_mm.h | 3 +--
mm/huge_memory.c | 33 +++++++--------------------------
mm/memory.c | 10 ++++++----
mm/mempolicy.c | 2 +-
mm/migrate.c | 2 +-
mm/mprotect.c | 2 +-
mm/pgtable-generic.c | 2 --
7 files changed, 17 insertions(+), 37 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad9051b..554bbe3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,8 +31,7 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd);
extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, pgprot_t newprot,
- int prot_numa);
+ unsigned long addr, pgprot_t newprot);

enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec0b424..668f1a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1367,9 +1367,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out;
clear_pmdnuma:
BUG_ON(!PageLocked(page));
- pmd = pmd_mknonnuma(pmd);
+ pmd = pmd_modify(pmd, vma->vm_page_prot);
set_pmd_at(mm, haddr, pmdp, pmd);
- VM_BUG_ON(pmd_protnone_numa(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
unlock_page(page);
out_unlock:
@@ -1503,7 +1502,7 @@ out:
* - HPAGE_PMD_NR is protections changed and TLB flush necessary
*/
int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, pgprot_t newprot, int prot_numa)
+ unsigned long addr, pgprot_t newprot)
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1512,29 +1511,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
ret = 1;
- if (!prot_numa) {
- entry = pmdp_get_and_clear(mm, addr, pmd);
- if (pmd_protnone_numa(entry))
- entry = pmd_mknonnuma(entry);
- entry = pmd_modify(entry, newprot);
- ret = HPAGE_PMD_NR;
- set_pmd_at(mm, addr, pmd, entry);
- BUG_ON(pmd_write(entry));
- } else {
- struct page *page = pmd_page(*pmd);
-
- /*
- * Do not trap faults against the zero page. The
- * read-only data is likely to be read-cached on the
- * local CPU cache and it is less useful to know about
- * local vs remote hits on the zero page.
- */
- if (!is_huge_zero_page(page) &&
- !pmd_protnone_numa(*pmd)) {
- pmdp_set_numa(mm, addr, pmd);
- ret = HPAGE_PMD_NR;
- }
- }
+ entry = pmdp_get_and_clear(mm, addr, pmd);
+ entry = pmd_modify(entry, newprot);
+ ret = HPAGE_PMD_NR;
+ set_pmd_at(mm, addr, pmd, entry);
+ BUG_ON(pmd_write(entry));
spin_unlock(ptl);
}

diff --git a/mm/memory.c b/mm/memory.c
index 96ceb0a..900127b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3120,9 +3120,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*
- * ptep_modify_prot_start is not called as this is clearing
- * the _PAGE_NUMA bit and it is not really expected that there
- * would be concurrent hardware modifications to the PTE.
+ * We can safely just do a "set_pte_at()", because the old
+ * page table entry is not accessible, so there would be no
+ * concurrent hardware modifications to the PTE.
*/
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
@@ -3131,7 +3131,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out;
}

- pte = pte_mknonnuma(pte);
+ /* Make it present again */
+ pte = pte_modify(pte, vma->vm_page_prot);
+ pte = pte_mkyoung(pte);
set_pte_at(mm, addr, ptep, pte);
update_mmu_cache(vma, addr, ptep);

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e58725a..9d61dce 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -633,7 +633,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
{
int nr_updated;

- nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
+ nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
if (nr_updated)
count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);

diff --git a/mm/migrate.c b/mm/migrate.c
index baaf80e..e3282d4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1890,7 +1890,7 @@ out_fail:
out_dropref:
ptl = pmd_lock(mm, pmd);
if (pmd_same(*pmd, entry)) {
- entry = pmd_mknonnuma(entry);
+ entry = pmd_modify(entry, vma->vm_page_prot);
set_pmd_at(mm, mmun_start, pmd, entry);
update_mmu_cache_pmd(vma, address, &entry);
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e93ddac..dc65c0f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -141,7 +141,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
- newprot, prot_numa);
+ newprot);

if (nr_ptes) {
if (nr_ptes == HPAGE_PMD_NR) {
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a2d8587..c25f94b 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -193,8 +193,6 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
pmd_t entry = *pmdp;
- if (pmd_protnone_numa(entry))
- entry = pmd_mknonnuma(entry);
set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
}
--
2.1.2

2014-11-20 10:23:05

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/10] mm: numa: Add paranoid check around pte_protnone_numa

pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
complete. Treating a real PROT_NONE PTE as a NUMA hinting fault is going
to result in strangeness so add a check for it. BUG_ON looks like overkill
but if this is hit then it's a serious bug that could result in corruption
so do not even try recovering. It would have been more comprehensive to
check VMA flags in pte_protnone_numa but it would have made the API ugly
just for a debugging check.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/huge_memory.c | 3 +++
mm/memory.c | 3 +++
2 files changed, 6 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3013eb8..6458229f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1274,6 +1274,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
bool migrated = false;
int flags = 0;

+ /* A PROT_NONE fault should not end up here */
+ BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
ptl = pmd_lock(mm, pmdp);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
diff --git a/mm/memory.c b/mm/memory.c
index a725c08..42d652d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3115,6 +3115,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
bool migrated = false;
int flags = 0;

+ /* A PROT_NONE fault should not end up here */
+ BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
+
/*
* The "pte" at this point cannot be used safely without
* validation through pte_unmap_same(). It's of NUMA type but
--
2.1.2

2014-11-20 10:20:06

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/10] ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT

ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
if they are triggered. This patch adds warnings just in case they
are being accidentally depended upon.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Aneesh Kumar K.V <[email protected]>
---
arch/powerpc/mm/copro_fault.c | 8 ++++++--
arch/powerpc/mm/fault.c | 20 +++++++++-----------
2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f0..0450d68 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
if (!(vma->vm_flags & VM_WRITE))
goto out_unlock;
} else {
- if (dsisr & DSISR_PROTFAULT)
- goto out_unlock;
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto out_unlock;
+ /*
+ * protfault should only happen due to us
+ * mapping a region readonly temporarily. PROT_NONE
+ * is also covered by the VMA check above.
+ */
+ WARN_ON_ONCE(dsisr & DSISR_PROTFAULT);
}

ret = 0;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5007497..9d6e0b3 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -396,17 +396,6 @@ good_area:
#endif /* CONFIG_8xx */

if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
- /* Protection fault on exec go straight to failure on
- * Hash based MMUs as they either don't support per-page
- * execute permission, or if they do, it's handled already
- * at the hash level. This test would probably have to
- * be removed if we change the way this works to make hash
- * processors use the same I/D cache coherency mechanism
- * as embedded.
- */
-#endif /* CONFIG_PPC_STD_MMU */
-
/*
* Allow execution from readable areas if the MMU does not
* provide separate controls over reading and executing.
@@ -421,6 +410,14 @@ good_area:
(cpu_has_feature(CPU_FTR_NOEXECUTE) ||
!(vma->vm_flags & (VM_READ | VM_WRITE))))
goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+ /*
+ * protfault should only happen due to us
+ * mapping a region readonly temporarily. PROT_NONE
+ * is also covered by the VMA check above.
+ */
+ WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
+#endif /* CONFIG_PPC_STD_MMU */
/* a write */
} else if (is_write) {
if (!(vma->vm_flags & VM_WRITE))
@@ -430,6 +427,7 @@ good_area:
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
+ WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
}

/*
--
2.1.2

2014-11-20 10:23:36

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/10] x86: mm: Restore original pte_special check

Commit b38af4721f59 ("x86,mm: fix pte_special versus pte_numa") adjusted
the pte_special check to take into account that a special pte had SPECIAL
and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no
longer modifying _PAGE_PRESENT it should be safe to restore the original
pte_special behaviour.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/include/asm/pgtable.h | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f8799e0..5241332 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -131,13 +131,7 @@ static inline int pte_exec(pte_t pte)

static inline int pte_special(pte_t pte)
{
- /*
- * See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h.
- * On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 ==
- * __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL.
- */
- return (pte_flags(pte) & _PAGE_SPECIAL) &&
- (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE));
+ return pte_flags(pte) & _PAGE_SPECIAL;
}

static inline unsigned long pte_pfn(pte_t pte)
--
2.1.2

2014-11-20 10:24:15

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/10] mm: numa: Do not trap faults on the huge zero page

Faults on the huge zero page are pointless and there is a BUG_ON
to catch them during fault time. This patch reintroduces a check
that avoids marking the zero page PAGE_NONE.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/huge_mm.h | 3 ++-
mm/huge_memory.c | 13 ++++++++++++-
mm/memory.c | 1 -
mm/mprotect.c | 15 ++++++++++++++-
4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 554bbe3..ad9051b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -31,7 +31,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd);
extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, pgprot_t newprot);
+ unsigned long addr, pgprot_t newprot,
+ int prot_numa);

enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 668f1a3..3013eb8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1502,7 +1502,7 @@ out:
* - HPAGE_PMD_NR is protections changed and TLB flush necessary
*/
int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, pgprot_t newprot)
+ unsigned long addr, pgprot_t newprot, int prot_numa)
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1510,6 +1510,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,

if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
+
+ /*
+ * Avoid trapping faults against the zero page. The read-only
+ * data is likely to be read-cached on the local CPU and
+ * local/remote hits to the zero page are not interesting.
+ */
+ if (prot_numa && is_huge_zero_pmd(*pmd)) {
+ spin_unlock(ptl);
+ return 0;
+ }
+
ret = 1;
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
diff --git a/mm/memory.c b/mm/memory.c
index 900127b..a725c08 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3142,7 +3142,6 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
return 0;
}
- BUG_ON(is_zero_pfn(page_to_pfn(page)));

/*
* Avoid grouping on DSO/COW pages in specific and RO pages
diff --git a/mm/mprotect.c b/mm/mprotect.c
index dc65c0f..33dfafb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,6 +75,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
oldpte = *pte;
if (pte_present(oldpte)) {
pte_t ptent;
+
+ /*
+ * Avoid trapping faults against the zero or KSM
+ * pages. See similar comment in change_huge_pmd.
+ */
+ if (prot_numa) {
+ struct page *page;
+
+ page = vm_normal_page(vma, addr, oldpte);
+ if (!page || PageKsm(page))
+ continue;
+ }
+
ptent = ptep_modify_prot_start(mm, addr, pte);
ptent = pte_modify(ptent, newprot);

@@ -141,7 +154,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
- newprot);
+ newprot, prot_numa);

if (nr_ptes) {
if (nr_ptes == HPAGE_PMD_NR) {
--
2.1.2

2014-11-20 10:25:06

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/10] mm: Remove remaining references to NUMA hinting bits and helpers

This patch removes the NUMA PTE bits and associated helpers. As a side-effect
it increases the maximum possible swap space on x86-64.

One potential source of problems is races between the marking of PTEs
PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that
a PTE being protected is not faulted in parallel, seen as a pte_none and
corrupting memory. The base case is safe but transhuge has problems in the
past due to an different migration mechanism and a dependance on page lock
to serialise migrations and warrants a closer look.

task_work hinting update parallel fault
------------------------ --------------
change_pmd_range
change_huge_pmd
__pmd_trans_huge_lock
pmdp_get_and_clear
__handle_mm_fault
pmd_none
do_huge_pmd_anonymous_page
read? pmd_lock blocks until hinting complete, fail !pmd_none test
write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
pmd_modify
set_pmd_at

task_work hinting update parallel migration
------------------------ ------------------
change_pmd_range
change_huge_pmd
__pmd_trans_huge_lock
pmdp_get_and_clear
__handle_mm_fault
do_huge_pmd_numa_page
migrate_misplaced_transhuge_page
pmd_lock waits for updates to complete, recheck pmd_same
pmd_modify
set_pmd_at

Both of those are safe and the case where a transhuge page is inserted
during a protection update is unchanged. The case where two processes try
migrating at the same time is unchanged by this series so should still be
ok. I could not find a case where we are accidentally depending on the
PTE not being cleared and flushed. If one is missed, it'll manifest as
corruption problems that start triggering shortly after this series is
merged and only happen when NUMA balancing is enabled.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/powerpc/include/asm/pgtable.h | 54 +-----------
arch/powerpc/include/asm/pte-common.h | 5 --
arch/powerpc/include/asm/pte-hash64.h | 6 --
arch/x86/include/asm/pgtable.h | 22 +----
arch/x86/include/asm/pgtable_64.h | 5 --
arch/x86/include/asm/pgtable_types.h | 41 +--------
include/asm-generic/pgtable.h | 155 ----------------------------------
include/linux/swapops.h | 2 +-
8 files changed, 7 insertions(+), 283 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 452c3b4..2e074e7 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -49,64 +49,12 @@ static inline int pmd_protnone_numa(pmd_t pmd)
{
return pte_protnone_numa(pmd_pte(pmd));
}
-
-static inline int pte_present(pte_t pte)
-{
- return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-#define pte_present_nonuma pte_present_nonuma
-static inline int pte_present_nonuma(pte_t pte)
-{
- return pte_val(pte) & (_PAGE_PRESENT);
-}
-
-#define ptep_set_numa ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep)
-{
- if ((pte_val(*ptep) & _PAGE_PRESENT) == 0)
- VM_BUG_ON(1);
-
- pte_update(mm, addr, ptep, _PAGE_PRESENT, _PAGE_NUMA, 0);
- return;
-}
-
-#define pmdp_set_numa pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
- pmd_t *pmdp)
-{
- if ((pmd_val(*pmdp) & _PAGE_PRESENT) == 0)
- VM_BUG_ON(1);
-
- pmd_hugepage_update(mm, addr, pmdp, _PAGE_PRESENT, _PAGE_NUMA);
- return;
-}
-
-/*
- * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
- * which was inherited from x86. For the purposes of powerpc pte_basic_t and
- * pmd_t are equivalent
- */
-#define pteval_t pte_basic_t
-#define pmdval_t pmd_t
-static inline pteval_t ptenuma_flags(pte_t pte)
-{
- return pte_val(pte) & _PAGE_NUMA_MASK;
-}
-
-static inline pmdval_t pmdnuma_flags(pmd_t pmd)
-{
- return pmd_val(pmd) & _PAGE_NUMA_MASK;
-}
-
-# else
+#endif /* CONFIG_NUMA_BALANCING */

static inline int pte_present(pte_t pte)
{
return pte_val(pte) & _PAGE_PRESENT;
}
-#endif /* CONFIG_NUMA_BALANCING */

/* Conversion functions: convert a page and protection to a page entry,
* and a page entry and page directory to the page they refer to.
diff --git a/arch/powerpc/include/asm/pte-common.h b/arch/powerpc/include/asm/pte-common.h
index e040c35..8d1569c 100644
--- a/arch/powerpc/include/asm/pte-common.h
+++ b/arch/powerpc/include/asm/pte-common.h
@@ -98,11 +98,6 @@ extern unsigned long bad_call_to_PMD_PAGE_SIZE(void);
_PAGE_USER | _PAGE_ACCESSED | \
_PAGE_RW | _PAGE_HWWRITE | _PAGE_DIRTY | _PAGE_EXEC)

-#ifdef CONFIG_NUMA_BALANCING
-/* Mask of bits that distinguish present and numa ptes */
-#define _PAGE_NUMA_MASK (_PAGE_NUMA|_PAGE_PRESENT)
-#endif
-
/*
* We define 2 sets of base prot bits, one for basic pages (ie,
* cacheable kernel and user pages) and one for non cacheable
diff --git a/arch/powerpc/include/asm/pte-hash64.h b/arch/powerpc/include/asm/pte-hash64.h
index 2505d8e..55aea0c 100644
--- a/arch/powerpc/include/asm/pte-hash64.h
+++ b/arch/powerpc/include/asm/pte-hash64.h
@@ -27,12 +27,6 @@
#define _PAGE_RW 0x0200 /* software: user write access allowed */
#define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */

-/*
- * Used for tracking numa faults
- */
-#define _PAGE_NUMA 0x00000010 /* Gather numa placement stats */
-
-
/* No separate kernel read-only */
#define _PAGE_KERNEL_RW (_PAGE_RW | _PAGE_DIRTY) /* user access blocked by key */
#define _PAGE_KERNEL_RO _PAGE_KERNEL_RW
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 613cd00..f8799e0 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -299,7 +299,7 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)

static inline pmd_t pmd_mknotpresent(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_PRESENT);
+ return pmd_clear_flags(pmd, _PAGE_PRESENT | _PAGE_PROTNONE);
}

#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
@@ -457,13 +457,6 @@ static inline int pte_same(pte_t a, pte_t b)

static inline int pte_present(pte_t a)
{
- return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
- _PAGE_NUMA);
-}
-
-#define pte_present_nonuma pte_present_nonuma
-static inline int pte_present_nonuma(pte_t a)
-{
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
}

@@ -473,7 +466,7 @@ static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
if (pte_flags(a) & _PAGE_PRESENT)
return true;

- if ((pte_flags(a) & (_PAGE_PROTNONE | _PAGE_NUMA)) &&
+ if ((pte_flags(a) & _PAGE_PROTNONE) &&
mm_tlb_flush_pending(mm))
return true;

@@ -493,8 +486,7 @@ static inline int pmd_present(pmd_t pmd)
* the _PAGE_PSE flag will remain set at all times while the
* _PAGE_PRESENT bit is clear).
*/
- return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
- _PAGE_NUMA);
+ return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
}

#ifdef CONFIG_NUMA_BALANCING
@@ -569,11 +561,6 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)

static inline int pmd_bad(pmd_t pmd)
{
-#ifdef CONFIG_NUMA_BALANCING
- /* pmd_numa check */
- if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)
- return 0;
-#endif
return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
}

@@ -892,19 +879,16 @@ static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
{
- VM_BUG_ON(pte_present_nonuma(pte));
return pte_set_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}

static inline int pte_swp_soft_dirty(pte_t pte)
{
- VM_BUG_ON(pte_present_nonuma(pte));
return pte_flags(pte) & _PAGE_SWP_SOFT_DIRTY;
}

static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
{
- VM_BUG_ON(pte_present_nonuma(pte));
return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
}
#endif
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 4572b2f..06ffca8 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -146,12 +146,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }

/* Encode and de-code a swap entry */
#define SWP_TYPE_BITS (_PAGE_BIT_FILE - _PAGE_BIT_PRESENT - 1)
-#ifdef CONFIG_NUMA_BALANCING
-/* Automatic NUMA balancing needs to be distinguishable from swap entries */
-#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 2)
-#else
#define SWP_OFFSET_SHIFT (_PAGE_BIT_PROTNONE + 1)
-#endif

#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0778964..d299cdd 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -27,14 +27,6 @@
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */

-/*
- * Swap offsets on configurations that allow automatic NUMA balancing use the
- * bits after _PAGE_BIT_GLOBAL. To uniquely distinguish NUMA hinting PTEs from
- * swap entries, we use the first bit after _PAGE_BIT_GLOBAL and shrink the
- * maximum possible swap space from 16TB to 8TB.
- */
-#define _PAGE_BIT_NUMA (_PAGE_BIT_GLOBAL+1)
-
/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
@@ -78,21 +70,6 @@
#endif

/*
- * _PAGE_NUMA distinguishes between a numa hinting minor fault and a page
- * that is not present. The hinting fault gathers numa placement statistics
- * (see pte_numa()). The bit is always zero when the PTE is not present.
- *
- * The bit picked must be always zero when the pmd is present and not
- * present, so that we don't lose information when we set it while
- * atomically clearing the present bit.
- */
-#ifdef CONFIG_NUMA_BALANCING
-#define _PAGE_NUMA (_AT(pteval_t, 1) << _PAGE_BIT_NUMA)
-#else
-#define _PAGE_NUMA (_AT(pteval_t, 0))
-#endif
-
-/*
* Tracking soft dirty bit when a page goes to a swap is tricky.
* We need a bit which can be stored in pte _and_ not conflict
* with swap entry format. On x86 bits 6 and 7 are *not* involved
@@ -125,8 +102,8 @@
/* Set of bits not changed in pte_modify */
#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
- _PAGE_SOFT_DIRTY | _PAGE_NUMA)
-#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_NUMA)
+ _PAGE_SOFT_DIRTY)
+#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)

#define _PAGE_CACHE_MASK (_PAGE_PCD | _PAGE_PWT)
#define _PAGE_CACHE_WB (0)
@@ -324,20 +301,6 @@ static inline pteval_t pte_flags(pte_t pte)
return native_pte_val(pte) & PTE_FLAGS_MASK;
}

-#ifdef CONFIG_NUMA_BALANCING
-/* Set of bits that distinguishes present, prot_none and numa ptes */
-#define _PAGE_NUMA_MASK (_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT)
-static inline pteval_t ptenuma_flags(pte_t pte)
-{
- return pte_flags(pte) & _PAGE_NUMA_MASK;
-}
-
-static inline pmdval_t pmdnuma_flags(pmd_t pmd)
-{
- return pmd_flags(pmd) & _PAGE_NUMA_MASK;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
#define pgprot_val(x) ((x).pgprot)
#define __pgprot(x) ((pgprot_t) { (x) } )

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 7e74122..323e914 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -233,10 +233,6 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
# define pte_accessible(mm, pte) ((void)(pte), 1)
#endif

-#ifndef pte_present_nonuma
-#define pte_present_nonuma(pte) pte_present(pte)
-#endif
-
#ifndef flush_tlb_fix_spurious_fault
#define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
#endif
@@ -696,157 +692,6 @@ static inline int pmd_protnone_numa(pmd_t pmd)
}
#endif /* CONFIG_NUMA_BALANCING */

-#ifdef CONFIG_NUMA_BALANCING
-/*
- * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that
- * is protected for PROT_NONE and a NUMA hinting fault entry. If the
- * architecture defines __PAGE_PROTNONE then it should take that into account
- * but those that do not can rely on the fact that the NUMA hinting scanner
- * skips inaccessible VMAs.
- *
- * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
- * fault triggers on those regions if pte/pmd_numa returns true
- * (because _PAGE_PRESENT is not set).
- */
-#ifndef pte_numa
-static inline int pte_numa(pte_t pte)
-{
- return ptenuma_flags(pte) == _PAGE_NUMA;
-}
-#endif
-
-#ifndef pmd_numa
-static inline int pmd_numa(pmd_t pmd)
-{
- return pmdnuma_flags(pmd) == _PAGE_NUMA;
-}
-#endif
-
-/*
- * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
- * because they're called by the NUMA hinting minor page fault. If we
- * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
- * would be forced to set it later while filling the TLB after we
- * return to userland. That would trigger a second write to memory
- * that we optimize away by setting _PAGE_ACCESSED here.
- */
-#ifndef pte_mknonnuma
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
- pteval_t val = pte_val(pte);
-
- val &= ~_PAGE_NUMA;
- val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
- return __pte(val);
-}
-#endif
-
-#ifndef pmd_mknonnuma
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
-{
- pmdval_t val = pmd_val(pmd);
-
- val &= ~_PAGE_NUMA;
- val |= (_PAGE_PRESENT|_PAGE_ACCESSED);
-
- return __pmd(val);
-}
-#endif
-
-#ifndef pte_mknuma
-static inline pte_t pte_mknuma(pte_t pte)
-{
- pteval_t val = pte_val(pte);
-
- VM_BUG_ON(!(val & _PAGE_PRESENT));
-
- val &= ~_PAGE_PRESENT;
- val |= _PAGE_NUMA;
-
- return __pte(val);
-}
-#endif
-
-#ifndef ptep_set_numa
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep)
-{
- pte_t ptent = *ptep;
-
- ptent = pte_mknuma(ptent);
- set_pte_at(mm, addr, ptep, ptent);
- return;
-}
-#endif
-
-#ifndef pmd_mknuma
-static inline pmd_t pmd_mknuma(pmd_t pmd)
-{
- pmdval_t val = pmd_val(pmd);
-
- val &= ~_PAGE_PRESENT;
- val |= _PAGE_NUMA;
-
- return __pmd(val);
-}
-#endif
-
-#ifndef pmdp_set_numa
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
- pmd_t *pmdp)
-{
- pmd_t pmd = *pmdp;
-
- pmd = pmd_mknuma(pmd);
- set_pmd_at(mm, addr, pmdp, pmd);
- return;
-}
-#endif
-#else
-static inline int pmd_numa(pmd_t pmd)
-{
- return 0;
-}
-
-static inline int pte_numa(pte_t pte)
-{
- return 0;
-}
-
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
- return pte;
-}
-
-static inline pmd_t pmd_mknonnuma(pmd_t pmd)
-{
- return pmd;
-}
-
-static inline pte_t pte_mknuma(pte_t pte)
-{
- return pte;
-}
-
-static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep)
-{
- return;
-}
-
-
-static inline pmd_t pmd_mknuma(pmd_t pmd)
-{
- return pmd;
-}
-
-static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
- pmd_t *pmdp)
-{
- return ;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
#endif /* CONFIG_MMU */

#endif /* !__ASSEMBLY__ */
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 6adfb7b..2b1fa56 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -54,7 +54,7 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
/* check whether a pte points to a swap entry */
static inline int is_swap_pte(pte_t pte)
{
- return !pte_none(pte) && !pte_present_nonuma(pte) && !pte_file(pte);
+ return !pte_none(pte) && !pte_file(pte);
}
#endif

--
2.1.2

2014-11-20 10:25:52

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing

This is a preparatory patch that introduces protnone helpers for automatic
NUMA balancing.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Aneesh Kumar K.V <[email protected]>
---
arch/powerpc/include/asm/pgtable.h | 11 +++++++++++
arch/x86/include/asm/pgtable.h | 16 ++++++++++++++++
include/asm-generic/pgtable.h | 19 +++++++++++++++++++
3 files changed, 46 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 316f9a5..452c3b4 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -39,6 +39,17 @@ static inline int pte_none(pte_t pte) { return (pte_val(pte) & ~_PTE_NONE_MASK)
static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); }

#ifdef CONFIG_NUMA_BALANCING
+static inline int pte_protnone_numa(pte_t pte)
+{
+ return (pte_val(pte) &
+ (_PAGE_PRESENT | _PAGE_USER)) == _PAGE_PRESENT;
+}
+
+static inline int pmd_protnone_numa(pmd_t pmd)
+{
+ return pte_protnone_numa(pmd_pte(pmd));
+}
+
static inline int pte_present(pte_t pte)
{
return pte_val(pte) & _PAGE_NUMA_MASK;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index aa97a07..613cd00 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -497,6 +497,22 @@ static inline int pmd_present(pmd_t pmd)
_PAGE_NUMA);
}

+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * These work without NUMA balancing but the kernel does not care. See the
+ * comment in include/asm-generic/pgtable.h
+ */
+static inline int pte_protnone_numa(pte_t pte)
+{
+ return pte_flags(pte) & _PAGE_PROTNONE;
+}
+
+static inline int pmd_protnone_numa(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_PROTNONE;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
static inline int pmd_none(pmd_t pmd)
{
/* Only check low word on 32-bit platforms, since it might be
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 752e30d..7e74122 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -677,6 +677,25 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+#ifndef CONFIG_NUMA_BALANCING
+/*
+ * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
+ * the only case the kernel cares is for NUMA balancing. By default,
+ * implement the helper as "always no". Note that this does not check VMA
+ * protection bits meaning that it is up to the caller to distinguish between
+ * PROT_NONE protections and NUMA hinting fault protections.
+ */
+static inline int pte_protnone_numa(pte_t pte)
+{
+ return 0;
+}
+
+static inline int pmd_protnone_numa(pmd_t pmd)
+{
+ return 0;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_NUMA_BALANCING
/*
* _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that
--
2.1.2

2014-11-20 10:39:46

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

From: Mel Gorman
> Convert existing users of pte_numa and friends to the new helper. Note
> that the kernel is broken after this patch is applied until the other
> page table modifiers are also altered. This patch layout is to make
> review easier.

Doesn't that break bisection?

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-11-20 10:47:29

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

On Thu, Nov 20, 2014 at 10:38:56AM +0000, David Laight wrote:
> From: Mel Gorman
> > Convert existing users of pte_numa and friends to the new helper. Note
> > that the kernel is broken after this patch is applied until the other
> > page table modifiers are also altered. This patch layout is to make
> > review easier.
>
> Doesn't that break bisection?
>

Yes, for automatic NUMA balancing at least. The patch structure is to
to make reviewers job easier and besides, bisecting within patches 2-6
is pointless. If desired, I can collapse patches 2-6 together for the
final submission.

--
Mel Gorman
SUSE Labs

2014-11-20 19:54:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing

On Thu, Nov 20, 2014 at 2:19 AM, Mel Gorman <[email protected]> wrote:
> This is a preparatory patch that introduces protnone helpers for automatic
> NUMA balancing.

Oh, I hadn't noticed that you had renamed these things. It was
probably already true in your V1 version.

I do *not* think that "pte_protnone_numa()" makes sense as a name. It
only confuses people to think that there is still/again something
NUMA-special about the PTE. The whole point of the protnone changes
was to make it really very very clear that from a hardware standpoint,
this is *exactly* about protnone, and nothing else.

The fact that we then use protnone PTE's for numa faults is a VM
internal issue, it should *not* show up in the architecture page table
helpers.

I'm not NAK'ing this name, but I really think it's a very important
part of the whole patch series - to stop the stupid confusion about
NUMA entries. As far as the page tables are concerned, this has
absolutely _zero_ to do with NUMA.

We made that mistake once. We're fixing it. Let the naming *show* that
it's fixed, and this is "pte_protnone()".

The places that use this for NUMA handling might have a comment or
something. But they'll be in the VM where this matters, not in the
architecture page table description files. The comment would be
something like "if the vma is accessible, but the PTE is marked
protnone, this is a autonuma entry".

Linus

2014-11-20 21:51:19

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v2

On 11/20/2014 05:19 AM, Mel Gorman wrote:
> V1 failed while running under kvm-tools very quickly and a second report
> indicated that it happens on bare metal as well. This version survived
> an overnight run of trinity running under kvm-tools here but verification
> from Sasha would be appreciated.

Hi Mel,

I tried giving it a spin, but it won't apply at all on the latest -mm
tree:

$ git am -3 numa/*
Applying: mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault
Applying: mm: Add p[te|md] protnone helpers for use by NUMA balancing
Applying: mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
fatal: sha1 information is lacking or useless (mm/huge_memory.c).
Repository lacks necessary blobs to fall back on 3-way merge.
Cannot fall back to three-way merge.

Did I miss a prerequisite?


Thanks,
Sasha

2014-11-21 05:28:56

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa

Mel Gorman <[email protected]> writes:

> Convert existing users of pte_numa and friends to the new helper. Note
> that the kernel is broken after this patch is applied until the other
> page table modifiers are also altered. This patch layout is to make
> review easier.
>

.....

> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index c90e602..b5d58d3 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -173,7 +173,13 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
> pte_t pte)
> {
> #ifdef CONFIG_DEBUG_VM
> - WARN_ON(pte_val(*ptep) & _PAGE_PRESENT);
> + /*
> + * When handling numa faults, we already have the pte marked
> + * _PAGE_PRESENT, but we can be sure that it is not in hpte.
> + * Hence we can use set_pte_at for them.
> + */
> + WARN_ON((pte_val(*ptep) & (_PAGE_PRESENT | _PAGE_USER)) ==
> + (_PAGE_PRESENT | _PAGE_USER));
> #endif


This can be VM_WARN_ON with #ifdef removed.

> /* Note: mm->context.id might not yet have been assigned as
> * this context might not have been activated yet when this
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c

-aneesh

2014-11-21 09:35:14

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] mm: Add p[te|md] protnone helpers for use by NUMA balancing

On Thu, Nov 20, 2014 at 11:54:06AM -0800, Linus Torvalds wrote:
> On Thu, Nov 20, 2014 at 2:19 AM, Mel Gorman <[email protected]> wrote:
> > This is a preparatory patch that introduces protnone helpers for automatic
> > NUMA balancing.
>
> Oh, I hadn't noticed that you had renamed these things. It was
> probably already true in your V1 version.
>
> I do *not* think that "pte_protnone_numa()" makes sense as a name. It
> only confuses people to think that there is still/again something
> NUMA-special about the PTE. The whole point of the protnone changes
> was to make it really very very clear that from a hardware standpoint,
> this is *exactly* about protnone, and nothing else.
>
> The fact that we then use protnone PTE's for numa faults is a VM
> internal issue, it should *not* show up in the architecture page table
> helpers.
>
> I'm not NAK'ing this name, but I really think it's a very important
> part of the whole patch series - to stop the stupid confusion about
> NUMA entries. As far as the page tables are concerned, this has
> absolutely _zero_ to do with NUMA.
>
> We made that mistake once. We're fixing it. Let the naming *show* that
> it's fixed, and this is "pte_protnone()".
>
> The places that use this for NUMA handling might have a comment or
> something. But they'll be in the VM where this matters, not in the
> architecture page table description files. The comment would be
> something like "if the vma is accessible, but the PTE is marked
> protnone, this is a autonuma entry".
>

I feared that people would eventually make the mistake of thinking that
pte_protnone() would return true for PROT_NONE VMAs that do *not* have
the page table bit set. I'll use the old name as you suggest and expand
the comment. It'll be in v3.

--
Mel Gorman
SUSE Labs

2014-11-21 09:38:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v2

On Thu, Nov 20, 2014 at 04:50:25PM -0500, Sasha Levin wrote:
> On 11/20/2014 05:19 AM, Mel Gorman wrote:
> > V1 failed while running under kvm-tools very quickly and a second report
> > indicated that it happens on bare metal as well. This version survived
> > an overnight run of trinity running under kvm-tools here but verification
> > from Sasha would be appreciated.
>
> Hi Mel,
>
> I tried giving it a spin, but it won't apply at all on the latest -mm
> tree:
>
> $ git am -3 numa/*
> Applying: mm: numa: Do not dereference pmd outside of the lock during NUMA hinting fault
> Applying: mm: Add p[te|md] protnone helpers for use by NUMA balancing
> Applying: mm: Convert p[te|md]_numa users to p[te|md]_protnone_numa
> fatal: sha1 information is lacking or useless (mm/huge_memory.c).
> Repository lacks necessary blobs to fall back on 3-way merge.
> Cannot fall back to three-way merge.
>
> Did I miss a prerequisite?
>

No. V2 was still against 3.18-rc4 as that was what I had vanilla kernel
test data for. V3 will be against latest mmotm.

--
Mel Gorman
SUSE Labs