2013-09-27 13:17:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 00/10] split page table lock for PMD tables

Alex Thorlton noticed that some massively threaded workloads work poorly,
if THP enabled. This patchset fixes this by introducing split page table
lock for PMD tables. hugetlbfs is not covered yet.

This patchset is based on work by Naoya Horiguchi.

Please review and consider applying.

Changes:
v4:
- convert hugetlb to new locking;
v3:
- fix USE_SPLIT_PMD_PTLOCKS;
- fix warning in fs/proc/task_mmu.c;
v2:
- reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;
- s/huge_pmd_lock/pmd_lock/g;
- assume pgtable_pmd_page_ctor() can fail;
- fix format line in task_mem() for VmPTE;

THP off, v3.12-rc2:
-------------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

1037072.835207 task-clock # 57.426 CPUs utilized ( +- 3.59% )
95,093 context-switches # 0.092 K/sec ( +- 3.93% )
140 cpu-migrations # 0.000 K/sec ( +- 5.28% )
10,000,550 page-faults # 0.010 M/sec ( +- 0.00% )
2,455,210,400,261 cycles # 2.367 GHz ( +- 3.62% ) [83.33%]
2,429,281,882,056 stalled-cycles-frontend # 98.94% frontend cycles idle ( +- 3.67% ) [83.33%]
1,975,960,019,659 stalled-cycles-backend # 80.48% backend cycles idle ( +- 3.88% ) [66.68%]
46,503,296,013 instructions # 0.02 insns per cycle
# 52.24 stalled cycles per insn ( +- 3.21% ) [83.34%]
9,278,997,542 branches # 8.947 M/sec ( +- 4.00% ) [83.34%]
89,881,640 branch-misses # 0.97% of all branches ( +- 1.17% ) [83.33%]

18.059261877 seconds time elapsed ( +- 2.65% )

THP on, v3.12-rc2:
------------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

3114745.395974 task-clock # 73.875 CPUs utilized ( +- 1.84% )
267,356 context-switches # 0.086 K/sec ( +- 1.84% )
99 cpu-migrations # 0.000 K/sec ( +- 1.40% )
58,313 page-faults # 0.019 K/sec ( +- 0.28% )
7,416,635,817,510 cycles # 2.381 GHz ( +- 1.83% ) [83.33%]
7,342,619,196,993 stalled-cycles-frontend # 99.00% frontend cycles idle ( +- 1.88% ) [83.33%]
6,267,671,641,967 stalled-cycles-backend # 84.51% backend cycles idle ( +- 2.03% ) [66.67%]
117,819,935,165 instructions # 0.02 insns per cycle
# 62.32 stalled cycles per insn ( +- 4.39% ) [83.34%]
28,899,314,777 branches # 9.278 M/sec ( +- 4.48% ) [83.34%]
71,787,032 branch-misses # 0.25% of all branches ( +- 1.03% ) [83.33%]

42.162306788 seconds time elapsed ( +- 1.73% )

HUGETLB, v3.12-rc2:
-------------------

Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):

2588052.787264 task-clock # 54.400 CPUs utilized ( +- 3.69% )
246,831 context-switches # 0.095 K/sec ( +- 4.15% )
138 cpu-migrations # 0.000 K/sec ( +- 5.30% )
21,027 page-faults # 0.008 K/sec ( +- 0.01% )
6,166,666,307,263 cycles # 2.383 GHz ( +- 3.68% ) [83.33%]
6,086,008,929,407 stalled-cycles-frontend # 98.69% frontend cycles idle ( +- 3.77% ) [83.33%]
5,087,874,435,481 stalled-cycles-backend # 82.51% backend cycles idle ( +- 4.41% ) [66.67%]
133,782,831,249 instructions # 0.02 insns per cycle
# 45.49 stalled cycles per insn ( +- 4.30% ) [83.34%]
34,026,870,541 branches # 13.148 M/sec ( +- 4.24% ) [83.34%]
68,670,942 branch-misses # 0.20% of all branches ( +- 3.26% ) [83.33%]

47.574936948 seconds time elapsed ( +- 2.09% )

THP off, patched:
-----------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

943301.957892 task-clock # 56.256 CPUs utilized ( +- 3.01% )
86,218 context-switches # 0.091 K/sec ( +- 3.17% )
121 cpu-migrations # 0.000 K/sec ( +- 6.64% )
10,000,551 page-faults # 0.011 M/sec ( +- 0.00% )
2,230,462,457,654 cycles # 2.365 GHz ( +- 3.04% ) [83.32%]
2,204,616,385,805 stalled-cycles-frontend # 98.84% frontend cycles idle ( +- 3.09% ) [83.32%]
1,778,640,046,926 stalled-cycles-backend # 79.74% backend cycles idle ( +- 3.47% ) [66.69%]
45,995,472,617 instructions # 0.02 insns per cycle
# 47.93 stalled cycles per insn ( +- 2.51% ) [83.34%]
9,179,700,174 branches # 9.731 M/sec ( +- 3.04% ) [83.35%]
89,166,529 branch-misses # 0.97% of all branches ( +- 1.45% ) [83.33%]

16.768027318 seconds time elapsed ( +- 2.47% )

THP on, patched:
----------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

458793.837905 task-clock # 54.632 CPUs utilized ( +- 0.79% )
41,831 context-switches # 0.091 K/sec ( +- 0.97% )
98 cpu-migrations # 0.000 K/sec ( +- 1.66% )
57,829 page-faults # 0.126 K/sec ( +- 0.62% )
1,077,543,336,716 cycles # 2.349 GHz ( +- 0.81% ) [83.33%]
1,067,403,802,964 stalled-cycles-frontend # 99.06% frontend cycles idle ( +- 0.87% ) [83.33%]
864,764,616,143 stalled-cycles-backend # 80.25% backend cycles idle ( +- 0.73% ) [66.68%]
16,129,177,440 instructions # 0.01 insns per cycle
# 66.18 stalled cycles per insn ( +- 7.94% ) [83.35%]
3,618,938,569 branches # 7.888 M/sec ( +- 8.46% ) [83.36%]
33,242,032 branch-misses # 0.92% of all branches ( +- 2.02% ) [83.32%]

8.397885779 seconds time elapsed ( +- 0.18% )

HUGETLB, patched:
-----------------

Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):

395353.076837 task-clock # 20.329 CPUs utilized ( +- 8.16% )
55,730 context-switches # 0.141 K/sec ( +- 5.31% )
138 cpu-migrations # 0.000 K/sec ( +- 4.24% )
21,027 page-faults # 0.053 K/sec ( +- 0.00% )
930,219,717,244 cycles # 2.353 GHz ( +- 8.21% ) [83.32%]
914,295,694,103 stalled-cycles-frontend # 98.29% frontend cycles idle ( +- 8.35% ) [83.33%]
704,137,950,187 stalled-cycles-backend # 75.70% backend cycles idle ( +- 9.16% ) [66.69%]
30,541,538,385 instructions # 0.03 insns per cycle
# 29.94 stalled cycles per insn ( +- 3.98% ) [83.35%]
8,415,376,631 branches # 21.286 M/sec ( +- 3.61% ) [83.36%]
32,645,478 branch-misses # 0.39% of all branches ( +- 3.41% ) [83.32%]

19.447481153 seconds time elapsed ( +- 2.00% )

Kirill A. Shutemov (10):
mm: rename USE_SPLIT_PTLOCKS to USE_SPLIT_PTE_PTLOCKS
mm: convert mm->nr_ptes to atomic_t
mm: introduce api for split page table lock for PMD level
mm, thp: change pmd_trans_huge_lock() to return taken lock
mm, thp: move ptl taking inside page_check_address_pmd()
mm, thp: do not access mm->pmd_huge_pte directly
mm, hugetlb: convert hugetlbfs to use split pmd lock
mm: convent the rest to new page table lock api
mm: implement split page table lock for PMD level
x86, mm: enable split page table lock for PMD level

arch/arm/mm/fault-armv.c | 6 +-
arch/s390/mm/pgtable.c | 12 +--
arch/sparc/mm/tlb.c | 12 +--
arch/x86/Kconfig | 4 +
arch/x86/include/asm/pgalloc.h | 11 ++-
arch/x86/xen/mmu.c | 6 +-
fs/proc/meminfo.c | 2 +-
fs/proc/task_mmu.c | 16 ++--
include/linux/huge_mm.h | 17 ++--
include/linux/hugetlb.h | 25 +++++
include/linux/mm.h | 52 ++++++++++-
include/linux/mm_types.h | 18 ++--
include/linux/swapops.h | 7 +-
kernel/fork.c | 6 +-
mm/Kconfig | 3 +
mm/huge_memory.c | 201 ++++++++++++++++++++++++-----------------
mm/hugetlb.c | 108 +++++++++++++---------
mm/memcontrol.c | 10 +-
mm/memory.c | 21 +++--
mm/mempolicy.c | 5 +-
mm/migrate.c | 14 +--
mm/mmap.c | 3 +-
mm/mprotect.c | 4 +-
mm/oom_kill.c | 6 +-
mm/pgtable-generic.c | 16 ++--
mm/rmap.c | 15 ++-
26 files changed, 379 insertions(+), 221 deletions(-)

--
1.8.4.rc3


2013-09-27 13:16:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 01/10] mm: rename USE_SPLIT_PTLOCKS to USE_SPLIT_PTE_PTLOCKS

We're going to introduce split page table lock for PMD level.
Let's rename existing split ptlock for PTE level to avoid confusion.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alex Thorlton <[email protected]>
---
arch/arm/mm/fault-armv.c | 6 +++---
arch/x86/xen/mmu.c | 6 +++---
include/linux/mm.h | 6 +++---
include/linux/mm_types.h | 8 ++++----
4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 2a5907b5c8..ff379ac115 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -65,7 +65,7 @@ static int do_adjust_pte(struct vm_area_struct *vma, unsigned long address,
return ret;
}

-#if USE_SPLIT_PTLOCKS
+#if USE_SPLIT_PTE_PTLOCKS
/*
* If we are using split PTE locks, then we need to take the page
* lock here. Otherwise we are using shared mm->page_table_lock
@@ -84,10 +84,10 @@ static inline void do_pte_unlock(spinlock_t *ptl)
{
spin_unlock(ptl);
}
-#else /* !USE_SPLIT_PTLOCKS */
+#else /* !USE_SPLIT_PTE_PTLOCKS */
static inline void do_pte_lock(spinlock_t *ptl) {}
static inline void do_pte_unlock(spinlock_t *ptl) {}
-#endif /* USE_SPLIT_PTLOCKS */
+#endif /* USE_SPLIT_PTE_PTLOCKS */

static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
unsigned long pfn)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index fdc3ba28ca..455c873ce0 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -796,7 +796,7 @@ static spinlock_t *xen_pte_lock(struct page *page, struct mm_struct *mm)
{
spinlock_t *ptl = NULL;

-#if USE_SPLIT_PTLOCKS
+#if USE_SPLIT_PTE_PTLOCKS
ptl = __pte_lockptr(page);
spin_lock_nest_lock(ptl, &mm->page_table_lock);
#endif
@@ -1637,7 +1637,7 @@ static inline void xen_alloc_ptpage(struct mm_struct *mm, unsigned long pfn,

__set_pfn_prot(pfn, PAGE_KERNEL_RO);

- if (level == PT_PTE && USE_SPLIT_PTLOCKS)
+ if (level == PT_PTE && USE_SPLIT_PTE_PTLOCKS)
__pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn);

xen_mc_issue(PARAVIRT_LAZY_MMU);
@@ -1671,7 +1671,7 @@ static inline void xen_release_ptpage(unsigned long pfn, unsigned level)
if (!PageHighMem(page)) {
xen_mc_batch();

- if (level == PT_PTE && USE_SPLIT_PTLOCKS)
+ if (level == PT_PTE && USE_SPLIT_PTE_PTLOCKS)
__pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, pfn);

__set_pfn_prot(pfn, PAGE_KERNEL);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55ee88..6cf8ddb45b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1232,7 +1232,7 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
}
#endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */

-#if USE_SPLIT_PTLOCKS
+#if USE_SPLIT_PTE_PTLOCKS
/*
* We tuck a spinlock to guard each pagetable page into its struct page,
* at page->private, with BUILD_BUG_ON to make sure that this will not
@@ -1245,14 +1245,14 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
} while (0)
#define pte_lock_deinit(page) ((page)->mapping = NULL)
#define pte_lockptr(mm, pmd) ({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
-#else /* !USE_SPLIT_PTLOCKS */
+#else /* !USE_SPLIT_PTE_PTLOCKS */
/*
* We use mm->page_table_lock to guard all pagetable pages of the mm.
*/
#define pte_lock_init(page) do {} while (0)
#define pte_lock_deinit(page) do {} while (0)
#define pte_lockptr(mm, pmd) ({(void)(pmd); &(mm)->page_table_lock;})
-#endif /* USE_SPLIT_PTLOCKS */
+#endif /* USE_SPLIT_PTE_PTLOCKS */

static inline void pgtable_page_ctor(struct page *page)
{
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851eeb6e..84e0c56e1e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,7 +23,7 @@

struct address_space;

-#define USE_SPLIT_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
+#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)

/*
* Each physical page in the system has a struct page associated with
@@ -141,7 +141,7 @@ struct page {
* indicates order in the buddy
* system if PG_buddy is set.
*/
-#if USE_SPLIT_PTLOCKS
+#if USE_SPLIT_PTE_PTLOCKS
spinlock_t ptl;
#endif
struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
@@ -309,14 +309,14 @@ enum {
NR_MM_COUNTERS
};

-#if USE_SPLIT_PTLOCKS && defined(CONFIG_MMU)
+#if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU)
#define SPLIT_RSS_COUNTING
/* per-thread cached information, */
struct task_rss_stat {
int events; /* for synchronization threshold */
int count[NR_MM_COUNTERS];
};
-#endif /* USE_SPLIT_PTLOCKS */
+#endif /* USE_SPLIT_PTE_PTLOCKS */

struct mm_rss_stat {
atomic_long_t count[NR_MM_COUNTERS];
--
1.8.4.rc3

2013-09-27 13:16:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 09/10] mm: implement split page table lock for PMD level

The basic idea is the same as with PTE level: the lock is embedded into
struct page of table's page.

We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
take mm->page_table_lock anymore. Let's reuse page->lru of table's page
for that.

pgtable_pmd_page_ctor() returns true, if initialization is successful
and false otherwise. Current implementation never fails, but assumption
that constructor can fail will help to port it to -rt where spinlock_t
is rather huge and cannot be embedded into struct page -- dynamic
allocation is required.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alex Thorlton <[email protected]>
---
include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 8 +++++++-
kernel/fork.c | 4 ++--
mm/Kconfig | 3 +++
4 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b96aac9622..75735f6171 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1294,13 +1294,45 @@ static inline void pgtable_page_dtor(struct page *page)
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
NULL: pte_offset_kernel(pmd, address))

+#if USE_SPLIT_PMD_PTLOCKS
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return &virt_to_page(pmd)->ptl;
+}
+
+static inline bool pgtable_pmd_page_ctor(struct page *page)
+{
+ spin_lock_init(&page->ptl);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ page->pmd_huge_pte = NULL;
+#endif
+ return true;
+}
+
+static inline void pgtable_pmd_page_dtor(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ VM_BUG_ON(page->pmd_huge_pte);
+#endif
+}
+
+#define pmd_huge_pte(mm, pmd) (virt_to_page(pmd)->pmd_huge_pte)
+
+#else
+
static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
{
return &mm->page_table_lock;
}

+static inline bool pgtable_pmd_page_ctor(struct page *page) { return true; }
+static inline void pgtable_pmd_page_dtor(struct page *page) {}
+
#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)

+#endif
+
static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
{
spinlock_t *ptl = pmd_lockptr(mm, pmd);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99f19e850d..498a75b96f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -24,6 +24,9 @@
struct address_space;

#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
+/* hugetlb hasn't converted to split locking yet */
+#define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \
+ IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))

/*
* Each physical page in the system has a struct page associated with
@@ -130,6 +133,9 @@ struct page {

struct list_head list; /* slobs list of pages */
struct slab *slab_page; /* slab fields */
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
+ pgtable_t pmd_huge_pte; /* protected by page->ptl */
+#endif
};

/* Remainder is not double word aligned */
@@ -406,7 +412,7 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index afe70530db..943051284f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -560,7 +560,7 @@ static void check_mm(struct mm_struct *mm)
"mm:%p idx:%d val:%ld\n", mm, i, x);
}

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
VM_BUG_ON(mm->pmd_huge_pte);
#endif
}
@@ -814,7 +814,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
memcpy(mm, oldmm, sizeof(*mm));
mm_init_cpumask(mm);

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
#endif
#ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a9b0..89d56e31f9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -214,6 +214,9 @@ config SPLIT_PTLOCK_CPUS
default "999999" if DEBUG_SPINLOCK || DEBUG_LOCK_ALLOC
default "4"

+config ARCH_ENABLE_SPLIT_PMD_PTLOCK
+ boolean
+
#
# support for memory balloon compaction
config BALLOON_COMPACTION
--
1.8.4.rc3

2013-09-27 13:16:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 05/10] mm, thp: move ptl taking inside page_check_address_pmd()

With split page table lock we can't know which lock we need to take
before we find the relevant pmd.

Let's move lock taking inside the function.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alex Thorlton <[email protected]>
---
include/linux/huge_mm.h | 3 ++-
mm/huge_memory.c | 43 +++++++++++++++++++++++++++----------------
mm/rmap.c | 13 +++++--------
3 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4aca0d8da1..91672e2dee 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -54,7 +54,8 @@ enum page_check_address_pmd_flag {
extern pmd_t *page_check_address_pmd(struct page *page,
struct mm_struct *mm,
unsigned long address,
- enum page_check_address_pmd_flag flag);
+ enum page_check_address_pmd_flag flag,
+ spinlock_t **ptl);

#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 59a1340f35..3a1f5c10b4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1500,23 +1500,33 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
return 0;
}

+/*
+ * This function returns whether a given @page is mapped onto the @address
+ * in the virtual space of @mm.
+ *
+ * When it's true, this function returns *pmd with holding the page table lock
+ * and passing it back to the caller via @ptl.
+ * If it's false, returns NULL without holding the page table lock.
+ */
pmd_t *page_check_address_pmd(struct page *page,
struct mm_struct *mm,
unsigned long address,
- enum page_check_address_pmd_flag flag)
+ enum page_check_address_pmd_flag flag,
+ spinlock_t **ptl)
{
- pmd_t *pmd, *ret = NULL;
+ pmd_t *pmd;

if (address & ~HPAGE_PMD_MASK)
- goto out;
+ return NULL;

pmd = mm_find_pmd(mm, address);
if (!pmd)
- goto out;
+ return NULL;
+ *ptl = pmd_lock(mm, pmd);
if (pmd_none(*pmd))
- goto out;
+ goto unlock;
if (pmd_page(*pmd) != page)
- goto out;
+ goto unlock;
/*
* split_vma() may create temporary aliased mappings. There is
* no risk as long as all huge pmd are found and have their
@@ -1526,14 +1536,15 @@ pmd_t *page_check_address_pmd(struct page *page,
*/
if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
pmd_trans_splitting(*pmd))
- goto out;
+ goto unlock;
if (pmd_trans_huge(*pmd)) {
VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
!pmd_trans_splitting(*pmd));
- ret = pmd;
+ return pmd;
}
-out:
- return ret;
+unlock:
+ spin_unlock(*ptl);
+ return NULL;
}

static int __split_huge_page_splitting(struct page *page,
@@ -1541,6 +1552,7 @@ static int __split_huge_page_splitting(struct page *page,
unsigned long address)
{
struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
pmd_t *pmd;
int ret = 0;
/* For mmu_notifiers */
@@ -1548,9 +1560,8 @@ static int __split_huge_page_splitting(struct page *page,
const unsigned long mmun_end = address + HPAGE_PMD_SIZE;

mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- spin_lock(&mm->page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+ PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
if (pmd) {
/*
* We can't temporarily set the pmd to null in order
@@ -1561,8 +1572,8 @@ static int __split_huge_page_splitting(struct page *page,
*/
pmdp_splitting_flush(vma, address, pmd);
ret = 1;
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

return ret;
@@ -1693,14 +1704,14 @@ static int __split_huge_page_map(struct page *page,
unsigned long address)
{
struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
pmd_t *pmd, _pmd;
int ret = 0, i;
pgtable_t pgtable;
unsigned long haddr;

- spin_lock(&mm->page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+ PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
if (pmd) {
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
@@ -1755,8 +1766,8 @@ static int __split_huge_page_map(struct page *page,
pmdp_invalidate(vma, address, pmd);
pmd_populate(mm, pmd, pgtable);
ret = 1;
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);

return ret;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index fd3ee7a54a..b59d741dcf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -665,25 +665,23 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
unsigned long *vm_flags)
{
struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
int referenced = 0;

if (unlikely(PageTransHuge(page))) {
pmd_t *pmd;

- spin_lock(&mm->page_table_lock);
/*
* rmap might return false positives; we must filter
* these out using page_check_address_pmd().
*/
pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_FLAG);
- if (!pmd) {
- spin_unlock(&mm->page_table_lock);
+ PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ if (!pmd)
goto out;
- }

if (vma->vm_flags & VM_LOCKED) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
*mapcount = 0; /* break early from loop */
*vm_flags |= VM_LOCKED;
goto out;
@@ -692,10 +690,9 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
/* go ahead even if the pmd is pmd_trans_splitting() */
if (pmdp_clear_flush_young_notify(vma, address, pmd))
referenced++;
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
} else {
pte_t *pte;
- spinlock_t *ptl;

/*
* rmap might return false positives; we must filter
--
1.8.4.rc3

2013-09-27 13:17:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 08/10] mm: convent the rest to new page table lock api

Only trivial cases left. Let's convert them altogether.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alex Thorlton <[email protected]>
---
mm/huge_memory.c | 108 ++++++++++++++++++++++++++++-----------------------
mm/memory.c | 17 ++++----
mm/migrate.c | 7 ++--
mm/mprotect.c | 4 +-
mm/pgtable-generic.c | 4 +-
5 files changed, 77 insertions(+), 63 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3a1f5c10b4..0d85512ad4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -709,6 +709,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct page *page)
{
pgtable_t pgtable;
+ spinlock_t *ptl;

VM_BUG_ON(!PageCompound(page));
pgtable = pte_alloc_one(mm, haddr);
@@ -723,9 +724,9 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
*/
__SetPageUptodate(page);

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_none(*pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mem_cgroup_uncharge_page(page);
put_page(page);
pte_free(mm, pgtable);
@@ -738,7 +739,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
set_pmd_at(mm, haddr, pmd, entry);
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
atomic_inc(&mm->nr_ptes);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
}

return 0;
@@ -766,6 +767,7 @@ static inline struct page *alloc_hugepage(int defrag)
}
#endif

+/* Caller must hold page table lock. */
static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
struct page *zero_page)
@@ -797,6 +799,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE) &&
transparent_hugepage_use_zero_page()) {
+ spinlock_t *ptl;
pgtable_t pgtable;
struct page *zero_page;
bool set;
@@ -809,10 +812,10 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
zero_page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
if (!set) {
pte_free(mm, pgtable);
put_huge_zero_page();
@@ -845,6 +848,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
{
+ spinlock_t *dst_ptl, *src_ptl;
struct page *src_page;
pmd_t pmd;
pgtable_t pgtable;
@@ -855,8 +859,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (unlikely(!pgtable))
goto out;

- spin_lock(&dst_mm->page_table_lock);
- spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+ dst_ptl = pmd_lock(dst_mm, dst_pmd);
+ src_ptl = pmd_lockptr(src_mm, src_pmd);
+ spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);

ret = -EAGAIN;
pmd = *src_pmd;
@@ -865,7 +870,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}
/*
- * mm->page_table_lock is enough to be sure that huge zero pmd is not
+ * When page table lock is held, the huge zero pmd should not be
* under splitting since we don't split the page itself, only pmd to
* a page table.
*/
@@ -886,8 +891,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
}
if (unlikely(pmd_trans_splitting(pmd))) {
/* split huge page running from under us */
- spin_unlock(&src_mm->page_table_lock);
- spin_unlock(&dst_mm->page_table_lock);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
pte_free(dst_mm, pgtable);

wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
@@ -907,8 +912,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,

ret = 0;
out_unlock:
- spin_unlock(&src_mm->page_table_lock);
- spin_unlock(&dst_mm->page_table_lock);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
out:
return ret;
}
@@ -919,10 +924,11 @@ void huge_pmd_set_accessed(struct mm_struct *mm,
pmd_t *pmd, pmd_t orig_pmd,
int dirty)
{
+ spinlock_t *ptl;
pmd_t entry;
unsigned long haddr;

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto unlock;

@@ -932,13 +938,14 @@ void huge_pmd_set_accessed(struct mm_struct *mm,
update_mmu_cache_pmd(vma, address, pmd);

unlock:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
}

static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd, pmd_t orig_pmd, unsigned long haddr)
{
+ spinlock_t *ptl;
pgtable_t pgtable;
pmd_t _pmd;
struct page *page;
@@ -965,7 +972,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
mmun_end = haddr + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_free_page;

@@ -992,7 +999,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
}
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
put_huge_zero_page();
inc_mm_counter(mm, MM_ANONPAGES);

@@ -1002,7 +1009,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
out:
return ret;
out_free_page:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
mem_cgroup_uncharge_page(page);
put_page(page);
@@ -1016,6 +1023,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct page *page,
unsigned long haddr)
{
+ spinlock_t *ptl;
pgtable_t pgtable;
pmd_t _pmd;
int ret = 0, i;
@@ -1062,7 +1070,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
mmun_end = haddr + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_free_pages;
VM_BUG_ON(!PageHead(page));
@@ -1088,7 +1096,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
page_remove_rmap(page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);

mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

@@ -1099,7 +1107,7 @@ out:
return ret;

out_free_pages:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
mem_cgroup_uncharge_start();
for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -1114,17 +1122,19 @@ out_free_pages:
int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
{
+ spinlock_t *ptl;
int ret = 0;
struct page *page = NULL, *new_page;
unsigned long haddr;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

+ ptl = pmd_lockptr(mm, pmd);
VM_BUG_ON(!vma->anon_vma);
haddr = address & HPAGE_PMD_MASK;
if (is_huge_zero_pmd(orig_pmd))
goto alloc;
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_unlock;

@@ -1140,7 +1150,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;
}
get_page(page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
@@ -1187,11 +1197,11 @@ alloc:
mmun_end = haddr + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (page)
put_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mem_cgroup_uncharge_page(new_page);
put_page(new_page);
goto out_mn;
@@ -1213,13 +1223,13 @@ alloc:
}
ret |= VM_FAULT_WRITE;
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
out_mn:
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
out:
return ret;
out_unlock:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
return ret;
}

@@ -1231,7 +1241,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;

- assert_spin_locked(&mm->page_table_lock);
+ assert_spin_locked(pmd_lockptr(mm, pmd));

if (flags & FOLL_WRITE && !pmd_write(*pmd))
goto out;
@@ -1278,13 +1288,14 @@ out:
int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
{
+ spinlock_t *ptl;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int target_nid;
int current_nid = -1;
bool migrated;

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmdp);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;

@@ -1302,17 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

/* Acquire the page lock to serialise THP migrations */
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
lock_page(page);

/* Confirm the PTE did not while locked */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
put_page(page);
goto out_unlock;
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);

/* Migrate the THP to the requested node */
migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1324,7 +1335,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;

check_same:
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
clear_pmdnuma:
@@ -1333,7 +1344,7 @@ clear_pmdnuma:
VM_BUG_ON(pmd_numa(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
out_unlock:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
if (current_nid != -1)
task_numa_fault(current_nid, HPAGE_PMD_NR, false);
return 0;
@@ -2282,7 +2293,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pte_t *pte;
pgtable_t pgtable;
struct page *new_page;
- spinlock_t *ptl;
+ spinlock_t *pmd_ptl, *pte_ptl;
int isolated;
unsigned long hstart, hend;
unsigned long mmun_start; /* For mmu_notifiers */
@@ -2325,12 +2336,12 @@ static void collapse_huge_page(struct mm_struct *mm,
anon_vma_lock_write(vma->anon_vma);

pte = pte_offset_map(pmd, address);
- ptl = pte_lockptr(mm, pmd);
+ pte_ptl = pte_lockptr(mm, pmd);

mmun_start = address;
mmun_end = address + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- spin_lock(&mm->page_table_lock); /* probably unnecessary */
+ pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* After this gup_fast can't run anymore. This also removes
* any huge TLB entry from the CPU so we won't allow
@@ -2338,16 +2349,16 @@ static void collapse_huge_page(struct mm_struct *mm,
* to avoid the risk of CPU bugs in that area.
*/
_pmd = pmdp_clear_flush(vma, address, pmd);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(pmd_ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

- spin_lock(ptl);
+ spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
- spin_unlock(ptl);
+ spin_unlock(pte_ptl);

if (unlikely(!isolated)) {
pte_unmap(pte);
- spin_lock(&mm->page_table_lock);
+ spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
/*
* We can only use set_pmd_at when establishing
@@ -2355,7 +2366,7 @@ static void collapse_huge_page(struct mm_struct *mm,
* points to regular pagetables. Use pmd_populate for that
*/
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma);
goto out;
}
@@ -2366,7 +2377,7 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
anon_vma_unlock_write(vma->anon_vma);

- __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+ __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
pte_unmap(pte);
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);
@@ -2381,13 +2392,13 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
smp_wmb();

- spin_lock(&mm->page_table_lock);
+ spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
page_add_new_anon_rmap(new_page, vma, address);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache_pmd(vma, address, pmd);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(pmd_ptl);

*hpage = NULL;

@@ -2712,6 +2723,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd)
{
+ spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
unsigned long haddr = address & HPAGE_PMD_MASK;
@@ -2723,22 +2735,22 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_trans_huge(*pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
return;
}
if (is_huge_zero_pmd(*pmd)) {
__split_huge_zero_page_pmd(vma, haddr, pmd);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

split_huge_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 1046396adf..551b15e3c1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -552,6 +552,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long address)
{
+ spinlock_t *ptl;
pgtable_t new = pte_alloc_one(mm, address);
int wait_split_huge_page;
if (!new)
@@ -572,7 +573,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
*/
smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
wait_split_huge_page = 0;
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
atomic_inc(&mm->nr_ptes);
@@ -580,7 +581,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
new = NULL;
} else if (unlikely(pmd_trans_splitting(*pmd)))
wait_split_huge_page = 1;
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
if (new)
pte_free(mm, new);
if (wait_split_huge_page)
@@ -1516,20 +1517,20 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
split_huge_page_pmd(vma, address, pmd);
goto split_fallthrough;
}
- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
wait_split_huge_page(vma->anon_vma, pmd);
} else {
page = follow_trans_huge_pmd(vma, address,
pmd, flags);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
*page_mask = HPAGE_PMD_NR - 1;
goto out;
}
} else
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
/* fall through */
}
split_fallthrough:
@@ -3602,13 +3603,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
bool numa = false;
int local_nid = numa_node_id();

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmdp);
pmd = *pmdp;
if (pmd_numa(pmd)) {
set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
numa = true;
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);

if (!numa)
return 0;
diff --git a/mm/migrate.c b/mm/migrate.c
index 0ac0668a08..4cd63c2379 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1654,6 +1654,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
unsigned long address,
struct page *page, int node)
{
+ spinlock_t *ptl;
unsigned long haddr = address & HPAGE_PMD_MASK;
pg_data_t *pgdat = NODE_DATA(node);
int isolated = 0;
@@ -1700,9 +1701,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
WARN_ON(PageLRU(new_page));

/* Recheck the target PMD */
- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, entry))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);

/* Reverse changes made by migrate_page_copy() */
if (TestClearPageActive(new_page))
@@ -1747,7 +1748,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
* before it's fully transferred to the new page.
*/
mem_cgroup_end_migration(memcg, page, new_page, true);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);

unlock_page(new_page);
unlock_page(page);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4d6b..d01a5356b8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -116,9 +116,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
pmd_t *pmd)
{
- spin_lock(&mm->page_table_lock);
+ spinlock_t *ptl = pmd_lock(mm, pmd);
set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
}
#else
static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 41fee3e5d5..cbb38545d9 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -151,7 +151,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable)
{
- assert_spin_locked(&mm->page_table_lock);
+ assert_spin_locked(pmd_lockptr(mm, pmdp));

/* FIFO */
if (!pmd_huge_pte(mm, pmdp))
@@ -170,7 +170,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
{
pgtable_t pgtable;

- assert_spin_locked(&mm->page_table_lock);
+ assert_spin_locked(pmd_lockptr(mm, pmdp));

/* FIFO */
pgtable = pmd_huge_pte(mm, pmdp);
--
1.8.4.rc3

2013-09-27 13:17:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 10/10] x86, mm: enable split page table lock for PMD level

Enable PMD split page table lock for X86_64 and PAE.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alex Thorlton <[email protected]>
---
arch/x86/Kconfig | 4 ++++
arch/x86/include/asm/pgalloc.h | 11 ++++++++++-
2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ee2fb9d377..015d1362b7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1880,6 +1880,10 @@ config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA

+config ARCH_ENABLE_SPLIT_PMD_PTLOCK
+ def_bool y
+ depends on X86_64 || X86_PAE
+
menu "Power management and ACPI options"

config ARCH_HIBERNATION_HEADER
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index b4389a468f..e2fb2b6934 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -80,12 +80,21 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
#if PAGETABLE_LEVELS > 2
static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
{
- return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+ struct page *page;
+ page = alloc_pages(GFP_KERNEL | __GFP_REPEAT| __GFP_ZERO, 0);
+ if (!page)
+ return NULL;
+ if (!pgtable_pmd_page_ctor(page)) {
+ __free_pages(page, 0);
+ return NULL;
+ }
+ return (pmd_t *)page_address(page);
}

static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
{
BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
+ pgtable_pmd_page_dtor(virt_to_page(pmd));
free_page((unsigned long)pmd);
}

--
1.8.4.rc3

2013-09-27 13:17:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 06/10] mm, thp: do not access mm->pmd_huge_pte directly

Currently mm->pmd_huge_pte protected by page table lock. It will not
work with split lock. We have to have per-pmd pmd_huge_pte for proper
access serialization.

For now, let's just introduce wrapper to access mm->pmd_huge_pte.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alex Thorlton <[email protected]>
---
arch/s390/mm/pgtable.c | 12 ++++++------
arch/sparc/mm/tlb.c | 12 ++++++------
include/linux/mm.h | 1 +
mm/pgtable-generic.c | 12 ++++++------
4 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index de8cbc30dc..c463e5cd3b 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1225,11 +1225,11 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- if (!mm->pmd_huge_pte)
+ if (!pmd_huge_pte(mm, pmdp))
INIT_LIST_HEAD(lh);
else
- list_add(lh, (struct list_head *) mm->pmd_huge_pte);
- mm->pmd_huge_pte = pgtable;
+ list_add(lh, (struct list_head *) pmd_huge_pte(mm, pmdp));
+ pmd_huge_pte(mm, pmdp) = pgtable;
}

pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
@@ -1241,12 +1241,12 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- pgtable = mm->pmd_huge_pte;
+ pgtable = pmd_huge_pte(mm, pmdp);
lh = (struct list_head *) pgtable;
if (list_empty(lh))
- mm->pmd_huge_pte = NULL;
+ pmd_huge_pte(mm, pmdp) = NULL;
else {
- mm->pmd_huge_pte = (pgtable_t) lh->next;
+ pmd_huge_pte(mm, pmdp) = (pgtable_t) lh->next;
list_del(lh);
}
ptep = (pte_t *) pgtable;
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 7a91f288c7..656cc46a81 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -196,11 +196,11 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- if (!mm->pmd_huge_pte)
+ if (!pmd_huge_pte(mm, pmdp))
INIT_LIST_HEAD(lh);
else
- list_add(lh, (struct list_head *) mm->pmd_huge_pte);
- mm->pmd_huge_pte = pgtable;
+ list_add(lh, (struct list_head *) pmd_huge_pte(mm, pmdp));
+ pmd_huge_pte(mm, pmdp) = pgtable;
}

pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
@@ -211,12 +211,12 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- pgtable = mm->pmd_huge_pte;
+ pgtable = pmd_huge_pte(mm, pmdp);
lh = (struct list_head *) pgtable;
if (list_empty(lh))
- mm->pmd_huge_pte = NULL;
+ pmd_huge_pte(mm, pmdp) = NULL;
else {
- mm->pmd_huge_pte = (pgtable_t) lh->next;
+ pmd_huge_pte(mm, pmdp) = (pgtable_t) lh->next;
list_del(lh);
}
pte_val(pgtable[0]) = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e3481c6b52..b96aac9622 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1299,6 +1299,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
return &mm->page_table_lock;
}

+#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)

static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
{
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 3929a40bd6..41fee3e5d5 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -154,11 +154,11 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- if (!mm->pmd_huge_pte)
+ if (!pmd_huge_pte(mm, pmdp))
INIT_LIST_HEAD(&pgtable->lru);
else
- list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
- mm->pmd_huge_pte = pgtable;
+ list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
+ pmd_huge_pte(mm, pmdp) = pgtable;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
@@ -173,11 +173,11 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- pgtable = mm->pmd_huge_pte;
+ pgtable = pmd_huge_pte(mm, pmdp);
if (list_empty(&pgtable->lru))
- mm->pmd_huge_pte = NULL;
+ pmd_huge_pte(mm, pmdp) = NULL;
else {
- mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+ pmd_huge_pte(mm, pmdp) = list_entry(pgtable->lru.next,
struct page, lru);
list_del(&pgtable->lru);
}
--
1.8.4.rc3

2013-09-27 13:18:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 07/10] mm, hugetlb: convert hugetlbfs to use split pmd lock

Hugetlb supports multiple page sizes. We use split lock only for PMD
level, but not for PUD.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alex Thorlton <[email protected]>
---
fs/proc/meminfo.c | 2 +-
include/linux/hugetlb.h | 25 +++++++++++
include/linux/swapops.h | 7 ++--
mm/hugetlb.c | 108 +++++++++++++++++++++++++++++-------------------
mm/mempolicy.c | 5 ++-
mm/migrate.c | 7 ++--
mm/rmap.c | 2 +-
7 files changed, 103 insertions(+), 53 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6088..6d061f5359 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -1,8 +1,8 @@
#include <linux/fs.h>
-#include <linux/hugetlb.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/mm.h>
+#include <linux/hugetlb.h>
#include <linux/mman.h>
#include <linux/mmzone.h>
#include <linux/proc_fs.h>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0393270466..2132532b02 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -392,6 +392,15 @@ static inline int hugepage_migration_support(struct hstate *h)
return pmd_huge_support() && (huge_page_shift(h) == PMD_SHIFT);
}

+static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+ struct mm_struct *mm, pte_t *pte)
+{
+ if (huge_page_size(h) == PMD_SIZE)
+ return pmd_lockptr(mm, (pmd_t *) pte);
+ VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
+ return &mm->page_table_lock;
+}
+
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
@@ -401,6 +410,7 @@ struct hstate {};
#define hstate_sizelog(s) NULL
#define hstate_vma(v) NULL
#define hstate_inode(i) NULL
+#define page_hstate(page) NULL
#define huge_page_size(h) PAGE_SIZE
#define huge_page_mask(h) PAGE_MASK
#define vma_kernel_pagesize(v) PAGE_SIZE
@@ -421,6 +431,21 @@ static inline pgoff_t basepage_index(struct page *page)
#define dissolve_free_huge_pages(s, e) do {} while (0)
#define pmd_huge_support() 0
#define hugepage_migration_support(h) 0
+
+static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+ struct mm_struct *mm, pte_t *pte)
+{
+ return &mm->page_table_lock;
+}
#endif /* CONFIG_HUGETLB_PAGE */

+static inline spinlock_t *huge_pte_lock(struct hstate *h,
+ struct mm_struct *mm, pte_t *pte)
+{
+ spinlock_t *ptl;
+ ptl = huge_pte_lockptr(h, mm, pte);
+ spin_lock(ptl);
+ return ptl;
+}
+
#endif /* _LINUX_HUGETLB_H */
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 8d4fa82bfb..c0f75261a7 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -139,7 +139,8 @@ static inline void make_migration_entry_read(swp_entry_t *entry)

extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address);
-extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
+extern void migration_entry_wait_huge(struct vm_area_struct *vma,
+ struct mm_struct *mm, pte_t *pte);
#else

#define make_migration_entry(page, write) swp_entry(0, 0)
@@ -151,8 +152,8 @@ static inline int is_migration_entry(swp_entry_t swp)
static inline void make_migration_entry_read(swp_entry_t *entryp) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
-static inline void migration_entry_wait_huge(struct mm_struct *mm,
- pte_t *pte) { }
+static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
+ struct mm_struct *mm, pte_t *pte) { }
static inline int is_write_migration_entry(swp_entry_t entry)
{
return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b49579c7f2..1c13a6f8d8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2361,6 +2361,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+ spinlock_t *src_ptl, *dst_ptl;
src_pte = huge_pte_offset(src, addr);
if (!src_pte)
continue;
@@ -2372,8 +2373,9 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (dst_pte == src_pte)
continue;

- spin_lock(&dst->page_table_lock);
- spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+ dst_ptl = huge_pte_lock(h, dst, dst_pte);
+ src_ptl = huge_pte_lockptr(h, src, src_pte);
+ spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
if (!huge_pte_none(huge_ptep_get(src_pte))) {
if (cow)
huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2383,8 +2385,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
page_dup_rmap(ptepage);
set_huge_pte_at(dst, addr, dst_pte, entry);
}
- spin_unlock(&src->page_table_lock);
- spin_unlock(&dst->page_table_lock);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
}
return 0;

@@ -2427,6 +2429,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long address;
pte_t *ptep;
pte_t pte;
+ spinlock_t *ptl;
struct page *page;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
@@ -2440,25 +2443,25 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb_start_vma(tlb, vma);
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
again:
- spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;

+ ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, &address, ptep))
- continue;
+ goto unlock;

pte = huge_ptep_get(ptep);
if (huge_pte_none(pte))
- continue;
+ goto unlock;

/*
* HWPoisoned hugepage is already unmapped and dropped reference
*/
if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
huge_pte_clear(mm, address, ptep);
- continue;
+ goto unlock;
}

page = pte_page(pte);
@@ -2469,7 +2472,7 @@ again:
*/
if (ref_page) {
if (page != ref_page)
- continue;
+ goto unlock;

/*
* Mark the VMA as having unmapped its page so that
@@ -2486,13 +2489,18 @@ again:

page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
- if (force_flush)
+ if (force_flush) {
+ spin_unlock(ptl);
break;
+ }
/* Bail out after unmapping reference page if supplied */
- if (ref_page)
+ if (ref_page) {
+ spin_unlock(ptl);
break;
+ }
+unlock:
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
/*
* mmu_gather ran out of room to batch pages, we break out of
* the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2598,7 +2606,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
*/
static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pte_t pte,
- struct page *pagecache_page)
+ struct page *pagecache_page, spinlock_t *ptl)
{
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
@@ -2632,8 +2640,8 @@ retry_avoidcopy:

page_cache_get(old_page);

- /* Drop page_table_lock as buddy allocator may be called */
- spin_unlock(&mm->page_table_lock);
+ /* Drop page table lock as buddy allocator may be called */
+ spin_unlock(ptl);
new_page = alloc_huge_page(vma, address, outside_reserve);

if (IS_ERR(new_page)) {
@@ -2651,12 +2659,12 @@ retry_avoidcopy:
BUG_ON(huge_pte_none(pte));
if (unmap_ref_private(mm, vma, old_page, address)) {
BUG_ON(huge_pte_none(pte));
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
ptep = huge_pte_offset(mm, address & huge_page_mask(h));
if (likely(pte_same(huge_ptep_get(ptep), pte)))
goto retry_avoidcopy;
/*
- * race occurs while re-acquiring page_table_lock, and
+ * race occurs while re-acquiring page table lock, and
* our job is done.
*/
return 0;
@@ -2665,7 +2673,7 @@ retry_avoidcopy:
}

/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (err == -ENOMEM)
return VM_FAULT_OOM;
else
@@ -2680,7 +2688,7 @@ retry_avoidcopy:
page_cache_release(new_page);
page_cache_release(old_page);
/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
return VM_FAULT_OOM;
}

@@ -2692,10 +2700,10 @@ retry_avoidcopy:
mmun_end = mmun_start + huge_page_size(h);
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
/*
- * Retake the page_table_lock to check for racing updates
+ * Retake the page table lock to check for racing updates
* before the page tables are altered
*/
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
ptep = huge_pte_offset(mm, address & huge_page_mask(h));
if (likely(pte_same(huge_ptep_get(ptep), pte))) {
ClearPagePrivate(new_page);
@@ -2709,13 +2717,13 @@ retry_avoidcopy:
/* Make the old page be freed below */
new_page = old_page;
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
page_cache_release(new_page);
page_cache_release(old_page);

/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
return 0;
}

@@ -2763,6 +2771,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
struct address_space *mapping;
pte_t new_pte;
+ spinlock_t *ptl;

/*
* Currently, we are forced to kill the process in the event the
@@ -2849,7 +2858,8 @@ retry:
goto backout_unlocked;
}

- spin_lock(&mm->page_table_lock);
+ ptl = huge_pte_lockptr(h, mm, ptep);
+ spin_lock(ptl);
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto backout;
@@ -2870,16 +2880,16 @@ retry:

if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
- ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
+ ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
}

- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
unlock_page(page);
out:
return ret;

backout:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
backout_unlocked:
unlock_page(page);
put_page(page);
@@ -2891,6 +2901,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
{
pte_t *ptep;
pte_t entry;
+ spinlock_t *ptl;
int ret;
struct page *page = NULL;
struct page *pagecache_page = NULL;
@@ -2903,7 +2914,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (ptep) {
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
- migration_entry_wait_huge(mm, ptep);
+ migration_entry_wait_huge(vma, mm, ptep);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
@@ -2959,17 +2970,18 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (page != pagecache_page)
lock_page(page);

- spin_lock(&mm->page_table_lock);
+ ptl = huge_pte_lockptr(h, mm, ptep);
+ spin_lock(ptl);
/* Check for a racing update before calling hugetlb_cow */
if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
- goto out_page_table_lock;
+ goto out_ptl;


if (flags & FAULT_FLAG_WRITE) {
if (!huge_pte_write(entry)) {
ret = hugetlb_cow(mm, vma, address, ptep, entry,
- pagecache_page);
- goto out_page_table_lock;
+ pagecache_page, ptl);
+ goto out_ptl;
}
entry = huge_pte_mkdirty(entry);
}
@@ -2978,8 +2990,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
flags & FAULT_FLAG_WRITE))
update_mmu_cache(vma, address, ptep);

-out_page_table_lock:
- spin_unlock(&mm->page_table_lock);
+out_ptl:
+ spin_unlock(ptl);

if (pagecache_page) {
unlock_page(pagecache_page);
@@ -3005,9 +3017,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long remainder = *nr_pages;
struct hstate *h = hstate_vma(vma);

- spin_lock(&mm->page_table_lock);
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
+ spinlock_t *ptl = NULL;
int absent;
struct page *page;

@@ -3015,8 +3027,12 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Some archs (sparc64, sh*) have multiple pte_ts to
* each hugepage. We have to make sure we get the
* first, for the page indexing below to work.
+ *
+ * Note that page table lock is not held when pte is null.
*/
pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+ if (pte)
+ ptl = huge_pte_lock(h, mm, pte);
absent = !pte || huge_pte_none(huge_ptep_get(pte));

/*
@@ -3028,6 +3044,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (absent && (flags & FOLL_DUMP) &&
!hugetlbfs_pagecache_present(h, vma, vaddr)) {
+ if (pte)
+ spin_unlock(ptl);
remainder = 0;
break;
}
@@ -3047,10 +3065,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
!huge_pte_write(huge_ptep_get(pte)))) {
int ret;

- spin_unlock(&mm->page_table_lock);
+ if (pte)
+ spin_unlock(ptl);
ret = hugetlb_fault(mm, vma, vaddr,
(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
- spin_lock(&mm->page_table_lock);
if (!(ret & VM_FAULT_ERROR))
continue;

@@ -3081,8 +3099,8 @@ same_page:
*/
goto same_page;
}
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
*nr_pages = remainder;
*position = vaddr;

@@ -3103,13 +3121,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_cache_range(vma, address, end);

mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
- spin_lock(&mm->page_table_lock);
for (; address < end; address += huge_page_size(h)) {
+ spinlock_t *ptl;
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;
+ ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, &address, ptep)) {
pages++;
+ spin_unlock(ptl);
continue;
}
if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3119,8 +3139,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
set_huge_pte_at(mm, address, ptep, pte);
pages++;
}
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
/*
* Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
* may have cleared our pud entry and done put_page on the page table:
@@ -3283,6 +3303,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
unsigned long saddr;
pte_t *spte = NULL;
pte_t *pte;
+ spinlock_t *ptl;

if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -3305,13 +3326,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!spte)
goto out;

- spin_lock(&mm->page_table_lock);
+ ptl = huge_pte_lockptr(hstate_vma(vma), mm, spte);
+ spin_lock(ptl);
if (pud_none(*pud))
pud_populate(mm, pud,
(pmd_t *)((unsigned long)spte & PAGE_MASK));
else
put_page(virt_to_page(spte));
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
mutex_unlock(&mapping->i_mmap_mutex);
@@ -3325,7 +3347,7 @@ out:
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with vma->vm_mm->page_table_lock held.
+ * called with page table lock held.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 04729647f3..930a3e64bd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -525,8 +525,9 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
#ifdef CONFIG_HUGETLB_PAGE
int nid;
struct page *page;
+ spinlock_t *ptl;

- spin_lock(&vma->vm_mm->page_table_lock);
+ ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
page = pte_page(huge_ptep_get((pte_t *)pmd));
nid = page_to_nid(page);
if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
@@ -536,7 +537,7 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
(flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
isolate_huge_page(page, private);
unlock:
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
#else
BUG();
#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 9c8d5f59d3..0ac0668a08 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -130,7 +130,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, addr);
if (!ptep)
goto out;
- ptl = &mm->page_table_lock;
+ ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
} else {
pmd = mm_find_pmd(mm, addr);
if (!pmd)
@@ -247,9 +247,10 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
__migration_entry_wait(mm, ptep, ptl);
}

-void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
+void migration_entry_wait_huge(struct vm_area_struct *vma,
+ struct mm_struct *mm, pte_t *pte)
{
- spinlock_t *ptl = &(mm)->page_table_lock;
+ spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), mm, pte);
__migration_entry_wait(mm, pte, ptl);
}

diff --git a/mm/rmap.c b/mm/rmap.c
index b59d741dcf..55c8b8dc9f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -601,7 +601,7 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,

if (unlikely(PageHuge(page))) {
pte = huge_pte_offset(mm, address);
- ptl = &mm->page_table_lock;
+ ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
goto check;
}

--
1.8.4.rc3

2013-09-27 13:18:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 02/10] mm: convert mm->nr_ptes to atomic_t

With split page table lock for PMD level we can't hold
mm->page_table_lock while updating nr_ptes.

Let's convert it to atomic_t to avoid races.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alex Thorlton <[email protected]>
---
fs/proc/task_mmu.c | 3 ++-
include/linux/mm_types.h | 2 +-
kernel/fork.c | 2 +-
mm/huge_memory.c | 10 +++++-----
mm/memory.c | 4 ++--
mm/mmap.c | 3 ++-
mm/oom_kill.c | 6 +++---
7 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7366e9d63c..c52c597fbf 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -62,7 +62,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
total_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
- (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10,
+ (unsigned long) (PTRS_PER_PTE * sizeof(pte_t) *
+ atomic_read(&mm->nr_ptes)) >> 10,
swap << (PAGE_SHIFT-10));
}

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 84e0c56e1e..99f19e850d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -339,6 +339,7 @@ struct mm_struct {
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
+ atomic_t nr_ptes; /* Page table pages */
int map_count; /* number of VMAs */

spinlock_t page_table_lock; /* Protects page tables and some counters */
@@ -360,7 +361,6 @@ struct mm_struct {
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */
unsigned long stack_vm; /* VM_GROWSUP/DOWN */
unsigned long def_flags;
- unsigned long nr_ptes; /* Page table pages */
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73ad6..afe70530db 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -532,7 +532,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm->flags = (current->mm) ?
(current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
- mm->nr_ptes = 0;
+ atomic_set(&mm->nr_ptes, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
mm_init_aio(mm);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884682..bbd41a2f49 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -737,7 +737,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
- mm->nr_ptes++;
+ atomic_inc(&mm->nr_ptes);
spin_unlock(&mm->page_table_lock);
}

@@ -778,7 +778,7 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
entry = pmd_mkhuge(entry);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
- mm->nr_ptes++;
+ atomic_inc(&mm->nr_ptes);
return true;
}

@@ -903,7 +903,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd = pmd_mkold(pmd_wrprotect(pmd));
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
- dst_mm->nr_ptes++;
+ atomic_inc(&dst_mm->nr_ptes);

ret = 0;
out_unlock:
@@ -1358,7 +1358,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
if (is_huge_zero_pmd(orig_pmd)) {
- tlb->mm->nr_ptes--;
+ atomic_dec(&tlb->mm->nr_ptes);
spin_unlock(&tlb->mm->page_table_lock);
put_huge_zero_page();
} else {
@@ -1367,7 +1367,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
VM_BUG_ON(page_mapcount(page) < 0);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
VM_BUG_ON(!PageHead(page));
- tlb->mm->nr_ptes--;
+ atomic_dec(&tlb->mm->nr_ptes);
spin_unlock(&tlb->mm->page_table_lock);
tlb_remove_page(tlb, page);
}
diff --git a/mm/memory.c b/mm/memory.c
index ca00039471..1046396adf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -382,7 +382,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
pgtable_t token = pmd_pgtable(*pmd);
pmd_clear(pmd);
pte_free_tlb(tlb, token, addr);
- tlb->mm->nr_ptes--;
+ atomic_dec(&tlb->mm->nr_ptes);
}

static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -575,7 +575,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
spin_lock(&mm->page_table_lock);
wait_split_huge_page = 0;
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
- mm->nr_ptes++;
+ atomic_inc(&mm->nr_ptes);
pmd_populate(mm, pmd, new);
new = NULL;
} else if (unlikely(pmd_trans_splitting(*pmd)))
diff --git a/mm/mmap.c b/mm/mmap.c
index 9d548512ff..1d0efbc974 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2726,7 +2726,8 @@ void exit_mmap(struct mm_struct *mm)
}
vm_unacct_memory(nr_accounted);

- WARN_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
+ WARN_ON(atomic_read(&mm->nr_ptes) >
+ (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
}

/* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 314e9d2743..7ab394e811 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -161,7 +161,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/
- points = get_mm_rss(p->mm) + p->mm->nr_ptes +
+ points = get_mm_rss(p->mm) + atomic_read(&p->mm->nr_ptes) +
get_mm_counter(p->mm, MM_SWAPENTS);
task_unlock(p);

@@ -364,10 +364,10 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
continue;
}

- pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5hd %s\n",
+ pr_info("[%5d] %5d %5d %8lu %8lu %7d %8lu %5hd %s\n",
task->pid, from_kuid(&init_user_ns, task_uid(task)),
task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
- task->mm->nr_ptes,
+ atomic_read(&task->mm->nr_ptes),
get_mm_counter(task->mm, MM_SWAPENTS),
task->signal->oom_score_adj, task->comm);
task_unlock(task);
--
1.8.4.rc3

2013-09-27 13:18:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 03/10] mm: introduce api for split page table lock for PMD level

Basic api, backed by mm->page_table_lock for now. Actual implementation
will be added later.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alex Thorlton <[email protected]>
---
include/linux/mm.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6cf8ddb45b..e3481c6b52 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1294,6 +1294,19 @@ static inline void pgtable_page_dtor(struct page *page)
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
NULL: pte_offset_kernel(pmd, address))

+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return &mm->page_table_lock;
+}
+
+
+static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
+{
+ spinlock_t *ptl = pmd_lockptr(mm, pmd);
+ spin_lock(ptl);
+ return ptl;
+}
+
extern void free_area_init(unsigned long * zones_size);
extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
--
1.8.4.rc3

2013-09-27 13:18:47

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 04/10] mm, thp: change pmd_trans_huge_lock() to return taken lock

With split ptlock it's important to know which lock pmd_trans_huge_lock()
took. This patch adds one more parameter to the function to return the
lock.

In most places migration to new api is trivial.
Exception is move_huge_pmd(): we need to take two locks if pmd tables
are different.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alex Thorlton <[email protected]>
---
fs/proc/task_mmu.c | 13 +++++++------
include/linux/huge_mm.h | 14 +++++++-------
mm/huge_memory.c | 40 +++++++++++++++++++++++++++-------------
mm/memcontrol.c | 10 +++++-----
4 files changed, 46 insertions(+), 31 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c52c597fbf..d7df94069e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -506,9 +506,9 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pte_t *pte;
spinlock_t *ptl;

- if (pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
- spin_unlock(&walk->mm->page_table_lock);
+ spin_unlock(ptl);
mss->anonymous_thp += HPAGE_PMD_SIZE;
return 0;
}
@@ -994,13 +994,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
{
struct vm_area_struct *vma;
struct pagemapread *pm = walk->private;
+ spinlock_t *ptl;
pte_t *pte;
int err = 0;
pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));

/* find the first VMA at or above 'addr' */
vma = find_vma(walk->mm, addr);
- if (vma && pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (vma && pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
int pmd_flags2;

if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -1018,7 +1019,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
if (err)
break;
}
- spin_unlock(&walk->mm->page_table_lock);
+ spin_unlock(ptl);
return err;
}

@@ -1320,7 +1321,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,

md = walk->private;

- if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
+ if (pmd_trans_huge_lock(pmd, md->vma, &ptl) == 1) {
pte_t huge_pte = *(pte_t *)pmd;
struct page *page;

@@ -1328,7 +1329,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
if (page)
gather_stats(page, md, pte_dirty(huge_pte),
HPAGE_PMD_SIZE/PAGE_SIZE);
- spin_unlock(&walk->mm->page_table_lock);
+ spin_unlock(ptl);
return 0;
}

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428c57..4aca0d8da1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -129,15 +129,15 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
long adjust_next);
-extern int __pmd_trans_huge_lock(pmd_t *pmd,
- struct vm_area_struct *vma);
+extern int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+ spinlock_t **ptl);
/* mmap_sem must be held on entry */
-static inline int pmd_trans_huge_lock(pmd_t *pmd,
- struct vm_area_struct *vma)
+static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+ spinlock_t **ptl)
{
VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
if (pmd_trans_huge(*pmd))
- return __pmd_trans_huge_lock(pmd, vma);
+ return __pmd_trans_huge_lock(pmd, vma, ptl);
else
return 0;
}
@@ -215,8 +215,8 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
long adjust_next)
{
}
-static inline int pmd_trans_huge_lock(pmd_t *pmd,
- struct vm_area_struct *vma)
+static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+ spinlock_t **ptl)
{
return 0;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbd41a2f49..59a1340f35 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1342,9 +1342,10 @@ out_unlock:
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
+ spinlock_t *ptl;
int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
struct page *page;
pgtable_t pgtable;
pmd_t orig_pmd;
@@ -1359,7 +1360,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
if (is_huge_zero_pmd(orig_pmd)) {
atomic_dec(&tlb->mm->nr_ptes);
- spin_unlock(&tlb->mm->page_table_lock);
+ spin_unlock(ptl);
put_huge_zero_page();
} else {
page = pmd_page(orig_pmd);
@@ -1368,7 +1369,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
VM_BUG_ON(!PageHead(page));
atomic_dec(&tlb->mm->nr_ptes);
- spin_unlock(&tlb->mm->page_table_lock);
+ spin_unlock(ptl);
tlb_remove_page(tlb, page);
}
pte_free(tlb->mm, pgtable);
@@ -1381,14 +1382,15 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
unsigned char *vec)
{
+ spinlock_t *ptl;
int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
/*
* All logical pages in the range are present
* if backed by a huge page.
*/
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
memset(vec, 1, (end - addr) >> PAGE_SHIFT);
ret = 1;
}
@@ -1401,6 +1403,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd)
{
+ spinlock_t *old_ptl, *new_ptl;
int ret = 0;
pmd_t pmd;

@@ -1421,12 +1424,21 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
goto out;
}

- ret = __pmd_trans_huge_lock(old_pmd, vma);
+ /*
+ * We don't have to worry about the ordering of src and dst
+ * ptlocks because exclusive mmap_sem prevents deadlock.
+ */
+ ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
if (ret == 1) {
+ new_ptl = pmd_lockptr(mm, new_pmd);
+ if (new_ptl != old_ptl)
+ spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
VM_BUG_ON(!pmd_none(*new_pmd));
set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
- spin_unlock(&mm->page_table_lock);
+ if (new_ptl != old_ptl)
+ spin_unlock(new_ptl);
+ spin_unlock(old_ptl);
}
out:
return ret;
@@ -1436,9 +1448,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot, int prot_numa)
{
struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
entry = pmdp_get_and_clear(mm, addr, pmd);
if (!prot_numa) {
@@ -1454,7 +1467,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
}
}
set_pmd_at(mm, addr, pmd, entry);
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
ret = 1;
}

@@ -1468,12 +1481,13 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
* Note that if it returns 1, this routine returns without unlocking page
* table locks. So callers must unlock them.
*/
-int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
+int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+ spinlock_t **ptl)
{
- spin_lock(&vma->vm_mm->page_table_lock);
+ *ptl = pmd_lock(vma->vm_mm, pmd);
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(*ptl);
wait_split_huge_page(vma->anon_vma, pmd);
return -1;
} else {
@@ -1482,7 +1496,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
return 1;
}
}
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(*ptl);
return 0;
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5ff3ce130..5f35b2a116 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6376,10 +6376,10 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
pte_t *pte;
spinlock_t *ptl;

- if (pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
mc.precharge += HPAGE_PMD_NR;
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
return 0;
}

@@ -6568,9 +6568,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
* to be unlocked in __split_huge_page_splitting(), where the main
* part of thp split is not executed yet.
*/
- if (pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
if (mc.precharge < HPAGE_PMD_NR) {
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
return 0;
}
target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
@@ -6587,7 +6587,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
}
put_page(page);
}
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
return 0;
}

--
1.8.4.rc3

2013-09-27 20:47:01

by Cody P Schafer

[permalink] [raw]
Subject: Re: [PATCHv4 02/10] mm: convert mm->nr_ptes to atomic_t

On 09/27/2013 06:16 AM, Kirill A. Shutemov wrote:
> With split page table lock for PMD level we can't hold
> mm->page_table_lock while updating nr_ptes.
>
> Let's convert it to atomic_t to avoid races.
>

> ---

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 84e0c56e1e..99f19e850d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -339,6 +339,7 @@ struct mm_struct {
> pgd_t * pgd;
> atomic_t mm_users; /* How many users with user space? */
> atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
> + atomic_t nr_ptes; /* Page table pages */
> int map_count; /* number of VMAs */
>
> spinlock_t page_table_lock; /* Protects page tables and some counters */
> @@ -360,7 +361,6 @@ struct mm_struct {
> unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */
> unsigned long stack_vm; /* VM_GROWSUP/DOWN */
> unsigned long def_flags;
> - unsigned long nr_ptes; /* Page table pages */
> unsigned long start_code, end_code, start_data, end_data;
> unsigned long start_brk, brk, start_stack;
> unsigned long arg_start, arg_end, env_start, env_end;

Will 32bits always be enough here? Should atomic_long_t be used instead?

2013-09-27 21:01:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 02/10] mm: convert mm->nr_ptes to atomic_t

On 09/27/2013 01:46 PM, Cody P Schafer wrote:
> On 09/27/2013 06:16 AM, Kirill A. Shutemov wrote:
>> @@ -339,6 +339,7 @@ struct mm_struct {
>> pgd_t * pgd;
>> atomic_t mm_users; /* How many users with user space? */
>> atomic_t mm_count; /* How many references to "struct
>> mm_struct" (users count as 1) */
>> + atomic_t nr_ptes; /* Page table pages */
>> int map_count; /* number of VMAs */
...
>
> Will 32bits always be enough here? Should atomic_long_t be used instead?

There are 48 bits of virtual address space on x86 today. 12 bits of
that is the address inside the page, so we've at *most* 2^36 pages. 2^9
(512) pages are mapped by a pte page, so that means the page tables only
hold 2^27 pte pages in a single process.

We've got 31 bits of usable space in the atomic_t, so that definitely
works _today_. If the virtual address space ever gets bigger, we might
have problems, though.

In practice, though, we steal a big chunk of that virtual address space
for the kernel, and that doesn't get accounted in mm->nr_ptes, so we've
got a _bit_ more wiggle room than just 4 bits. Also, anybody that's
mapping >4 petabytes of memory with 4k ptes is just off their rocker.

I'm also not sure what the virtual address limits are for the more
obscure architectures, so I guess it's also possible they'll hit this.
I guess it wouldn't hurt to stick an overflow check in there for VM
debugging purposes.

2013-09-28 00:14:13

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCHv4 02/10] mm: convert mm->nr_ptes to atomic_t

On Sat, Sep 28, 2013 at 01:24:51AM +0300, Kirill A. Shutemov wrote:
> Cody P Schafer wrote:
> > On 09/27/2013 06:16 AM, Kirill A. Shutemov wrote:
> > > With split page table lock for PMD level we can't hold
> > > mm->page_table_lock while updating nr_ptes.
> > >
> > > Let's convert it to atomic_t to avoid races.
> > >
> >
> > > ---
> >
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 84e0c56e1e..99f19e850d 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -339,6 +339,7 @@ struct mm_struct {
> > > pgd_t * pgd;
> > > atomic_t mm_users; /* How many users with user space? */
> > > atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
> > > + atomic_t nr_ptes; /* Page table pages */
> > > int map_count; /* number of VMAs */
> > >
> > > spinlock_t page_table_lock; /* Protects page tables and some counters */
> > > @@ -360,7 +361,6 @@ struct mm_struct {
> > > unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */
> > > unsigned long stack_vm; /* VM_GROWSUP/DOWN */
> > > unsigned long def_flags;
> > > - unsigned long nr_ptes; /* Page table pages */
> > > unsigned long start_code, end_code, start_data, end_data;
> > > unsigned long start_brk, brk, start_stack;
> > > unsigned long arg_start, arg_end, env_start, env_end;
> >
> > Will 32bits always be enough here? Should atomic_long_t be used instead?
>
> Good question!
>
> On x86_64 we need one table to cover 2M (512 entries by 4k, 21 bits) of
> virtual address space. Total size of virtual memory which can be covered
> by 31-bit (32 - sign) nr_ptes is 52 bits (31 + 21).
>
> Currently, on x86_64 with 4-level page tables we can use at most 48 bit of
> virtual address space (only half of it available for userspace), so we
> pretty safe here.
>
> Although, it can be a potential problem, if (when) x86_64 will implement
> 5-level page tables -- 57-bits of virtual address space.
>
> Any thoughts?

I'd just go with atomic_long_t to avoid having to worry about this in
the first place. It's been ulong forever and I'm not aware of struct
mm_struct size being an urgent issue. Cutting this type in half and
adding overflow checks adds more problems than it solves.

2013-10-03 23:11:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHv4 09/10] mm: implement split page table lock for PMD level

On Fri, 27 Sep 2013 16:16:26 +0300 "Kirill A. Shutemov" <[email protected]> wrote:

> The basic idea is the same as with PTE level: the lock is embedded into
> struct page of table's page.
>
> We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
> take mm->page_table_lock anymore. Let's reuse page->lru of table's page
> for that.
>
> pgtable_pmd_page_ctor() returns true, if initialization is successful
> and false otherwise. Current implementation never fails, but assumption
> that constructor can fail will help to port it to -rt where spinlock_t
> is rather huge and cannot be embedded into struct page -- dynamic
> allocation is required.

spinlock_t is rather large when lockdep is enabled. What happens?

2013-10-03 23:38:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 09/10] mm: implement split page table lock for PMD level

Andrew Morton wrote:
> On Fri, 27 Sep 2013 16:16:26 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
>
> > The basic idea is the same as with PTE level: the lock is embedded into
> > struct page of table's page.
> >
> > We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
> > take mm->page_table_lock anymore. Let's reuse page->lru of table's page
> > for that.
> >
> > pgtable_pmd_page_ctor() returns true, if initialization is successful
> > and false otherwise. Current implementation never fails, but assumption
> > that constructor can fail will help to port it to -rt where spinlock_t
> > is rather huge and cannot be embedded into struct page -- dynamic
> > allocation is required.
>
> spinlock_t is rather large when lockdep is enabled. What happens?

The same as with PTE split lock: CONFIG_SPLIT_PTLOCK_CPUS set to 999999
if DEBUG_SPINLOCK || DEBUG_LOCK_ALLOC. It effectively blocks split locks
usage if spinlock_t is too big.

--
Kirill A. Shutemov

2013-10-04 07:21:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHv4 09/10] mm: implement split page table lock for PMD level

On Thu, Oct 03, 2013 at 04:11:09PM -0700, Andrew Morton wrote:
> On Fri, 27 Sep 2013 16:16:26 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
>
> > The basic idea is the same as with PTE level: the lock is embedded into
> > struct page of table's page.
> >
> > We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
> > take mm->page_table_lock anymore. Let's reuse page->lru of table's page
> > for that.
> >
> > pgtable_pmd_page_ctor() returns true, if initialization is successful
> > and false otherwise. Current implementation never fails, but assumption
> > that constructor can fail will help to port it to -rt where spinlock_t
> > is rather huge and cannot be embedded into struct page -- dynamic
> > allocation is required.
>
> spinlock_t is rather large when lockdep is enabled. What happens?

I could go fix all the arch code and pgtable ctor thingies and do the
same thing we do on -rt if anybody cares.

Hugh thought the single pagetable lock would catch the more interesting
locking scenarios, but its of course sad to have an entire locking
scheme not covered by lockdep -- that's just waiting for a bug to sneak
in there.

2013-10-04 20:12:10

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCHv4 00/10] split page table lock for PMD tables

Kirill,

I've pasted in my results for 512 cores below. Things are looking
really good here. I don't have a test for HUGETLBFS, but if you want to
pass me the one you used, I can run that too. I suppose I could write
one, but why reinvent the wheel? :)

Sorry for the delay on these results. I hit some strange issues with
running thp_memscale on systems with either of the following
combinations of configuration options set:

[thp off]
HUGETLBFS=y
HUGETLB_PAGE=y
NUMA_BALANCING=y
NUMA_BALANCING_DEFAULT_ENABLED=y

[thp on or off]
HUGETLBFS=n
HUGETLB_PAGE=n
NUMA_BALANCING=y
NUMA_BALANCING_DEFAULT_ENABLED=y

I'm getting segfaults intermittently, as well as some weird RCU sched
errors. This happens in vanilla 3.12-rc2, so it doesn't have anything
to do with your patches, but I thought I'd let you know. There didn't
used to be any issues with this test, so I think there's a subtle kernel
bug here. That's, of course, an entirely separate issue though.

As far as these patches go, I think everything looks good (save for the
bit of discussion you were having with Andrew earlier, which I think
you've worked out). My testing shows that the page fault rates are
actually better on this threaded test than in the non-threaded case!

- Alex

On Fri, Sep 27, 2013 at 04:16:17PM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton noticed that some massively threaded workloads work poorly,
> if THP enabled. This patchset fixes this by introducing split page table
> lock for PMD tables. hugetlbfs is not covered yet.
>
> This patchset is based on work by Naoya Horiguchi.
>
> Please review and consider applying.
>
> Changes:
> v4:
> - convert hugetlb to new locking;
> v3:
> - fix USE_SPLIT_PMD_PTLOCKS;
> - fix warning in fs/proc/task_mmu.c;
> v2:
> - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;
> - s/huge_pmd_lock/pmd_lock/g;
> - assume pgtable_pmd_page_ctor() can fail;
> - fix format line in task_mem() for VmPTE;
>
> THP off, v3.12-rc2:
> -------------------
>
> Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
>
> 1037072.835207 task-clock # 57.426 CPUs utilized ( +- 3.59% )
> 95,093 context-switches # 0.092 K/sec ( +- 3.93% )
> 140 cpu-migrations # 0.000 K/sec ( +- 5.28% )
> 10,000,550 page-faults # 0.010 M/sec ( +- 0.00% )
> 2,455,210,400,261 cycles # 2.367 GHz ( +- 3.62% ) [83.33%]
> 2,429,281,882,056 stalled-cycles-frontend # 98.94% frontend cycles idle ( +- 3.67% ) [83.33%]
> 1,975,960,019,659 stalled-cycles-backend # 80.48% backend cycles idle ( +- 3.88% ) [66.68%]
> 46,503,296,013 instructions # 0.02 insns per cycle
> # 52.24 stalled cycles per insn ( +- 3.21% ) [83.34%]
> 9,278,997,542 branches # 8.947 M/sec ( +- 4.00% ) [83.34%]
> 89,881,640 branch-misses # 0.97% of all branches ( +- 1.17% ) [83.33%]
>
> 18.059261877 seconds time elapsed ( +- 2.65% )
>
> THP on, v3.12-rc2:
> ------------------
>
> Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
>
> 3114745.395974 task-clock # 73.875 CPUs utilized ( +- 1.84% )
> 267,356 context-switches # 0.086 K/sec ( +- 1.84% )
> 99 cpu-migrations # 0.000 K/sec ( +- 1.40% )
> 58,313 page-faults # 0.019 K/sec ( +- 0.28% )
> 7,416,635,817,510 cycles # 2.381 GHz ( +- 1.83% ) [83.33%]
> 7,342,619,196,993 stalled-cycles-frontend # 99.00% frontend cycles idle ( +- 1.88% ) [83.33%]
> 6,267,671,641,967 stalled-cycles-backend # 84.51% backend cycles idle ( +- 2.03% ) [66.67%]
> 117,819,935,165 instructions # 0.02 insns per cycle
> # 62.32 stalled cycles per insn ( +- 4.39% ) [83.34%]
> 28,899,314,777 branches # 9.278 M/sec ( +- 4.48% ) [83.34%]
> 71,787,032 branch-misses # 0.25% of all branches ( +- 1.03% ) [83.33%]
>
> 42.162306788 seconds time elapsed ( +- 1.73% )

THP on, v3.12-rc2:
------------------

Performance counter stats for './thp_memscale -C 0 -m 0 -c 512 -b 512m' (5 runs):

568668865.944994 task-clock # 528.547 CPUs utilized ( +- 0.21% ) [100.00%]
1,491,589 context-switches # 0.000 M/sec ( +- 0.25% ) [100.00%]
1,085 CPU-migrations # 0.000 M/sec ( +- 1.80% ) [100.00%]
400,822 page-faults # 0.000 M/sec ( +- 0.41% )
1,306,612,476,049,478 cycles # 2.298 GHz ( +- 0.23% ) [100.00%]
1,277,211,694,318,724 stalled-cycles-frontend # 97.75% frontend cycles idle ( +- 0.21% ) [100.00%]
1,163,736,844,232,064 stalled-cycles-backend # 89.07% backend cycles idle ( +- 0.20% ) [100.00%]
53,855,178,678,230 instructions # 0.04 insns per cycle
# 23.72 stalled cycles per insn ( +- 1.15% ) [100.00%]
21,041,661,816,782 branches # 37.002 M/sec ( +- 0.64% ) [100.00%]
606,665,092 branch-misses # 0.00% of all branches ( +- 0.63% )

1075.909782795 seconds time elapsed ( +- 0.21% )

> HUGETLB, v3.12-rc2:
> -------------------
>
> Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):
>
> 2588052.787264 task-clock # 54.400 CPUs utilized ( +- 3.69% )
> 246,831 context-switches # 0.095 K/sec ( +- 4.15% )
> 138 cpu-migrations # 0.000 K/sec ( +- 5.30% )
> 21,027 page-faults # 0.008 K/sec ( +- 0.01% )
> 6,166,666,307,263 cycles # 2.383 GHz ( +- 3.68% ) [83.33%]
> 6,086,008,929,407 stalled-cycles-frontend # 98.69% frontend cycles idle ( +- 3.77% ) [83.33%]
> 5,087,874,435,481 stalled-cycles-backend # 82.51% backend cycles idle ( +- 4.41% ) [66.67%]
> 133,782,831,249 instructions # 0.02 insns per cycle
> # 45.49 stalled cycles per insn ( +- 4.30% ) [83.34%]
> 34,026,870,541 branches # 13.148 M/sec ( +- 4.24% ) [83.34%]
> 68,670,942 branch-misses # 0.20% of all branches ( +- 3.26% ) [83.33%]
>
> 47.574936948 seconds time elapsed ( +- 2.09% )
>
> THP off, patched:
> -----------------
>
> Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
>
> 943301.957892 task-clock # 56.256 CPUs utilized ( +- 3.01% )
> 86,218 context-switches # 0.091 K/sec ( +- 3.17% )
> 121 cpu-migrations # 0.000 K/sec ( +- 6.64% )
> 10,000,551 page-faults # 0.011 M/sec ( +- 0.00% )
> 2,230,462,457,654 cycles # 2.365 GHz ( +- 3.04% ) [83.32%]
> 2,204,616,385,805 stalled-cycles-frontend # 98.84% frontend cycles idle ( +- 3.09% ) [83.32%]
> 1,778,640,046,926 stalled-cycles-backend # 79.74% backend cycles idle ( +- 3.47% ) [66.69%]
> 45,995,472,617 instructions # 0.02 insns per cycle
> # 47.93 stalled cycles per insn ( +- 2.51% ) [83.34%]
> 9,179,700,174 branches # 9.731 M/sec ( +- 3.04% ) [83.35%]
> 89,166,529 branch-misses # 0.97% of all branches ( +- 1.45% ) [83.33%]
>
> 16.768027318 seconds time elapsed ( +- 2.47% )
>
> THP on, patched:
> ----------------
>
> Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
>
> 458793.837905 task-clock # 54.632 CPUs utilized ( +- 0.79% )
> 41,831 context-switches # 0.091 K/sec ( +- 0.97% )
> 98 cpu-migrations # 0.000 K/sec ( +- 1.66% )
> 57,829 page-faults # 0.126 K/sec ( +- 0.62% )
> 1,077,543,336,716 cycles # 2.349 GHz ( +- 0.81% ) [83.33%]
> 1,067,403,802,964 stalled-cycles-frontend # 99.06% frontend cycles idle ( +- 0.87% ) [83.33%]
> 864,764,616,143 stalled-cycles-backend # 80.25% backend cycles idle ( +- 0.73% ) [66.68%]
> 16,129,177,440 instructions # 0.01 insns per cycle
> # 66.18 stalled cycles per insn ( +- 7.94% ) [83.35%]
> 3,618,938,569 branches # 7.888 M/sec ( +- 8.46% ) [83.36%]
> 33,242,032 branch-misses # 0.92% of all branches ( +- 2.02% ) [83.32%]
>
> 8.397885779 seconds time elapsed ( +- 0.18% )

THP on, patched:
----------------

Performance counter stats for './runt -t -c 512 -b 512m' (5 runs):

15836198.490485 task-clock # 533.304 CPUs utilized ( +- 0.95% ) [100.00%]
127,507 context-switches # 0.000 M/sec ( +- 1.65% ) [100.00%]
1,223 CPU-migrations # 0.000 M/sec ( +- 3.23% ) [100.00%]
302,080 page-faults # 0.000 M/sec ( +- 6.88% )
18,925,875,973,975 cycles # 1.195 GHz ( +- 0.43% ) [100.00%]
18,325,469,464,007 stalled-cycles-frontend # 96.83% frontend cycles idle ( +- 0.44% ) [100.00%]
17,522,272,147,141 stalled-cycles-backend # 92.58% backend cycles idle ( +- 0.49% ) [100.00%]
2,686,490,067,197 instructions # 0.14 insns per cycle
# 6.82 stalled cycles per insn ( +- 2.16% ) [100.00%]
944,712,646,402 branches # 59.655 M/sec ( +- 2.03% ) [100.00%]
145,956,565 branch-misses # 0.02% of all branches ( +- 0.88% )

29.694499652 seconds time elapsed ( +- 0.95% )

(these results are from the test suite that I ripped thp_memscale out
of, but it's the same test)

> HUGETLB, patched
> -----------------
>
> Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):
>
> 395353.076837 task-clock # 20.329 CPUs utilized ( +- 8.16% )
> 55,730 context-switches # 0.141 K/sec ( +- 5.31% )
> 138 cpu-migrations # 0.000 K/sec ( +- 4.24% )
> 21,027 page-faults # 0.053 K/sec ( +- 0.00% )
> 930,219,717,244 cycles # 2.353 GHz ( +- 8.21% ) [83.32%]
> 914,295,694,103 stalled-cycles-frontend # 98.29% frontend cycles idle ( +- 8.35% ) [83.33%]
> 704,137,950,187 stalled-cycles-backend # 75.70% backend cycles idle ( +- 9.16% ) [66.69%]
> 30,541,538,385 instructions # 0.03 insns per cycle
> # 29.94 stalled cycles per insn ( +- 3.98% ) [83.35%]
> 8,415,376,631 branches # 21.286 M/sec ( +- 3.61% ) [83.36%]
> 32,645,478 branch-misses # 0.39% of all branches ( +- 3.41% ) [83.32%]
>
> 19.447481153 seconds time elapsed ( +- 2.00% )
>
> Kirill A. Shutemov (10):
> mm: rename USE_SPLIT_PTLOCKS to USE_SPLIT_PTE_PTLOCKS
> mm: convert mm->nr_ptes to atomic_t
> mm: introduce api for split page table lock for PMD level
> mm, thp: change pmd_trans_huge_lock() to return taken lock
> mm, thp: move ptl taking inside page_check_address_pmd()
> mm, thp: do not access mm->pmd_huge_pte directly
> mm, hugetlb: convert hugetlbfs to use split pmd lock
> mm: convent the rest to new page table lock api
> mm: implement split page table lock for PMD level
> x86, mm: enable split page table lock for PMD level
>
> arch/arm/mm/fault-armv.c | 6 +-
> arch/s390/mm/pgtable.c | 12 +--
> arch/sparc/mm/tlb.c | 12 +--
> arch/x86/Kconfig | 4 +
> arch/x86/include/asm/pgalloc.h | 11 ++-
> arch/x86/xen/mmu.c | 6 +-
> fs/proc/meminfo.c | 2 +-
> fs/proc/task_mmu.c | 16 ++--
> include/linux/huge_mm.h | 17 ++--
> include/linux/hugetlb.h | 25 +++++
> include/linux/mm.h | 52 ++++++++++-
> include/linux/mm_types.h | 18 ++--
> include/linux/swapops.h | 7 +-
> kernel/fork.c | 6 +-
> mm/Kconfig | 3 +
> mm/huge_memory.c | 201 ++++++++++++++++++++++++-----------------
> mm/hugetlb.c | 108 +++++++++++++---------
> mm/memcontrol.c | 10 +-
> mm/memory.c | 21 +++--
> mm/mempolicy.c | 5 +-
> mm/migrate.c | 14 +--
> mm/mmap.c | 3 +-
> mm/mprotect.c | 4 +-
> mm/oom_kill.c | 6 +-
> mm/pgtable-generic.c | 16 ++--
> mm/rmap.c | 15 ++-
> 26 files changed, 379 insertions(+), 221 deletions(-)
>
> --
> 1.8.4.rc3
>

2013-10-04 20:26:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 00/10] split page table lock for PMD tables

Alex Thorlton wrote:
> Kirill,
>
> I've pasted in my results for 512 cores below. Things are looking
> really good here. I don't have a test for HUGETLBFS, but if you want to
> pass me the one you used, I can run that too. I suppose I could write
> one, but why reinvent the wheel? :)

Patch below.

> Sorry for the delay on these results. I hit some strange issues with
> running thp_memscale on systems with either of the following
> combinations of configuration options set:
>
> [thp off]
> HUGETLBFS=y
> HUGETLB_PAGE=y
> NUMA_BALANCING=y
> NUMA_BALANCING_DEFAULT_ENABLED=y
>
> [thp on or off]
> HUGETLBFS=n
> HUGETLB_PAGE=n
> NUMA_BALANCING=y
> NUMA_BALANCING_DEFAULT_ENABLED=y
>
> I'm getting segfaults intermittently, as well as some weird RCU sched
> errors. This happens in vanilla 3.12-rc2, so it doesn't have anything
> to do with your patches, but I thought I'd let you know. There didn't
> used to be any issues with this test, so I think there's a subtle kernel
> bug here. That's, of course, an entirely separate issue though.

I'll take a look next week, if nobody does it before.

>
> As far as these patches go, I think everything looks good (save for the
> bit of discussion you were having with Andrew earlier, which I think
> you've worked out). My testing shows that the page fault rates are
> actually better on this threaded test than in the non-threaded case!
>
> - Alex
>
> THP on, v3.12-rc2:
> ------------------
>
> Performance counter stats for './thp_memscale -C 0 -m 0 -c 512 -b 512m' (5 runs):
>
> 568668865.944994 task-clock # 528.547 CPUs utilized ( +- 0.21% ) [100.00%]
> 1,491,589 context-switches # 0.000 M/sec ( +- 0.25% ) [100.00%]
> 1,085 CPU-migrations # 0.000 M/sec ( +- 1.80% ) [100.00%]
> 400,822 page-faults # 0.000 M/sec ( +- 0.41% )
> 1,306,612,476,049,478 cycles # 2.298 GHz ( +- 0.23% ) [100.00%]
> 1,277,211,694,318,724 stalled-cycles-frontend # 97.75% frontend cycles idle ( +- 0.21% ) [100.00%]
> 1,163,736,844,232,064 stalled-cycles-backend # 89.07% backend cycles idle ( +- 0.20% ) [100.00%]
> 53,855,178,678,230 instructions # 0.04 insns per cycle
> # 23.72 stalled cycles per insn ( +- 1.15% ) [100.00%]
> 21,041,661,816,782 branches # 37.002 M/sec ( +- 0.64% ) [100.00%]
> 606,665,092 branch-misses # 0.00% of all branches ( +- 0.63% )
>
> 1075.909782795 seconds time elapsed ( +- 0.21% )
>
> THP on, patched:
> ----------------
>
> Performance counter stats for './runt -t -c 512 -b 512m' (5 runs):
>
> 15836198.490485 task-clock # 533.304 CPUs utilized ( +- 0.95% ) [100.00%]
> 127,507 context-switches # 0.000 M/sec ( +- 1.65% ) [100.00%]
> 1,223 CPU-migrations # 0.000 M/sec ( +- 3.23% ) [100.00%]
> 302,080 page-faults # 0.000 M/sec ( +- 6.88% )
> 18,925,875,973,975 cycles # 1.195 GHz ( +- 0.43% ) [100.00%]
> 18,325,469,464,007 stalled-cycles-frontend # 96.83% frontend cycles idle ( +- 0.44% ) [100.00%]
> 17,522,272,147,141 stalled-cycles-backend # 92.58% backend cycles idle ( +- 0.49% ) [100.00%]
> 2,686,490,067,197 instructions # 0.14 insns per cycle
> # 6.82 stalled cycles per insn ( +- 2.16% ) [100.00%]
> 944,712,646,402 branches # 59.655 M/sec ( +- 2.03% ) [100.00%]
> 145,956,565 branch-misses # 0.02% of all branches ( +- 0.88% )
>
> 29.694499652 seconds time elapsed ( +- 0.95% )
>
> (these results are from the test suite that I ripped thp_memscale out
> of, but it's the same test)

36 times faster. Not bad I think. ;)

Naive patch to use HUGETLB:

--- thp_memscale/thp_memscale.c 2013-09-23 23:44:21.000000000 +0300
+++ thp_memscale/thp_memscale.c 2013-09-26 17:45:47.878429885 +0300
@@ -191,7 +191,10 @@
int id, i, cnt;

id = (long)arg;
- p = malloc(bytes);
+ p = mmap(NULL, bytes, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, 0, 0);
+ if (p == MAP_FAILED)
+ perrorx("mmap failed");
ps = p;

if (runon(basecpu + id) < 0)
--
Kirill A. Shutemov

2013-10-04 20:31:41

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCHv4 00/10] split page table lock for PMD tables

On Fri, Oct 04, 2013 at 11:26:02PM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton wrote:
> > Kirill,
> >
> > I've pasted in my results for 512 cores below. Things are looking
> > really good here. I don't have a test for HUGETLBFS, but if you want to
> > pass me the one you used, I can run that too. I suppose I could write
> > one, but why reinvent the wheel? :)
>
> Patch below.

Good deal, thanks. I'll get some test results put up soon.

>
> > Sorry for the delay on these results. I hit some strange issues with
> > running thp_memscale on systems with either of the following
> > combinations of configuration options set:
> >
> > [thp off]
> > HUGETLBFS=y
> > HUGETLB_PAGE=y
> > NUMA_BALANCING=y
> > NUMA_BALANCING_DEFAULT_ENABLED=y
> >
> > [thp on or off]
> > HUGETLBFS=n
> > HUGETLB_PAGE=n
> > NUMA_BALANCING=y
> > NUMA_BALANCING_DEFAULT_ENABLED=y
> >
> > I'm getting segfaults intermittently, as well as some weird RCU sched
> > errors. This happens in vanilla 3.12-rc2, so it doesn't have anything
> > to do with your patches, but I thought I'd let you know. There didn't
> > used to be any issues with this test, so I think there's a subtle kernel
> > bug here. That's, of course, an entirely separate issue though.
>
> I'll take a look next week, if nobody does it before.

I'm starting a bisect now. Not sure how long it'll take, but I'll keep
you posted.

>
> >
> > As far as these patches go, I think everything looks good (save for the
> > bit of discussion you were having with Andrew earlier, which I think
> > you've worked out). My testing shows that the page fault rates are
> > actually better on this threaded test than in the non-threaded case!
> >
> > - Alex
> >
> > THP on, v3.12-rc2:
> > ------------------
> >
> > Performance counter stats for './thp_memscale -C 0 -m 0 -c 512 -b 512m' (5 runs):
> >
> > 568668865.944994 task-clock # 528.547 CPUs utilized ( +- 0.21% ) [100.00%]
> > 1,491,589 context-switches # 0.000 M/sec ( +- 0.25% ) [100.00%]
> > 1,085 CPU-migrations # 0.000 M/sec ( +- 1.80% ) [100.00%]
> > 400,822 page-faults # 0.000 M/sec ( +- 0.41% )
> > 1,306,612,476,049,478 cycles # 2.298 GHz ( +- 0.23% ) [100.00%]
> > 1,277,211,694,318,724 stalled-cycles-frontend # 97.75% frontend cycles idle ( +- 0.21% ) [100.00%]
> > 1,163,736,844,232,064 stalled-cycles-backend # 89.07% backend cycles idle ( +- 0.20% ) [100.00%]
> > 53,855,178,678,230 instructions # 0.04 insns per cycle
> > # 23.72 stalled cycles per insn ( +- 1.15% ) [100.00%]
> > 21,041,661,816,782 branches # 37.002 M/sec ( +- 0.64% ) [100.00%]
> > 606,665,092 branch-misses # 0.00% of all branches ( +- 0.63% )
> >
> > 1075.909782795 seconds time elapsed ( +- 0.21% )
> >
> > THP on, patched:
> > ----------------
> >
> > Performance counter stats for './runt -t -c 512 -b 512m' (5 runs):
> >
> > 15836198.490485 task-clock # 533.304 CPUs utilized ( +- 0.95% ) [100.00%]
> > 127,507 context-switches # 0.000 M/sec ( +- 1.65% ) [100.00%]
> > 1,223 CPU-migrations # 0.000 M/sec ( +- 3.23% ) [100.00%]
> > 302,080 page-faults # 0.000 M/sec ( +- 6.88% )
> > 18,925,875,973,975 cycles # 1.195 GHz ( +- 0.43% ) [100.00%]
> > 18,325,469,464,007 stalled-cycles-frontend # 96.83% frontend cycles idle ( +- 0.44% ) [100.00%]
> > 17,522,272,147,141 stalled-cycles-backend # 92.58% backend cycles idle ( +- 0.49% ) [100.00%]
> > 2,686,490,067,197 instructions # 0.14 insns per cycle
> > # 6.82 stalled cycles per insn ( +- 2.16% ) [100.00%]
> > 944,712,646,402 branches # 59.655 M/sec ( +- 2.03% ) [100.00%]
> > 145,956,565 branch-misses # 0.02% of all branches ( +- 0.88% )
> >
> > 29.694499652 seconds time elapsed ( +- 0.95% )
> >
> > (these results are from the test suite that I ripped thp_memscale out
> > of, but it's the same test)
>
> 36 times faster. Not bad I think. ;)
>
> Naive patch to use HUGETLB:
>
> --- thp_memscale/thp_memscale.c 2013-09-23 23:44:21.000000000 +0300
> +++ thp_memscale/thp_memscale.c 2013-09-26 17:45:47.878429885 +0300
> @@ -191,7 +191,10 @@
> int id, i, cnt;
>
> id = (long)arg;
> - p = malloc(bytes);
> + p = mmap(NULL, bytes, PROT_READ | PROT_WRITE,
> + MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, 0, 0);
> + if (p == MAP_FAILED)
> + perrorx("mmap failed");
> ps = p;
>
> if (runon(basecpu + id) < 0)
> --
> Kirill A. Shutemov

2013-10-07 09:48:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 00/10] split page table lock for PMD tables

Alex Thorlton wrote:
> > > Sorry for the delay on these results. I hit some strange issues with
> > > running thp_memscale on systems with either of the following
> > > combinations of configuration options set:
> > >
> > > [thp off]
> > > HUGETLBFS=y
> > > HUGETLB_PAGE=y
> > > NUMA_BALANCING=y
> > > NUMA_BALANCING_DEFAULT_ENABLED=y
> > >
> > > [thp on or off]
> > > HUGETLBFS=n
> > > HUGETLB_PAGE=n
> > > NUMA_BALANCING=y
> > > NUMA_BALANCING_DEFAULT_ENABLED=y
> > >
> > > I'm getting segfaults intermittently, as well as some weird RCU sched
> > > errors. This happens in vanilla 3.12-rc2, so it doesn't have anything
> > > to do with your patches, but I thought I'd let you know. There didn't
> > > used to be any issues with this test, so I think there's a subtle kernel
> > > bug here. That's, of course, an entirely separate issue though.
> >
> > I'll take a look next week, if nobody does it before.
>
> I'm starting a bisect now. Not sure how long it'll take, but I'll keep
> you posted.

I don't see the issue. Could you share your kernel config?

--
Kirill A. Shutemov

2013-10-08 21:47:19

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCHv4 00/10] split page table lock for PMD tables

On Mon, Oct 07, 2013 at 12:48:20PM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton wrote:
> > > > Sorry for the delay on these results. I hit some strange issues with
> > > > running thp_memscale on systems with either of the following
> > > > combinations of configuration options set:
> > > >
> > > > [thp off]
> > > > HUGETLBFS=y
> > > > HUGETLB_PAGE=y
> > > > NUMA_BALANCING=y
> > > > NUMA_BALANCING_DEFAULT_ENABLED=y
> > > >
> > > > [thp on or off]
> > > > HUGETLBFS=n
> > > > HUGETLB_PAGE=n
> > > > NUMA_BALANCING=y
> > > > NUMA_BALANCING_DEFAULT_ENABLED=y
> > > >
> > > > I'm getting segfaults intermittently, as well as some weird RCU sched
> > > > errors. This happens in vanilla 3.12-rc2, so it doesn't have anything
> > > > to do with your patches, but I thought I'd let you know. There didn't
> > > > used to be any issues with this test, so I think there's a subtle kernel
> > > > bug here. That's, of course, an entirely separate issue though.
> > >
> > > I'll take a look next week, if nobody does it before.
> >
> > I'm starting a bisect now. Not sure how long it'll take, but I'll keep
> > you posted.
>
> I don't see the issue. Could you share your kernel config?

I put my kernel config up on ftp at:

ftp://shell.sgi.com/collect/atconfig/config_bug

I've been investigating the issue today and the smallest run I've seen
the problem on was with 128 threads, so this might not be something that
most people will hit.

With the config I've shared here the problem appears to only be
intermittent at 128 threads. It happened on every run of the test when
I ran it with 512 threads.

Just for something to compare to, here's a config that seems to behave
just fine for any number of threads:

ftp://shell.sgi.com/collect/atconfig/config_good

It looks like this is a problem all the way back to the current 3.8
stable tree. I'm still working on tracing back to a kernel where this
problem doesn't show up.

- Alex